2009-06-21 17:08:26

by Maciej Rutecki

[permalink] [raw]
Subject: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Tested kernel version: 2.6.30-git16 and 2.6.30-git17
Last known good: 2.6.30

System hangs few minutes after resume from suspend to disk. I have
tried bisection and here is result:

4efc0670baf4b14bc95502e54a83ccf639146125 is first bad commit
commit 4efc0670baf4b14bc95502e54a83ccf639146125
Author: Andi Kleen <[email protected]>
Date: Tue Apr 28 19:07:31 2009 +0200

x86, mce: use 64bit machine check code on 32bit

The 64bit machine check code is in many ways much better than
the 32bit machine check code: it is more specification compliant,
is cleaner, only has a single code base versus one per CPU,
has better infrastructure for recovery, has a cleaner way to communicate
with user space etc. etc.

Use the 64bit code for 32bit too.

This is the second attempt to do this. There was one a couple of years
ago to unify this code for 32bit and 64bit. Back then this ran into some
trouble with K7s and was reverted.

I believe this time the K7 problems (and some others) are addressed.
I went over the old handlers and was very careful to retain
all quirks.

But of course this needs a lot of testing on old systems. On newer
64bit capable systems I don't expect much problems because they have been
already tested with the 64bit kernel.

I made this a CONFIG for now that still allows to select the old
machine check code. This is mostly to make testing easier,
if someone runs into a problem we can ask them to try
with the CONFIG switched.

The new code is default y for more coverage.

Once there is confidence the 64bit code works well on older hardware
too the CONFIG_X86_OLD_MCE and the associated code can be easily
removed.

This causes a behaviour change for 32bit installations. They now
have to install the mcelog package to be able to log
corrected machine checks.

The 64bit machine check code only handles CPUs which support the
standard Intel machine check architecture described in the IA32 SDM.
The 32bit code has special support for some older CPUs which
have non standard machine check architectures, in particular
WinChip C3 and Intel P5. I made those a separate CONFIG option
and kept them for now. The WinChip variant could be probably
removed without too much pain, it doesn't really do anything
interesting. P5 is also disabled by default (like it
was before) because many motherboards have it miswired, but
according to Alan Cox a few embedded setups use that one.

Forward ported/heavily changed version of old patch, original patch
included review/fixes from Thomas Gleixner, Bert Wesarg.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>

:040000 040000 3ed45ebe46fdbb0df7f4190400fa4640be9f4c6c
e1fbb6da0ce70b944894d47c7e6fef0d30b5ff71 M arch


Unfortunately, because system hangs, I haven't any information in logs.

/proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz
stepping : 13
cpu MHz : 1200.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est
tm2 ssse3 cx16 xtpr pdcm lahf_lm
bogomips : 3999.98
clflush size : 64
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz
stepping : 13
cpu MHz : 1200.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est
tm2 ssse3 cx16 xtpr pdcm lahf_lm
bogomips : 3999.72
clflush size : 64
power management:

dmesg, config from 2.6.30-git17:
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/

--
Maciej Rutecki
http://www.maciek.unixy.pl


2009-06-21 18:43:17

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> Tested kernel version: 2.6.30-git16 and 2.6.30-git17
> Last known good: 2.6.30
>
> System hangs few minutes after resume from suspend to disk.

Thanks for the report.

I assume it runs stable for hours without resume from disk?
And you made sure you don't use stale data from
a different kernel for resume from disk?

It is strange that resume from disk affects machine check.
How is your resume setup?
Do you have any init scripts that change machine check state
before the resume from disk runs?

Is there any chance you could configure netconsole or similar
to get output during the hang?

> I have
> tried bisection and here is result:

I assume you have CONFIG_X86_NEW_MCE enabled, correct?
Does it still happen with CONFIG_X86_OLD_MCE instead?

Also a "a few minutes" suggest something might be going wrong
with the poll handler. Does the problem still happen
with you use CONFIG_X86_NEW_MCE again, but before
resume do

echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval

On the other hand you should get a crash very fast with

echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval

If we confirm it's the poll handler I can send you a debugging patch
to narrow it down further if I can't reproduce it.
But that would need console output during the crash.

Your dmesg also doesn't have anything related to resume from disk?

Thanks,

-Andi

2009-06-21 20:13:30

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/21 Andi Kleen <[email protected]>:
> I assume it runs stable for hours without resume from disk?

I only test for 40 minutes. latest git hangs 4-5 minutes after resume
from s2disk

> And you made sure you don't use stale data from
> a different kernel for resume from disk?

I'm sure

>
> It is strange that resume from disk affects machine check.
> How is your resume setup?

You ask about "resume" kernel option?

maciek@zlom:~$ cat /proc/cmdline
root=/dev/sda2 ro resume=/dev/sda3 selinux=0

> Do you have any init scripts that change machine check state
> before the resume from disk runs?

No. I use default Debian instalation. I use this script, to do s2disk:

#!/bin/sh
umount /mnt/vista
umount /mnt/drugi
governor0=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
governor1=`cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor`
f_min_0=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq`
f_min_1=`cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq`
f_max_0=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq`
f_max_1=`cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq`
#rmmod snd_hda_intel
sync
hdparm -F /dev/sda
hdparm -F /dev/sdb
sleep 1
# hibernate
echo -n platform > /sys/power/disk
echo -n disk > /sys/power/state
echo $governor0 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo $governor1 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo $f_min_0 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo $f_min_1 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
echo $f_max_0 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo $f_max_1 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
#modprobe snd_hda_intel model=3stack-dig
sleep 1
/etc/init.d/hdparm restart
mount /mnt/vista
mount /mnt/drugi


>
> I assume you have CONFIG_X86_NEW_MCE enabled, correct?

maciek@zlom:~$ cat /boot/config-2.6.30-git17 | grep MCE
CONFIG_X86_MCE=y
# CONFIG_X86_OLD_MCE is not set
CONFIG_X86_NEW_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
# CONFIG_X86_ANCIENT_MCE is not set
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=m

> Does it still happen with CONFIG_X86_OLD_MCE instead?

I will check tomorrow.

>
> Also a "a few minutes" suggest something might be going wrong
> with the poll handler.  Does the problem still happen
> with you use CONFIG_X86_NEW_MCE again, but before
> resume do
>
> echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval
>
> On the other hand you should get a crash very fast with
>
> echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval

I didn't instructions from above, but I found something else. After
normal boot I try:

echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval

I I found this in dmesg:

[ 141.704025] ------------[ cut here ]------------
[ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
mcheck_timer+0xf5/0x100()
[ 141.704044] Hardware name: G31M-S2L
[ 141.704047] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
iptable_filter nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables
x_tables fuse dm_crypt dm_mod coretemp it87 hwmon_vid loop usbhid hid
btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
snd_seq_device snd uhci_hcd ehci_hcd soundcore parport_pc parport
psmouse r8169 usbcore 8139too 8139cp mii i2c_i801 button rtc_cmos
rtc_core rtc_lib snd_page_alloc intel_agp agpgart evdev
[ 141.704139] Pid: 0, comm: swapper Not tainted 2.6.30-git17 #1
[ 141.704143] Call Trace:
[ 141.704152] [<c039382c>] ? printk+0x18/0x1c
[ 141.704158] [<c010f715>] ? mcheck_timer+0xf5/0x100
[ 141.704165] [<c013212c>] warn_slowpath_common+0x6c/0xc0
[ 141.704170] [<c010f715>] ? mcheck_timer+0xf5/0x100
[ 141.704176] [<c0132195>] warn_slowpath_null+0x15/0x20
[ 141.704182] [<c010f715>] mcheck_timer+0xf5/0x100
[ 141.704188] [<c013b99d>] run_timer_softirq+0x12d/0x1f0
[ 141.704194] [<c010f620>] ? mcheck_timer+0x0/0x100
[ 141.704199] [<c010f620>] ? mcheck_timer+0x0/0x100
[ 141.704206] [<c01372da>] __do_softirq+0x9a/0x130
[ 141.704212] [<c014b0ce>] ? hrtimer_interrupt+0xde/0x230
[ 141.704217] [<c039642f>] ? _spin_unlock+0xf/0x30
[ 141.704224] [<c01373a5>] do_softirq+0x35/0x40
[ 141.704229] [<c01375ad>] irq_exit+0x6d/0x90
[ 141.704235] [<c01167e8>] smp_apic_timer_interrupt+0x58/0x90
[ 141.704241] [<c0103856>] apic_timer_interrupt+0x2a/0x30
[ 141.704248] [<c010a662>] ? mwait_idle+0x62/0x70
[ 141.704253] [<c0101ee5>] cpu_idle+0x55/0x90
[ 141.704259] [<c0390b0b>] start_secondary+0x184/0x1f9
[ 141.704264] ---[ end trace 54c5f0d77c70ea21 ]---
[ 142.701022] ------------[ cut here ]------------
[ 142.701036] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
mcheck_timer+0xf5/0x100()
[ 142.701041] Hardware name: G31M-S2L
[ 142.701044] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
iptable_filter nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables
x_tables fuse dm_crypt dm_mod coretemp it87 hwmon_vid loop usbhid hid
btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
snd_seq_device snd uhci_hcd ehci_hcd soundcore parport_pc parport
psmouse r8169 usbcore 8139too 8139cp mii i2c_i801 button rtc_cmos
rtc_core rtc_lib snd_page_alloc intel_agpagpgart evdev
[ 142.701138] Pid: 0, comm: swapper Tainted: G W 2.6.30-git17 #1
[ 142.701142] Call Trace:
[ 142.701151] [<c039382c>] ? printk+0x18/0x1c
[ 142.701156] [<c010f715>] ? mcheck_timer+0xf5/0x100
[ 142.701163] [<c013212c>] warn_slowpath_common+0x6c/0xc0
[ 142.701169] [<c010f715>] ? mcheck_timer+0xf5/0x100
[ 142.701174] [<c0132195>] warn_slowpath_null+0x15/0x20
[ 142.701180] [<c010f715>] mcheck_timer+0xf5/0x100
[ 142.701186] [<c013b99d>] run_timer_softirq+0x12d/0x1f0
[ 142.701192] [<c010f620>] ? mcheck_timer+0x0/0x100
[ 142.701197] [<c010f620>] ? mcheck_timer+0x0/0x100
[ 142.701204] [<c01372da>] __do_softirq+0x9a/0x130
[ 142.701210] [<c014b0ce>] ? hrtimer_interrupt+0xde/0x230
[ 142.701216] [<c039642f>] ? _spin_unlock+0xf/0x30
[ 142.701222] [<c01373a5>] do_softirq+0x35/0x40
[ 142.701228] [<c01375ad>] irq_exit+0x6d/0x90
[ 142.701234] [<c01167e8>] smp_apic_timer_interrupt+0x58/0x90
[ 142.701240] [<c0103856>] apic_timer_interrupt+0x2a/0x30
[ 142.701247] [<c010a662>] ? mwait_idle+0x62/0x70
[ 142.701252] [<c0101ee5>] cpu_idle+0x55/0x90
[ 142.701258] [<c0390b0b>] start_secondary+0x184/0x1f9
[ 142.701264] ---[ end trace 54c5f0d77c70ea22 ]---

It's stop when I do echo 0...

> Your dmesg also doesn't have anything related to resume from disk?

Dmesg after resume, but before hangs:
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-after-resume.txt

Nothing weird.

>
> Thanks,
>
> -Andi
>

Thanks for ansfer.

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 02:03:35

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
>> Also a "a few minutes" suggest something might be going wrong
>> with the poll handler. Does the problem still happen
>> with you use CONFIG_X86_NEW_MCE again, but before
>> resume do
>>
>> echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval
>>
>> On the other hand you should get a crash very fast with
>>
>> echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval
>
> I didn't instructions from above, but I found something else. After
> normal boot I try:
>
> echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
>
> I I found this in dmesg:
>
> [ 141.704025] ------------[ cut here ]------------
> [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
> mcheck_timer+0xf5/0x100()

I see. At least this warning will be cleared by following patch.
WARN_ON(smp_processor_id() != data);

But I'm not sure whether this can cause system hangs or not.

Thanks,
H.Seto

===

[PATCH] x86, mce: percpu mcheck_timer should be pinned

Maciej Rutecki wrote:
> After normal boot I try:
>
> echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
>
> I I found this in dmesg:
>
> [ 141.704025] ------------[ cut here ]------------
> [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
> mcheck_timer+0xf5/0x100()

If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might
be migrated on other cpu. Use add_timer_on() instead.

Reported-by: Maciej Rutecki <[email protected]>
Signed-off-by: Hidetoshi Seto <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7da8fec..8e1794a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1117,7 +1117,7 @@ static void mcheck_timer(unsigned long data)
*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));

t->expires = jiffies + *n;
- add_timer(t);
+ add_timer_on(t, smp_processor_id());
}

static void mce_do_trigger(struct work_struct *work)
@@ -1321,7 +1321,7 @@ static void mce_init_timer(void)
return;
setup_timer(t, mcheck_timer, smp_processor_id());
t->expires = round_jiffies(jiffies + *n);
- add_timer(t);
+ add_timer_on(t, smp_processor_id());
}

/*
--
1.6.3

2009-06-22 06:20:17

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/21 Andi Kleen <[email protected]>:

> Also a "a few minutes" suggest something might be going wrong
> with the poll handler.  Does the problem still happen
> with you use CONFIG_X86_NEW_MCE again, but before
> resume do
>
> echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval


Seems be stable.

>
> On the other hand you should get a crash very fast with
>
> echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval

System hangs during suspend to disk (not resume). It stops on this
message (no_console_suspend option enabled):
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/img_0003.jpg

after this should be:
[ 2482.751571] CPU0 attaching NULL sched-domain.
[ 2482.752016] Switched to high resolution mode on CPU 1
[ 2482.754024] CPU0 attaching sched-domain:
[ 2482.754028] domain 0: span 0-1 level MC
[ 2482.754032] groups: 0 1
[ 2482.754039] CPU1 attaching sched-domain:
[ 2482.754042] domain 0: span 0-1 level MC
[ 2482.754045] groups: 1 0
[ 2482.755139] CPU1 is up

See:
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-after-resume.txt
(dmesg after resume, but before hangs)

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 06:43:42

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Hidetoshi Seto wrote:
> Maciej Rutecki wrote:
>>> Also a "a few minutes" suggest something might be going wrong
>>> with the poll handler. Does the problem still happen
>>> with you use CONFIG_X86_NEW_MCE again, but before
>>> resume do
>>>
>>> echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval
>>>
>>> On the other hand you should get a crash very fast with
>>>
>>> echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval
>> I didn't instructions from above, but I found something else. After
>> normal boot I try:
>>
>> echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
>>
>> I I found this in dmesg:
>>
>> [ 141.704025] ------------[ cut here ]------------
>> [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
>> mcheck_timer+0xf5/0x100()
>
> I see. At least this warning will be cleared by following patch.
> WARN_ON(smp_processor_id() != data);
>
> But I'm not sure whether this can cause system hangs or not.

It might actually. If two different handlers run on the same CPU
they could re-add a timer twice, which might cause loops in the timer
list etc.

Maciej, can you test Seto-san's patch please?

BTW this is probably related to

commit eea08f32adb3f97553d49a4f79a119833036000a
Author: Arun R Bharadwaj <[email protected]>
Date: Thu Apr 16 12:16:41 2009 +0530

timers: Logic to move non pinned timers

it might be also useful to test if reverting that patch makes
the problem go away. But with this patch we need the add_timer_on change.

-Andi

2009-06-22 07:19:54

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Hidetoshi Seto <[email protected]>:
>
> I see.  At least this warning will be cleared by following patch.
>  WARN_ON(smp_processor_id() != data);
>
> But I'm not sure whether this can cause system hangs or not.
>
> Thanks,
> H.Seto
>
> ===
>
> [PATCH] x86, mce: percpu mcheck_timer should be pinned
>

>
> If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might
> be migrated on other cpu.  Use add_timer_on() instead.
>
> Reported-by: Maciej Rutecki <[email protected]>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> CC: Andi Kleen <[email protected]>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7da8fec..8e1794a 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1117,7 +1117,7 @@ static void mcheck_timer(unsigned long data)
>                *n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));
>
>        t->expires = jiffies + *n;
> -       add_timer(t);
> +       add_timer_on(t, smp_processor_id());
>  }
>
>  static void mce_do_trigger(struct work_struct *work)
> @@ -1321,7 +1321,7 @@ static void mce_init_timer(void)
>                return;
>        setup_timer(t, mcheck_timer, smp_processor_id());
>        t->expires = round_jiffies(jiffies + *n);
> -       add_timer(t);
> +       add_timer_on(t, smp_processor_id());
>  }
>
>  /*
> --
> 1.6.3
>

Patch remove warning.

Tested-by Maciej Rutecki <[email protected]>

_But_ still system hangs after resume and during s2disk (while use
echo 0 > /sys/devices...)

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 07:20:39

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Maciej Rutecki <[email protected]>:

> _But_ still system hangs after resume and during s2disk (while use
> echo 0 > /sys/devices...)

Should be: echo 1 > /sys/devices...


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 09:40:28

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/21 Andi Kleen <[email protected]>:

> Does it still happen with CONFIG_X86_OLD_MCE instead?

Works without any problem:
maciek@zlom:~$ cat /boot/config-2.6.30-git17 | grep -i mce
CONFIG_X86_MCE=y
CONFIG_X86_OLD_MCE=y
CONFIG_X86_ANCIENT_MCE=y
# CONFIG_X86_MCE_NONFATAL is not set
# CONFIG_X86_MCE_P4THERMAL is not set

Regards
--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 09:58:41

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?


>
> Patch remove warning.

Ok.


> Tested-by Maciej Rutecki <[email protected]>
>
> _But_ still system hangs after resume and during s2disk (while use
> echo 0 > /sys/devices...)

Too bad. Were you able to set up a console during the hang?

e.g. if you have a second system could you perhaps configure netconsole
(See Documentation/networking/netconsole.txt) and capture the messages?

-Andi

2009-06-22 11:53:51

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Andi Kleen <[email protected]>:

>
> Too bad. Were you able to set up a console during the hang?
>
> e.g. if you have a second system could you perhaps configure netconsole
> (See Documentation/networking/netconsole.txt) and capture the messages?
>

I configure netconsole. But it doesn't show any messages when system hangs:

[...]
[ 54.115003] PM: Restoring platform NVS memory
[ 54.120959] Booting processor 1 APIC 0x1 ip 0x6000
[ 54.113855] CPU: L1 I cache: 32K, L1 D cache: 32K
[ 54.113855] CPU: Processor Core ID: 1
[ 54.113855] x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
[ 54.216141] CPU1 is up
[ 54.218135] HDA Intel 0000:00:1b.0: PCI INT A -> GSI 16 (level,
low) -> IRQ 16
[ 54.272099] usb usb3: root hub lost power or was reset
[ 54.272227] ehci_hcd 0000:00:1d.7: PME# disabled
[ 54.277797] r8169 0000:02:00.0: PME# disabled
[ 54.279176] parport_pc 00:08: activated
[ 54.430512] ata2.01: ACPI cmd ef/03:44:00:00:00:b0 filtered out
[ 54.433320] ata2.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
[ 54.443411] ata2.00: configured for UDMA/100
[ 54.446252] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
[ 54.481479] sd 1:0:0:0: [sdb] Starting disk
[ 55.799953] btusb 4-1:1.1: no reset_resume for driver btusb?
[ 57.247203] Restarting tasks ... done.


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 12:35:39

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> 2009/6/22 Andi Kleen <[email protected]>:
>
>> Too bad. Were you able to set up a console during the hang?
>>
>> e.g. if you have a second system could you perhaps configure netconsole
>> (See Documentation/networking/netconsole.txt) and capture the messages?
>>
>
> I configure netconsole. But it doesn't show any messages when system hangs:

Was this with check_interval = 1 ?

Do you see something when you set it to 60 or so (you need to wait 60 seconds)?

-Andi

2009-06-22 13:27:31

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Andi Kleen <[email protected]>:

> Was this with check_interval = 1 ?
>
> Do you see something when you set it to 60 or so (you need to wait 60
> seconds)?

I try check_interval=10 before suspend fot machinecheck0 and
mechinecheck1. Suspend to disk works (not hangs like for "1"). System
hangs few seconds after resume:
[...]
[ 68.071137] CPU1 is up
[ 68.073336] HDA Intel 0000:00:1b.0: PCI INT A -> GSI 16 (level,
low) -> IRQ 16
[ 68.126110] usb usb3: root hub lost power or was reset
[ 68.126249] ehci_hcd 0000:00:1d.7: PME# disabled
[ 68.132082] r8169 0000:02:00.0: PME# disabled
[ 68.132935] serial 00:07: activated
[ 68.136433] r8169: eth0: link up
[ 68.287417] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 filtered out
[ 68.290331] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
[ 68.302257] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
[ 68.324357] ata1.00: configured for UDMA/133
[ 69.656127] btusb 4-1:1.0: no reset_resume for driver btusb?
[ 70.708015] usb 5-2: reset low speed USB device using uhci_hcd and address 3

It seems, that resume isn't resume completely.


Also I try 20 seconds and find this in netconsole:
[ 532.576128] BUG: soft lockup - CPU#1 stuck for 61s! [ksensors:2670]
[ 532.576128] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit snd[
532.576128] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 532.576128] [<c01a266d>] sys_read+0x3d/0x70
^[[B^[[A[ 598.075129] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq_device[ 598.075129] EAX: 00000001 EBX: c28023e0 ECX: 00000292
EDX: 000008fb
[ 598.075129] [<c01a24f9>] vfs_read+0x99/0x150
[ 663.573129] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_timer[ 663.573129] EIP is at generic_exec_single+0x5e/0x90
[ 663.573129] [<c01a24f9>] vfs_read+0x99/0x150
[ 729.071129] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq[ 729.071129] EIP: 0060:[<c015979a>] EFLAGS: 00000202 CPU: 1
[ 729.071129] [<c0226aaf>] ? security_file_permission+0xf/0x20
[ 794.569130] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
iptable_filter nf_defrag_ipv4 fuse coretemp loop bluetooth
snd_hda_codec[ 860.068130] BUG: softlockup - CPU#1 stuck for 61s!
[ksensors:2670]
[ 860.068130] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
nf_conntrack x_tables dm_mod hwmon_vid btusb snd_hda_intel[
925.566131] Modules linked in: i915drm i2c_algo_bit video backlight
output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
iptable_filter nf_conntrack x_tables dm_mod loop btusb snd_hda_intel[
991.064131] Modules linked in: i915 drm i2c_algo_bit video backlight
output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state snd_seq[
991.064131] EIP: 0060:[<c0159798>] EFLAGS: 00000202 CPU: 1
[ 991.064131] [<c0226aaf>] ? security_file_permission+0xf/0x20
[ 1056.562131] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq_midi_event[ 1056.562131] EIP: 0060:[<c015979a>] EFLAGS:
00000202 CPU: 1
[ 1056.562131] [<c01ec20d>] sysfs_read_file+0x7d/0x110
[ 1122.061132] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq_device[ 1122.061132] EAX: 00000001 EBX: c28023e0 ECX: 00000292
EDX: 000008fb
[ 1122.061132] [<c01a24f9>] vfs_read+0x99/0x150
[ 1187.559132] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq_device[ 1187.559132] EAX: 00000001 EBX: c28023e0 ECX: 00000292
EDX: 000008fb
[ 1187.559132] [<c01a24f9>] vfs_read+0x99/0x150
[ 1253.057133] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_timer[ 1253.057133] EIP is at generic_exec_single+0x5e/0x90
[ 1253.057133] [<c01a24f9>] vfs_read+0x99/0x150
[ 1318.555133] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq[ 1318.555133] EIP is at generic_exec_single+0x5a/0x90
[ 1318.555133] [<c0226aaf>] ? security_file_permission+0xf/0x20
[ 1384.054133] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd[ 1384.054133] EAX: 00000001 EBX: c28023e0 ECX: 00000292 EDX:
000008fb
[ 1384.054133] [<c01ec190>] ? sysfs_read_file+0x0/0x110
[ 1449.552134] Modules linked in: i915 drm i2c_algo_bit video
backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit xt_state
snd_seq_device[ 1449.552134] EAX: 00000001 EBX: c28023e0 ECX: 00000292
EDX: 000008fb
[ 1449.552134] [<c01a24f9>] vfs_read+0x99/0x150

Message appears very slowly.


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 13:49:52

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?


> [ 68.287417] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 filtered out
> [ 68.290331] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
> [ 68.302257] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
> [ 68.324357] ata1.00: configured for UDMA/133
> [ 69.656127] btusb 4-1:1.0: no reset_resume for driver btusb?
> [ 70.708015] usb 5-2: reset low speed USB device using uhci_hcd and address 3
>
> It seems, that resume isn't resume completely.

The USB reset messages are normally not serious

>
>
> Also I try 20 seconds and find this in netconsole:
> [ 532.576128] BUG: soft lockup - CPU#1 stuck for 61s! [ksensors:2670]
> [ 532.576128] Modules linked in: i915 drm i2c_algo_bit video
> backlight output ppdev lp rfcomm l2cap xt_tcpudp xt_limit snd[
> 532.576128] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 532.576128] [<c01a266d>] sys_read+0x3d/0x70

Hmm strange lockup in the VFS, but in sensors context. Might or might not
be related.

Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
Can you apply that and try again and send me the output?

Thanks,

-Andi

2009-06-22 14:09:01

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Andi Kleen <[email protected]>:

Some strange issue: when I set check_interval=60 for both
machinecheck* then kernel works fine. Tested for 30 minutes, 2 times.

>
> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
> Can you apply that and try again and send me the output?
>

I will check


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 14:55:57

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Andi Kleen <[email protected]>:

> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
> Can you apply that and try again and send me the output?
>

Dmesg after resume:
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt

System hangs when uptime is roughly 5-6 minutes (when I don't change
check_interval). netconsole doesn't show anything.

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 16:56:38

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> 2009/6/22 Andi Kleen <[email protected]>:
>
>> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
>> Can you apply that and try again and send me the output?
>>
>
> Dmesg after resume:
> http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
>
> System hangs when uptime is roughly 5-6 minutes (when I don't change
> check_interval). netconsole doesn't show anything.

Weird.

And again can you double check it doesn't hang with check_interval = 0
in this setup? (just to make sure)

-Andi

2009-06-22 17:37:50

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Andi Kleen <[email protected]>:

>
> Weird.
>
> And again can you double check it doesn't hang with check_interval = 0
> in this setup? (just to make sure)

To be clear: I do s2disk immediately after boot. System never hangs
after normal boot.

I do this test:

1. boot, don't change check_interval
2. suspend to disk
3. wait 10 minutes
4. resume
5. When finish resume uptime shows 12 minutes
6. hangs when uptime shows 15 minutes

Second test:
Another machine (laptop) has the same issue
1. boot, don't change check_interval
2. suspend to disk
3. don't wait
4. resume
5. hangs when uptime shows 5 minutes


I already do this test:
1. boot, check_interval=0
2. suspend to disk
3. wait 10 minutes
4. resume

I show result in next e-mail, please be patient.



--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 18:12:44

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Maciej Rutecki <[email protected]>:

> I already do this test:
> 1. boot, check_interval=0*
> 2. suspend to disk
> 3. wait 10 minutes
> 4. resume
5. After resume uptime shows 15 minutes
6. wait 15 minutes; nothing happen

*Note: I change check_interval for machinecheck0 and machinecheck1.

Perform any else tests?
--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 18:41:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

On Monday 22 June 2009, Maciej Rutecki wrote:
> 2009/6/22 Maciej Rutecki <[email protected]>:
>
> > I already do this test:
> > 1. boot, check_interval=0*
> > 2. suspend to disk
> > 3. wait 10 minutes
> > 4. resume
> 5. After resume uptime shows 15 minutes
> 6. wait 15 minutes; nothing happen

Does it mean the box survived the test?

> *Note: I change check_interval for machinecheck0 and machinecheck1.
>
> Perform any else tests?

Yes, please.

Before 2. in your procedure run 'echo core > /sys/power/pm_test'. It will not
hibernate then, but it'll run the hibernation sequence including suspending
devices etc. Next, it'll wait for 5 seconds and run the resume sequence.

Please let me know if the box(es) work(s) correctly after that.

Best,
Rafael

2009-06-22 18:45:45

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Rafael J. Wysocki <[email protected]>:
> On Monday 22 June 2009, Maciej Rutecki wrote:
>> 2009/6/22 Maciej Rutecki <[email protected]>:
>>
>> > I already do this test:
>> > 1. boot, check_interval=0*
>> > 2. suspend to disk
>> > 3. wait 10 minutes
>> > 4. resume
>> 5. After resume uptime shows 15 minutes
>> 6. wait 15 minutes; nothing happen
>
> Does it mean the box survived the test?

Yes.


> Before 2. in your procedure run 'echo core > /sys/power/pm_test'.  It will not

I will check.

Regards
--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-22 19:19:47

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/22 Maciej Rutecki <[email protected]>:

>> Before 2. in your procedure run 'echo core > /sys/power/pm_test'.  It will not
>
> I will check.

1. boot, don't change check_interval
2. echo core > /sys/power/pm_test
3. suspend to disk
4. wait 15 minutes, system doesn't hangs.


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-23 03:41:16

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> 2009/6/22 Andi Kleen <[email protected]>:
>
>> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
>> Can you apply that and try again and send me the output?
>>
>
> Dmesg after resume:
> http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
>
> System hangs when uptime is roughly 5-6 minutes (when I don't change
> check_interval). netconsole doesn't show anything.
>

I found in the dmesg that mce_init() and mce_cpu_features() are called
on cpu0 twice in short time:

[ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[ 82.989005] bank 0
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 1
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 2
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 3
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 4
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 5
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] mcp on cpu 0 finished
[ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed
[ 82.989005] PM: Restoring platform NVS memory
[ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[ 82.989005] bank 0
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 1
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 2
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 3
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 4
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 5
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] mcp on cpu 0 finished
[ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed

mce_cpu_features() (which prints "Thermal ...") is always paired with
mce_init(), and is called only from mcheck_init() and mce_resume().

One of the above would be from mce_resume(), and if another was from
mcheck_init(), then setup_timer() in mce_init_timer() will break the
pending timer...

[arch/x86/power/cpu.c]
> static void __restore_processor_state(struct saved_context *ctxt)
> {
> :
> #ifdef CONFIG_X86_32
> mcheck_init(&boot_cpu_data);
> #endif
> }

Hum?

Maciej, could you try this patch?

Thanks,
H.Seto

===
[PATCH] x86: Fix mce resume on 32bit

Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
arch/x86/power/cpu.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index d277ef1..b3d20b9 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
do_fpu_end();
mtrr_ap_init();

-#ifdef CONFIG_X86_32
+#ifdef CONFIG_X86_OLD_MCE
mcheck_init(&boot_cpu_data);
#endif
}
--
1.6.3

2009-06-23 10:55:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

On Tue, 23 Jun 2009, Hidetoshi Seto wrote:
> Maciej Rutecki wrote:
> > 2009/6/22 Andi Kleen <[email protected]>:
> >
> >> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
> >> Can you apply that and try again and send me the output?
> >>
> >
> > Dmesg after resume:
> > http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
> >
> > System hangs when uptime is roughly 5-6 minutes (when I don't change
> > check_interval). netconsole doesn't show anything.
> >
>
> I found in the dmesg that mce_init() and mce_cpu_features() are called
> on cpu0 twice in short time:
>
> [ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
> [ 82.989005] bank 0
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 1
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 2
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 3
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 4
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 5
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] mcp on cpu 0 finished
> [ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed
> [ 82.989005] PM: Restoring platform NVS memory
> [ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
> [ 82.989005] bank 0
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 1
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 2
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 3
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 4
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] bank 5
> [ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [ 82.989005] mcp on cpu 0 finished
> [ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed
>
> mce_cpu_features() (which prints "Thermal ...") is always paired with
> mce_init(), and is called only from mcheck_init() and mce_resume().
>
> One of the above would be from mce_resume(), and if another was from
> mcheck_init(), then setup_timer() in mce_init_timer() will break the
> pending timer...
>
> [arch/x86/power/cpu.c]
> > static void __restore_processor_state(struct saved_context *ctxt)
> > {
> > :
> > #ifdef CONFIG_X86_32
> > mcheck_init(&boot_cpu_data);
> > #endif
> > }
>
> Hum?
>
> Maciej, could you try this patch?

Well found, thanks so much for that, I can confirm that this patch
works for me (though I'm not Maciej).

Since running recent gits on Core2 Duo, following resume from RAM,
within five minutes I'd get either a silent freeze, or a BUG at
kernel/timer.c:911, and/or DEBUG_LIST list.h warnings from
internal_add_timer().

Your patch puts an end to all that. Something for Andi to rush to
Linus for -rc1 I think - though I won't be in the least surprised if
Andi decides on something a little different, it's rather surprising
to have that mcheck_init() call from over there in just the one case.

Thanks again,
Hugh

>
> Thanks,
> H.Seto
>
> ===
> [PATCH] x86: Fix mce resume on 32bit
>
> Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> arch/x86/power/cpu.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index d277ef1..b3d20b9 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
> do_fpu_end();
> mtrr_ap_init();
>
> -#ifdef CONFIG_X86_32
> +#ifdef CONFIG_X86_OLD_MCE
> mcheck_init(&boot_cpu_data);
> #endif
> }
> --
> 1.6.3

2009-06-23 11:06:20

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?


> Your patch puts an end to all that. Something for Andi to rush to
> Linus for -rc1 I think - though I won't be in the least surprised if
> Andi decides on something a little different, it's rather surprising
> to have that mcheck_init() call from over there in just the one case.
>
> Thanks again,
> Hugh


Yes patch looks good. I was just waiting for reports if it really fixes
the problem.

-Andi

2009-06-23 14:47:57

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/23 Hidetoshi Seto <[email protected]>:

>
> Maciej, could you try this patch?

Tested in 2 systems (for 17 and 30 minutes). Works without any
problems. Thanks for patch!

>
> ===
> [PATCH] x86: Fix mce resume on 32bit
>
> Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
>  arch/x86/power/cpu.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index d277ef1..b3d20b9 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
>        do_fpu_end();
>        mtrr_ap_init();
>
> -#ifdef CONFIG_X86_32
> +#ifdef CONFIG_X86_OLD_MCE
>        mcheck_init(&boot_cpu_data);
>  #endif
>  }

Tested-by Maciej Rutecki <[email protected]>


--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-23 19:57:54

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> 2009/6/23 Hidetoshi Seto <[email protected]>:
>
>> Maciej, could you try this patch?
>
> Tested in 2 systems (for 17 and 30 minutes). Works without any
> problems. Thanks for patch!

Ok great. Thanks for testing. Peter please put Seto-san's patch into urgent.

Acked-by: Andi Kleen <[email protected]>

Thanks.

-Andi

2009-06-24 00:21:27

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Peter, could you put this patch into urgent too?

Thanks,
H.Seto

===

[PATCH] x86, mce: percpu mcheck_timer should be pinned

Maciej Rutecki wrote:
> > After normal boot I try:
> >
> > echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
> >
> > I found this in dmesg:
> >
> > [ 141.704025] ------------[ cut here ]------------
> > [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
> > mcheck_timer+0xf5/0x100()

If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might
be migrated on other cpu. Use add_timer_on() instead.

Reported-by: Maciej Rutecki <[email protected]>
Signed-off-by: Hidetoshi Seto <[email protected]>
Tested-by Maciej Rutecki <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7da8fec..8e1794a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1117,7 +1117,7 @@ static void mcheck_timer(unsigned long data)
*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));

t->expires = jiffies + *n;
- add_timer(t);
+ add_timer_on(t, smp_processor_id());
}

static void mce_do_trigger(struct work_struct *work)
@@ -1321,7 +1321,7 @@ static void mce_init_timer(void)
return;
setup_timer(t, mcheck_timer, smp_processor_id());
t->expires = round_jiffies(jiffies + *n);
- add_timer(t);
+ add_timer_on(t, smp_processor_id());
}

/*
--
1.6.3

2009-06-24 06:17:53

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Hidetoshi Seto wrote:
> Peter, could you put this patch into urgent too?

Acked-by: Andi Kleen <[email protected]> too

(BTW that is needed in interaction with the new timer migration code
that was merged into .31; .30 based kernels don't need it)

-Andi

2009-06-25 15:53:18

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009/6/24 Andi Kleen <[email protected]>:
> Hidetoshi Seto wrote:
>>
>> Peter, could you put this patch into urgent too?
>
> Acked-by: Andi Kleen <[email protected]> too
>
> (BTW that is needed in interaction with the new timer migration code
> that was merged into .31; .30 based kernels don't need it)
>
> -Andi
>
>

Are this patches was added to -rc1?

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-25 16:23:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
>
> Are this patches was added to -rc1?
>

No, but they're in our push queue. Since they're bug fixes they will be
pushed after we have run tests on them.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.