2009-04-11 09:41:51

by Alan Jenkins

[permalink] [raw]
Subject: Regression: EEE PC hangs when booting off battery

Regression #6 on latest git! (last known good is 2.6.29-rc8)

My EEE PC hangs on bootup if it is not connected to mains power. The
hang happens during udev coldplug. I've left it for >60 seconds with
nmi_watchdog=1, soft lockup and hung task detection are enabled, and I
get no trace. Key presses are not echoed to the screen.

Magic SysRQ keys work though. Unfortunately, SysRq-W / SysRq-T generate
too much output, so the useful bits immediately scroll off the top of
the screen. I can't press SysRq-P (even if that would help), because
the SysRQ key on my laptop is accessed using a Fn modifier key, which
also turns the left hand side of the keyboard into a numeric pad.

Magic SysRQs on my USB keyboard generate no response. (Sometimes the
numlock key seems to work though. I can't tell on the laptop keyboard
because the laptop doesn't have a numlock led).

I also tried booting with vga=ask (anticipating nmi_watchdog output),
and selecting the highest mode, 80x60. It turns out that this doesn't
really work, the screen flickers constantly until userspace changes the
console font. However, it avoided the hang in udev coldplug. Instead,
the hang occurred when X started, leaving me with a black screen and a
non-responsive X cursor.

Booting with acpi=off fixes this.

Regards
Alan


2009-04-11 18:08:02

by Tzy-Jye Daniel Lin

[permalink] [raw]
Subject: Re: Regression: EEE PC hangs when booting off battery

On Sat, Apr 11, 2009 at 5:41 AM, Alan Jenkins
<[email protected]> wrote:
> Regression #6 on latest git! (last known good is 2.6.29-rc8)
>
> Magic SysRQ keys work though.  Unfortunately, SysRq-W / SysRq-T generate
> too much output, so the useful bits immediately scroll off the top of
> the screen.  I can't press SysRq-P (even if that would help), because
> the SysRQ key on my laptop is accessed using a Fn modifier key, which
> also turns the left hand side of the keyboard into a numeric pad.

I don't own an Eee, but having played with one, the key sequence

Alt down
Fn down
SysRq down, up
Fn up
P down, up
Alt up

should work just fine. As long as you keep holding Alt, you don't
need to continue holding SysRq after the initial keypress.

> Magic SysRQs on my USB keyboard generate no response.  (Sometimes the
> numlock key seems to work though.  I can't tell on the laptop keyboard
> because the laptop doesn't have a numlock led).

SysRq over USB has never worked. On PC hardware, Magic SysRq depends
on features of the built-in i8042 controller, which only handles PS/2
input.

2009-04-11 19:25:29

by Kristoffer Ericson

[permalink] [raw]
Subject: Re: Regression: EEE PC hangs when booting off battery

On Sat, 11 Apr 2009 10:41:35 +0100
Alan Jenkins <[email protected]> wrote:

> Regression #6 on latest git! (last known good is 2.6.29-rc8)
>
> My EEE PC hangs on bootup if it is not connected to mains power. The
> hang happens during udev coldplug. I've left it for >60 seconds with
> nmi_watchdog=1, soft lockup and hung task detection are enabled, and I
> get no trace. Key presses are not echoed to the screen.

Getting this myself, even if power cord attached. Its an tainted
kernel though (archlinux 2.6.28) so wasnt sure if
vanilla was affected.

BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac]
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/platform/i8042/modalias
Modules linked in: joydev psmouse snd_pcsp serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support i2c_core uvcvideo compat_ioctl32 videodev v4l1_compat sg option usbserial video output eeepc_laptop rfkill intel_agp agpgart thermal processor fan evdev button battery ac snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hwdep snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore ath_pci wlan ath_hal(P) arc4 ecb ath5k mac80211 led_class cfg80211 atl2 rtc_cmos rtc_core rtc_lib ext3 jbd mbcache usbhid hid usb_storage sd_mod uhci_hcd ehci_hcd usbcore ata_piix ahci ata_generic pata_acpi libata scsi_mod

Pid: 1794, comm: fsck.ext3 Tainted: P (2.6.28-ARCH #1) 701
EIP: 0060:[<f850d031>] EFLAGS: 00010246 CPU: 0
EIP is at acpi_ac_get_state+0x1b/0x59 [ac]
EAX: f69559dc EBX: 00000000 ECX: 00000000 EDX: f850d418
ESI: f6955980 EDI: 00000001 EBP: f6834480 ESP: f6853f0c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process fsck.ext3 (pid: 1794, ti=f6852000 task=f737e400 task.ti=f6852000)
Stack:
f69559dc f6955980 f692e360 f850d2fd 00000001 f692e360 c01ab0e0 00000400
b80cb000 f6834480 f692e380 00000000 00000000 00000001 00000000 00000000
f7327c00 fffffffb c01aafb0 f6834480 c01cead4 f6853fa0 00000400 b80cb000
Call Trace:
[<f850d2fd>] acpi_ac_seq_show+0x12/0x5d [ac]
[<c01ab0e0>] seq_read+0x130/0x2e0
[<c01aafb0>] seq_read+0x0/0x2e0
[<c01cead4>] proc_reg_read+0x64/0xa0
[<c01cea70>] proc_reg_read+0x0/0xa0
[<c019346d>] vfs_read+0x9d/0x160
[<c0193601>] sys_read+0x41/0x80
[<c0103f13>] sysenter_do_call+0x12/0x33
[<c0320000>] serial_pnp_probe+0x150/0x250
Code: b8 ea ff ff ff 75 07 8b 43 5c 89 01 31 c0 5b c3 56 89 c6 85 f6 b8 ea ff ff ff 53 74 49 8b 5e 58 8d 46 5c 31 c9 50 ba 18 d4 50 f8 <8b> 03 e8 e0 be d2 c7 89 c2 58 31 c0 85 d2 74 2b 68 1d d4 50 f8
EIP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac] SS:ESP 0068:f6853f0c
---[ end trace eb12d69e27c56472 ]---

>
> Magic SysRQ keys work though. Unfortunately, SysRq-W / SysRq-T generate
> too much output, so the useful bits immediately scroll off the top of
> the screen. I can't press SysRq-P (even if that would help), because
> the SysRQ key on my laptop is accessed using a Fn modifier key, which
> also turns the left hand side of the keyboard into a numeric pad.
>
> Magic SysRQs on my USB keyboard generate no response. (Sometimes the
> numlock key seems to work though. I can't tell on the laptop keyboard
> because the laptop doesn't have a numlock led).
>
> I also tried booting with vga=ask (anticipating nmi_watchdog output),
> and selecting the highest mode, 80x60. It turns out that this doesn't
> really work, the screen flickers constantly until userspace changes the
> console font. However, it avoided the hang in udev coldplug. Instead,
> the hang occurred when X started, leaving me with a black screen and a
> non-responsive X cursor.
>
> Booting with acpi=off fixes this.
>
> Regards
> Alan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
Kristoffer Ericson <[email protected]>

2009-04-12 09:00:46

by Alan Jenkins

[permalink] [raw]
Subject: Re: Regression: EEE PC hangs when booting off battery

Tzy-Jye Daniel Lin wrote:
> On Sat, Apr 11, 2009 at 5:41 AM, Alan Jenkins
> <[email protected]> wrote:
>
>> Regression #6 on latest git! (last known good is 2.6.29-rc8)
>>
>> Magic SysRQ keys work though. Unfortunately, SysRq-W / SysRq-T generate
>> too much output, so the useful bits immediately scroll off the top of
>> the screen. I can't press SysRq-P (even if that would help), because
>> the SysRQ key on my laptop is accessed using a Fn modifier key, which
>> also turns the left hand side of the keyboard into a numeric pad.
>>
>
> I don't own an Eee, but having played with one, the key sequence
>
> Alt down
> Fn down
> SysRq down, up
> Fn up
> P down, up
> Alt up
>
> should work just fine. As long as you keep holding Alt, you don't
> need to continue holding SysRq after the initial keypress.
>
>
>> Magic SysRQs on my USB keyboard generate no response. (Sometimes the
>> numlock key seems to work though. I can't tell on the laptop keyboard
>> because the laptop doesn't have a numlock led).
>>
>
> SysRq over USB has never worked. On PC hardware, Magic SysRq depends
> on features of the built-in i8042 controller, which only handles PS/2
> input.
>
Great, thanks a lot!

I was able to run SysRq-P, and found the following backtrace -

Pid: 0
EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
cpud_idle_call
cpu_idle
rest_init
start_kernel
i386_start_kernel

2009-04-12 09:23:13

by Alan Jenkins

[permalink] [raw]
Subject: Re: Regression: EEE PC hangs when booting off battery

Kristoffer Ericson wrote:
> On Sat, 11 Apr 2009 10:41:35 +0100
> Alan Jenkins <[email protected]> wrote:
>
>
>> Regression #6 on latest git! (last known good is 2.6.29-rc8)
>>
>> My EEE PC hangs on bootup if it is not connected to mains power. The
>> hang happens during udev coldplug. I've left it for >60 seconds with
>> nmi_watchdog=1, soft lockup and hung task detection are enabled, and I
>> get no trace. Key presses are not echoed to the screen.
>>
>
> Getting this myself, even if power cord attached. Its an tainted
> kernel though (archlinux 2.6.28) so wasnt sure if
> vanilla was affected.
>
> BUG: unable to handle kernel NULL pointer dereference at 00000000
> IP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac]
> *pde = 00000000
> Oops: 0000 [#1] PREEMPT SMP
> last sysfs file: /sys/devices/platform/i8042/modalias
> Modules linked in: joydev psmouse snd_pcsp serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support i2c_core uvcvideo compat_ioctl32 videodev v4l1_compat sg option usbserial video output eeepc_laptop rfkill intel_agp agpgart thermal processor fan evdev button battery ac snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hwdep snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore ath_pci wlan ath_hal(P) arc4 ecb ath5k mac80211 led_class cfg80211 atl2 rtc_cmos rtc_core rtc_lib ext3 jbd mbcache usbhid hid usb_storage sd_mod uhci_hcd ehci_hcd usbcore ata_piix ahci ata_generic pata_acpi libata scsi_mod
>
> Pid: 1794, comm: fsck.ext3 Tainted: P (2.6.28-ARCH #1) 701
> EIP: 0060:[<f850d031>] EFLAGS: 00010246 CPU: 0
> EIP is at acpi_ac_get_state+0x1b/0x59 [ac]
> EAX: f69559dc EBX: 00000000 ECX: 00000000 EDX: f850d418
> ESI: f6955980 EDI: 00000001 EBP: f6834480 ESP: f6853f0c
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process fsck.ext3 (pid: 1794, ti=f6852000 task=f737e400 task.ti=f6852000)
> Stack:
> f69559dc f6955980 f692e360 f850d2fd 00000001 f692e360 c01ab0e0 00000400
> b80cb000 f6834480 f692e380 00000000 00000000 00000001 00000000 00000000
> f7327c00 fffffffb c01aafb0 f6834480 c01cead4 f6853fa0 00000400 b80cb000
> Call Trace:
> [<f850d2fd>] acpi_ac_seq_show+0x12/0x5d [ac]
> [<c01ab0e0>] seq_read+0x130/0x2e0
> [<c01aafb0>] seq_read+0x0/0x2e0
> [<c01cead4>] proc_reg_read+0x64/0xa0
> [<c01cea70>] proc_reg_read+0x0/0xa0
> [<c019346d>] vfs_read+0x9d/0x160
> [<c0193601>] sys_read+0x41/0x80
> [<c0103f13>] sysenter_do_call+0x12/0x33
> [<c0320000>] serial_pnp_probe+0x150/0x250
> Code: b8 ea ff ff ff 75 07 8b 43 5c 89 01 31 c0 5b c3 56 89 c6 85 f6 b8 ea ff ff ff 53 74 49 8b 5e 58 8d 46 5c 31 c9 50 ba 18 d4 50 f8 <8b> 03 e8 e0 be d2 c7 89 c2 58 31 c0 85 d2 74 2b 68 1d d4 50 f8
> EIP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac] SS:ESP 0068:f6853f0c
> ---[ end trace eb12d69e27c56472 ]---
>

I'm pretty sure that's a separate bug :-). As you say, it happens under
different circumstances, it's a BUG as opposed to a hang, and it's in a
different version (I didn't have my problem in 2.6.28).

http://www.kerneloops.org says no-one else has reported your bug. So maybe
you're right, and Arch have applied some patches which nobody else runs
with. I don't know how to find Arch source packages, so I can't really
tell.

But the only _taint_ as such is due to the ath_hal module. Let's see,
Arch linux...

<http://wiki.archlinux.org/index.php/Udev#Blacklisting_Modules>
describes how you can (reversibly) blacklist modules. If you blacklist
ath_pci (and ath_hal too, just to make sure), you could boot without any
taint and generate a "clean" backtrace.

Regards
Alan

2009-04-12 13:12:21

by Alan Jenkins

[permalink] [raw]
Subject: [BISECTED] EEE PC hangs when booting off battery

Alan Jenkins wrote:
> Tzy-Jye Daniel Lin wrote:
>
>> On Sat, Apr 11, 2009 at 5:41 AM, Alan Jenkins
>> <[email protected]> wrote:
>>
>>
>>> Regression #6 on latest git! (last known good is 2.6.29-rc8)
>>>
>>> Magic SysRQ keys work though. Unfortunately, SysRq-W / SysRq-T generate
>>> too much output, so the useful bits immediately scroll off the top of
>>> the screen. I can't press SysRq-P (even if that would help), because
>>> the SysRQ key on my laptop is accessed using a Fn modifier key, which
>>> also turns the left hand side of the keyboard into a numeric pad.
>>>
>>>
>> I don't own an Eee, but having played with one, the key sequence
>>
>> Alt down
>> Fn down
>> SysRq down, up
>> Fn up
>> P down, up
>> Alt up
>>
>> should work just fine. As long as you keep holding Alt, you don't
>> need to continue holding SysRq after the initial keypress.
>>
>>
>>
>>> Magic SysRQs on my USB keyboard generate no response. (Sometimes the
>>> numlock key seems to work though. I can't tell on the laptop keyboard
>>> because the laptop doesn't have a numlock led).
>>>
>>>
>> SysRq over USB has never worked. On PC hardware, Magic SysRq depends
>> on features of the built-in i8042 controller, which only handles PS/2
>> input.
>>
>>
> Great, thanks a lot!
>
> I was able to run SysRq-P, and found the following backtrace -
>
> Pid: 0
> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
> cpud_idle_call
> cpu_idle
> rest_init
> start_kernel
> i386_start_kernel
>

7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
Author: Bjorn Helgaas <[email protected]>
Date: Mon Mar 30 17:48:24 2009 +0000

ACPI: processor: use .notify method instead of installing handler
directly

This patch adds a .notify() method. The presence of .notify() causes
Linux/ACPI to manage event handlers and notify handlers on our behalf,
so we don't have to install and remove them ourselves.

Signed-off-by: Bjorn Helgaas <[email protected]>
CC: Zhang Rui <[email protected]>
CC: Zhao Yakui <[email protected]>
CC: Venki Pallipadi <[email protected]>
CC: Anil S Keshavamurthy <[email protected]>
Signed-off-by: Len Brown <[email protected]>



However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.

Alan

2009-04-13 19:16:17

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
> Alan Jenkins wrote:
> > Tzy-Jye Daniel Lin wrote:
> >> On Sat, Apr 11, 2009 at 5:41 AM, Alan Jenkins
> >> <[email protected]> wrote:
> >>
> >>> Regression #6 on latest git! (last known good is 2.6.29-rc8)

You mention that this occurs when booting off battery. So I
assume everything works fine when the EEE is plugged in to the
wall socket?

> >>> Magic SysRQ keys work though. ...
> > I was able to run SysRq-P, and found the following backtrace -
> >
> > Pid: 0
> > EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]

Can you figure out where this is in acpi_idle_enter_bm() or
maybe just email me your processor.ko module?

Does it always happen at the same point?

If you blacklist or rename the processor module to prevent it
from loading, does that keep the hang from occurring?

> > cpud_idle_call
> > cpu_idle
> > rest_init
> > start_kernel
> > i386_start_kernel
>
> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
> Author: Bjorn Helgaas <[email protected]>
> Date: Mon Mar 30 17:48:24 2009 +0000

Ouch, sorry about that. Thanks for doing all the bisection work.

> ACPI: processor: use .notify method instead of installing handler
> directly
>
> This patch adds a .notify() method. The presence of .notify() causes
> Linux/ACPI to manage event handlers and notify handlers on our behalf,
> so we don't have to install and remove them ourselves.
>
> Signed-off-by: Bjorn Helgaas <[email protected]>
> CC: Zhang Rui <[email protected]>
> CC: Zhao Yakui <[email protected]>
> CC: Venki Pallipadi <[email protected]>
> CC: Anil S Keshavamurthy <[email protected]>
> Signed-off-by: Len Brown <[email protected]>
>
> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.

I don't see the problem in that commit yet, and if there is a problem
with it, I would think that reverting it from 2.6.30-rc1 would solve
it. But maybe it'd be useful to revert the whole .notify series to
make sure. From 2.6.30-rc1, you should be able to revert these:

7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
373cfc360ec773be2f7615e59a19f3313255db7c button
46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure

What happens with those commits reverted?

Bjorn

2009-04-13 19:57:21

by Alan Jenkins

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

Bjorn Helgaas wrote:
> On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
>
> You mention that this occurs when booting off battery. So I
> assume everything works fine when the EEE is plugged in to the
> wall socket?
>

When I tested it before, that was what I found.

However, I now find that's not quite right. It only works (i.e. doesn't
hang) if I remove the battery as well as plugging it into the wall. If
I have the battery in, it hangs.

>>>>> Magic SysRQ keys work though. ...
>>>>>
>>> I was able to run SysRq-P, and found the following backtrace -
>>>
>>> Pid: 0
>>> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
>>>
>
> Can you figure out where this is in acpi_idle_enter_bm() or
> maybe just email me your processor.ko module?
>
> Does it always happen at the same point?
>

Yes, it always happens at the same point.

It turns out I can read the runes, but I don't understand what they're
saying :-).

objdump -dr processor.o

00001bd0 <acpi_idle_enter_bm>:

...
00001bd0 + 0x1df = 00001daf
...
1d70: b8 03 00 00 00 mov $0x3,%eax
1d75: e8 90 f3 ff ff call 110a <tsc_halts_in_c>
1d7a: 85 c0 test %eax,%eax
1d7c: 74 0a je 1d88 <acpi_idle_enter_bm+0x1b8>
1d7e: b8 0e 09 00 00 mov $0x90e,%eax
1d7f: R_386_32 .rodata.str1.1
1d83: e8 fc ff ff ff call 1d84 <acpi_idle_enter_bm+0x1b4>
1d84: R_386_PC32 mark_tsc_unstable
1d88: 8b 45 e8 mov -0x18(%ebp),%eax
1d8b: 8b 55 ec mov -0x14(%ebp),%edx
1d8e: e8 ab fd ff ff call 1b3e <us_to_pm_timer_ticks>
1d93: 89 c3 mov %eax,%ebx
1d95: b8 17 01 00 00 mov $0x117,%eax
1d9a: 69 ca 17 01 00 00 imul $0x117,%edx,%ecx
1da0: 89 d6 mov %edx,%esi
1da2: f7 e3 mul %ebx
1da4: 8d 14 11 lea (%ecx,%edx,1),%edx
1da7: e8 fc ff ff ff call 1da8 <acpi_idle_enter_bm+0x1d8>
1da8: R_386_PC32 sched_clock_idle_wakeup_event
1dac: fb sti
1dad: 89 e0 mov %esp,%eax
-> 1daf: 31 c9 xor %ecx,%ecx <---------
1db1: 25 00 e0 ff ff and $0xffffe000,%eax
1db6: 89 fa mov %edi,%edx
1db8: 83 48 0c 04 orl $0x4,0xc(%eax)
1dbc: ff 47 18 incl 0x18(%edi)
1dbf: 8b 45 e4 mov -0x1c(%ebp),%eax
1dc2: e8 a4 f5 ff ff call 136b <acpi_state_timer_broadcast>
1dc7: 01 5f 1c add %ebx,0x1c(%edi)
1dca: 11 77 20 adc %esi,0x20(%edi)
1dcd: 8b 45 e8 mov -0x18(%ebp),%eax
1dd0: 83 c4 10 add $0x10,%esp
1dd3: 5b pop %ebx
1dd4: 5e pop %esi
1dd5: 5f pop %edi
1dd6: 5d pop %ebp
1dd7: c3 ret


> If you blacklist or rename the processor module to prevent it
> from loading, does that keep the hang from occurring?
>

No. In that case I get the hang in default_idle+0x59/0x95

0000007a <default_idle>:
7a: 55 push %ebp
7b: 89 e5 mov %esp,%ebp
7d: 56 push %esi
7e: 53 push %ebx
7f: 83 ec 18 sub $0x18,%esp
82: 83 3d 18 00 00 00 00 cmpl $0x0,0x18
84: R_386_32 .bss
89: 75 7a jne 105 <default_idle+0x8b>
8b: 80 3d 05 00 00 00 00 cmpb $0x0,0x5
8d: R_386_32 boot_cpu_data
92: 74 71 je 105 <default_idle+0x8b>
94: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
96: R_386_32 __tracepoint_power_start
9b: 74 23 je c0 <default_idle+0x46>
9d: 8b 1d 08 00 00 00 mov 0x8,%ebx
9f: R_386_32 __tracepoint_power_start
a3: 85 db test %ebx,%ebx
a5: 74 19 je c0 <default_idle+0x46>
a7: 8d 75 e0 lea -0x20(%ebp),%esi
aa: b9 01 00 00 00 mov $0x1,%ecx
af: ba 01 00 00 00 mov $0x1,%edx
b4: 89 f0 mov %esi,%eax
b6: ff 13 call *(%ebx)
b8: 83 c3 04 add $0x4,%ebx
bb: 83 3b 00 cmpl $0x0,(%ebx)
be: 75 ea jne aa <default_idle+0x30>
c0: 89 e0 mov %esp,%eax
c2: 25 00 e0 ff ff and $0xffffe000,%eax
c7: 83 60 0c fb andl $0xfffffffb,0xc(%eax)
cb: f6 40 08 08 testb $0x8,0x8(%eax)
cf: 75 04 jne d5 <default_idle+0x5b>
d1: fb sti
d2: f4 hlt
--> d3: eb 01 jmp d6 <default_idle+0x5c> <--------
d5: fb sti
d6: 89 e0 mov %esp,%eax
d8: 25 00 e0 ff ff and $0xffffe000,%eax
dd: 83 48 0c 04 orl $0x4,0xc(%eax)
e1: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
e3: R_386_32 __tracepoint_power_end
e8: 74 1e je 108 <default_idle+0x8e>


>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
>> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
>> Author: Bjorn Helgaas <[email protected]>
>> Date: Mon Mar 30 17:48:24 2009 +0000
>>
>
> Ouch, sorry about that. Thanks for doing all the bisection work.
>
>
>> ACPI: processor: use .notify method instead of installing handler
>> directly
>>
>> This patch adds a .notify() method. The presence of .notify() causes
>> Linux/ACPI to manage event handlers and notify handlers on our behalf,
>> so we don't have to install and remove them ourselves.
>>
>> Signed-off-by: Bjorn Helgaas <[email protected]>
>> CC: Zhang Rui <[email protected]>
>> CC: Zhao Yakui <[email protected]>
>> CC: Venki Pallipadi <[email protected]>
>> CC: Anil S Keshavamurthy <[email protected]>
>> Signed-off-by: Len Brown <[email protected]>
>>
>> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.
>>
>
> I don't see the problem in that commit yet, and if there is a problem
> with it, I would think that reverting it from 2.6.30-rc1 would solve
> it. But maybe it'd be useful to revert the whole .notify series to
> make sure. From 2.6.30-rc1, you should be able to revert these:
>
> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
> 373cfc360ec773be2f7615e59a19f3313255db7c button
> 46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure
>
> What happens with those commits reverted?
>

I'll find out tomorrow.

Thanks
Alan

2009-04-13 22:28:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Monday 13 April 2009 01:57:00 pm Alan Jenkins wrote:
> Bjorn Helgaas wrote:
> > On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
> >
> > You mention that this occurs when booting off battery. So I
> > assume everything works fine when the EEE is plugged in to the
> > wall socket?
>
> When I tested it before, that was what I found.
>
> However, I now find that's not quite right. It only works (i.e. doesn't
> hang) if I remove the battery as well as plugging it into the wall. If
> I have the battery in, it hangs.
>
> >>>>> Magic SysRQ keys work though. ...
> >>>>>
> >>> I was able to run SysRq-P, and found the following backtrace -
> >>>
> >>> Pid: 0
> >>> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
> >
> > Can you figure out where this is in acpi_idle_enter_bm() or
> > maybe just email me your processor.ko module?
> >
> > Does it always happen at the same point?
>
> Yes, it always happens at the same point.
>
> It turns out I can read the runes, but I don't understand what they're
> saying :-).

I'm not much good with x86 assembly either :-)

I think that in both cases below, you're right after enabling
interrupts and about to exit the idle routine. My guess is the
system is not really hung; it just doesn't think it has anything
to do and is spending all its time in the idle loop.

> 00001bd0 <acpi_idle_enter_bm>:
>
> ...
> 00001bd0 + 0x1df = 00001daf
> ...
> 1d70: b8 03 00 00 00 mov $0x3,%eax
> 1d75: e8 90 f3 ff ff call 110a <tsc_halts_in_c>
> 1d7a: 85 c0 test %eax,%eax
> 1d7c: 74 0a je 1d88 <acpi_idle_enter_bm+0x1b8>
> 1d7e: b8 0e 09 00 00 mov $0x90e,%eax
> 1d7f: R_386_32 .rodata.str1.1
> 1d83: e8 fc ff ff ff call 1d84 <acpi_idle_enter_bm+0x1b4>
> 1d84: R_386_PC32 mark_tsc_unstable
> 1d88: 8b 45 e8 mov -0x18(%ebp),%eax
> 1d8b: 8b 55 ec mov -0x14(%ebp),%edx
> 1d8e: e8 ab fd ff ff call 1b3e <us_to_pm_timer_ticks>
> 1d93: 89 c3 mov %eax,%ebx
> 1d95: b8 17 01 00 00 mov $0x117,%eax
> 1d9a: 69 ca 17 01 00 00 imul $0x117,%edx,%ecx
> 1da0: 89 d6 mov %edx,%esi
> 1da2: f7 e3 mul %ebx
> 1da4: 8d 14 11 lea (%ecx,%edx,1),%edx
> 1da7: e8 fc ff ff ff call 1da8 <acpi_idle_enter_bm+0x1d8>
> 1da8: R_386_PC32 sched_clock_idle_wakeup_event
> 1dac: fb sti
> 1dad: 89 e0 mov %esp,%eax
> -> 1daf: 31 c9 xor %ecx,%ecx <---------
> 1db1: 25 00 e0 ff ff and $0xffffe000,%eax
> 1db6: 89 fa mov %edi,%edx
> 1db8: 83 48 0c 04 orl $0x4,0xc(%eax)
> 1dbc: ff 47 18 incl 0x18(%edi)
> 1dbf: 8b 45 e4 mov -0x1c(%ebp),%eax
> 1dc2: e8 a4 f5 ff ff call 136b <acpi_state_timer_broadcast>
> 1dc7: 01 5f 1c add %ebx,0x1c(%edi)
> 1dca: 11 77 20 adc %esi,0x20(%edi)
> 1dcd: 8b 45 e8 mov -0x18(%ebp),%eax
> 1dd0: 83 c4 10 add $0x10,%esp
> 1dd3: 5b pop %ebx
> 1dd4: 5e pop %esi
> 1dd5: 5f pop %edi
> 1dd6: 5d pop %ebp
> 1dd7: c3 ret
>
> > If you blacklist or rename the processor module to prevent it
> > from loading, does that keep the hang from occurring?
>
> No. In that case I get the hang in default_idle+0x59/0x95
>
> 0000007a <default_idle>:
> 7a: 55 push %ebp
> 7b: 89 e5 mov %esp,%ebp
> 7d: 56 push %esi
> 7e: 53 push %ebx
> 7f: 83 ec 18 sub $0x18,%esp
> 82: 83 3d 18 00 00 00 00 cmpl $0x0,0x18
> 84: R_386_32 .bss
> 89: 75 7a jne 105 <default_idle+0x8b>
> 8b: 80 3d 05 00 00 00 00 cmpb $0x0,0x5
> 8d: R_386_32 boot_cpu_data
> 92: 74 71 je 105 <default_idle+0x8b>
> 94: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
> 96: R_386_32 __tracepoint_power_start
> 9b: 74 23 je c0 <default_idle+0x46>
> 9d: 8b 1d 08 00 00 00 mov 0x8,%ebx
> 9f: R_386_32 __tracepoint_power_start
> a3: 85 db test %ebx,%ebx
> a5: 74 19 je c0 <default_idle+0x46>
> a7: 8d 75 e0 lea -0x20(%ebp),%esi
> aa: b9 01 00 00 00 mov $0x1,%ecx
> af: ba 01 00 00 00 mov $0x1,%edx
> b4: 89 f0 mov %esi,%eax
> b6: ff 13 call *(%ebx)
> b8: 83 c3 04 add $0x4,%ebx
> bb: 83 3b 00 cmpl $0x0,(%ebx)
> be: 75 ea jne aa <default_idle+0x30>
> c0: 89 e0 mov %esp,%eax
> c2: 25 00 e0 ff ff and $0xffffe000,%eax
> c7: 83 60 0c fb andl $0xfffffffb,0xc(%eax)
> cb: f6 40 08 08 testb $0x8,0x8(%eax)
> cf: 75 04 jne d5 <default_idle+0x5b>
> d1: fb sti
> d2: f4 hlt
> --> d3: eb 01 jmp d6 <default_idle+0x5c> <--------
> d5: fb sti
> d6: 89 e0 mov %esp,%eax
> d8: 25 00 e0 ff ff and $0xffffe000,%eax
> dd: 83 48 0c 04 orl $0x4,0xc(%eax)
> e1: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
> e3: R_386_32 __tracepoint_power_end
> e8: 74 1e je 108 <default_idle+0x8e>
>
>
> >> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
> >> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
> >> Author: Bjorn Helgaas <[email protected]>
> >> Date: Mon Mar 30 17:48:24 2009 +0000
> >
> > Ouch, sorry about that. Thanks for doing all the bisection work.
> >
> >> ACPI: processor: use .notify method instead of installing handler
> >> directly
> >>
> >> This patch adds a .notify() method. The presence of .notify() causes
> >> Linux/ACPI to manage event handlers and notify handlers on our behalf,
> >> so we don't have to install and remove them ourselves.
> >>
> >> Signed-off-by: Bjorn Helgaas <[email protected]>
> >> CC: Zhang Rui <[email protected]>
> >> CC: Zhao Yakui <[email protected]>
> >> CC: Venki Pallipadi <[email protected]>
> >> CC: Anil S Keshavamurthy <[email protected]>
> >> Signed-off-by: Len Brown <[email protected]>
> >>
> >> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.
> >
> > I don't see the problem in that commit yet, and if there is a problem
> > with it, I would think that reverting it from 2.6.30-rc1 would solve
> > it. But maybe it'd be useful to revert the whole .notify series to
> > make sure. From 2.6.30-rc1, you should be able to revert these:
> >
> > 7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
> > 373cfc360ec773be2f7615e59a19f3313255db7c button
> > 46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure
> >
> > What happens with those commits reverted?
>
> I'll find out tomorrow.

The fact that it still hangs even when you don't load the processor
driver at all suggests that the 7ec0a729079 commit identified by the
bisection is not the real problem. That commit only touches
drivers/acpi/processor_core.c.

I think it's more likely some kind of race or missed wakeup.

Since it seems to be sensitive to whether the battery is present,
I guess you could try blacklisting the battery.ko driver. There
have been a few changes to it since 2.6.29-rc8. If things work
without battery.ko, we can look through those changes.

Bjorn

2009-04-14 05:03:49

by Ben Gamari

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Mon, Apr 13, 2009 at 08:57:00PM +0100, Alan Jenkins wrote:
> Bjorn Helgaas wrote:
> However, I now find that's not quite right. It only works (i.e. doesn't
> hang) if I remove the battery as well as plugging it into the wall. If
> I have the battery in, it hangs.

I seem to have a similar issue with master which occurs regardless of
whether the battery is in on my Dell Latitude D830. Here is the
backtrace,

usb 7-1.2: Product: O2Micro CCID SC Reader
usb 7-1.2: Manufacturer: O2
usb 7-1.2: configuration #1 chosen from 1 choice
[drm] TV-15: set mode NTSC 480i 0
allocated 1920x1200 fb: 0x007df000, bo ffff88011d895b40
Console: switching to colour frame buffer device 240x75
[drm] LVDS-8: set mode 1920x1200 17
fb0: inteldrmfb frame buffer device
registered panic notifier
input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A03:00/device:37/input/i6
ACPI: Video Device [VID1] (multi-head: yes rom: no post: no)
ACPI Warning (nspredef-0437): \_SB_.PCI0.VID2._DOD: Return Package has no eleme]
input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A03:00/device:3c/input/i7
ACPI: Video Device [VID2] (multi-head: yes rom: no post: no)
[drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
SysRq : Show backtrace of all active CPUs
CPU1:
CPU 1:
Modules linked in: i915 drm i2c_algo_bit cfbcopyarea i2c_core cfbimgblt video o]
Pid: 0, comm: swapper Not tainted 2.6.30-rc1ben #54 Latitude D830
RIP: 0010:[<ffffffff8126c488>] [<ffffffff8126c488>] acpi_idle_enter_bm+0x28e/00
RSP: 0018:ffff88011fb0fe88 EFLAGS: 00000246
RAX: 000000000000eca0 RBX: ffff88011fb0fee8 RCX: 0000000000000000
RDX: ffff88002803a000 RSI: 0000000516143023 RDI: ffff88002804d6c0
RBP: ffffffff8101180e R08: ffffffffff762325 R09: ffff88011fb09058
R10: ffff88011fb09018 R11: ffff880028049430 R12: 000000073446f694
R13: ffff88011fb0fe08 R14: 00000000000019d2 R15: ffff88011fb0fe18
FS: 0000000000000000(0000) GS:ffff88002803a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fff6d9cab78 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffff8126c481>] ? acpi_idle_enter_bm+0x287/0x2d0
[<ffffffff8133f949>] ? cpuidle_idle_call+0x93/0xca
[<ffffffff8100fd20>] ? cpu_idle+0x64/0xcd
[<ffffffff813f5326>] ? start_secondary+0x1a1/0x1a6


Hope this helps. Let me know if I can do anything else. Thanks!

- ben

2009-04-14 08:07:15

by Alan Jenkins

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

Bjorn Helgaas wrote:
> On Monday 13 April 2009 01:57:00 pm Alan Jenkins wrote:
>
>> Bjorn Helgaas wrote:
>>
>>> On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
>>>
>>> You mention that this occurs when booting off battery. So I
>>> assume everything works fine when the EEE is plugged in to the
>>> wall socket?
>>>
>> When I tested it before, that was what I found.
>>
>> However, I now find that's not quite right. It only works (i.e. doesn't
>> hang) if I remove the battery as well as plugging it into the wall. If
>> I have the battery in, it hangs.
>>

... and right now, I can only reproduce it by booting with it plugged
into the wall and the battery present. If I unplug it from the wall, it
boots fine.

It must be affected by something else as well, maybe battery level or
charging / discharging status.

>>>>>>> Magic SysRQ keys work though. ...
>>>>>>>
>>>>>>>
>>>>> I was able to run SysRq-P, and found the following backtrace -
>>>>>
>>>>> Pid: 0
>>>>> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
>>>>>
>>> Can you figure out where this is in acpi_idle_enter_bm() or
>>> maybe just email me your processor.ko module?
>>>
>>> Does it always happen at the same point?
>>>
>> Yes, it always happens at the same point.
>>
>> It turns out I can read the runes, but I don't understand what they're
>> saying :-).
>>
>
> I'm not much good with x86 assembly either :-)
>
> I think that in both cases below, you're right after enabling
> interrupts and about to exit the idle routine. My guess is the
> system is not really hung; it just doesn't think it has anything
> to do and is spending all its time in the idle loop.
>
>
>> 00001bd0 <acpi_idle_enter_bm>:
>>
>> ...
>> 00001bd0 + 0x1df = 00001daf
>> ...
>> 1d70: b8 03 00 00 00 mov $0x3,%eax
>> 1d75: e8 90 f3 ff ff call 110a <tsc_halts_in_c>
>> 1d7a: 85 c0 test %eax,%eax
>> 1d7c: 74 0a je 1d88 <acpi_idle_enter_bm+0x1b8>
>> 1d7e: b8 0e 09 00 00 mov $0x90e,%eax
>> 1d7f: R_386_32 .rodata.str1.1
>> 1d83: e8 fc ff ff ff call 1d84 <acpi_idle_enter_bm+0x1b4>
>> 1d84: R_386_PC32 mark_tsc_unstable
>> 1d88: 8b 45 e8 mov -0x18(%ebp),%eax
>> 1d8b: 8b 55 ec mov -0x14(%ebp),%edx
>> 1d8e: e8 ab fd ff ff call 1b3e <us_to_pm_timer_ticks>
>> 1d93: 89 c3 mov %eax,%ebx
>> 1d95: b8 17 01 00 00 mov $0x117,%eax
>> 1d9a: 69 ca 17 01 00 00 imul $0x117,%edx,%ecx
>> 1da0: 89 d6 mov %edx,%esi
>> 1da2: f7 e3 mul %ebx
>> 1da4: 8d 14 11 lea (%ecx,%edx,1),%edx
>> 1da7: e8 fc ff ff ff call 1da8 <acpi_idle_enter_bm+0x1d8>
>> 1da8: R_386_PC32 sched_clock_idle_wakeup_event
>> 1dac: fb sti
>> 1dad: 89 e0 mov %esp,%eax
>> -> 1daf: 31 c9 xor %ecx,%ecx <---------
>> 1db1: 25 00 e0 ff ff and $0xffffe000,%eax
>> 1db6: 89 fa mov %edi,%edx
>> 1db8: 83 48 0c 04 orl $0x4,0xc(%eax)
>> 1dbc: ff 47 18 incl 0x18(%edi)
>> 1dbf: 8b 45 e4 mov -0x1c(%ebp),%eax
>> 1dc2: e8 a4 f5 ff ff call 136b <acpi_state_timer_broadcast>
>> 1dc7: 01 5f 1c add %ebx,0x1c(%edi)
>> 1dca: 11 77 20 adc %esi,0x20(%edi)
>> 1dcd: 8b 45 e8 mov -0x18(%ebp),%eax
>> 1dd0: 83 c4 10 add $0x10,%esp
>> 1dd3: 5b pop %ebx
>> 1dd4: 5e pop %esi
>> 1dd5: 5f pop %edi
>> 1dd6: 5d pop %ebp
>> 1dd7: c3 ret
>>
>>
>>> If you blacklist or rename the processor module to prevent it
>>> from loading, does that keep the hang from occurring?
>>>
>> No. In that case I get the hang in default_idle+0x59/0x95
>>
>> 0000007a <default_idle>:
>> 7a: 55 push %ebp
>> 7b: 89 e5 mov %esp,%ebp
>> 7d: 56 push %esi
>> 7e: 53 push %ebx
>> 7f: 83 ec 18 sub $0x18,%esp
>> 82: 83 3d 18 00 00 00 00 cmpl $0x0,0x18
>> 84: R_386_32 .bss
>> 89: 75 7a jne 105 <default_idle+0x8b>
>> 8b: 80 3d 05 00 00 00 00 cmpb $0x0,0x5
>> 8d: R_386_32 boot_cpu_data
>> 92: 74 71 je 105 <default_idle+0x8b>
>> 94: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
>> 96: R_386_32 __tracepoint_power_start
>> 9b: 74 23 je c0 <default_idle+0x46>
>> 9d: 8b 1d 08 00 00 00 mov 0x8,%ebx
>> 9f: R_386_32 __tracepoint_power_start
>> a3: 85 db test %ebx,%ebx
>> a5: 74 19 je c0 <default_idle+0x46>
>> a7: 8d 75 e0 lea -0x20(%ebp),%esi
>> aa: b9 01 00 00 00 mov $0x1,%ecx
>> af: ba 01 00 00 00 mov $0x1,%edx
>> b4: 89 f0 mov %esi,%eax
>> b6: ff 13 call *(%ebx)
>> b8: 83 c3 04 add $0x4,%ebx
>> bb: 83 3b 00 cmpl $0x0,(%ebx)
>> be: 75 ea jne aa <default_idle+0x30>
>> c0: 89 e0 mov %esp,%eax
>> c2: 25 00 e0 ff ff and $0xffffe000,%eax
>> c7: 83 60 0c fb andl $0xfffffffb,0xc(%eax)
>> cb: f6 40 08 08 testb $0x8,0x8(%eax)
>> cf: 75 04 jne d5 <default_idle+0x5b>
>> d1: fb sti
>> d2: f4 hlt
>> --> d3: eb 01 jmp d6 <default_idle+0x5c> <--------
>> d5: fb sti
>> d6: 89 e0 mov %esp,%eax
>> d8: 25 00 e0 ff ff and $0xffffe000,%eax
>> dd: 83 48 0c 04 orl $0x4,0xc(%eax)
>> e1: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
>> e3: R_386_32 __tracepoint_power_end
>> e8: 74 1e je 108 <default_idle+0x8e>
>>
>>
>>
>>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
>>>> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
>>>> Author: Bjorn Helgaas <[email protected]>
>>>> Date: Mon Mar 30 17:48:24 2009 +0000
>>>>
>>> Ouch, sorry about that. Thanks for doing all the bisection work.
>>>
>>>
>>>> ACPI: processor: use .notify method instead of installing handler
>>>> directly
>>>>
>>>> This patch adds a .notify() method. The presence of .notify() causes
>>>> Linux/ACPI to manage event handlers and notify handlers on our behalf,
>>>> so we don't have to install and remove them ourselves.
>>>>
>>>> Signed-off-by: Bjorn Helgaas <[email protected]>
>>>> CC: Zhang Rui <[email protected]>
>>>> CC: Zhao Yakui <[email protected]>
>>>> CC: Venki Pallipadi <[email protected]>
>>>> CC: Anil S Keshavamurthy <[email protected]>
>>>> Signed-off-by: Len Brown <[email protected]>
>>>>
>>>> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.
>>>>
>>> I don't see the problem in that commit yet, and if there is a problem
>>> with it, I would think that reverting it from 2.6.30-rc1 would solve
>>> it. But maybe it'd be useful to revert the whole .notify series to
>>> make sure. From 2.6.30-rc1, you should be able to revert these:
>>>
>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
>>> 373cfc360ec773be2f7615e59a19f3313255db7c button
>>> 46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure
>>>
>>> What happens with those commits reverted?
>>>
>> I'll find out tomorrow.
>>
>
> The fact that it still hangs even when you don't load the processor
> driver at all suggests that the 7ec0a729079 commit identified by the
> bisection is not the real problem. That commit only touches
> drivers/acpi/processor_core.c.
>

Yah.

> I think it's more likely some kind of race or missed wakeup.
>
> Since it seems to be sensitive to whether the battery is present,
> I guess you could try blacklisting the battery.ko driver. There
> have been a few changes to it since 2.6.29-rc8. If things work
> without battery.ko, we can look through those changes.
>

Good guess :-). I tried a couple of times either way, and blacklisting
"battery" definitely avoids the hang.

Thanks
Alan

2009-04-14 09:26:35

by Alan Jenkins

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

Alan Jenkins wrote:
> Bjorn Helgaas wrote:
>
>> On Monday 13 April 2009 01:57:00 pm Alan Jenkins wrote:
>>
>>
>>> Bjorn Helgaas wrote:
>>>
>>>
>>>> On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
>>>>
>>>> You mention that this occurs when booting off battery. So I
>>>> assume everything works fine when the EEE is plugged in to the
>>>> wall socket?
>>>>
>>>>
>>> When I tested it before, that was what I found.
>>>
>>> However, I now find that's not quite right. It only works (i.e. doesn't
>>> hang) if I remove the battery as well as plugging it into the wall. If
>>> I have the battery in, it hangs.
>>>
>>>
>
> ... and right now, I can only reproduce it by booting with it plugged
> into the wall and the battery present. If I unplug it from the wall, it
> boots fine.
>
> It must be affected by something else as well, maybe battery level or
> charging / discharging status.
>
>
>>>>>>>> Magic SysRQ keys work though. ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> I was able to run SysRq-P, and found the following backtrace -
>>>>>>
>>>>>> Pid: 0
>>>>>> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
>>>>>>
>>>>>>
>>>> Can you figure out where this is in acpi_idle_enter_bm() or
>>>> maybe just email me your processor.ko module?
>>>>
>>>> Does it always happen at the same point?
>>>>
>>>>
>>> Yes, it always happens at the same point.
>>>
>>> It turns out I can read the runes, but I don't understand what they're
>>> saying :-).
>>>
>>>
>> I'm not much good with x86 assembly either :-)
>>
>> I think that in both cases below, you're right after enabling
>> interrupts and about to exit the idle routine. My guess is the
>> system is not really hung; it just doesn't think it has anything
>> to do and is spending all its time in the idle loop.
>>
>>
>>
>>> 00001bd0 <acpi_idle_enter_bm>:
>>>
>>> ...
>>> 00001bd0 + 0x1df = 00001daf
>>> ...
>>> 1d70: b8 03 00 00 00 mov $0x3,%eax
>>> 1d75: e8 90 f3 ff ff call 110a <tsc_halts_in_c>
>>> 1d7a: 85 c0 test %eax,%eax
>>> 1d7c: 74 0a je 1d88 <acpi_idle_enter_bm+0x1b8>
>>> 1d7e: b8 0e 09 00 00 mov $0x90e,%eax
>>> 1d7f: R_386_32 .rodata.str1.1
>>> 1d83: e8 fc ff ff ff call 1d84 <acpi_idle_enter_bm+0x1b4>
>>> 1d84: R_386_PC32 mark_tsc_unstable
>>> 1d88: 8b 45 e8 mov -0x18(%ebp),%eax
>>> 1d8b: 8b 55 ec mov -0x14(%ebp),%edx
>>> 1d8e: e8 ab fd ff ff call 1b3e <us_to_pm_timer_ticks>
>>> 1d93: 89 c3 mov %eax,%ebx
>>> 1d95: b8 17 01 00 00 mov $0x117,%eax
>>> 1d9a: 69 ca 17 01 00 00 imul $0x117,%edx,%ecx
>>> 1da0: 89 d6 mov %edx,%esi
>>> 1da2: f7 e3 mul %ebx
>>> 1da4: 8d 14 11 lea (%ecx,%edx,1),%edx
>>> 1da7: e8 fc ff ff ff call 1da8 <acpi_idle_enter_bm+0x1d8>
>>> 1da8: R_386_PC32 sched_clock_idle_wakeup_event
>>> 1dac: fb sti
>>> 1dad: 89 e0 mov %esp,%eax
>>> -> 1daf: 31 c9 xor %ecx,%ecx <---------
>>> 1db1: 25 00 e0 ff ff and $0xffffe000,%eax
>>> 1db6: 89 fa mov %edi,%edx
>>> 1db8: 83 48 0c 04 orl $0x4,0xc(%eax)
>>> 1dbc: ff 47 18 incl 0x18(%edi)
>>> 1dbf: 8b 45 e4 mov -0x1c(%ebp),%eax
>>> 1dc2: e8 a4 f5 ff ff call 136b <acpi_state_timer_broadcast>
>>> 1dc7: 01 5f 1c add %ebx,0x1c(%edi)
>>> 1dca: 11 77 20 adc %esi,0x20(%edi)
>>> 1dcd: 8b 45 e8 mov -0x18(%ebp),%eax
>>> 1dd0: 83 c4 10 add $0x10,%esp
>>> 1dd3: 5b pop %ebx
>>> 1dd4: 5e pop %esi
>>> 1dd5: 5f pop %edi
>>> 1dd6: 5d pop %ebp
>>> 1dd7: c3 ret
>>>
>>>
>>>
>>>> If you blacklist or rename the processor module to prevent it
>>>> from loading, does that keep the hang from occurring?
>>>>
>>>>
>>> No. In that case I get the hang in default_idle+0x59/0x95
>>>
>>> 0000007a <default_idle>:
>>> 7a: 55 push %ebp
>>> 7b: 89 e5 mov %esp,%ebp
>>> 7d: 56 push %esi
>>> 7e: 53 push %ebx
>>> 7f: 83 ec 18 sub $0x18,%esp
>>> 82: 83 3d 18 00 00 00 00 cmpl $0x0,0x18
>>> 84: R_386_32 .bss
>>> 89: 75 7a jne 105 <default_idle+0x8b>
>>> 8b: 80 3d 05 00 00 00 00 cmpb $0x0,0x5
>>> 8d: R_386_32 boot_cpu_data
>>> 92: 74 71 je 105 <default_idle+0x8b>
>>> 94: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
>>> 96: R_386_32 __tracepoint_power_start
>>> 9b: 74 23 je c0 <default_idle+0x46>
>>> 9d: 8b 1d 08 00 00 00 mov 0x8,%ebx
>>> 9f: R_386_32 __tracepoint_power_start
>>> a3: 85 db test %ebx,%ebx
>>> a5: 74 19 je c0 <default_idle+0x46>
>>> a7: 8d 75 e0 lea -0x20(%ebp),%esi
>>> aa: b9 01 00 00 00 mov $0x1,%ecx
>>> af: ba 01 00 00 00 mov $0x1,%edx
>>> b4: 89 f0 mov %esi,%eax
>>> b6: ff 13 call *(%ebx)
>>> b8: 83 c3 04 add $0x4,%ebx
>>> bb: 83 3b 00 cmpl $0x0,(%ebx)
>>> be: 75 ea jne aa <default_idle+0x30>
>>> c0: 89 e0 mov %esp,%eax
>>> c2: 25 00 e0 ff ff and $0xffffe000,%eax
>>> c7: 83 60 0c fb andl $0xfffffffb,0xc(%eax)
>>> cb: f6 40 08 08 testb $0x8,0x8(%eax)
>>> cf: 75 04 jne d5 <default_idle+0x5b>
>>> d1: fb sti
>>> d2: f4 hlt
>>> --> d3: eb 01 jmp d6 <default_idle+0x5c> <--------
>>> d5: fb sti
>>> d6: 89 e0 mov %esp,%eax
>>> d8: 25 00 e0 ff ff and $0xffffe000,%eax
>>> dd: 83 48 0c 04 orl $0x4,0xc(%eax)
>>> e1: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
>>> e3: R_386_32 __tracepoint_power_end
>>> e8: 74 1e je 108 <default_idle+0x8e>
>>>
>>>
>>>
>>>
>>>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
>>>>> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
>>>>> Author: Bjorn Helgaas <[email protected]>
>>>>> Date: Mon Mar 30 17:48:24 2009 +0000
>>>>>
>>>>>
>>>> Ouch, sorry about that. Thanks for doing all the bisection work.
>>>>
>>>>
>>>>
>>>>> ACPI: processor: use .notify method instead of installing handler
>>>>> directly
>>>>>
>>>>> This patch adds a .notify() method. The presence of .notify() causes
>>>>> Linux/ACPI to manage event handlers and notify handlers on our behalf,
>>>>> so we don't have to install and remove them ourselves.
>>>>>
>>>>> Signed-off-by: Bjorn Helgaas <[email protected]>
>>>>> CC: Zhang Rui <[email protected]>
>>>>> CC: Zhao Yakui <[email protected]>
>>>>> CC: Venki Pallipadi <[email protected]>
>>>>> CC: Anil S Keshavamurthy <[email protected]>
>>>>> Signed-off-by: Len Brown <[email protected]>
>>>>>
>>>>> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.
>>>>>
>>>>>
>>>> I don't see the problem in that commit yet, and if there is a problem
>>>> with it, I would think that reverting it from 2.6.30-rc1 would solve
>>>> it. But maybe it'd be useful to revert the whole .notify series to
>>>> make sure. From 2.6.30-rc1, you should be able to revert these:
>>>>
>>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
>>>> 373cfc360ec773be2f7615e59a19f3313255db7c button
>>>> 46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure
>>>>
>>>> What happens with those commits reverted?
>>>>
>>>>
>>> I'll find out tomorrow.
>>>
>>>
>> The fact that it still hangs even when you don't load the processor
>> driver at all suggests that the 7ec0a729079 commit identified by the
>> bisection is not the real problem. That commit only touches
>> drivers/acpi/processor_core.c.
>>
>>
>
> Yah.
>
>
>> I think it's more likely some kind of race or missed wakeup.
>>
>> Since it seems to be sensitive to whether the battery is present,
>> I guess you could try blacklisting the battery.ko driver. There
>> have been a few changes to it since 2.6.29-rc8. If things work
>> without battery.ko, we can look through those changes.
>>
>>
>
> Good guess :-). I tried a couple of times either way, and blacklisting
> "battery" definitely avoids the hang.
>

Ok, I tried reverting

0f66af530116e9f4dd97f328d91718b56a6fc5a4 "ACPI: battery: asynchronous init"

and that fixed it.

Thanks for your help so far, Bjorn.

Alan

2009-04-14 14:59:19

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Tuesday 14 April 2009 03:26:13 am Alan Jenkins wrote:
> Alan Jenkins wrote:
> > Bjorn Helgaas wrote:
> >> On Monday 13 April 2009 01:57:00 pm Alan Jenkins wrote:
> >>> Bjorn Helgaas wrote:
> >>>> On Sunday 12 April 2009 07:11:57 am Alan Jenkins wrote:
> >>>>
> >>>> You mention that this occurs when booting off battery. So I
> >>>> assume everything works fine when the EEE is plugged in to the
> >>>> wall socket?
> >>>>
> >>> When I tested it before, that was what I found.
> >>>
> >>> However, I now find that's not quite right. It only works (i.e. doesn't
> >>> hang) if I remove the battery as well as plugging it into the wall. If
> >>> I have the battery in, it hangs.
> >
> > ... and right now, I can only reproduce it by booting with it plugged
> > into the wall and the battery present. If I unplug it from the wall, it
> > boots fine.
> >
> > It must be affected by something else as well, maybe battery level or
> > charging / discharging status.
> >
> >>>>>>>> Magic SysRQ keys work though. ...
> >>>>>> I was able to run SysRq-P, and found the following backtrace -
> >>>>>>
> >>>>>> Pid: 0
> >>>>>> EIP is at acpi_idle_enter_bm+0x1df/0x208 [processor]
> >>>>>>
> >>>> Can you figure out where this is in acpi_idle_enter_bm() or
> >>>> maybe just email me your processor.ko module?
> >>>>
> >>>> Does it always happen at the same point?
> >>>>
> >>> Yes, it always happens at the same point.
> >>>
> >>> It turns out I can read the runes, but I don't understand what they're
> >>> saying :-).
> >>>
> >> I'm not much good with x86 assembly either :-)
> >>
> >> I think that in both cases below, you're right after enabling
> >> interrupts and about to exit the idle routine. My guess is the
> >> system is not really hung; it just doesn't think it has anything
> >> to do and is spending all its time in the idle loop.
> >>
> >>> 00001bd0 <acpi_idle_enter_bm>:
> >>> ...
> >>> 00001bd0 + 0x1df = 00001daf
> >>> ...
> >>> 1d70: b8 03 00 00 00 mov $0x3,%eax
> >>> 1d75: e8 90 f3 ff ff call 110a <tsc_halts_in_c>
> >>> 1d7a: 85 c0 test %eax,%eax
> >>> 1d7c: 74 0a je 1d88 <acpi_idle_enter_bm+0x1b8>
> >>> 1d7e: b8 0e 09 00 00 mov $0x90e,%eax
> >>> 1d7f: R_386_32 .rodata.str1.1
> >>> 1d83: e8 fc ff ff ff call 1d84 <acpi_idle_enter_bm+0x1b4>
> >>> 1d84: R_386_PC32 mark_tsc_unstable
> >>> 1d88: 8b 45 e8 mov -0x18(%ebp),%eax
> >>> 1d8b: 8b 55 ec mov -0x14(%ebp),%edx
> >>> 1d8e: e8 ab fd ff ff call 1b3e <us_to_pm_timer_ticks>
> >>> 1d93: 89 c3 mov %eax,%ebx
> >>> 1d95: b8 17 01 00 00 mov $0x117,%eax
> >>> 1d9a: 69 ca 17 01 00 00 imul $0x117,%edx,%ecx
> >>> 1da0: 89 d6 mov %edx,%esi
> >>> 1da2: f7 e3 mul %ebx
> >>> 1da4: 8d 14 11 lea (%ecx,%edx,1),%edx
> >>> 1da7: e8 fc ff ff ff call 1da8 <acpi_idle_enter_bm+0x1d8>
> >>> 1da8: R_386_PC32 sched_clock_idle_wakeup_event
> >>> 1dac: fb sti
> >>> 1dad: 89 e0 mov %esp,%eax
> >>> -> 1daf: 31 c9 xor %ecx,%ecx <---------
> >>> 1db1: 25 00 e0 ff ff and $0xffffe000,%eax
> >>> 1db6: 89 fa mov %edi,%edx
> >>> 1db8: 83 48 0c 04 orl $0x4,0xc(%eax)
> >>> 1dbc: ff 47 18 incl 0x18(%edi)
> >>> 1dbf: 8b 45 e4 mov -0x1c(%ebp),%eax
> >>> 1dc2: e8 a4 f5 ff ff call 136b <acpi_state_timer_broadcast>
> >>> 1dc7: 01 5f 1c add %ebx,0x1c(%edi)
> >>> 1dca: 11 77 20 adc %esi,0x20(%edi)
> >>> 1dcd: 8b 45 e8 mov -0x18(%ebp),%eax
> >>> 1dd0: 83 c4 10 add $0x10,%esp
> >>> 1dd3: 5b pop %ebx
> >>> 1dd4: 5e pop %esi
> >>> 1dd5: 5f pop %edi
> >>> 1dd6: 5d pop %ebp
> >>> 1dd7: c3 ret
> >>>
> >>>> If you blacklist or rename the processor module to prevent it
> >>>> from loading, does that keep the hang from occurring?
> >>>>
> >>> No. In that case I get the hang in default_idle+0x59/0x95
> >>>
> >>> 0000007a <default_idle>:
> >>> 7a: 55 push %ebp
> >>> 7b: 89 e5 mov %esp,%ebp
> >>> 7d: 56 push %esi
> >>> 7e: 53 push %ebx
> >>> 7f: 83 ec 18 sub $0x18,%esp
> >>> 82: 83 3d 18 00 00 00 00 cmpl $0x0,0x18
> >>> 84: R_386_32 .bss
> >>> 89: 75 7a jne 105 <default_idle+0x8b>
> >>> 8b: 80 3d 05 00 00 00 00 cmpb $0x0,0x5
> >>> 8d: R_386_32 boot_cpu_data
> >>> 92: 74 71 je 105 <default_idle+0x8b>
> >>> 94: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
> >>> 96: R_386_32 __tracepoint_power_start
> >>> 9b: 74 23 je c0 <default_idle+0x46>
> >>> 9d: 8b 1d 08 00 00 00 mov 0x8,%ebx
> >>> 9f: R_386_32 __tracepoint_power_start
> >>> a3: 85 db test %ebx,%ebx
> >>> a5: 74 19 je c0 <default_idle+0x46>
> >>> a7: 8d 75 e0 lea -0x20(%ebp),%esi
> >>> aa: b9 01 00 00 00 mov $0x1,%ecx
> >>> af: ba 01 00 00 00 mov $0x1,%edx
> >>> b4: 89 f0 mov %esi,%eax
> >>> b6: ff 13 call *(%ebx)
> >>> b8: 83 c3 04 add $0x4,%ebx
> >>> bb: 83 3b 00 cmpl $0x0,(%ebx)
> >>> be: 75 ea jne aa <default_idle+0x30>
> >>> c0: 89 e0 mov %esp,%eax
> >>> c2: 25 00 e0 ff ff and $0xffffe000,%eax
> >>> c7: 83 60 0c fb andl $0xfffffffb,0xc(%eax)
> >>> cb: f6 40 08 08 testb $0x8,0x8(%eax)
> >>> cf: 75 04 jne d5 <default_idle+0x5b>
> >>> d1: fb sti
> >>> d2: f4 hlt
> >>> --> d3: eb 01 jmp d6 <default_idle+0x5c> <--------
> >>> d5: fb sti
> >>> d6: 89 e0 mov %esp,%eax
> >>> d8: 25 00 e0 ff ff and $0xffffe000,%eax
> >>> dd: 83 48 0c 04 orl $0x4,0xc(%eax)
> >>> e1: 83 3d 04 00 00 00 00 cmpl $0x0,0x4
> >>> e3: R_386_32 __tracepoint_power_end
> >>> e8: 74 1e je 108 <default_idle+0x8e>
> >>>
> >>>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f is first bad commit
> >>>>> commit 7ec0a7290797f57b780f792d12f4bcc19c83aa4f
> >>>>> Author: Bjorn Helgaas <[email protected]>
> >>>>> Date: Mon Mar 30 17:48:24 2009 +0000
> >>>>>
> >>>> Ouch, sorry about that. Thanks for doing all the bisection work.
> >>>>
> >>>>> ACPI: processor: use .notify method instead of installing handler
> >>>>> directly
> >>>>>
> >>>>> This patch adds a .notify() method. The presence of .notify() causes
> >>>>> Linux/ACPI to manage event handlers and notify handlers on our behalf,
> >>>>> so we don't have to install and remove them ourselves.
> >>>>>
> >>>>> Signed-off-by: Bjorn Helgaas <[email protected]>
> >>>>> CC: Zhang Rui <[email protected]>
> >>>>> CC: Zhao Yakui <[email protected]>
> >>>>> CC: Venki Pallipadi <[email protected]>
> >>>>> CC: Anil S Keshavamurthy <[email protected]>
> >>>>> Signed-off-by: Len Brown <[email protected]>
> >>>>>
> >>>>> However, reverting this commit from v2.6.30-rc1 doesn't solve the hang.
> >>>>>
> >>>> I don't see the problem in that commit yet, and if there is a problem
> >>>> with it, I would think that reverting it from 2.6.30-rc1 would solve
> >>>> it. But maybe it'd be useful to revert the whole .notify series to
> >>>> make sure. From 2.6.30-rc1, you should be able to revert these:
> >>>>
> >>>> 7ec0a7290797f57b780f792d12f4bcc19c83aa4f processor
> >>>> 373cfc360ec773be2f7615e59a19f3313255db7c button
> >>>> 46ec8598fde74ba59703575c22a6fb0b6b151bb6 Linux/ACPI infrastructure
> >>>>
> >>>> What happens with those commits reverted?
> >>>>
> >>> I'll find out tomorrow.
> >>>
> >> The fact that it still hangs even when you don't load the processor
> >> driver at all suggests that the 7ec0a729079 commit identified by the
> >> bisection is not the real problem. That commit only touches
> >> drivers/acpi/processor_core.c.
> >
> > Yah.
> >
> >> I think it's more likely some kind of race or missed wakeup.
> >>
> >> Since it seems to be sensitive to whether the battery is present,
> >> I guess you could try blacklisting the battery.ko driver. There
> >> have been a few changes to it since 2.6.29-rc8. If things work
> >> without battery.ko, we can look through those changes.
> >
> > Good guess :-). I tried a couple of times either way, and blacklisting
> > "battery" definitely avoids the hang.
>
> Ok, I tried reverting
>
> 0f66af530116e9f4dd97f328d91718b56a6fc5a4 "ACPI: battery: asynchronous init"
>
> and that fixed it.

I can't help with the real problem of why the asynchronous battery
init causes the hang.

But I do object to the magic makefile ordering change in that commit.
Nobody reading the makefile can tell why battery is down at the end,
and moving it apparently slows down boot significantly. So the
ordering change just feels like a band-aid that covers up a place
where ACPI could be improved.

I don't see anything unusual in what the battery init is doing, so
it's probably just some ACPI methods that take a long time to
execute. Other drivers could easily have similar problems.

Bjorn

2009-04-14 15:15:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Tue, 14 Apr 2009 08:59:01 -0600
Bjorn Helgaas <[email protected]> wrote:

> I can't help with the real problem of why the asynchronous battery
> init causes the hang.

that got fixed already for the module case.

>
> But I do object to the magic makefile ordering change in that commit.
> Nobody reading the makefile can tell why battery is down at the end,
> and moving it apparently slows down boot significantly.

for all cases I've seen it actually speeds it up, because the battery
now runs concurrently with the disk probe.

> So the
> ordering change just feels like a band-aid that covers up a place
> where ACPI could be improved.

the reason for the move is that both the battery and other pieces take
the big acpi lock; which defeats the parallelism. So the battery needs
to happen at the end instead.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-04-14 15:37:59

by Alan Jenkins

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

Arjan van de Ven wrote:
> On Tue, 14 Apr 2009 08:59:01 -0600
> Bjorn Helgaas <[email protected]> wrote:
>
>
>> I can't help with the real problem of why the asynchronous battery
>> init causes the hang.
>>
>
> that got fixed already for the module case.
>

Are you not thinking of the BUG due to calling __init functions from async?

I have a distinct _hang_ which occurs in v2.6.30-rc1-205-g08f4416. Do
you have a fix for that I can try?

Thanks
Alan

2009-04-14 15:49:27

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
> On Tue, 14 Apr 2009 08:59:01 -0600
> Bjorn Helgaas <[email protected]> wrote:
>
> > I can't help with the real problem of why the asynchronous battery
> > init causes the hang.
>
> that got fixed already for the module case.

But apparently still broken for the builtin case? I think Alan is
running pretty new bits -- he said "latest git" on April 11.

> > But I do object to the magic makefile ordering change in that commit.
> > Nobody reading the makefile can tell why battery is down at the end,
> > and moving it apparently slows down boot significantly.
>
> for all cases I've seen it actually speeds it up, because the battery
> now runs concurrently with the disk probe.

I understand; I just meant that if somebody moves it back where it
was, we'll mysteriously lose the speedup you got. Maybe a comment
in the makefile would be a short-term solution.

> > So the
> > ordering change just feels like a band-aid that covers up a place
> > where ACPI could be improved.
>
> the reason for the move is that both the battery and other pieces take
> the big acpi lock; which defeats the parallelism. So the battery needs
> to happen at the end instead.

Yep. But I don't think it's anything about the Linux battery driver
itself that makes it slow. I think it's more likely that some of the
ACPI methods it executes happen to be slow. And that could afflict
*any* driver, depending on the whim of a BIOS writer.

My guess is that if we could run ACPI methods concurrently and avoid
that big lock, the ordering wouldn't matter. I know we probably can't
do that any time soon, but I think it's good to make the problem
visible at least with a "we need this magic order because ACPI doesn't
support concurrent method execution" sort of comment.

Bjorn

2009-04-14 15:58:37

by Moore, Robert

[permalink] [raw]
Subject: RE: [BISECTED] EEE PC hangs when booting off battery

In fact, ACPI methods can execute concurrently -- constrained by the ACPI specification itself. The "big lock" is released before anything that will block for a significant amount of time, allowing other methods to run.

Bob


>-----Original Message-----
>From: [email protected] [mailto:linux-acpi-
>[email protected]] On Behalf Of Bjorn Helgaas
>Sent: Tuesday, April 14, 2009 8:49 AM
>To: Arjan van de Ven
>Cc: Alan Jenkins; [email protected]; linux-kernel; Kernel Testers
>List; Pallipadi, Venkatesh
>Subject: Re: [BISECTED] EEE PC hangs when booting off battery
>
>On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>> On Tue, 14 Apr 2009 08:59:01 -0600
>> Bjorn Helgaas <[email protected]> wrote:
>>
>> > I can't help with the real problem of why the asynchronous battery
>> > init causes the hang.
>>
>> that got fixed already for the module case.
>
>But apparently still broken for the builtin case? I think Alan is
>running pretty new bits -- he said "latest git" on April 11.
>
>> > But I do object to the magic makefile ordering change in that commit.
>> > Nobody reading the makefile can tell why battery is down at the end,
>> > and moving it apparently slows down boot significantly.
>>
>> for all cases I've seen it actually speeds it up, because the battery
>> now runs concurrently with the disk probe.
>
>I understand; I just meant that if somebody moves it back where it
>was, we'll mysteriously lose the speedup you got. Maybe a comment
>in the makefile would be a short-term solution.
>
>> > So the
>> > ordering change just feels like a band-aid that covers up a place
>> > where ACPI could be improved.
>>
>> the reason for the move is that both the battery and other pieces take
>> the big acpi lock; which defeats the parallelism. So the battery needs
>> to happen at the end instead.
>
>Yep. But I don't think it's anything about the Linux battery driver
>itself that makes it slow. I think it's more likely that some of the
>ACPI methods it executes happen to be slow. And that could afflict
>*any* driver, depending on the whim of a BIOS writer.
>
>My guess is that if we could run ACPI methods concurrently and avoid
>that big lock, the ordering wouldn't matter. I know we probably can't
>do that any time soon, but I think it's good to make the problem
>visible at least with a "we need this magic order because ACPI doesn't
>support concurrent method execution" sort of comment.
>
>Bjorn
>--
>To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-04-14 16:57:00

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

On Tuesday 14 April 2009 09:55:58 am Moore, Robert wrote:
> In fact, ACPI methods can execute concurrently -- constrained by the ACPI specification itself. The "big lock" is released before anything that will block for a significant amount of time, allowing other methods to run.

We had a discussion about this a couple years ago:
http://www.mail-archive.com/linux-acpi%40vger.kernel.org/msg06976.html

Here's my understanding (please correct me if I'm wrong):

- The ACPI CA holds a mutex while executing an AML method (see
acpi_ex_enter_interpreter()).

- When an AML method blocks, the ACPI CA releases the mutex, which
allows another AML method to run while the first is blocked.

- Because of the mutex, at most one AML method can be executing at
any given time. Others may have started, but they are blocked
until the current method completes or blocks itself.

- The undocumented "acpi_serialize" option makes the ACPI CA hold
the mutex for the entire duration of a method, even while a method
is blocked.

Based on that, my guess about the battery init being slow because of
a long-running AML method might be incorrect, although I suppose
it's still possible to write AML busy-loops that never block.

I'd really like to understand what's making it slow. The only
thing that looks at all unusual in acpi_battery_add() is
acpi_battery_update(), and that evaluates _STA and _BIF and
not much else.

Arjan, do you have any more information? Is the battery driver
slow on all laptops? If not, where did you see the problem? Do
you have a DSDT for them?

Bjorn

> >-----Original Message-----
> >From: [email protected] [mailto:linux-acpi-
> >[email protected]] On Behalf Of Bjorn Helgaas
> >Sent: Tuesday, April 14, 2009 8:49 AM
> >To: Arjan van de Ven
> >Cc: Alan Jenkins; [email protected]; linux-kernel; Kernel Testers
> >List; Pallipadi, Venkatesh
> >Subject: Re: [BISECTED] EEE PC hangs when booting off battery
> >
> >On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
> >> On Tue, 14 Apr 2009 08:59:01 -0600
> >> Bjorn Helgaas <[email protected]> wrote:
> >>
> >> > I can't help with the real problem of why the asynchronous battery
> >> > init causes the hang.
> >>
> >> that got fixed already for the module case.
> >
> >But apparently still broken for the builtin case? I think Alan is
> >running pretty new bits -- he said "latest git" on April 11.
> >
> >> > But I do object to the magic makefile ordering change in that commit.
> >> > Nobody reading the makefile can tell why battery is down at the end,
> >> > and moving it apparently slows down boot significantly.
> >>
> >> for all cases I've seen it actually speeds it up, because the battery
> >> now runs concurrently with the disk probe.
> >
> >I understand; I just meant that if somebody moves it back where it
> >was, we'll mysteriously lose the speedup you got. Maybe a comment
> >in the makefile would be a short-term solution.
> >
> >> > So the
> >> > ordering change just feels like a band-aid that covers up a place
> >> > where ACPI could be improved.
> >>
> >> the reason for the move is that both the battery and other pieces take
> >> the big acpi lock; which defeats the parallelism. So the battery needs
> >> to happen at the end instead.
> >
> >Yep. But I don't think it's anything about the Linux battery driver
> >itself that makes it slow. I think it's more likely that some of the
> >ACPI methods it executes happen to be slow. And that could afflict
> >*any* driver, depending on the whim of a BIOS writer.
> >
> >My guess is that if we could run ACPI methods concurrently and avoid
> >that big lock, the ordering wouldn't matter. I know we probably can't
> >do that any time soon, but I think it's good to make the problem
> >visible at least with a "we need this magic order because ACPI doesn't
> >support concurrent method execution" sort of comment.
> >
> >Bjorn

2009-04-14 17:23:13

by Moore, Robert

[permalink] [raw]
Subject: RE: [BISECTED] EEE PC hangs when booting off battery

Your understanding is correct.

The only interesting thing that happens when executing the battery methods like _STA, _BIF, and _BST is that the EC is usually involved - and this can be slow. Note that the AML interpreter is unlocked before accessing an EC operation region.

>-----Original Message-----
>From: Bjorn Helgaas [mailto:[email protected]]
>Sent: Tuesday, April 14, 2009 9:56 AM
>To: Moore, Robert
>Cc: Arjan van de Ven; Alan Jenkins; [email protected]; linux-
>kernel; Kernel Testers List; Pallipadi, Venkatesh
>Subject: Re: [BISECTED] EEE PC hangs when booting off battery
>
>On Tuesday 14 April 2009 09:55:58 am Moore, Robert wrote:
>> In fact, ACPI methods can execute concurrently -- constrained by the ACPI
>specification itself. The "big lock" is released before anything that will
>block for a significant amount of time, allowing other methods to run.
>
>We had a discussion about this a couple years ago:
> http://www.mail-archive.com/linux-acpi%40vger.kernel.org/msg06976.html
>
>Here's my understanding (please correct me if I'm wrong):
>
>- The ACPI CA holds a mutex while executing an AML method (see
> acpi_ex_enter_interpreter()).
>
>- When an AML method blocks, the ACPI CA releases the mutex, which
> allows another AML method to run while the first is blocked.
>
>- Because of the mutex, at most one AML method can be executing at
> any given time. Others may have started, but they are blocked
> until the current method completes or blocks itself.
>
>- The undocumented "acpi_serialize" option makes the ACPI CA hold
> the mutex for the entire duration of a method, even while a method
> is blocked.
>
>Based on that, my guess about the battery init being slow because of
>a long-running AML method might be incorrect, although I suppose
>it's still possible to write AML busy-loops that never block.
>
>I'd really like to understand what's making it slow. The only
>thing that looks at all unusual in acpi_battery_add() is
>acpi_battery_update(), and that evaluates _STA and _BIF and
>not much else.
>
>Arjan, do you have any more information? Is the battery driver
>slow on all laptops? If not, where did you see the problem? Do
>you have a DSDT for them?
>
>Bjorn
>
>> >-----Original Message-----
>> >From: [email protected] [mailto:linux-acpi-
>> >[email protected]] On Behalf Of Bjorn Helgaas
>> >Sent: Tuesday, April 14, 2009 8:49 AM
>> >To: Arjan van de Ven
>> >Cc: Alan Jenkins; [email protected]; linux-kernel; Kernel
>Testers
>> >List; Pallipadi, Venkatesh
>> >Subject: Re: [BISECTED] EEE PC hangs when booting off battery
>> >
>> >On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>> >> On Tue, 14 Apr 2009 08:59:01 -0600
>> >> Bjorn Helgaas <[email protected]> wrote:
>> >>
>> >> > I can't help with the real problem of why the asynchronous battery
>> >> > init causes the hang.
>> >>
>> >> that got fixed already for the module case.
>> >
>> >But apparently still broken for the builtin case? I think Alan is
>> >running pretty new bits -- he said "latest git" on April 11.
>> >
>> >> > But I do object to the magic makefile ordering change in that
>commit.
>> >> > Nobody reading the makefile can tell why battery is down at the end,
>> >> > and moving it apparently slows down boot significantly.
>> >>
>> >> for all cases I've seen it actually speeds it up, because the battery
>> >> now runs concurrently with the disk probe.
>> >
>> >I understand; I just meant that if somebody moves it back where it
>> >was, we'll mysteriously lose the speedup you got. Maybe a comment
>> >in the makefile would be a short-term solution.
>> >
>> >> > So the
>> >> > ordering change just feels like a band-aid that covers up a place
>> >> > where ACPI could be improved.
>> >>
>> >> the reason for the move is that both the battery and other pieces take
>> >> the big acpi lock; which defeats the parallelism. So the battery needs
>> >> to happen at the end instead.
>> >
>> >Yep. But I don't think it's anything about the Linux battery driver
>> >itself that makes it slow. I think it's more likely that some of the
>> >ACPI methods it executes happen to be slow. And that could afflict
>> >*any* driver, depending on the whim of a BIOS writer.
>> >
>> >My guess is that if we could run ACPI methods concurrently and avoid
>> >that big lock, the ordering wouldn't matter. I know we probably can't
>> >do that any time soon, but I think it's good to make the problem
>> >visible at least with a "we need this magic order because ACPI doesn't
>> >support concurrent method execution" sort of comment.
>> >
>> >Bjorn

2009-04-14 17:30:22

by Alan Jenkins

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery

So I guess the problem is that the other ACPI modules execute methods
which also need to access the EC. And it's the EC that causes them to
be serialized against the battery module.

Moore, Robert wrote:
> Your understanding is correct.
>
> The only interesting thing that happens when executing the battery methods like _STA, _BIF, and _BST is that the EC is usually involved - and this can be slow. Note that the AML interpreter is unlocked before accessing an EC operation region.
>
>
>> -----Original Message-----
>> From: Bjorn Helgaas [mailto:[email protected]]
>> Sent: Tuesday, April 14, 2009 9:56 AM
>> To: Moore, Robert
>> Cc: Arjan van de Ven; Alan Jenkins; [email protected]; linux-
>> kernel; Kernel Testers List; Pallipadi, Venkatesh
>> Subject: Re: [BISECTED] EEE PC hangs when booting off battery
>>
>> On Tuesday 14 April 2009 09:55:58 am Moore, Robert wrote:
>>
>>> In fact, ACPI methods can execute concurrently -- constrained by the ACPI
>>>
>> specification itself. The "big lock" is released before anything that will
>> block for a significant amount of time, allowing other methods to run.
>>
>> We had a discussion about this a couple years ago:
>> http://www.mail-archive.com/linux-acpi%40vger.kernel.org/msg06976.html
>>
>> Here's my understanding (please correct me if I'm wrong):
>>
>> - The ACPI CA holds a mutex while executing an AML method (see
>> acpi_ex_enter_interpreter()).
>>
>> - When an AML method blocks, the ACPI CA releases the mutex, which
>> allows another AML method to run while the first is blocked.
>>
>> - Because of the mutex, at most one AML method can be executing at
>> any given time. Others may have started, but they are blocked
>> until the current method completes or blocks itself.
>>
>> - The undocumented "acpi_serialize" option makes the ACPI CA hold
>> the mutex for the entire duration of a method, even while a method
>> is blocked.
>>
>> Based on that, my guess about the battery init being slow because of
>> a long-running AML method might be incorrect, although I suppose
>> it's still possible to write AML busy-loops that never block.
>>
>> I'd really like to understand what's making it slow. The only
>> thing that looks at all unusual in acpi_battery_add() is
>> acpi_battery_update(), and that evaluates _STA and _BIF and
>> not much else.
>>
>> Arjan, do you have any more information? Is the battery driver
>> slow on all laptops? If not, where did you see the problem? Do
>> you have a DSDT for them?
>>
>> Bjorn
>>
>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:linux-acpi-
>>>> [email protected]] On Behalf Of Bjorn Helgaas
>>>> Sent: Tuesday, April 14, 2009 8:49 AM
>>>> To: Arjan van de Ven
>>>> Cc: Alan Jenkins; [email protected]; linux-kernel; Kernel
>>>>
>> Testers
>>
>>>> List; Pallipadi, Venkatesh
>>>> Subject: Re: [BISECTED] EEE PC hangs when booting off battery
>>>>
>>>> On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>>>>
>>>>> On Tue, 14 Apr 2009 08:59:01 -0600
>>>>> Bjorn Helgaas <[email protected]> wrote:
>>>>>
>>>>>
>>>>>> I can't help with the real problem of why the asynchronous battery
>>>>>> init causes the hang.
>>>>>>
>>>>> that got fixed already for the module case.
>>>>>
>>>> But apparently still broken for the builtin case? I think Alan is
>>>> running pretty new bits -- he said "latest git" on April 11.
>>>>
>>>>
>>>>>> But I do object to the magic makefile ordering change in that
>>>>>>
>> commit.
>>
>>>>>> Nobody reading the makefile can tell why battery is down at the end,
>>>>>> and moving it apparently slows down boot significantly.
>>>>>>
>>>>> for all cases I've seen it actually speeds it up, because the battery
>>>>> now runs concurrently with the disk probe.
>>>>>
>>>> I understand; I just meant that if somebody moves it back where it
>>>> was, we'll mysteriously lose the speedup you got. Maybe a comment
>>>> in the makefile would be a short-term solution.
>>>>
>>>>
>>>>>> So the
>>>>>> ordering change just feels like a band-aid that covers up a place
>>>>>> where ACPI could be improved.
>>>>>>
>>>>> the reason for the move is that both the battery and other pieces take
>>>>> the big acpi lock; which defeats the parallelism. So the battery needs
>>>>> to happen at the end instead.
>>>>>
>>>> Yep. But I don't think it's anything about the Linux battery driver
>>>> itself that makes it slow. I think it's more likely that some of the
>>>> ACPI methods it executes happen to be slow. And that could afflict
>>>> *any* driver, depending on the whim of a BIOS writer.
>>>>
>>>> My guess is that if we could run ACPI methods concurrently and avoid
>>>> that big lock, the ordering wouldn't matter. I know we probably can't
>>>> do that any time soon, but I think it's good to make the problem
>>>> visible at least with a "we need this magic order because ACPI doesn't
>>>> support concurrent method execution" sort of comment.
>>>>
>>>> Bjorn
>>>>

2009-04-15 14:33:48

by Alan Jenkins

[permalink] [raw]
Subject: [FIXED] EEE PC hangs when booting off battery

Bjorn Helgaas wrote:
> On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>
>> On Tue, 14 Apr 2009 08:59:01 -0600
>> Bjorn Helgaas <[email protected]> wrote:
>>
>>
>>> I can't help with the real problem of why the asynchronous battery
>>> init causes the hang.
>>>
>> that got fixed already for the module case.
>>
>
> But apparently still broken for the builtin case? I think Alan is
> running pretty new bits -- he said "latest git" on April 11.
>

It's now fixed, in 2.6.30-rc2. My battery is modular btw. I suspect

5d38258ec026921a7b266f4047ebeaa75db358e5 "ACPI battery: fix async boot
oops" [removal of __init]

was not sufficient to fix my problem, but it was solved by the "real" fix,

d6de2c80e9d758d2e36c21699117db6178c0f517 "async: Fix module loading
async-work regression" [module loading waits on async work]


I would argue there's still a question over why the asynchronous battery
init (_with_ the oops fix) should cause a hang in the idle routine. But
since the regression is fixed, it's not exactly an urgent question.

Thanks
Alan

2009-04-15 15:39:16

by Len Brown

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery


> > In fact, ACPI methods can execute concurrently -- constrained by the ACPI specification itself. The "big lock" is released before anything that will block for a significant amount of time, allowing other methods to run.
>
> We had a discussion about this a couple years ago:
> http://www.mail-archive.com/linux-acpi%40vger.kernel.org/msg06976.html
>
> Here's my understanding (please correct me if I'm wrong):
>
> - The ACPI CA holds a mutex while executing an AML method (see
> acpi_ex_enter_interpreter()).
>
> - When an AML method blocks, the ACPI CA releases the mutex, which
> allows another AML method to run while the first is blocked.
>
> - Because of the mutex, at most one AML method can be executing at
> any given time. Others may have started, but they are blocked
> until the current method completes or blocks itself.
>
> - The undocumented "acpi_serialize" option makes the ACPI CA hold
> the mutex for the entire duration of a method, even while a method
> is blocked.
>
> Based on that, my guess about the battery init being slow because of
> a long-running AML method might be incorrect, although I suppose
> it's still possible to write AML busy-loops that never block.
>
> I'd really like to understand what's making it slow. The only
> thing that looks at all unusual in acpi_battery_add() is
> acpi_battery_update(), and that evaluates _STA and _BIF and
> not much else.
>
> Arjan, do you have any more information? Is the battery driver
> slow on all laptops? If not, where did you see the problem? Do
> you have a DSDT for them?

Before I added Arjan's async battery patch, I booted a couple of laptops
with async disabled and enabled. I saw about 40 usec battery init (T61,
nx6325).

Obviously, this was too fast for async battery to have any measurable
benefit on these machines (though async disk probing, by comparison,
was a big win).

But Arjan saw O(300) usec serial battery init on an EEE PC,
which is why he optimized it.

Indeed, this thread was started b/c a failure was seen on an EEEPC,
probably because of its unusually slow timing.

-Len

2009-04-15 15:41:43

by Len Brown

[permalink] [raw]
Subject: Re: [BISECTED] EEE PC hangs when booting off battery



On Tue, 14 Apr 2009, Ben Gamari wrote:

> On Mon, Apr 13, 2009 at 08:57:00PM +0100, Alan Jenkins wrote:
> > Bjorn Helgaas wrote:
> > However, I now find that's not quite right. It only works (i.e. doesn't
> > hang) if I remove the battery as well as plugging it into the wall. If
> > I have the battery in, it hangs.
>
> I seem to have a similar issue with master which occurs regardless of
> whether the battery is in on my Dell Latitude D830. Here is the
> backtrace,

Does it go away for you in 2.6.30-rc2 like it did for Alan?

thanks,
-Len Brown, Intel Open Source Technology Center

> input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A03:00/device:37/input/i6
> ACPI: Video Device [VID1] (multi-head: yes rom: no post: no)
> ACPI Warning (nspredef-0437): \_SB_.PCI0.VID2._DOD: Return Package has no eleme]
> input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A03:00/device:3c/input/i7
> ACPI: Video Device [VID2] (multi-head: yes rom: no post: no)
> [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
> SysRq : Show backtrace of all active CPUs
> CPU1:
> CPU 1:
> Modules linked in: i915 drm i2c_algo_bit cfbcopyarea i2c_core cfbimgblt video o]
> Pid: 0, comm: swapper Not tainted 2.6.30-rc1ben #54 Latitude D830
> RIP: 0010:[<ffffffff8126c488>] [<ffffffff8126c488>] acpi_idle_enter_bm+0x28e/00
> RSP: 0018:ffff88011fb0fe88 EFLAGS: 00000246
> RAX: 000000000000eca0 RBX: ffff88011fb0fee8 RCX: 0000000000000000
> RDX: ffff88002803a000 RSI: 0000000516143023 RDI: ffff88002804d6c0
> RBP: ffffffff8101180e R08: ffffffffff762325 R09: ffff88011fb09058
> R10: ffff88011fb09018 R11: ffff880028049430 R12: 000000073446f694
> R13: ffff88011fb0fe08 R14: 00000000000019d2 R15: ffff88011fb0fe18
> FS: 0000000000000000(0000) GS:ffff88002803a000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00007fff6d9cab78 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> [<ffffffff8126c481>] ? acpi_idle_enter_bm+0x287/0x2d0
> [<ffffffff8133f949>] ? cpuidle_idle_call+0x93/0xca
> [<ffffffff8100fd20>] ? cpu_idle+0x64/0xcd
> [<ffffffff813f5326>] ? start_secondary+0x1a1/0x1a6
>
>
> Hope this helps. Let me know if I can do anything else. Thanks!
>
> - ben
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-04-15 15:50:40

by Len Brown

[permalink] [raw]
Subject: archlinux 2.6.28 ac oops (was Re: Regression: EEE PC hangs when booting off battery)


> Kristoffer Ericson wrote:

> > Getting this myself, even if power cord attached. Its an tainted
> > kernel though (archlinux 2.6.28) so wasnt sure if
> > vanilla was affected.
> >
> > BUG: unable to handle kernel NULL pointer dereference at 00000000
> > IP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac]
> > *pde = 00000000
> > Oops: 0000 [#1] PREEMPT SMP
> > last sysfs file: /sys/devices/platform/i8042/modalias
> > Modules linked in: joydev psmouse snd_pcsp serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support i2c_core uvcvideo compat_ioctl32 videodev v4l1_compat sg option usbserial video output eeepc_laptop rfkill intel_agp agpgart thermal processor fan evdev button battery ac snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hwdep snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore ath_pci wlan ath_hal(P) arc4 ecb ath5k mac80211 led_class cfg80211 atl2 rtc_cmos rtc_core rtc_lib ext3 jbd mbcache usbhid hid usb_storage sd_mod uhci_hcd ehci_hcd usbcore ata_piix ahci ata_generic pata_acpi libata scsi_mod
> >
> > Pid: 1794, comm: fsck.ext3 Tainted: P (2.6.28-ARCH #1) 701
> > EIP: 0060:[<f850d031>] EFLAGS: 00010246 CPU: 0
> > EIP is at acpi_ac_get_state+0x1b/0x59 [ac]
> > EAX: f69559dc EBX: 00000000 ECX: 00000000 EDX: f850d418
> > ESI: f6955980 EDI: 00000001 EBP: f6834480 ESP: f6853f0c
> > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > Process fsck.ext3 (pid: 1794, ti=f6852000 task=f737e400 task.ti=f6852000)
> > Stack:
> > f69559dc f6955980 f692e360 f850d2fd 00000001 f692e360 c01ab0e0 00000400
> > b80cb000 f6834480 f692e380 00000000 00000000 00000001 00000000 00000000
> > f7327c00 fffffffb c01aafb0 f6834480 c01cead4 f6853fa0 00000400 b80cb000
> > Call Trace:
> > [<f850d2fd>] acpi_ac_seq_show+0x12/0x5d [ac]
> > [<c01ab0e0>] seq_read+0x130/0x2e0
> > [<c01aafb0>] seq_read+0x0/0x2e0
> > [<c01cead4>] proc_reg_read+0x64/0xa0
> > [<c01cea70>] proc_reg_read+0x0/0xa0
> > [<c019346d>] vfs_read+0x9d/0x160
> > [<c0193601>] sys_read+0x41/0x80
> > [<c0103f13>] sysenter_do_call+0x12/0x33
> > [<c0320000>] serial_pnp_probe+0x150/0x250
> > Code: b8 ea ff ff ff 75 07 8b 43 5c 89 01 31 c0 5b c3 56 89 c6 85 f6 b8 ea ff ff ff 53 74 49 8b 5e 58 8d 46 5c 31 c9 50 ba 18 d4 50 f8 <8b> 03 e8 e0 be d2 c7 89 c2 58 31 c0 85 d2 74 2b 68 1d d4 50 f8
> > EIP: [<f850d031>] acpi_ac_get_state+0x1b/0x59 [ac] SS:ESP 0068:f6853f0c
> > ---[ end trace eb12d69e27c56472 ]---
> >
>
> I'm pretty sure that's a separate bug :-). As you say, it happens under
> different circumstances, it's a BUG as opposed to a hang, and it's in a
> different version (I didn't have my problem in 2.6.28).
>
> http://www.kerneloops.org says no-one else has reported your bug. So maybe
> you're right, and Arch have applied some patches which nobody else runs
> with. I don't know how to find Arch source packages, so I can't really
> tell.
>
> But the only _taint_ as such is due to the ath_hal module. Let's see,
> Arch linux...
>
> <http://wiki.archlinux.org/index.php/Udev#Blacklisting_Modules>
> describes how you can (reversibly) blacklist modules. If you blacklist
> ath_pci (and ath_hal too, just to make sure), you could boot without any
> taint and generate a "clean" backtrace.

This oops is happening in the ac driver when user-space is
reading /proc/acpi/ac_adapter/*/state

it would be interesting if...
you can reproduce this using a kernel.org kernel
if you delay loading of the ac module till after
the system is up and see if you can reproduce this
at will by reading the file above.

thanks,
-Len

2009-04-22 12:17:23

by Alan Jenkins

[permalink] [raw]
Subject: Re: [FIXED] EEE PC hangs when booting off battery

Alan Jenkins wrote:
> Bjorn Helgaas wrote:
>
>> On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>>
>>
>>> On Tue, 14 Apr 2009 08:59:01 -0600
>>> Bjorn Helgaas <[email protected]> wrote:
>>>
>>>
>>>
>>>> I can't help with the real problem of why the asynchronous battery
>>>> init causes the hang.
>>>>
>>>>
>>> that got fixed already for the module case.
>>>
>>>
>> But apparently still broken for the builtin case? I think Alan is
>> running pretty new bits -- he said "latest git" on April 11.
>>
>>
>
> It's now fixed, in 2.6.30-rc2. My battery is modular btw. I suspect
>
> 5d38258ec026921a7b266f4047ebeaa75db358e5 "ACPI battery: fix async boot
> oops" [removal of __init]
>
> was not sufficient to fix my problem, but it was solved by the "real" fix,
>
> d6de2c80e9d758d2e36c21699117db6178c0f517 "async: Fix module loading
> async-work regression" [module loading waits on async work]
>
>
> I would argue there's still a question over why the asynchronous battery
> init (_with_ the oops fix) should cause a hang in the idle routine. But
> since the regression is fixed, it's not exactly an urgent question.
>

Ugh. Recently I tried building the battery driver into the kernel, to
benefit from the async work. Later, I tried booting from the battery
and it hung again.

This time, the kernel did not even respond to SysRq. I tried
nmi_watchdog=1 and waiting 2 minutes, but the watchdog didn't trigger
either. As before, it doesn't happen with acpi=off.

I checked that this still happened in todays rc3, and it doesn't happen
if I revert

0f66af530116e9f4dd97f328d91718b56a6fc5a4 "ACPI: battery: asynchronous init"

Regards
Alan

2009-04-26 11:34:26

by Alan Jenkins

[permalink] [raw]
Subject: EEE PC hangs when booting off battery

Alan Jenkins wrote:
> Alan Jenkins wrote:
>
>> Bjorn Helgaas wrote:
>>
>>
>>> On Tuesday 14 April 2009 09:17:28 am Arjan van de Ven wrote:
>>>
>>>
>>>
>>>> On Tue, 14 Apr 2009 08:59:01 -0600
>>>> Bjorn Helgaas <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I can't help with the real problem of why the asynchronous battery
>>>>> init causes the hang.
>>>>>
>>>>>
>>>>>
>>>> that got fixed already for the module case.
>>>>
>>>>
>>>>
>>> But apparently still broken for the builtin case? I think Alan is
>>> running pretty new bits -- he said "latest git" on April 11.
>>>
>>>
>>>
>> It's now fixed, in 2.6.30-rc2. My battery is modular btw. I suspect
>>
>> 5d38258ec026921a7b266f4047ebeaa75db358e5 "ACPI battery: fix async boot
>> oops" [removal of __init]
>>
>> was not sufficient to fix my problem, but it was solved by the "real" fix,
>>
>> d6de2c80e9d758d2e36c21699117db6178c0f517 "async: Fix module loading
>> async-work regression" [module loading waits on async work]
>>
>>
>> I would argue there's still a question over why the asynchronous battery
>> init (_with_ the oops fix) should cause a hang in the idle routine. But
>> since the regression is fixed, it's not exactly an urgent question.
>>
>>
>
> Ugh. Recently I tried building the battery driver into the kernel, to
> benefit from the async work. Later, I tried booting from the battery
> and it hung again.
>
> This time, the kernel did not even respond to SysRq. I tried
> nmi_watchdog=1 and waiting 2 minutes, but the watchdog didn't trigger
> either. As before, it doesn't happen with acpi=off.
>
> I checked that this still happened in todays rc3, and it doesn't happen
> if I revert
>
> 0f66af530116e9f4dd97f328d91718b56a6fc5a4 "ACPI: battery: asynchronous init"
>
>

It looks like my hang is caused by linkwatch_event() deadlocking on
rtnl_lock(). I can't see any direct connection to asynchronous battery
init, so perhaps that is just revealing a bug by changing the timing.

It appears I wasn't patient enough for hung task detection. If I leave
it long enough, I see:

<scrolled off top of screen>
? kobject_uevent_env
? kobject_uevent_env
__mutex_lock_slowpath
mutex_lock
rtnl_lock
linkwatch_event
worker_thread
? linkwatch_event
? autoremove_wake_function
? worker_thread
kthread
kernel_thread_helper


INFO: task modprobe:485 blocked for more than 120 seconds
Call trace:
? __atomic_notifier_call_chain
schedule
schedule_timeout
? notify_update
? do_con_write
? __wake_up
wait_for_common
? default_wake_function
wait_for_completion
flush_cpu_workqueue
? wq_barrier_func
flush_workqueue
flush_scheduled_work
tty_ldisc_release
? tty_fasyc
tty_release_dev
? free_pgtables
tty_release
__fput
filp_close
sys_close
syscall_call
? __send_remote_softirq
? usecs_to_jiffies


I then seem to get another repetition of the second calltrace, followed
by a new one

INFO: task swapper:1 blocked for more than 120 seconds
Call trace:
schedule
schedule_timeout
? __wake_up_common
? wake_up
wait_for_common
wait_for_completion
call_usermodehelper_exec
__request_module
crypto_larval_lookup
? extract_entropy
crypto_alg_mod_lookup
crypto_alloc_base
ieee80211_wep_init
ieee80211_register_hw
? ath5k_hw_set_bss
ath5k+pci_probe
local_pci_probe
pci_device_probe
driver_probe_device
__driver_attach
bus_for_each_dev
driver_attach
? __driver_attach
buad_add_driver
driver_register
? ktime_get_ts
__pci_register_driver
init_ath5k_pci
_stext
? init_ath5k_pci
? proc_create_data
? register_ieq_proc
kernel_init
? kernel_init
kernel_thread_helper




The hang happens at this point:

[ 0.967588] scsi 1:0:0:0: Direct-Access ATA SILICONMOTION SM
n/a PQ: 0 ANSI: 5
[ 0.968049] calling 4_sd_probe_async+0x0/0x225 @ 323
[ 0.968313] initcall 3_async_port_probe+0x0/0x95 returned 0 after
343051 usecs
<mark> (see below).
[ 0.968786] sd 1:0:0:0: [sda] 7815024 512-byte hardware sectors:
(4.00 GB/3.72 GiB)
[ 0.968964] sd 1:0:0:0: [sda] Write Protect is off
[ 0.969062] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 0.969132] sd 1:0:0:0: [sda] Write cache: disabled, read cache:
enabled, doesn't support DPO or FUA
[ 0.969543] sda: sda1 sda2
[ 0.970965] sd 1:0:0:0: [sda] Attached SCSI disk
[ 0.971073] initcall 4_sd_probe_async+0x0/0x225 returned 0 after 2849
usecs

On a successful boot, the next lines are

[ 0.971188] async_continuing @ 1 after 2483 usec
[ 0.971305] Freeing unused kernel memory: 256k freed
[ 1.071724] calling ata_generic_init+0x0/0x19 [ata_generic] @ 574
[ 1.073798] initcall ata_generic_init+0x0/0x19 [ata_generic] returned
0 after 144 usecs
[ 1.183372] Clocksource tsc unstable (delta = -128600689 ns)
[ 2.035932] EXT4-fs: delayed allocation enabled

Also, on a successful boot, I see these additional lines at the point
<mark> above.

[ 0.968461] async_continuing @ 1 after 76663 usec
[ 0.968556] async_waiting @ 1

In fact, when the hang happens I can see no "async_waiting @ 1" on my
50-line screen. Which makes sense if the kernel init process is hung
in init_athk_pci().


Thanks
Alan

2009-04-28 09:19:32

by Alan Jenkins

[permalink] [raw]
Subject: [PATCH] [RFC] EEE PC hangs when booting off battery

I found a regression where my EEE hangs at boot time, if the battery is
present.

I'm confident it's a regression because it disappears if I revert
Arjan's asynchronous battery initialisation. However, the evidence
points to a deadlock in the wireless stack which has simply been
uncovered by timing changes.

If I leave the system long enough, I get a series of hung task
warnings. They suggest the following deadlock:

- ieee80211_wep_init(), which is called with rtnl_lock() held, is
blocked in request_module() [waiting for modprobe to load a crypto module].
- modprobe is blocked in a call to flush_workqueue(), caused by closing
a TTY.
- worker_thread is blocked because the workqueue item linkwatch_event()
is blocked on rtnl_lock.


I've hacked up a test patch to move wep_init() outside of rtnl_lock, and
it solved the problem. My one caveat is that it would probably be
cleaner to move it after rtnl_unlock(), instead of before rtnl_lock().
I just wasn't 100% sure if that would be safe. Here's the patch:

---8<---

diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index fbcbed6..fffa7f9 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -909,6 +909,13 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
if (result < 0)
goto fail_sta_info;

+ result = ieee80211_wep_init(local);
+ if (result < 0) {
+ printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
+ wiphy_name(local->hw.wiphy), result);
+ goto fail_wep;
+ }
+
rtnl_lock();
result = dev_alloc_name(local->mdev, local->mdev->name);
if (result < 0)
@@ -930,14 +937,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
goto fail_rate;
}

- result = ieee80211_wep_init(local);
-
- if (result < 0) {
- printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
- wiphy_name(local->hw.wiphy), result);
- goto fail_wep;
- }
-
/* add one default STA interface if supported */
if (local->hw.wiphy->interface_modes & BIT(NL80211_IFTYPE_STATION)) {
result = ieee80211_if_add(local, "wlan%d", NULL,
@@ -967,13 +966,12 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)

return 0;

-fail_wep:
- rate_control_deinitialize(local);
fail_rate:
unregister_netdevice(local->mdev);
local->mdev = NULL;
fail_dev:
rtnl_unlock();
+fail_wep:
sta_info_stop(local);
fail_sta_info:
debugfs_hw_del(local);

2009-04-28 09:58:28

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] [RFC] EEE PC hangs when booting off battery

It seems the only reason lockdep doesn't warn is the detour to userspace
(modprobe) and the waiting for a userspace task (waiting isn't lockdep
annotated and I don't think it can be)

It seems you cannot load a module while under _any_ lock, ever, since
userspace might turn around and do something that requires that lock? I
think we should probably add a warning to the code that waits for the
userspace task:

debug_check_no_locks_held(current);

Or not? Some purely internal locks _might_ be ok... but how could you
verify that?

> - ieee80211_wep_init(), which is called with rtnl_lock() held, is
> blocked in request_module() [waiting for modprobe to load a crypto module].

Right.

> - modprobe is blocked in a call to flush_workqueue(), caused by closing
> a TTY.

Can you point out where? I can't find that.

> - worker_thread is blocked because the workqueue item linkwatch_event()
> is blocked on rtnl_lock.

This I know about.

> I've hacked up a test patch to move wep_init() outside of rtnl_lock, and
> it solved the problem. My one caveat is that it would probably be
> cleaner to move it after rtnl_unlock(), instead of before rtnl_lock().
> I just wasn't 100% sure if that would be safe. Here's the patch:

That doesn't seem relevant, this just does some initialisation. However,
you definitely missed adding a call to wep_free().

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-04-28 10:36:06

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] [RFC] EEE PC hangs when booting off battery

On Tue, 2009-04-28 at 11:27 +0100, Alan Jenkins wrote:

> > Can you point out where? I can't find that.

>
> I posted the calltraces earlier at
> <http://marc.info/?l=linux-wireless&m=124074566523506&w=2>.
>
> wait_for_completion
> flush_cpu_workqueue
> ? wq_barrier_func
> flush_workqueue
> flush_scheduled_work
> tty_ldisc_release

Ah, I was just grepping for the wrong thing :)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-04-28 10:40:40

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH] [RFC] EEE PC hangs when booting off battery

Johannes Berg wrote:
> It seems the only reason lockdep doesn't warn is the detour to userspace
> (modprobe) and the waiting for a userspace task (waiting isn't lockdep
> annotated and I don't think it can be)
>
> It seems you cannot load a module while under _any_ lock, ever, since
> userspace might turn around and do something that requires that lock? I
> think we should probably add a warning to the code that waits for the
> userspace task:
>
> debug_check_no_locks_held(current);
>
> Or not? Some purely internal locks _might_ be ok... but how could you
> verify that?
>
>
>> - ieee80211_wep_init(), which is called with rtnl_lock() held, is
>> blocked in request_module() [waiting for modprobe to load a crypto module].
>>
>
> Right.
>
>
>> - modprobe is blocked in a call to flush_workqueue(), caused by closing
>> a TTY.
>>
>
> Can you point out where? I can't find that.
>

I posted the calltraces earlier at
<http://marc.info/?l=linux-wireless&m=124074566523506&w=2>.

wait_for_completion
flush_cpu_workqueue
? wq_barrier_func
flush_workqueue
flush_scheduled_work
tty_ldisc_release
? tty_fasyc
tty_release_dev
? free_pgtables
tty_release
__fput
filp_close
sys_close
syscall_call


>> - worker_thread is blocked because the workqueue item linkwatch_event()
>> is blocked on rtnl_lock.
>>
>
> This I know about.
>
>
>> I've hacked up a test patch to move wep_init() outside of rtnl_lock, and
>> it solved the problem. My one caveat is that it would probably be
>> cleaner to move it after rtnl_unlock(), instead of before rtnl_lock().
>> I just wasn't 100% sure if that would be safe. Here's the patch:
>>
>
> That doesn't seem relevant, this just does some initialisation. However,
> you definitely missed adding a call to wep_free().
>

Hah, I should have realized something was wrong when I noticed I was
removing more lines that I added.

The crypto init does cause the module load:

wait_for_completion
call_usermodehelper_exec
__request_module
crypto_larval_lookup
? extract_entropy
crypto_alg_mod_lookup
crypto_alloc_base
ieee80211_wep_init
ieee80211_register_hw
? ath5k_hw_set_bss
ath5k+pci_probe
local_pci_probe
pci_device_probe
driver_probe_device
__driver_attach
bus_for_each_dev
driver_attach
? __driver_attach
buad_add_driver
driver_register
? ktime_get_ts
__pci_register_driver
init_ath5k_pci
_stext
? init_ath5k_pci
? proc_create_data
? register_ieq_proc
kernel_init


Thanks
Alan

2009-04-29 11:14:57

by Alan Jenkins

[permalink] [raw]
Subject: Re: [PATCH] [RFC] EEE PC hangs when booting off battery

Alan Jenkins wrote:
> Johannes Berg wrote:
>
>> That doesn't seem relevant, this just does some initialisation. However,
>> you definitely missed adding a call to wep_free().
>>
>>
>
> Hah, I should have realized something was wrong when I noticed I was
> removing more lines that I added.
>
> The crypto init does cause the module load:
>
> wait_for_completion
> call_usermodehelper_exec
> __request_module
> crypto_larval_lookup
> ? extract_entropy
> crypto_alg_mod_lookup
> crypto_alloc_base
> ieee80211_wep_init
> ieee80211_register_hw
>

Here's a corrected patch complete with changelog. If there are no other
problems with it, can you please apply this for 2.6.30 to keep my EeePC
regression-free?

Thanks
Alan

------>
>From c5e9dc036247e70956d1a28e8850c3810385dda0 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <[email protected]>
Date: Wed, 29 Apr 2009 11:41:24 +0100
Subject: [PATCH] mac80211: fix modprobe deadlock by not calling wep_init under rtnl_lock

- ieee80211_wep_init(), which is called with rtnl_lock held, blocks in
request_module() [waiting for modprobe to load a crypto module].

- modprobe blocks in a call to flush_workqueue(), when it closes a TTY
[presumably when it exits].

- The workqueue item linkwatch_event() blocks on rtnl_lock.

There's no reason for wep_init() to be called with rtnl_lock held, so
just move it outside the critical section.

Signed-off-by: Alan Jenkins <[email protected]>
---
net/mac80211/main.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index fbcbed6..00968c2 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -909,6 +909,13 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
if (result < 0)
goto fail_sta_info;

+ result = ieee80211_wep_init(local);
+ if (result < 0) {
+ printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
+ wiphy_name(local->hw.wiphy), result);
+ goto fail_wep;
+ }
+
rtnl_lock();
result = dev_alloc_name(local->mdev, local->mdev->name);
if (result < 0)
@@ -930,14 +937,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
goto fail_rate;
}

- result = ieee80211_wep_init(local);
-
- if (result < 0) {
- printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
- wiphy_name(local->hw.wiphy), result);
- goto fail_wep;
- }
-
/* add one default STA interface if supported */
if (local->hw.wiphy->interface_modes & BIT(NL80211_IFTYPE_STATION)) {
result = ieee80211_if_add(local, "wlan%d", NULL,
@@ -967,13 +966,13 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)

return 0;

-fail_wep:
- rate_control_deinitialize(local);
fail_rate:
unregister_netdevice(local->mdev);
local->mdev = NULL;
fail_dev:
rtnl_unlock();
+ ieee80211_wep_free(local);
+fail_wep:
sta_info_stop(local);
fail_sta_info:
debugfs_hw_del(local);
--
1.5.4.3



2009-04-29 11:21:26

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] [RFC] EEE PC hangs when booting off battery

On Wed, 2009-04-29 at 12:14 +0100, Alan Jenkins wrote:

> From c5e9dc036247e70956d1a28e8850c3810385dda0 Mon Sep 17 00:00:00 2001
> From: Alan Jenkins <[email protected]>
> Date: Wed, 29 Apr 2009 11:41:24 +0100
> Subject: [PATCH] mac80211: fix modprobe deadlock by not calling wep_init under rtnl_lock
>
> - ieee80211_wep_init(), which is called with rtnl_lock held, blocks in
> request_module() [waiting for modprobe to load a crypto module].
>
> - modprobe blocks in a call to flush_workqueue(), when it closes a TTY
> [presumably when it exits].
>
> - The workqueue item linkwatch_event() blocks on rtnl_lock.
>
> There's no reason for wep_init() to be called with rtnl_lock held, so
> just move it outside the critical section.

Looks correct to me.

> Signed-off-by: Alan Jenkins <[email protected]>

Reviewed-by: Johannes Berg <[email protected]>

> ---
> net/mac80211/main.c | 19 +++++++++----------
> 1 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/net/mac80211/main.c b/net/mac80211/main.c
> index fbcbed6..00968c2 100644
> --- a/net/mac80211/main.c
> +++ b/net/mac80211/main.c
> @@ -909,6 +909,13 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
> if (result < 0)
> goto fail_sta_info;
>
> + result = ieee80211_wep_init(local);
> + if (result < 0) {
> + printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
> + wiphy_name(local->hw.wiphy), result);
> + goto fail_wep;
> + }
> +
> rtnl_lock();
> result = dev_alloc_name(local->mdev, local->mdev->name);
> if (result < 0)
> @@ -930,14 +937,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
> goto fail_rate;
> }
>
> - result = ieee80211_wep_init(local);
> -
> - if (result < 0) {
> - printk(KERN_DEBUG "%s: Failed to initialize wep: %d\n",
> - wiphy_name(local->hw.wiphy), result);
> - goto fail_wep;
> - }
> -
> /* add one default STA interface if supported */
> if (local->hw.wiphy->interface_modes & BIT(NL80211_IFTYPE_STATION)) {
> result = ieee80211_if_add(local, "wlan%d", NULL,
> @@ -967,13 +966,13 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>
> return 0;
>
> -fail_wep:
> - rate_control_deinitialize(local);
> fail_rate:
> unregister_netdevice(local->mdev);
> local->mdev = NULL;
> fail_dev:
> rtnl_unlock();
> + ieee80211_wep_free(local);
> +fail_wep:
> sta_info_stop(local);
> fail_sta_info:
> debugfs_hw_del(local);


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part