2008-08-21 15:05:01

by Vegard Nossum

[permalink] [raw]
Subject: latest -git: hibernate: possible circular locking dependency detected

Hi,

I just got this on v2.6.27-rc4 (+ unrelated fix):

ACPI: Preparing to enter system sleep state S5
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc4-00003-ga798564 #28
-------------------------------------------------------
events/0/10 is trying to acquire lock:
(cpu_add_remove_lock){--..}, at: [<c013bd9f>] cpu_maps_update_begin+0xf/0x20
but task is already holding lock:
(poweroff_work){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (poweroff_work){--..}:
[<c015f9e6>] validate_chain+0x976/0xe90
[<c0160159>] __lock_acquire+0x259/0xa00
[<c0160989>] lock_acquire+0x89/0xc0
[<c014ae75>] run_workqueue+0x165/0x200
[<c014b9bd>] worker_thread+0x7d/0xe0
[<c014e252>] kthread+0x42/0x70
[<c0105cf3>] kernel_thread_helper+0x7/0x14
[<ffffffff>] 0xffffffff
-> #1 (events){--..}:
[<c015f9e6>] validate_chain+0x976/0xe90
[<c0160159>] __lock_acquire+0x259/0xa00
[<c0160989>] lock_acquire+0x89/0xc0
[<c014b09e>] cleanup_workqueue_thread+0x3e/0x70
[<c0663bfc>] workqueue_cpu_callback+0x7c/0x130
[<c0152f97>] notifier_call_chain+0x37/0x70
[<c0153009>] __raw_notifier_call_chain+0x19/0x20
[<c015302a>] raw_notifier_call_chain+0x1a/0x20
[<c0661e9a>] _cpu_down+0x1ea/0x270
[<c013bf18>] disable_nonboot_cpus+0x58/0xe0
[<c0168897>] hibernation_snapshot+0x117/0x220
[<c0168a80>] hibernate+0xe0/0x180
[<c01675ef>] state_store+0xbf/0xd0
[<c0375f84>] kobj_attr_store+0x24/0x30
[<c01fa4c2>] sysfs_write_file+0xa2/0x100
[<c01bbf96>] vfs_write+0x96/0x130
[<c01bc4dd>] sys_write+0x3d/0x70
[<c0104f3b>] sysenter_do_call+0x12/0x3f
[<ffffffff>] 0xffffffff
-> #0 (cpu_add_remove_lock){--..}:
[<c015f5ba>] validate_chain+0x54a/0xe90
[<c0160159>] __lock_acquire+0x259/0xa00
[<c0160989>] lock_acquire+0x89/0xc0
[<c06859cb>] mutex_lock_nested+0xab/0x300
[<c013bd9f>] cpu_maps_update_begin+0xf/0x20
[<c013bed3>] disable_nonboot_cpus+0x13/0xe0
[<c0148990>] kernel_power_off+0x20/0x40
[<c016c0a8>] do_poweroff+0x8/0x10
[<c014ae7a>] run_workqueue+0x16a/0x200
[<c014b9bd>] worker_thread+0x7d/0xe0
[<c014e252>] kthread+0x42/0x70
[<c0105cf3>] kernel_thread_helper+0x7/0x14
[<ffffffff>] 0xffffffff
other info that might help us debug this:
2 locks held by events/0/10:
#0: (events){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
#1: (poweroff_work){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
stack backtrace:
Pid: 10, comm: events/0 Not tainted 2.6.27-rc4-00003-ga798564 #28
[<c015f029>] print_circular_bug_tail+0x79/0xc0
[<c015c193>] ? print_circular_bug_entry+0x43/0x50
[<c015f5ba>] validate_chain+0x54a/0xe90
[<c015b565>] ? add_lock_to_list+0x45/0xb0
[<c039b7dd>] ? acpi_os_acquire_lock+0x8/0xa
[<c0160159>] __lock_acquire+0x259/0xa00
[<c015b67b>] ? trace_hardirqs_off+0xb/0x10
[<c0160989>] lock_acquire+0x89/0xc0
[<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
[<c06859cb>] mutex_lock_nested+0xab/0x300
[<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
[<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
[<c013bd9f>] cpu_maps_update_begin+0xf/0x20
[<c013bed3>] disable_nonboot_cpus+0x13/0xe0
[<c0148990>] kernel_power_off+0x20/0x40
[<c016c0a8>] do_poweroff+0x8/0x10
[<c014ae7a>] run_workqueue+0x16a/0x200
[<c014ae17>] ? run_workqueue+0x107/0x200
[<c016c0a0>] ? do_poweroff+0x0/0x10
[<c014b9bd>] worker_thread+0x7d/0xe0
[<c014e560>] ? autoremove_wake_function+0x0/0x50
[<c014b940>] ? worker_thread+0x0/0xe0
[<c014e252>] kthread+0x42/0x70
[<c014e210>] ? kthread+0x0/0x70
[<c0105cf3>] kernel_thread_helper+0x7/0x14
=======================
Disabling non-boot CPUs ...
Unmapping cpu 1 from all nodes
Unmapping cpu 1 from all nodes
CPU 1 is now offline
lockdep: fixing up alternatives.
SMP alternatives: switching to UP code
CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.
CPU0 attaching sched-domain:
domain 0: span 0 level CPU
groups: 0
CPU1 is down
Power down.
acpi_power_off called
hwsleep-0326 [00] enter_sleep_state : Entering sleep state [S5]

...the machine would shut down, but not resume:

Trying to resume from /dev/VolGroup00/LogVol01
No suspend signature on swap, not resuming.
Creating root device.
Mounting root filesystem.

I previously also saw some SLUB errors on resuming (but those didn't
make it to the serial console, unfortunately); will post follow-up if
I can manage to get a capture.

I don't know if it's related, but I was reading some files from /sys
while doing the "echo disk > /sys/power/state".


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036


2008-08-21 17:38:50

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

Hi,

[ cc: Peter and Oleg ]

heh, my mind might have been also 'hibernated' by the everning but I
still dare to speculate :-)


=======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.27-rc4-00003-ga798564 #28
> -------------------------------------------------------
> events/0/10 is trying to acquire lock:
> (cpu_add_remove_lock){--..}, at: [<c013bd9f>] cpu_maps_update_begin+0xf/0x20
> but task is already holding lock:
> (poweroff_work){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
> which lock already depends on the new lock.
> the existing dependency chain (in reverse order) is:
> -> #2 (poweroff_work){--..}:
> [<c015f9e6>] validate_chain+0x976/0xe90
> [<c0160159>] __lock_acquire+0x259/0xa00
> [<c0160989>] lock_acquire+0x89/0xc0
> [<c014ae75>] run_workqueue+0x165/0x200
> [<c014b9bd>] worker_thread+0x7d/0xe0
> [<c014e252>] kthread+0x42/0x70
> [<c0105cf3>] kernel_thread_helper+0x7/0x14
> [<ffffffff>] 0xffffffff
> -> #1 (events){--..}:
> [<c015f9e6>] validate_chain+0x976/0xe90
> [<c0160159>] __lock_acquire+0x259/0xa00
> [<c0160989>] lock_acquire+0x89/0xc0
> [<c014b09e>] cleanup_workqueue_thread+0x3e/0x70
> [<c0663bfc>] workqueue_cpu_callback+0x7c/0x130
> [<c0152f97>] notifier_call_chain+0x37/0x70
> [<c0153009>] __raw_notifier_call_chain+0x19/0x20
> [<c015302a>] raw_notifier_call_chain+0x1a/0x20
> [<c0661e9a>] _cpu_down+0x1ea/0x270
> [<c013bf18>] disable_nonboot_cpus+0x58/0xe0
> [<c0168897>] hibernation_snapshot+0x117/0x220
> [<c0168a80>] hibernate+0xe0/0x180
> [<c01675ef>] state_store+0xbf/0xd0
> [<c0375f84>] kobj_attr_store+0x24/0x30
> [<c01fa4c2>] sysfs_write_file+0xa2/0x100
> [<c01bbf96>] vfs_write+0x96/0x130
> [<c01bc4dd>] sys_write+0x3d/0x70
> [<c0104f3b>] sysenter_do_call+0x12/0x3f

this path is triggered as a result of "echo disk > /sys/power/state"

disable_nonboot_cpus() calls cpu_maps_update_being() which takes
"cpu_add_remove_lock" (lock-1).

If we go down the road cleanup_workqueue_thread() ->
flush_cpu_workqueue() will take "cwq->lock" (lock-2).
So this should be the second lock.


> [<ffffffff>] 0xffffffff
> -> #0 (cpu_add_remove_lock){--..}:
> [<c015f5ba>] validate_chain+0x54a/0xe90
> [<c0160159>] __lock_acquire+0x259/0xa00
> [<c0160989>] lock_acquire+0x89/0xc0
> [<c06859cb>] mutex_lock_nested+0xab/0x300
> [<c013bd9f>] cpu_maps_update_begin+0xf/0x20
> [<c013bed3>] disable_nonboot_cpus+0x13/0xe0
> [<c0148990>] kernel_power_off+0x20/0x40
> [<c016c0a8>] do_poweroff+0x8/0x10
> [<c014ae7a>] run_workqueue+0x16a/0x200
> [<c014b9bd>] worker_thread+0x7d/0xe0
> [<c014e252>] kthread+0x42/0x70
> [<c0105cf3>] kernel_thread_helper+0x7/0x14

hmm, did you somehow hit "Sysrq + o"?

'cause I don't see any other places (say, with handle_sysrq(k,...)
where "k" migth be 'o') from where do_power_off() might have been
triggered...

however, I think there are 2 problems with handle_poweroff()
[ kernel/power/poweroff.c ]

(1) it doesn't ensure that the 'cpu' it gets via
first_cpu(cpu_online_map) can't disappear (race with cpu_down()) on
the way to schedule_work_on()

[ I pressume, neither generic sysrq nor console layer takes care of
it. They shoudn't of course ]

(2) run_workqueue() [ which in the end calls do_poweroff() ] takes the
"cwq->lock" (which is lock-2 in our terminology)

well, actually it release it before calling "work->fun()" but is the
'lockdep' annotation right here? Peter?

(I admit, I never looked at lockdep and do make assumptions on its syntax here).

The lock-1 will be taken as a result of

then, do_poweroff() -> kernel_power_off() -> disable_nonboot_cpus()

which calls cpu_maps_update_begin() and takes "cpu_add_remove_lock"

and this looks dangerous. Due to the same reason as was before with
the use of get_online_cpus() by workqueue handlers before
CPU_POST_DEAD introduction
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3da1c84c00c7e5fa8348336bd8c342f9128b0f14)

I guess, it may deadlock as the lock-1 has been already taken before
calling cleanup_workqueue_thread() -> flush_cpu_workqueue() and
completion of the former chain depends in turn on being able to
acquire the very same lock.

hm?


> [<ffffffff>] 0xffffffff
> other info that might help us debug this:
> 2 locks held by events/0/10:
> #0: (events){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
> #1: (poweroff_work){--..}, at: [<c014ae17>] run_workqueue+0x107/0x200
> stack backtrace:
> Pid: 10, comm: events/0 Not tainted 2.6.27-rc4-00003-ga798564 #28
> [<c015f029>] print_circular_bug_tail+0x79/0xc0
> [<c015c193>] ? print_circular_bug_entry+0x43/0x50
> [<c015f5ba>] validate_chain+0x54a/0xe90
> [<c015b565>] ? add_lock_to_list+0x45/0xb0
> [<c039b7dd>] ? acpi_os_acquire_lock+0x8/0xa
> [<c0160159>] __lock_acquire+0x259/0xa00
> [<c015b67b>] ? trace_hardirqs_off+0xb/0x10
> [<c0160989>] lock_acquire+0x89/0xc0
> [<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
> [<c06859cb>] mutex_lock_nested+0xab/0x300
> [<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
> [<c013bd9f>] ? cpu_maps_update_begin+0xf/0x20
> [<c013bd9f>] cpu_maps_update_begin+0xf/0x20
> [<c013bed3>] disable_nonboot_cpus+0x13/0xe0
> [<c0148990>] kernel_power_off+0x20/0x40
> [<c016c0a8>] do_poweroff+0x8/0x10
> [<c014ae7a>] run_workqueue+0x16a/0x200
> [<c014ae17>] ? run_workqueue+0x107/0x200
> [<c016c0a0>] ? do_poweroff+0x0/0x10
> [<c014b9bd>] worker_thread+0x7d/0xe0
> [<c014e560>] ? autoremove_wake_function+0x0/0x50
> [<c014b940>] ? worker_thread+0x0/0xe0
> [<c014e252>] kthread+0x42/0x70
> [<c014e210>] ? kthread+0x0/0x70
> [<c0105cf3>] kernel_thread_helper+0x7/0x14
> =======================
> Disabling non-boot CPUs ...
> Unmapping cpu 1 from all nodes
> Unmapping cpu 1 from all nodes
> CPU 1 is now offline
> lockdep: fixing up alternatives.
> SMP alternatives: switching to UP code
> CPU0 attaching NULL sched-domain.
> CPU1 attaching NULL sched-domain.
> CPU0 attaching sched-domain:
> domain 0: span 0 level CPU
> groups: 0
> CPU1 is down
> Power down.
> acpi_power_off called
> hwsleep-0326 [00] enter_sleep_state : Entering sleep state [S5]
>
> ...the machine would shut down, but not resume:
>
> Trying to resume from /dev/VolGroup00/LogVol01
> No suspend signature on swap, not resuming.
> Creating root device.
> Mounting root filesystem.
>
> I previously also saw some SLUB errors on resuming (but those didn't
> make it to the serial console, unfortunately); will post follow-up if
> I can manage to get a capture.
>
> I don't know if it's related, but I was reading some files from /sys
> while doing the "echo disk > /sys/power/state".
>
>
> Vegard
>
> --
> "The animistic metaphor of the bug that maliciously sneaked in while
> the programmer was not looking is intellectually dishonest as it
> disguises that the error is the programmer's own creation."
> -- E. W. Dijkstra, EWD1036
>



--
Best regards,
Dmitry Adamushko

2008-08-21 17:59:22

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thu, Aug 21, 2008 at 7:38 PM, Dmitry Adamushko
<[email protected]> wrote:
>> [ INFO: possible circular locking dependency detected ]
>> 2.6.27-rc4-00003-ga798564 #28

[...]

> this path is triggered as a result of "echo disk > /sys/power/state"

Yes. That is what I typed.

>
> disable_nonboot_cpus() calls cpu_maps_update_being() which takes
> "cpu_add_remove_lock" (lock-1).
>
> If we go down the road cleanup_workqueue_thread() ->
> flush_cpu_workqueue() will take "cwq->lock" (lock-2).
> So this should be the second lock.
>
>
[...]

> hmm, did you somehow hit "Sysrq + o"?
>
> 'cause I don't see any other places (say, with handle_sysrq(k,...)
> where "k" migth be 'o') from where do_power_off() might have been
> triggered...
>

No. But in my logs I often saw SysRq triggered, even though I didn't
do it any of these times:

log-20080821-104053.txt:SysRq : SysRq : Show State
log-20080821-105541.txt:SysRq : Emergency Sync
log-20080821-105541.txt:SysRq : Emergency Sync
log-20080821-105541.txt:SysRq : Power Off
log-20080821-110514.txt:SysRq : Terminate All Tasks
log-20080821-111650.txt:SysRq : HELP : loglevel0-8 <6>serial 00:0d: activated
log-20080821-111650.txt:SysRq : Power Off
log-20080821-111650.txt:SysRq : Terminate All Tasks
log-20080821-111650.txt:SysRq : SysRq : <6>serial 00:0d: activated
log-20080821-120628.txt:SysRq : SysRq : HELP : HELP : <6>serial 00:0d: activated
log-20080821-120628.txt:SysRq : SysRq : HELP : HELP : <6>serial 00:0d: activated

(And it seems to pick a random letter too. It even showed the "HELP:"
line at one point!)

I have no idea why this happens. Is it possible to trigger SysRq by
writing data on the other end of the serial console? (Maybe my cable
is bad or something, but data seems to be going in both directions
over my serial console.)

The rest I will not comment on. It also seems unlikely that I will be
able to test any patches reliably, simply because I see a few
different kinds of issues on resume :-(

Thanks!


Vegard


--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-08-21 17:59:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On 08/21, Dmitry Adamushko wrote:
>
> however, I think there are 2 problems with handle_poweroff()
> [ kernel/power/poweroff.c ]
>
> (1) it doesn't ensure that the 'cpu' it gets via
> first_cpu(cpu_online_map) can't disappear (race with cpu_down()) on
> the way to schedule_work_on()
>
> [ I pressume, neither generic sysrq nor console layer takes care of
> it. They shoudn't of course ]
>
> (2) run_workqueue() [ which in the end calls do_poweroff() ] takes the
> "cwq->lock" (which is lock-2 in our terminology)
>
> well, actually it release it before calling "work->fun()" but is the
> 'lockdep' annotation right here? Peter?
>
> (I admit, I never looked at lockdep and do make assumptions on its syntax here).
>
> The lock-1 will be taken as a result of
>
> then, do_poweroff() -> kernel_power_off() -> disable_nonboot_cpus()
>
> which calls cpu_maps_update_begin() and takes "cpu_add_remove_lock"
>
> and this looks dangerous. Due to the same reason as was before with
> the use of get_online_cpus() by workqueue handlers before
> CPU_POST_DEAD introduction
> (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3da1c84c00c7e5fa8348336bd8c342f9128b0f14)
>
> I guess, it may deadlock as the lock-1 has been already taken before
> calling cleanup_workqueue_thread() -> flush_cpu_workqueue() and
> completion of the former chain depends in turn on being able to
> acquire the very same lock.

I apologize in advance if I missed something else in your message,
but I think you are very right. Please look at

http://marc.info/?t=121580236300019

Oleg.

2008-08-21 18:18:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thursday, 21 of August 2008, Vegard Nossum wrote:
> Hi,
>
> I just got this on v2.6.27-rc4 (+ unrelated fix):
>
> ACPI: Preparing to enter system sleep state S5

This looks fishy.

Please check what's there in /sys/power/disk .

Also, can you please post the full dmesg containing those traces?

Rafael

2008-08-21 18:31:59

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thu, Aug 21, 2008 at 8:22 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Thursday, 21 of August 2008, Vegard Nossum wrote:
>> Hi,
>>
>> I just got this on v2.6.27-rc4 (+ unrelated fix):
>>
>> ACPI: Preparing to enter system sleep state S5
>
> This looks fishy.
>
> Please check what's there in /sys/power/disk .
>
> Also, can you please post the full dmesg containing those traces?

Yep, here is:

http://userweb.kernel.org/~vegard/bugs/20080821-acpi/dmesg.txt

There seems to be a few different ACPI errors scattered throughout the log.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-08-21 18:39:37

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thu, Aug 21, 2008 at 8:31 PM, Vegard Nossum <[email protected]> wrote:
> On Thu, Aug 21, 2008 at 8:22 PM, Rafael J. Wysocki <[email protected]> wrote:
>> On Thursday, 21 of August 2008, Vegard Nossum wrote:
>>>
>>> I just got this on v2.6.27-rc4 (+ unrelated fix):
>>>
>>> ACPI: Preparing to enter system sleep state S5
>>
>> This looks fishy.
>>
>> Please check what's there in /sys/power/disk .
>>

Sorry, forgot this one:

# cat /sys/power/disk
test testproc [shutdown] reboot


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-08-21 19:04:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thursday, 21 of August 2008, Vegard Nossum wrote:
> On Thu, Aug 21, 2008 at 8:31 PM, Vegard Nossum <[email protected]> wrote:
> > On Thu, Aug 21, 2008 at 8:22 PM, Rafael J. Wysocki <[email protected]> wrote:
> >> On Thursday, 21 of August 2008, Vegard Nossum wrote:
> >>>
> >>> I just got this on v2.6.27-rc4 (+ unrelated fix):
> >>>
> >>> ACPI: Preparing to enter system sleep state S5
> >>
> >> This looks fishy.
> >>
> >> Please check what's there in /sys/power/disk .
> >>
>
> Sorry, forgot this one:
>
> # cat /sys/power/disk
> test testproc [shutdown] reboot

Hm, I wonder why 'platform' isn't there.

Perhaps the dmesg log will tell.

[/me looks]

No, it doesn't. The BIOS evidently supports S4.

[/me looks into the source code]

I have no idea whatsoever how it is possible to have

"ACPI: (supports S0 S1 S3 S4 S5)"

in the log and not to have "platform" in the output of "cat /sys/power/disk".

Can you please add

printk(KERN_CRIT "PM: hibernation_ops = %p\n", hibernation_ops);

at the end of kernel/power/disk.c:hibernation_set_ops() and see if it gets
printed during boot?

Thanks,
Rafael

2008-08-21 19:14:18

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thu, Aug 21, 2008 at 9:07 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Thursday, 21 of August 2008, Vegard Nossum wrote:
>> On Thu, Aug 21, 2008 at 8:31 PM, Vegard Nossum <[email protected]> wrote:
>> > On Thu, Aug 21, 2008 at 8:22 PM, Rafael J. Wysocki <[email protected]> wrote:
>> >> On Thursday, 21 of August 2008, Vegard Nossum wrote:
>> >>>
>> >>> I just got this on v2.6.27-rc4 (+ unrelated fix):
>> >>>
>> >>> ACPI: Preparing to enter system sleep state S5
>> >>
>> >> This looks fishy.
>> >>
>> >> Please check what's there in /sys/power/disk .
>> >>
>>
>> Sorry, forgot this one:
>>
>> # cat /sys/power/disk
>> test testproc [shutdown] reboot
>
> Hm, I wonder why 'platform' isn't there.
>
> Perhaps the dmesg log will tell.
>
> [/me looks]
>
> No, it doesn't. The BIOS evidently supports S4.
>
> [/me looks into the source code]
>
> I have no idea whatsoever how it is possible to have
>
> "ACPI: (supports S0 S1 S3 S4 S5)"
>
> in the log and not to have "platform" in the output of "cat /sys/power/disk".

I'm sorry, I made a stupid mistake: That file was taken from the
system when it was booted with acpi=off. The boot log had acpi
enabled, of course. Here is the same file (now booted without
acpi=off):

# cat /sys/power/disk
[platform] test testproc shutdown reboot

> Can you please add
>
> printk(KERN_CRIT "PM: hibernation_ops = %p\n", hibernation_ops);
>
> at the end of kernel/power/disk.c:hibernation_set_ops() and see if it gets
> printed during boot?

I guess this is now not necessary anymore?


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-08-21 19:28:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thursday, 21 of August 2008, Vegard Nossum wrote:
> On Thu, Aug 21, 2008 at 9:07 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Thursday, 21 of August 2008, Vegard Nossum wrote:
> >> On Thu, Aug 21, 2008 at 8:31 PM, Vegard Nossum <[email protected]> wrote:
> >> > On Thu, Aug 21, 2008 at 8:22 PM, Rafael J. Wysocki <[email protected]> wrote:
> >> >> On Thursday, 21 of August 2008, Vegard Nossum wrote:
> >> >>>
> >> >>> I just got this on v2.6.27-rc4 (+ unrelated fix):
> >> >>>
> >> >>> ACPI: Preparing to enter system sleep state S5
> >> >>
> >> >> This looks fishy.
> >> >>
> >> >> Please check what's there in /sys/power/disk .
> >> >>
> >>
> >> Sorry, forgot this one:
> >>
> >> # cat /sys/power/disk
> >> test testproc [shutdown] reboot
> >
> > Hm, I wonder why 'platform' isn't there.
> >
> > Perhaps the dmesg log will tell.
> >
> > [/me looks]
> >
> > No, it doesn't. The BIOS evidently supports S4.
> >
> > [/me looks into the source code]
> >
> > I have no idea whatsoever how it is possible to have
> >
> > "ACPI: (supports S0 S1 S3 S4 S5)"
> >
> > in the log and not to have "platform" in the output of "cat /sys/power/disk".
>
> I'm sorry, I made a stupid mistake: That file was taken from the
> system when it was booted with acpi=off. The boot log had acpi
> enabled, of course. Here is the same file (now booted without
> acpi=off):
>
> # cat /sys/power/disk
> [platform] test testproc shutdown reboot
>
> > Can you please add
> >
> > printk(KERN_CRIT "PM: hibernation_ops = %p\n", hibernation_ops);
> >
> > at the end of kernel/power/disk.c:hibernation_set_ops() and see if it gets
> > printed during boot?
>
> I guess this is now not necessary anymore?

No, it's not.

Now, what is the source of the "SysRq : Power Off" in the dmesg you posted?

Also, have you tried to hibernate with the "platform" setting selected in
"cat /sys/power/disk" and if you have, does the lockdep warning appear in this
case?

Rafael

2008-08-21 19:41:36

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thu, Aug 21, 2008 at 9:31 PM, Rafael J. Wysocki <[email protected]> wrote:
> Now, what is the source of the "SysRq : Power Off" in the dmesg you posted?

The only suggestion I can make is that SysRq can be triggered remotely
with serial console (COM1/ttyS0), because I know that the serial line
is producing junk in the other direction (i.e. reading from /dev/ttyS0
on the kernel which has been booted with console=ttyS0 reads junk,
sometimes things that look like pieces of dmesg). Other than that, I
have no idea. The SysRq thing seems to only happen after I've entered
"echo disk > /sys/power/state" too.

(Also see my email at http://lkml.org/lkml/2008/8/21/409)

>
> Also, have you tried to hibernate with the "platform" setting selected in
> "cat /sys/power/disk" and if you have, does the lockdep warning appear in this
> case?

Yes, I have tried it. In fact, I believe the lockdep warning in the
boot-log comes with such a setting (ACPI was enabled; isn't platform
the default in that case?). At least it looked to be selected when
acpi is enabled:

> # cat /sys/power/disk
> [platform] test testproc shutdown reboot


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-08-21 19:51:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Thursday, 21 of August 2008, Vegard Nossum wrote:
> On Thu, Aug 21, 2008 at 9:31 PM, Rafael J. Wysocki <[email protected]> wrote:
> > Now, what is the source of the "SysRq : Power Off" in the dmesg you posted?
>
> The only suggestion I can make is that SysRq can be triggered remotely
> with serial console (COM1/ttyS0), because I know that the serial line
> is producing junk in the other direction (i.e. reading from /dev/ttyS0
> on the kernel which has been booted with console=ttyS0 reads junk,
> sometimes things that look like pieces of dmesg). Other than that, I
> have no idea. The SysRq thing seems to only happen after I've entered
> "echo disk > /sys/power/state" too.
>
> (Also see my email at http://lkml.org/lkml/2008/8/21/409)
>
> >
> > Also, have you tried to hibernate with the "platform" setting selected in
> > "cat /sys/power/disk" and if you have, does the lockdep warning appear in this
> > case?
>
> Yes, I have tried it. In fact, I believe the lockdep warning in the
> boot-log comes with such a setting (ACPI was enabled; isn't platform
> the default in that case?). At least it looked to be selected when
> acpi is enabled:

Yes, it is the default in that case.

> > # cat /sys/power/disk
> > [platform] test testproc shutdown reboot

The problem is, in the case of the "platform" hibernation, kernel_power_off()
is not called at all (ACPI mechanism to enter the S4 sleep state is used
instead). Apart from this, there should be lines like:

"ACPI: Preparing to enter system sleep state S4"

in the dmesg log in that case and they are not present in your dmesg.

Your dmesg looks like a mixture of hibernation in the "shutdown" mode with
"emergency" shutdown caused by "SysRq : Power Off".

Thanks,
Rafael

2008-08-22 08:55:57

by Pavel Machek

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

Hi!


> I just got this on v2.6.27-rc4 (+ unrelated fix):

Did suspend ever work for you on this machine?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-08-22 09:18:37

by Pavel Machek

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

Hi!

> > this path is triggered as a result of "echo disk > /sys/power/state"
>
> Yes. That is what I typed.
>
> >
> > disable_nonboot_cpus() calls cpu_maps_update_being() which takes
> > "cpu_add_remove_lock" (lock-1).
> >
> > If we go down the road cleanup_workqueue_thread() ->
> > flush_cpu_workqueue() will take "cwq->lock" (lock-2).
> > So this should be the second lock.
> >
> >
> [...]
>
> > hmm, did you somehow hit "Sysrq + o"?
> >
> > 'cause I don't see any other places (say, with handle_sysrq(k,...)
> > where "k" migth be 'o') from where do_power_off() might have been
> > triggered...
> >
>
> No. But in my logs I often saw SysRq triggered, even though I didn't
> do it any of these times:
>
> log-20080821-104053.txt:SysRq : SysRq : Show State
> log-20080821-105541.txt:SysRq : Emergency Sync
> log-20080821-105541.txt:SysRq : Emergency Sync
> log-20080821-105541.txt:SysRq : Power Off
> log-20080821-110514.txt:SysRq : Terminate All Tasks
> log-20080821-111650.txt:SysRq : HELP : loglevel0-8 <6>serial 00:0d: activated
> log-20080821-111650.txt:SysRq : Power Off
> log-20080821-111650.txt:SysRq : Terminate All Tasks
> log-20080821-111650.txt:SysRq : SysRq : <6>serial 00:0d: activated
> log-20080821-120628.txt:SysRq : SysRq : HELP : HELP : <6>serial 00:0d: activated
> log-20080821-120628.txt:SysRq : SysRq : HELP : HELP : <6>serial 00:0d: activated
>
> (And it seems to pick a random letter too. It even showed the "HELP:"
> line at one point!)
>
> I have no idea why this happens. Is it possible to trigger SysRq by
> writing data on the other end of the serial console? (Maybe my cable
> is bad or something, but data seems to be going in both directions
> over my serial console.)

Serial break translates to sysrq, IIRC.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-08-22 19:09:26

by Vegard Nossum

[permalink] [raw]
Subject: Re: latest -git: hibernate: possible circular locking dependency detected

On Fri, Aug 22, 2008 at 10:55 AM, Pavel Machek <[email protected]> wrote:
> Hi!
>
>
>> I just got this on v2.6.27-rc4 (+ unrelated fix):
>
> Did suspend ever work for you on this machine?

Yes, 2.6.20.9 works fine :-P

I am checking out your suggestions of IDE driver changes now.

(Next e-mail:)
>> I have no idea why this happens. Is it possible to trigger SysRq by
>> writing data on the other end of the serial console? (Maybe my cable
>> is bad or something, but data seems to be going in both directions
>> over my serial console.)
>
> Serial break translates to sysrq, IIRC.

Thanks, that's useful info.

I also figured out something else; apparently the ttys are set to
character echo mode in either or both directions. I guess this is why
I also get these "ttyS0: 3 input overrun(s)" messages all the time. Am
I the only one who experienced this? I'll experiment with stty.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036