LinuxLists.cc - Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

2014-01-02 12:32:05

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

"Rafael J. Wysocki" <[email protected]> writes:

> On Tuesday, December 24, 2013 07:11:00 AM Viresh Kumar wrote:
>> __cpufreq_add_dev() can fail sometimes while we are resuming our system.
>> Currently we are clearing all sysfs nodes for cpufreq's failed policy as that
>> could make userspace unstable. But if we suspend/resume again, we should atleast
>> try to bring back those policies.
>>
>> This patch fixes this issue by clearing fallback data on failure and trying to
>> allocate a new struct cpufreq_policy on second resume.
>>
>> Reported-and-tested-by: Bjørn Mork <[email protected]>
>> Signed-off-by: Viresh Kumar <[email protected]>
>
> Well, while I appreciate the work done here, I don't like the changelog,
> I don't really like the way the code is written and I don't like the comments.
> Sorry about that.
>
> Bjorn, can you please test the patch below instead along with the [2/2]
> from this series (on top of linux-pm.git/pm-cpufreq)?

I tested this series with your modified 1/2 today, but on top of a
current mainline (commit 9a0bb2966ef) instead of linux-pm.git/pm-cpufreq.
Shouldn't make much difference since Linus already has pulled your
'pm+acpi-3.13-rc6' tag, which included that pm-cpufreq branch.

This series fixes the reported bug for me.

But I observed this, most likely unrelated, splat during the test:

ACPI Exception: AE_BAD_PARAMETER, Returned by Handler for [EmbeddedControl] (20131115/evregion-282)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.LPMD] (Node ffff880232499c18), AE_BAD_PARAMETER (20131115/psparse-536)
ACPI Error: Method parse/execution failed [\_PR_.CPU0._PPC] (Node ffff8802326f93d0), AE_BAD_PARAMETER (20131115/psparse-536)
ACPI Error: Method parse/execution failed [\_PR_.CPU1._PPC] (Node ffff8802326f9268), AE_BAD_PARAMETER (20131115/psparse-536)
ACPI Exception: AE_BAD_PARAMETER, Evaluating _PPC (20131115/processor_perflib-140)

======================================================
[ INFO: possible circular locking dependency detected ]
3.13.0-rc6+ #181 Not tainted
-------------------------------------------------------
s2disk/5651 is trying to acquire lock:
(s_active#170){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
[<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
[<ffffffff812a6974>] store+0x20/0xad
[<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
[<ffffffff8112a428>] vfs_write+0x9c/0x102
[<ffffffff8112a716>] SyS_write+0x50/0x85
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

-> #0 (s_active#170){++++.+}:
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[<ffffffff811d5d11>] kobject_del+0x18/0x42
[<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[<ffffffff811d5ede>] kobject_put+0x45/0x49
[<ffffffff812a6e85>] cpufreq_policy_put_kobj+0x37/0x83
[<ffffffff812a8bab>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[<ffffffff812a8c39>] cpufreq_cpu_callback+0x53/0x88
[<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[<ffffffff81039ea1>] cpu_notify+0xe/0x10
[<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(s_active#170);
lock(cpu_hotplug.lock);
lock(s_active#170);

*** DEADLOCK ***

7 locks held by s2disk/5651:
#0: (pm_mutex){+.+.+.}, at: [<ffffffff8107e9ea>] snapshot_ioctl+0x4b/0x431
#1: (device_hotplug_lock){+.+.+.}, at: [<ffffffff81283730>] lock_device_hotplug+0x12/0x14
#2: (acpi_scan_lock){+.+.+.}, at: [<ffffffff8122c657>] acpi_scan_lock_acquire+0x12/0x14
#3: (console_lock){+.+.+.}, at: [<ffffffff810817f2>] suspend_console+0x20/0x38
#4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81039fb9>] cpu_maps_update_begin+0x12/0x14
#5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
#6: (cpufreq_rwsem){.+.+.+}, at: [<ffffffff812a84cc>] __cpufreq_add_dev.isra.18+0x7f/0x78c

stack backtrace:
CPU: 0 PID: 5651 Comm: s2disk Not tainted 3.13.0-rc6+ #181
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
ffffffff81d3edf0 ffff8802174898f8 ffffffff81399cac 0000000000004549
ffffffff81d3edf0 ffff880217489948 ffffffff81397dc5 ffff880217489928
ffff88023198ea50 0000000000000006 ffff88023198f1d8 0000000000000006
Call Trace:
[<ffffffff81399cac>] dump_stack+0x4e/0x68
[<ffffffff81397dc5>] print_circular_bug+0x1f8/0x209
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff8107565d>] ? debug_check_no_locks_freed+0x12c/0x143
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[<ffffffff81072201>] ? lockdep_init_map+0x14e/0x160
[<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[<ffffffff811d5d11>] kobject_del+0x18/0x42
[<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[<ffffffff811d5ede>] kobject_put+0x45/0x49
[<ffffffff812a6e85>] cpufreq_policy_put_kobj+0x37/0x83
[<ffffffff812a8bab>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[<ffffffff81071b04>] ? __lock_is_held+0x32/0x54
[<ffffffff812a8c39>] cpufreq_cpu_callback+0x53/0x88
[<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[<ffffffff81039ea1>] cpu_notify+0xe/0x10
[<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[<ffffffff811414c8>] ? fcheck_files+0xa1/0xe4
[<ffffffff81141a67>] ? fget_light+0x33/0x9a
[<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[<ffffffff811df4ce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
CPU1 is up
ACPI: Waking up from system sleep state S4
PM: noirq thaw of devices complete after 2.727 msecs
PM: early thaw of devices complete after 1.137 msecs

This warning appeared when I tried cancelling hibernation, which is
known to fail when using the acpi-cpufreq driver. The point of the test
was to verify that such failures still produce the expected result:

- all stale sysfs files are cleaned up and removed
- later suspend/resume actions will restore (or actually reinitiate)
the cpufreq driver for the non-boot CPUs

Which was successful, except that it produced the above lockdep warning.

I have not verified that this is a new warning (which means that it most
likely is not). And I expect the whole acpi-cpufreq problem, including
that warning, to go away when you eventually push
http://www.spinics.net/lists/cpufreq/msg08714.html with your improved
changelog (thanks for doing that BTW). So I don't worry too much about
it. Just wanted to let you know.

Bjørn

2014-01-03 00:27:05

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On Thursday, January 02, 2014 01:15:10 PM Bjørn Mork wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Tuesday, December 24, 2013 07:11:00 AM Viresh Kumar wrote:
> >> __cpufreq_add_dev() can fail sometimes while we are resuming our system.
> >> Currently we are clearing all sysfs nodes for cpufreq's failed policy as that
> >> could make userspace unstable. But if we suspend/resume again, we should atleast
> >> try to bring back those policies.
> >>
> >> This patch fixes this issue by clearing fallback data on failure and trying to
> >> allocate a new struct cpufreq_policy on second resume.
> >>
> >> Reported-and-tested-by: Bjørn Mork <[email protected]>
> >> Signed-off-by: Viresh Kumar <[email protected]>
> >
> > Well, while I appreciate the work done here, I don't like the changelog,
> > I don't really like the way the code is written and I don't like the comments.
> > Sorry about that.
> >
> > Bjorn, can you please test the patch below instead along with the [2/2]
> > from this series (on top of linux-pm.git/pm-cpufreq)?
>
> I tested this series with your modified 1/2 today, but on top of a
> current mainline (commit 9a0bb2966ef) instead of linux-pm.git/pm-cpufreq.
> Shouldn't make much difference since Linus already has pulled your
> 'pm+acpi-3.13-rc6' tag, which included that pm-cpufreq branch.
>
> This series fixes the reported bug for me.
>
>
> But I observed this, most likely unrelated, splat during the test:
>
> ACPI Exception: AE_BAD_PARAMETER, Returned by Handler for [EmbeddedControl] (20131115/evregion-282)
> ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.LPMD] (Node ffff880232499c18), AE_BAD_PARAMETER (20131115/psparse-536)
> ACPI Error: Method parse/execution failed [\_PR_.CPU0._PPC] (Node ffff8802326f93d0), AE_BAD_PARAMETER (20131115/psparse-536)
> ACPI Error: Method parse/execution failed [\_PR_.CPU1._PPC] (Node ffff8802326f9268), AE_BAD_PARAMETER (20131115/psparse-536)
> ACPI Exception: AE_BAD_PARAMETER, Evaluating _PPC (20131115/processor_perflib-140)
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.13.0-rc6+ #181 Not tainted
> -------------------------------------------------------
> s2disk/5651 is trying to acquire lock:
> (s_active#170){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
>
> but task is already holding lock:
> (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (cpu_hotplug.lock){+.+.+.}:
> [<ffffffff81075027>] lock_acquire+0xfb/0x144
> [<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
> [<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
> [<ffffffff812a6974>] store+0x20/0xad
> [<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
> [<ffffffff8112a428>] vfs_write+0x9c/0x102
> [<ffffffff8112a716>] SyS_write+0x50/0x85
> [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
>
> -> #0 (s_active#170){++++.+}:
> [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
> [<ffffffff81075027>] lock_acquire+0xfb/0x144
> [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
> [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
> [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
> [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
> [<ffffffff811d5d11>] kobject_del+0x18/0x42
> [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
> [<ffffffff811d5ede>] kobject_put+0x45/0x49
> [<ffffffff812a6e85>] cpufreq_policy_put_kobj+0x37/0x83
> [<ffffffff812a8bab>] __cpufreq_add_dev.isra.18+0x75e/0x78c
> [<ffffffff812a8c39>] cpufreq_cpu_callback+0x53/0x88
> [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
> [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
> [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
> [<ffffffff81039ea1>] cpu_notify+0xe/0x10
> [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
> [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
> [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
> [<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
> [<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
> [<ffffffff8113911c>] SyS_ioctl+0x52/0x82
> [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(cpu_hotplug.lock);
> lock(s_active#170);
> lock(cpu_hotplug.lock);
> lock(s_active#170);
>
> *** DEADLOCK ***
>
> 7 locks held by s2disk/5651:
> #0: (pm_mutex){+.+.+.}, at: [<ffffffff8107e9ea>] snapshot_ioctl+0x4b/0x431
> #1: (device_hotplug_lock){+.+.+.}, at: [<ffffffff81283730>] lock_device_hotplug+0x12/0x14
> #2: (acpi_scan_lock){+.+.+.}, at: [<ffffffff8122c657>] acpi_scan_lock_acquire+0x12/0x14
> #3: (console_lock){+.+.+.}, at: [<ffffffff810817f2>] suspend_console+0x20/0x38
> #4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81039fb9>] cpu_maps_update_begin+0x12/0x14
> #5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
> #6: (cpufreq_rwsem){.+.+.+}, at: [<ffffffff812a84cc>] __cpufreq_add_dev.isra.18+0x7f/0x78c
>
> stack backtrace:
> CPU: 0 PID: 5651 Comm: s2disk Not tainted 3.13.0-rc6+ #181
> Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
> ffffffff81d3edf0 ffff8802174898f8 ffffffff81399cac 0000000000004549
> ffffffff81d3edf0 ffff880217489948 ffffffff81397dc5 ffff880217489928
> ffff88023198ea50 0000000000000006 ffff88023198f1d8 0000000000000006
> Call Trace:
> [<ffffffff81399cac>] dump_stack+0x4e/0x68
> [<ffffffff81397dc5>] print_circular_bug+0x1f8/0x209
> [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
> [<ffffffff8107565d>] ? debug_check_no_locks_freed+0x12c/0x143
> [<ffffffff81075027>] lock_acquire+0xfb/0x144
> [<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
> [<ffffffff81072201>] ? lockdep_init_map+0x14e/0x160
> [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
> [<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
> [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
> [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
> [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
> [<ffffffff811d5d11>] kobject_del+0x18/0x42
> [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
> [<ffffffff811d5ede>] kobject_put+0x45/0x49
> [<ffffffff812a6e85>] cpufreq_policy_put_kobj+0x37/0x83
> [<ffffffff812a8bab>] __cpufreq_add_dev.isra.18+0x75e/0x78c
> [<ffffffff81071b04>] ? __lock_is_held+0x32/0x54
> [<ffffffff812a8c39>] cpufreq_cpu_callback+0x53/0x88
> [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
> [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
> [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
> [<ffffffff81039ea1>] cpu_notify+0xe/0x10
> [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
> [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
> [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
> [<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
> [<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
> [<ffffffff811414c8>] ? fcheck_files+0xa1/0xe4
> [<ffffffff81141a67>] ? fget_light+0x33/0x9a
> [<ffffffff8113911c>] SyS_ioctl+0x52/0x82
> [<ffffffff811df4ce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
> CPU1 is up
> ACPI: Waking up from system sleep state S4
> PM: noirq thaw of devices complete after 2.727 msecs
> PM: early thaw of devices complete after 1.137 msecs
>
>
>
> This warning appeared when I tried cancelling hibernation, which is
> known to fail when using the acpi-cpufreq driver. The point of the test
> was to verify that such failures still produce the expected result:
>
> - all stale sysfs files are cleaned up and removed
> - later suspend/resume actions will restore (or actually reinitiate)
> the cpufreq driver for the non-boot CPUs
>
> Which was successful, except that it produced the above lockdep warning.
>
> I have not verified that this is a new warning (which means that it most
> likely is not). And I expect the whole acpi-cpufreq problem, including
> that warning, to go away when you eventually push
> http://www.spinics.net/lists/cpufreq/msg08714.html with your improved
> changelog (thanks for doing that BTW). So I don't worry too much about
> it. Just wanted to let you know.

OK, thanks! Well, this is somewhat worrisome.

Could you also check the linux-pm.git/fixes branch that contains all patches
I'm planning to push for 3.13-rc7 shortly?

Rafael

2014-01-03 09:25:36

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

"Rafael J. Wysocki" <[email protected]> writes:

> OK, thanks! Well, this is somewhat worrisome.
>
> Could you also check the linux-pm.git/fixes branch that contains all patches
> I'm planning to push for 3.13-rc7 shortly?

It's definitely still there. But I'm less sure about the exact trigger.

I have now booted and suspend that branch a few times trying to figure
out which actions it depends on. It seems to depend on a modified
scaling_max_freq (or any other attribute, I guess - haven't testing
anything else) in combination with suspend. But the order doesn't
necessarily matter. I can sometimes set off the warning by writing to
sysfs after resuming, and sometimes at suspend time if the value is
already modified. The results are not consistent though. Sometimes
there is no warning at all.

But what's 100% sure is that I still can trigger this with the
linux-pm.git/fixes branch.

For example, triggered by

echo 800000 >/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq

after resume (I believe the taint is due to a _request_firmware failure
for one of the bluetooth devices):

======================================================
[ INFO: possible circular locking dependency detected ]
3.13.0-rc6+ #183 Tainted: G W
-------------------------------------------------------
bash/4897 is trying to acquire lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039f4a>] get_online_cpus+0x3c/0x50

but task is already holding lock:
(s_active#273){++++.+}, at: [<ffffffff8118a93f>] sysfs_write_file+0xd6/0x18b

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (s_active#273){++++.+}:
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[<ffffffff811d5d11>] kobject_del+0x18/0x42
[<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[<ffffffff811d5ede>] kobject_put+0x45/0x49
[<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
[<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
[<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[<ffffffff81039ea1>] cpu_notify+0xe/0x10
[<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

-> #0 (cpu_hotplug.lock){+.+.+.}:
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
[<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
[<ffffffff812a69c4>] store+0x20/0xad
[<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
[<ffffffff8112a428>] vfs_write+0x9c/0x102
[<ffffffff8112a716>] SyS_write+0x50/0x85
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(s_active#273);
lock(cpu_hotplug.lock);
lock(s_active#273);
lock(cpu_hotplug.lock);

*** DEADLOCK ***

3 locks held by bash/4897:
#0: (sb_writers#5){.+.+.+}, at: [<ffffffff811297c7>] file_start_write+0x27/0x29
#1: (&of->mutex){+.+.+.}, at: [<ffffffff8118a937>] sysfs_write_file+0xce/0x18b
#2: (s_active#273){++++.+}, at: [<ffffffff8118a93f>] sysfs_write_file+0xd6/0x18b

stack backtrace:
CPU: 1 PID: 4897 Comm: bash Tainted: G W 3.13.0-rc6+ #183
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
ffffffff81ca83b0 ffff88021fa65c08 ffffffff81399cac 0000000000000006
ffffffff81ca83b0 ffff88021fa65c58 ffffffff81397dc5 0000000000000002
ffff8800b1af2810 0000000000000003 ffff8800b1af2ef0 0000000000000003
Call Trace:
[<ffffffff81399cac>] dump_stack+0x4e/0x68
[<ffffffff81397dc5>] print_circular_bug+0x1f8/0x209
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff81039f4a>] ? get_online_cpus+0x3c/0x50
[<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
[<ffffffff81039f4a>] ? get_online_cpus+0x3c/0x50
[<ffffffff81071b04>] ? __lock_is_held+0x32/0x54
[<ffffffff81039f4a>] ? get_online_cpus+0x3c/0x50
[<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
[<ffffffff812a69c4>] store+0x20/0xad
[<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
[<ffffffff8112a428>] vfs_write+0x9c/0x102
[<ffffffff8112a716>] SyS_write+0x50/0x85
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

Or by s2disk (not the same boot - this time without the taint):

======================================================
[ INFO: possible circular locking dependency detected ]
3.13.0-rc6+ #183 Not tainted
-------------------------------------------------------
s2disk/5123 is trying to acquire lock:
(s_active#171){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
[<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
[<ffffffff812a69c4>] store+0x20/0xad
[<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
[<ffffffff8112a428>] vfs_write+0x9c/0x102
[<ffffffff8112a716>] SyS_write+0x50/0x85
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

-> #0 (s_active#171){++++.+}:
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[<ffffffff811d5d11>] kobject_del+0x18/0x42
[<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[<ffffffff811d5ede>] kobject_put+0x45/0x49
[<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
[<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
[<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[<ffffffff81039ea1>] cpu_notify+0xe/0x10
[<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(s_active#171);
lock(cpu_hotplug.lock);
lock(s_active#171);

*** DEADLOCK ***

7 locks held by s2disk/5123:
#0: (pm_mutex){+.+.+.}, at: [<ffffffff8107e9ea>] snapshot_ioctl+0x4b/0x431
#1: (device_hotplug_lock){+.+.+.}, at: [<ffffffff812837ac>] lock_device_hotplug+0x12/0x14
#2: (acpi_scan_lock){+.+.+.}, at: [<ffffffff8122c6d3>] acpi_scan_lock_acquire+0x12/0x14
#3: (console_lock){+.+.+.}, at: [<ffffffff810817f2>] suspend_console+0x20/0x38
#4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81039fb9>] cpu_maps_update_begin+0x12/0x14
#5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
#6: (cpufreq_rwsem){.+.+.+}, at: [<ffffffff812a851c>] __cpufreq_add_dev.isra.18+0x7f/0x78c

stack backtrace:
CPU: 0 PID: 5123 Comm: s2disk Not tainted 3.13.0-rc6+ #183
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
ffffffff81d3de70 ffff8800b10718f8 ffffffff81399cac 0000000000003a93
ffffffff81d3de70 ffff8800b1071948 ffffffff81397dc5 ffff8800b1071928
ffff8800b10c8bd0 0000000000000006 ffff8800b10c9358 0000000000000006
Call Trace:
[<ffffffff81399cac>] dump_stack+0x4e/0x68
[<ffffffff81397dc5>] print_circular_bug+0x1f8/0x209
[<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[<ffffffff8107565d>] ? debug_check_no_locks_freed+0x12c/0x143
[<ffffffff81075027>] lock_acquire+0xfb/0x144
[<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[<ffffffff81072201>] ? lockdep_init_map+0x14e/0x160
[<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[<ffffffff811d5d11>] kobject_del+0x18/0x42
[<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[<ffffffff811d5ede>] kobject_put+0x45/0x49
[<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
[<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[<ffffffff81071b04>] ? __lock_is_held+0x32/0x54
[<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
[<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[<ffffffff81039ea1>] cpu_notify+0xe/0x10
[<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[<ffffffff811414c8>] ? fcheck_files+0xa1/0xe4
[<ffffffff81141a67>] ? fget_light+0x33/0x9a
[<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[<ffffffff811df4ce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
CPU1 is up
ACPI: Waking up from system sleep state S4
PM: noirq thaw of devices complete after 1.722 msecs
PM: early thaw of devices complete after 1.180 msecs

Bjørn

2014-01-03 10:15:30

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

Bjørn Mork <[email protected]> writes:
> "Rafael J. Wysocki" <[email protected]> writes:
>
>> OK, thanks! Well, this is somewhat worrisome.
>>
>> Could you also check the linux-pm.git/fixes branch that contains all patches
>> I'm planning to push for 3.13-rc7 shortly?
>
> It's definitely still there. But I'm less sure about the exact trigger.

I did a couple of more tests after clean reboots, and 100% sure ways to
trigger the circular locking on my laptop are

echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
s2disk

or

echo 1401000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
s2disk

or

echo 1401000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
s2disk

Note that "ondemand" and "1401000" are the default vaules, so I don't
actually change anything here. The write is causing the problem, not
the value. As expected, I guess.

Also note that boot vs non-boot cpu doesn't seem to matter. Nor does
cancelling the hibernation. The warning appears on hibernate - not on
resume.

Hope this helps.

Bjørn

2014-01-03 11:19:21

by Viresh Kumar

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On 3 January 2014 15:23, Bj?rn Mork <[email protected]> wrote:
> Note that "ondemand" and "1401000" are the default vaules, so I don't
> actually change anything here. The write is causing the problem, not
> the value. As expected, I guess.
>
> Also note that boot vs non-boot cpu doesn't seem to matter. Nor does
> cancelling the hibernation. The warning appears on hibernate - not on
> resume.

Hmm... I spent quite some time understanding whats going on and really
couldn't get across anything as of now. I haven't tried reproducing it though.

Few things that I can make out of this mail chain so far:
- Apart from the log, everything is working fine. i.e. system is back in
working condition.
- It only happens when cpufreq_add_dev() fails during hibernation while
we enable non-boot CPUs again to save image to disk. So, isn't a problem
for a system which doesn't have any issues with add_dev() failing on
hibernation
- There is a contention of locks in the order they are taken. And the contention
looks to be between, hotplug lock taken by cpu_online_cpus() and s_active
lock for sysfs files. Don't know what's the role of previous write to
sysfs files.
As that should finish before hibernation starts and so all locks should be back
in place.

--
viresh

2014-01-03 11:55:42

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

Viresh Kumar <[email protected]> writes:

> On 3 January 2014 15:23, Bjørn Mork <[email protected]> wrote:
>> Note that "ondemand" and "1401000" are the default vaules, so I don't
>> actually change anything here. The write is causing the problem, not
>> the value. As expected, I guess.
>>
>> Also note that boot vs non-boot cpu doesn't seem to matter. Nor does
>> cancelling the hibernation. The warning appears on hibernate - not on
>> resume.
>
> Hmm... I spent quite some time understanding whats going on and really
> couldn't get across anything as of now. I haven't tried reproducing it though.
>
> Few things that I can make out of this mail chain so far:
> - Apart from the log, everything is working fine. i.e. system is back in
> working condition.

Correct. And users not running a lock debugging kernel will of course
not even see the warning.

> - It only happens when cpufreq_add_dev() fails during hibernation while
> we enable non-boot CPUs again to save image to disk. So, isn't a problem
> for a system which doesn't have any issues with add_dev() failing on
> hibernation

Wrong. This was my initial assumption but I later found out that the
issue is unrelated to hibernation failures. Sorry about the confusion.

> - There is a contention of locks in the order they are taken. And the contention
> looks to be between, hotplug lock taken by cpu_online_cpus() and s_active
> lock for sysfs files. Don't know what's the role of previous write to
> sysfs files.
> As that should finish before hibernation starts and so all locks should be back
> in place.

Yes, that seems logical. But I guess this is where it fails?

Bjørn

2014-01-06 06:27:21

by Viresh Kumar

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On 3 January 2014 17:25, Bj?rn Mork <[email protected]> wrote:
> Correct. And users not running a lock debugging kernel will of course
> not even see the warning.

Okay..

>> - It only happens when cpufreq_add_dev() fails during hibernation while
>> we enable non-boot CPUs again to save image to disk. So, isn't a problem
>> for a system which doesn't have any issues with add_dev() failing on
>> hibernation
>
> Wrong. This was my initial assumption but I later found out that the
> issue is unrelated to hibernation failures. Sorry about the confusion.

Hmm.. Can we have the latest warning logs you have? Earlier ones were
related to hibernation..

>> - There is a contention of locks in the order they are taken. And the contention
>> looks to be between, hotplug lock taken by cpu_online_cpus() and s_active
>> lock for sysfs files. Don't know what's the role of previous write to
>> sysfs files.
>> As that should finish before hibernation starts and so all locks should be back
>> in place.
>
> Yes, that seems logical. But I guess this is where it fails?

It looked like that.. Though your new logs might indicate something else.

2014-01-06 09:25:06

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

Viresh Kumar <[email protected]> writes:
> On 3 January 2014 17:25, Bjørn Mork <[email protected]> wrote:
>> Correct. And users not running a lock debugging kernel will of course
>> not even see the warning.
>
> Okay..
>
>>> - It only happens when cpufreq_add_dev() fails during hibernation while
>>> we enable non-boot CPUs again to save image to disk. So, isn't a problem
>>> for a system which doesn't have any issues with add_dev() failing on
>>> hibernation
>>
>> Wrong. This was my initial assumption but I later found out that the
>> issue is unrelated to hibernation failures. Sorry about the confusion.
>
> Hmm.. Can we have the latest warning logs you have? Earlier ones were
> related to hibernation..

That's correct. I have not observed this on suspend to RAM. But then
again I haven't rigged any way to log that, so I don't think it's
conclusive..

The point I tried to make is that it isn't related to any hibernation
*failures*. The warning appears even if the add_dev() is successful,
and it also appears if I touch only the *boot* cpu cpufreq attributes.

I.e., this seems to be unrelated to the hotplugging code.

Here's another copy of the warning, captured by cancelling hibernation
as I don't have any other way to log it at the moment. But I do see the
warning appear on the console *before* cancelling. And I also see this
warning appear when trying the in-kernel hibernation instead of
userspace.

[ 267.893084] ======================================================
[ 267.893085] [ INFO: possible circular locking dependency detected ]
[ 267.893087] 3.13.0-rc6+ #183 Tainted: G W
[ 267.893089] -------------------------------------------------------
[ 267.893090] s2disk/5450 is trying to acquire lock:
[ 267.893101] (s_active#164){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[ 267.893102]
[ 267.893102] but task is already holding lock:
[ 267.893111] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
[ 267.893112]
[ 267.893112] which lock already depends on the new lock.
[ 267.893112]
[ 267.893113]
[ 267.893113] the existing dependency chain (in reverse order) is:
[ 267.893117]
[ 267.893117] -> #1 (cpu_hotplug.lock){+.+.+.}:
[ 267.893123] [<ffffffff81075027>] lock_acquire+0xfb/0x144
[ 267.893128] [<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
[ 267.893132] [<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
[ 267.893137] [<ffffffff812a69c4>] store+0x20/0xad
[ 267.893142] [<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
[ 267.893147] [<ffffffff8112a428>] vfs_write+0x9c/0x102
[ 267.893151] [<ffffffff8112a716>] SyS_write+0x50/0x85
[ 267.893155] [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
[ 267.893160]
[ 267.893160] -> #0 (s_active#164){++++.+}:
[ 267.893164] [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[ 267.893168] [<ffffffff81075027>] lock_acquire+0xfb/0x144
[ 267.893172] [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[ 267.893175] [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[ 267.893178] [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[ 267.893182] [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[ 267.893186] [<ffffffff811d5d11>] kobject_del+0x18/0x42
[ 267.893190] [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[ 267.893193] [<ffffffff811d5ede>] kobject_put+0x45/0x49
[ 267.893197] [<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
[ 267.893201] [<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[ 267.893204] [<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
[ 267.893208] [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[ 267.893213] [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[ 267.893217] [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[ 267.893221] [<ffffffff81039ea1>] cpu_notify+0xe/0x10
[ 267.893225] [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[ 267.893230] [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[ 267.893234] [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[ 267.893238] [<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[ 267.893242] [<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[ 267.893245] [<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[ 267.893249] [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
[ 267.893250]
[ 267.893250] other info that might help us debug this:
[ 267.893250]
[ 267.893251] Possible unsafe locking scenario:
[ 267.893251]
[ 267.893252] CPU0 CPU1
[ 267.893253] ---- ----
[ 267.893256] lock(cpu_hotplug.lock);
[ 267.893259] lock(s_active#164);
[ 267.893261] lock(cpu_hotplug.lock);
[ 267.893265] lock(s_active#164);
[ 267.893266]
[ 267.893266] *** DEADLOCK ***
[ 267.893266]
[ 267.893268] 7 locks held by s2disk/5450:
[ 267.893275] #0: (pm_mutex){+.+.+.}, at: [<ffffffff8107e9ea>] snapshot_ioctl+0x4b/0x431
[ 267.893283] #1: (device_hotplug_lock){+.+.+.}, at: [<ffffffff812837ac>] lock_device_hotplug+0x12/0x14
[ 267.893290] #2: (acpi_scan_lock){+.+.+.}, at: [<ffffffff8122c6d3>] acpi_scan_lock_acquire+0x12/0x14
[ 267.893297] #3: (console_lock){+.+.+.}, at: [<ffffffff810817f2>] suspend_console+0x20/0x38
[ 267.893305] #4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81039fb9>] cpu_maps_update_begin+0x12/0x14
[ 267.893311] #5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
[ 267.893318] #6: (cpufreq_rwsem){.+.+.+}, at: [<ffffffff812a851c>] __cpufreq_add_dev.isra.18+0x7f/0x78c
[ 267.893319]
[ 267.893319] stack backtrace:
[ 267.893322] CPU: 0 PID: 5450 Comm: s2disk Tainted: G W 3.13.0-rc6+ #183
[ 267.893324] Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
[ 267.893329] ffffffff81d3cd00 ffff8800b7acd8f8 ffffffff81399cac 00000000000068f7
[ 267.893334] ffffffff81d3cd00 ffff8800b7acd948 ffffffff81397dc5 ffff8800b7acd928
[ 267.893338] ffff8800b7b24990 0000000000000006 ffff8800b7b25118 0000000000000006
[ 267.893339] Call Trace:
[ 267.893344] [<ffffffff81399cac>] dump_stack+0x4e/0x68
[ 267.893349] [<ffffffff81397dc5>] print_circular_bug+0x1f8/0x209
[ 267.893353] [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
[ 267.893357] [<ffffffff8107565d>] ? debug_check_no_locks_freed+0x12c/0x143
[ 267.893361] [<ffffffff81075027>] lock_acquire+0xfb/0x144
[ 267.893365] [<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[ 267.893369] [<ffffffff81072201>] ? lockdep_init_map+0x14e/0x160
[ 267.893372] [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
[ 267.893376] [<ffffffff8118b9d7>] ? sysfs_addrm_finish+0x28/0x46
[ 267.893380] [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
[ 267.893383] [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
[ 267.893386] [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
[ 267.893390] [<ffffffff811d5d11>] kobject_del+0x18/0x42
[ 267.893393] [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
[ 267.893396] [<ffffffff811d5ede>] kobject_put+0x45/0x49
[ 267.893400] [<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
[ 267.893404] [<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
[ 267.893407] [<ffffffff81071b04>] ? __lock_is_held+0x32/0x54
[ 267.893413] [<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
[ 267.893416] [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
[ 267.893421] [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
[ 267.893425] [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
[ 267.893428] [<ffffffff81039ea1>] cpu_notify+0xe/0x10
[ 267.893432] [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
[ 267.893436] [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
[ 267.893439] [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
[ 267.893443] [<ffffffff8107eb84>] snapshot_ioctl+0x1e5/0x431
[ 267.893447] [<ffffffff81139080>] do_vfs_ioctl+0x45e/0x4a8
[ 267.893451] [<ffffffff811414c8>] ? fcheck_files+0xa1/0xe4
[ 267.893454] [<ffffffff81141a67>] ? fget_light+0x33/0x9a
[ 267.893457] [<ffffffff8113911c>] SyS_ioctl+0x52/0x82
[ 267.893462] [<ffffffff811df4ce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 267.893466] [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b

I don't think I do anything extra-ordinary to trigger this, so I would
be surprised if you can't reproduce it by doing

export x=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
echo $x >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
s2disk

or similar on a kernel with CONFIG_PROVE_LOCKING=y

Bjørn

2014-01-06 09:57:55

by Viresh Kumar

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On 6 January 2014 14:31, Bj?rn Mork <[email protected]> wrote:
> That's correct. I have not observed this on suspend to RAM. But then
> again I haven't rigged any way to log that, so I don't think it's
> conclusive..
>
> The point I tried to make is that it isn't related to any hibernation
> *failures*. The warning appears even if the add_dev() is successful,
> and it also appears if I touch only the *boot* cpu cpufreq attributes.

Okay.. But your log below is for add_dev() failure case.

> I.e., this seems to be unrelated to the hotplugging code.
>
> Here's another copy of the warning, captured by cancelling hibernation
> as I don't have any other way to log it at the moment. But I do see the
> warning appear on the console *before* cancelling. And I also see this
> warning appear when trying the in-kernel hibernation instead of
> userspace.

How do you cancel hibernation here? Sorry, what is in-kernel hibernation?

> [ 267.893084] ======================================================
> [ 267.893085] [ INFO: possible circular locking dependency detected ]
> [ 267.893087] 3.13.0-rc6+ #183 Tainted: G W
> [ 267.893089] -------------------------------------------------------
> [ 267.893090] s2disk/5450 is trying to acquire lock:
> [ 267.893101] (s_active#164){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
> [ 267.893102]
> [ 267.893102] but task is already holding lock:
> [ 267.893111] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
> [ 267.893112]
> [ 267.893112] which lock already depends on the new lock.
> [ 267.893112]
> [ 267.893113]
> [ 267.893113] the existing dependency chain (in reverse order) is:
> [ 267.893117]
> [ 267.893117] -> #1 (cpu_hotplug.lock){+.+.+.}:
> [ 267.893123] [<ffffffff81075027>] lock_acquire+0xfb/0x144
> [ 267.893128] [<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
> [ 267.893132] [<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
> [ 267.893137] [<ffffffff812a69c4>] store+0x20/0xad
> [ 267.893142] [<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
> [ 267.893147] [<ffffffff8112a428>] vfs_write+0x9c/0x102
> [ 267.893151] [<ffffffff8112a716>] SyS_write+0x50/0x85
> [ 267.893155] [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
> [ 267.893160]
> [ 267.893160] -> #0 (s_active#164){++++.+}:
> [ 267.893164] [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
> [ 267.893168] [<ffffffff81075027>] lock_acquire+0xfb/0x144
> [ 267.893172] [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
> [ 267.893175] [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
> [ 267.893178] [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
> [ 267.893182] [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
> [ 267.893186] [<ffffffff811d5d11>] kobject_del+0x18/0x42
> [ 267.893190] [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
> [ 267.893193] [<ffffffff811d5ede>] kobject_put+0x45/0x49
> [ 267.893197] [<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83

All above is for case where add dev was required to clear up earlier
allocated policy as something failed during add_dev()

> [ 267.893201] [<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
> [ 267.893204] [<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
> [ 267.893208] [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
> [ 267.893213] [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
> [ 267.893217] [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
> [ 267.893221] [<ffffffff81039ea1>] cpu_notify+0xe/0x10
> [ 267.893225] [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
> [ 267.893230] [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
> [ 267.893234] [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed

And this is again the same log as what you have sent last time. It happened
while saving image after disabling all non-boot CPUs. And so it happened for
your case as your driver->init() was failing, so it may not get reproduced at
my end. Haven't tried it though.

> I don't think I do anything extra-ordinary to trigger this, so I would
> be surprised if you can't reproduce it by doing
>
> export x=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
> echo $x >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> s2disk

I don't see you cancelling hibernation here. Sorry got confused on how
exactly you reproduced it.

2014-01-06 10:49:16

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

Viresh Kumar <[email protected]> writes:

> On 6 January 2014 14:31, Bjørn Mork <[email protected]> wrote:
>> That's correct. I have not observed this on suspend to RAM. But then
>> again I haven't rigged any way to log that, so I don't think it's
>> conclusive..
>>
>> The point I tried to make is that it isn't related to any hibernation
>> *failures*. The warning appears even if the add_dev() is successful,
>> and it also appears if I touch only the *boot* cpu cpufreq attributes.
>
> Okay.. But your log below is for add_dev() failure case.

Any suggestion on how to capure warnings issued between freeze and thaw
is appreciated :-)

Will a netconsole work? I can try that later.

>> I.e., this seems to be unrelated to the hotplugging code.
>>
>> Here's another copy of the warning, captured by cancelling hibernation
>> as I don't have any other way to log it at the moment. But I do see the
>> warning appear on the console *before* cancelling. And I also see this
>> warning appear when trying the in-kernel hibernation instead of
>> userspace.
>
> How do you cancel hibernation here? Sorry, what is in-kernel hibernation?

I don't know the proper terms here. I normally use s2disk from uswsusp
to hibernate. By "in-kernel hibernation" I meant the process initiated
by doing

echo disk >/sys/power/state

>> [ 267.893084] ======================================================
>> [ 267.893085] [ INFO: possible circular locking dependency detected ]
>> [ 267.893087] 3.13.0-rc6+ #183 Tainted: G W
>> [ 267.893089] -------------------------------------------------------
>> [ 267.893090] s2disk/5450 is trying to acquire lock:
>> [ 267.893101] (s_active#164){++++.+}, at: [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
>> [ 267.893102]
>> [ 267.893102] but task is already holding lock:
>> [ 267.893111] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039ff5>] cpu_hotplug_begin+0x28/0x50
>> [ 267.893112]
>> [ 267.893112] which lock already depends on the new lock.
>> [ 267.893112]
>> [ 267.893113]
>> [ 267.893113] the existing dependency chain (in reverse order) is:
>> [ 267.893117]
>> [ 267.893117] -> #1 (cpu_hotplug.lock){+.+.+.}:
>> [ 267.893123] [<ffffffff81075027>] lock_acquire+0xfb/0x144
>> [ 267.893128] [<ffffffff8139d4d2>] mutex_lock_nested+0x6c/0x397
>> [ 267.893132] [<ffffffff81039f4a>] get_online_cpus+0x3c/0x50
>> [ 267.893137] [<ffffffff812a69c4>] store+0x20/0xad
>> [ 267.893142] [<ffffffff8118a9a1>] sysfs_write_file+0x138/0x18b
>> [ 267.893147] [<ffffffff8112a428>] vfs_write+0x9c/0x102
>> [ 267.893151] [<ffffffff8112a716>] SyS_write+0x50/0x85
>> [ 267.893155] [<ffffffff813a57a2>] system_call_fastpath+0x16/0x1b
>> [ 267.893160]
>> [ 267.893160] -> #0 (s_active#164){++++.+}:
>> [ 267.893164] [<ffffffff81074760>] __lock_acquire+0xae3/0xe68
>> [ 267.893168] [<ffffffff81075027>] lock_acquire+0xfb/0x144
>> [ 267.893172] [<ffffffff8118b027>] sysfs_deactivate+0xa5/0x108
>> [ 267.893175] [<ffffffff8118b9d7>] sysfs_addrm_finish+0x28/0x46
>> [ 267.893178] [<ffffffff8118bd3f>] sysfs_remove+0x2a/0x31
>> [ 267.893182] [<ffffffff8118be2f>] sysfs_remove_dir+0x66/0x6b
>> [ 267.893186] [<ffffffff811d5d11>] kobject_del+0x18/0x42
>> [ 267.893190] [<ffffffff811d5e1c>] kobject_cleanup+0xe1/0x14f
>> [ 267.893193] [<ffffffff811d5ede>] kobject_put+0x45/0x49
>> [ 267.893197] [<ffffffff812a6ed5>] cpufreq_policy_put_kobj+0x37/0x83
>
> All above is for case where add dev was required to clear up earlier
> allocated policy as something failed during add_dev()

I see. I'll try capturing at least part of the warning with a camera,
when it's not failing.

>> [ 267.893201] [<ffffffff812a8bfb>] __cpufreq_add_dev.isra.18+0x75e/0x78c
>> [ 267.893204] [<ffffffff812a8c89>] cpufreq_cpu_callback+0x53/0x88
>> [ 267.893208] [<ffffffff813a314c>] notifier_call_chain+0x67/0x92
>> [ 267.893213] [<ffffffff8105bce4>] __raw_notifier_call_chain+0x9/0xb
>> [ 267.893217] [<ffffffff81039e7c>] __cpu_notify+0x1b/0x32
>> [ 267.893221] [<ffffffff81039ea1>] cpu_notify+0xe/0x10
>> [ 267.893225] [<ffffffff8103a12b>] _cpu_up+0xf1/0x124
>> [ 267.893230] [<ffffffff8138ee7d>] enable_nonboot_cpus+0x52/0xbf
>> [ 267.893234] [<ffffffff8107a2a3>] hibernation_snapshot+0x1be/0x2ed
>
> And this is again the same log as what you have sent last time. It happened
> while saving image after disabling all non-boot CPUs. And so it happened for
> your case as your driver->init() was failing, so it may not get reproduced at
> my end. Haven't tried it though.
>
>> I don't think I do anything extra-ordinary to trigger this, so I would
>> be surprised if you can't reproduce it by doing
>>
>> export x=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
>> echo $x >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> s2disk
>
> I don't see you cancelling hibernation here. Sorry got confused on how
> exactly you reproduced it.

I cancelled by pressing backspace while the image was being written.

Bjørn

2014-01-06 10:54:12

by Viresh Kumar

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On 6 January 2014 16:19, Bj?rn Mork <[email protected]> wrote:
> I cancelled by pressing backspace while the image was being written.

I will answer other queries on next mail, but a quick question:
Do you get these warnings even if you don't cancel hibernation?

2014-01-06 11:00:31

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On Monday, January 06, 2014 11:49:00 AM Bjørn Mork wrote:
> Viresh Kumar <[email protected]> writes:
>
> > On 6 January 2014 14:31, Bjørn Mork <[email protected]> wrote:
> >> That's correct. I have not observed this on suspend to RAM. But then
> >> again I haven't rigged any way to log that, so I don't think it's
> >> conclusive..
> >>
> >> The point I tried to make is that it isn't related to any hibernation
> >> *failures*. The warning appears even if the add_dev() is successful,
> >> and it also appears if I touch only the *boot* cpu cpufreq attributes.
> >
> > Okay.. But your log below is for add_dev() failure case.
>
> Any suggestion on how to capure warnings issued between freeze and thaw
> is appreciated :-)

You can try to add no_console_suspend to the kernel's command line.

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-01-06 11:19:27

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 1/2] cpufreq: try to resume policies which failed on last resume

On Monday, January 06, 2014 04:24:09 PM Viresh Kumar wrote:
> On 6 January 2014 16:19, Bjørn Mork <[email protected]> wrote:
> > I cancelled by pressing backspace while the image was being written.

Basically, stuff is still frozen at this point, but devices are fully
functional. It's kind of like returning an error code from swsusp_write()
in hibernate() (kernel/power/hibernate.c).

> I will answer other queries on next mail, but a quick question:
> Do you get these warnings even if you don't cancel hibernation?

Good question. :-)

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.