2014-10-24 16:08:19

by Eric B Munson

[permalink] [raw]
Subject: Commit 35ce7f29a breaks hibernation for XPS 13

Paul,

As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
the finger at 35ce7f29a. A revert of that commit confirms, I can once
again hibernate my machine without it.

When the hibernation fails I see this in dmesg:
[ 37.953313] PM: Syncing filesystems ... done.
[ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
[ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
[ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
[ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
[ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
[ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
[ 37.966000] PM: Basic memory bitmaps created
[ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
[ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
[ 38.141525] Freezing remaining freezable tasks ...
[ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
[ 58.151894]
[ 58.151896] Restarting kernel threads ... done.
[ 58.181915] PM: Basic memory bitmaps freed
[ 58.181917] Restarting tasks ... done.


I am not sure what else I can provide that might be useful, but I did
see the thread on net-dev about this same commit. Please CC me on any
fixes and I will be happy to test.

Eric


2014-10-24 16:20:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> Paul,
>
> As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> the finger at 35ce7f29a. A revert of that commit confirms, I can once
> again hibernate my machine without it.
>
> When the hibernation fails I see this in dmesg:
> [ 37.953313] PM: Syncing filesystems ... done.
> [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> [ 37.966000] PM: Basic memory bitmaps created
> [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> [ 38.141525] Freezing remaining freezable tasks ...
> [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> [ 58.151894]
> [ 58.151896] Restarting kernel threads ... done.
> [ 58.181915] PM: Basic memory bitmaps freed
> [ 58.181917] Restarting tasks ... done.
>
>
> I am not sure what else I can provide that might be useful, but I did
> see the thread on net-dev about this same commit. Please CC me on any
> fixes and I will be happy to test.

Thank you for the bug report!

Does the following patch help?

Thanx, Paul

------------------------------------------------------------------------

rcu: More on deadlock between CPU hotplug and expedited grace periods

Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
expedited grace periods) was incomplete. Although it did eliminate
deadlocks involving synchronize_sched_expedited()'s acquisition of
cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
deadlock involving acquisition of this same lock via put_online_cpus().
This deadlock became apparent with testing involving hibernation.

This commit therefore changes put_online_cpus() acquisition of this lock
to be conditional, and increments a new cpu_hotplug.puts_pending field
in case of acquisition failure. Then cpu_hotplug_begin() checks for this
new field being non-zero, and applies any changes to cpu_hotplug.refcount.

Reported-by: Jiri Kosina <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Tested-by: Jiri Kosina <[email protected]>
Tested-by: Borislav Petkov <[email protected]>

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 356450f09c1f..90a3d017b90c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -64,6 +64,8 @@ static struct {
* an ongoing cpu hotplug operation.
*/
int refcount;
+ /* And allows lockless put_online_cpus(). */
+ atomic_t puts_pending;

#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
@@ -113,7 +115,11 @@ void put_online_cpus(void)
{
if (cpu_hotplug.active_writer == current)
return;
- mutex_lock(&cpu_hotplug.lock);
+ if (!mutex_trylock(&cpu_hotplug.lock)) {
+ atomic_inc(&cpu_hotplug.puts_pending);
+ cpuhp_lock_release();
+ return;
+ }

if (WARN_ON(!cpu_hotplug.refcount))
cpu_hotplug.refcount++; /* try to fix things up */
@@ -155,6 +161,12 @@ void cpu_hotplug_begin(void)
cpuhp_lock_acquire();
for (;;) {
mutex_lock(&cpu_hotplug.lock);
+ if (atomic_read(&cpu_hotplug.puts_pending)) {
+ int delta;
+
+ delta = atomic_xchg(&cpu_hotplug.puts_pending, 0);
+ cpu_hotplug.refcount -= delta;
+ }
if (likely(!cpu_hotplug.refcount))
break;
__set_current_state(TASK_UNINTERRUPTIBLE);

2014-10-24 16:36:14

by Eric B Munson

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, 24 Oct 2014, Paul E. McKenney wrote:

> On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > Paul,
> >
> > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > again hibernate my machine without it.
> >
> > When the hibernation fails I see this in dmesg:
> > [ 37.953313] PM: Syncing filesystems ... done.
> > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > [ 37.966000] PM: Basic memory bitmaps created
> > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > [ 38.141525] Freezing remaining freezable tasks ...
> > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > [ 58.151894]
> > [ 58.151896] Restarting kernel threads ... done.
> > [ 58.181915] PM: Basic memory bitmaps freed
> > [ 58.181917] Restarting tasks ... done.
> >
> >
> > I am not sure what else I can provide that might be useful, but I did
> > see the thread on net-dev about this same commit. Please CC me on any
> > fixes and I will be happy to test.
>
> Thank you for the bug report!
>
> Does the following patch help?
>
> Thanx, Paul

Paul,

This patch does not help. I see the same dmesg output and failure to
hibernate.

Eric

>
> ------------------------------------------------------------------------
>
> rcu: More on deadlock between CPU hotplug and expedited grace periods
>
> Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
> expedited grace periods) was incomplete. Although it did eliminate
> deadlocks involving synchronize_sched_expedited()'s acquisition of
> cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
> deadlock involving acquisition of this same lock via put_online_cpus().
> This deadlock became apparent with testing involving hibernation.
>
> This commit therefore changes put_online_cpus() acquisition of this lock
> to be conditional, and increments a new cpu_hotplug.puts_pending field
> in case of acquisition failure. Then cpu_hotplug_begin() checks for this
> new field being non-zero, and applies any changes to cpu_hotplug.refcount.
>
> Reported-by: Jiri Kosina <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Tested-by: Jiri Kosina <[email protected]>
> Tested-by: Borislav Petkov <[email protected]>
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 356450f09c1f..90a3d017b90c 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -64,6 +64,8 @@ static struct {
> * an ongoing cpu hotplug operation.
> */
> int refcount;
> + /* And allows lockless put_online_cpus(). */
> + atomic_t puts_pending;
>
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> @@ -113,7 +115,11 @@ void put_online_cpus(void)
> {
> if (cpu_hotplug.active_writer == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
> + if (!mutex_trylock(&cpu_hotplug.lock)) {
> + atomic_inc(&cpu_hotplug.puts_pending);
> + cpuhp_lock_release();
> + return;
> + }
>
> if (WARN_ON(!cpu_hotplug.refcount))
> cpu_hotplug.refcount++; /* try to fix things up */
> @@ -155,6 +161,12 @@ void cpu_hotplug_begin(void)
> cpuhp_lock_acquire();
> for (;;) {
> mutex_lock(&cpu_hotplug.lock);
> + if (atomic_read(&cpu_hotplug.puts_pending)) {
> + int delta;
> +
> + delta = atomic_xchg(&cpu_hotplug.puts_pending, 0);
> + cpu_hotplug.refcount -= delta;
> + }
> if (likely(!cpu_hotplug.refcount))
> break;
> __set_current_state(TASK_UNINTERRUPTIBLE);
>

2014-10-24 17:22:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, Oct 24, 2014 at 12:36:12PM -0400, Eric B Munson wrote:
> On Fri, 24 Oct 2014, Paul E. McKenney wrote:
>
> > On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > > Paul,
> > >
> > > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > > again hibernate my machine without it.
> > >
> > > When the hibernation fails I see this in dmesg:
> > > [ 37.953313] PM: Syncing filesystems ... done.
> > > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > > [ 37.966000] PM: Basic memory bitmaps created
> > > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > > [ 38.141525] Freezing remaining freezable tasks ...
> > > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > > [ 58.151894]
> > > [ 58.151896] Restarting kernel threads ... done.
> > > [ 58.181915] PM: Basic memory bitmaps freed
> > > [ 58.181917] Restarting tasks ... done.
> > >
> > >
> > > I am not sure what else I can provide that might be useful, but I did
> > > see the thread on net-dev about this same commit. Please CC me on any
> > > fixes and I will be happy to test.
> >
> > Thank you for the bug report!
> >
> > Does the following patch help?
> >
> > Thanx, Paul
>
> Paul,
>
> This patch does not help. I see the same dmesg output and failure to
> hibernate.

Thank you for testing it. Does the following (untested, might not even
build) patch help? (Or feel free to wait until I have done some testing
on it.)

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 29fb23f33c18..927c17b081c7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state *rsp, int cpu)
rdp->nocb_leader = rdp_spawn;
if (rdp_last && rdp != rdp_spawn)
rdp_last->nocb_next_follower = rdp;
- rdp_last = rdp;
- rdp = rdp->nocb_next_follower;
- rdp_last->nocb_next_follower = NULL;
+ if (rdp == rdp_spawn) {
+ rdp = rdp->nocb_next_follower;
+ } else {
+ rdp_last = rdp;
+ rdp = rdp->nocb_next_follower;
+ rdp_last->nocb_next_follower = NULL;
+ }
} while (rdp);
rdp_spawn->nocb_next_follower = rdp_old_leader;
}

2014-10-24 18:40:33

by Eric B Munson

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, 24 Oct 2014, Paul E. McKenney wrote:

> On Fri, Oct 24, 2014 at 12:36:12PM -0400, Eric B Munson wrote:
> > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> >
> > > On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > > > Paul,
> > > >
> > > > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > > > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > > > again hibernate my machine without it.
> > > >
> > > > When the hibernation fails I see this in dmesg:
> > > > [ 37.953313] PM: Syncing filesystems ... done.
> > > > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > > > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > > > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > > > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > > > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > > > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > > > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > > > [ 37.966000] PM: Basic memory bitmaps created
> > > > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > > > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > > > [ 38.141525] Freezing remaining freezable tasks ...
> > > > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > > > [ 58.151894]
> > > > [ 58.151896] Restarting kernel threads ... done.
> > > > [ 58.181915] PM: Basic memory bitmaps freed
> > > > [ 58.181917] Restarting tasks ... done.
> > > >
> > > >
> > > > I am not sure what else I can provide that might be useful, but I did
> > > > see the thread on net-dev about this same commit. Please CC me on any
> > > > fixes and I will be happy to test.
> > >
> > > Thank you for the bug report!
> > >
> > > Does the following patch help?
> > >
> > > Thanx, Paul
> >
> > Paul,
> >
> > This patch does not help. I see the same dmesg output and failure to
> > hibernate.
>
> Thank you for testing it. Does the following (untested, might not even
> build) patch help? (Or feel free to wait until I have done some testing
> on it.)
>
> Thanx, Paul

Still didn't help. If it helps, when I attempt to reboot after trying
to hibernate I see a kworker thread hung and get the stack trace below
from that thread. I assume this is the same thread that is holding up
the hibernate.

Oct 24 14:26:46 lappy-486 kernel: [ 240.479810] INFO: task kworker/1:0:16 blocked for more than 120 seconds.
Oct 24 14:26:46 lappy-486 kernel: [ 240.479815] Tainted: G E 3.18.0-rc1+ #78
Oct 24 14:26:46 lappy-486 kernel: [ 240.479816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 14:26:46 lappy-486 kernel: [ 240.479818] kworker/1:0 D ffff88021f254600 0 16 2 0x00000000
Oct 24 14:26:46 lappy-486 kernel: [ 240.479827] Workqueue: usb_hub_wq hub_event
Oct 24 14:26:46 lappy-486 kernel: [ 240.479829] ffff880213a93908 0000000000000046 ffff880213a83200 ffff880213a93fd8
Oct 24 14:26:46 lappy-486 kernel: [ 240.479831] 0000000000014600 0000000000014600 ffff88021357e400 ffff880213a83200
Oct 24 14:26:46 lappy-486 kernel: [ 240.479834] 0000000000014600 ffffffff81c58a10 ffffffff81c58a18 7fffffffffffffff
Oct 24 14:26:46 lappy-486 kernel: [ 240.479836] Call Trace:
Oct 24 14:26:46 lappy-486 kernel: [ 240.479843] [<ffffffff8174d919>] schedule+0x29/0x70
Oct 24 14:26:46 lappy-486 kernel: [ 240.479846] [<ffffffff8175091c>] schedule_timeout+0x20c/0x280
Oct 24 14:26:46 lappy-486 kernel: [ 240.479851] [<ffffffff81097bbd>] ? check_preempt_curr+0x8d/0xa0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479854] [<ffffffff81097bed>] ? ttwu_do_wakeup+0x1d/0xd0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479857] [<ffffffff8174e616>] wait_for_completion+0xa6/0x160
Oct 24 14:26:46 lappy-486 kernel: [ 240.479860] [<ffffffff8109abb0>] ? wake_up_state+0x20/0x20
Oct 24 14:26:46 lappy-486 kernel: [ 240.479863] [<ffffffff810ce267>] _rcu_barrier+0x157/0x200
Oct 24 14:26:46 lappy-486 kernel: [ 240.479865] [<ffffffff810ce365>] rcu_barrier+0x15/0x20
Oct 24 14:26:46 lappy-486 kernel: [ 240.479870] [<ffffffff816632f0>] netdev_run_todo+0x60/0x300
Oct 24 14:26:46 lappy-486 kernel: [ 240.479874] [<ffffffff8166ddee>] rtnl_unlock+0xe/0x10
Oct 24 14:26:46 lappy-486 kernel: [ 240.479877] [<ffffffff8165d3c5>] unregister_netdev+0x25/0x30
Oct 24 14:26:46 lappy-486 kernel: [ 240.479883] [<ffffffffa05b9768>] usbnet_disconnect+0x48/0xf0 [usbnet]
Oct 24 14:26:46 lappy-486 kernel: [ 240.479888] [<ffffffff81577a28>] usb_unbind_interface+0x1f8/0x2c0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479893] [<ffffffff814c90e6>] ? rpm_idle+0xd6/0x2b0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479898] [<ffffffff814bf3cf>] __device_release_driver+0x7f/0xf0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479901] [<ffffffff814bf463>] device_release_driver+0x23/0x30
Oct 24 14:26:46 lappy-486 kernel: [ 240.479904] [<ffffffff814bed58>] bus_remove_device+0x108/0x180
Oct 24 14:26:46 lappy-486 kernel: [ 240.479907] [<ffffffff814bb4d9>] device_del+0x129/0x1e0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479910] [<ffffffff81575140>] usb_disable_device+0xb0/0x290
Oct 24 14:26:46 lappy-486 kernel: [ 240.479913] [<ffffffff8156a554>] usb_disconnect+0x94/0x2c0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479915] [<ffffffff8156cbe4>] hub_event+0x994/0x1500
Oct 24 14:26:46 lappy-486 kernel: [ 240.479919] [<ffffffff810a4c5e>] ? dequeue_task_fair+0x44e/0x660
Oct 24 14:26:46 lappy-486 kernel: [ 240.479924] [<ffffffff81088280>] process_one_work+0x150/0x3f0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479927] [<ffffffff81088971>] worker_thread+0x121/0x520
Oct 24 14:26:46 lappy-486 kernel: [ 240.479930] [<ffffffff81088850>] ? rescuer_thread+0x330/0x330
Oct 24 14:26:46 lappy-486 kernel: [ 240.479932] [<ffffffff8108d942>] kthread+0xd2/0xf0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479935] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180
Oct 24 14:26:46 lappy-486 kernel: [ 240.479939] [<ffffffff81751ffc>] ret_from_fork+0x7c/0xb0
Oct 24 14:26:46 lappy-486 kernel: [ 240.479941] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180

Eric

>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 29fb23f33c18..927c17b081c7 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state *rsp, int cpu)
> rdp->nocb_leader = rdp_spawn;
> if (rdp_last && rdp != rdp_spawn)
> rdp_last->nocb_next_follower = rdp;
> - rdp_last = rdp;
> - rdp = rdp->nocb_next_follower;
> - rdp_last->nocb_next_follower = NULL;
> + if (rdp == rdp_spawn) {
> + rdp = rdp->nocb_next_follower;
> + } else {
> + rdp_last = rdp;
> + rdp = rdp->nocb_next_follower;
> + rdp_last->nocb_next_follower = NULL;
> + }
> } while (rdp);
> rdp_spawn->nocb_next_follower = rdp_old_leader;
> }
>

2014-10-24 20:35:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, Oct 24, 2014 at 02:40:28PM -0400, Eric B Munson wrote:
> On Fri, 24 Oct 2014, Paul E. McKenney wrote:
>
> > On Fri, Oct 24, 2014 at 12:36:12PM -0400, Eric B Munson wrote:
> > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> > >
> > > > On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > > > > Paul,
> > > > >
> > > > > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > > > > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > > > > again hibernate my machine without it.
> > > > >
> > > > > When the hibernation fails I see this in dmesg:
> > > > > [ 37.953313] PM: Syncing filesystems ... done.
> > > > > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > > > > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > > > > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > > > > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > > > > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > > > > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > > > > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > > > > [ 37.966000] PM: Basic memory bitmaps created
> > > > > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > > > > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > > > > [ 38.141525] Freezing remaining freezable tasks ...
> > > > > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > > > > [ 58.151894]
> > > > > [ 58.151896] Restarting kernel threads ... done.
> > > > > [ 58.181915] PM: Basic memory bitmaps freed
> > > > > [ 58.181917] Restarting tasks ... done.
> > > > >
> > > > >
> > > > > I am not sure what else I can provide that might be useful, but I did
> > > > > see the thread on net-dev about this same commit. Please CC me on any
> > > > > fixes and I will be happy to test.
> > > >
> > > > Thank you for the bug report!
> > > >
> > > > Does the following patch help?
> > > >
> > > > Thanx, Paul
> > >
> > > Paul,
> > >
> > > This patch does not help. I see the same dmesg output and failure to
> > > hibernate.
> >
> > Thank you for testing it. Does the following (untested, might not even
> > build) patch help? (Or feel free to wait until I have done some testing
> > on it.)
> >
> > Thanx, Paul
>
> Still didn't help. If it helps, when I attempt to reboot after trying
> to hibernate I see a kworker thread hung and get the stack trace below
> from that thread. I assume this is the same thread that is holding up
> the hibernate.

Yep, looks like something that some other people are running into as well.

If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?

Thanx, Paul

> Oct 24 14:26:46 lappy-486 kernel: [ 240.479810] INFO: task kworker/1:0:16 blocked for more than 120 seconds.
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479815] Tainted: G E 3.18.0-rc1+ #78
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479818] kworker/1:0 D ffff88021f254600 0 16 2 0x00000000
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479827] Workqueue: usb_hub_wq hub_event
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479829] ffff880213a93908 0000000000000046 ffff880213a83200 ffff880213a93fd8
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479831] 0000000000014600 0000000000014600 ffff88021357e400 ffff880213a83200
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479834] 0000000000014600 ffffffff81c58a10 ffffffff81c58a18 7fffffffffffffff
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479836] Call Trace:
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479843] [<ffffffff8174d919>] schedule+0x29/0x70
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479846] [<ffffffff8175091c>] schedule_timeout+0x20c/0x280
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479851] [<ffffffff81097bbd>] ? check_preempt_curr+0x8d/0xa0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479854] [<ffffffff81097bed>] ? ttwu_do_wakeup+0x1d/0xd0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479857] [<ffffffff8174e616>] wait_for_completion+0xa6/0x160
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479860] [<ffffffff8109abb0>] ? wake_up_state+0x20/0x20
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479863] [<ffffffff810ce267>] _rcu_barrier+0x157/0x200
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479865] [<ffffffff810ce365>] rcu_barrier+0x15/0x20
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479870] [<ffffffff816632f0>] netdev_run_todo+0x60/0x300
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479874] [<ffffffff8166ddee>] rtnl_unlock+0xe/0x10
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479877] [<ffffffff8165d3c5>] unregister_netdev+0x25/0x30
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479883] [<ffffffffa05b9768>] usbnet_disconnect+0x48/0xf0 [usbnet]
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479888] [<ffffffff81577a28>] usb_unbind_interface+0x1f8/0x2c0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479893] [<ffffffff814c90e6>] ? rpm_idle+0xd6/0x2b0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479898] [<ffffffff814bf3cf>] __device_release_driver+0x7f/0xf0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479901] [<ffffffff814bf463>] device_release_driver+0x23/0x30
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479904] [<ffffffff814bed58>] bus_remove_device+0x108/0x180
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479907] [<ffffffff814bb4d9>] device_del+0x129/0x1e0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479910] [<ffffffff81575140>] usb_disable_device+0xb0/0x290
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479913] [<ffffffff8156a554>] usb_disconnect+0x94/0x2c0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479915] [<ffffffff8156cbe4>] hub_event+0x994/0x1500
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479919] [<ffffffff810a4c5e>] ? dequeue_task_fair+0x44e/0x660
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479924] [<ffffffff81088280>] process_one_work+0x150/0x3f0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479927] [<ffffffff81088971>] worker_thread+0x121/0x520
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479930] [<ffffffff81088850>] ? rescuer_thread+0x330/0x330
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479932] [<ffffffff8108d942>] kthread+0xd2/0xf0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479935] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479939] [<ffffffff81751ffc>] ret_from_fork+0x7c/0xb0
> Oct 24 14:26:46 lappy-486 kernel: [ 240.479941] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180
>
> Eric
>
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 29fb23f33c18..927c17b081c7 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state *rsp, int cpu)
> > rdp->nocb_leader = rdp_spawn;
> > if (rdp_last && rdp != rdp_spawn)
> > rdp_last->nocb_next_follower = rdp;
> > - rdp_last = rdp;
> > - rdp = rdp->nocb_next_follower;
> > - rdp_last->nocb_next_follower = NULL;
> > + if (rdp == rdp_spawn) {
> > + rdp = rdp->nocb_next_follower;
> > + } else {
> > + rdp_last = rdp;
> > + rdp = rdp->nocb_next_follower;
> > + rdp_last->nocb_next_follower = NULL;
> > + }
> > } while (rdp);
> > rdp_spawn->nocb_next_follower = rdp_old_leader;
> > }
> >
>

2014-10-27 13:48:00

by Eric B Munson

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Fri, 24 Oct 2014, Paul E. McKenney wrote:

> On Fri, Oct 24, 2014 at 02:40:28PM -0400, Eric B Munson wrote:
> > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> >
> > > On Fri, Oct 24, 2014 at 12:36:12PM -0400, Eric B Munson wrote:
> > > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> > > >
> > > > > On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > > > > > Paul,
> > > > > >
> > > > > > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > > > > > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > > > > > again hibernate my machine without it.
> > > > > >
> > > > > > When the hibernation fails I see this in dmesg:
> > > > > > [ 37.953313] PM: Syncing filesystems ... done.
> > > > > > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > > > > > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > > > > > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > > > > > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > > > > > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > > > > > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > > > > > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > > > > > [ 37.966000] PM: Basic memory bitmaps created
> > > > > > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > > > > > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > > > > > [ 38.141525] Freezing remaining freezable tasks ...
> > > > > > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > > > > > [ 58.151894]
> > > > > > [ 58.151896] Restarting kernel threads ... done.
> > > > > > [ 58.181915] PM: Basic memory bitmaps freed
> > > > > > [ 58.181917] Restarting tasks ... done.
> > > > > >
> > > > > >
> > > > > > I am not sure what else I can provide that might be useful, but I did
> > > > > > see the thread on net-dev about this same commit. Please CC me on any
> > > > > > fixes and I will be happy to test.
> > > > >
> > > > > Thank you for the bug report!
> > > > >
> > > > > Does the following patch help?
> > > > >
> > > > > Thanx, Paul
> > > >
> > > > Paul,
> > > >
> > > > This patch does not help. I see the same dmesg output and failure to
> > > > hibernate.
> > >
> > > Thank you for testing it. Does the following (untested, might not even
> > > build) patch help? (Or feel free to wait until I have done some testing
> > > on it.)
> > >
> > > Thanx, Paul
> >
> > Still didn't help. If it helps, when I attempt to reboot after trying
> > to hibernate I see a kworker thread hung and get the stack trace below
> > from that thread. I assume this is the same thread that is holding up
> > the hibernate.
>
> Yep, looks like something that some other people are running into as well.
>
> If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?
>
> Thanx, Paul
>

Disabling CONFIG_RCU_NOCB_CPU fixes the problem. I am able to hibernate
and resume successfully.

Eric

> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479810] INFO: task kworker/1:0:16 blocked for more than 120 seconds.
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479815] Tainted: G E 3.18.0-rc1+ #78
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479818] kworker/1:0 D ffff88021f254600 0 16 2 0x00000000
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479827] Workqueue: usb_hub_wq hub_event
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479829] ffff880213a93908 0000000000000046 ffff880213a83200 ffff880213a93fd8
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479831] 0000000000014600 0000000000014600 ffff88021357e400 ffff880213a83200
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479834] 0000000000014600 ffffffff81c58a10 ffffffff81c58a18 7fffffffffffffff
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479836] Call Trace:
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479843] [<ffffffff8174d919>] schedule+0x29/0x70
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479846] [<ffffffff8175091c>] schedule_timeout+0x20c/0x280
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479851] [<ffffffff81097bbd>] ? check_preempt_curr+0x8d/0xa0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479854] [<ffffffff81097bed>] ? ttwu_do_wakeup+0x1d/0xd0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479857] [<ffffffff8174e616>] wait_for_completion+0xa6/0x160
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479860] [<ffffffff8109abb0>] ? wake_up_state+0x20/0x20
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479863] [<ffffffff810ce267>] _rcu_barrier+0x157/0x200
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479865] [<ffffffff810ce365>] rcu_barrier+0x15/0x20
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479870] [<ffffffff816632f0>] netdev_run_todo+0x60/0x300
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479874] [<ffffffff8166ddee>] rtnl_unlock+0xe/0x10
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479877] [<ffffffff8165d3c5>] unregister_netdev+0x25/0x30
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479883] [<ffffffffa05b9768>] usbnet_disconnect+0x48/0xf0 [usbnet]
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479888] [<ffffffff81577a28>] usb_unbind_interface+0x1f8/0x2c0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479893] [<ffffffff814c90e6>] ? rpm_idle+0xd6/0x2b0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479898] [<ffffffff814bf3cf>] __device_release_driver+0x7f/0xf0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479901] [<ffffffff814bf463>] device_release_driver+0x23/0x30
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479904] [<ffffffff814bed58>] bus_remove_device+0x108/0x180
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479907] [<ffffffff814bb4d9>] device_del+0x129/0x1e0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479910] [<ffffffff81575140>] usb_disable_device+0xb0/0x290
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479913] [<ffffffff8156a554>] usb_disconnect+0x94/0x2c0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479915] [<ffffffff8156cbe4>] hub_event+0x994/0x1500
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479919] [<ffffffff810a4c5e>] ? dequeue_task_fair+0x44e/0x660
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479924] [<ffffffff81088280>] process_one_work+0x150/0x3f0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479927] [<ffffffff81088971>] worker_thread+0x121/0x520
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479930] [<ffffffff81088850>] ? rescuer_thread+0x330/0x330
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479932] [<ffffffff8108d942>] kthread+0xd2/0xf0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479935] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479939] [<ffffffff81751ffc>] ret_from_fork+0x7c/0xb0
> > Oct 24 14:26:46 lappy-486 kernel: [ 240.479941] [<ffffffff8108d870>] ? kthread_create_on_node+0x180/0x180
> >
> > Eric
> >
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 29fb23f33c18..927c17b081c7 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state *rsp, int cpu)
> > > rdp->nocb_leader = rdp_spawn;
> > > if (rdp_last && rdp != rdp_spawn)
> > > rdp_last->nocb_next_follower = rdp;
> > > - rdp_last = rdp;
> > > - rdp = rdp->nocb_next_follower;
> > > - rdp_last->nocb_next_follower = NULL;
> > > + if (rdp == rdp_spawn) {
> > > + rdp = rdp->nocb_next_follower;
> > > + } else {
> > > + rdp_last = rdp;
> > > + rdp = rdp->nocb_next_follower;
> > > + rdp_last->nocb_next_follower = NULL;
> > > + }
> > > } while (rdp);
> > > rdp_spawn->nocb_next_follower = rdp_old_leader;
> > > }
> > >
> >
>

2014-10-27 15:14:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Mon, Oct 27, 2014 at 09:47:57AM -0400, Eric B Munson wrote:
> On Fri, 24 Oct 2014, Paul E. McKenney wrote:
>
> > On Fri, Oct 24, 2014 at 02:40:28PM -0400, Eric B Munson wrote:
> > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> > >
> > > > On Fri, Oct 24, 2014 at 12:36:12PM -0400, Eric B Munson wrote:
> > > > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> > > > >
> > > > > > On Fri, Oct 24, 2014 at 12:08:15PM -0400, Eric B Munson wrote:
> > > > > > > Paul,
> > > > > > >
> > > > > > > As of 3.18-rc1 I can no longer hibernate my Dell XPS-13. Bisect points
> > > > > > > the finger at 35ce7f29a. A revert of that commit confirms, I can once
> > > > > > > again hibernate my machine without it.
> > > > > > >
> > > > > > > When the hibernation fails I see this in dmesg:
> > > > > > > [ 37.953313] PM: Syncing filesystems ... done.
> > > > > > > [ 37.963694] Freezing user space processes ... (elapsed 0.001 seconds) done.
> > > > > > > [ 37.965297] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
> > > > > > > [ 37.965299] PM: Marking nosave pages: [mem 0x00058000-0x00058fff]
> > > > > > > [ 37.965301] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
> > > > > > > [ 37.965304] PM: Marking nosave pages: [mem 0xc496a000-0xc4b6bfff]
> > > > > > > [ 37.965315] PM: Marking nosave pages: [mem 0xdadb7000-0xdcffefff]
> > > > > > > [ 37.965479] PM: Marking nosave pages: [mem 0xdd000000-0xffffffff]
> > > > > > > [ 37.966000] PM: Basic memory bitmaps created
> > > > > > > [ 37.966046] PM: Preallocating image memory... done (allocated 181989 pages)
> > > > > > > [ 38.141524] PM: Allocated 727956 kbytes in 0.17 seconds (4282.09 MB/s)
> > > > > > > [ 38.141525] Freezing remaining freezable tasks ...
> > > > > > > [ 58.151863] Freezing of tasks failed after 20.004 seconds (0 tasks refusing to freeze, wq_busy=1):
> > > > > > > [ 58.151894]
> > > > > > > [ 58.151896] Restarting kernel threads ... done.
> > > > > > > [ 58.181915] PM: Basic memory bitmaps freed
> > > > > > > [ 58.181917] Restarting tasks ... done.
> > > > > > >
> > > > > > >
> > > > > > > I am not sure what else I can provide that might be useful, but I did
> > > > > > > see the thread on net-dev about this same commit. Please CC me on any
> > > > > > > fixes and I will be happy to test.
> > > > > >
> > > > > > Thank you for the bug report!
> > > > > >
> > > > > > Does the following patch help?
> > > > > >
> > > > > > Thanx, Paul
> > > > >
> > > > > Paul,
> > > > >
> > > > > This patch does not help. I see the same dmesg output and failure to
> > > > > hibernate.
> > > >
> > > > Thank you for testing it. Does the following (untested, might not even
> > > > build) patch help? (Or feel free to wait until I have done some testing
> > > > on it.)
> > > >
> > > > Thanx, Paul
> > >
> > > Still didn't help. If it helps, when I attempt to reboot after trying
> > > to hibernate I see a kworker thread hung and get the stack trace below
> > > from that thread. I assume this is the same thread that is holding up
> > > the hibernate.
> >
> > Yep, looks like something that some other people are running into as well.
> >
> > If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?
>
> Disabling CONFIG_RCU_NOCB_CPU fixes the problem. I am able to hibernate
> and resume successfully.

Very good! Then the fix I am working on might actually be a fix. ;-)

Thanx, Paul

2014-10-27 17:44:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Mon, Oct 27, 2014 at 08:10:21AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 27, 2014 at 09:47:57AM -0400, Eric B Munson wrote:
> > On Fri, 24 Oct 2014, Paul E. McKenney wrote:

[ . . . ]

> > > > Still didn't help. If it helps, when I attempt to reboot after trying
> > > > to hibernate I see a kworker thread hung and get the stack trace below
> > > > from that thread. I assume this is the same thread that is holding up
> > > > the hibernate.
> > >
> > > Yep, looks like something that some other people are running into as well.
> > >
> > > If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?
> >
> > Disabling CONFIG_RCU_NOCB_CPU fixes the problem. I am able to hibernate
> > and resume successfully.
>
> Very good! Then the fix I am working on might actually be a fix. ;-)

And here is a patch that passes preliminary testing at my end. Does it
help at your end?

Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti <[email protected]>
Reported-by: Jay Vosburgh <[email protected]>
Reported-by: Eric B Munson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index aa8e5eea3ab4..c78e88ce5ea3 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -660,18 +660,18 @@ TRACE_EVENT(rcu_torture_read,
/*
* Tracepoint for _rcu_barrier() execution. The string "s" describes
* the _rcu_barrier phase:
- * "Begin": rcu_barrier_callback() started.
- * "Check": rcu_barrier_callback() checking for piggybacking.
- * "EarlyExit": rcu_barrier_callback() piggybacked, thus early exit.
- * "Inc1": rcu_barrier_callback() piggyback check counter incremented.
- * "Offline": rcu_barrier_callback() found offline CPU
- * "OnlineNoCB": rcu_barrier_callback() found online no-CBs CPU.
- * "OnlineQ": rcu_barrier_callback() found online CPU with callbacks.
- * "OnlineNQ": rcu_barrier_callback() found online CPU, no callbacks.
+ * "Begin": _rcu_barrier() started.
+ * "Check": _rcu_barrier() checking for piggybacking.
+ * "EarlyExit": _rcu_barrier() piggybacked, thus early exit.
+ * "Inc1": _rcu_barrier() piggyback check counter incremented.
+ * "OfflineNoCB": _rcu_barrier() found callback on never-online CPU
+ * "OnlineNoCB": _rcu_barrier() found online no-CBs CPU.
+ * "OnlineQ": _rcu_barrier() found online CPU with callbacks.
+ * "OnlineNQ": _rcu_barrier() found online CPU, no callbacks.
* "IRQ": An rcu_barrier_callback() callback posted on remote CPU.
* "CB": An rcu_barrier_callback() invoked a callback, not the last.
* "LastCB": An rcu_barrier_callback() invoked the last callback.
- * "Inc2": rcu_barrier_callback() piggyback check counter incremented.
+ * "Inc2": _rcu_barrier() piggyback check counter incremented.
* The "cpu" argument is the CPU or -1 if meaningless, the "cnt" argument
* is the count of remaining callbacks, and "done" is the piggybacking count.
*/
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f6880052b917..7680fc275036 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3312,11 +3312,16 @@ static void _rcu_barrier(struct rcu_state *rsp)
continue;
rdp = per_cpu_ptr(rsp->rda, cpu);
if (rcu_is_nocb_cpu(cpu)) {
- _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
- rsp->n_barrier_done);
- atomic_inc(&rsp->barrier_cpu_count);
- __call_rcu(&rdp->barrier_head, rcu_barrier_callback,
- rsp, cpu, 0);
+ if (!rcu_nocb_cpu_needs_barrier(rsp, cpu)) {
+ _rcu_barrier_trace(rsp, "OfflineNoCB", cpu,
+ rsp->n_barrier_done);
+ } else {
+ _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
+ rsp->n_barrier_done);
+ atomic_inc(&rsp->barrier_cpu_count);
+ __call_rcu(&rdp->barrier_head,
+ rcu_barrier_callback, rsp, cpu, 0);
+ }
} else if (ACCESS_ONCE(rdp->qlen)) {
_rcu_barrier_trace(rsp, "OnlineQ", cpu,
rsp->n_barrier_done);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4beab3d2328c..8e7b1843896e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -587,6 +587,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
static void print_cpu_stall_info_end(void);
static void zero_cpu_stall_ticks(struct rcu_data *rdp);
static void increment_cpu_stall_ticks(void);
+static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu);
static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
static void rcu_init_one_nocb(struct rcu_node *rnp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 927c17b081c7..68c5b23b7173 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2050,6 +2050,33 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
}

/*
+ * Does the specified CPU need an RCU callback for the specified flavor
+ * of rcu_barrier()?
+ */
+static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
+{
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+ struct rcu_head *rhp;
+
+ /* No-CBs CPUs might have callbacks on any of three lists. */
+ rhp = ACCESS_ONCE(rdp->nocb_head);
+ if (!rhp)
+ rhp = ACCESS_ONCE(rdp->nocb_gp_head);
+ if (!rhp)
+ rhp = ACCESS_ONCE(rdp->nocb_follower_head);
+
+ /* Having no rcuo kthread but CBs after scheduler starts is bad! */
+ if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp) {
+ /* RCU callback enqueued before CPU first came online??? */
+ pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n",
+ cpu, rhp->func);
+ WARN_ON_ONCE(1);
+ }
+
+ return !!rhp;
+}
+
+/*
* Enqueue the specified string of rcu_head structures onto the specified
* CPU's no-CBs lists. The CPU is specified by rdp, the head of the
* string by rhp, and the tail of the string by rhtp. The non-lazy/lazy
@@ -2646,6 +2673,10 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)

#else /* #ifdef CONFIG_RCU_NOCB_CPU */

+static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
+{
+}
+
static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
{
}

2014-10-27 18:03:48

by Eric B Munson

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Mon, 27 Oct 2014, Paul E. McKenney wrote:

> On Mon, Oct 27, 2014 at 08:10:21AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 27, 2014 at 09:47:57AM -0400, Eric B Munson wrote:
> > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > > > Still didn't help. If it helps, when I attempt to reboot after trying
> > > > > to hibernate I see a kworker thread hung and get the stack trace below
> > > > > from that thread. I assume this is the same thread that is holding up
> > > > > the hibernate.
> > > >
> > > > Yep, looks like something that some other people are running into as well.
> > > >
> > > > If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?
> > >
> > > Disabling CONFIG_RCU_NOCB_CPU fixes the problem. I am able to hibernate
> > > and resume successfully.
> >
> > Very good! Then the fix I am working on might actually be a fix. ;-)
>
> And here is a patch that passes preliminary testing at my end. Does it
> help at your end?
>
> Thanx, Paul

Thanks Paul, that fixed it for me. Feel free to add my Tested-by: to
the patch.

Eric

>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_barrier() understand about missing rcuo kthreads
>
> Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> avoids creating rcuo kthreads for CPUs that never come online. This
> fixes a bug in many instances of firmware: Instead of lying about their
> age, these systems instead lie about the number of CPUs that they have.
> Before commit 35ce7f29a44a, this could result in huge numbers of useless
> rcuo kthreads being created.
>
> It appears that experience indicates that I should have told the
> people suffering from this problem to fix their broken firmware, but
> I instead produced what turned out to be a partial fix. The missing
> piece supplied by this commit makes sure that rcu_barrier() knows not to
> post callbacks for no-CBs CPUs that have not yet come online, because
> otherwise rcu_barrier() will hang on systems having firmware that lies
> about the number of CPUs.
>
> It is tempting to simply have rcu_barrier() refuse to post a callback on
> any no-CBs CPU that does not have an rcuo kthread. This unfortunately
> does not work because rcu_barrier() is required to wait for all pending
> callbacks. It is therefore required to wait even for those callbacks
> that cannot possibly be invoked. Even if doing so hangs the system.
>
> Given that posting a callback to a no-CBs CPU that does not yet have an
> rcuo kthread can hang rcu_barrier(), It is tempting to report an error
> in this case. Unfortunately, this will result in false positives at
> boot time, when it is perfectly legal to post callbacks to the boot CPU
> before the scheduler has started, in other words, before it is legal
> to invoke rcu_barrier().
>
> So this commit instead has rcu_barrier() avoid posting callbacks to
> CPUs having neither rcuo kthread nor pending callbacks, and has it
> complain bitterly if it finds CPUs having no rcuo kthread but some
> pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
> kthread but pending callbacks, as noted earlier, it has no choice but
> to hang indefinitely.
>
> Reported-by: Yanko Kaneti <[email protected]>
> Reported-by: Jay Vosburgh <[email protected]>
> Reported-by: Eric B Munson <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> index aa8e5eea3ab4..c78e88ce5ea3 100644
> --- a/include/trace/events/rcu.h
> +++ b/include/trace/events/rcu.h
> @@ -660,18 +660,18 @@ TRACE_EVENT(rcu_torture_read,
> /*
> * Tracepoint for _rcu_barrier() execution. The string "s" describes
> * the _rcu_barrier phase:
> - * "Begin": rcu_barrier_callback() started.
> - * "Check": rcu_barrier_callback() checking for piggybacking.
> - * "EarlyExit": rcu_barrier_callback() piggybacked, thus early exit.
> - * "Inc1": rcu_barrier_callback() piggyback check counter incremented.
> - * "Offline": rcu_barrier_callback() found offline CPU
> - * "OnlineNoCB": rcu_barrier_callback() found online no-CBs CPU.
> - * "OnlineQ": rcu_barrier_callback() found online CPU with callbacks.
> - * "OnlineNQ": rcu_barrier_callback() found online CPU, no callbacks.
> + * "Begin": _rcu_barrier() started.
> + * "Check": _rcu_barrier() checking for piggybacking.
> + * "EarlyExit": _rcu_barrier() piggybacked, thus early exit.
> + * "Inc1": _rcu_barrier() piggyback check counter incremented.
> + * "OfflineNoCB": _rcu_barrier() found callback on never-online CPU
> + * "OnlineNoCB": _rcu_barrier() found online no-CBs CPU.
> + * "OnlineQ": _rcu_barrier() found online CPU with callbacks.
> + * "OnlineNQ": _rcu_barrier() found online CPU, no callbacks.
> * "IRQ": An rcu_barrier_callback() callback posted on remote CPU.
> * "CB": An rcu_barrier_callback() invoked a callback, not the last.
> * "LastCB": An rcu_barrier_callback() invoked the last callback.
> - * "Inc2": rcu_barrier_callback() piggyback check counter incremented.
> + * "Inc2": _rcu_barrier() piggyback check counter incremented.
> * The "cpu" argument is the CPU or -1 if meaningless, the "cnt" argument
> * is the count of remaining callbacks, and "done" is the piggybacking count.
> */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f6880052b917..7680fc275036 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3312,11 +3312,16 @@ static void _rcu_barrier(struct rcu_state *rsp)
> continue;
> rdp = per_cpu_ptr(rsp->rda, cpu);
> if (rcu_is_nocb_cpu(cpu)) {
> - _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> - rsp->n_barrier_done);
> - atomic_inc(&rsp->barrier_cpu_count);
> - __call_rcu(&rdp->barrier_head, rcu_barrier_callback,
> - rsp, cpu, 0);
> + if (!rcu_nocb_cpu_needs_barrier(rsp, cpu)) {
> + _rcu_barrier_trace(rsp, "OfflineNoCB", cpu,
> + rsp->n_barrier_done);
> + } else {
> + _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> + rsp->n_barrier_done);
> + atomic_inc(&rsp->barrier_cpu_count);
> + __call_rcu(&rdp->barrier_head,
> + rcu_barrier_callback, rsp, cpu, 0);
> + }
> } else if (ACCESS_ONCE(rdp->qlen)) {
> _rcu_barrier_trace(rsp, "OnlineQ", cpu,
> rsp->n_barrier_done);
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4beab3d2328c..8e7b1843896e 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -587,6 +587,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
> static void print_cpu_stall_info_end(void);
> static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> static void increment_cpu_stall_ticks(void);
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu);
> static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
> static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
> static void rcu_init_one_nocb(struct rcu_node *rnp);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 927c17b081c7..68c5b23b7173 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2050,6 +2050,33 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> }
>
> /*
> + * Does the specified CPU need an RCU callback for the specified flavor
> + * of rcu_barrier()?
> + */
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> +{
> + struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
> + struct rcu_head *rhp;
> +
> + /* No-CBs CPUs might have callbacks on any of three lists. */
> + rhp = ACCESS_ONCE(rdp->nocb_head);
> + if (!rhp)
> + rhp = ACCESS_ONCE(rdp->nocb_gp_head);
> + if (!rhp)
> + rhp = ACCESS_ONCE(rdp->nocb_follower_head);
> +
> + /* Having no rcuo kthread but CBs after scheduler starts is bad! */
> + if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp) {
> + /* RCU callback enqueued before CPU first came online??? */
> + pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n",
> + cpu, rhp->func);
> + WARN_ON_ONCE(1);
> + }
> +
> + return !!rhp;
> +}
> +
> +/*
> * Enqueue the specified string of rcu_head structures onto the specified
> * CPU's no-CBs lists. The CPU is specified by rdp, the head of the
> * string by rhp, and the tail of the string by rhtp. The non-lazy/lazy
> @@ -2646,6 +2673,10 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
>
> #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>
> +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> +{
> +}
> +
> static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> {
> }
>

2014-10-27 18:17:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Commit 35ce7f29a breaks hibernation for XPS 13

On Mon, Oct 27, 2014 at 02:03:44PM -0400, Eric B Munson wrote:
> On Mon, 27 Oct 2014, Paul E. McKenney wrote:
>
> > On Mon, Oct 27, 2014 at 08:10:21AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 27, 2014 at 09:47:57AM -0400, Eric B Munson wrote:
> > > > On Fri, 24 Oct 2014, Paul E. McKenney wrote:
> >
> > [ . . . ]
> >
> > > > > > Still didn't help. If it helps, when I attempt to reboot after trying
> > > > > > to hibernate I see a kworker thread hung and get the stack trace below
> > > > > > from that thread. I assume this is the same thread that is holding up
> > > > > > the hibernate.
> > > > >
> > > > > Yep, looks like something that some other people are running into as well.
> > > > >
> > > > > If you turn off CONFIG_RCU_NOCB_CPU, do you still get the failure?
> > > >
> > > > Disabling CONFIG_RCU_NOCB_CPU fixes the problem. I am able to hibernate
> > > > and resume successfully.
> > >
> > > Very good! Then the fix I am working on might actually be a fix. ;-)
> >
> > And here is a patch that passes preliminary testing at my end. Does it
> > help at your end?
> >
> > Thanx, Paul
>
> Thanks Paul, that fixed it for me. Feel free to add my Tested-by: to
> the patch.

Woo-hoo!!! ;-)

I added your Tested-by, and thank you for your reporting and testing
for this bug!

Thanx, Paul

> Eric
>
> >
> > ------------------------------------------------------------------------
> >
> > rcu: Make rcu_barrier() understand about missing rcuo kthreads
> >
> > Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
> > avoids creating rcuo kthreads for CPUs that never come online. This
> > fixes a bug in many instances of firmware: Instead of lying about their
> > age, these systems instead lie about the number of CPUs that they have.
> > Before commit 35ce7f29a44a, this could result in huge numbers of useless
> > rcuo kthreads being created.
> >
> > It appears that experience indicates that I should have told the
> > people suffering from this problem to fix their broken firmware, but
> > I instead produced what turned out to be a partial fix. The missing
> > piece supplied by this commit makes sure that rcu_barrier() knows not to
> > post callbacks for no-CBs CPUs that have not yet come online, because
> > otherwise rcu_barrier() will hang on systems having firmware that lies
> > about the number of CPUs.
> >
> > It is tempting to simply have rcu_barrier() refuse to post a callback on
> > any no-CBs CPU that does not have an rcuo kthread. This unfortunately
> > does not work because rcu_barrier() is required to wait for all pending
> > callbacks. It is therefore required to wait even for those callbacks
> > that cannot possibly be invoked. Even if doing so hangs the system.
> >
> > Given that posting a callback to a no-CBs CPU that does not yet have an
> > rcuo kthread can hang rcu_barrier(), It is tempting to report an error
> > in this case. Unfortunately, this will result in false positives at
> > boot time, when it is perfectly legal to post callbacks to the boot CPU
> > before the scheduler has started, in other words, before it is legal
> > to invoke rcu_barrier().
> >
> > So this commit instead has rcu_barrier() avoid posting callbacks to
> > CPUs having neither rcuo kthread nor pending callbacks, and has it
> > complain bitterly if it finds CPUs having no rcuo kthread but some
> > pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
> > kthread but pending callbacks, as noted earlier, it has no choice but
> > to hang indefinitely.
> >
> > Reported-by: Yanko Kaneti <[email protected]>
> > Reported-by: Jay Vosburgh <[email protected]>
> > Reported-by: Eric B Munson <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> > index aa8e5eea3ab4..c78e88ce5ea3 100644
> > --- a/include/trace/events/rcu.h
> > +++ b/include/trace/events/rcu.h
> > @@ -660,18 +660,18 @@ TRACE_EVENT(rcu_torture_read,
> > /*
> > * Tracepoint for _rcu_barrier() execution. The string "s" describes
> > * the _rcu_barrier phase:
> > - * "Begin": rcu_barrier_callback() started.
> > - * "Check": rcu_barrier_callback() checking for piggybacking.
> > - * "EarlyExit": rcu_barrier_callback() piggybacked, thus early exit.
> > - * "Inc1": rcu_barrier_callback() piggyback check counter incremented.
> > - * "Offline": rcu_barrier_callback() found offline CPU
> > - * "OnlineNoCB": rcu_barrier_callback() found online no-CBs CPU.
> > - * "OnlineQ": rcu_barrier_callback() found online CPU with callbacks.
> > - * "OnlineNQ": rcu_barrier_callback() found online CPU, no callbacks.
> > + * "Begin": _rcu_barrier() started.
> > + * "Check": _rcu_barrier() checking for piggybacking.
> > + * "EarlyExit": _rcu_barrier() piggybacked, thus early exit.
> > + * "Inc1": _rcu_barrier() piggyback check counter incremented.
> > + * "OfflineNoCB": _rcu_barrier() found callback on never-online CPU
> > + * "OnlineNoCB": _rcu_barrier() found online no-CBs CPU.
> > + * "OnlineQ": _rcu_barrier() found online CPU with callbacks.
> > + * "OnlineNQ": _rcu_barrier() found online CPU, no callbacks.
> > * "IRQ": An rcu_barrier_callback() callback posted on remote CPU.
> > * "CB": An rcu_barrier_callback() invoked a callback, not the last.
> > * "LastCB": An rcu_barrier_callback() invoked the last callback.
> > - * "Inc2": rcu_barrier_callback() piggyback check counter incremented.
> > + * "Inc2": _rcu_barrier() piggyback check counter incremented.
> > * The "cpu" argument is the CPU or -1 if meaningless, the "cnt" argument
> > * is the count of remaining callbacks, and "done" is the piggybacking count.
> > */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f6880052b917..7680fc275036 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3312,11 +3312,16 @@ static void _rcu_barrier(struct rcu_state *rsp)
> > continue;
> > rdp = per_cpu_ptr(rsp->rda, cpu);
> > if (rcu_is_nocb_cpu(cpu)) {
> > - _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> > - rsp->n_barrier_done);
> > - atomic_inc(&rsp->barrier_cpu_count);
> > - __call_rcu(&rdp->barrier_head, rcu_barrier_callback,
> > - rsp, cpu, 0);
> > + if (!rcu_nocb_cpu_needs_barrier(rsp, cpu)) {
> > + _rcu_barrier_trace(rsp, "OfflineNoCB", cpu,
> > + rsp->n_barrier_done);
> > + } else {
> > + _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
> > + rsp->n_barrier_done);
> > + atomic_inc(&rsp->barrier_cpu_count);
> > + __call_rcu(&rdp->barrier_head,
> > + rcu_barrier_callback, rsp, cpu, 0);
> > + }
> > } else if (ACCESS_ONCE(rdp->qlen)) {
> > _rcu_barrier_trace(rsp, "OnlineQ", cpu,
> > rsp->n_barrier_done);
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 4beab3d2328c..8e7b1843896e 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -587,6 +587,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
> > static void print_cpu_stall_info_end(void);
> > static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> > static void increment_cpu_stall_ticks(void);
> > +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu);
> > static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
> > static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
> > static void rcu_init_one_nocb(struct rcu_node *rnp);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 927c17b081c7..68c5b23b7173 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2050,6 +2050,33 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > }
> >
> > /*
> > + * Does the specified CPU need an RCU callback for the specified flavor
> > + * of rcu_barrier()?
> > + */
> > +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> > +{
> > + struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
> > + struct rcu_head *rhp;
> > +
> > + /* No-CBs CPUs might have callbacks on any of three lists. */
> > + rhp = ACCESS_ONCE(rdp->nocb_head);
> > + if (!rhp)
> > + rhp = ACCESS_ONCE(rdp->nocb_gp_head);
> > + if (!rhp)
> > + rhp = ACCESS_ONCE(rdp->nocb_follower_head);
> > +
> > + /* Having no rcuo kthread but CBs after scheduler starts is bad! */
> > + if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp) {
> > + /* RCU callback enqueued before CPU first came online??? */
> > + pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n",
> > + cpu, rhp->func);
> > + WARN_ON_ONCE(1);
> > + }
> > +
> > + return !!rhp;
> > +}
> > +
> > +/*
> > * Enqueue the specified string of rcu_head structures onto the specified
> > * CPU's no-CBs lists. The CPU is specified by rdp, the head of the
> > * string by rhp, and the tail of the string by rhtp. The non-lazy/lazy
> > @@ -2646,6 +2673,10 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
> >
> > #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> >
> > +static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu)
> > +{
> > +}
> > +
> > static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > {
> > }
> >
>