2017-06-08 20:11:57

by Krister Johansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

Hi Paul,

On Thu, May 25, 2017 at 02:59:18PM -0700, Paul E. McKenney wrote:
> Wait/wakeup operations do not guarantee ordering on their own. Instead,
> either locking or memory barriers are required. This commit therefore
> adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree_plugin.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b1042545116..573fbe9640a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> + smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> swake_up(&rdp_leader->nocb_wq);
> }
> }
> @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> * nocb_gp_head, where they await a grace period.
> */
> gotcbs = false;
> + smp_mb(); /* wakeup before ->nocb_head reads. */
> for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> if (!rdp->nocb_gp_head)

May I impose upon you to CC this patch to stable, and tag it as fixing
abedf8e241? I ran into this on a production 4.9 branch. When I
debugged it, I discovered that it went all the way back to 4.6. The
tl;dr is that at least for some environments, the missed wakeup
manifests itself as a series of hung-task warnings to console and if I'm
unlucky it can also generate a hang that can block interactive logins
via ssh.

In case it's useful, I'm attching the RCA that I sent out before I
circulated a similar patch for internal review. You beat me to the
punch in terms of getting a patch out a public mailing list.

--- >8 cut here 8< ---

Production boxes go through fits where they generate spurts of hung task
warnings. Sometimes, the slowdown is enough to ensnare a process
holding the rtnl_lock, or prevent systemd from letting ssh logins on to
the box.

Below is a splat from a relatively benign ocurrence:

INFO: task uc-spawn:34443 blocked for more than 120 seconds.
Not tainted 4.9.4-1.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
uc-spawn D 0 34443 1 0x00000082
ffff883fec88e4c0 0000000000000000 ffff883ff1b4ac00 ffff883ffddd9540
ffff883fee999600 ffffc9001c907c30 ffffffff8175e3f6 0000000000000202
ffff883fee999600 7fffffffffffffff ffff883fee999600 7fffffffffffffff
Call Trace:
[<ffffffff8175e3f6>] ? __schedule+0x1a6/0x650
[<ffffffff8175e8d6>] schedule+0x36/0x80
[<ffffffff81761aec>] schedule_timeout+0x21c/0x3a0
[<ffffffff8175f3b2>] wait_for_completion+0xf2/0x130
[<ffffffff810b6710>] ? wake_up_q+0x80/0x80
[<ffffffff810eec27>] __wait_rcu_gp+0xd7/0x100
[<ffffffff810f264e>] synchronize_sched+0x5e/0x80
[<ffffffff810f1da0>] ? __call_rcu+0x320/0x320
[<ffffffff810eeb30>] ? trace_raw_output_rcu_utilization+0x60/0x60
[<ffffffff8125673b>] kern_unmount+0x2b/0x40
[<ffffffff812d6f85>] mq_put_mnt+0x15/0x20
[<ffffffff812d76de>] put_ipc_ns+0x3e/0xa0
[<ffffffff810ab768>] free_nsproxy+0x38/0x90
[<ffffffff810ab8c9>] switch_task_namespaces+0x59/0x60
[<ffffffff810ab8e0>] exit_task_namespaces+0x10/0x20
[<ffffffff8108d787>] do_exit+0x2b7/0xac0
[<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
[<ffffffff8108e00f>] do_group_exit+0x3f/0xb0
[<ffffffff8108e094>] SyS_exit_group+0x14/0x20
[<ffffffff81003a47>] do_syscall_64+0x67/0x180
[<ffffffff8176306b>] entry_SYSCALL64_slow_path+0x25/0x25

In order to debug this further, a kdump was obtained.

crash> ps 34443
PID PPID CPU TASK ST %MEM VSZ RSS COMM
34443 1 33 ffff883fee999600 UN 0.0 0 0 uc-spawn

crash> bt 34443
PID: 34443 TASK: ffff883fee999600 CPU: 33 COMMAND: "uc-spawn"
#0 [ffffc9001c907bb8] __schedule at ffffffff8175e3f6
#1 [ffffc9001c907c38] schedule at ffffffff8175e8d6
#2 [ffffc9001c907c50] schedule_timeout at ffffffff81761aec
#3 [ffffc9001c907cf0] wait_for_completion at ffffffff8175f3b2
#4 [ffffc9001c907d50] __wait_rcu_gp at ffffffff810eec27
#5 [ffffc9001c907da0] synchronize_sched at ffffffff810f264e
#6 [ffffc9001c907df0] kern_unmount at ffffffff8125673b
#7 [ffffc9001c907e08] mq_put_mnt at ffffffff812d6f85
#8 [ffffc9001c907e18] put_ipc_ns at ffffffff812d76de
#9 [ffffc9001c907e30] free_nsproxy at ffffffff810ab768
#10 [ffffc9001c907e48] switch_task_namespaces at ffffffff810ab8c9
#11 [ffffc9001c907e70] exit_task_namespaces at ffffffff810ab8e0
#12 [ffffc9001c907e80] do_exit at ffffffff8108d787
#13 [ffffc9001c907ee8] do_group_exit at ffffffff8108e00f
#14 [ffffc9001c907f18] sys_exit_group at ffffffff8108e094
#15 [ffffc9001c907f28] do_syscall_64 at ffffffff81003a47
RIP: 00007f6b6e1f4209 RSP: 00007ffc488028b8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f6b6e1f4209
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007f6b6e4ea840 R8: 000000000000003c R9: 00000000000000e7
R10: fffffffffffffe90 R11: 0000000000000246 R12: 00007f6b6e4ea840
R13: 00007f6b6e4efe80 R14: 00007ffc48802918 R15: 00007ffc48802998
ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b

This particular process is blocked in its exit path, waiting for the
synchronize_sched() call it invoked to complete. This function queues a
call_rcu_sched() invocation with a callback that triggers a wakeup of
the invoking process. In order for this process to become unblocked,
a grace period must elapse, and a the callback must be called.

Typically, bugs like this imply that some processor failed to yield, or
some thread is holding rcu_read_lock_sched(), and hence preventing a
grace period from occurring. However, after dilligent inspection I can
find no evidence of such a thing occurring. After dumping all stacks,
no threads are in call frames that hold this type of read-side critical
section. All cpus appear to be mostly idle and responding to interrupt
events. The rcu threads themselves are not blocked waiting for
completion of a grace period either.

A quick look at rcu_sched_state shows nothing obviously asmiss:

crash> p rcu_sched_state
rcu_sched_state = $1 = {
node = {{
lock = {
raw_lock = {
val = {
counter = 0
}
}
},
gpnum = 22469944,
completed = 22469944,
qsmask = 0,
qsmaskinit = 3,
qsmaskinitnext = 0,
expmask = 0,
expmaskinit = 3,
expmaskinitnext = 0,
grpmask = 0,
grplo = 0,
grphi = 71,
grpnum = 0 '\000',
level = 0 '\000',
wait_blkd_tasks = false,
parent = 0x0,
blkd_tasks = {
next = 0xffffffff81c70428 <rcu_sched_state+104>,
prev = 0xffffffff81c70428 <rcu_sched_state+104>
},
gp_tasks = 0x0,
exp_tasks = 0x0,
boost_tasks = 0x0,
<...>

Note that according to this, the most recent gp completed was numbered
22469944.

Now here's the rcu_data for CPU 33, where the process got stuck. It's
at least 15 grace periods behind.

crash> struct rcu_data 0xffff883ffddda380
struct rcu_data {
completed = 22469929,
gpnum = 22469930,
rcu_qs_ctr_snap = 18203094,
cpu_no_qs = {
b = {
norm = 0 '\000',
exp = 0 '\000'
},
s = 0
},
core_needs_qs = true,
beenonline = true,
gpwrap = false,
mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
grpmask = 8589934592,
ticks_this_gp = 0,
nxtlist = 0x0,
nxttail = {0x0, 0x0, 0x0, 0x0},
nxtcompleted = {0, 0, 0, 0},
qlen_lazy = 0,
qlen = 0,
qlen_last_fqs_check = 0,
n_cbs_invoked = 0,
n_nocbs_invoked = 10626703,
n_cbs_orphaned = 0,
n_cbs_adopted = 0,
n_force_qs_snap = 0,
blimit = 10,
dynticks = 0xffff883ffddcffd0,
dynticks_snap = 28482762,
dynticks_fqs = 19371,
offline_fqs = 0,
cond_resched_completed = 22421984,
n_rcu_pending = 2265627,
n_rp_core_needs_qs = 31314,
n_rp_report_qs = 595411,
n_rp_cb_ready = 0,
n_rp_cpu_needs_gp = 757,
n_rp_gp_completed = 367886,
n_rp_gp_started = 38620,
n_rp_nocb_defer_wakeup = 126,
n_rp_need_nothing = 1262827,
barrier_head = {
next = 0x0,
func = 0xffffffff810f0090 <rcu_barrier_callback>
},
exp_workdone0 = {
counter = 21444
},
exp_workdone1 = {
counter = 0
},
exp_workdone2 = {
counter = 0
},
exp_workdone3 = {
counter = 0
},
nocb_head = 0xffff881f8444a138,
nocb_tail = 0xffff881f9f8ee8d0,
nocb_q_count = {
counter = 500
},
nocb_q_count_lazy = {
counter = 18
},
nocb_follower_head = 0x0,
nocb_follower_tail = 0xffff883ffddda508,
nocb_wq = {
lock = {
raw_lock = {
val = {
counter = 0
}
}
},
task_list = {
next = 0xffffc900193e3e80,
prev = 0xffffc900193e3e80
}
},
nocb_kthread = 0xffff883ff0861600,
nocb_defer_wakeup = 0,
nocb_gp_head = 0xffff883fe18c71a0,
nocb_gp_tail = 0xffffc9001c7a3da0,
nocb_leader_sleep = false,
nocb_next_follower = 0xffff883ffde1a380,
nocb_leader = 0xffff883ffdd9a380,
softirq_snap = 1002642,
cpu = 33,
rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

Notice also that the nocb_q_count is 500. With a bit of looking we can
see that the blocked process has its callback queued in both the
nocb_gp_head list and the nocb_head list. I'll spare readers all of the
output, but the correct linkage was located by walking the list entries
and then grepping for the correct pointer.

crash> struct completion 0xffffc9001c907dc0
struct completion {
done = 0,
wait = {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
task_list = {
next = 0xffffc9001c907d10,
prev = 0xffffc9001c907d10
}
}
}

crash> struct rcu_synchronize 0xffffc9001c907db0
struct rcu_synchronize {
head = {
next = 0xffff883fe4d18398,
func = 0xffffffff810eeb30 <wakeme_after_rcu>
},
completion = {
done = 0,
wait = {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
task_list = {
next = 0xffffc9001c907d10,
prev = 0xffffc9001c907d10
}
}
}
}

This is the stack for CPU 33's nocb kthread. It's sleeping.

crash> bt 0xffff883ff0861600
PID: 279 TASK: ffff883ff0861600 CPU: 39 COMMAND: "rcuos/33"
#0 [ffffc900193e3db0] __schedule at ffffffff8175e3f6
#1 [ffffc900193e3e30] schedule at ffffffff8175e8d6
#2 [ffffc900193e3e48] rcu_nocb_kthread at ffffffff810f2a67
#3 [ffffc900193e3ec8] kthread at ffffffff810a9e19
#4 [ffffc900193e3f50] ret_from_fork at ffffffff81763215

Since CPU 33 is a follower in this configuration, it needs to be awoken
by a leader. In this case CPU 32 is the leader assigned to wake CPU
33's nocb_kthread. It's also sleeping.

crash> bt 0xffff883ff0805800
PID: 271 TASK: ffff883ff0805800 CPU: 38 COMMAND: "rcuos/32"
#0 [ffffc900193a3db0] __schedule at ffffffff8175e3f6
#1 [ffffc900193a3e30] schedule at ffffffff8175e8d6
#2 [ffffc900193a3e48] rcu_nocb_kthread at ffffffff810f2ad5
#3 [ffffc900193a3ec8] kthread at ffffffff810a9e19
#4 [ffffc900193a3f50] ret_from_fork at ffffffff81763215

Here's a red flag: The leader is asleep despite nocb_leader_sleep being
false. The code in nocb_leader_wait() has an swait_event_interruptible()
that's supposed to only let this thread wait when nocb_leader_sleep is
true.

crash> struct rcu_data.nocb_leader_sleep 0xffff883ffdd9a380
nocb_leader_sleep = false

The leader's wq contains active entries, so something is sleeping here:

crash> struct rcu_data.nocb_wq 0xffff883ffdd9a380
nocb_wq = {
lock = {
raw_lock = {
val = {
counter = 0
}
}
},
task_list = {
next = 0xffffc900193a3e80,
prev = 0xffffc900193a3e80
}
}

crash> struct -o rcu_data.nocb_wq 0xffff883ffdd9a380
struct rcu_data {
[ffff883ffdd9a518] struct swait_queue_head nocb_wq;
}

crash> struct list_head 0xffffc900193a3e80
struct list_head {
next = 0xffff883ffdd9a520,
prev = 0xffff883ffdd9a520
}

This is the waitq of the callback parked by the process that was
stranded on cpu 33. It's also correctly configured.

crash> struct wait_queue_t 0xffffc9001c907cf8
struct wait_queue_t {
flags = 1,
private = 0xffff883fee999600,
func = 0xffffffff810b6710 <default_wake_function>,
task_list = {
next = 0xffffc9001c907dd0,
prev = 0xffffc9001c907dd0
}
}
crash> ps -m 34443
[ 0 00:11:07.318] [UN] PID: 34443 TASK: ffff883fee999600 CPU: 33 COMMAND: "uc-spawn"
crash> ps -m 0xffff883ff0861600
[ 0 00:11:44.271] [IN] PID: 279 TASK: ffff883ff0861600 CPU: 39 COMMAND: "rcuos/33"
crash> ps -m ffff883ff0805800
[ 0 00:11:44.208] [IN] PID: 271 TASK: ffff883ff0805800 CPU: 38 COMMAND: "rcuos/32"

Looking at the PS data, we can see that our processes have been stranded
for more than 11 minutes. The rcuos threads for 37 seconds longer than
pid 34443.

The reason this appears to be a transient hang is due to the code in
__call_rcu_nocb_enqueue(). We seem to be missing the initial wakeup,
and once that happens, we'll only send another wakeup if nocb_q_count is
greater than qlen_last_fqs_check + qhimark. In our case, we need to
accumulate 10,000 callbacks before we get another bite at this. Since
these cpus are relatively idle, we've only accrued 500 on the follower
and ~2000 on the leader. Leader's full state if below:

crash> struct rcu_data 0xffff883ffdd9a380
struct rcu_data {
completed = 22469935,
gpnum = 22469935,
rcu_qs_ctr_snap = 17067139,
cpu_no_qs = {
b = {
norm = 0 '\000',
exp = 0 '\000'
},
s = 0
},
core_needs_qs = false,
beenonline = true,
gpwrap = false,
mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
grpmask = 4294967296,
ticks_this_gp = 1,
nxtlist = 0x0,
nxttail = {0x0, 0x0, 0x0, 0x0},
nxtcompleted = {18446744073709551526, 18446744073709551526, 18446744073709551526, 0},
qlen_lazy = 0,
qlen = 0,
qlen_last_fqs_check = 0,
n_cbs_invoked = 0,
n_nocbs_invoked = 10730040,
n_cbs_orphaned = 0,
n_cbs_adopted = 0,
n_force_qs_snap = 0,
blimit = 10,
dynticks = 0xffff883ffdd8ffd0,
dynticks_snap = 29638122,
dynticks_fqs = 19616,
offline_fqs = 0,
cond_resched_completed = 22467956,
n_rcu_pending = 2314079,
n_rp_core_needs_qs = 30976,
n_rp_report_qs = 654184,
n_rp_cb_ready = 0,
n_rp_cpu_needs_gp = 813,
n_rp_gp_completed = 369917,
n_rp_gp_started = 37049,
n_rp_nocb_defer_wakeup = 135,
n_rp_need_nothing = 1251981,
barrier_head = {
next = 0x0,
func = 0xffffffff810f0090 <rcu_barrier_callback>
},
exp_workdone0 = {
counter = 20577
},
exp_workdone1 = {
counter = 0
},
exp_workdone2 = {
counter = 0
},
exp_workdone3 = {
counter = 0
},
nocb_head = 0xffff883ff0f6f700,
nocb_tail = 0xffff881f9f8ed2d0,
nocb_q_count = {
counter = 2029
},
nocb_q_count_lazy = {
counter = 71
},
nocb_follower_head = 0x0,
nocb_follower_tail = 0xffff883ffdd9a508,
nocb_wq = {
lock = {
raw_lock = {
val = {
counter = 0
}
}
},
task_list = {
next = 0xffffc900193a3e80,
prev = 0xffffc900193a3e80
}
},
nocb_kthread = 0xffff883ff0805800,
nocb_defer_wakeup = 0,
nocb_gp_head = 0x0,
nocb_gp_tail = 0xffff883feb6c4d00,
nocb_leader_sleep = false,
nocb_next_follower = 0xffff883ffddda380,
nocb_leader = 0xffff883ffdd9a380,
softirq_snap = 1061978,
cpu = 32,
rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

The nocb path can miss a wakeup sent to the master. If the system is
fairly idle it can take a substantial amount of time in order for
qhimark to be reached so that we kick the leader again. It might be
possible to tune qhimark, but this is intellectually unsatisfying,
especially since that tunable has use outside of the nocb path.

Instead, it looks like in the conversion to simple wait queues the
sematics of swake_up differ slightly from wake_up and the rcu_nocb code
failed to make a necessary adjustment.

Here's the wakeup path:

static void wake_nocb_leader(struct rcu_data *rdp, bool force)
{
struct rcu_data *rdp_leader = rdp->nocb_leader;

if (!READ_ONCE(rdp_leader->nocb_kthread))
return;
if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
/* Prior smp_mb__after_atomic() orders against prior enqueue. */
WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
swake_up(&rdp_leader->nocb_wq);
}
}

And here's swake_up iteslf:

void swake_up(struct swait_queue_head *q)
{
unsigned long flags;

if (!swait_active(q))
return;

raw_spin_lock_irqsave(&q->lock, flags);
swake_up_locked(q);
raw_spin_unlock_irqrestore(&q->lock, flags);
}

It's important to note that swait_active() isn't called with any locks
held so we're in a situation where the write to nocb_leader_sleep and,
potentially, a store to q->task_list are in flight on different CPUs at
the same time. A comment in wake_nocb_leader notes that the
smp_mb__after_atomic() gives us ordering against the prior enqueue, but
that's a red herring. In order to wake the leader, the waker's store to
nocb_leader_sleep must be visibile on the CPU where the awaking process
can observe it.

The documentation in memory-barriers.txt cautions us in this way:

[!] Note that the memory barriers implied by the sleeper and the
waker do _not_ order multiple stores before the wake-up with respect
to loads of those stored values after the sleeper has called
set_current_state(). For instance, if the sleeper does:

set_current_state(TASK_INTERRUPTIBLE);
if (event_indicated)
break;
__set_current_state(TASK_RUNNING);
do_something(my_data);

and the waker does:

my_data = value;
event_indicated = 1;
wake_up(&event_wait_queue);

there's no guarantee that the change to event_indicated will be
perceived by the sleeper as coming after the change to my_data. In
such a circumstance, the code on both sides must interpolate its own
memory barriers between the separate data accesses. Thus the above
sleeper ought to do:

set_current_state(TASK_INTERRUPTIBLE);
if (event_indicated) {
smp_rmb();
do_something(my_data);
}

and the waker should do:

my_data = value;
smp_wmb();
event_indicated = 1;
wake_up(&event_wait_queue);

[https://www.kernel.org/doc/Documentation/memory-barriers.txt]

The RT-linux tree had a patch in its swait_head_has_waiters() function
to ensure that wakeups were not missed:

https://www.spinics.net/lists/linux-rt-users/msg10340.html

That barrier was dropped from the patch as part of getting it ready for
mainline:

http://www.spinics.net/lists/linux-rt-users/msg13691.html

The patch in the prior message seemed to have a comment, though it's
possible it was subsequently dropped. Also note that waitqueue_active
in the current code base contains the following comment:

/**
* waitqueue_active -- locklessly test for waiters on the queue
* @q: the waitqueue to test for waiters
*
* returns true if the wait list is not empty
*
* NOTE: this function is lockless and requires care, incorrect usage _will_
* lead to sporadic and non-obvious failure.
*
* Use either while holding wait_queue_head_t::lock or when used for wakeups
* with an extra smp_mb() like:
*
* CPU0 - waker CPU1 - waiter
*
* for (;;) {
* @cond = true; prepare_to_wait(&wq, &wait, state);
* smp_mb(); // smp_mb() from set_current_state()
* if (waitqueue_active(wq)) if (@cond)
* wake_up(wq); break;
* schedule();
* }
* finish_wait(&wq, &wait);
*
* Because without the explicit smp_mb() it's possible for the
* waitqueue_active() load to get hoisted over the @cond store such that we'll
* observe an empty wait list while the waiter might not observe @cond.
*
* Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
* which (when the lock is uncontended) are of roughly equal cost.
*/
[include/linux/wait.h]

That's exactly what we're seeing go wrong here. The waiter didn't see
the condtion become true as it was going to sleep, and the waker didn't
see the waiter as being asleep.

Based upon all of this, it's pretty clear that we need a memory barrier
added to wake_nocb_leader(). The good news is that most other users of
swake_up(), are already using other atomic operations that cause updates
to their values to become visibile.

The fix should be something as simple as this:

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 56583e7..1322d0b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1765,8 +1765,8 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
if (!READ_ONCE(rdp_leader->nocb_kthread))
return;
if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
- /* Prior smp_mb__after_atomic() orders against prior enqueue. */
- WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+ /* mb ensures nocb_leader_sleep is observed prior to wakeup */
+ smp_store_mb(rdp_leader->nocb_leader_sleep, false);
swake_up(&rdp_leader->nocb_wq);
}
}

--- >8 cut here 8< ---

Thanks,

-K


2017-06-08 20:55:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

On Thu, Jun 08, 2017 at 01:11:48PM -0700, Krister Johansen wrote:
> Hi Paul,
>
> On Thu, May 25, 2017 at 02:59:18PM -0700, Paul E. McKenney wrote:
> > Wait/wakeup operations do not guarantee ordering on their own. Instead,
> > either locking or memory barriers are required. This commit therefore
> > adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree_plugin.h | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 0b1042545116..573fbe9640a0 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> > /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> > WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > + smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> > swake_up(&rdp_leader->nocb_wq);
> > }
> > }
> > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> > * nocb_gp_head, where they await a grace period.
> > */
> > gotcbs = false;
> > + smp_mb(); /* wakeup before ->nocb_head reads. */
> > for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> > if (!rdp->nocb_gp_head)
>
> May I impose upon you to CC this patch to stable, and tag it as fixing
> abedf8e241? I ran into this on a production 4.9 branch. When I
> debugged it, I discovered that it went all the way back to 4.6. The
> tl;dr is that at least for some environments, the missed wakeup
> manifests itself as a series of hung-task warnings to console and if I'm
> unlucky it can also generate a hang that can block interactive logins
> via ssh.

Interesting! This is the first that I have heard that this was anything
other than a theoretical bug. To the comment in your second URL, it is
wise to recall that a seismologist was in fact arrested for failing to
predict an earthquake. Later acquitted/pardoned/whatever, but arrested
nonetheless. ;-)

https://www.theguardian.com/world/2012/oct/23/jailing-italian-seismologists-scientific-community

Silliness aside, does my patch actually fix your problem in practice as
well as in theory? If so, may I have your Tested-by?

Impressive investigative effort, by the way!

Thanx, Paul

2017-06-08 21:28:21

by Krister Johansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

On Thu, Jun 08, 2017 at 01:55:00PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 08, 2017 at 01:11:48PM -0700, Krister Johansen wrote:
> > May I impose upon you to CC this patch to stable, and tag it as fixing
> > abedf8e241? I ran into this on a production 4.9 branch. When I
> > debugged it, I discovered that it went all the way back to 4.6. The
> > tl;dr is that at least for some environments, the missed wakeup
> > manifests itself as a series of hung-task warnings to console and if I'm
> > unlucky it can also generate a hang that can block interactive logins
> > via ssh.
>
> Interesting! This is the first that I have heard that this was anything
> other than a theoretical bug. To the comment in your second URL, it is
> wise to recall that a seismologist was in fact arrested for failing to
> predict an earthquake. Later acquitted/pardoned/whatever, but arrested
> nonetheless. ;-)

Point taken. I do realize that we all make mistakes, and certainly I do
too. Perhaps I should have said that my survey of current callers of
swake_up() was enough to convince me that I didn't have an immediate
problem elsewhere, but that I'm not familiar enough with the code base
to make that statement with a lot of authority. The concern being that if
the patch came from RT-linux where the barrier was present in
swake_up(), are there other places where swake_up() callers still assume
this is being handled on their behalf?

As part of this, I also pondered whether I should add a comment around
swake_up(), similar to what's already there for waitqueue_active.
I wasn't sure how subtle this is for other consumers, though.

> Silliness aside, does my patch actually fix your problem in practice as
> well as in theory? If so, may I have your Tested-by?

Yes, it absolutely does. Consider it given:

Tested-by: Krister Johansen <[email protected]>

> Impressive investigative effort, by the way!

Thanks!

-K

2017-06-08 23:47:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

On Thu, Jun 08, 2017 at 02:28:14PM -0700, Krister Johansen wrote:
> On Thu, Jun 08, 2017 at 01:55:00PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 08, 2017 at 01:11:48PM -0700, Krister Johansen wrote:
> > > May I impose upon you to CC this patch to stable, and tag it as fixing
> > > abedf8e241? I ran into this on a production 4.9 branch. When I
> > > debugged it, I discovered that it went all the way back to 4.6. The
> > > tl;dr is that at least for some environments, the missed wakeup
> > > manifests itself as a series of hung-task warnings to console and if I'm
> > > unlucky it can also generate a hang that can block interactive logins
> > > via ssh.
> >
> > Interesting! This is the first that I have heard that this was anything
> > other than a theoretical bug. To the comment in your second URL, it is
> > wise to recall that a seismologist was in fact arrested for failing to
> > predict an earthquake. Later acquitted/pardoned/whatever, but arrested
> > nonetheless. ;-)
>
> Point taken. I do realize that we all make mistakes, and certainly I do
> too.

Indeed! Let's just say that the author of that email will have no
trouble returning the favor, and sooner rather than later. ;-)

> Perhaps I should have said that my survey of current callers of
> swake_up() was enough to convince me that I didn't have an immediate
> problem elsewhere, but that I'm not familiar enough with the code base
> to make that statement with a lot of authority. The concern being that if
> the patch came from RT-linux where the barrier was present in
> swake_up(), are there other places where swake_up() callers still assume
> this is being handled on their behalf?
>
> As part of this, I also pondered whether I should add a comment around
> swake_up(), similar to what's already there for waitqueue_active.
> I wasn't sure how subtle this is for other consumers, though.

In my case, I assume I need barriers for swake_up(), which is why I
found this bug by inspection. Still, I wouldn't mind a comment.
Others might have other opinions.

> > Silliness aside, does my patch actually fix your problem in practice as
> > well as in theory? If so, may I have your Tested-by?
>
> Yes, it absolutely does. Consider it given:
>
> Tested-by: Krister Johansen <[email protected]>

Thank you!!!

Thanx, Paul

> > Impressive investigative effort, by the way!
>
> Thanks!
>
> -K
>

2017-06-09 03:24:40

by Krister Johansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

On Thu, Jun 08, 2017 at 04:47:43PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 08, 2017 at 02:28:14PM -0700, Krister Johansen wrote:
> > On Thu, Jun 08, 2017 at 01:55:00PM -0700, Paul E. McKenney wrote:
> > > Interesting! This is the first that I have heard that this was anything
> > > other than a theoretical bug. To the comment in your second URL, it is
> > > wise to recall that a seismologist was in fact arrested for failing to
> > > predict an earthquake. Later acquitted/pardoned/whatever, but arrested
> > > nonetheless. ;-)
> >
> > Point taken. I do realize that we all make mistakes, and certainly I do
> > too.
>
> Indeed! Let's just say that the author of that email will have no
> trouble returning the favor, and sooner rather than later. ;-)

No doubt he's polishing up an extra small extra tight pair of handcuffs
with my name on them.

> > Perhaps I should have said that my survey of current callers of
> > swake_up() was enough to convince me that I didn't have an immediate
> > problem elsewhere, but that I'm not familiar enough with the code base
> > to make that statement with a lot of authority. The concern being that if
> > the patch came from RT-linux where the barrier was present in
> > swake_up(), are there other places where swake_up() callers still assume
> > this is being handled on their behalf?
> >
> > As part of this, I also pondered whether I should add a comment around
> > swake_up(), similar to what's already there for waitqueue_active.
> > I wasn't sure how subtle this is for other consumers, though.
>
> In my case, I assume I need barriers for swake_up(), which is why I
> found this bug by inspection. Still, I wouldn't mind a comment.
> Others might have other opinions.

Since you don't mind, I've prepared a small patch for those comments. I'll
send that in a separate thread.

Thanks again,

-K