2018-07-31 18:31:53

by David Chen

[permalink] [raw]
Subject: Backport 35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*()) to 4.9 branch

Hi Peter,

In 4.9 branch, we hit an issue in RCU, where the NOCB follower list not getting
reclaimed and causing OOM.

In discussion with Paul, we were able to figure out the problem was because of
missed wake up resulted from lack of proper memory barrier between setting
wake up condition and swake_up().

nocb_leader_wait()
{
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
swake_up(&rdp->nocb_wq);

Note, that the smp_mb__after_atomic() is only a compiler barrier on x86.
Originally I was going to change the barrier to smp_mb(). But then I found out
master has this following patch that solves the same class of problem by
removing the lockless check inside swake_up().

35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*())

So I'm wonder if we can backport this patch to 4.9 branch to solve this issue,
and maybe solve other potential missed wake up issue as well.

Thanks,
David