Date: Thu, 15 Jun 2017 10:56:29 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Boqun Feng <boqun.feng@gmail.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        linux-kernel@vger.kernel.org,
        Paul Gortmaker <paul.gortmaker@windriver.com>,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH tip/sched/core] Add comments to aid in safer usage of
 swake_up.
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170609071957.GJ8337@worktop.programming.kicks-ass.net>
 <20170609124554.GF3721@linux.vnet.ibm.com>
 <20170613192308.173dd86a@gandalf.local.home>
 <20170613234205.GD3721@linux.vnet.ibm.com>
 <20170613211547.49814d25@gandalf.local.home>
 <20170614035843.GI3721@linux.vnet.ibm.com>
 <20170614091015.01d7dc89@gandalf.local.home>
 <20170614110240.10abe2ed@gandalf.local.home>
 <20170614162558.GA2368@templeofstupid.com>
 <20170615041828.zk3a3sfyudm5p6nl@tardis>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170615041828.zk3a3sfyudm5p6nl@tardis>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20170615175629.GE3721@linux.vnet.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7422
Lines: 232

On Thu, Jun 15, 2017 at 12:18:28PM +0800, Boqun Feng wrote:
> On Wed, Jun 14, 2017 at 09:25:58AM -0700, Krister Johansen wrote:
> > On Wed, Jun 14, 2017 at 11:02:40AM -0400, Steven Rostedt wrote:
> > > On Wed, 14 Jun 2017 09:10:15 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> > > > where applicable.
> > > > 
> > > > 
> > > > 	CPU0				CPU1
> > > > 	----				----
> > > > 				LOCK(A)
> > > > 
> > > >  LOCK(B)
> > > > 				 WRITE_ONCE(X, INIT)
> > > > 
> > > > 				 (the cpu may postpone writing X)
> > > > 
> > > > 				 (the cpu can fetch wq list here)
> > > >   list_add(wq, q)
> > > > 
> > > >  UNLOCK(B)
> > > > 
> > > >  (the cpu may fetch old value of X)
> > > > 
> > > > 				 (write of X happens here)
> > > > 
> > > >  if (READ_ONCE(X) != init)
> > > >    schedule();
> > > > 
> > > > 				UNLOCK(A)
> > > > 
> > > > 				 if (list_empty(wq))
> > > > 				   return;
> > > > 
> > > > Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> > > > scenario?
> > > > 
> > > > Because we are using spinlocks, this wont be an issue for most
> > > > architectures. The bug happens if the fetching of the list_empty()
> > > > leaks into before the UNLOCK(A).
> > > > 
> > > > If the reading/writing of the list and the reading/writing of gp_flags
> > > > gets reversed in either direction by the CPU, then we have a problem.
> > > 
> > > FYI..
> > > 
> > > Both sides need a memory barrier. Otherwise, even with a memory barrier
> > > on CPU1 we can still have:
> > > 
> > > 
> > > 	CPU0				CPU1
> > > 	----				----
> > > 
> > > 				LOCK(A)
> > >  LOCK(B)
> > > 
> > >  list_add(wq, q)
> > > 
> > >  (cpu waits to write wq list)
> > > 
> > >  (cpu fetches X)
> > > 
> > > 				 WRITE_ONCE(X, INIT)
> > > 
> > > 				UNLOCK(A)
> > > 
> > > 				smp_mb();
> > > 
> > > 				if (list_empty(wq))
> > > 				   return;
> > > 
> > >  (cpu writes wq list)
> > > 
> > >  UNLOCK(B)
> > > 
> > >  if (READ_ONCE(X) != INIT)
> > >    schedule()
> > > 
> > > 
> > > Luckily for us, there is a memory barrier on CPU0. In
> > > prepare_to_swait() we have:
> > > 
> > > 	raw_spin_lock_irqsave(&q->lock, flags);
> > > 	__prepare_to_swait(q, wait);
> > > 	set_current_state(state);
> > > 	raw_spin_unlock_irqrestore(&q->lock, flags);
> > > 
> > > And that set_current_state() call includes a memory barrier, which will
> > > prevent the above from happening, as the addition to the wq list must
> > > be flushed before fetching X.
> > > 
> > > I still strongly believe that the swait_active() requires a memory
> > > barrier.
> > 
> > FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
> > swait_activate().
> > 
> > https://www.spinics.net/lists/linux-rt-users/msg10340.html
> > 
> > If the barrier goes in swait_active() then we don't have to require all
> > of the callers of swait_active and swake_up to issue the barrier
> > instead.  Handling this in swait_active is likely to be less error
> > prone.  Though, we could also do something like wq_has_sleeper() and use
> > that preferentially in swake_up and its variants.
> > 
> 
> I think it makes more sense that we delete the swait_active() in
> swake_up()? Because we seems to encourage users to do the quick check on
> wait queue on their own, so why do the check again in swake_up()?
> Besides, wake_up() doesn't call waitqueue_activie() outside the lock
> critical section either.
> 
> So how about the patch below(Testing is in progress)? Peter?

It is quite possible that a problem I am seeing is caused by this, but
there are reasons to believe otherwise.  And in any case, the problem is
quite rare, taking tens or perhaps even hundreds of hours of rcutorture
to reproduce.

So, would you be willing to create a dedicated swait torture test to check
this out?  The usual approach would be to create a circle of kthreads,
with each waiting on the previous kthread and waking up the next one.
Each kthread, after being awakened, checks a variable that its waker
sets just before the wakeup.  Have another kthread check for hangs.

Possibly introduce timeouts and random delays to stir things up a bit.

But maybe such a test already exists.  Does anyone know of one?  I don't
see anything obvious.

Interested?

							Thanx, Paul

> Regards,
> Boqun
> 
> --------------------->8
> Subject: [PATCH] swait: Remove the lockless swait_active() check in
>  swake_up*()
> 
> Steven Rostedt reported a potential race in RCU core because of
> swake_up():
> 
>         CPU0                            CPU1
>         ----                            ----
>                                 __call_rcu_core() {
> 
>                                  spin_lock(rnp_root)
>                                  need_wake = __rcu_start_gp() {
>                                   rcu_start_gp_advanced() {
>                                    gp_flags = FLAG_INIT
>                                   }
>                                  }
> 
>  rcu_gp_kthread() {
>    swait_event_interruptible(wq,
>         gp_flags & FLAG_INIT) {
>    spin_lock(q->lock)
> 
>                                 *fetch wq->task_list here! *
> 
>    list_add(wq->task_list, q->task_list)
>    spin_unlock(q->lock);
> 
>    *fetch old value of gp_flags here *
> 
>                                  spin_unlock(rnp_root)
> 
>                                  rcu_gp_kthread_wake() {
>                                   swake_up(wq) {
>                                    swait_active(wq) {
>                                     list_empty(wq->task_list)
> 
>                                    } * return false *
> 
>   if (condition) * false *
>     schedule();
> 
> In this case, a wakeup is missed, which could cause the rcu_gp_kthread
> waits for a long time.
> 
> The reason of this is that we do a lockless swait_active() check in
> swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
> before swait_active() to provide the proper order or 2) simply remove
> the swait_active() in swake_up().
> 
> The solution 2 not only fixes this problem but also keeps the swait and
> wait API as close as possible, as wake_up() doesn't provide a full
> barrier and doesn't do a lockless check of the wait queue either.
> Moreover, there are users already using swait_active() to do their quick
> checks for the wait queues, so it make less sense that swake_up() and
> swake_up_all() do this on their own.
> 
> This patch then removes the lockless swait_active() check in swake_up()
> and swake_up_all().
> 
> Reported-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/sched/swait.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index 3d5610dcce11..2227e183e202 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
>  {
>  	unsigned long flags;
> 
> -	if (!swait_active(q))
> -		return;
> -
>  	raw_spin_lock_irqsave(&q->lock, flags);
>  	swake_up_locked(q);
>  	raw_spin_unlock_irqrestore(&q->lock, flags);
> @@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
>  	struct swait_queue *curr;
>  	LIST_HEAD(tmp);
> 
> -	if (!swait_active(q))
> -		return;
> -
>  	raw_spin_lock_irq(&q->lock);
>  	list_splice_init(&q->task_list, &tmp);
>  	while (!list_empty(&tmp)) {
> -- 
> 2.13.0
>