Date: Tue, 27 Jun 2017 13:58:03 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Andrea Parri <parri.andrea@gmail.com>
Cc: linux-kernel@vger.kernel.org, priyalee.kushwaha@intel.com,
        drozdziak1@gmail.com, arnd@arndb.de, ldr709@gmail.com,
        tglx@linutronix.de, peterz@infradead.org, josh@joshtriplett.org,
        nico@linaro.org, kjlx@templeofstupid.com, vegard.nossum@oracle.com,
        torvalds@linux-foundation.org, dcb314@hotmail.com,
        fengguang.wu@intel.com, fweisbec@gmail.com, riel@redhat.com,
        rostedt@goodmis.org, mingo@kernel.org, stern@rowland.harvard.edu,
        luc.maranget@inria.fr, j.alglave@ucl.ac.uk
Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170612213754.GA7201@linux.vnet.ibm.com>
 <20170614025404.GA2525@andrea>
 <20170614043317.GK3721@linux.vnet.ibm.com>
 <20170614143322.GA3368@andrea>
 <20170614202329.GV3721@linux.vnet.ibm.com>
 <20170619162428.GA9632@andrea>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170619162428.GA9632@andrea>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20170627205802.GY3721@linux.vnet.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 18402
Lines: 477

On Mon, Jun 19, 2017 at 06:24:28PM +0200, Andrea Parri wrote:
> On Wed, Jun 14, 2017 at 01:23:29PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 14, 2017 at 04:33:22PM +0200, Andrea Parri wrote:
> > > On Tue, Jun 13, 2017 at 09:33:17PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jun 14, 2017 at 04:54:04AM +0200, Andrea Parri wrote:
> > > > > On Mon, Jun 12, 2017 at 02:37:55PM -0700, Paul E. McKenney wrote:
> > > > > > Hello, Ingo,
> > > > > > 
> > > > > > This pull request is unusual in being a single linear set of commits,
> > > > > > as opposed to my usual topic branches.  This is due to the many
> > > > > > large-footprint changes, which means that reasonable topic branches
> > > > > > result in large numbers of merge conflicts.  In addition, some commits
> > > > > > depend on other commits that should be on different topic branches.
> > > > > > I will return to the topic-branch style next time.
> > > > > > 
> > > > > > The largest feature of this series is shrinking and simplification,
> > > > > > with the following diffstat summary:
> > > > > > 
> > > > > >  79 files changed, 1496 insertions(+), 4211 deletions(-)
> > > > > > 
> > > > > > In other words, this series represents a net reduction of more than 2700
> > > > > > lines of code.
> > > > > > 
> > > > > > These commits were posted to LKML:
> > > > > > 
> > > > > > 	http://lkml.kernel.org/r/20170525215934.GA11578@linux.vnet.ibm.com
> > > > > 
> > > > > I did raise some issues (AFAICT, unresolved) concerning...
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Two of these commits (46/88 and 48/88) have been deferred, most likely
> > > > > > to v4.14.  All of the remaining commits have been subjected to the 0day
> > > > > > Test Robot and -next testing, and are availiable in teh git repository at:
> > > > > > 
> > > > > >   git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git for-mingo
> > > > > > 
> > > > > > for you to fetch changes up to 6d48152eafde1f0d0a4a9e0584fa7d9ff4fbfdac:
> > > > > > 
> > > > > >   rcu: Remove RCU CPU stall warnings from Tiny RCU (2017-06-08 18:52:45 -0700)
> > > > > > 
> > > > > > ----------------------------------------------------------------
> > > > > > Arnd Bergmann (1):
> > > > > >       bcm47xx: Fix build regression
> > > > > > 
> > > > > > Paul E. McKenney (83):
> > > > > >       rcutorture: Add lockdep to one of the SRCU scenarios
> > > > > >       rcutorture: Add three-level tree test for Tree SRCU
> > > > > >       rcutorture: Fix bug in reporting Kconfig mis-settings
> > > > > >       rcutorture: Add a scenario for Tiny SRCU
> > > > > >       rcutorture: Add a scenario for Classic SRCU
> > > > > >       rcu: Prevent rcu_barrier() from starting needless grace periods
> > > > > >       rcutorture: Correctly handle CONFIG_RCU_TORTURE_TEST_* options
> > > > > >       rcutorture: Update test scenarios based on new Kconfig dependencies
> > > > > >       srcu: Eliminate possibility of destructive counter overflow
> > > > > >       rcu: Complain if blocking in preemptible RCU read-side critical section
> > > > > >       rcuperf: Defer expedited/normal check to end of test
> > > > > >       rcuperf: Remove conflicting Kconfig options
> > > > > >       rcu: Remove obsolete reference to synchronize_kernel()
> > > > > >       rcuperf: Add ability to performance-test call_rcu() and friends
> > > > > >       rcuperf: Add a Kconfig-fragment file for Classic SRCU
> > > > > >       rcu: Make sync_rcu_preempt_exp_done() return bool
> > > > > >       checkpatch: Remove checks for expedited grace periods
> > > > > >       rcuperf: Add test for dynamically initialized srcu_struct
> > > > > >       doc/atomic_ops: Clarify smp_mb__{before,after}_atomic()
> > > > > >       atomics: Add header comment so spin_unlock_wait()
> > > > > 
> > > > > ... this one: c.f.,
> > > > > 
> > > > >   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1418503.html
> > > > > 
> > > > > Any hints about those?
> > > > 
> > > > I suggest being -extremely- clear.  This is about ARM, correct?
> > > > If so, lay out the exact situation (code example, which hardware,
> > > > which spin_unlock_wait(), which sequence of events) that could lead to
> > > > the failure.
> > > > 
> > > > The problem here is that no one knows which of the 30 CPU families you
> > > > might be talking about, nor do they know exactly what the problem is.
> > > > I didn't worry about it at the time because I figured that you had
> > > > sent private email to Will with the full story.
> > > > 
> > > > Yes, the four of us (you, Alan, Luc, and me) discussed it, but we weren't
> > > > sure whether it was a bug in the memory model, the spin_unlock_wait()
> > > > code, or my description of spin_unlock_wait().  Given that Will didn't
> > > > object to my April 13th email (the one that you were not CCed on),
> > > > I figured that he wasn't going to claim that the spin_unlock_wait()
> > > > description was wrong, especially since he went to so much effort some
> > > > years back to make ARM64 meet that description.
> > > > 
> > > > So again, I recommend replying to your msg1418503.html email with
> > > > a code fragment demonstrating the problem, exact identification of
> > > > the hardware that might be susceptible (ARM64? ARM32? Which ARM32?),
> > > > exact identification of which spin_unlock_wait() function you suspect,
> > > > and a clear bullet-form sequence of events that shows how you believe
> > > > that the problem can occur.
> > > > 
> > > > That makes it easy for people to see what your concern is, easy for
> > > > them to check their code and hardware, and hard for them to ignore you.
> > > > 
> > > > Make sense?
> > > 
> > > My concerns originates from the fact that none of the implementations
> > > (of spin_unlock_wait()) for the architectures touched by:
> > > 
> > >   726328d92a42b6d4b76078e2659f43067f82c4e8
> > >   ("locking/spinlock, arch: Update and fix spin_unlock_wait() implementations"
> > > 
> > > currently contain any traces of that RELEASE/spin_unlock() from your:
> > > 
> > >   "Semantically this is equivalent to a spin_lock() immediately followed
> > >    by a spin_unlock()."
> > > 
> > > In fact, the header of that commit states:
> > > 
> > >   "The update is in semantics; where it previously was only a control
> > >    dependency, we now upgrade to a full load-acquire [...]"
> > > 
> > > For an example leveraging this RELEASE, consider:
> > > 
> > >   [initially: X = 0, s UNLOCKED]
> > > 
> > >   P0                      P1
> > >   X = 1;                  spin_lock(s);
> > >   spin_unlock_wait(s);    r0 = X;
> > > 
> > > According to the "spin_lock(); spin_unlock() semantics" this has one
> > > non-deadlocking execution, and the RELEASE from the spin_unlock_wait()
> > > (paired with the ACQUIRE from the spin_lock() in P1) guarantees that
> > > r0 = 1 in this execution. AFAICT, this same conclusion does not hold
> > > according to the "smp_cond_load_acquire() semantics" (726328d92a42b).
> > 
> > OK.  For exactly which CPU families do you believe that this fails to
> > hold.  That is, given the implementations of spin_unlock_wait() and
> > spin_lock() for the various CPU families, which will break and why?
> 
> Considering the case of x86, the question amounts to ask whether
> the "exists" clause of the following test can be satisfied:
> 
> X86 TEST
> {
>  1:EAX=1;
> }
>  P0          | P1                ;
>  MOV [X],$1  | LOCK XCHG [s],EAX ;
>  MOV EAX,[s] | MOV EBX,[X]       ;
> exists
> (0:EAX=0 /\ 1:EBX=0)
> 
> The answer is "Yes", as illustrated by the following sequence of
> events:
> 
>   1) P0's store to X is placed into the store buffer (of P0);
>   2) P0 loads 0 from s;
>   3) P1 executes LOCK XCHG;
>   4) P1 loads X=0 from memory;
>   5) P0's store to X is flushed from store buffer to memory.
> 
> ---
> 
> This analysis shows that the x86 implementation needs additional
> synchronization to meet a "spin_lock();spin_unlock()" semantics.
> 
> I believe that the same conclusion holds for other architectures
> (e.g., sparc, ia64, arm(32), alpha).
> 
> In fact, the only two implementations I know that _seem_ able to
> meet that semantics are arm64's and powerpc's.

And you are quite correct.  In fact, Luc Maranget was able to make this
happen on real x86 hardware using real Linux-kernel locking primitives.

So what next?

One option would be to weaken the definition of spin_unlock_wait() so
that it had acquire semantics but not release semantics.  Alternatively,
we could keep the full empty-critical-section semantics and add memory
barriers to spin_unlock_wait() implementations, perhaps as shown in the
patch below.  I could go either way, though I do have some preference
for the stronger semantics.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit a023454468fefc41c5d3c1054106ca77aa1b962d
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri Jun 23 13:20:27 2017 -0700

    locking: Add leading smp_mb() for spin_unlock_wait()
    
    The spin_unlock_wait() primitive is supposed to have the effect of
    a spin_lock() immediately followed by a spin_unlock().  However,
    the klitmus7 tool says otherwise on x86 when running a kernel module
    generated from the following litmus test:
    
    	C C-unlock-wait-00
    	{
    	  spinlock_t r;
    	}
    
    	P0(spinlock_t *r,int *x,int *y)
    	{
    	  int r0;
    	  WRITE_ONCE(*x,1) ;
    	  spin_unlock_wait(r);
    	  r0 = READ_ONCE(*y);
    
    	}
    
    	P1(spinlock_t *r,int *x,int *y) {
    	   int r1;
    	   spin_lock(r);
    	   WRITE_ONCE(*y,1);
    	   r1 = READ_ONCE(*x);
    	   spin_unlock(r);
    	}
    
    	exists (0:r0=0 /\ 1:r1=0)
    
    If P0()'s spin_unlock_wait() slots in before P1()'s critical section,
    then r1 cannot be zero, and if P0()'s spin_unlock_wait() slots in after
    P1()'s critical section, then r0 cannot be zero.  Yet real code running
    on real hardware observes both r0 and r1 being zero on real hardware,
    including x86.
    
    One option is to weaken the semantics of spin_unlock_wait(),
    but there is one use case that seems to need the strong semantics
    (ata_scsi_cmd_error_handler()), and another that has explicit memory
    barriers that could be removed given the stronger semantics proposed
    in this commit (nf_conntrack_lock()).
    
    This commit therefore adds memory barriers to enforce the stronger
    semantics.  Please note that this commit takes an extremely conservative
    approach because spin_unlock_wait() is not used heavily or on fastpaths.
    It is quite possible that some of the added memory barriers are in
    fact unnecessary.
    
    Reported-by: Andrea Parri <parri.andrea@gmail.com>
    Reported-by: Luc Maranget <luc.maranget@inria.fr>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: <linux-arch@vger.kernel.org>

diff --git a/arch/alpha/include/asm/spinlock.h b/arch/alpha/include/asm/spinlock.h
index a40b9fc0c6c3..6ceb56e1e716 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -18,6 +18,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h
index 233d5ffe6ec7..9d0f588caea3 100644
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -18,6 +18,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->slock, !VAL);
 }
 
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 4bec45442072..0b097ea6760d 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -56,6 +56,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
 	u16 owner = READ_ONCE(lock->tickets.owner);
 
+	smp_mb();
 	for (;;) {
 		arch_spinlock_t tmp = READ_ONCE(*lock);
 
diff --git a/arch/blackfin/include/asm/spinlock.h b/arch/blackfin/include/asm/spinlock.h
index c58f4a83ed6f..f4bdc8996d35 100644
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -50,6 +50,7 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/hexagon/include/asm/spinlock.h b/arch/hexagon/include/asm/spinlock.h
index a1c55788c5d6..9f09f900e146 100644
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -181,6 +181,7 @@ static inline unsigned int arch_spin_trylock(arch_spinlock_t *lock)
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/ia64/include/asm/spinlock.h b/arch/ia64/include/asm/spinlock.h
index ca9e76149a4a..4c76d7caf185 100644
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -145,6 +145,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	__ticket_spin_unlock_wait(lock);
 }
 
diff --git a/arch/m32r/include/asm/spinlock.h b/arch/m32r/include/asm/spinlock.h
index 323c7fc953cd..1f2e97486557 100644
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -32,6 +32,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->slock, VAL > 0);
 }
 
diff --git a/arch/metag/include/asm/spinlock.h b/arch/metag/include/asm/spinlock.h
index c0c7a22be1ae..53a17023beed 100644
--- a/arch/metag/include/asm/spinlock.h
+++ b/arch/metag/include/asm/spinlock.h
@@ -17,6 +17,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index a8df44d60607..2c5788223abc 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -53,7 +53,7 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
 	u16 owner = READ_ONCE(lock->h.serving_now);
-	smp_rmb();
+	smp_mb();
 	for (;;) {
 		arch_spinlock_t tmp = READ_ONCE(*lock);
 
diff --git a/arch/mn10300/include/asm/spinlock.h b/arch/mn10300/include/asm/spinlock.h
index 9c7b8f7942d8..517db0908132 100644
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -28,6 +28,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->slock, !VAL);
 }
 
diff --git a/arch/parisc/include/asm/spinlock.h b/arch/parisc/include/asm/spinlock.h
index e32936cd7f10..aecf9e4f4335 100644
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -18,6 +18,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *x)
 {
 	volatile unsigned int *a = __ldcw_align(x);
 
+	smp_mb();
 	smp_cond_load_acquire(a, VAL);
 }
 
diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index f7838ecd83c6..5b97e8e9e444 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -100,6 +100,7 @@ static inline void arch_spin_unlock(arch_spinlock_t *lp)
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	while (arch_spin_is_locked(lock))
 		arch_spin_relax(lock);
 	smp_acquire__after_ctrl_dep();
diff --git a/arch/sh/include/asm/spinlock-cas.h b/arch/sh/include/asm/spinlock-cas.h
index c46e8cc7b515..37af00015330 100644
--- a/arch/sh/include/asm/spinlock-cas.h
+++ b/arch/sh/include/asm/spinlock-cas.h
@@ -31,6 +31,7 @@ static inline unsigned __sl_cas(volatile unsigned *p, unsigned old, unsigned new
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, VAL > 0);
 }
 
diff --git a/arch/sh/include/asm/spinlock-llsc.h b/arch/sh/include/asm/spinlock-llsc.h
index cec78143fa83..cf3e37e70ae8 100644
--- a/arch/sh/include/asm/spinlock-llsc.h
+++ b/arch/sh/include/asm/spinlock-llsc.h
@@ -23,6 +23,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, VAL > 0);
 }
 
diff --git a/arch/sparc/include/asm/spinlock_32.h b/arch/sparc/include/asm/spinlock_32.h
index 8011e79f59c9..441009963962 100644
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -16,6 +16,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/sparc/include/asm/spinlock_64.h b/arch/sparc/include/asm/spinlock_64.h
index 07c9f2e9bf57..9be036a4a88b 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -28,6 +28,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 076c6cc43113..7a9bb8acdb6f 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -69,6 +69,7 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
 	int next = READ_ONCE(lock->next_ticket);
 
 	/* Return immediately if unlocked. */
+	smp_mb();
 	if (next == curr)
 		return;
 
diff --git a/arch/tile/lib/spinlock_64.c b/arch/tile/lib/spinlock_64.c
index a4b5b2cbce93..2933e02582c9 100644
--- a/arch/tile/lib/spinlock_64.c
+++ b/arch/tile/lib/spinlock_64.c
@@ -69,6 +69,7 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
 	u32 curr = arch_spin_current(val);
 
 	/* Return immediately if unlocked. */
+	smp_mb();
 	if (arch_spin_next(val) == curr)
 		return;
 
diff --git a/arch/xtensa/include/asm/spinlock.h b/arch/xtensa/include/asm/spinlock.h
index a36221cf6363..536417abd11b 100644
--- a/arch/xtensa/include/asm/spinlock.h
+++ b/arch/xtensa/include/asm/spinlock.h
@@ -35,6 +35,7 @@
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
+	smp_mb();
 	smp_cond_load_acquire(&lock->slock, !VAL);
 }