Date: Wed, 28 Jun 2017 16:16:03 -0400 (EDT)
From: Alan Stern <stern@rowland.harvard.edu>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Andrea Parri <parri.andrea@gmail.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        <priyalee.kushwaha@intel.com>, <drozdziak1@gmail.com>,
        Arnd Bergmann <arnd@arndb.de>, <ldr709@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Josh Triplett <josh@joshtriplett.org>, Nicolas Pitre <nico@linaro.org>,
        Krister Johansen <kjlx@templeofstupid.com>,
        Vegard Nossum <vegard.nossum@oracle.com>, <dcb314@hotmail.com>,
        Wu Fengguang <fengguang.wu@intel.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Rik van Riel <riel@redhat.com>, Steven Rostedt <rostedt@goodmis.org>,
        Ingo Molnar <mingo@kernel.org>, Luc Maranget <luc.maranget@inria.fr>,
        Jade Alglave <j.alglave@ucl.ac.uk>
Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13
In-Reply-To: <20170628170321.GQ3721@linux.vnet.ibm.com>
Message-ID: <Pine.LNX.4.44L0.1706281547270.27696-100000@netrider.rowland.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4323
Lines: 113

On Wed, 28 Jun 2017, Paul E. McKenney wrote:

> > The problem is that adding smp_mb() before spin_unlock_wait() does not
> > provide release semantics, as Andrea has pointed out.  ISTM that when
> > people want full release & acquire semantics, they should just use
> > "spin_lock(); spin_unlock();".
> 
> Well, from what I can see, this thread is certainly demonstrating to my
> satisfaction that we really do need a memory model.  ;-)
> 
> I agree that calling a loops of loads a "release" is at best confusing,
> and certainly conflicts with all know nomenclature.  So let's forget
> "release" or not and try to agree on litmus tests.  Of course, as we
> both know and as noted long ago on LKML, we cannot -define- semantics
> via litmus tests, but we can use litmus tests to -check- semantics.
> 
> First, modeling lock acquisition with xchg(), which is fully ordered
> (given that locking is still very much under development in our model):
> 
> 	C SUW+or-ow+l-ow-or
> 	{}
> 
> 	P0(int *x, int *y, int *my_lock)
> 	{
> 	  r0 = READ_ONCE(*x);
> 	  WRITE_ONCE(*y, 1);
> 	  smp_mb();
> 	  r1 = READ_ONCE(*my_lock);
> 	}
> 
> 	P1(int *x, int *y, int *my_lock) {
> 	  r1 = xchg(my_lock, 1);
> 	  WRITE_ONCE(*x, 1);
> 	  r0 = READ_ONCE(*y);
> 	}
> 
> 	exists (0:r0=1 /\ 0:r1=0 /\ 1:r0=0 /\ 1:r1=0)
> 
> This gives "Positive: 0 Negative: 5".  But to your point, replacing
> "xchg()" with "xchg_acquire()" gets us "Positive: 1 Negative: 7".

Yes, that's what I had in mind.

> But xchg_acquire() would in fact work on x86 because the xchg instruction
> does full ordering in any case, right?  (I believe that this also applies
> to the other TSO architectures, but have not checked.)

True.

> For PowerPC (and I believe ARM), the leading smp_mb() suffices, as
> illustrated by this litmus test:
> 
> 	PPC spin_unlock_wait_st.litmus
> 	""
> 	{
> 	l=0;
> 	0:r1=1; 0:r3=42; 0:r4=x; 0:r10=0;          0:r12=l;
> 	1:r1=1; 1:r3=42; 1:r4=x; 1:r10=0; 1:r11=0; 1:r12=l;
> 	}
> 	 P0                 | P1                 ;
> 	 stw r1,0(r4)       | lwarx r11,r10,r12  ;
> 	 sync               | cmpwi r11,0        ;
> 	 lwarx r11,r10,r12  | bne Fail1          ;
> 	 stwcx. r11,r10,r12 | stwcx. r1,r10,r12  ;
> 	 bne Fail0          | bne Fail1          ;
> 	 lwz r3,0(r12)      | lwsync             ;
> 	 Fail0:             | lwz r3,0(r4)       ;
> 			    | Fail1:             ;
> 
> 
> 	exists
> 	(0:r3=0 /\ 1:r3=0)

Yes.  Bear in mind that the PowerPC version of arch_spin_unlock_wait
ends with smp_mb.  That already makes it a lot stronger than the
smp_cond_load_acquire implementation on other architectures.

> So what am I missing here?

Our memory model is (deliberately) weaker than the guarantees provided
by any of the hardware implementations.  So while adding smp_mb in
front of smp_cond_load_acquire may suffice to give the desired
semantics in many cases, it might not suffice for all architectures
(because it doesn't suffice in the model).  In fact, we would need to 
change the model to make it accept this idiom.

I admit not being able to point to any architectures where the
difference matters.  But if you take this approach, you do need to
document that for any future architectures, putting smp_mb in front of
spin_unlock_wait must be guaranteed by the arch-specific code to have
the desired effect.

Also, it would not be obvious to a reader _why_ putting an explicit
smp_mb before spin_unlock_wait would make prior writes visible to later
critical sections, since it's not obvious that spin_unlock_wait needs
to do any writes.  Replacing the whole thing with spin_lock +
spin_unlock would be easier to understand and wouldn't need any special 
guarantees or explanations.

> > If there are any places where this would add unacceptable overhead,
> > maybe those places need some rethinking.  For instance, perhaps we
> > could add a separate primitive that provides only release semantics.  
> > (But would using the new primitive together with spin_unlock_wait
> > really be significantly better than lock-unlock?)
> 
> At the moment, I have no idea on the relative overheads.  ;-)

I don't either.  But for architectures where spin_unlock_wait already 
does the equivalent of load-locked + store-conditional, it's hard to 
see why replacing it with spin_lock + spin_unlock would make it any 
slower.

Alan Stern