Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751551AbdF1UQO (ORCPT ); Wed, 28 Jun 2017 16:16:14 -0400 Received: from netrider.rowland.org ([192.131.102.5]:49533 "HELO netrider.rowland.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751615AbdF1UQF (ORCPT ); Wed, 28 Jun 2017 16:16:05 -0400 Date: Wed, 28 Jun 2017 16:16:03 -0400 (EDT) From: Alan Stern X-X-Sender: stern@netrider.rowland.org To: "Paul E. McKenney" cc: Linus Torvalds , Andrea Parri , Linux Kernel Mailing List , , , Arnd Bergmann , , Thomas Gleixner , Peter Zijlstra , Josh Triplett , Nicolas Pitre , Krister Johansen , Vegard Nossum , , Wu Fengguang , Frederic Weisbecker , Rik van Riel , Steven Rostedt , Ingo Molnar , Luc Maranget , Jade Alglave Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13 In-Reply-To: <20170628170321.GQ3721@linux.vnet.ibm.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4323 Lines: 113 On Wed, 28 Jun 2017, Paul E. McKenney wrote: > > The problem is that adding smp_mb() before spin_unlock_wait() does not > > provide release semantics, as Andrea has pointed out. ISTM that when > > people want full release & acquire semantics, they should just use > > "spin_lock(); spin_unlock();". > > Well, from what I can see, this thread is certainly demonstrating to my > satisfaction that we really do need a memory model. ;-) > > I agree that calling a loops of loads a "release" is at best confusing, > and certainly conflicts with all know nomenclature. So let's forget > "release" or not and try to agree on litmus tests. Of course, as we > both know and as noted long ago on LKML, we cannot -define- semantics > via litmus tests, but we can use litmus tests to -check- semantics. > > First, modeling lock acquisition with xchg(), which is fully ordered > (given that locking is still very much under development in our model): > > C SUW+or-ow+l-ow-or > {} > > P0(int *x, int *y, int *my_lock) > { > r0 = READ_ONCE(*x); > WRITE_ONCE(*y, 1); > smp_mb(); > r1 = READ_ONCE(*my_lock); > } > > P1(int *x, int *y, int *my_lock) { > r1 = xchg(my_lock, 1); > WRITE_ONCE(*x, 1); > r0 = READ_ONCE(*y); > } > > exists (0:r0=1 /\ 0:r1=0 /\ 1:r0=0 /\ 1:r1=0) > > This gives "Positive: 0 Negative: 5". But to your point, replacing > "xchg()" with "xchg_acquire()" gets us "Positive: 1 Negative: 7". Yes, that's what I had in mind. > But xchg_acquire() would in fact work on x86 because the xchg instruction > does full ordering in any case, right? (I believe that this also applies > to the other TSO architectures, but have not checked.) True. > For PowerPC (and I believe ARM), the leading smp_mb() suffices, as > illustrated by this litmus test: > > PPC spin_unlock_wait_st.litmus > "" > { > l=0; > 0:r1=1; 0:r3=42; 0:r4=x; 0:r10=0; 0:r12=l; > 1:r1=1; 1:r3=42; 1:r4=x; 1:r10=0; 1:r11=0; 1:r12=l; > } > P0 | P1 ; > stw r1,0(r4) | lwarx r11,r10,r12 ; > sync | cmpwi r11,0 ; > lwarx r11,r10,r12 | bne Fail1 ; > stwcx. r11,r10,r12 | stwcx. r1,r10,r12 ; > bne Fail0 | bne Fail1 ; > lwz r3,0(r12) | lwsync ; > Fail0: | lwz r3,0(r4) ; > | Fail1: ; > > > exists > (0:r3=0 /\ 1:r3=0) Yes. Bear in mind that the PowerPC version of arch_spin_unlock_wait ends with smp_mb. That already makes it a lot stronger than the smp_cond_load_acquire implementation on other architectures. > So what am I missing here? Our memory model is (deliberately) weaker than the guarantees provided by any of the hardware implementations. So while adding smp_mb in front of smp_cond_load_acquire may suffice to give the desired semantics in many cases, it might not suffice for all architectures (because it doesn't suffice in the model). In fact, we would need to change the model to make it accept this idiom. I admit not being able to point to any architectures where the difference matters. But if you take this approach, you do need to document that for any future architectures, putting smp_mb in front of spin_unlock_wait must be guaranteed by the arch-specific code to have the desired effect. Also, it would not be obvious to a reader _why_ putting an explicit smp_mb before spin_unlock_wait would make prior writes visible to later critical sections, since it's not obvious that spin_unlock_wait needs to do any writes. Replacing the whole thing with spin_lock + spin_unlock would be easier to understand and wouldn't need any special guarantees or explanations. > > If there are any places where this would add unacceptable overhead, > > maybe those places need some rethinking. For instance, perhaps we > > could add a separate primitive that provides only release semantics. > > (But would using the new primitive together with spin_unlock_wait > > really be significantly better than lock-unlock?) > > At the moment, I have no idea on the relative overheads. ;-) I don't either. But for architectures where spin_unlock_wait already does the equivalent of load-locked + store-conditional, it's hard to see why replacing it with spin_lock + spin_unlock would make it any slower. Alan Stern