Date: Fri, 1 Dec 2017 10:32:15 -0500 (EST)
From: Alan Stern <stern@rowland.harvard.edu>
To: Boqun Feng <boqun.feng@gmail.com>
cc: Daniel Lustig <dlustig@nvidia.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Andrea Parri <parri.andrea@gmail.com>,
        Luc Maranget <luc.maranget@inria.fr>,
        Jade Alglave <j.alglave@ucl.ac.uk>,
        Nicholas Piggin <npiggin@gmail.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Will Deacon <will.deacon@arm.com>, David Howells <dhowells@redhat.com>,
        Palmer Dabbelt <palmer@dabbelt.com>,
        Kernel development list <linux-kernel@vger.kernel.org>
Subject: Re: Unlock-lock questions and the Linux Kernel Memory Model
In-Reply-To: <20171201024624.GB9516@tardis>
Message-ID: <Pine.LNX.4.44L0.1712011009560.1361-100000@iolanthe.rowland.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3798
Lines: 104

On Fri, 1 Dec 2017, Boqun Feng wrote:

> > > But in case of AMOs, which directly send the addition request to memory
> > > controller, so there wouldn't be any read part or even write part of the
> > > atomic_inc() executed by CPU. Would this be allowed then?
> > 
> > Firstly, sending the addition request to the memory controller _is_ a
> > write operation.
> > 
> > Secondly, even though the CPU hardware might not execute a read 
> > operation during an AMO, the LKMM and herd nevertheless represent the 
> > atomic update as a specially-annotated read event followed by a write 
> > event.
> > 
> 
> Ah, right! From the point of view of the model, there are read events
> and write events for the atomics.
> 
> > In an other-multicopy-atomic system, P0's write to y must become
> > visible to P1 before P1 executes the smp_load_acquire, because the
> > write was visible to the memory controller when the controller carried
> > out the AMO, and the write becomes visible to the memory controller and
> > to P1 at the same time (by other-multicopy-atomicity).  That's why I
> > said the test would be forbidden on ARM.
> > 
> 
> Agreed.
> 
> > But even on a non-other-multicopy-atomic system, there has to be some 
> > synchronization between the memory controller and P1's CPU.  Otherwise, 
> > how could the system guarantee that P1's smp_load_acquire would see the 
> > post-increment value of y?  It seems reasonable to assume that this 
> > synchronization would also cause P1 to see x=1.
> > 
> 
> I agree with you the "reasonable" part ;-) So basically, memory
> controller could only do the write of AMO until P0's second write
> propagated to the memory controller(and because of the wmb(), P0's first
> write must be already propagated to the memory controller, too), so it
> makes sense when the write of AMO propagated from memory controller to
> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> memory controller acts at least like a release.
> 
> However, some part of myself is still a little paranoid, because to my
> understanding, the point of AMO is to get atomic operations executing
> as fast as possible, so maybe, AMO has some fast path for the memory
> controller to forward a write to the CPU that issues the AMO, in that
> way, it will become unreasonable ;-)

It's true that a hardware design in the future might behave differently 
from current hardware.  If that ever happens, we will need to rethink 
the situation.  Maybe the designers will change their hardware to make 
it match the memory model.  Or maybe the memory model will change.

And it's certainly possible to write a litmus test which emulates this 
situation:

C MP+wmb+emulated-amo-acq

{}

P0(int *x, int *y)
{
	WRITE_ONCE(*x, 1);
	smp_wmb();
	WRITE_ONCE(*y, 1);
}

P1(int *x, int *y, int *u, int *v)
{
	WRITE_ONCE(*u, 1);
	r1 = READ_ONCE(*v);
	smp_rmb();
	r2 = smp_load_acquire(y);
	r3 = READ_ONCE(*x);
}

P2(int *y, int *u, int *v)
{
	r4 = READ_ONCE(*u);
	if (r4 != 0) {
		atomic_inc(y);
		smp_wmb();
		WRITE_ONCE(*v, 1);
	}
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0 /\ 2:r4=1)

Here P1 tells P2 to perform the atomic increment by setting u to 1, and
P2 tells P1 that the increment is finished by setting v to 1.  This
test is allowed by the LKMM, because the wmb in P2 is not A-cumulative.
On the other hand, store-release is A-cumulative -- the test would be
forbidden if P2 did "smp_store_release(v, 1)" rather than "smp_wmb() ;
WRITE_ONCE(*v, 1)".

> With that in mind, I think it's better if herd could provide the type
> annotations of atomics for the read and write parts, and we handle it
> inside the LKMM's cats and bells, rather than letting herd provide the
> internal dependency by default.

herd already does provide this information via the rmw relation.

Alan