More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by release-acquire chains and by
locking. In other words, given the following code:
WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&s);
WRITE_ONCE(y, 1);
or the following:
smp_store_release(&x, 1);
r1 = smp_load_acquire(&x); // r1 = 1
WRITE_ONCE(y, 1);
the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s or be part of
the release-acquire chain. In terms of the memory model, this means
that rel-rf-acq-po should be part of the cumul-fence relation.
All the architectures supported by the Linux kernel (including RISC-V)
do behave this way, albeit for varying reasons. Therefore this patch
changes the model in accordance with the developers' wishes.
Signed-off-by: Alan Stern <[email protected]>
---
[as1871]
tools/memory-model/Documentation/explanation.txt | 81 +++++++++++++++++++++++
tools/memory-model/linux-kernel.cat | 2
2 files changed, 82 insertions(+), 1 deletion(-)
Index: usb-4.x/tools/memory-model/linux-kernel.cat
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.cat
+++ usb-4.x/tools/memory-model/linux-kernel.cat
@@ -66,7 +66,7 @@ let ppo = to-r | to-w | fence
(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | rel-rf-acq-po
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
(*
Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
===================================================================
--- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-4.x/tools/memory-model/Documentation/explanation.txt
@@ -1897,3 +1897,84 @@ non-deadlocking executions. For example
Is it possible to end up with r0 = 36 at the end? The LKMM will tell
you it is not, but the model won't mention that this is because P1
will self-deadlock in the executions where it stores 36 in y.
+
+In the LKMM, locks and release-acquire chains cause stores to
+propagate in order. For example:
+
+ int x, y, z;
+
+ P0()
+ {
+ WRITE_ONCE(x, 1);
+ smp_store_release(&y, 1);
+ }
+
+ P1()
+ {
+ int r1;
+
+ r1 = smp_load_acquire(&y);
+ WRITE_ONCE(z, 1);
+ }
+
+ P2()
+ {
+ int r2, r3, r4;
+
+ r2 = READ_ONCE(z);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ r4 = READ_ONCE(y);
+ }
+
+If r1 = 1 and r2 = 1 at the end, then both r3 and r4 must also be 1.
+In other words, the smp_store_release() read by the smp_load_acquire()
+together act as a sort of inter-processor fence, forcing the stores to
+x and y to propagate to P2 before the store to z does, regardless of
+the fact that P2 doesn't execute any release or acquire instructions.
+This conclusion would hold even if P0 and P1 were on the same CPU, so
+long as r1 = 1.
+
+We have mentioned that the LKMM treats locks as acquires and unlocks
+as releases. Therefore it should not be surprising that something
+analogous to this ordering also holds for locks:
+
+ int x, y;
+ spinlock_t s;
+
+ P0()
+ {
+ spin_lock(&s);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ int r1;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x):
+ WRITE_ONCE(y, 1);
+ spin_unlock(&s);
+ }
+
+ P2()
+ {
+ int r2, r3;
+
+ r2 = READ_ONCE(y);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ }
+
+If r1 = 1 at the end (implying that P1's critical section executes
+after P0's) and r2 = 1, then r3 must be 1; the ordering of the
+critical sections forces the store to x to propagate to P2 before the
+store to y does.
+
+In both versions of this scenario, the store-propagation ordering is
+not required by the operational model. However, it does happen on all
+the architectures supporting the Linux kernel, and kernel developers
+seem to expect it; they have requested that this behavior be included
+in the LKMM.
On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by release-acquire chains and by
> locking. In other words, given the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> or the following:
>
> smp_store_release(&x, 1);
> r1 = smp_load_acquire(&x); // r1 = 1
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s or be part of
> the release-acquire chain. In terms of the memory model, this means
> that rel-rf-acq-po should be part of the cumul-fence relation.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do behave this way, albeit for varying reasons. Therefore this patch
> changes the model in accordance with the developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
Thanks for that Alan!
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Thu, Jun 21, 2018 at 08:04:22PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by release-acquire chains and by
> > locking. In other words, given the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > or the following:
> >
> > smp_store_release(&x, 1);
> > r1 = smp_load_acquire(&x); // r1 = 1
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s or be part of
> > the release-acquire chain. In terms of the memory model, this means
> > that rel-rf-acq-po should be part of the cumul-fence relation.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do behave this way, albeit for varying reasons. Therefore this patch
> > changes the model in accordance with the developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> Thanks for that Alan!
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
I queued both, with Peter's ack on the second, thank you both!
(Peter, if you meant to ack both, please let me know.)
Thanx, Paul
On Thu, Jun 21, 2018 at 08:34:40PM -0700, Paul E. McKenney wrote:
> (Peter, if you meant to ack both, please let me know.)
I sadly have not had the time to go through all that fancy cat language
stuff, so while I can read and mostly understand the Changelog, I don't
feel qualified to actually say anything much about that first patch ;-)
At some distant point in the past I mostly grokked the cat stuffs, but
then things happened and I've not looked at it in more than a year and
the brain just doesn't want to remember... :/ Some day..
Hi Alan,
On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by release-acquire chains and by
> locking. In other words, given the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> or the following:
>
> smp_store_release(&x, 1);
> r1 = smp_load_acquire(&x); // r1 = 1
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s or be part of
> the release-acquire chain. In terms of the memory model, this means
> that rel-rf-acq-po should be part of the cumul-fence relation.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do behave this way, albeit for varying reasons. Therefore this patch
> changes the model in accordance with the developers' wishes.
Interesting...
I think the second example would preclude us using LDAPR for load-acquire,
so I'm surprised that RISC-V is ok with this. For example, the first test
below is allowed on arm64.
I also think this would break if we used DMB LD to implement load-acquire
(second test below).
So I'm not a big fan of this change, and I'm surprised this works on all
architectures. What's the justification?
Will
--->8
AArch64 MP+poslq-poqp+poap
"PosWRLQ PodRWQP RfePA PodRRAP FrePL"
Generator=diyone7 (version 7.46+3)
Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
Com=Rf Fr
Orig=PosWRLQ PodRWQP RfePA PodRRAP FrePL
{
0:X1=x; 0:X4=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDAR W0,[X1] ;
STLR W0,[X1] | LDR W2,[X3] ;
LDAPR W2,[X1] | ;
MOV W3,#1 | ;
STR W3,[X4] | ;
exists
(1:X0=1 /\ 1:X2=0)
AArch64 MP+pos-dmb.ld+poap
"PosWR DMB.LDdRW RfePA PodRRAP Fre"
Generator=diyone7 (version 7.46+3)
Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
Com=Rf Fr
Orig=PosWR DMB.LDdRW RfePA PodRRAP Fre
{
0:X1=x; 0:X4=y;
1:X1=y; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDAR W0,[X1] ;
STR W0,[X1] | LDR W2,[X3] ;
LDR W2,[X1] | ;
DMB LD | ;
MOV W3,#1 | ;
STR W3,[X4] | ;
exists
(1:X0=1 /\ 1:X2=0)
On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by release-acquire chains and by
> locking. In other words, given the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> or the following:
>
> smp_store_release(&x, 1);
> r1 = smp_load_acquire(&x); // r1 = 1
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s or be part of
> the release-acquire chain. In terms of the memory model, this means
> that rel-rf-acq-po should be part of the cumul-fence relation.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do behave this way, albeit for varying reasons. Therefore this patch
> changes the model in accordance with the developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
This patch changes the "Result" for ISA2+pooncelock+pooncelock+pombonce,
so it should update the corresponding comment/README.
Reviewed-and-Tested-by: Andrea Parri <[email protected]>
Andrea
>
> ---
>
>
> [as1871]
>
>
> tools/memory-model/Documentation/explanation.txt | 81 +++++++++++++++++++++++
> tools/memory-model/linux-kernel.cat | 2
> 2 files changed, 82 insertions(+), 1 deletion(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -66,7 +66,7 @@ let ppo = to-r | to-w | fence
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | rel-rf-acq-po
> let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
>
> (*
> Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-4.x/tools/memory-model/Documentation/explanation.txt
> @@ -1897,3 +1897,84 @@ non-deadlocking executions. For example
> Is it possible to end up with r0 = 36 at the end? The LKMM will tell
> you it is not, but the model won't mention that this is because P1
> will self-deadlock in the executions where it stores 36 in y.
> +
> +In the LKMM, locks and release-acquire chains cause stores to
> +propagate in order. For example:
> +
> + int x, y, z;
> +
> + P0()
> + {
> + WRITE_ONCE(x, 1);
> + smp_store_release(&y, 1);
> + }
> +
> + P1()
> + {
> + int r1;
> +
> + r1 = smp_load_acquire(&y);
> + WRITE_ONCE(z, 1);
> + }
> +
> + P2()
> + {
> + int r2, r3, r4;
> +
> + r2 = READ_ONCE(z);
> + smp_rmb();
> + r3 = READ_ONCE(x);
> + r4 = READ_ONCE(y);
> + }
> +
> +If r1 = 1 and r2 = 1 at the end, then both r3 and r4 must also be 1.
> +In other words, the smp_store_release() read by the smp_load_acquire()
> +together act as a sort of inter-processor fence, forcing the stores to
> +x and y to propagate to P2 before the store to z does, regardless of
> +the fact that P2 doesn't execute any release or acquire instructions.
> +This conclusion would hold even if P0 and P1 were on the same CPU, so
> +long as r1 = 1.
> +
> +We have mentioned that the LKMM treats locks as acquires and unlocks
> +as releases. Therefore it should not be surprising that something
> +analogous to this ordering also holds for locks:
> +
> + int x, y;
> + spinlock_t s;
> +
> + P0()
> + {
> + spin_lock(&s);
> + WRITE_ONCE(x, 1);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + int r1;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x):
> + WRITE_ONCE(y, 1);
> + spin_unlock(&s);
> + }
> +
> + P2()
> + {
> + int r2, r3;
> +
> + r2 = READ_ONCE(y);
> + smp_rmb();
> + r3 = READ_ONCE(x);
> + }
> +
> +If r1 = 1 at the end (implying that P1's critical section executes
> +after P0's) and r2 = 1, then r3 must be 1; the ordering of the
> +critical sections forces the store to x to propagate to P2 before the
> +store to y does.
> +
> +In both versions of this scenario, the store-propagation ordering is
> +not required by the operational model. However, it does happen on all
> +the architectures supporting the Linux kernel, and kernel developers
> +seem to expect it; they have requested that this behavior be included
> +in the LKMM.
>
On Fri, Jun 22, 2018 at 09:09:28AM +0100, Will Deacon wrote:
> On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by release-acquire chains and by
> > locking. In other words, given the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > or the following:
> >
> > smp_store_release(&x, 1);
> > r1 = smp_load_acquire(&x); // r1 = 1
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s or be part of
> > the release-acquire chain. In terms of the memory model, this means
> > that rel-rf-acq-po should be part of the cumul-fence relation.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do behave this way, albeit for varying reasons. Therefore this patch
> > changes the model in accordance with the developers' wishes.
>
> Interesting...
>
> I think the second example would preclude us using LDAPR for load-acquire,
> so I'm surprised that RISC-V is ok with this. For example, the first test
> below is allowed on arm64.
>
> I also think this would break if we used DMB LD to implement load-acquire
> (second test below).
>
> So I'm not a big fan of this change, and I'm surprised this works on all
> architectures. What's the justification?
I also just realised that this prevents Power from using ctrl+isync to
implement acquire, should they wish to do so.
Will
On Fri, Jun 22, 2018 at 10:55:47AM +0100, Will Deacon wrote:
> On Fri, Jun 22, 2018 at 09:09:28AM +0100, Will Deacon wrote:
> > On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by release-acquire chains and by
> > > locking. In other words, given the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
So this is the one I'm relying on and really want sorted.
> > > or the following:
> > >
> > > smp_store_release(&x, 1);
> > > r1 = smp_load_acquire(&x); // r1 = 1
> > > WRITE_ONCE(y, 1);
Reading back some of the old threads [1], it seems the direct
translation of the first into acquire-release would be:
WRITE_ONCE(x, 1);
smp_store_release(&s, 1);
r1 = smp_load_acquire(&s);
WRITE_ONCE(y, 1);
Which is I think easier to make happen than the second example you give.
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s or be part of
> > > the release-acquire chain. In terms of the memory model, this means
> > > that rel-rf-acq-po should be part of the cumul-fence relation.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do behave this way, albeit for varying reasons. Therefore this patch
> > > changes the model in accordance with the developers' wishes.
> >
> > Interesting...
> >
> > I think the second example would preclude us using LDAPR for load-acquire,
> > so I'm surprised that RISC-V is ok with this. For example, the first test
> > below is allowed on arm64.
> >
> > I also think this would break if we used DMB LD to implement load-acquire
> > (second test below).
> >
> > So I'm not a big fan of this change, and I'm surprised this works on all
> > architectures. What's the justification?
>
> I also just realised that this prevents Power from using ctrl+isync to
> implement acquire, should they wish to do so.
They in fact do so on chips lacking LWSYNC, see how PPC_ACQUIRE_BARRIER
(as used by atomic_*_acquire) turns into ISYNC (note however that they
do not use PPC_ACQUIRE_BARRIER for smp_load_acquire -- because there's
no CTRL there).
[1] https://lkml.kernel.org/r/[email protected]
Hi Peter,
On Fri, Jun 22, 2018 at 12:31:29PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 22, 2018 at 10:55:47AM +0100, Will Deacon wrote:
> > On Fri, Jun 22, 2018 at 09:09:28AM +0100, Will Deacon wrote:
> > > On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > > > More than one kernel developer has expressed the opinion that the LKMM
> > > > should enforce ordering of writes by release-acquire chains and by
> > > > locking. In other words, given the following code:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > spin_unlock(&s):
> > > > spin_lock(&s);
> > > > WRITE_ONCE(y, 1);
>
> So this is the one I'm relying on and really want sorted.
Agreed, and I think this one makes a lot of sense.
>
> > > > or the following:
> > > >
> > > > smp_store_release(&x, 1);
> > > > r1 = smp_load_acquire(&x); // r1 = 1
> > > > WRITE_ONCE(y, 1);
>
> Reading back some of the old threads [1], it seems the direct
> translation of the first into acquire-release would be:
>
> WRITE_ONCE(x, 1);
> smp_store_release(&s, 1);
> r1 = smp_load_acquire(&s);
> WRITE_ONCE(y, 1);
>
> Which is I think easier to make happen than the second example you give.
It's easier, but it will still break on architectures with native support
for RCpc acquire/release. For example, using LDAPR again:
AArch64 MP+popl-rfilq-poqp+poap
"PodWWPL RfiLQ PodRWQP RfePA PodRRAP Fre"
Generator=diyone7 (version 7.46+3)
Prefetch=0:x=F,0:z=W,1:z=F,1:x=T
Com=Rf Fr
Orig=PodWWPL RfiLQ PodRWQP RfePA PodRRAP Fre
{
0:X1=x; 0:X3=y; 0:X6=z;
1:X1=z; 1:X3=x;
}
P0 | P1 ;
MOV W0,#1 | LDAR W0,[X1] ;
STR W0,[X1] | LDR W2,[X3] ;
MOV W2,#1 | ;
STLR W2,[X3] | ;
LDAPR W4,[X3] | ;
MOV W5,#1 | ;
STR W5,[X6] | ;
exists
(0:X4=1 /\ 1:X0=1 /\ 1:X2=0)
then this is permitted on arm64.
> > > > the stores to x and y should be propagated in order to all other CPUs,
> > > > even though those other CPUs might not access the lock s or be part of
> > > > the release-acquire chain. In terms of the memory model, this means
> > > > that rel-rf-acq-po should be part of the cumul-fence relation.
> > > >
> > > > All the architectures supported by the Linux kernel (including RISC-V)
> > > > do behave this way, albeit for varying reasons. Therefore this patch
> > > > changes the model in accordance with the developers' wishes.
> > >
> > > Interesting...
> > >
> > > I think the second example would preclude us using LDAPR for load-acquire,
> > > so I'm surprised that RISC-V is ok with this. For example, the first test
> > > below is allowed on arm64.
> > >
> > > I also think this would break if we used DMB LD to implement load-acquire
> > > (second test below).
> > >
> > > So I'm not a big fan of this change, and I'm surprised this works on all
> > > architectures. What's the justification?
> >
> > I also just realised that this prevents Power from using ctrl+isync to
> > implement acquire, should they wish to do so.
>
> They in fact do so on chips lacking LWSYNC, see how PPC_ACQUIRE_BARRIER
> (as used by atomic_*_acquire) turns into ISYNC (note however that they
> do not use PPC_ACQUIRE_BARRIER for smp_load_acquire -- because there's
> no CTRL there).
Right, so the example in the commit message is broken on PPC then. I think
it's also broken on RISC-V, despite the claim.
Could we drop the acquire/release stuff from the patch and limit this change
to locking instead?
Will
> > > I also just realised that this prevents Power from using ctrl+isync to
> > > implement acquire, should they wish to do so.
> >
> > They in fact do so on chips lacking LWSYNC, see how PPC_ACQUIRE_BARRIER
> > (as used by atomic_*_acquire) turns into ISYNC (note however that they
> > do not use PPC_ACQUIRE_BARRIER for smp_load_acquire -- because there's
> > no CTRL there).
>
> Right, so the example in the commit message is broken on PPC then. I think
> it's also broken on RISC-V, despite the claim.
I agree for RISC-V (and I missed it in my earlier review): the 2nd
snippet from the commit message would map to something like
fence rw, w
STORE #1,[x]
LOAD [x]
fence r ,rw
STORE #1,[y]
and there would be no guarantee that the stores to x and y will be
propagated in program order to another CPU, AFAICT. Thank you for
pointing this out.
Andrea
On Fri, Jun 22, 2018 at 12:31:29PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 22, 2018 at 10:55:47AM +0100, Will Deacon wrote:
> > On Fri, Jun 22, 2018 at 09:09:28AM +0100, Will Deacon wrote:
> > > On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > > > More than one kernel developer has expressed the opinion that the LKMM
> > > > should enforce ordering of writes by release-acquire chains and by
> > > > locking. In other words, given the following code:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > spin_unlock(&s):
> > > > spin_lock(&s);
> > > > WRITE_ONCE(y, 1);
>
> So this is the one I'm relying on and really want sorted.
>
> > > > or the following:
> > > >
> > > > smp_store_release(&x, 1);
> > > > r1 = smp_load_acquire(&x); // r1 = 1
> > > > WRITE_ONCE(y, 1);
>
> Reading back some of the old threads [1], it seems the direct
> translation of the first into acquire-release would be:
>
> WRITE_ONCE(x, 1);
> smp_store_release(&s, 1);
> r1 = smp_load_acquire(&s);
> WRITE_ONCE(y, 1);
>
> Which is I think easier to make happen than the second example you give.
>
> > > > the stores to x and y should be propagated in order to all other CPUs,
> > > > even though those other CPUs might not access the lock s or be part of
> > > > the release-acquire chain. In terms of the memory model, this means
> > > > that rel-rf-acq-po should be part of the cumul-fence relation.
> > > >
> > > > All the architectures supported by the Linux kernel (including RISC-V)
> > > > do behave this way, albeit for varying reasons. Therefore this patch
> > > > changes the model in accordance with the developers' wishes.
> > >
> > > Interesting...
> > >
> > > I think the second example would preclude us using LDAPR for load-acquire,
> > > so I'm surprised that RISC-V is ok with this. For example, the first test
> > > below is allowed on arm64.
> > >
> > > I also think this would break if we used DMB LD to implement load-acquire
> > > (second test below).
> > >
> > > So I'm not a big fan of this change, and I'm surprised this works on all
> > > architectures. What's the justification?
> >
> > I also just realised that this prevents Power from using ctrl+isync to
> > implement acquire, should they wish to do so.
>
> They in fact do so on chips lacking LWSYNC, see how PPC_ACQUIRE_BARRIER
> (as used by atomic_*_acquire) turns into ISYNC (note however that they
> do not use PPC_ACQUIRE_BARRIER for smp_load_acquire -- because there's
> no CTRL there).
PowerPC -could- use a load-compare-branch-isync sequence to implement
smp_load_acquire(), but as you say it instead uses lwsync, falling
back to sync on CPUs not implementing lwsync.
Locking should work either way because even if lock acquisition uses isync
(leveraging the fact that the "stcwx." instruction affects condition
codes and branches back when someone else already holds the lock),
lock release still uses lwsync (or sync on systems not having lwsync).
Thanx, Paul
> [1] https://lkml.kernel.org/r/[email protected]
>
On Fri, 22 Jun 2018, Will Deacon wrote:
> Hi Alan,
>
> On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by release-acquire chains and by
> > locking. In other words, given the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > or the following:
> >
> > smp_store_release(&x, 1);
> > r1 = smp_load_acquire(&x); // r1 = 1
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s or be part of
> > the release-acquire chain. In terms of the memory model, this means
> > that rel-rf-acq-po should be part of the cumul-fence relation.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do behave this way, albeit for varying reasons. Therefore this patch
> > changes the model in accordance with the developers' wishes.
>
> Interesting...
>
> I think the second example would preclude us using LDAPR for load-acquire,
What are the semantics of LDAPR? That instruction isn't included in my
year-old copy of the ARMv8.1 manual; the closest it comes is LDAR and
LDAXP.
> so I'm surprised that RISC-V is ok with this. For example, the first test
> below is allowed on arm64.
Does ARMv8 use LDAPR for smp_load_aquire()? If it doesn't, this is a
moot point.
> I also think this would break if we used DMB LD to implement load-acquire
> (second test below).
Same question.
> So I'm not a big fan of this change, and I'm surprised this works on all
> architectures. What's the justification?
For ARMv8, I've been going by something you wrote in an earlier email
to the effect that store-release and load-acquire are fully ordered,
and therefore a release can never be forwarded to an acquire. Is that
still true? But evidently it only justifies patch 1 in this series,
not patch 2.
For RISC-V, I've been going by Andrea's and Luc's comments.
> > Reading back some of the old threads [1], it seems the direct
> > translation of the first into acquire-release would be:
> >
> > WRITE_ONCE(x, 1);
> > smp_store_release(&s, 1);
> > r1 = smp_load_acquire(&s);
> > WRITE_ONCE(y, 1);
> >
> > Which is I think easier to make happen than the second example you give.
>
> It's easier, but it will still break on architectures with native support
> for RCpc acquire/release.
Again, do we want the kernel to support that?
For that matter, what would happen if someone were to try using RCpc
semantics for lock/unlock? Or to put it another way, why do you
contemplate the possibility of RCpc acquire/release but not RCpc
lock/unlock?
> Could we drop the acquire/release stuff from the patch and limit this change
> to locking instead?
The LKMM uses the same CAT code for acquire/release and lock/unlock.
(In essence, it considers a lock to be an acquire and an unlock to be a
release; everything else follows from that.) Treating one differently
from the other in these tests would require some significant changes.
It wouldn't be easy.
Alan
Hi Alan,
On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> On Fri, 22 Jun 2018, Will Deacon wrote:
> > On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by release-acquire chains and by
> > > locking. In other words, given the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > or the following:
> > >
> > > smp_store_release(&x, 1);
> > > r1 = smp_load_acquire(&x); // r1 = 1
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s or be part of
> > > the release-acquire chain. In terms of the memory model, this means
> > > that rel-rf-acq-po should be part of the cumul-fence relation.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do behave this way, albeit for varying reasons. Therefore this patch
> > > changes the model in accordance with the developers' wishes.
> >
> > Interesting...
> >
> > I think the second example would preclude us using LDAPR for load-acquire,
>
> What are the semantics of LDAPR? That instruction isn't included in my
> year-old copy of the ARMv8.1 manual; the closest it comes is LDAR and
> LDAXP.
It's part of 8.3 and is documented in the latest Arm Arm:
https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf
It's also included in the upstream armv8.cat file using the 'Q' set.
> > so I'm surprised that RISC-V is ok with this. For example, the first test
> > below is allowed on arm64.
>
> Does ARMv8 use LDAPR for smp_load_aquire()? If it doesn't, this is a
> moot point.
I don't think it's a moot point. We want new architectures to implement
acquire/release efficiently, and it's not unlikely that they will have
acquire loads that are similar in semantics to LDAPR. This patch prevents
them from doing so, and it also breaks Power and RISC-V without any clear
justification for the stronger semantics.
> > I also think this would break if we used DMB LD to implement load-acquire
> > (second test below).
>
> Same question.
Same answer (and RISC-V is a concrete example of an architecture building
acquire using a load->load+store fence).
> > So I'm not a big fan of this change, and I'm surprised this works on all
> > architectures. What's the justification?
>
> For ARMv8, I've been going by something you wrote in an earlier email
> to the effect that store-release and load-acquire are fully ordered,
> and therefore a release can never be forwarded to an acquire. Is that
> still true? But evidently it only justifies patch 1 in this series,
> not patch 2.
LDAR and STLR are RCsc, so that remains true. arm64 is not broken by this
patch, but I'm still objecting to the change in semantics.
> For RISC-V, I've been going by Andrea's and Luc's comments.
https://is.gd/WhV1xz
From that state of rmem, you can propagate the writes out of order on
RISC-V.
> > > Reading back some of the old threads [1], it seems the direct
> > > translation of the first into acquire-release would be:
> > >
> > > WRITE_ONCE(x, 1);
> > > smp_store_release(&s, 1);
> > > r1 = smp_load_acquire(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > Which is I think easier to make happen than the second example you give.
> >
> > It's easier, but it will still break on architectures with native support
> > for RCpc acquire/release.
>
> Again, do we want the kernel to support that?
Yes, I think we do. That's the most common interpretation of
acquire/release, it matches what C11 has done and it facilitates
construction of acquire using a load->load+store fence.
> For that matter, what would happen if someone were to try using RCpc
> semantics for lock/unlock? Or to put it another way, why do you
> contemplate the possibility of RCpc acquire/release but not RCpc
> lock/unlock?
I think lock/unlock is a higher-level abstraction than acquire/release
and therefore should be simpler to use and easier to reason about.
acquire/release are building blocks for more complicated synchronisation
mechanisms and we shouldn't be penalising their implementation without good
reason.
> > Could we drop the acquire/release stuff from the patch and limit this change
> > to locking instead?
>
> The LKMM uses the same CAT code for acquire/release and lock/unlock.
> (In essence, it considers a lock to be an acquire and an unlock to be a
> release; everything else follows from that.) Treating one differently
> from the other in these tests would require some significant changes.
> It wouldn't be easy.
It would be boring if it was easy ;) I think this is a case of the tail
wagging the dog.
Paul -- please can you drop this patch until we've resolved this discussion?
Will
On Fri, 22 Jun 2018, Will Deacon wrote:
> Hi Alan,
>
> On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> > On Fri, 22 Jun 2018, Will Deacon wrote:
> > > On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > > > More than one kernel developer has expressed the opinion that the LKMM
> > > > should enforce ordering of writes by release-acquire chains and by
> > > > locking. In other words, given the following code:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > spin_unlock(&s):
> > > > spin_lock(&s);
> > > > WRITE_ONCE(y, 1);
> > > >
> > > > or the following:
> > > >
> > > > smp_store_release(&x, 1);
> > > > r1 = smp_load_acquire(&x); // r1 = 1
> > > > WRITE_ONCE(y, 1);
> > > >
> > > > the stores to x and y should be propagated in order to all other CPUs,
> > > > even though those other CPUs might not access the lock s or be part of
> > > > the release-acquire chain. In terms of the memory model, this means
> > > > that rel-rf-acq-po should be part of the cumul-fence relation.
> > > >
> > > > All the architectures supported by the Linux kernel (including RISC-V)
> > > > do behave this way, albeit for varying reasons. Therefore this patch
> > > > changes the model in accordance with the developers' wishes.
> > >
> > > Interesting...
> > >
> > > I think the second example would preclude us using LDAPR for load-acquire,
> >
> > What are the semantics of LDAPR? That instruction isn't included in my
> > year-old copy of the ARMv8.1 manual; the closest it comes is LDAR and
> > LDAXP.
>
> It's part of 8.3 and is documented in the latest Arm Arm:
>
> https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf
>
> It's also included in the upstream armv8.cat file using the 'Q' set.
I'll have to look at that.
> > > so I'm surprised that RISC-V is ok with this. For example, the first test
> > > below is allowed on arm64.
> >
> > Does ARMv8 use LDAPR for smp_load_aquire()? If it doesn't, this is a
> > moot point.
>
> I don't think it's a moot point. We want new architectures to implement
> acquire/release efficiently, and it's not unlikely that they will have
> acquire loads that are similar in semantics to LDAPR. This patch prevents
> them from doing so, and it also breaks Power and RISC-V without any clear
> justification for the stronger semantics.
>
> > > I also think this would break if we used DMB LD to implement load-acquire
> > > (second test below).
> >
> > Same question.
>
> Same answer (and RISC-V is a concrete example of an architecture building
> acquire using a load->load+store fence).
>
> > > So I'm not a big fan of this change, and I'm surprised this works on all
> > > architectures. What's the justification?
> >
> > For ARMv8, I've been going by something you wrote in an earlier email
> > to the effect that store-release and load-acquire are fully ordered,
> > and therefore a release can never be forwarded to an acquire. Is that
> > still true? But evidently it only justifies patch 1 in this series,
> > not patch 2.
>
> LDAR and STLR are RCsc, so that remains true. arm64 is not broken by this
> patch, but I'm still objecting to the change in semantics.
>
> > For RISC-V, I've been going by Andrea's and Luc's comments.
>
> https://is.gd/WhV1xz
>
> From that state of rmem, you can propagate the writes out of order on
> RISC-V.
>
> > > > Reading back some of the old threads [1], it seems the direct
> > > > translation of the first into acquire-release would be:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > smp_store_release(&s, 1);
> > > > r1 = smp_load_acquire(&s);
> > > > WRITE_ONCE(y, 1);
> > > >
> > > > Which is I think easier to make happen than the second example you give.
> > >
> > > It's easier, but it will still break on architectures with native support
> > > for RCpc acquire/release.
> >
> > Again, do we want the kernel to support that?
>
> Yes, I think we do. That's the most common interpretation of
> acquire/release, it matches what C11 has done and it facilitates
> construction of acquire using a load->load+store fence.
>
> > For that matter, what would happen if someone were to try using RCpc
> > semantics for lock/unlock? Or to put it another way, why do you
> > contemplate the possibility of RCpc acquire/release but not RCpc
> > lock/unlock?
>
> I think lock/unlock is a higher-level abstraction than acquire/release
> and therefore should be simpler to use and easier to reason about.
> acquire/release are building blocks for more complicated synchronisation
> mechanisms and we shouldn't be penalising their implementation without good
> reason.
>
> > > Could we drop the acquire/release stuff from the patch and limit this change
> > > to locking instead?
> >
> > The LKMM uses the same CAT code for acquire/release and lock/unlock.
> > (In essence, it considers a lock to be an acquire and an unlock to be a
> > release; everything else follows from that.) Treating one differently
> > from the other in these tests would require some significant changes.
> > It wouldn't be easy.
>
> It would be boring if it was easy ;) I think this is a case of the tail
> wagging the dog.
>
> Paul -- please can you drop this patch until we've resolved this discussion?
Agreed. It sounds like we'll need two versions of the Rel and Acq sets
in the memory model; one for RCpc and one for RCsc. smp_load_acquire
and smp_store_release will use the former, and locking will use the
latter.
Would it suffice to have this duplication just for release, using a
single version of acquire? What would happen on ARMv8 or RISC-V if an
RCsc release was read by an RCpc acquire? Or vice versa?
Alan
On Fri, 22 Jun 2018, Andrea Parri wrote:
> On Thu, Jun 21, 2018 at 01:27:12PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by release-acquire chains and by
> > locking. In other words, given the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > or the following:
> >
> > smp_store_release(&x, 1);
> > r1 = smp_load_acquire(&x); // r1 = 1
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s or be part of
> > the release-acquire chain. In terms of the memory model, this means
> > that rel-rf-acq-po should be part of the cumul-fence relation.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do behave this way, albeit for varying reasons. Therefore this patch
> > changes the model in accordance with the developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> This patch changes the "Result" for ISA2+pooncelock+pooncelock+pombonce,
> so it should update the corresponding comment/README.
Thanks for noticing this. The next version of this patch will update
the comment. The README seems to be adequate as it is now, since it
doesn't specify the result of the litmus test.
Alan
On Fri, Jun 22, 2018 at 03:11:37PM -0400, Alan Stern wrote:
> On Fri, 22 Jun 2018, Will Deacon wrote:
> > On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
[ . . . ]
> > > > Could we drop the acquire/release stuff from the patch and limit this change
> > > > to locking instead?
> > >
> > > The LKMM uses the same CAT code for acquire/release and lock/unlock.
> > > (In essence, it considers a lock to be an acquire and an unlock to be a
> > > release; everything else follows from that.) Treating one differently
> > > from the other in these tests would require some significant changes.
> > > It wouldn't be easy.
> >
> > It would be boring if it was easy ;) I think this is a case of the tail
> > wagging the dog.
> >
> > Paul -- please can you drop this patch until we've resolved this discussion?
>
> Agreed. It sounds like we'll need two versions of the Rel and Acq sets
> in the memory model; one for RCpc and one for RCsc. smp_load_acquire
> and smp_store_release will use the former, and locking will use the
> latter.
Done!
Thanx, Paul
> Would it suffice to have this duplication just for release, using a
> single version of acquire? What would happen on ARMv8 or RISC-V if an
> RCsc release was read by an RCpc acquire? Or vice versa?
>
> Alan
>
I have yet to digest the rest of the discussion, however:
On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> The LKMM uses the same CAT code for acquire/release and lock/unlock.
> (In essence, it considers a lock to be an acquire and an unlock to be a
> release; everything else follows from that.) Treating one differently
> from the other in these tests would require some significant changes.
> It wouldn't be easy.
That is problematic, acquire+release are very much simpler operations
than lock+unlock.
At the very least, lock includes a control-dependency, where acquire
does not.
On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > I think the second example would preclude us using LDAPR for load-acquire,
> I don't think it's a moot point. We want new architectures to implement
> acquire/release efficiently, and it's not unlikely that they will have
> acquire loads that are similar in semantics to LDAPR. This patch prevents
> them from doing so,
By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
consider, e.g., the two litmus tests below: what am I missing?
Andrea
C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
{}
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_wmb();
WRITE_ONCE(*y, 1);
}
P1(int *x, int *y, int *z)
{
r0 = READ_ONCE(*y);
smp_store_release(z, 1);
r1 = smp_load_acquire(z);
r2 = READ_ONCE(*x);
}
exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
AArch64 MP+dmb.st+popl-rfilq-poqp
"DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
Generator=diyone7 (version 7.49+02(dev))
Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
Com=Rf Fr
Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=z; 1:X6=x;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X1] ;
STR W0,[X1] | MOV W2,#1 ;
DMB ST | STLR W2,[X3] ;
MOV W2,#1 | LDAPR W4,[X3] ;
STR W2,[X3] | LDR W5,[X6] ;
exists
(1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
On Mon, Jun 25, 2018 at 09:32:29AM +0200, Peter Zijlstra wrote:
>
> I have yet to digest the rest of the discussion, however:
>
> On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> > The LKMM uses the same CAT code for acquire/release and lock/unlock.
> > (In essence, it considers a lock to be an acquire and an unlock to be a
> > release; everything else follows from that.) Treating one differently
> > from the other in these tests would require some significant changes.
> > It wouldn't be easy.
>
> That is problematic, acquire+release are very much simpler operations
> than lock+unlock.
>
> At the very least, lock includes a control-dependency, where acquire
> does not.
I don't see how this is relevant here; roughly, "if something is guaranteed
by a control-dependency, that is also guaranteed by an acquire". Right? ;)
Andrea
On Mon, Jun 25, 2018 at 10:29:23AM +0200, Andrea Parri wrote:
> On Mon, Jun 25, 2018 at 09:32:29AM +0200, Peter Zijlstra wrote:
> >
> > I have yet to digest the rest of the discussion, however:
> >
> > On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> > > The LKMM uses the same CAT code for acquire/release and lock/unlock.
> > > (In essence, it considers a lock to be an acquire and an unlock to be a
> > > release; everything else follows from that.) Treating one differently
> > > from the other in these tests would require some significant changes.
> > > It wouldn't be easy.
> >
> > That is problematic, acquire+release are very much simpler operations
> > than lock+unlock.
> >
> > At the very least, lock includes a control-dependency, where acquire
> > does not.
>
> I don't see how this is relevant here; roughly, "if something is guaranteed
> by a control-dependency, that is also guaranteed by an acquire". Right? ;)
Right, you are, clearly I needs me a wake up drink :-).. So lock does
very fundamentally involve a RmW, and it has the whole wait-until loop
thing in. But yes, now I'm strugging to better express how it's
different from a memory ordering pov.
But still, the lock case will/must disallow the re-ordering (since we rely on
it), whereas the pure acquire/release seems to be struggling.
Personally I prefer a stronger model over a weaker one (as does Linus
IIRC) but clearly people have different opinions on that.
Will:
On Mon, 25 Jun 2018, Andrea Parri wrote:
> On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > I think the second example would preclude us using LDAPR for load-acquire,
>
> > I don't think it's a moot point. We want new architectures to implement
> > acquire/release efficiently, and it's not unlikely that they will have
> > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > them from doing so,
>
> By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> consider, e.g., the two litmus tests below: what am I missing?
This is an excellent point, which seems to have gotten lost in the
shuffle. I'd like to see your comments.
In essence, if you're using release-acquire instructions that only
provide RCpc consistency, does store-release followed by load-acquire
of the same address provide read-read ordering? In theory it doesn't
have to, because if the value from the store-release is forwarded to
the load-acquire then:
LOAD A
STORE-RELEASE X, v
LOAD-ACQUIRE X
LOAD B
could be executed by the CPU in the order:
LOAD-ACQUIRE X
LOAD B
LOAD A
STORE-RELEASE X, v
thereby accessing A and B out of program order without violating the
requirements on the release or the acquire.
Of course PPC doesn't allow this, but should we rule it out entirely?
> C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
>
> {}
>
> P0(int *x, int *y)
> {
> WRITE_ONCE(*x, 1);
> smp_wmb();
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> r0 = READ_ONCE(*y);
> smp_store_release(z, 1);
> r1 = smp_load_acquire(z);
> r2 = READ_ONCE(*x);
> }
>
> exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
>
>
> AArch64 MP+dmb.st+popl-rfilq-poqp
> "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> Generator=diyone7 (version 7.49+02(dev))
> Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> Com=Rf Fr
> Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> {
> 0:X1=x; 0:X3=y;
> 1:X1=y; 1:X3=z; 1:X6=x;
> }
> P0 | P1 ;
> MOV W0,#1 | LDR W0,[X1] ;
> STR W0,[X1] | MOV W2,#1 ;
> DMB ST | STLR W2,[X3] ;
> MOV W2,#1 | LDAPR W4,[X3] ;
> STR W2,[X3] | LDR W5,[X6] ;
> exists
> (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
There's also read-write ordering, in the form of the LB pattern:
P0(int *x, int *y, int *z)
{
r0 = READ_ONCE(*x);
smp_store_release(z, 1);
r1 = smp_load_acquire(z);
WRITE_ONCE(*y, 1);
}
P1(int *x, int *y)
{
r2 = READ_ONCE(*y);
smp_mp();
WRITE_ONCE(*x, 1);
}
exists (0:r0=1 /\ 1:r2=1)
Would this be allowed if smp_load_acquire() was implemented with LDAPR?
If the answer is yes then we will have to remove the rfi-rel-acq and
rel-rf-acq-po relations from the memory model entirely.
Alan
PS: Paul, is the patch which introduced rel-rf-acq-po currently present
in any of your branches? I couldn't find it.
On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> Will:
>
> On Mon, 25 Jun 2018, Andrea Parri wrote:
>
> > On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > > I think the second example would preclude us using LDAPR for load-acquire,
> >
> > > I don't think it's a moot point. We want new architectures to implement
> > > acquire/release efficiently, and it's not unlikely that they will have
> > > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > > them from doing so,
> >
> > By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> > consider, e.g., the two litmus tests below: what am I missing?
>
> This is an excellent point, which seems to have gotten lost in the
> shuffle. I'd like to see your comments.
>
> In essence, if you're using release-acquire instructions that only
> provide RCpc consistency, does store-release followed by load-acquire
> of the same address provide read-read ordering? In theory it doesn't
> have to, because if the value from the store-release is forwarded to
> the load-acquire then:
>
> LOAD A
> STORE-RELEASE X, v
> LOAD-ACQUIRE X
> LOAD B
>
> could be executed by the CPU in the order:
>
> LOAD-ACQUIRE X
> LOAD B
> LOAD A
> STORE-RELEASE X, v
>
> thereby accessing A and B out of program order without violating the
> requirements on the release or the acquire.
>
> Of course PPC doesn't allow this, but should we rule it out entirely?
>
> > C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
> >
> > {}
> >
> > P0(int *x, int *y)
> > {
> > WRITE_ONCE(*x, 1);
> > smp_wmb();
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y, int *z)
> > {
> > r0 = READ_ONCE(*y);
> > smp_store_release(z, 1);
> > r1 = smp_load_acquire(z);
> > r2 = READ_ONCE(*x);
> > }
> >
> > exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
> >
> >
> > AArch64 MP+dmb.st+popl-rfilq-poqp
> > "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> > Generator=diyone7 (version 7.49+02(dev))
> > Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> > Com=Rf Fr
> > Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> > {
> > 0:X1=x; 0:X3=y;
> > 1:X1=y; 1:X3=z; 1:X6=x;
> > }
> > P0 | P1 ;
> > MOV W0,#1 | LDR W0,[X1] ;
> > STR W0,[X1] | MOV W2,#1 ;
> > DMB ST | STLR W2,[X3] ;
> > MOV W2,#1 | LDAPR W4,[X3] ;
> > STR W2,[X3] | LDR W5,[X6] ;
> > exists
> > (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
>
> There's also read-write ordering, in the form of the LB pattern:
>
> P0(int *x, int *y, int *z)
> {
> r0 = READ_ONCE(*x);
> smp_store_release(z, 1);
> r1 = smp_load_acquire(z);
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y)
> {
> r2 = READ_ONCE(*y);
> smp_mp();
> WRITE_ONCE(*x, 1);
> }
>
> exists (0:r0=1 /\ 1:r2=1)
>
> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> If the answer is yes then we will have to remove the rfi-rel-acq and
> rel-rf-acq-po relations from the memory model entirely.
>
> Alan
>
> PS: Paul, is the patch which introduced rel-rf-acq-po currently present
> in any of your branches? I couldn't find it.
It is not, I will add it back in. I misinterpreted your "drop this
patch" on 2/2 as "drop both patches". Please accept my apologies!
Just to double-check, the patch below should be added, correct?
Thanx, Paul
------------------------------------------------------------------------
Date: Thu, 21 Jun 2018 13:26:49 -0400 (EDT)
From: Alan Stern <[email protected]>
To: LKMM Maintainers -- Akira Yokosawa <[email protected]>, Andrea Parri
<[email protected]>, Boqun Feng
<[email protected]>, David Howells <[email protected]>,
Jade Alglave <[email protected]>, Luc Maranget <[email protected]>,
Nicholas Piggin <[email protected]>,
"Paul E. McKenney" <[email protected]>, Peter Zijlstra <[email protected]>,
Will Deacon <[email protected]>
cc: Kernel development list <[email protected]>
Subject: [PATCH 1/2] tools/memory-model: Change rel-rfi-acq ordering to
(rel-rf-acq-po & int)
Message-ID: <[email protected]>
This patch changes the LKMM rule which says that an acquire which
reads from an earlier release must be executed after that release (in
other words, the release cannot be forwarded to the acquire). This is
not true on PowerPC, for example.
What is true instead is that any instruction following the acquire
must be executed after the release. On some architectures this is
because a write-release cannot be forwarded to a read-acquire; on
others (including PowerPC) it is because the implementation of
smp_load_acquire() places a memory barrier immediately after the
load.
This change to the model does not cause any change to the model's
predictions. This is because any link starting from a load must be an
instance of either po or fr. In the po case, the new rule will still
provide ordering. In the fr case, we also have ordering because there
must be a co link to the same destination starting from the
write-release.
Signed-off-by: Alan Stern <[email protected]>
---
[as1870]
tools/memory-model/Documentation/explanation.txt | 35 ++++++++++++-----------
tools/memory-model/linux-kernel.cat | 6 +--
2 files changed, 22 insertions(+), 19 deletions(-)
Index: usb-4.x/tools/memory-model/linux-kernel.cat
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.cat
+++ usb-4.x/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let rel-rf-acq-po = [Release] ; rf ; [Acquire] ; po
(**********************************)
(* Fundamental coherence ordering *)
@@ -60,9 +60,9 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (rel-rf-acq-po & int)
(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
===================================================================
--- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-4.x/tools/memory-model/Documentation/explanation.txt
@@ -1067,27 +1067,30 @@ allowing out-of-order writes like this t
violating the write-write coherence rule by requiring the CPU not to
send the W write to the memory subsystem at all!)
-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release. For example:
+There is one last example of preserved program order in the LKMM; it
+applies to instructions po-after a load-acquire which reads from an
+earlier store-release. For example:
smp_store_release(&x, 123);
r1 = smp_load_acquire(&x);
+ WRITE_ONCE(&y, 246);
If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load. On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence. The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
+written by the smp_store_release(), the LKMM says that the store to y
+must be executed after the store to x. In fact, the only way this
+could fail would be if the store-release was forwarded to the
+load-acquire; the LKMM says it holds even in that case. This
+requirement does not arise from the operational model, but it yields
+correct predictions on all architectures supported by the Linux
+kernel, although for differing reasons:
+
+On some architectures, including x86 and ARMv8, a store-release cannot
+be forwarded to a load-acquire. On others, including PowerPC and
+ARMv7, smp_load_acquire() generates object code that ends with a
+fence. The result is that even though the store-release may be
+forwarded to the load-acquire, it is still true that the store-release
+(and all preceding instructions) will be executed before any
+instruction following the load-acquire.
AND THEN THERE WAS ALPHA
Hi Alan,
On Fri, Jun 22, 2018 at 03:11:37PM -0400, Alan Stern wrote:
> On Fri, 22 Jun 2018, Will Deacon wrote:
> > On Fri, Jun 22, 2018 at 02:09:04PM -0400, Alan Stern wrote:
> > > On Fri, 22 Jun 2018, Will Deacon wrote:
> > > > Could we drop the acquire/release stuff from the patch and limit this change
> > > > to locking instead?
> > >
> > > The LKMM uses the same CAT code for acquire/release and lock/unlock.
> > > (In essence, it considers a lock to be an acquire and an unlock to be a
> > > release; everything else follows from that.) Treating one differently
> > > from the other in these tests would require some significant changes.
> > > It wouldn't be easy.
> >
> > It would be boring if it was easy ;) I think this is a case of the tail
> > wagging the dog.
> >
> > Paul -- please can you drop this patch until we've resolved this discussion?
>
> Agreed. It sounds like we'll need two versions of the Rel and Acq sets
> in the memory model; one for RCpc and one for RCsc. smp_load_acquire
> and smp_store_release will use the former, and locking will use the
> latter.
>
> Would it suffice to have this duplication just for release, using a
> single version of acquire? What would happen on ARMv8 or RISC-V if an
> RCsc release was read by an RCpc acquire? Or vice versa?
On Arm, RCsc release can forward to an RCpc acquire. We don't have an RCpc
release instruction.
Will
Hi Alan,
On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> On Mon, 25 Jun 2018, Andrea Parri wrote:
>
> > On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > > I think the second example would preclude us using LDAPR for load-acquire,
> >
> > > I don't think it's a moot point. We want new architectures to implement
> > > acquire/release efficiently, and it's not unlikely that they will have
> > > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > > them from doing so,
> >
> > By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> > consider, e.g., the two litmus tests below: what am I missing?
>
> This is an excellent point, which seems to have gotten lost in the
> shuffle. I'd like to see your comments.
Yeah, sorry. Loads going on at the moment. You could ask herd instead of me
though ;)
> In essence, if you're using release-acquire instructions that only
> provide RCpc consistency, does store-release followed by load-acquire
> of the same address provide read-read ordering? In theory it doesn't
> have to, because if the value from the store-release is forwarded to
> the load-acquire then:
>
> LOAD A
> STORE-RELEASE X, v
> LOAD-ACQUIRE X
> LOAD B
>
> could be executed by the CPU in the order:
>
> LOAD-ACQUIRE X
> LOAD B
> LOAD A
> STORE-RELEASE X, v
>
> thereby accessing A and B out of program order without violating the
> requirements on the release or the acquire.
>
> Of course PPC doesn't allow this, but should we rule it out entirely?
This would be allowed if LOAD-ACQUIRE was implemented using LDAPR on Arm.
I don't think we should be ruling out architectures using RCpc
acquire/release primitives, because doing so just feels like an artifact of
most architectures building these out of fences today.
It's funny really, because from an Arm-perspective I don't plan to stray
outside of RCsc, but I feel like other weak architectures aren't being
well represented here. If we just care about x86, Arm and Power (and assume
that Power doesn't plan to implement RCpc acquire/release instructions)
then we're good to tighten things up. But I fear that RISC-V should probably
be more engaged (adding Daniel) and who knows about MIPS or these other
random architectures popping up on linux-arch.
> > C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
> >
> > {}
> >
> > P0(int *x, int *y)
> > {
> > WRITE_ONCE(*x, 1);
> > smp_wmb();
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y, int *z)
> > {
> > r0 = READ_ONCE(*y);
> > smp_store_release(z, 1);
> > r1 = smp_load_acquire(z);
> > r2 = READ_ONCE(*x);
> > }
> >
> > exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
> >
> >
> > AArch64 MP+dmb.st+popl-rfilq-poqp
> > "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> > Generator=diyone7 (version 7.49+02(dev))
> > Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> > Com=Rf Fr
> > Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> > {
> > 0:X1=x; 0:X3=y;
> > 1:X1=y; 1:X3=z; 1:X6=x;
> > }
> > P0 | P1 ;
> > MOV W0,#1 | LDR W0,[X1] ;
> > STR W0,[X1] | MOV W2,#1 ;
> > DMB ST | STLR W2,[X3] ;
> > MOV W2,#1 | LDAPR W4,[X3] ;
> > STR W2,[X3] | LDR W5,[X6] ;
> > exists
> > (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
(you can also run this yourself, since 'Q' is supported in the .cat file
I contributed to herdtools7)
Test MP+dmb.sy+popl-rfilq-poqp Allowed
States 4
1:X0=0; 1:X4=1; 1:X5=0;
1:X0=0; 1:X4=1; 1:X5=1;
1:X0=1; 1:X4=1; 1:X5=0;
1:X0=1; 1:X4=1; 1:X5=1;
Ok
Witnesses
Positive: 1 Negative: 3
Condition exists (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
Observation MP+dmb.sy+popl-rfilq-poqp Sometimes 1 3
Time MP+dmb.sy+popl-rfilq-poqp 0.01
Hash=61858b7b59a6310d869f99cd05718f96
> There's also read-write ordering, in the form of the LB pattern:
>
> P0(int *x, int *y, int *z)
> {
> r0 = READ_ONCE(*x);
> smp_store_release(z, 1);
> r1 = smp_load_acquire(z);
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y)
> {
> r2 = READ_ONCE(*y);
> smp_mp();
> WRITE_ONCE(*x, 1);
> }
>
> exists (0:r0=1 /\ 1:r2=1)
The access types are irrelevant to the acquire/release primitives, so yes
that's also allowed.
> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> If the answer is yes then we will have to remove the rfi-rel-acq and
> rel-rf-acq-po relations from the memory model entirely.
I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
rel-rfi-acq-po for the other? Sounds like I'm confused here.
Will
On Wed, Jul 04, 2018 at 04:28:52AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > PS: Paul, is the patch which introduced rel-rf-acq-po currently present
> > in any of your branches? I couldn't find it.
>
> It is not, I will add it back in. I misinterpreted your "drop this
> patch" on 2/2 as "drop both patches". Please accept my apologies!
>
> Just to double-check, the patch below should be added, correct?
Hang on, I'm not sure this patch is quite right either. We need to reach
agreement on whether or not we want to support native RCpc acquire/release
instructions before we work out what to do with this relation.
Will
> ------------------------------------------------------------------------
>
> Date: Thu, 21 Jun 2018 13:26:49 -0400 (EDT)
> From: Alan Stern <[email protected]>
> To: LKMM Maintainers -- Akira Yokosawa <[email protected]>, Andrea Parri
> <[email protected]>, Boqun Feng
> <[email protected]>, David Howells <[email protected]>,
> Jade Alglave <[email protected]>, Luc Maranget <[email protected]>,
> Nicholas Piggin <[email protected]>,
> "Paul E. McKenney" <[email protected]>, Peter Zijlstra <[email protected]>,
> Will Deacon <[email protected]>
> cc: Kernel development list <[email protected]>
> Subject: [PATCH 1/2] tools/memory-model: Change rel-rfi-acq ordering to
> (rel-rf-acq-po & int)
> Message-ID: <[email protected]>
>
> This patch changes the LKMM rule which says that an acquire which
> reads from an earlier release must be executed after that release (in
> other words, the release cannot be forwarded to the acquire). This is
> not true on PowerPC, for example.
>
> What is true instead is that any instruction following the acquire
> must be executed after the release. On some architectures this is
> because a write-release cannot be forwarded to a read-acquire; on
> others (including PowerPC) it is because the implementation of
> smp_load_acquire() places a memory barrier immediately after the
> load.
>
> This change to the model does not cause any change to the model's
> predictions. This is because any link starting from a load must be an
> instance of either po or fr. In the po case, the new rule will still
> provide ordering. In the fr case, we also have ordering because there
> must be a co link to the same destination starting from the
> write-release.
>
> Signed-off-by: Alan Stern <[email protected]>
>
> ---
>
>
> [as1870]
>
>
> tools/memory-model/Documentation/explanation.txt | 35 ++++++++++++-----------
> tools/memory-model/linux-kernel.cat | 6 +--
> 2 files changed, 22 insertions(+), 19 deletions(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let rel-rf-acq-po = [Release] ; rf ; [Acquire] ; po
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,9 +60,9 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (rel-rf-acq-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-4.x/tools/memory-model/Documentation/explanation.txt
> @@ -1067,27 +1067,30 @@ allowing out-of-order writes like this t
> violating the write-write coherence rule by requiring the CPU not to
> send the W write to the memory subsystem at all!)
>
> -There is one last example of preserved program order in the LKMM: when
> -a load-acquire reads from an earlier store-release. For example:
> +There is one last example of preserved program order in the LKMM; it
> +applies to instructions po-after a load-acquire which reads from an
> +earlier store-release. For example:
>
> smp_store_release(&x, 123);
> r1 = smp_load_acquire(&x);
> + WRITE_ONCE(&y, 246);
>
> If the smp_load_acquire() ends up obtaining the 123 value that was
> -stored by the smp_store_release(), the LKMM says that the load must be
> -executed after the store; the store cannot be forwarded to the load.
> -This requirement does not arise from the operational model, but it
> -yields correct predictions on all architectures supported by the Linux
> -kernel, although for differing reasons.
> -
> -On some architectures, including x86 and ARMv8, it is true that the
> -store cannot be forwarded to the load. On others, including PowerPC
> -and ARMv7, smp_store_release() generates object code that starts with
> -a fence and smp_load_acquire() generates object code that ends with a
> -fence. The upshot is that even though the store may be forwarded to
> -the load, it is still true that any instruction preceding the store
> -will be executed before the load or any following instructions, and
> -the store will be executed before any instruction following the load.
> +written by the smp_store_release(), the LKMM says that the store to y
> +must be executed after the store to x. In fact, the only way this
> +could fail would be if the store-release was forwarded to the
> +load-acquire; the LKMM says it holds even in that case. This
> +requirement does not arise from the operational model, but it yields
> +correct predictions on all architectures supported by the Linux
> +kernel, although for differing reasons:
> +
> +On some architectures, including x86 and ARMv8, a store-release cannot
> +be forwarded to a load-acquire. On others, including PowerPC and
> +ARMv7, smp_load_acquire() generates object code that ends with a
> +fence. The result is that even though the store-release may be
> +forwarded to the load-acquire, it is still true that the store-release
> +(and all preceding instructions) will be executed before any
> +instruction following the load-acquire.
>
>
> AND THEN THERE WAS ALPHA
>
>
On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
> Hi Alan,
>
> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > On Mon, 25 Jun 2018, Andrea Parri wrote:
> >
> > > On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > > > I think the second example would preclude us using LDAPR for load-acquire,
> > >
> > > > I don't think it's a moot point. We want new architectures to implement
> > > > acquire/release efficiently, and it's not unlikely that they will have
> > > > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > > > them from doing so,
> > >
> > > By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> > > consider, e.g., the two litmus tests below: what am I missing?
> >
> > This is an excellent point, which seems to have gotten lost in the
> > shuffle. I'd like to see your comments.
>
> Yeah, sorry. Loads going on at the moment. You could ask herd instead of me
> though ;)
>
> > In essence, if you're using release-acquire instructions that only
> > provide RCpc consistency, does store-release followed by load-acquire
> > of the same address provide read-read ordering? In theory it doesn't
> > have to, because if the value from the store-release is forwarded to
> > the load-acquire then:
> >
> > LOAD A
> > STORE-RELEASE X, v
> > LOAD-ACQUIRE X
> > LOAD B
> >
> > could be executed by the CPU in the order:
> >
> > LOAD-ACQUIRE X
> > LOAD B
> > LOAD A
> > STORE-RELEASE X, v
> >
> > thereby accessing A and B out of program order without violating the
> > requirements on the release or the acquire.
> >
> > Of course PPC doesn't allow this, but should we rule it out entirely?
>
> This would be allowed if LOAD-ACQUIRE was implemented using LDAPR on Arm.
> I don't think we should be ruling out architectures using RCpc
> acquire/release primitives, because doing so just feels like an artifact of
> most architectures building these out of fences today.
>
> It's funny really, because from an Arm-perspective I don't plan to stray
> outside of RCsc, but I feel like other weak architectures aren't being
> well represented here. If we just care about x86, Arm and Power (and assume
> that Power doesn't plan to implement RCpc acquire/release instructions)
> then we're good to tighten things up. But I fear that RISC-V should probably
> be more engaged (adding Daniel) and who knows about MIPS or these other
> random architectures popping up on linux-arch.
>
> > > C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
> > >
> > > {}
> > >
> > > P0(int *x, int *y)
> > > {
> > > WRITE_ONCE(*x, 1);
> > > smp_wmb();
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *x, int *y, int *z)
> > > {
> > > r0 = READ_ONCE(*y);
> > > smp_store_release(z, 1);
> > > r1 = smp_load_acquire(z);
> > > r2 = READ_ONCE(*x);
> > > }
> > >
> > > exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
> > >
> > >
> > > AArch64 MP+dmb.st+popl-rfilq-poqp
> > > "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> > > Generator=diyone7 (version 7.49+02(dev))
> > > Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> > > Com=Rf Fr
> > > Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> > > {
> > > 0:X1=x; 0:X3=y;
> > > 1:X1=y; 1:X3=z; 1:X6=x;
> > > }
> > > P0 | P1 ;
> > > MOV W0,#1 | LDR W0,[X1] ;
> > > STR W0,[X1] | MOV W2,#1 ;
> > > DMB ST | STLR W2,[X3] ;
> > > MOV W2,#1 | LDAPR W4,[X3] ;
> > > STR W2,[X3] | LDR W5,[X6] ;
> > > exists
> > > (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
>
> (you can also run this yourself, since 'Q' is supported in the .cat file
> I contributed to herdtools7)
>
> Test MP+dmb.sy+popl-rfilq-poqp Allowed
> States 4
> 1:X0=0; 1:X4=1; 1:X5=0;
> 1:X0=0; 1:X4=1; 1:X5=1;
> 1:X0=1; 1:X4=1; 1:X5=0;
> 1:X0=1; 1:X4=1; 1:X5=1;
> Ok
> Witnesses
> Positive: 1 Negative: 3
> Condition exists (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
> Observation MP+dmb.sy+popl-rfilq-poqp Sometimes 1 3
> Time MP+dmb.sy+popl-rfilq-poqp 0.01
> Hash=61858b7b59a6310d869f99cd05718f96
>
> > There's also read-write ordering, in the form of the LB pattern:
> >
> > P0(int *x, int *y, int *z)
> > {
> > r0 = READ_ONCE(*x);
> > smp_store_release(z, 1);
> > r1 = smp_load_acquire(z);
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y)
> > {
> > r2 = READ_ONCE(*y);
> > smp_mp();
> > WRITE_ONCE(*x, 1);
> > }
> >
> > exists (0:r0=1 /\ 1:r2=1)
>
> The access types are irrelevant to the acquire/release primitives, so yes
> that's also allowed.
>
> > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > If the answer is yes then we will have to remove the rfi-rel-acq and
> > rel-rf-acq-po relations from the memory model entirely.
>
> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> rel-rfi-acq-po for the other? Sounds like I'm confused here.
[Your reply about 1/2 suggests that you've figured this out now, IAC ...]
"rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
is defined and currently used in linux-kernel.cat (and, FWIW, it has been
so since when we upstreamed LKMM).
My point is that, as exemplified by the two tests I reported above, this
relation already prevents you from implementing your acquire with LDAPR;
so my/our question was not "can you run herd7 for me?" but rather "do you
think that changes are needed to the .cat file?".
This question goes back _at least_ to:
http://lkml.kernel.org/r/[email protected]
(see, in part., the "IMPORTANT" note at the bottom of the commit message)
and that discussion later resulted in:
0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
(the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
https://github.com/riscv/riscv-isa-manual, Appendix A.5
includes our "Linux mapping", although the currently-recommended mapping
differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
My understanding is that your answer to this question is "Yes", but I am
not sure about the "How/Which changes?"; of course, an answer his question
_in the form_ of PATCHes would be appreciated! (but please also consider
that I'll be offline for most of the time until next Monday.)
Andrea
>
> Will
On Wed, 4 Jul 2018, Will Deacon wrote:
> Hi Alan,
>
> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > On Mon, 25 Jun 2018, Andrea Parri wrote:
> >
> > > On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > > > I think the second example would preclude us using LDAPR for load-acquire,
> > >
> > > > I don't think it's a moot point. We want new architectures to implement
> > > > acquire/release efficiently, and it's not unlikely that they will have
> > > > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > > > them from doing so,
> > >
> > > By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> > > consider, e.g., the two litmus tests below: what am I missing?
> >
> > This is an excellent point, which seems to have gotten lost in the
> > shuffle. I'd like to see your comments.
>
> Yeah, sorry. Loads going on at the moment. You could ask herd instead of me
> though ;)
Indeed; and the answer was as expected. Sometimes one gains additional
insights by asking a person, though.
> > In essence, if you're using release-acquire instructions that only
> > provide RCpc consistency, does store-release followed by load-acquire
> > of the same address provide read-read ordering? In theory it doesn't
> > have to, because if the value from the store-release is forwarded to
> > the load-acquire then:
> >
> > LOAD A
> > STORE-RELEASE X, v
> > LOAD-ACQUIRE X
> > LOAD B
> >
> > could be executed by the CPU in the order:
> >
> > LOAD-ACQUIRE X
> > LOAD B
> > LOAD A
> > STORE-RELEASE X, v
> >
> > thereby accessing A and B out of program order without violating the
> > requirements on the release or the acquire.
> >
> > Of course PPC doesn't allow this, but should we rule it out entirely?
>
> This would be allowed if LOAD-ACQUIRE was implemented using LDAPR on Arm.
> I don't think we should be ruling out architectures using RCpc
> acquire/release primitives, because doing so just feels like an artifact of
> most architectures building these out of fences today.
>
> It's funny really, because from an Arm-perspective I don't plan to stray
> outside of RCsc, but I feel like other weak architectures aren't being
> well represented here. If we just care about x86, Arm and Power (and assume
> that Power doesn't plan to implement RCpc acquire/release instructions)
> then we're good to tighten things up. But I fear that RISC-V should probably
> be more engaged (adding Daniel) and who knows about MIPS or these other
> random architectures popping up on linux-arch.
I don't object to having weak versions of acquire/release in the LKMM.
Perhaps the stronger versions could be kept in the hardware model
(which has not been published and is not in the kernel source), but
even that might be a bad idea in view of what RISC-V is liable to do.
> > > C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
> > >
> > > {}
> > >
> > > P0(int *x, int *y)
> > > {
> > > WRITE_ONCE(*x, 1);
> > > smp_wmb();
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *x, int *y, int *z)
> > > {
> > > r0 = READ_ONCE(*y);
> > > smp_store_release(z, 1);
> > > r1 = smp_load_acquire(z);
> > > r2 = READ_ONCE(*x);
> > > }
> > >
> > > exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
> > >
> > >
> > > AArch64 MP+dmb.st+popl-rfilq-poqp
> > > "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> > > Generator=diyone7 (version 7.49+02(dev))
> > > Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> > > Com=Rf Fr
> > > Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> > > {
> > > 0:X1=x; 0:X3=y;
> > > 1:X1=y; 1:X3=z; 1:X6=x;
> > > }
> > > P0 | P1 ;
> > > MOV W0,#1 | LDR W0,[X1] ;
> > > STR W0,[X1] | MOV W2,#1 ;
> > > DMB ST | STLR W2,[X3] ;
> > > MOV W2,#1 | LDAPR W4,[X3] ;
> > > STR W2,[X3] | LDR W5,[X6] ;
> > > exists
> > > (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
>
> (you can also run this yourself, since 'Q' is supported in the .cat file
> I contributed to herdtools7)
>
> Test MP+dmb.sy+popl-rfilq-poqp Allowed
> States 4
> 1:X0=0; 1:X4=1; 1:X5=0;
> 1:X0=0; 1:X4=1; 1:X5=1;
> 1:X0=1; 1:X4=1; 1:X5=0;
> 1:X0=1; 1:X4=1; 1:X5=1;
> Ok
> Witnesses
> Positive: 1 Negative: 3
> Condition exists (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
> Observation MP+dmb.sy+popl-rfilq-poqp Sometimes 1 3
> Time MP+dmb.sy+popl-rfilq-poqp 0.01
> Hash=61858b7b59a6310d869f99cd05718f96
>
> > There's also read-write ordering, in the form of the LB pattern:
> >
> > P0(int *x, int *y, int *z)
> > {
> > r0 = READ_ONCE(*x);
> > smp_store_release(z, 1);
> > r1 = smp_load_acquire(z);
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y)
> > {
> > r2 = READ_ONCE(*y);
> > smp_mp();
> > WRITE_ONCE(*x, 1);
> > }
> >
> > exists (0:r0=1 /\ 1:r2=1)
>
> The access types are irrelevant to the acquire/release primitives, so yes
> that's also allowed.
>
> > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > If the answer is yes then we will have to remove the rfi-rel-acq and
> > rel-rf-acq-po relations from the memory model entirely.
>
> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> rel-rfi-acq-po for the other? Sounds like I'm confused here.
"rfi-rel-acq" is the relation which was removed by the first of my two
patches (it is now back in business since Paul reverted the commits),
and "rel-rf-acq-po" is the relation that was introduced to replace it.
At any rate, it looks like instead of strengthening the relation, I
should write a patch that removes it entirely. I also will add new,
stronger relations for use with locking, essentially making spin_lock
and spin_unlock be RCsc.
Alan
On Wed, 4 Jul 2018, Will Deacon wrote:
> On Wed, Jul 04, 2018 at 04:28:52AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > PS: Paul, is the patch which introduced rel-rf-acq-po currently present
> > > in any of your branches? I couldn't find it.
> >
> > It is not, I will add it back in. I misinterpreted your "drop this
> > patch" on 2/2 as "drop both patches". Please accept my apologies!
> >
> > Just to double-check, the patch below should be added, correct?
>
> Hang on, I'm not sure this patch is quite right either. We need to reach
> agreement on whether or not we want to support native RCpc acquire/release
> instructions before we work out what to do with this relation.
Agreed. Paul, please leave both patches reverted. I will send
replacements.
Alan
Andrea,
On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
> > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > There's also read-write ordering, in the form of the LB pattern:
> > >
> > > P0(int *x, int *y, int *z)
> > > {
> > > r0 = READ_ONCE(*x);
> > > smp_store_release(z, 1);
> > > r1 = smp_load_acquire(z);
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *x, int *y)
> > > {
> > > r2 = READ_ONCE(*y);
> > > smp_mp();
> > > WRITE_ONCE(*x, 1);
> > > }
> > >
> > > exists (0:r0=1 /\ 1:r2=1)
> >
> > The access types are irrelevant to the acquire/release primitives, so yes
> > that's also allowed.
> >
> > > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > > If the answer is yes then we will have to remove the rfi-rel-acq and
> > > rel-rf-acq-po relations from the memory model entirely.
> >
> > I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> > rel-rfi-acq-po for the other? Sounds like I'm confused here.
>
> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
Yeah, the naming threw me because it's not the same order as the composition
in the actual definition (why not?). I typoed an extra 'po' suffix. Well
done for spotting it.
> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
> so since when we upstreamed LKMM).
>
> My point is that, as exemplified by the two tests I reported above, this
> relation already prevents you from implementing your acquire with LDAPR;
> so my/our question was not "can you run herd7 for me?" but rather "do you
> think that changes are needed to the .cat file?".
I mean, you can run herd on the armv8 memory model and see the answer to the
litmus test. So we have two options for the LKMM .cat file (the Arm one is
baked in silicon):
1. Say that architectures with RCpc acquire/release instructions must
instead use RCsc instructions (if they have them) or full fences
2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
still like RCsc lock/unlock semantics, and I think that's actually the
real case that matters here, but the currently difficulty in
distinguishing the two in the .cat file has led to this broader
ordering being enforced as a side-effect)
I prefer (2), because I think it's a safe and sensible relaxation to make
and accomodates what I consider to be a likely extension to weakly ordered
architectures that we might want to support efficiently. So yes, I think
changes are needed to the LKMM .cat file, but please don't view that as a
criticism. We change stuff all the time as long as it doesn't break userspace.
> This question goes back _at least_ to:
>
> http://lkml.kernel.org/r/[email protected]
>
> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
> and that discussion later resulted in:
>
> 0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
> 5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>
> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>
> https://github.com/riscv/riscv-isa-manual, Appendix A.5
>
> includes our "Linux mapping", although the currently-recommended mapping
> differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
[I think the discussion at hand is broader than RISC-V, and I looped in
Daniel already]
> My understanding is that your answer to this question is "Yes", but I am
> not sure about the "How/Which changes?"; of course, an answer his question
> _in the form_ of PATCHes would be appreciated! (but please also consider
> that I'll be offline for most of the time until next Monday.)
C'mon, I'm reviewing patches here. The onus shouldn't be on the reviewer to
come back with the correct version of the patch. I'm also not terribly
worried if LKMM says the wrong thing whilst we work out what the right
thing should be, but I /would/ be worried if we started adding full fences to
an architecture that has acquire/release instructions just because they're
RCpc. If it turns out that no other arch maintainers care, then fine,
because frankly this doesn't affect me, but so far it's basically been
silence and I'd really like some more input before we close the door on
them.
Will
On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> On Wed, 4 Jul 2018, Will Deacon wrote:
> > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > > If the answer is yes then we will have to remove the rfi-rel-acq and
> > > rel-rf-acq-po relations from the memory model entirely.
> >
> > I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> > rel-rfi-acq-po for the other? Sounds like I'm confused here.
>
> "rfi-rel-acq" is the relation which was removed by the first of my two
> patches (it is now back in business since Paul reverted the commits),
> and "rel-rf-acq-po" is the relation that was introduced to replace it.
Sorry, yes, I realised this after I'd replied. Curious: but why do you name
the relations this way around, as opposed to e.g. rel-rfi-acq? It's
obviously up to you, but I just couldn't figure out what inspired the
ordering.
> At any rate, it looks like instead of strengthening the relation, I
> should write a patch that removes it entirely. I also will add new,
> stronger relations for use with locking, essentially making spin_lock
> and spin_unlock be RCsc.
Thanks, Alan. I'll try to review them a bit more quickly this time, too.
Will
Will:
On Thu, 5 Jul 2018, Will Deacon wrote:
> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> > On Wed, 4 Jul 2018, Will Deacon wrote:
> > > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > > > If the answer is yes then we will have to remove the rfi-rel-acq and
> > > > rel-rf-acq-po relations from the memory model entirely.
> > >
> > > I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> > > rel-rfi-acq-po for the other? Sounds like I'm confused here.
> >
> > "rfi-rel-acq" is the relation which was removed by the first of my two
> > patches (it is now back in business since Paul reverted the commits),
> > and "rel-rf-acq-po" is the relation that was introduced to replace it.
>
> Sorry, yes, I realised this after I'd replied. Curious: but why do you name
> the relations this way around, as opposed to e.g. rel-rfi-acq? It's
> obviously up to you, but I just couldn't figure out what inspired the
> ordering.
I no longer remember the reason for naming "rfi-rel-acq" the way I did.
As you say, it doesn't make a lot of sense.
The reason for "rel-rf-acq-po" instead of "rel-rfi-acq-po" was because
the second of the two patches uses that relation in a context where the
release and the acquire might very well run on different CPUs.
Alan
> > At any rate, it looks like instead of strengthening the relation, I
> > should write a patch that removes it entirely. I also will add new,
> > stronger relations for use with locking, essentially making spin_lock
> > and spin_unlock be RCsc.
>
> Thanks, Alan. I'll try to review them a bit more quickly this time, too.
>
> Will
> At any rate, it looks like instead of strengthening the relation, I
> should write a patch that removes it entirely. I also will add new,
> stronger relations for use with locking, essentially making spin_lock
> and spin_unlock be RCsc.
Thank you.
Ah let me put this forward: please keep an eye on the (generic)
queued_spin_lock()
queued_spin_unlock()
(just to point out an example). Their implementation (in part.,
the fast-path) suggests that if we will stick to RCsc lock then
we should also stick to RCsc acq. load from RMW and rel. store.
Andrea
>
> Alan
>
On Thu, Jul 05, 2018 at 10:57:28AM -0400, Alan Stern wrote:
> On Thu, 5 Jul 2018, Will Deacon wrote:
> > On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> > > On Wed, 4 Jul 2018, Will Deacon wrote:
> > > > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > > > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > > > > If the answer is yes then we will have to remove the rfi-rel-acq and
> > > > > rel-rf-acq-po relations from the memory model entirely.
> > > >
> > > > I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> > > > rel-rfi-acq-po for the other? Sounds like I'm confused here.
> > >
> > > "rfi-rel-acq" is the relation which was removed by the first of my two
> > > patches (it is now back in business since Paul reverted the commits),
> > > and "rel-rf-acq-po" is the relation that was introduced to replace it.
> >
> > Sorry, yes, I realised this after I'd replied. Curious: but why do you name
> > the relations this way around, as opposed to e.g. rel-rfi-acq? It's
> > obviously up to you, but I just couldn't figure out what inspired the
> > ordering.
>
> I no longer remember the reason for naming "rfi-rel-acq" the way I did.
> As you say, it doesn't make a lot of sense.
Fair enough!
> The reason for "rel-rf-acq-po" instead of "rel-rfi-acq-po" was because
> the second of the two patches uses that relation in a context where the
> release and the acquire might very well run on different CPUs.
Ok, that makes sense. I realised that I've only been thinking about RCpc
making a difference in the rfi case, because Armv8 is multi-copy atomic
so we don't allow early forwarding from a release to an RCpc acquire
if they are from different CPUs.
Again, I'd be interested in what other architectures have to say here (i.e.
whether RCpc acquire/release instructions are likely to exist in a non
multi-copy atomic architecture).
Will
On 7/5/2018 7:44 AM, Will Deacon wrote:
> Andrea,
>
> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
>>>> There's also read-write ordering, in the form of the LB pattern:
>>>>
>>>> P0(int *x, int *y, int *z)
>>>> {
>>>> r0 = READ_ONCE(*x);
>>>> smp_store_release(z, 1);
>>>> r1 = smp_load_acquire(z);
>>>> WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *x, int *y)
>>>> {
>>>> r2 = READ_ONCE(*y);
>>>> smp_mp();
>>>> WRITE_ONCE(*x, 1);
>>>> }
>>>>
>>>> exists (0:r0=1 /\ 1:r2=1)
>>>
>>> The access types are irrelevant to the acquire/release primitives, so yes
>>> that's also allowed.
>>>
>>>> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
>>>> If the answer is yes then we will have to remove the rfi-rel-acq and
>>>> rel-rf-acq-po relations from the memory model entirely.
>>>
>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>
>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
>
> Yeah, the naming threw me because it's not the same order as the composition
> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
> done for spotting it.
>
>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>> so since when we upstreamed LKMM).
>>
>> My point is that, as exemplified by the two tests I reported above, this
>> relation already prevents you from implementing your acquire with LDAPR;
>> so my/our question was not "can you run herd7 for me?" but rather "do you
>> think that changes are needed to the .cat file?".
>
> I mean, you can run herd on the armv8 memory model and see the answer to the
> litmus test. So we have two options for the LKMM .cat file (the Arm one is
> baked in silicon):
>
> 1. Say that architectures with RCpc acquire/release instructions must
> instead use RCsc instructions (if they have them) or full fences
>
> 2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
> still like RCsc lock/unlock semantics, and I think that's actually the
> real case that matters here, but the currently difficulty in
> distinguishing the two in the .cat file has led to this broader
> ordering being enforced as a side-effect)
>
> I prefer (2), because I think it's a safe and sensible relaxation to make
> and accomodates what I consider to be a likely extension to weakly ordered
> architectures that we might want to support efficiently. So yes, I think
> changes are needed to the LKMM .cat file, but please don't view that as a
> criticism. We change stuff all the time as long as it doesn't break userspace.
>
>> This question goes back _at least_ to:
>>
>> http://lkml.kernel.org/r/[email protected]
>>
>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>> and that discussion later resulted in:
>>
>> 0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>> 5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>
>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>
>> https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>
>> includes our "Linux mapping", although the currently-recommended mapping
>> differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
>
> [I think the discussion at hand is broader than RISC-V, and I looped in
> Daniel already]
Sorry for the delay, it was Independence Day here in the US.
I'm aligned with Will on all this so far. As mentioned above, this issue
definitely comes up on RISC-V, but for reasons even beyond RCpc vs. RCsc.
Why? On RISC-V, we have RCsc atomics, but no native load-acquire or
store-release opcodes. You might expect you can use "ld; fence r,rw"
and "fence rw,w; sd", respectively, but those don't interact with the
RCsc .aq/.rl mappings. For example, suppose we did this:
(a)
amoswap.w.rl x0, x0, 0(a1) // st-rel using amoswap with dummy dest
...
lw a0, 0(a1) // ld-acq using fence-based mapping
fence r,rw
(b)
Nothing here orders (a) before (b) in general. A mapping always using
fences would cover what's being asked for (I believe), and using only
RCsc operations (like ARM does) would also work if we had native
load-acquire and store-release operations, but this mixed mapping
doesn't. That's why the spec currently recommends the stronger
fence.tso (~= multi-copy atomic Power lwsync) instead.
...but it's very non-obvious. Is there really a hurry to rush all
this in?
>> My understanding is that your answer to this question is "Yes", but I am
>> not sure about the "How/Which changes?"; of course, an answer his question
>> _in the form_ of PATCHes would be appreciated! (but please also consider
>> that I'll be offline for most of the time until next Monday.)
>
> C'mon, I'm reviewing patches here. The onus shouldn't be on the reviewer to
> come back with the correct version of the patch. I'm also not terribly
> worried if LKMM says the wrong thing whilst we work out what the right
> thing should be, but I /would/ be worried if we started adding full fences to
> an architecture that has acquire/release instructions just because they're
> RCpc. If it turns out that no other arch maintainers care, then fine,
> because frankly this doesn't affect me, but so far it's basically been
> silence and I'd really like some more input before we close the door on
> them
I care! Thanks for speaking up Will!
Dan
>
> Will
>
On Thu, Jul 05, 2018 at 10:23:01AM -0400, Alan Stern wrote:
> On Wed, 4 Jul 2018, Will Deacon wrote:
>
> > On Wed, Jul 04, 2018 at 04:28:52AM -0700, Paul E. McKenney wrote:
> > > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > > PS: Paul, is the patch which introduced rel-rf-acq-po currently present
> > > > in any of your branches? I couldn't find it.
> > >
> > > It is not, I will add it back in. I misinterpreted your "drop this
> > > patch" on 2/2 as "drop both patches". Please accept my apologies!
> > >
> > > Just to double-check, the patch below should be added, correct?
> >
> > Hang on, I'm not sure this patch is quite right either. We need to reach
> > agreement on whether or not we want to support native RCpc acquire/release
> > instructions before we work out what to do with this relation.
>
> Agreed. Paul, please leave both patches reverted. I will send
> replacements.
Looking forward to seeing them!
Thanx, Paul
On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> On Wed, 4 Jul 2018, Will Deacon wrote:
>
> > Hi Alan,
> >
> > On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
> > > On Mon, 25 Jun 2018, Andrea Parri wrote:
> > >
> > > > On Fri, Jun 22, 2018 at 07:30:08PM +0100, Will Deacon wrote:
> > > > > > > I think the second example would preclude us using LDAPR for load-acquire,
> > > >
> > > > > I don't think it's a moot point. We want new architectures to implement
> > > > > acquire/release efficiently, and it's not unlikely that they will have
> > > > > acquire loads that are similar in semantics to LDAPR. This patch prevents
> > > > > them from doing so,
> > > >
> > > > By this same argument, you should not be a "big fan" of rfi-rel-acq in ppo ;)
> > > > consider, e.g., the two litmus tests below: what am I missing?
> > >
> > > This is an excellent point, which seems to have gotten lost in the
> > > shuffle. I'd like to see your comments.
> >
> > Yeah, sorry. Loads going on at the moment. You could ask herd instead of me
> > though ;)
>
> Indeed; and the answer was as expected. Sometimes one gains additional
> insights by asking a person, though.
>
> > > In essence, if you're using release-acquire instructions that only
> > > provide RCpc consistency, does store-release followed by load-acquire
> > > of the same address provide read-read ordering? In theory it doesn't
> > > have to, because if the value from the store-release is forwarded to
> > > the load-acquire then:
> > >
> > > LOAD A
> > > STORE-RELEASE X, v
> > > LOAD-ACQUIRE X
> > > LOAD B
> > >
> > > could be executed by the CPU in the order:
> > >
> > > LOAD-ACQUIRE X
> > > LOAD B
> > > LOAD A
> > > STORE-RELEASE X, v
> > >
> > > thereby accessing A and B out of program order without violating the
> > > requirements on the release or the acquire.
> > >
> > > Of course PPC doesn't allow this, but should we rule it out entirely?
> >
> > This would be allowed if LOAD-ACQUIRE was implemented using LDAPR on Arm.
> > I don't think we should be ruling out architectures using RCpc
> > acquire/release primitives, because doing so just feels like an artifact of
> > most architectures building these out of fences today.
> >
> > It's funny really, because from an Arm-perspective I don't plan to stray
> > outside of RCsc, but I feel like other weak architectures aren't being
> > well represented here. If we just care about x86, Arm and Power (and assume
> > that Power doesn't plan to implement RCpc acquire/release instructions)
> > then we're good to tighten things up. But I fear that RISC-V should probably
> > be more engaged (adding Daniel) and who knows about MIPS or these other
> > random architectures popping up on linux-arch.
>
> I don't object to having weak versions of acquire/release in the LKMM.
> Perhaps the stronger versions could be kept in the hardware model
> (which has not been published and is not in the kernel source), but
> even that might be a bad idea in view of what RISC-V is liable to do.
>
> > > > C MP+fencewmbonceonce+pooncerelease-rfireleaseacquire-poacquireonce
> > > >
> > > > {}
> > > >
> > > > P0(int *x, int *y)
> > > > {
> > > > WRITE_ONCE(*x, 1);
> > > > smp_wmb();
> > > > WRITE_ONCE(*y, 1);
> > > > }
> > > >
> > > > P1(int *x, int *y, int *z)
> > > > {
> > > > r0 = READ_ONCE(*y);
> > > > smp_store_release(z, 1);
> > > > r1 = smp_load_acquire(z);
> > > > r2 = READ_ONCE(*x);
> > > > }
> > > >
> > > > exists (1:r0=1 /\ 1:r1=1 /\ 1:r2=0)
> > > >
> > > >
> > > > AArch64 MP+dmb.st+popl-rfilq-poqp
> > > > "DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre"
> > > > Generator=diyone7 (version 7.49+02(dev))
> > > > Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
> > > > Com=Rf Fr
> > > > Orig=DMB.STdWW Rfe PodRWPL RfiLQ PodRRQP Fre
> > > > {
> > > > 0:X1=x; 0:X3=y;
> > > > 1:X1=y; 1:X3=z; 1:X6=x;
> > > > }
> > > > P0 | P1 ;
> > > > MOV W0,#1 | LDR W0,[X1] ;
> > > > STR W0,[X1] | MOV W2,#1 ;
> > > > DMB ST | STLR W2,[X3] ;
> > > > MOV W2,#1 | LDAPR W4,[X3] ;
> > > > STR W2,[X3] | LDR W5,[X6] ;
> > > > exists
> > > > (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
> >
> > (you can also run this yourself, since 'Q' is supported in the .cat file
> > I contributed to herdtools7)
> >
> > Test MP+dmb.sy+popl-rfilq-poqp Allowed
> > States 4
> > 1:X0=0; 1:X4=1; 1:X5=0;
> > 1:X0=0; 1:X4=1; 1:X5=1;
> > 1:X0=1; 1:X4=1; 1:X5=0;
> > 1:X0=1; 1:X4=1; 1:X5=1;
> > Ok
> > Witnesses
> > Positive: 1 Negative: 3
> > Condition exists (1:X0=1 /\ 1:X4=1 /\ 1:X5=0)
> > Observation MP+dmb.sy+popl-rfilq-poqp Sometimes 1 3
> > Time MP+dmb.sy+popl-rfilq-poqp 0.01
> > Hash=61858b7b59a6310d869f99cd05718f96
> >
> > > There's also read-write ordering, in the form of the LB pattern:
> > >
> > > P0(int *x, int *y, int *z)
> > > {
> > > r0 = READ_ONCE(*x);
> > > smp_store_release(z, 1);
> > > r1 = smp_load_acquire(z);
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *x, int *y)
> > > {
> > > r2 = READ_ONCE(*y);
> > > smp_mp();
> > > WRITE_ONCE(*x, 1);
> > > }
> > >
> > > exists (0:r0=1 /\ 1:r2=1)
> >
> > The access types are irrelevant to the acquire/release primitives, so yes
> > that's also allowed.
> >
> > > Would this be allowed if smp_load_acquire() was implemented with LDAPR?
> > > If the answer is yes then we will have to remove the rfi-rel-acq and
> > > rel-rf-acq-po relations from the memory model entirely.
> >
> > I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
> > rel-rfi-acq-po for the other? Sounds like I'm confused here.
>
> "rfi-rel-acq" is the relation which was removed by the first of my two
> patches (it is now back in business since Paul reverted the commits),
> and "rel-rf-acq-po" is the relation that was introduced to replace it.
>
> At any rate, it looks like instead of strengthening the relation, I
> should write a patch that removes it entirely. I also will add new,
> stronger relations for use with locking, essentially making spin_lock
> and spin_unlock be RCsc.
Only in the presence of smp_mb__after_unlock_lock() or
smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
Thanx, Paul
On 7/5/2018 8:16 AM, Daniel Lustig wrote:
> On 7/5/2018 7:44 AM, Will Deacon wrote:
>> Andrea,
>>
>> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
>>>>> There's also read-write ordering, in the form of the LB pattern:
>>>>>
>>>>> P0(int *x, int *y, int *z)
>>>>> {
>>>>> r0 = READ_ONCE(*x);
>>>>> smp_store_release(z, 1);
>>>>> r1 = smp_load_acquire(z);
>>>>> WRITE_ONCE(*y, 1);
>>>>> }
>>>>>
>>>>> P1(int *x, int *y)
>>>>> {
>>>>> r2 = READ_ONCE(*y);
>>>>> smp_mp();
>>>>> WRITE_ONCE(*x, 1);
>>>>> }
>>>>>
>>>>> exists (0:r0=1 /\ 1:r2=1)
>>>>
>>>> The access types are irrelevant to the acquire/release primitives, so yes
>>>> that's also allowed.
>>>>
>>>>> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
>>>>> If the answer is yes then we will have to remove the rfi-rel-acq and
>>>>> rel-rf-acq-po relations from the memory model entirely.
>>>>
>>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>>
>>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
>>
>> Yeah, the naming threw me because it's not the same order as the composition
>> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
>> done for spotting it.
>>
>>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>>> so since when we upstreamed LKMM).
>>>
>>> My point is that, as exemplified by the two tests I reported above, this
>>> relation already prevents you from implementing your acquire with LDAPR;
>>> so my/our question was not "can you run herd7 for me?" but rather "do you
>>> think that changes are needed to the .cat file?".
>>
>> I mean, you can run herd on the armv8 memory model and see the answer to the
>> litmus test. So we have two options for the LKMM .cat file (the Arm one is
>> baked in silicon):
>>
>> 1. Say that architectures with RCpc acquire/release instructions must
>> instead use RCsc instructions (if they have them) or full fences
>>
>> 2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
>> still like RCsc lock/unlock semantics, and I think that's actually the
>> real case that matters here, but the currently difficulty in
>> distinguishing the two in the .cat file has led to this broader
>> ordering being enforced as a side-effect)
>>
>> I prefer (2), because I think it's a safe and sensible relaxation to make
>> and accomodates what I consider to be a likely extension to weakly ordered
>> architectures that we might want to support efficiently. So yes, I think
>> changes are needed to the LKMM .cat file, but please don't view that as a
>> criticism. We change stuff all the time as long as it doesn't break userspace.
>>
>>> This question goes back _at least_ to:
>>>
>>> http://lkml.kernel.org/r/[email protected]
>>>
>>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>>> and that discussion later resulted in:
>>>
>>> 0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>>> 5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>>
>>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>>
>>> https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>>
>>> includes our "Linux mapping", although the currently-recommended mapping
>>> differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
>>
>> [I think the discussion at hand is broader than RISC-V, and I looped in
>> Daniel already]
>
> Sorry for the delay, it was Independence Day here in the US.
>
> I'm aligned with Will on all this so far. As mentioned above, this issue
> definitely comes up on RISC-V, but for reasons even beyond RCpc vs. RCsc.
> Why? On RISC-V, we have RCsc atomics, but no native load-acquire or
> store-release opcodes. You might expect you can use "ld; fence r,rw"
> and "fence rw,w; sd", respectively, but those don't interact with the
> RCsc .aq/.rl mappings. For example, suppose we did this:
>
> (a)
> amoswap.w.rl x0, x0, 0(a1) // st-rel using amoswap with dummy dest
> ...
> lw a0, 0(a1) // ld-acq using fence-based mapping
> fence r,rw
> (b)
>
> Nothing here orders (a) before (b) in general. A mapping always using
And here, I should have said that (a) and (b) are stores. But then
flip it around, so that the release uses fences and the acquire uses
RCsc, and then there's a similar lack of read->read ordering between
(a) and (b). That's what I really meant to say here. Depending on
which particular mapping you use in each case, you may or may not get
R->R or W->W ordering.
> fences would cover what's being asked for (I believe), and using only
> RCsc operations (like ARM does) would also work if we had native
> load-acquire and store-release operations, but this mixed mapping
> doesn't. That's why the spec currently recommends the stronger
> fence.tso (~= multi-copy atomic Power lwsync) instead.
>
> ...but it's very non-obvious. Is there really a hurry to rush all
> this in?
>
>>> My understanding is that your answer to this question is "Yes", but I am
>>> not sure about the "How/Which changes?"; of course, an answer his question
>>> _in the form_ of PATCHes would be appreciated! (but please also consider
>>> that I'll be offline for most of the time until next Monday.)
>>
>> C'mon, I'm reviewing patches here. The onus shouldn't be on the reviewer to
>> come back with the correct version of the patch. I'm also not terribly
>> worried if LKMM says the wrong thing whilst we work out what the right
>> thing should be, but I /would/ be worried if we started adding full fences to
>> an architecture that has acquire/release instructions just because they're
>> RCpc. If it turns out that no other arch maintainers care, then fine,
>> because frankly this doesn't affect me, but so far it's basically been
>> silence and I'd really like some more input before we close the door on
>> them
>
> I care! Thanks for speaking up Will!
>
> Dan
>
>>
>> Will
>>
> > At any rate, it looks like instead of strengthening the relation, I
> > should write a patch that removes it entirely. I also will add new,
> > stronger relations for use with locking, essentially making spin_lock
> > and spin_unlock be RCsc.
>
> Only in the presence of smp_mb__after_unlock_lock() or
> smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
There are at least two definitions of RCsc: one as documented in the header
comment for smp_mb__after_spinlock() or rather in the patch under review...,
one as processor architects used to intend it. ;-)
Andrea
> Thanx, Paul
>
On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>> At any rate, it looks like instead of strengthening the relation, I
>> should write a patch that removes it entirely. I also will add new,
>> stronger relations for use with locking, essentially making spin_lock
>> and spin_unlock be RCsc.
>
> Only in the presence of smp_mb__after_unlock_lock() or
> smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
>
> Thanx, Paul
>
In terms of naming...is what you're asking for really RCsc? To me,
that would imply that even stores in the first critical section would
need to be ordered before loads in the second critical section.
Meaning that even x86 would need an mfence in either lock() or unlock()?
If you're only looking for R->R, R->W, and W->W ordering between
critical sections, is it better to find a new unique name for this?
"RCtso" possibly, or something to that effect?
Dan
On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
> On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> > On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> >> At any rate, it looks like instead of strengthening the relation, I
> >> should write a patch that removes it entirely. I also will add new,
> >> stronger relations for use with locking, essentially making spin_lock
> >> and spin_unlock be RCsc.
> >
> > Only in the presence of smp_mb__after_unlock_lock() or
> > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
> >
> > Thanx, Paul
> >
>
> In terms of naming...is what you're asking for really RCsc? To me,
> that would imply that even stores in the first critical section would
> need to be ordered before loads in the second critical section.
> Meaning that even x86 would need an mfence in either lock() or unlock()?
I think a LOCK operation always implies an atomic RmW, which will give
full ordering guarantees on x86. I know there have been interesting issues
involving I/O accesses in the past, but I think that's still out of scope
for the memory model.
Peter will know.
Will
On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
> On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
> > On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> > > On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> > >> At any rate, it looks like instead of strengthening the relation, I
> > >> should write a patch that removes it entirely. I also will add new,
> > >> stronger relations for use with locking, essentially making spin_lock
> > >> and spin_unlock be RCsc.
> > >
> > > Only in the presence of smp_mb__after_unlock_lock() or
> > > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
> > >
> > > Thanx, Paul
> > >
> >
> > In terms of naming...is what you're asking for really RCsc? To me,
> > that would imply that even stores in the first critical section would
> > need to be ordered before loads in the second critical section.
> > Meaning that even x86 would need an mfence in either lock() or unlock()?
>
> I think a LOCK operation always implies an atomic RmW, which will give
> full ordering guarantees on x86. I know there have been interesting issues
> involving I/O accesses in the past, but I think that's still out of scope
> for the memory model.
>
> Peter will know.
Agreed, x86 locked operations imply full fences, so x86 will order the
accesses in consecutive critical sections with respect to an observer
not holding the lock, even stores in earlier critical sections against
loads in later critical sections. We have been discussing tightening
LKMM to make an unlock-lock pair order everything except earlier stores
vs. later loads. (Of course, if everyone holds the lock, they will see
full ordering against both earlier and later critical sections.)
Or are you pushing for something stronger?
Thanx, Paul
On Thu, Jul 05, 2018 at 05:39:06PM +0200, Andrea Parri wrote:
> > > At any rate, it looks like instead of strengthening the relation, I
> > > should write a patch that removes it entirely. I also will add new,
> > > stronger relations for use with locking, essentially making spin_lock
> > > and spin_unlock be RCsc.
> >
> > Only in the presence of smp_mb__after_unlock_lock() or
> > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
>
> There are at least two definitions of RCsc: one as documented in the header
> comment for smp_mb__after_spinlock() or rather in the patch under review...,
> one as processor architects used to intend it. ;-)
Searching isn't working for me all that well this morning, so could you
please send me a pointer to that patch?
Thanx, Paul
On Thu, Jul 05, 2018 at 09:58:48AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 05:39:06PM +0200, Andrea Parri wrote:
> > > > At any rate, it looks like instead of strengthening the relation, I
> > > > should write a patch that removes it entirely. I also will add new,
> > > > stronger relations for use with locking, essentially making spin_lock
> > > > and spin_unlock be RCsc.
> > >
> > > Only in the presence of smp_mb__after_unlock_lock() or
> > > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
> >
> > There are at least two definitions of RCsc: one as documented in the header
> > comment for smp_mb__after_spinlock() or rather in the patch under review...,
> > one as processor architects used to intend it. ;-)
>
> Searching isn't working for me all that well this morning, so could you
> please send me a pointer to that patch?
Sorry, I meant in _this patch_: "RCsc" as ordering everything except for
W -> R, without the barriers above (_informally, the current LKMM misses
the W -> W order only).
Andrea
>
> Thanx, Paul
>
On 7/5/2018 9:56 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
>> On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
>>> On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
>>>> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>>>>> At any rate, it looks like instead of strengthening the relation, I
>>>>> should write a patch that removes it entirely. I also will add new,
>>>>> stronger relations for use with locking, essentially making spin_lock
>>>>> and spin_unlock be RCsc.
>>>>
>>>> Only in the presence of smp_mb__after_unlock_lock() or
>>>> smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
>>>>
>>>> Thanx, Paul
>>>>
>>>
>>> In terms of naming...is what you're asking for really RCsc? To me,
>>> that would imply that even stores in the first critical section would
>>> need to be ordered before loads in the second critical section.
>>> Meaning that even x86 would need an mfence in either lock() or unlock()?
>>
>> I think a LOCK operation always implies an atomic RmW, which will give
>> full ordering guarantees on x86. I know there have been interesting issues
>> involving I/O accesses in the past, but I think that's still out of scope
>> for the memory model.
Yes, you're right about atomic RMWs on x86, and I'm not worried about I/O
here either. But see below.
>>
>> Peter will know.
>
> Agreed, x86 locked operations imply full fences, so x86 will order the
> accesses in consecutive critical sections with respect to an observer
> not holding the lock, even stores in earlier critical sections against
> loads in later critical sections. We have been discussing tightening
> LKMM to make an unlock-lock pair order everything except earlier stores
> vs. later loads. (Of course, if everyone holds the lock, they will see
> full ordering against both earlier and later critical sections.)
>
> Or are you pushing for something stronger?
>
> Thanx, Paul
>
No, I'm definitely not pushing for anything stronger. I'm still just
wondering if the name "RCsc" is right for what you described. For
example, Andrea just said this in a parallel email:
> "RCsc" as ordering everything except for W -> R, without the [extra]
> barriers
If it's "RCsc with exceptions", doesn't it make sense to find a
different name, rather than simply overloading the term "RCsc" with
a subtly different meaning, and hoping nobody gets confused?
I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
instructions, respectively. But on Power and RISC-V, it would really
be more "RCsc with a W->R exception", right?
In fact, the more I think about it, this doesn't seem to be RCsc at all.
It seems closer to "RCpc plus extra PC ordering between critical
sections". No?
The synchronization accesses themselves aren't sequentially consistent
with respect to each other under the Power or RISC-V mappings, unless
there's a hwsync in there somewhere that I missed? Or a rule
preventing stw from forwarding to lwarx? Or some other higher-order
effect preventing it from being observed anyway?
So that's all I'm suggesting here. If you all buy that, maybe "RCpccs"
for "RCpc with processor consistent critical section ordering"?
I don't have a strong opinion on the name itself; I just want to find
a name that's less ambiguous or overloaded.
Dan
> No, I'm definitely not pushing for anything stronger. I'm still just
> wondering if the name "RCsc" is right for what you described. For
> example, Andrea just said this in a parallel email:
>
> > "RCsc" as ordering everything except for W -> R, without the [extra]
> > barriers
And I already regret it: the point is, different communities/people have
different things in mind when they use terms such as "RCsc" or "ordering"
and different communities seems to be represented in LKMM.
Really, I don't think that this is simply a matter of naming (personally,
I'd be OK with "foo" or whather you suggested below! ;-)). My suggestion
would be: "get in there!! ;-) please let's refrain from using terms such
as these (_overly_ overloaded) "RCsc" and "order" when talking about MCM
let's rather talk, say, about "ppo", "cumul-fence" ...
Andrea
>
> If it's "RCsc with exceptions", doesn't it make sense to find a
> different name, rather than simply overloading the term "RCsc" with
> a subtly different meaning, and hoping nobody gets confused?
>
> I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
> due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
> instructions, respectively. But on Power and RISC-V, it would really
> be more "RCsc with a W->R exception", right?
>
> In fact, the more I think about it, this doesn't seem to be RCsc at all.
> It seems closer to "RCpc plus extra PC ordering between critical
> sections". No?
>
> The synchronization accesses themselves aren't sequentially consistent
> with respect to each other under the Power or RISC-V mappings, unless
> there's a hwsync in there somewhere that I missed? Or a rule
> preventing stw from forwarding to lwarx? Or some other higher-order
> effect preventing it from being observed anyway?
>
> So that's all I'm suggesting here. If you all buy that, maybe "RCpccs"
> for "RCpc with processor consistent critical section ordering"?
> I don't have a strong opinion on the name itself; I just want to find
> a name that's less ambiguous or overloaded.
>
> Dan
On Thu, Jul 05, 2018 at 08:38:36PM +0200, Andrea Parri wrote:
> > No, I'm definitely not pushing for anything stronger. I'm still just
> > wondering if the name "RCsc" is right for what you described. For
> > example, Andrea just said this in a parallel email:
> >
> > > "RCsc" as ordering everything except for W -> R, without the [extra]
> > > barriers
>
> And I already regret it: the point is, different communities/people have
> different things in mind when they use terms such as "RCsc" or "ordering"
> and different communities seems to be represented in LKMM.
>
> Really, I don't think that this is simply a matter of naming (personally,
> I'd be OK with "foo" or whather you suggested below! ;-)). My suggestion
> would be: "get in there!! ;-) please let's refrain from using terms such
> as these (_overly_ overloaded) "RCsc" and "order" when talking about MCM
> let's rather talk, say, about "ppo", "cumul-fence" ...
... or bare litmus tests!
Andrea
>
> Andrea
>
>
> >
> > If it's "RCsc with exceptions", doesn't it make sense to find a
> > different name, rather than simply overloading the term "RCsc" with
> > a subtly different meaning, and hoping nobody gets confused?
> >
> > I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
> > due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
> > instructions, respectively. But on Power and RISC-V, it would really
> > be more "RCsc with a W->R exception", right?
> >
> > In fact, the more I think about it, this doesn't seem to be RCsc at all.
> > It seems closer to "RCpc plus extra PC ordering between critical
> > sections". No?
> >
> > The synchronization accesses themselves aren't sequentially consistent
> > with respect to each other under the Power or RISC-V mappings, unless
> > there's a hwsync in there somewhere that I missed? Or a rule
> > preventing stw from forwarding to lwarx? Or some other higher-order
> > effect preventing it from being observed anyway?
> >
> > So that's all I'm suggesting here. If you all buy that, maybe "RCpccs"
> > for "RCpc with processor consistent critical section ordering"?
> > I don't have a strong opinion on the name itself; I just want to find
> > a name that's less ambiguous or overloaded.
> >
> > Dan
On Thu, Jul 05, 2018 at 08:38:36PM +0200, Andrea Parri wrote:
> > No, I'm definitely not pushing for anything stronger. I'm still just
> > wondering if the name "RCsc" is right for what you described. For
> > example, Andrea just said this in a parallel email:
> >
> > > "RCsc" as ordering everything except for W -> R, without the [extra]
> > > barriers
>
> And I already regret it: the point is, different communities/people have
> different things in mind when they use terms such as "RCsc" or "ordering"
> and different communities seems to be represented in LKMM.
How about "RCsb" for release-consistency store buffering? (Sorry,
couldn't resist...)
> Really, I don't think that this is simply a matter of naming (personally,
> I'd be OK with "foo" or whather you suggested below! ;-)). My suggestion
> would be: "get in there!! ;-) please let's refrain from using terms such
> as these (_overly_ overloaded) "RCsc" and "order" when talking about MCM
> let's rather talk, say, about "ppo", "cumul-fence" ...
What is in a name? ;-)
Thanx, Paul
> Andrea
>
>
> >
> > If it's "RCsc with exceptions", doesn't it make sense to find a
> > different name, rather than simply overloading the term "RCsc" with
> > a subtly different meaning, and hoping nobody gets confused?
> >
> > I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
> > due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
> > instructions, respectively. But on Power and RISC-V, it would really
> > be more "RCsc with a W->R exception", right?
> >
> > In fact, the more I think about it, this doesn't seem to be RCsc at all.
> > It seems closer to "RCpc plus extra PC ordering between critical
> > sections". No?
> >
> > The synchronization accesses themselves aren't sequentially consistent
> > with respect to each other under the Power or RISC-V mappings, unless
> > there's a hwsync in there somewhere that I missed? Or a rule
> > preventing stw from forwarding to lwarx? Or some other higher-order
> > effect preventing it from being observed anyway?
> >
> > So that's all I'm suggesting here. If you all buy that, maybe "RCpccs"
> > for "RCpc with processor consistent critical section ordering"?
> > I don't have a strong opinion on the name itself; I just want to find
> > a name that's less ambiguous or overloaded.
> >
> > Dan
>
On Thu, Jul 05, 2018 at 08:44:10PM +0200, Andrea Parri wrote:
> On Thu, Jul 05, 2018 at 08:38:36PM +0200, Andrea Parri wrote:
> > > No, I'm definitely not pushing for anything stronger. I'm still just
> > > wondering if the name "RCsc" is right for what you described. For
> > > example, Andrea just said this in a parallel email:
> > >
> > > > "RCsc" as ordering everything except for W -> R, without the [extra]
> > > > barriers
> >
> > And I already regret it: the point is, different communities/people have
> > different things in mind when they use terms such as "RCsc" or "ordering"
> > and different communities seems to be represented in LKMM.
> >
> > Really, I don't think that this is simply a matter of naming (personally,
> > I'd be OK with "foo" or whather you suggested below! ;-)). My suggestion
> > would be: "get in there!! ;-) please let's refrain from using terms such
> > as these (_overly_ overloaded) "RCsc" and "order" when talking about MCM
> > let's rather talk, say, about "ppo", "cumul-fence" ...
>
> ... or bare litmus tests!
Validation of changes to the memory model is going to continue to be
an interesting topic, which will probably involve its share of litmus
tests.
Thanx, Paul
> Andrea
>
>
> >
> > Andrea
> >
> >
> > >
> > > If it's "RCsc with exceptions", doesn't it make sense to find a
> > > different name, rather than simply overloading the term "RCsc" with
> > > a subtly different meaning, and hoping nobody gets confused?
> > >
> > > I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
> > > due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
> > > instructions, respectively. But on Power and RISC-V, it would really
> > > be more "RCsc with a W->R exception", right?
> > >
> > > In fact, the more I think about it, this doesn't seem to be RCsc at all.
> > > It seems closer to "RCpc plus extra PC ordering between critical
> > > sections". No?
> > >
> > > The synchronization accesses themselves aren't sequentially consistent
> > > with respect to each other under the Power or RISC-V mappings, unless
> > > there's a hwsync in there somewhere that I missed? Or a rule
> > > preventing stw from forwarding to lwarx? Or some other higher-order
> > > effect preventing it from being observed anyway?
> > >
> > > So that's all I'm suggesting here. If you all buy that, maybe "RCpccs"
> > > for "RCpc with processor consistent critical section ordering"?
> > > I don't have a strong opinion on the name itself; I just want to find
> > > a name that's less ambiguous or overloaded.
> > >
> > > Dan
>
On Thu, Jul 05, 2018 at 09:56:02AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
> > On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
> > > On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> > > > On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> > > >> At any rate, it looks like instead of strengthening the relation, I
> > > >> should write a patch that removes it entirely. I also will add new,
> > > >> stronger relations for use with locking, essentially making spin_lock
> > > >> and spin_unlock be RCsc.
> > > >
> > > > Only in the presence of smp_mb__after_unlock_lock() or
> > > > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
> > > >
> > > > Thanx, Paul
> > > >
> > >
> > > In terms of naming...is what you're asking for really RCsc? To me,
> > > that would imply that even stores in the first critical section would
> > > need to be ordered before loads in the second critical section.
> > > Meaning that even x86 would need an mfence in either lock() or unlock()?
> >
> > I think a LOCK operation always implies an atomic RmW, which will give
> > full ordering guarantees on x86. I know there have been interesting issues
> > involving I/O accesses in the past, but I think that's still out of scope
> > for the memory model.
> >
> > Peter will know.
>
> Agreed, x86 locked operations imply full fences, so x86 will order the
> accesses in consecutive critical sections with respect to an observer
> not holding the lock, even stores in earlier critical sections against
> loads in later critical sections. We have been discussing tightening
> LKMM to make an unlock-lock pair order everything except earlier stores
> vs. later loads. (Of course, if everyone holds the lock, they will see
> full ordering against both earlier and later critical sections.)
>
> Or are you pushing for something stronger?
I (and I think Peter) would like something stronger, but we can't have
nice things ;)
Anyhow, that's not really related to this patch series, so sorry for
mis-speaking and thanks to everybody who piled on with corrections! I got
a bit arm-centric for a moment. I think Alan got the gist of it, so I'll
wait to see what he posts.
Will
On Fri, Jul 06, 2018 at 10:25:29AM +0100, Will Deacon wrote:
> On Thu, Jul 05, 2018 at 09:56:02AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
> > > On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
> > > > On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> > > > > On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
> > > > >> At any rate, it looks like instead of strengthening the relation, I
> > > > >> should write a patch that removes it entirely. I also will add new,
> > > > >> stronger relations for use with locking, essentially making spin_lock
> > > > >> and spin_unlock be RCsc.
> > > > >
> > > > > Only in the presence of smp_mb__after_unlock_lock() or
> > > > > smp_mb__after_spinlock(), correct? Or am I confused about RCsc?
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > >
> > > > In terms of naming...is what you're asking for really RCsc? To me,
> > > > that would imply that even stores in the first critical section would
> > > > need to be ordered before loads in the second critical section.
> > > > Meaning that even x86 would need an mfence in either lock() or unlock()?
> > >
> > > I think a LOCK operation always implies an atomic RmW, which will give
> > > full ordering guarantees on x86. I know there have been interesting issues
> > > involving I/O accesses in the past, but I think that's still out of scope
> > > for the memory model.
> > >
> > > Peter will know.
> >
> > Agreed, x86 locked operations imply full fences, so x86 will order the
> > accesses in consecutive critical sections with respect to an observer
> > not holding the lock, even stores in earlier critical sections against
> > loads in later critical sections. We have been discussing tightening
> > LKMM to make an unlock-lock pair order everything except earlier stores
> > vs. later loads. (Of course, if everyone holds the lock, they will see
> > full ordering against both earlier and later critical sections.)
> >
> > Or are you pushing for something stronger?
>
> I (and I think Peter) would like something stronger, but we can't have
> nice things ;)
There is a lot of that going around! ;-)
> Anyhow, that's not really related to this patch series, so sorry for
> mis-speaking and thanks to everybody who piled on with corrections! I got
> a bit arm-centric for a moment. I think Alan got the gist of it, so I'll
> wait to see what he posts.
Sounds good!
Thanx, Paul
On Thu, 5 Jul 2018, Andrea Parri wrote:
> > At any rate, it looks like instead of strengthening the relation, I
> > should write a patch that removes it entirely. I also will add new,
> > stronger relations for use with locking, essentially making spin_lock
> > and spin_unlock be RCsc.
>
> Thank you.
>
> Ah let me put this forward: please keep an eye on the (generic)
>
> queued_spin_lock()
> queued_spin_unlock()
>
> (just to point out an example). Their implementation (in part.,
> the fast-path) suggests that if we will stick to RCsc lock then
> we should also stick to RCsc acq. load from RMW and rel. store.
A very good point. The implementation of those routines uses
atomic_cmpxchg_acquire() to acquire the lock. Unless this is
implemented with an operation or fence that provides write-write
ordering (in conjunction with a suitable release), qspinlocks won't
have the ordering properties that we want.
I'm going to assume that the release operations used for unlocking
don't need to have any extra properties; only the lock-acquire
operations need to be special (i.e., stronger than a normal
smp_load_acquire). This suggests that atomic RMW functions with acquire
semantics should also use this stronger form of acquire.
Does anybody have a different suggestion?
Alan
On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
> On Thu, 5 Jul 2018, Andrea Parri wrote:
>
> > > At any rate, it looks like instead of strengthening the relation, I
> > > should write a patch that removes it entirely. I also will add new,
> > > stronger relations for use with locking, essentially making spin_lock
> > > and spin_unlock be RCsc.
> >
> > Thank you.
> >
> > Ah let me put this forward: please keep an eye on the (generic)
> >
> > queued_spin_lock()
> > queued_spin_unlock()
> >
> > (just to point out an example). Their implementation (in part.,
> > the fast-path) suggests that if we will stick to RCsc lock then
> > we should also stick to RCsc acq. load from RMW and rel. store.
>
> A very good point. The implementation of those routines uses
> atomic_cmpxchg_acquire() to acquire the lock. Unless this is
> implemented with an operation or fence that provides write-write
> ordering (in conjunction with a suitable release), qspinlocks won't
> have the ordering properties that we want.
>
> I'm going to assume that the release operations used for unlocking
> don't need to have any extra properties; only the lock-acquire
> operations need to be special (i.e., stronger than a normal
> smp_load_acquire). This suggests that atomic RMW functions with acquire
> semantics should also use this stronger form of acquire.
>
> Does anybody have a different suggestion?
The approach you suggest makes sense to me. Will, Peter, Daniel, any
reasons why this approach would be a problem for you guys?
Thanx, Paul
On Fri, Jul 06, 2018 at 02:10:55PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
> > On Thu, 5 Jul 2018, Andrea Parri wrote:
> >
> > > > At any rate, it looks like instead of strengthening the relation, I
> > > > should write a patch that removes it entirely. I also will add new,
> > > > stronger relations for use with locking, essentially making spin_lock
> > > > and spin_unlock be RCsc.
> > >
> > > Thank you.
> > >
> > > Ah let me put this forward: please keep an eye on the (generic)
> > >
> > > queued_spin_lock()
> > > queued_spin_unlock()
> > >
> > > (just to point out an example). Their implementation (in part.,
> > > the fast-path) suggests that if we will stick to RCsc lock then
> > > we should also stick to RCsc acq. load from RMW and rel. store.
> >
> > A very good point. The implementation of those routines uses
> > atomic_cmpxchg_acquire() to acquire the lock. Unless this is
> > implemented with an operation or fence that provides write-write
> > ordering (in conjunction with a suitable release), qspinlocks won't
> > have the ordering properties that we want.
> >
> > I'm going to assume that the release operations used for unlocking
> > don't need to have any extra properties; only the lock-acquire
> > operations need to be special (i.e., stronger than a normal
> > smp_load_acquire). This suggests that atomic RMW functions with acquire
> > semantics should also use this stronger form of acquire.
> >
> > Does anybody have a different suggestion?
>
> The approach you suggest makes sense to me. Will, Peter, Daniel, any
> reasons why this approach would be a problem for you guys?
qspinlock is very much opt-in per arch, so we can simply require that
an architecture must have RCsc RmW atomics if they want to use qspinlock.
Should an architecture arise where that isn't the case, then we could
consider an arch hook in the qspinlock code, but I don't think we have
to solve that yet.
Will
On 7/9/2018 9:52 AM, Will Deacon wrote:
> On Fri, Jul 06, 2018 at 02:10:55PM -0700, Paul E. McKenney wrote:
>> On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
>>> On Thu, 5 Jul 2018, Andrea Parri wrote:
>>>
>>>>> At any rate, it looks like instead of strengthening the relation, I
>>>>> should write a patch that removes it entirely. I also will add new,
>>>>> stronger relations for use with locking, essentially making spin_lock
>>>>> and spin_unlock be RCsc.
>>>>
>>>> Thank you.
>>>>
>>>> Ah let me put this forward: please keep an eye on the (generic)
>>>>
>>>> queued_spin_lock()
>>>> queued_spin_unlock()
>>>>
>>>> (just to point out an example). Their implementation (in part.,
>>>> the fast-path) suggests that if we will stick to RCsc lock then
>>>> we should also stick to RCsc acq. load from RMW and rel. store.
Just to be clear, this is "RCsc with W->R exception" again, right?
>>> A very good point. The implementation of those routines uses
>>> atomic_cmpxchg_acquire() to acquire the lock. Unless this is
>>> implemented with an operation or fence that provides write-write
>>> ordering (in conjunction with a suitable release), qspinlocks won't
>>> have the ordering properties that we want.
>>>
>>> I'm going to assume that the release operations used for unlocking
>>> don't need to have any extra properties; only the lock-acquire
>>> operations need to be special (i.e., stronger than a normal
>>> smp_load_acquire). This suggests that atomic RMW functions with acquire
>>> semantics should also use this stronger form of acquire.
It's not clear to me that the burden of enforcing "RCsc with W->R
ordering" should always be placed only on the acquire half.
RISC-V currently places some of the burden on the release half, as
we discussed last week. Specifically, there are a few cases where
fence.tso is used instead of fence rw,w on the release side.
If we always use fence.tso here, following the current recommendation,
we'll still be fine. If LKMM introduces an RCpc vs. RCsc distinction
of some kind, though, I think we would want to distinguish the two
types of release accordingly as well.
>>> Does anybody have a different suggestion?
>>
>> The approach you suggest makes sense to me. Will, Peter, Daniel, any
>> reasons why this approach would be a problem for you guys?
>
> qspinlock is very much opt-in per arch, so we can simply require that
> an architecture must have RCsc RmW atomics if they want to use qspinlock.
> Should an architecture arise where that isn't the case, then we could
> consider an arch hook in the qspinlock code, but I don't think we have
> to solve that yet.
>
> Will
This sounds reasonable to me.
Dan
On Mon, 9 Jul 2018, Daniel Lustig wrote:
> On 7/9/2018 9:52 AM, Will Deacon wrote:
> > On Fri, Jul 06, 2018 at 02:10:55PM -0700, Paul E. McKenney wrote:
> >> On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
> >>> On Thu, 5 Jul 2018, Andrea Parri wrote:
> >>>
> >>>>> At any rate, it looks like instead of strengthening the relation, I
> >>>>> should write a patch that removes it entirely. I also will add new,
> >>>>> stronger relations for use with locking, essentially making spin_lock
> >>>>> and spin_unlock be RCsc.
> >>>>
> >>>> Thank you.
> >>>>
> >>>> Ah let me put this forward: please keep an eye on the (generic)
> >>>>
> >>>> queued_spin_lock()
> >>>> queued_spin_unlock()
> >>>>
> >>>> (just to point out an example). Their implementation (in part.,
> >>>> the fast-path) suggests that if we will stick to RCsc lock then
> >>>> we should also stick to RCsc acq. load from RMW and rel. store.
>
> Just to be clear, this is "RCsc with W->R exception" again, right?
Yes.
I don't think any of these suggested names are really appropriate.
(For instance, if I understood the original paper correctly, the "sc"
in "RCsc" refers to the ordering properties of the acquire and release
accesses themselves, not the accesses they protect.) I'm going to
avoid using them.
> >>> A very good point. The implementation of those routines uses
> >>> atomic_cmpxchg_acquire() to acquire the lock. Unless this is
> >>> implemented with an operation or fence that provides write-write
> >>> ordering (in conjunction with a suitable release), qspinlocks won't
> >>> have the ordering properties that we want.
> >>>
> >>> I'm going to assume that the release operations used for unlocking
> >>> don't need to have any extra properties; only the lock-acquire
> >>> operations need to be special (i.e., stronger than a normal
> >>> smp_load_acquire). This suggests that atomic RMW functions with acquire
> >>> semantics should also use this stronger form of acquire.
>
> It's not clear to me that the burden of enforcing "RCsc with W->R
> ordering" should always be placed only on the acquire half.
> RISC-V currently places some of the burden on the release half, as
> we discussed last week. Specifically, there are a few cases where
> fence.tso is used instead of fence rw,w on the release side.
>
> If we always use fence.tso here, following the current recommendation,
> we'll still be fine. If LKMM introduces an RCpc vs. RCsc distinction
> of some kind, though, I think we would want to distinguish the two
> types of release accordingly as well.
I wasn't talking about the burden in the implementation, just the
modification to the memory model. In practice it shouldn't matter
because real code should never intermix the two kinds of fences. That
is, nobody will call smp_store_release() or cmpxchg_acquire()
with an argument of type spinlock_t *, and nobody will call
spin_unlock() with an argument that isn't of type spinlock_t *.
> >>> Does anybody have a different suggestion?
> >>
> >> The approach you suggest makes sense to me. Will, Peter, Daniel, any
> >> reasons why this approach would be a problem for you guys?
> >
> > qspinlock is very much opt-in per arch, so we can simply require that
> > an architecture must have RCsc RmW atomics if they want to use qspinlock.
> > Should an architecture arise where that isn't the case, then we could
> > consider an arch hook in the qspinlock code, but I don't think we have
> > to solve that yet.
> >
> > Will
>
> This sounds reasonable to me.
Okay, I'll arrange the patch so that the new requirements apply only to
lock and unlock accesses, not to RMW accesses in general.
Alan