The LKMM defines certain memory ordering constraints for spin_lock(),
spin_unlock() and other primitives that the kernel developer can rely
on; unfortunately, some of these constraints are not currently met by
the RISC-V implementation of spin_lock(), spin_unlock().
The following MP-like program exemplifies the issue: according to our
LKMM, program "unlock-lock-read-ordering" below can never reach state
(1:r0=1 /\ 1:r1=0). However, when we map this C program to the RISCV
program "RISCV-unlock-lock-read-ordering" below following the current
implementation, the corresponding state is reachable according to the
RISCV specification and its formalizations [2].
C unlock-lock-read-ordering
{}
/* s initially owned by P1 */
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_wmb();
WRITE_ONCE(*y, 1);
}
P1(int *x, int *y, spinlock_t *s)
{
int r0;
int r1;
r0 = READ_ONCE(*y);
spin_unlock(s);
spin_lock(s);
r1 = READ_ONCE(*y);
}
exists (1:r0=1 /\ 1:r1=0)
RISCV RISCV-unlock-lock-read-ordering
{
0:x2=x; 0:x4=y;
1:x2=y; 1:x4=x; 1:x6=s;
s=1;
}
P0 | P1 ;
ori x1,x0,1 | lw x1,0(x2) ;
sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
fence w,w | ori x5,x0,1 ;
ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
sw x3,0(x4) | lw x3,0(x4) ;
exists
(1:x1=1 /\ 1:x3=0)
The issue can in fact be exarcebated if, as envisaged/discussed in [3],
the LKMM will be modified to become even more "demanding" on the order-
ing constraints associated to the locking primitives. For example the
state (1:r0=1 /\ 1:r1=0) in program "unlock-lock-write-ordering" below
is currently allowed by LKMM (as is the corresponding state in "RISCV-
unlock-lock-write-ordering" below). However, proposals modifying LKMM
to _forbid_ that state have already appeared on LKML [4].
C unlock-lock-write-ordering
{}
/* s initially owned by P0 */
P0(int *x, int *y, spinlock_t *s)
{
WRITE_ONCE(*x, 1);
spin_unlock(s);
spin_lock(s);
WRITE_ONCE(*y, 1);
}
P1(int *x, int *y)
{
int r0;
int r1;
r0 = READ_ONCE(*y);
smp_rmb();
r1 = READ_ONCE(*y);
}
exists (1:r0=1 /\ 1:r1=0)
RISCV RISCV-unlock-lock-write-ordering
{
0:x2=x; 0:x4=y; 0:x6=s;
1:x2=y; 1:x4=x;
s=1;
}
P0 | P1 ;
ori x1,x0,1 | lw x1,0(x2) ;
sw x1,0(x2) | fence r,r ;
amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
ori x5,x0,1 | ;
amoswap.w.aq x0,x5,(x6) | ;
ori x3,x0,1 | ;
sw x3,0(x4) | ;
exists
(1:x1=1 /\ 1:x3=0)
[Curiously, RISC-V's current implementations of smp_load_acquire() and
smp_store_release() provide way stronger ordering than what currently
required by LKMM since those're relying on the generic implementation
(c.f, also, [5]). ]
This RFC fixes the issue by strengthening RISC-V's implementations of
spin_lock() and spin_unlock(), based on "A spinlock with fences" from
Section 2.3.5 ("Acquire/Release Ordering") of the RISC-V draft spec.
It does _not_ attempt to fix read-lock and atomics (for which, AFAICT,
similar considerations would hold).
IMPORTANT. This patch is _NOT_ intended to be applied as is. Rather,
this is intended to test the waters, implicit questions being "Should
we take this direction?" "Are changes to LKMM needed?" (and develop a
technical discussion on the above issues.)
[1] https://marc.info/?l=linux-kernel&m=151633436614259&w=2
[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
[3] https://marc.info/?l=linux-kernel&m=151181741400461&w=2
[4] https://marc.info/?l=linux-kernel&m=151871035014425&w=2
[5] https://marc.info/?l=linux-kernel&m=151912186913692&w=2
Signed-off-by: Andrea Parri <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Daniel Lustig <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jade Alglave <[email protected]>
Cc: Luc Maranget <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Akira Yokosawa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/riscv/include/asm/spinlock.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h
index 2fd27e8ef1fd6..2f89fc62c9196 100644
--- a/arch/riscv/include/asm/spinlock.h
+++ b/arch/riscv/include/asm/spinlock.h
@@ -28,8 +28,9 @@
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
+ RISCV_FENCE(rw,w);
__asm__ __volatile__ (
- "amoswap.w.rl x0, x0, %0"
+ "amoswap.w x0, x0, %0"
: "=A" (lock->lock)
:: "memory");
}
@@ -39,10 +40,11 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
int tmp = 1, busy;
__asm__ __volatile__ (
- "amoswap.w.aq %0, %2, %1"
+ "amoswap.w %0, %2, %1"
: "=r" (busy), "+A" (lock->lock)
: "r" (tmp)
: "memory");
+ RISCV_FENCE(r,rw);
return !busy;
}
--
2.7.4
On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
> The LKMM defines certain memory ordering constraints for spin_lock(),
> spin_unlock() and other primitives that the kernel developer can rely
> on; unfortunately, some of these constraints are not currently met by
> the RISC-V implementation of spin_lock(), spin_unlock().
>
> The following MP-like program exemplifies the issue: according to our
> LKMM, program "unlock-lock-read-ordering" below can never reach state
> (1:r0=1 /\ 1:r1=0). However, when we map this C program to the RISCV
> program "RISCV-unlock-lock-read-ordering" below following the current
> implementation, the corresponding state is reachable according to the
> RISCV specification and its formalizations [2].
>
> C unlock-lock-read-ordering
>
> {}
> /* s initially owned by P1 */
>
> P0(int *x, int *y)
> {
> WRITE_ONCE(*x, 1);
> smp_wmb();
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y, spinlock_t *s)
> {
> int r0;
> int r1;
>
> r0 = READ_ONCE(*y);
> spin_unlock(s);
> spin_lock(s);
> r1 = READ_ONCE(*y);
This last read should have been from 'x' of course (and similarly in
unlock-lock-write-ordering). Sorry,
Andrea
> }
>
> exists (1:r0=1 /\ 1:r1=0)
>
> RISCV RISCV-unlock-lock-read-ordering
> {
> 0:x2=x; 0:x4=y;
> 1:x2=y; 1:x4=x; 1:x6=s;
> s=1;
> }
> P0 | P1 ;
> ori x1,x0,1 | lw x1,0(x2) ;
> sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
> fence w,w | ori x5,x0,1 ;
> ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
> sw x3,0(x4) | lw x3,0(x4) ;
> exists
> (1:x1=1 /\ 1:x3=0)
>
> The issue can in fact be exarcebated if, as envisaged/discussed in [3],
> the LKMM will be modified to become even more "demanding" on the order-
> ing constraints associated to the locking primitives. For example the
> state (1:r0=1 /\ 1:r1=0) in program "unlock-lock-write-ordering" below
> is currently allowed by LKMM (as is the corresponding state in "RISCV-
> unlock-lock-write-ordering" below). However, proposals modifying LKMM
> to _forbid_ that state have already appeared on LKML [4].
>
> C unlock-lock-write-ordering
>
> {}
> /* s initially owned by P0 */
>
> P0(int *x, int *y, spinlock_t *s)
> {
> WRITE_ONCE(*x, 1);
> spin_unlock(s);
> spin_lock(s);
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = READ_ONCE(*y);
> smp_rmb();
> r1 = READ_ONCE(*y);
> }
>
> exists (1:r0=1 /\ 1:r1=0)
>
> RISCV RISCV-unlock-lock-write-ordering
> {
> 0:x2=x; 0:x4=y; 0:x6=s;
> 1:x2=y; 1:x4=x;
> s=1;
> }
> P0 | P1 ;
> ori x1,x0,1 | lw x1,0(x2) ;
> sw x1,0(x2) | fence r,r ;
> amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
> ori x5,x0,1 | ;
> amoswap.w.aq x0,x5,(x6) | ;
> ori x3,x0,1 | ;
> sw x3,0(x4) | ;
> exists
> (1:x1=1 /\ 1:x3=0)
>
> [Curiously, RISC-V's current implementations of smp_load_acquire() and
> smp_store_release() provide way stronger ordering than what currently
> required by LKMM since those're relying on the generic implementation
> (c.f, also, [5]). ]
>
> This RFC fixes the issue by strengthening RISC-V's implementations of
> spin_lock() and spin_unlock(), based on "A spinlock with fences" from
> Section 2.3.5 ("Acquire/Release Ordering") of the RISC-V draft spec.
> It does _not_ attempt to fix read-lock and atomics (for which, AFAICT,
> similar considerations would hold).
>
> IMPORTANT. This patch is _NOT_ intended to be applied as is. Rather,
> this is intended to test the waters, implicit questions being "Should
> we take this direction?" "Are changes to LKMM needed?" (and develop a
> technical discussion on the above issues.)
>
> [1] https://marc.info/?l=linux-kernel&m=151633436614259&w=2
> [2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
> [3] https://marc.info/?l=linux-kernel&m=151181741400461&w=2
> [4] https://marc.info/?l=linux-kernel&m=151871035014425&w=2
> [5] https://marc.info/?l=linux-kernel&m=151912186913692&w=2
>
> Signed-off-by: Andrea Parri <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> Cc: Albert Ou <[email protected]>
> Cc: Daniel Lustig <[email protected]>
> Cc: Alan Stern <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Jade Alglave <[email protected]>
> Cc: Luc Maranget <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Akira Yokosawa <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/riscv/include/asm/spinlock.h | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h
> index 2fd27e8ef1fd6..2f89fc62c9196 100644
> --- a/arch/riscv/include/asm/spinlock.h
> +++ b/arch/riscv/include/asm/spinlock.h
> @@ -28,8 +28,9 @@
>
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> + RISCV_FENCE(rw,w);
> __asm__ __volatile__ (
> - "amoswap.w.rl x0, x0, %0"
> + "amoswap.w x0, x0, %0"
> : "=A" (lock->lock)
> :: "memory");
> }
> @@ -39,10 +40,11 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
> int tmp = 1, busy;
>
> __asm__ __volatile__ (
> - "amoswap.w.aq %0, %2, %1"
> + "amoswap.w %0, %2, %1"
> : "=r" (busy), "+A" (lock->lock)
> : "r" (tmp)
> : "memory");
> + RISCV_FENCE(r,rw);
>
> return !busy;
> }
> --
> 2.7.4
>
On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
> C unlock-lock-read-ordering
>
> {}
> /* s initially owned by P1 */
>
> P0(int *x, int *y)
> {
> WRITE_ONCE(*x, 1);
> smp_wmb();
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y, spinlock_t *s)
> {
> int r0;
> int r1;
>
> r0 = READ_ONCE(*y);
> spin_unlock(s);
> spin_lock(s);
> r1 = READ_ONCE(*y);
> }
>
> exists (1:r0=1 /\ 1:r1=0)
>
> RISCV RISCV-unlock-lock-read-ordering
> {
> 0:x2=x; 0:x4=y;
> 1:x2=y; 1:x4=x; 1:x6=s;
> s=1;
> }
> P0 | P1 ;
> ori x1,x0,1 | lw x1,0(x2) ;
> sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
> fence w,w | ori x5,x0,1 ;
> ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
> sw x3,0(x4) | lw x3,0(x4) ;
> exists
> (1:x1=1 /\ 1:x3=0)
So I would indeed expect this to be forbidden. Could someone please
explain how this could be allowed?
> C unlock-lock-write-ordering
>
> {}
> /* s initially owned by P0 */
>
> P0(int *x, int *y, spinlock_t *s)
> {
> WRITE_ONCE(*x, 1);
> spin_unlock(s);
> spin_lock(s);
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = READ_ONCE(*y);
> smp_rmb();
> r1 = READ_ONCE(*y);
> }
>
> exists (1:r0=1 /\ 1:r1=0)
>
> RISCV RISCV-unlock-lock-write-ordering
> {
> 0:x2=x; 0:x4=y; 0:x6=s;
> 1:x2=y; 1:x4=x;
> s=1;
> }
> P0 | P1 ;
> ori x1,x0,1 | lw x1,0(x2) ;
> sw x1,0(x2) | fence r,r ;
> amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
> ori x5,x0,1 | ;
> amoswap.w.aq x0,x5,(x6) | ;
> ori x3,x0,1 | ;
> sw x3,0(x4) | ;
> exists
> (1:x1=1 /\ 1:x3=0)
And here I think the RISCV conversion is flawed, there should be a ctrl
dependency. The second store-word in P0 should depend on the result of
amoswap.w.aq being 0.
(strictly speaking there should be a ctrl-dep in the read example too,
except it'd be pointless for ordering reads, so I accept it being left
out)
Again, I cannot see how this could be allowed.
On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
>
> > C unlock-lock-read-ordering
> >
> > {}
> > /* s initially owned by P1 */
> >
> > P0(int *x, int *y)
> > {
> > WRITE_ONCE(*x, 1);
> > smp_wmb();
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y, spinlock_t *s)
> > {
> > int r0;
> > int r1;
> >
> > r0 = READ_ONCE(*y);
> > spin_unlock(s);
> > spin_lock(s);
> > r1 = READ_ONCE(*y);
> > }
> >
> > exists (1:r0=1 /\ 1:r1=0)
> >
> > RISCV RISCV-unlock-lock-read-ordering
> > {
> > 0:x2=x; 0:x4=y;
> > 1:x2=y; 1:x4=x; 1:x6=s;
> > s=1;
> > }
> > P0 | P1 ;
> > ori x1,x0,1 | lw x1,0(x2) ;
> > sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
> > fence w,w | ori x5,x0,1 ;
> > ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
> > sw x3,0(x4) | lw x3,0(x4) ;
> > exists
> > (1:x1=1 /\ 1:x3=0)
>
> So I would indeed expect this to be forbidden. Could someone please
> explain how this could be allowed?
As mentioned in IRC, my understanding here is only based on the spec.
referred below and on its (available) formalizations. I expect that
RISC-V people will be able to provide more information.
>
> > C unlock-lock-write-ordering
> >
> > {}
> > /* s initially owned by P0 */
> >
> > P0(int *x, int *y, spinlock_t *s)
> > {
> > WRITE_ONCE(*x, 1);
> > spin_unlock(s);
> > spin_lock(s);
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *x, int *y)
> > {
> > int r0;
> > int r1;
> >
> > r0 = READ_ONCE(*y);
> > smp_rmb();
> > r1 = READ_ONCE(*y);
> > }
> >
> > exists (1:r0=1 /\ 1:r1=0)
> >
> > RISCV RISCV-unlock-lock-write-ordering
> > {
> > 0:x2=x; 0:x4=y; 0:x6=s;
> > 1:x2=y; 1:x4=x;
> > s=1;
> > }
> > P0 | P1 ;
> > ori x1,x0,1 | lw x1,0(x2) ;
> > sw x1,0(x2) | fence r,r ;
> > amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
> > ori x5,x0,1 | ;
> > amoswap.w.aq x0,x5,(x6) | ;
> > ori x3,x0,1 | ;
> > sw x3,0(x4) | ;
> > exists
> > (1:x1=1 /\ 1:x3=0)
>
> And here I think the RISCV conversion is flawed, there should be a ctrl
> dependency. The second store-word in P0 should depend on the result of
> amoswap.w.aq being 0.
You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
right after amoswap.w.aq (and this label at the end of P0); this does not
change the verdict of the available formalizations reported above however.
(So, AFAICT, the above question remains valid/open.)
Andrea
>
> (strictly speaking there should be a ctrl-dep in the read example too,
> except it'd be pointless for ordering reads, so I accept it being left
> out)
>
> Again, I cannot see how this could be allowed.
>
On 2/22/2018 6:12 AM, Andrea Parri wrote:
> On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
>>
>>> C unlock-lock-read-ordering
>>>
>>> {}
>>> /* s initially owned by P1 */
>>>
>>> P0(int *x, int *y)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> smp_wmb();
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y, spinlock_t *s)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-read-ordering
>>> {
>>> 0:x2=x; 0:x4=y;
>>> 1:x2=y; 1:x4=x; 1:x6=s;
>>> s=1;
>>> }
>>> P0 | P1 ;
>>> ori x1,x0,1 | lw x1,0(x2) ;
>>> sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
>>> fence w,w | ori x5,x0,1 ;
>>> ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
>>> sw x3,0(x4) | lw x3,0(x4) ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> So I would indeed expect this to be forbidden. Could someone please
>> explain how this could be allowed?
>
> As mentioned in IRC, my understanding here is only based on the spec.
> referred below and on its (available) formalizations. I expect that
> RISC-V people will be able to provide more information.
I think an RCpc interpretation of .aq and .rl would in fact allow
the two normal loads in P1 to be reordered. Wouldn't it? Does that
go against what the LKMM requires?
The intuition would be that the amoswap.w.aq can forward from the
amoswap.w.rl while that's still in the store buffer, and then the
lw x3,0(x4) can also perform while the amoswap.w.rl is still in
the store buffer, all before the l1 x1,0(x2) executes. That's
not forbidden unless the amoswaps are RCsc, unless I'm missing
something.
Likewise even if the unlock()/lock() is between two stores. A
control dependency might originate from the load part of the
amoswap.w.aq, but there still would have to be something to
ensure that this load part in fact performs after the store
part of the amoswap.w.rl performs globally, and that's not
automatic under RCpc.
In any case, our concern that amoswap.w.rl and amoswap.w.aq might
not actually be sufficient for spin_unlock() and spin_lock() was
what prompted us to start our own internal discussion on this.
FWIW we have made a few clarifications to our spec that haven't
propagated to the public upstream yet, but that should be happening
soon. In any case as it relates to the questions here, it's
more a matter of us deciding what we should put into the spec in
the end than it is a matter of checking the current text. In
other words, we're still finalizing things, and there's still
some time to tweak as necessary.
>>
>>> C unlock-lock-write-ordering
>>>
>>> {}
>>> /* s initially owned by P0 */
>>>
>>> P0(int *x, int *y, spinlock_t *s)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> smp_rmb();
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-write-ordering
>>> {
>>> 0:x2=x; 0:x4=y; 0:x6=s;
>>> 1:x2=y; 1:x4=x;
>>> s=1;
>>> }
>>> P0 | P1 ;
>>> ori x1,x0,1 | lw x1,0(x2) ;
>>> sw x1,0(x2) | fence r,r ;
>>> amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
>>> ori x5,x0,1 | ;
>>> amoswap.w.aq x0,x5,(x6) | ;
>>> ori x3,x0,1 | ;
>>> sw x3,0(x4) | ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> And here I think the RISCV conversion is flawed, there should be a ctrl
>> dependency. The second store-word in P0 should depend on the result of
>> amoswap.w.aq being 0.
>
> You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
> right after amoswap.w.aq (and this label at the end of P0); this does not
> change the verdict of the available formalizations reported above however.
>
> (So, AFAICT, the above question remains valid/open.)
This makes sense, but one other question: Paul mentioned earlier in the
thread that
> The PowerPC lock implementation's unlock-lock pair does not order writes
> from the previous critical section against reads from the later critical
> section, but it does order other combinations of reads and writes.
My understanding is that the same logic about control dependencies applies
here too, right? In other words, in spite of Peter's claim, the control
dependency doesn't automatically fix this for RISC-V or for PowerPC?
>
> Andrea
>
>
>>
>> (strictly speaking there should be a ctrl-dep in the read example too,
>> except it'd be pointless for ordering reads, so I accept it being left
>> out)
>>
>> Again, I cannot see how this could be allowed.
>>
Peter, in this test you mentioned earlier:
P0()
{
WRITE_ONCE(x, 1);
smp_store_release(&y, 1);
r0 = smp_load_acquire(&y);
WRITE_ONCE(z, 1);
}
P1()
{
r1 = READ_ONCE(z);
smp_rmb();
r2 = READ_ONCE(x);
}
exists: r0 == 1 /\ r1==1 /\ r2==0
You expect this to be forbidden too, even if the release and acquire
are RCpc? This part I don't follow. There's no conditional branch
here to enforce a control dependency. I get and I agree that
smp_store_release/smp_load_acquire are different than spin_unlock/
spin_lock, but isn't that even more reason to allow this one if
smp_store_release/smp_load_acquire are RCpc without question?
Thanks,
Dan
On Thu, Feb 22, 2018 at 09:27:09AM -0800, Daniel Lustig wrote:
> On 2/22/2018 6:12 AM, Andrea Parri wrote:
> > On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
> >> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
> >>
> >>> C unlock-lock-read-ordering
> >>>
> >>> {}
> >>> /* s initially owned by P1 */
> >>>
> >>> P0(int *x, int *y)
> >>> {
> >>> WRITE_ONCE(*x, 1);
> >>> smp_wmb();
> >>> WRITE_ONCE(*y, 1);
> >>> }
> >>>
> >>> P1(int *x, int *y, spinlock_t *s)
> >>> {
> >>> int r0;
> >>> int r1;
> >>>
> >>> r0 = READ_ONCE(*y);
> >>> spin_unlock(s);
> >>> spin_lock(s);
> >>> r1 = READ_ONCE(*y);
> >>> }
> >>>
> >>> exists (1:r0=1 /\ 1:r1=0)
> >>>
> >>> RISCV RISCV-unlock-lock-read-ordering
> >>> {
> >>> 0:x2=x; 0:x4=y;
> >>> 1:x2=y; 1:x4=x; 1:x6=s;
> >>> s=1;
> >>> }
> >>> P0 | P1 ;
> >>> ori x1,x0,1 | lw x1,0(x2) ;
> >>> sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
> >>> fence w,w | ori x5,x0,1 ;
> >>> ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
> >>> sw x3,0(x4) | lw x3,0(x4) ;
> >>> exists
> >>> (1:x1=1 /\ 1:x3=0)
> >>
> >> So I would indeed expect this to be forbidden. Could someone please
> >> explain how this could be allowed?
> >
> > As mentioned in IRC, my understanding here is only based on the spec.
> > referred below and on its (available) formalizations. I expect that
> > RISC-V people will be able to provide more information.
>
> I think an RCpc interpretation of .aq and .rl would in fact allow
> the two normal loads in P1 to be reordered. Wouldn't it? Does that
> go against what the LKMM requires?
>
> The intuition would be that the amoswap.w.aq can forward from the
> amoswap.w.rl while that's still in the store buffer, and then the
> lw x3,0(x4) can also perform while the amoswap.w.rl is still in
> the store buffer, all before the l1 x1,0(x2) executes. That's
> not forbidden unless the amoswaps are RCsc, unless I'm missing
> something.
>
> Likewise even if the unlock()/lock() is between two stores. A
> control dependency might originate from the load part of the
> amoswap.w.aq, but there still would have to be something to
> ensure that this load part in fact performs after the store
> part of the amoswap.w.rl performs globally, and that's not
> automatic under RCpc.
>
> In any case, our concern that amoswap.w.rl and amoswap.w.aq might
> not actually be sufficient for spin_unlock() and spin_lock() was
> what prompted us to start our own internal discussion on this.
>
> FWIW we have made a few clarifications to our spec that haven't
> propagated to the public upstream yet, but that should be happening
> soon. In any case as it relates to the questions here, it's
> more a matter of us deciding what we should put into the spec in
> the end than it is a matter of checking the current text. In
> other words, we're still finalizing things, and there's still
> some time to tweak as necessary.
>
> >>
> >>> C unlock-lock-write-ordering
> >>>
> >>> {}
> >>> /* s initially owned by P0 */
> >>>
> >>> P0(int *x, int *y, spinlock_t *s)
> >>> {
> >>> WRITE_ONCE(*x, 1);
> >>> spin_unlock(s);
> >>> spin_lock(s);
> >>> WRITE_ONCE(*y, 1);
> >>> }
> >>>
> >>> P1(int *x, int *y)
> >>> {
> >>> int r0;
> >>> int r1;
> >>>
> >>> r0 = READ_ONCE(*y);
> >>> smp_rmb();
> >>> r1 = READ_ONCE(*y);
> >>> }
> >>>
> >>> exists (1:r0=1 /\ 1:r1=0)
> >>>
> >>> RISCV RISCV-unlock-lock-write-ordering
> >>> {
> >>> 0:x2=x; 0:x4=y; 0:x6=s;
> >>> 1:x2=y; 1:x4=x;
> >>> s=1;
> >>> }
> >>> P0 | P1 ;
> >>> ori x1,x0,1 | lw x1,0(x2) ;
> >>> sw x1,0(x2) | fence r,r ;
> >>> amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
> >>> ori x5,x0,1 | ;
> >>> amoswap.w.aq x0,x5,(x6) | ;
> >>> ori x3,x0,1 | ;
> >>> sw x3,0(x4) | ;
> >>> exists
> >>> (1:x1=1 /\ 1:x3=0)
> >>
> >> And here I think the RISCV conversion is flawed, there should be a ctrl
> >> dependency. The second store-word in P0 should depend on the result of
> >> amoswap.w.aq being 0.
> >
> > You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
> > right after amoswap.w.aq (and this label at the end of P0); this does not
> > change the verdict of the available formalizations reported above however.
> >
> > (So, AFAICT, the above question remains valid/open.)
>
> This makes sense, but one other question: Paul mentioned earlier in the
> thread that
>
> > The PowerPC lock implementation's unlock-lock pair does not order writes
> > from the previous critical section against reads from the later critical
> > section, but it does order other combinations of reads and writes.
>
> My understanding is that the same logic about control dependencies applies
> here too, right? In other words, in spite of Peter's claim, the control
> dependency doesn't automatically fix this for RISC-V or for PowerPC?
For PowerPC, an lwsync instruction is used, which means that an external
obsserver can see a write from an earlier critical section being
reordered with a read from a later critical section, but no other
combination of reads and writes can be so reordered.
I will dig up PowerPC litmus tests demonstrating this and send them
along.
> >> (strictly speaking there should be a ctrl-dep in the read example too,
> >> except it'd be pointless for ordering reads, so I accept it being left
> >> out)
> >>
> >> Again, I cannot see how this could be allowed.
> >>
>
> Peter, in this test you mentioned earlier:
>
> P0()
> {
> WRITE_ONCE(x, 1);
> smp_store_release(&y, 1);
> r0 = smp_load_acquire(&y);
> WRITE_ONCE(z, 1);
> }
>
> P1()
> {
> r1 = READ_ONCE(z);
> smp_rmb();
> r2 = READ_ONCE(x);
> }
>
> exists: r0 == 1 /\ r1==1 /\ r2==0
>
> You expect this to be forbidden too, even if the release and acquire
> are RCpc? This part I don't follow. There's no conditional branch
> here to enforce a control dependency. I get and I agree that
> smp_store_release/smp_load_acquire are different than spin_unlock/
> spin_lock, but isn't that even more reason to allow this one if
> smp_store_release/smp_load_acquire are RCpc without question?
Here is the above test, reworked a bit to allow herd to accept it:
------------------------------------------------------------------------
C C-peterz
{
}
P0(int *x, int *y, int *z)
{
int r0;
WRITE_ONCE(*x, 1);
smp_store_release(y, 1);
r0 = smp_load_acquire(y);
WRITE_ONCE(*z, 1);
}
P1(int *x, int *y, int *z)
{
int r1;
int r2;
r1 = READ_ONCE(*z);
smp_rmb();
r2 = READ_ONCE(*x);
}
exists (0:r0 == 1 /\ 1:r1==1 /\ 1:r2==0)
------------------------------------------------------------------------
Please let me know if I messed up the conversion. On the perhaps unlikely
chance that I got it right, here is what the current version of the
Linux-kernel memory model says:
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/C-peterz.litmus
Test C-peterz Allowed
States 4
0:r0=1; 1:r1=0; 1:r2=0;
0:r0=1; 1:r1=0; 1:r2=1;
0:r0=1; 1:r1=1; 1:r2=0;
0:r0=1; 1:r1=1; 1:r2=1;
Ok
Witnesses
Positive: 1 Negative: 3
Condition exists (0:r0=1 /\ 1:r1=1 /\ 1:r2=0)
Observation C-peterz Sometimes 1 3
Time C-peterz 0.01
Hash=6c4f7f8a8dc750fd5d706d6f3c104360
------------------------------------------------------------------------
So the current Linux-kernel memory model does in fact allow this, as
you say. But again, there is a proposal to tighten the Linux-kernel
memory model based on the pre-RISC-V set of architectures running the
Linux kernel, all of which would forbid this test.
So we have something that is not all that rare in the Linux kernel
community, namely two conflicting more-or-less concurrent changes.
This clearly needs to be resolved, either by us not strengthening the
Linux-kernel memory model in the way we were planning to or by you
strengthening RISC-V to be no weaker than PowerPC for these sorts of
externally viewed release-acquire situations.
Other thoughts?
Thanx, Paul
On Thu, Feb 22, 2018 at 09:27:09AM -0800, Daniel Lustig wrote:
> On 2/22/2018 6:12 AM, Andrea Parri wrote:
> > On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
> >> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
> >>
> >>> C unlock-lock-read-ordering
> >>>
> >>> {}
> >>> /* s initially owned by P1 */
> >>>
> >>> P0(int *x, int *y)
> >>> {
> >>> WRITE_ONCE(*x, 1);
> >>> smp_wmb();
> >>> WRITE_ONCE(*y, 1);
> >>> }
> >>>
> >>> P1(int *x, int *y, spinlock_t *s)
> >>> {
> >>> int r0;
> >>> int r1;
> >>>
> >>> r0 = READ_ONCE(*y);
> >>> spin_unlock(s);
> >>> spin_lock(s);
> >>> r1 = READ_ONCE(*y);
> >>> }
> >> So I would indeed expect this to be forbidden. Could someone please
> >> explain how this could be allowed?
> I think an RCpc interpretation of .aq and .rl would in fact allow
> the two normal loads in P1 to be reordered. Wouldn't it? Does that
> go against what the LKMM requires?
>
> The intuition would be that the amoswap.w.aq can forward from the
> amoswap.w.rl while that's still in the store buffer, and then the
> lw x3,0(x4) can also perform while the amoswap.w.rl is still in
> the store buffer, all before the l1 x1,0(x2) executes. That's
> not forbidden unless the amoswaps are RCsc, unless I'm missing
> something.
So here you're equating flushing the store buffer to RCsc. Is that
entirely fair? Can't a local store-buffer flush still not mean global
visibility, similarly, could not a store-buffer have internal ordering,
such that it guarantees to flush writes in-order? Employing for example
multiple stages / generations.
That is, I think there's quite a bit of gray area here.
But even with the store buffer, the STORE-RELEASE could 'wait' for all
prior LOADs to complete before issuing the STORE. It could also
increment a store-buffer generation before issuing, such that it would,
once it gets about flushing things, issue the STOREs in the 'promised'
order.
> Likewise even if the unlock()/lock() is between two stores. A
> control dependency might originate from the load part of the
> amoswap.w.aq, but there still would have to be something to
> ensure that this load part in fact performs after the store
> part of the amoswap.w.rl performs globally, and that's not
> automatic under RCpc.
Right, so I think we have different definitions of what RCpc means, and
I suspect I might be the odd one out. And its not just spinlocks.
> >>
> >>> C unlock-lock-write-ordering
> >>>
> >>> {}
> >>> /* s initially owned by P0 */
> >>>
> >>> P0(int *x, int *y, spinlock_t *s)
> >>> {
> >>> WRITE_ONCE(*x, 1);
> >>> spin_unlock(s);
> >>> spin_lock(s);
> >>> WRITE_ONCE(*y, 1);
> >>> }
> >>>
> >>> P1(int *x, int *y)
> >>> {
> >>> int r0;
> >>> int r1;
> >>>
> >>> r0 = READ_ONCE(*y);
> >>> smp_rmb();
> >>> r1 = READ_ONCE(*y);
> >>> }
> >>>
> >>> exists (1:r0=1 /\ 1:r1=0)
> >>>
> >>> RISCV RISCV-unlock-lock-write-ordering
> >>> {
> >>> 0:x2=x; 0:x4=y; 0:x6=s;
> >>> 1:x2=y; 1:x4=x;
> >>> s=1;
> >>> }
> >>> P0 | P1 ;
> >>> ori x1,x0,1 | lw x1,0(x2) ;
> >>> sw x1,0(x2) | fence r,r ;
> >>> amoswap.w.rl x0,x0,(x6) | lw x3,0(x4) ;
> >>> ori x5,x0,1 | ;
> >>> amoswap.w.aq x0,x5,(x6) | ;
> >>> ori x3,x0,1 | ;
> >>> sw x3,0(x4) | ;
> >>> exists
> >>> (1:x1=1 /\ 1:x3=0)
> >>
> >> And here I think the RISCV conversion is flawed, there should be a ctrl
> >> dependency. The second store-word in P0 should depend on the result of
> >> amoswap.w.aq being 0.
> >
> > You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
> > right after amoswap.w.aq (and this label at the end of P0); this does not
> > change the verdict of the available formalizations reported above however.
> >
> > (So, AFAICT, the above question remains valid/open.)
>
> This makes sense, but one other question: Paul mentioned earlier in the
> thread that
>
> > The PowerPC lock implementation's unlock-lock pair does not order writes
> > from the previous critical section against reads from the later critical
> > section, but it does order other combinations of reads and writes.
>
> My understanding is that the same logic about control dependencies applies
> here too, right? In other words, in spite of Peter's claim, the control
> dependency doesn't automatically fix this for RISC-V or for PowerPC?
So what Paul says is that earlier STOREs and later LOADs can reorder
(the TSO reordering) over the RELEASE+ACQUIRE. And that is right, LWSYNC
allows just that one reorder.
The ctrl-dep cannot fix that, ctrl-dep is a LOAD->STORE order, later
STOREs will happen after the earlier LOAD.
But the above example is a STORE->STORE, which LWSYNC _will_ order.
But yes, I remember being confused by that example.
> Peter, in this test you mentioned earlier:
>
> P0()
> {
> WRITE_ONCE(x, 1);
> smp_store_release(&y, 1);
> r0 = smp_load_acquire(&y);
> WRITE_ONCE(z, 1);
> }
>
> P1()
> {
> r1 = READ_ONCE(z);
> smp_rmb();
> r2 = READ_ONCE(x);
> }
>
> exists: r0 == 1 /\ r1==1 /\ r2==0
>
> You expect this to be forbidden too, even if the release and acquire
> are RCpc? This part I don't follow. There's no conditional branch
> here to enforce a control dependency. I get and I agree that
> smp_store_release/smp_load_acquire are different than spin_unlock/
> spin_lock, but isn't that even more reason to allow this one if
> smp_store_release/smp_load_acquire are RCpc without question?
Yes, I was expecting that to be forbidden. My definition of RCpc is such
that ACQUIRE and RELEASE are 'locally' respected.
The TSO archs forbid the above by ordering STOREs, then the only
architectures that implement ACQUIRE/RELEASE are PPC and ARM64 (the rest
uses smp_mb()), ARM64 has globally consistent RELEASE/ACQUIRE and PPC
has their LWSYNC which imposes a local TSO order.
In particular I'm not a great fan of weakening the LKMM further. PPC
is already painfully weak (more so than ARM64), but the proposed RISC-V
would be weaker still.
Of course, I come from x86 so I could be biased, but history has shown
that the weaker this stuff becomes the more painful this all becomes.
Not to mention that we'd have to go audit all existant code for
constructs no longer allowed, which is incredibly painful.
Linus too has expressed his preference for stronger models.
On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> So we have something that is not all that rare in the Linux kernel
> community, namely two conflicting more-or-less concurrent changes.
> This clearly needs to be resolved, either by us not strengthening the
> Linux-kernel memory model in the way we were planning to or by you
> strengthening RISC-V to be no weaker than PowerPC for these sorts of
> externally viewed release-acquire situations.
>
> Other thoughts?
Like said in the other email, I would _much_ prefer to not go weaker
than PPC, I find that PPC is already painfully weak at times.
On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>> So we have something that is not all that rare in the Linux kernel
>> community, namely two conflicting more-or-less concurrent changes.
>> This clearly needs to be resolved, either by us not strengthening the
>> Linux-kernel memory model in the way we were planning to or by you
>> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>> externally viewed release-acquire situations.
>>
>> Other thoughts?
>
> Like said in the other email, I would _much_ prefer to not go weaker
> than PPC, I find that PPC is already painfully weak at times.
Sure, and RISC-V could make this work too by using RCsc instructions
and/or by using lightweight fences instead. It just wasn't clear at
first whether smp_load_acquire() and smp_store_release() were RCpc,
RCsc, or something else, and hence whether RISC-V would actually need
to use something stronger than pure RCpc there. Likewise for
spin_unlock()/spin_lock() and everywhere else this comes up.
As Paul's email in the other thread observed, RCpc seems to be
OK for smp_load_acquire()/smp_store_release() at least according
to the current LKMM herd spec. Unlock/lock are stronger already
I guess. But if there's an active proposal to strengthen them all
to something stricter than pure RCpc, then that's good to know.
My understanding from earlier discussions is that ARM has no plans
to use their own RCpc instruction for smp_load_acquire() instead
of their RCsc instructions. Is that still true? If they were to
use the RCpc load there, that would cause them to have the same
problem we're discussing here, right? Just checking.
Dan
On Thu, Feb 22, 2018 at 07:27:17PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> > So we have something that is not all that rare in the Linux kernel
> > community, namely two conflicting more-or-less concurrent changes.
> > This clearly needs to be resolved, either by us not strengthening the
> > Linux-kernel memory model in the way we were planning to or by you
> > strengthening RISC-V to be no weaker than PowerPC for these sorts of
> > externally viewed release-acquire situations.
> >
> > Other thoughts?
>
> Like said in the other email, I would _much_ prefer to not go weaker
> than PPC, I find that PPC is already painfully weak at times.
And here are the four PowerPC litmus tests. As expected, a
release-acquire pair within a given process orders everything except
for prior stores against later loads, from the viewpoint of some other
process.
And yes, a few of the filenames are unfortunate.
Thanx, Paul
------------------------------------------------------------------------
PPC MP+o-r-a-o+o-rmb-o
""
(* 22-Feb-2018: ppcmem says "Never" *)
{
0:r1=1; 0:r4=x; 0:r5=y; 0:r6=z;
1:r1=1; 1:r4=x; 1:r5=y; 1:r6=z;
}
P0 | P1 ;
stw r1,0(r4) | lwz r7,0(r6) ;
lwsync | lwsync ;
stw r1,0(r5) | lwz r8,0(r4) ;
lwz r7,0(r5) | ;
lwsync | ;
stw r1,0(r6) | ;
exists
(1:r7=1 /\ 1:r8=0)
------------------------------------------------------------------------
PPC SB+o-r-a-o+o-rmb-o
""
(* 22-Feb-2018: ppcmem says "Sometimes" *)
{
0:r1=1; 0:r4=x; 0:r5=y; 0:r6=z;
1:r1=1; 1:r4=x; 1:r5=y; 1:r6=z;
}
P0 | P1 ;
stw r1,0(r4) | stw r1,0(r6) ;
lwsync | lwsync ;
stw r1,0(r5) | lwz r7,0(r4) ;
lwz r8,0(r5) | ;
lwsync | ;
lwz r7,0(r6) | ;
exists
(0:r7=0 /\ 1:r7=0)
------------------------------------------------------------------------
PPC LB+o-r-a-o+o-rmb-o
""
(* 22-Feb-2018: ppcmem says "Never" *)
{
0:r1=1; 0:r4=x; 0:r5=y; 0:r6=z;
1:r1=1; 1:r4=x; 1:r5=y; 1:r6=z;
}
P0 | P1 ;
lwz r7,0(r4) | lwz r7,0(r6) ;
lwsync | lwsync ;
stw r1,0(r5) | stw r1,0(r4) ;
lwz r8,0(r5) | ;
lwsync | ;
stw r1,0(r6) | ;
exists
(0:r7=1 /\ 1:r7=1)
------------------------------------------------------------------------
PPC MP+o-rmb-o+o-r-a-o.litmus
""
(* 22-Feb-2018: ppcmem says "Never" *)
{
0:r1=1; 0:r4=x; 0:r5=y; 0:r6=z;
1:r1=1; 1:r4=x; 1:r5=y; 1:r6=z;
}
P0 | P1 ;
lwz r7,0(r4) | stw r1,0(r6) ;
lwsync | lwsync ;
stw r1,0(r5) | stw r1,0(r4) ;
lwz r8,0(r5) | ;
lwsync | ;
lwz r9,0(r6) | ;
exists
(0:r7=1 /\ 0:r9=0)
On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> >> So we have something that is not all that rare in the Linux kernel
> >> community, namely two conflicting more-or-less concurrent changes.
> >> This clearly needs to be resolved, either by us not strengthening the
> >> Linux-kernel memory model in the way we were planning to or by you
> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
> >> externally viewed release-acquire situations.
> >>
> >> Other thoughts?
> >
> > Like said in the other email, I would _much_ prefer to not go weaker
> > than PPC, I find that PPC is already painfully weak at times.
>
> Sure, and RISC-V could make this work too by using RCsc instructions
> and/or by using lightweight fences instead. It just wasn't clear at
> first whether smp_load_acquire() and smp_store_release() were RCpc,
> RCsc, or something else, and hence whether RISC-V would actually need
> to use something stronger than pure RCpc there. Likewise for
> spin_unlock()/spin_lock() and everywhere else this comes up.
Thank you for the confirmations.
The hope was indeed that this thread helped clarify what the (current)
LKMM "expectations" are, and how these are likely to evolve in a near
future. And Yes, I'm definitely with you and Paul on "let us (try to)
keep LKMM and RISC-V synchronized"! ;-)
Andrea
>
>
> Dan
On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> >> So we have something that is not all that rare in the Linux kernel
> >> community, namely two conflicting more-or-less concurrent changes.
> >> This clearly needs to be resolved, either by us not strengthening the
> >> Linux-kernel memory model in the way we were planning to or by you
> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
> >> externally viewed release-acquire situations.
> >>
> >> Other thoughts?
> >
> > Like said in the other email, I would _much_ prefer to not go weaker
> > than PPC, I find that PPC is already painfully weak at times.
>
> Sure, and RISC-V could make this work too by using RCsc instructions
> and/or by using lightweight fences instead. It just wasn't clear at
> first whether smp_load_acquire() and smp_store_release() were RCpc,
> RCsc, or something else, and hence whether RISC-V would actually need
> to use something stronger than pure RCpc there. Likewise for
> spin_unlock()/spin_lock() and everywhere else this comes up.
>
> As Paul's email in the other thread observed, RCpc seems to be
> OK for smp_load_acquire()/smp_store_release() at least according
> to the current LKMM herd spec. Unlock/lock are stronger already
> I guess. But if there's an active proposal to strengthen them all
> to something stricter than pure RCpc, then that's good to know.
>
> My understanding from earlier discussions is that ARM has no plans
> to use their own RCpc instruction for smp_load_acquire() instead
> of their RCsc instructions. Is that still true? If they were to
> use the RCpc load there, that would cause them to have the same
> problem we're discussing here, right? Just checking.
Agreed. No plans to use the LDAPR instruction in Linux.
Will
> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> >> So we have something that is not all that rare in the Linux kernel
> >> community, namely two conflicting more-or-less concurrent changes.
> >> This clearly needs to be resolved, either by us not strengthening the
> >> Linux-kernel memory model in the way we were planning to or by you
> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
> >> externally viewed release-acquire situations.
> >>
> >> Other thoughts?
> >
> > Like said in the other email, I would _much_ prefer to not go weaker
> > than PPC, I find that PPC is already painfully weak at times.
>
> Sure, and RISC-V could make this work too by using RCsc instructions
> and/or by using lightweight fences instead. It just wasn't clear at
> first whether smp_load_acquire() and smp_store_release() were RCpc,
> RCsc, or something else,
As far as I now understand, they are something else. Something in-between.
Basically, considering a multi-copy atomic model, and following RISCV
axiomatic herd notations
RCpc
[Acq];po in ppo (ie Acquire "half barrier")
po;[Rel] in ppo (ie Release "half barrier")
RCsc adds the rule
[Rel];po;[Acq] in ppo. (ie Store release followed by load Acquire is
kept in order)
While LKMM has a less constrained additional rule
[Rel];po-loc;[Acq] (ie Store release followed by load Acquire from
the SAME location is kept in order)
(For specialists LKMM is formulated differently, the overall
effect is the same)
> and hence whether RISC-V would actually need
> to use something stronger than pure RCpc there. Likewise for
> spin_unlock()/spin_lock() and everywhere else this comes up.
On may here observe that the LKMM rule is here precisely to
forbid Andrea's test. In that aspect,the suggested RISCv implemention
of lock and unlock on spinlock_t is too weak as it allows the test.
However, I cannot conclude on smp_load_acquire/smp_store_release (see below).
>
> As Paul's email in the other thread observed, RCpc seems to be
> OK for smp_load_acquire()/smp_store_release() at least according
> to the current LKMM herd spec. Unlock/lock are stronger already
> I guess. But if there's an active proposal to strengthen them all
> to something stricter than pure RCpc, then that's good to know.
As illustrated above RCpc is not enough to implement LLKM
smp_load_acquire()/smp_store_release(). However, I do not know
for sure what the current implementation of smp_load_acquire()
and smp_store_release() (ie LLKM primitives) in RISCV.
As far as I could find the current implementation is the generic default
that uses strong fences and is safe.
On a related note, it may not be clear yet to everyone
that LLKM has "platonic spinlocks".
That is, locks are not implemented from more basic primitive but are specified.
The specification can be described as behaving that way:
- A lock behaves as a read-modify-write. teh read behaving as a read-acquire
- A unlock behaves as a store release.
- Lock and Unlock operation (to a given lock variable)
alternate in some total order. This order induces teh usal relations
between accesses to a given location.
This of course strongly suggest implementing lock with a RWM (R being acquire,
ie amoswap.aq in RISCV) and unlock as a store release
(ie amoswap.rl in RISCV). And again, this is too weak given LKMM
additional rule for RC. However, nothing prevents from a different
implemention provided it is safe.
As regards LKMM spinlocks, I am not sure that having strong
implementations of lock/unlock in Power (lwsync) and
ARMv8 (RCsc?) commands a specification stronger than the current one.
>
> My understanding from earlier discussions is that ARM has no plans
> to use their own RCpc instruction for smp_load_acquire() instead
> of their RCsc instructions. Is that still true? If they were to
> use the RCpc load there, that would cause them to have the same
> problem we're discussing here, right? Just checking.
>
> Dan
On Mon, Feb 26, 2018 at 6:21 AM, Luc Maranget <[email protected]> wrote:
>
> That is, locks are not implemented from more basic primitive but are specified.
> The specification can be described as behaving that way:
> - A lock behaves as a read-modify-write. the read behaving as a read-acquire
This is wrong, or perhaps just misleading.
The *whole* r-m-w acts as an acquire. Not just the read part. The
write is very much part of it.
Maybe that's what you meant, but it read to me as "just the read part
of the rmw behaves as a read-acquire".
Because it is very important that the _write_ part of the rmw is also
ordered wrt everything that is inside the spinlock.
So doing a spinlock as
(a) read-locked-acquire
modify
(c) write-conditional
would be wrong, because the accesses inside the spinlock are ordered
not just wrt the read-acquire, they have to be ordered wrt the write
too.
So it is closer to say that it's the _write_ of the r-m-w sequence
that has the acquire semantics, not the read.
> - A unlock behaves as a store release.
Yes.
Linus
On Mon, Feb 26, 2018 at 08:06:59AM -0800, Linus Torvalds wrote:
> On Mon, Feb 26, 2018 at 6:21 AM, Luc Maranget <[email protected]> wrote:
> >
> > That is, locks are not implemented from more basic primitive but are specified.
> > The specification can be described as behaving that way:
> > - A lock behaves as a read-modify-write. the read behaving as a read-acquire
>
> This is wrong, or perhaps just misleading.
>
> The *whole* r-m-w acts as an acquire. Not just the read part. The
> write is very much part of it.
>
> Maybe that's what you meant, but it read to me as "just the read part
> of the rmw behaves as a read-acquire".
>
> Because it is very important that the _write_ part of the rmw is also
> ordered wrt everything that is inside the spinlock.
>
> So doing a spinlock as
>
> (a) read-locked-acquire
> modify
> (c) write-conditional
>
> would be wrong, because the accesses inside the spinlock are ordered
> not just wrt the read-acquire, they have to be ordered wrt the write
> too.
>
> So it is closer to say that it's the _write_ of the r-m-w sequence
> that has the acquire semantics, not the read.
Strictly speaking, that's not what we've got implemented on arm64: only
the read part of the RmW has Acquire semantics, but there is a total
order on the lock/unlock operations for the lock. For example, if one
CPU does:
spin_lock(&lock);
WRITE_ONCE(foo, 42);
then another CPU could do:
if (smp_load_acquire(&foo) == 42)
BUG_ON(!spin_is_locked(&lock));
and that could fire. Is that relied on somewhere?
Will
On Mon, Feb 26, 2018 at 8:24 AM, Will Deacon <[email protected]> wrote:
>
> Strictly speaking, that's not what we've got implemented on arm64: only
> the read part of the RmW has Acquire semantics, but there is a total
> order on the lock/unlock operations for the lock.
Hmm.
I thought we had exactly that bug on some architecture with the queued
spinlocks, and people decided it was wrong.
But it's possible that I mis-remember, and that we decided it was ok after all.
> spin_lock(&lock);
> WRITE_ONCE(foo, 42);
>
> then another CPU could do:
>
> if (smp_load_acquire(&foo) == 42)
> BUG_ON(!spin_is_locked(&lock));
>
> and that could fire. Is that relied on somewhere?
I have a distinct memory that we said the spinlock write is seen in
order, wrt the writes inside the spinlock, and the reason was
something very similar to the above, except that "spin_is_locked()"
was about our spin_unlock_wait().
Because we had something very much like the above in the exit path,
where we would look at some state and do "spin_unlock_wait()" and
expect to be guaranteed to be the last user after that.
But a few months ago we obviously got rid of spin_unlock_wait exactly
because people were worried about the semantics.
So maybe I just remember an older issue that simply became a non-issue
with that.
Linus
On Mon, Feb 26, 2018 at 09:00:43AM -0800, Linus Torvalds wrote:
> On Mon, Feb 26, 2018 at 8:24 AM, Will Deacon <[email protected]> wrote:
> >
> > Strictly speaking, that's not what we've got implemented on arm64: only
> > the read part of the RmW has Acquire semantics, but there is a total
> > order on the lock/unlock operations for the lock.
>
> Hmm.
>
> I thought we had exactly that bug on some architecture with the queued
> spinlocks, and people decided it was wrong.
>
> But it's possible that I mis-remember, and that we decided it was ok after all.
>
> > spin_lock(&lock);
> > WRITE_ONCE(foo, 42);
> >
> > then another CPU could do:
> >
> > if (smp_load_acquire(&foo) == 42)
> > BUG_ON(!spin_is_locked(&lock));
> >
> > and that could fire. Is that relied on somewhere?
>
> I have a distinct memory that we said the spinlock write is seen in
> order, wrt the writes inside the spinlock, and the reason was
> something very similar to the above, except that "spin_is_locked()"
> was about our spin_unlock_wait().
Yes, we did run into problems with spin_unlock_wait and we ended up
strengthening the arm64 implementation to do an RmW, which puts it into
the total order of lock/unlock operations. However, we then went and
killed the thing because it was seldom used correctly and we struggled
to define what "correctly" even meant!
> Because we had something very much like the above in the exit path,
> where we would look at some state and do "spin_unlock_wait()" and
> expect to be guaranteed to be the last user after that.
>
> But a few months ago we obviously got rid of spin_unlock_wait exactly
> because people were worried about the semantics.
Similarly for spin_can_lock.
> So maybe I just remember an older issue that simply became a non-issue
> with that.
I think so. If we need to, I could make spin_is_locked do an RmW on
arm64 so we can say that all successful spin_* operations are totally
ordered for a given lock, but spin_is_locked is normally only used as a
coarse debug check anyway where it's assumed that if it's held, it's
held by the current CPU. We should probably move most users over to
lockdep and see what we're left with.
Will
On Mon, Feb 26, 2018 at 04:24:27PM +0000, Will Deacon wrote:
> On Mon, Feb 26, 2018 at 08:06:59AM -0800, Linus Torvalds wrote:
> > On Mon, Feb 26, 2018 at 6:21 AM, Luc Maranget <[email protected]> wrote:
> > >
> > > That is, locks are not implemented from more basic primitive but are specified.
> > > The specification can be described as behaving that way:
> > > - A lock behaves as a read-modify-write. the read behaving as a read-acquire
> >
> > This is wrong, or perhaps just misleading.
> >
> > The *whole* r-m-w acts as an acquire. Not just the read part. The
> > write is very much part of it.
> >
> > Maybe that's what you meant, but it read to me as "just the read part
> > of the rmw behaves as a read-acquire".
> >
> > Because it is very important that the _write_ part of the rmw is also
> > ordered wrt everything that is inside the spinlock.
> >
> > So doing a spinlock as
> >
> > (a) read-locked-acquire
> > modify
> > (c) write-conditional
> >
> > would be wrong, because the accesses inside the spinlock are ordered
> > not just wrt the read-acquire, they have to be ordered wrt the write
> > too.
> >
> > So it is closer to say that it's the _write_ of the r-m-w sequence
> > that has the acquire semantics, not the read.
>
> Strictly speaking, that's not what we've got implemented on arm64: only
> the read part of the RmW has Acquire semantics, but there is a total
> order on the lock/unlock operations for the lock. For example, if one
> CPU does:
>
> spin_lock(&lock);
> WRITE_ONCE(foo, 42);
>
> then another CPU could do:
>
> if (smp_load_acquire(&foo) == 42)
> BUG_ON(!spin_is_locked(&lock));
>
Hmm.. this is new to me. So the write part of spin_lock() and the
WRITE_ONCE() will not get reordered? Could you explain more about this
or point where I should look in the document? I understand the write
part of spin_lock() must be committed earlier than the WRITE_ONCE()
committed due to the ll/sc, but I thought the ordering of their arrivals
in memory system is undefined/arbitary.
Regards,
Boqun
> and that could fire. Is that relied on somewhere?
>
> Will
On Tue, Feb 27, 2018 at 01:06:35PM +0800, Boqun Feng wrote:
> On Mon, Feb 26, 2018 at 04:24:27PM +0000, Will Deacon wrote:
> > On Mon, Feb 26, 2018 at 08:06:59AM -0800, Linus Torvalds wrote:
> > > On Mon, Feb 26, 2018 at 6:21 AM, Luc Maranget <[email protected]> wrote:
> > > >
> > > > That is, locks are not implemented from more basic primitive but are specified.
> > > > The specification can be described as behaving that way:
> > > > - A lock behaves as a read-modify-write. the read behaving as a read-acquire
> > >
> > > This is wrong, or perhaps just misleading.
> > >
> > > The *whole* r-m-w acts as an acquire. Not just the read part. The
> > > write is very much part of it.
> > >
> > > Maybe that's what you meant, but it read to me as "just the read part
> > > of the rmw behaves as a read-acquire".
> > >
> > > Because it is very important that the _write_ part of the rmw is also
> > > ordered wrt everything that is inside the spinlock.
> > >
> > > So doing a spinlock as
> > >
> > > (a) read-locked-acquire
> > > modify
> > > (c) write-conditional
> > >
> > > would be wrong, because the accesses inside the spinlock are ordered
> > > not just wrt the read-acquire, they have to be ordered wrt the write
> > > too.
> > >
> > > So it is closer to say that it's the _write_ of the r-m-w sequence
> > > that has the acquire semantics, not the read.
> >
> > Strictly speaking, that's not what we've got implemented on arm64: only
> > the read part of the RmW has Acquire semantics, but there is a total
> > order on the lock/unlock operations for the lock. For example, if one
> > CPU does:
> >
> > spin_lock(&lock);
> > WRITE_ONCE(foo, 42);
> >
> > then another CPU could do:
> >
> > if (smp_load_acquire(&foo) == 42)
> > BUG_ON(!spin_is_locked(&lock));
> >
>
> Hmm.. this is new to me. So the write part of spin_lock() and the
> WRITE_ONCE() will not get reordered? Could you explain more about this
> or point where I should look in the document? I understand the write
> part of spin_lock() must be committed earlier than the WRITE_ONCE()
> committed due to the ll/sc, but I thought the ordering of their arrivals
> in memory system is undefined/arbitary.
>
> Regards,
> Boqun
>
> > and that could fire. Is that relied on somewhere?
> >
My bad, I misunderstood your mail. So you were saying the BUG_ON() can
be triggered in currently implementations. Never mind my reply above.
Regards,
Boqun
> > Will
Hi Daniel,
On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
> >> So we have something that is not all that rare in the Linux kernel
> >> community, namely two conflicting more-or-less concurrent changes.
> >> This clearly needs to be resolved, either by us not strengthening the
> >> Linux-kernel memory model in the way we were planning to or by you
> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
> >> externally viewed release-acquire situations.
> >>
> >> Other thoughts?
> >
> > Like said in the other email, I would _much_ prefer to not go weaker
> > than PPC, I find that PPC is already painfully weak at times.
>
> Sure, and RISC-V could make this work too by using RCsc instructions
> and/or by using lightweight fences instead. It just wasn't clear at
> first whether smp_load_acquire() and smp_store_release() were RCpc,
> RCsc, or something else, and hence whether RISC-V would actually need
> to use something stronger than pure RCpc there. Likewise for
> spin_unlock()/spin_lock() and everywhere else this comes up.
while digging into riscv's locks and atomics to fix the issues discussed
earlier in this thread, I became aware of another issue:
Considering here the CMPXCHG primitives, for example, I noticed that the
implementation currently provides the four variants
ATOMIC_OPS( , .aqrl)
ATOMIC_OPS(_acquire, .aq)
ATOMIC_OPS(_release, .rl)
ATOMIC_OPS(_relaxed, )
(corresponding, resp., to
atomic_cmpxchg()
atomic_cmpxchg_acquire()
atomic_cmpxchg_release()
atomic_cmpxchg_relaxed() )
so that the first variant above ends up doing
0: lr.w.aqrl %0, %addr
bne %0, %old, 1f
sc.w.aqrl %1, %new, %addr
bnez %1, 0b
1:
From Sect. 2.3.5. ("Acquire/Release Ordering") of the Spec.,
"AMOs with both .aq and .rl set are fully-ordered operations. Treating
the load part and the store part as independent RCsc operations is not
in and of itself sufficient to enforce full fencing behavior, but this
subtle weak behavior is counterintuitive and not much of an advantage
architecturally, especially with lr and sc also available [...]."
I understand that
{ x = y = u = v = 0 }
P0()
WRITE_ONCE(x, 1); ("relaxed" store, sw)
atomic_cmpxchg(&u, 0, 1);
r0 = READ_ONCE(y); ("relaxed" load, lw)
P1()
WRITE_ONCE(y, 1);
atomic_cmpxchg(&v, 0, 1);
r1 = READ_ONCE(x);
could result in (u = v = 1 and r0 = r1 = 0) at the end; can you confirm?
Thanks,
Andrea
On Thu, 01 Mar 2018 07:11:41 PST (-0800), [email protected] wrote:
> Hi Daniel,
>
> On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
>> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
>> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>> >> So we have something that is not all that rare in the Linux kernel
>> >> community, namely two conflicting more-or-less concurrent changes.
>> >> This clearly needs to be resolved, either by us not strengthening the
>> >> Linux-kernel memory model in the way we were planning to or by you
>> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>> >> externally viewed release-acquire situations.
>> >>
>> >> Other thoughts?
>> >
>> > Like said in the other email, I would _much_ prefer to not go weaker
>> > than PPC, I find that PPC is already painfully weak at times.
>>
>> Sure, and RISC-V could make this work too by using RCsc instructions
>> and/or by using lightweight fences instead. It just wasn't clear at
>> first whether smp_load_acquire() and smp_store_release() were RCpc,
>> RCsc, or something else, and hence whether RISC-V would actually need
>> to use something stronger than pure RCpc there. Likewise for
>> spin_unlock()/spin_lock() and everywhere else this comes up.
>
> while digging into riscv's locks and atomics to fix the issues discussed
> earlier in this thread, I became aware of another issue:
>
> Considering here the CMPXCHG primitives, for example, I noticed that the
> implementation currently provides the four variants
>
> ATOMIC_OPS( , .aqrl)
> ATOMIC_OPS(_acquire, .aq)
> ATOMIC_OPS(_release, .rl)
> ATOMIC_OPS(_relaxed, )
>
> (corresponding, resp., to
>
> atomic_cmpxchg()
> atomic_cmpxchg_acquire()
> atomic_cmpxchg_release()
> atomic_cmpxchg_relaxed() )
>
> so that the first variant above ends up doing
>
> 0: lr.w.aqrl %0, %addr
> bne %0, %old, 1f
> sc.w.aqrl %1, %new, %addr
> bnez %1, 0b
> 1:
>
> From Sect. 2.3.5. ("Acquire/Release Ordering") of the Spec.,
>
> "AMOs with both .aq and .rl set are fully-ordered operations. Treating
> the load part and the store part as independent RCsc operations is not
> in and of itself sufficient to enforce full fencing behavior, but this
> subtle weak behavior is counterintuitive and not much of an advantage
> architecturally, especially with lr and sc also available [...]."
>
> I understand that
>
> { x = y = u = v = 0 }
>
> P0()
>
> WRITE_ONCE(x, 1); ("relaxed" store, sw)
> atomic_cmpxchg(&u, 0, 1);
> r0 = READ_ONCE(y); ("relaxed" load, lw)
>
> P1()
>
> WRITE_ONCE(y, 1);
> atomic_cmpxchg(&v, 0, 1);
> r1 = READ_ONCE(x);
>
> could result in (u = v = 1 and r0 = r1 = 0) at the end; can you confirm?
cmpxchg isn't an AMO, it's an LR SC sequence, so that blurb doesn't apply. I
think "lr.w.aqrl" and "sc.w.aqrl" is not sufficient to perform a fully ordered
operation (ie, it's an incorrect implementation of atomic_cmpxchg()), but I was
hoping to get some time to actually internalize this part of the RISC-V memory
model at some point to be sure.
On 3/1/2018 1:54 PM, Palmer Dabbelt wrote:
> On Thu, 01 Mar 2018 07:11:41 PST (-0800), [email protected] wrote:
>> Hi Daniel,
>>
>> On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
>>> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
>>> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>>> >> So we have something that is not all that rare in the Linux kernel
>>> >> community, namely two conflicting more-or-less concurrent changes.
>>> >> This clearly needs to be resolved, either by us not strengthening the
>>> >> Linux-kernel memory model in the way we were planning to or by you
>>> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>>> >> externally viewed release-acquire situations.
>>> >>
>>> >> Other thoughts?
>>> >
>>> > Like said in the other email, I would _much_ prefer to not go weaker
>>> > than PPC, I find that PPC is already painfully weak at times.
>>>
>>> Sure, and RISC-V could make this work too by using RCsc instructions
>>> and/or by using lightweight fences instead. It just wasn't clear at
>>> first whether smp_load_acquire() and smp_store_release() were RCpc,
>>> RCsc, or something else, and hence whether RISC-V would actually need
>>> to use something stronger than pure RCpc there. Likewise for
>>> spin_unlock()/spin_lock() and everywhere else this comes up.
>>
>> while digging into riscv's locks and atomics to fix the issues discussed
>> earlier in this thread, I became aware of another issue:
>>
>> Considering here the CMPXCHG primitives, for example, I noticed that the
>> implementation currently provides the four variants
>>
>> ATOMIC_OPS( , .aqrl)
>> ATOMIC_OPS(_acquire, .aq)
>> ATOMIC_OPS(_release, .rl)
>> ATOMIC_OPS(_relaxed, )
>>
>> (corresponding, resp., to
>>
>> atomic_cmpxchg()
>> atomic_cmpxchg_acquire()
>> atomic_cmpxchg_release()
>> atomic_cmpxchg_relaxed() )
>>
>> so that the first variant above ends up doing
>>
>> 0: lr.w.aqrl %0, %addr
>> bne %0, %old, 1f
>> sc.w.aqrl %1, %new, %addr
>> bnez %1, 0b
>> 1:
>>
>> From Sect. 2.3.5. ("Acquire/Release Ordering") of the Spec.,
>>
>> "AMOs with both .aq and .rl set are fully-ordered operations. Treating
>> the load part and the store part as independent RCsc operations is not
>> in and of itself sufficient to enforce full fencing behavior, but this
>> subtle weak behavior is counterintuitive and not much of an advantage
>> architecturally, especially with lr and sc also available [...]."
>>
>> I understand that
>>
>> { x = y = u = v = 0 }
>>
>> P0()
>>
>> WRITE_ONCE(x, 1); ("relaxed" store, sw)
>> atomic_cmpxchg(&u, 0, 1);
>> r0 = READ_ONCE(y); ("relaxed" load, lw)
>>
>> P1()
>>
>> WRITE_ONCE(y, 1);
>> atomic_cmpxchg(&v, 0, 1);
>> r1 = READ_ONCE(x);
>>
>> could result in (u = v = 1 and r0 = r1 = 0) at the end; can you confirm?
>
> cmpxchg isn't an AMO, it's an LR SC sequence, so that blurb doesn't
> apply. I think "lr.w.aqrl" and "sc.w.aqrl" is not sufficient to
> perform a fully ordered operation (ie, it's an incorrect
> implementation of atomic_cmpxchg()), but I was hoping to get some
> time to actually internalize this part of the RISC-V memory model at
> some point to be sure.
Agreed, the current intention is that we'll add a fence rw,rw in there
when we use lr/sc as the implementation, likely similar to what ARM does:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67
It's also probably not the only piece of synchronization-related code
we'll have to go back and audit as we finalize the RISC-V memory
consistency model over the next couple months or so.
Dan
On Mon, Feb 26, 2018 at 09:00:43AM -0800, Linus Torvalds wrote:
> On Mon, Feb 26, 2018 at 8:24 AM, Will Deacon <[email protected]> wrote:
> >
> > Strictly speaking, that's not what we've got implemented on arm64: only
> > the read part of the RmW has Acquire semantics, but there is a total
> > order on the lock/unlock operations for the lock.
>
> Hmm.
>
> I thought we had exactly that bug on some architecture with the queued
> spinlocks, and people decided it was wrong.
So ARM64 and Power have the acquire-on-load only thing, but qspinlock
has it per construction on anything that allowes reordering stores.
Given that unlock/lock are ordered, which covers about 99% of the users
out there, and fixing the issue would make things significantly slower
on the weak architectures we let it be.
But yes, its a pesky detail.