More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking. In other words, given
the following code:
WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&s);
WRITE_ONCE(y, 1);
the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s. In terms of
the memory model, this means expanding the cumul-fence relation.
Locks should also provide read-read (and read-write) ordering in a
similar way. Given:
READ_ONCE(x);
spin_unlock(&s);
spin_lock(&s);
READ_ONCE(y); // or WRITE_ONCE(y, 1);
the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock. This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction. The patch
therefore removes the ordering requirement from the LKMM for that
case.
All the architectures supported by the Linux kernel (including RISC-V)
do provide this ordering for locks, albeit for varying reasons.
Therefore this patch changes the model in accordance with the
developers' wishes.
Signed-off-by: Alan Stern <[email protected]>
---
v.2: Restrict the ordering to lock operations, not general release
and acquire fences.
[as1871b]
tools/memory-model/Documentation/explanation.txt | 186 +++++++---
tools/memory-model/linux-kernel.cat | 8
tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 5
3 files changed, 149 insertions(+), 50 deletions(-)
Index: usb-4.x/tools/memory-model/linux-kernel.cat
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.cat
+++ usb-4.x/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
(**********************************)
(* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
(*
Index: usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
===================================================================
--- usb-4.x.orig/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
C ISA2+pooncelock+pooncelock+pombonce.litmus
(*
- * Result: Sometimes
+ * Result: Never
*
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
*)
{}
Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
===================================================================
--- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-4.x/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory C
20. THE HAPPENS-BEFORE RELATION: hb
21. THE PROPAGATES-BEFORE RELATION: pb
22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
- 23. ODDS AND ENDS
+ 23. LOCKING
+ 24. ODDS AND ENDS
@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this t
violating the write-write coherence rule by requiring the CPU not to
send the W write to the memory subsystem at all!)
-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release. For example:
-
- smp_store_release(&x, 123);
- r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load. On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence. The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-
AND THEN THERE WAS ALPHA
------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section
grace period does and ends after it does.
+LOCKING
+-------
+
+The LKMM includes locking. In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered. A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+ while (cmpxchg_acquire(&s, 0, 1) != 0)
+ cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+ r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0. Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+ return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful). spin_unlock(&s)
+is treated almost the same as:
+
+ smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation. In the LMKM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire. This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+ int x, y;
+ spinlock_t s;
+
+ P0()
+ {
+ int r1, r2;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ spin_unlock(&s);
+ spin_lock(&s);
+ r2 = READ_ONCE(y);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ WRITE_ONCE(y, 1);
+ smp_wmb();
+ WRITE_ONCE(x, 1);
+ }
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y. Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations. For instance, suppose P0()
+in the example had been written as:
+
+ P0()
+ {
+ int r1, r2, r3;
+
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1);
+ r3 = smp_load_acquire(&s);
+ r2 = READ_ONCE(y);
+ }
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+ r3 = smp_load_acquire(&s); // Obtains r3 = 1
+ r2 = READ_ONCE(y);
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1); // Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does. For example, consider:
+
+ int x, y;
+ spinlock_t x;
+
+ P0()
+ {
+ spin_lock(&s);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ int r1;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&s);
+ }
+
+ P2()
+ {
+ int r2, r3;
+
+ r2 = READ_ONCE(y);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ }
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0. Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model. Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
ODDS AND ENDS
-------------
@@ -1831,26 +1951,6 @@ they behave as follows:
events and the events preceding them against all po-later
events.
-The LKMM includes locking. In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered. A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
- while (cmpxchg_acquire(&s, 0, 1) != 0)
- cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
- r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0. spin_unlock(&s) is
-treated the same as:
-
- smp_store_release(&s, 0);
-
Interestingly, RCU and locking each introduce the possibility of
deadlock. When faced with code sequences such as:
On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.
>
> Locks should also provide read-read (and read-write) ordering in a
> similar way. Given:
>
> READ_ONCE(x);
> spin_unlock(&s);
> spin_lock(&s);
> READ_ONCE(y); // or WRITE_ONCE(y, 1);
>
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock. This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction. The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
Nice!
However, it doesn't apply against current -rcu. Am I missing a patch?
Or is this supposed to apply against origin/lkmm?
Thanx, Paul
> ---
>
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
>
> [as1871b]
>
>
> tools/memory-model/Documentation/explanation.txt | 186 +++++++---
> tools/memory-model/linux-kernel.cat | 8
> tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 5
> 3 files changed, 149 insertions(+), 50 deletions(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
> let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
>
> (*
> Index: usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> +++ usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> @@ -1,11 +1,10 @@
> C ISA2+pooncelock+pooncelock+pombonce.litmus
>
> (*
> - * Result: Sometimes
> + * Result: Never
> *
> - * This test shows that the ordering provided by a lock-protected S
> - * litmus test (P0() and P1()) are not visible to external process P2().
> - * This is likely to change soon.
> + * This test shows that write-write ordering provided by locks
> + * (in P0() and P1()) is visible to external process P2().
> *)
>
> {}
> Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-4.x/tools/memory-model/Documentation/explanation.txt
> @@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory C
> 20. THE HAPPENS-BEFORE RELATION: hb
> 21. THE PROPAGATES-BEFORE RELATION: pb
> 22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
> - 23. ODDS AND ENDS
> + 23. LOCKING
> + 24. ODDS AND ENDS
>
>
>
> @@ -1067,28 +1068,6 @@ allowing out-of-order writes like this t
> violating the write-write coherence rule by requiring the CPU not to
> send the W write to the memory subsystem at all!)
>
> -There is one last example of preserved program order in the LKMM: when
> -a load-acquire reads from an earlier store-release. For example:
> -
> - smp_store_release(&x, 123);
> - r1 = smp_load_acquire(&x);
> -
> -If the smp_load_acquire() ends up obtaining the 123 value that was
> -stored by the smp_store_release(), the LKMM says that the load must be
> -executed after the store; the store cannot be forwarded to the load.
> -This requirement does not arise from the operational model, but it
> -yields correct predictions on all architectures supported by the Linux
> -kernel, although for differing reasons.
> -
> -On some architectures, including x86 and ARMv8, it is true that the
> -store cannot be forwarded to the load. On others, including PowerPC
> -and ARMv7, smp_store_release() generates object code that starts with
> -a fence and smp_load_acquire() generates object code that ends with a
> -fence. The upshot is that even though the store may be forwarded to
> -the load, it is still true that any instruction preceding the store
> -will be executed before the load or any following instructions, and
> -the store will be executed before any instruction following the load.
> -
>
> AND THEN THERE WAS ALPHA
> ------------------------
> @@ -1766,6 +1745,147 @@ before it does, and the critical section
> grace period does and ends after it does.
>
>
> +LOCKING
> +-------
> +
> +The LKMM includes locking. In fact, there is special code for locking
> +in the formal model, added in order to make tools run faster.
> +However, this special code is intended to be more or less equivalent
> +to concepts we have already covered. A spinlock_t variable is treated
> +the same as an int, and spin_lock(&s) is treated almost the same as:
> +
> + while (cmpxchg_acquire(&s, 0, 1) != 0)
> + cpu_relax();
> +
> +This waits until s is equal to 0 and then atomically sets it to 1,
> +and the read part of the cmpxchg operation acts as an acquire fence.
> +An alternate way to express the same thing would be:
> +
> + r = xchg_acquire(&s, 1);
> +
> +along with a requirement that at the end, r = 0. Similarly,
> +spin_trylock(&s) is treated almost the same as:
> +
> + return !cmpxchg_acquire(&s, 0, 1);
> +
> +which atomically sets s to 1 if it is currently equal to 0 and returns
> +true if it succeeds (the read part of the cmpxchg operation acts as an
> +acquire fence only if the operation is successful). spin_unlock(&s)
> +is treated almost the same as:
> +
> + smp_store_release(&s, 0);
> +
> +The "almost" qualifiers above need some explanation. In the LMKM, the
> +store-release in a spin_unlock() and the load-acquire which forms the
> +first half of the atomic rmw update in a spin_lock() or a successful
> +spin_trylock() -- we can call these things lock-releases and
> +lock-acquires -- have two properties beyond those of ordinary releases
> +and acquires.
> +
> +First, when a lock-acquire reads from a lock-release, the LKMM
> +requires that every instruction po-before the lock-release must
> +execute before any instruction po-after the lock-acquire. This would
> +naturally hold if the release and acquire operations were on different
> +CPUs, but the LKMM says it holds even when they are on the same CPU.
> +For example:
> +
> + int x, y;
> + spinlock_t s;
> +
> + P0()
> + {
> + int r1, r2;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + spin_unlock(&s);
> + spin_lock(&s);
> + r2 = READ_ONCE(y);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + WRITE_ONCE(y, 1);
> + smp_wmb();
> + WRITE_ONCE(x, 1);
> + }
> +
> +Here the second spin_lock() reads from the first spin_unlock(), and
> +therefore the load of x must execute before the load of y. Thus we
> +cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> +MP pattern).
> +
> +This requirement does not apply to ordinary release and acquire
> +fences, only to lock-related operations. For instance, suppose P0()
> +in the example had been written as:
> +
> + P0()
> + {
> + int r1, r2, r3;
> +
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1);
> + r3 = smp_load_acquire(&s);
> + r2 = READ_ONCE(y);
> + }
> +
> +Then the CPU would be allowed to forward the s = 1 value from the
> +smp_store_release() to the smp_load_acquire(), executing the
> +instructions in the following order:
> +
> + r3 = smp_load_acquire(&s); // Obtains r3 = 1
> + r2 = READ_ONCE(y);
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1); // Value is forwarded
> +
> +and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
> +
> +Second, when a lock-acquire reads from a lock-release, and some other
> +stores W and W' occur po-before the lock-release and po-after the
> +lock-acquire respectively, the LKMM requires that W must propagate to
> +each CPU before W' does. For example, consider:
> +
> + int x, y;
> + spinlock_t x;
> +
> + P0()
> + {
> + spin_lock(&s);
> + WRITE_ONCE(x, 1);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + int r1;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + WRITE_ONCE(y, 1);
> + spin_unlock(&s);
> + }
> +
> + P2()
> + {
> + int r2, r3;
> +
> + r2 = READ_ONCE(y);
> + smp_rmb();
> + r3 = READ_ONCE(x);
> + }
> +
> +If r1 = 1 at the end then the spin_lock() in P1 must have read from
> +the spin_unlock() in P0. Hence the store to x must propagate to P2
> +before the store to y does, so we cannot have r2 = 1 and r3 = 0.
> +
> +These two special requirements for lock-release and lock-acquire do
> +not arise from the operational model. Nevertheless, kernel developers
> +have come to expect and rely on them because they do hold on all
> +architectures supported by the Linux kernel, albeit for various
> +differing reasons.
> +
> +
> ODDS AND ENDS
> -------------
>
> @@ -1831,26 +1951,6 @@ they behave as follows:
> events and the events preceding them against all po-later
> events.
>
> -The LKMM includes locking. In fact, there is special code for locking
> -in the formal model, added in order to make tools run faster.
> -However, this special code is intended to be exactly equivalent to
> -concepts we have already covered. A spinlock_t variable is treated
> -the same as an int, and spin_lock(&s) is treated the same as:
> -
> - while (cmpxchg_acquire(&s, 0, 1) != 0)
> - cpu_relax();
> -
> -which waits until s is equal to 0 and then atomically sets it to 1,
> -and where the read part of the atomic update is also an acquire fence.
> -An alternate way to express the same thing would be:
> -
> - r = xchg_acquire(&s, 1);
> -
> -along with a requirement that at the end, r = 0. spin_unlock(&s) is
> -treated the same as:
> -
> - smp_store_release(&s, 0);
> -
> Interestingly, RCU and locking each introduce the possibility of
> deadlock. When faced with code sequences such as:
>
>
On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
I'd like to step back on this point: I still don't have a strong opinion
on this, but all this debating made me curious about others' opinion ;-)
I'd like to see the above argument expanded: what's the rationale behind
that opinion? can we maybe add references to actual code relying on that
ordering? other that I've been missing?
I'd extend these same questions to the "ordering of reads" snippet below
(and discussed since so long...).
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.
>
> Locks should also provide read-read (and read-write) ordering in a
> similar way. Given:
>
> READ_ONCE(x);
> spin_unlock(&s);
> spin_lock(&s);
> READ_ONCE(y); // or WRITE_ONCE(y, 1);
>
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock. This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction. The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
IIUC, the same argument could be used to support the removal of the new
unlock-rf-lock-po (we already discussed riscv .aq/.rl, it doesn't seem
hard to imagine an arm64 LDAPR-exclusive, or the adoption of ctrl+isync
on powerpc). Why are we effectively preventing their adoption? Again,
I'd like to see more details about the underlying motivations...
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
>
> ---
>
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
This is another controversial point, and one that makes me shivering ...
I have the impression that we're dismissing the suggestion "RMW-acquire
at par with LKR" with a bit of rush. So, this patch is implying that:
while (cmpxchg_acquire(&s, 0, 1) != 0)
cpu_relax();
is _not_ a valid implementation of spin_lock()! or, at least, it is not
when paired with an smp_store_release(). Will was anticipating inserting
arch hooks into the (generic) qspinlock code, when we know that similar
patterns are spread all over in (q)rwlocks, mutexes, rwsem, ... (please
also notice that the informal documentation is currently treating these
synchronization mechanisms equally as far as "ordering" is concerned...).
This distinction between locking operations and "other acquires" appears
to me not only unmotivated but also extremely _fragile (difficult to use
/maintain) when considering the analysis of synchronization mechanisms
such as those mentioned above or their porting for new arch.
Please see below for a couple of minor comments.
>
> [as1871b]
>
>
> tools/memory-model/Documentation/explanation.txt | 186 +++++++---
> tools/memory-model/linux-kernel.cat | 8
> tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 5
> 3 files changed, 149 insertions(+), 50 deletions(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
> let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
>
> (*
> Index: usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> +++ usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> @@ -1,11 +1,10 @@
> C ISA2+pooncelock+pooncelock+pombonce.litmus
>
> (*
> - * Result: Sometimes
> + * Result: Never
> *
> - * This test shows that the ordering provided by a lock-protected S
> - * litmus test (P0() and P1()) are not visible to external process P2().
> - * This is likely to change soon.
> + * This test shows that write-write ordering provided by locks
> + * (in P0() and P1()) is visible to external process P2().
> *)
>
> {}
> Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-4.x/tools/memory-model/Documentation/explanation.txt
> @@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory C
> 20. THE HAPPENS-BEFORE RELATION: hb
> 21. THE PROPAGATES-BEFORE RELATION: pb
> 22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
> - 23. ODDS AND ENDS
> + 23. LOCKING
> + 24. ODDS AND ENDS
>
>
>
> @@ -1067,28 +1068,6 @@ allowing out-of-order writes like this t
> violating the write-write coherence rule by requiring the CPU not to
> send the W write to the memory subsystem at all!)
>
> -There is one last example of preserved program order in the LKMM: when
> -a load-acquire reads from an earlier store-release. For example:
> -
> - smp_store_release(&x, 123);
> - r1 = smp_load_acquire(&x);
> -
> -If the smp_load_acquire() ends up obtaining the 123 value that was
> -stored by the smp_store_release(), the LKMM says that the load must be
> -executed after the store; the store cannot be forwarded to the load.
> -This requirement does not arise from the operational model, but it
> -yields correct predictions on all architectures supported by the Linux
> -kernel, although for differing reasons.
> -
> -On some architectures, including x86 and ARMv8, it is true that the
> -store cannot be forwarded to the load. On others, including PowerPC
> -and ARMv7, smp_store_release() generates object code that starts with
> -a fence and smp_load_acquire() generates object code that ends with a
> -fence. The upshot is that even though the store may be forwarded to
> -the load, it is still true that any instruction preceding the store
> -will be executed before the load or any following instructions, and
> -the store will be executed before any instruction following the load.
> -
>
> AND THEN THERE WAS ALPHA
> ------------------------
> @@ -1766,6 +1745,147 @@ before it does, and the critical section
> grace period does and ends after it does.
>
>
> +LOCKING
> +-------
> +
> +The LKMM includes locking. In fact, there is special code for locking
> +in the formal model, added in order to make tools run faster.
> +However, this special code is intended to be more or less equivalent
> +to concepts we have already covered. A spinlock_t variable is treated
> +the same as an int, and spin_lock(&s) is treated almost the same as:
> +
> + while (cmpxchg_acquire(&s, 0, 1) != 0)
> + cpu_relax();
> +
> +This waits until s is equal to 0 and then atomically sets it to 1,
> +and the read part of the cmpxchg operation acts as an acquire fence.
> +An alternate way to express the same thing would be:
> +
> + r = xchg_acquire(&s, 1);
> +
> +along with a requirement that at the end, r = 0. Similarly,
> +spin_trylock(&s) is treated almost the same as:
> +
> + return !cmpxchg_acquire(&s, 0, 1);
> +
> +which atomically sets s to 1 if it is currently equal to 0 and returns
> +true if it succeeds (the read part of the cmpxchg operation acts as an
> +acquire fence only if the operation is successful). spin_unlock(&s)
> +is treated almost the same as:
> +
> + smp_store_release(&s, 0);
> +
> +The "almost" qualifiers above need some explanation. In the LMKM, the
memory-barriers.txt seems to also need an update on this regard: e.g.,
"VARIETIES OF MEMORY BARRIERS" currently has:
ACQUIRE operations include LOCK operations and both smp_load_acquire()
and smp_cond_acquire() operations. [BTW, the latter was replaced by
smp_cond_load_acquire() in 1f03e8d2919270 ...]
RELEASE operations include UNLOCK operations and smp_store_release()
operations. [...]
[...] after an ACQUIRE on a given variable, all memory accesses
preceding any prior RELEASE on that same variable are guaranteed
to be visible.
Please see also "LOCK ACQUISITION FUNCTIONS".
> +store-release in a spin_unlock() and the load-acquire which forms the
> +first half of the atomic rmw update in a spin_lock() or a successful
> +spin_trylock() -- we can call these things lock-releases and
> +lock-acquires -- have two properties beyond those of ordinary releases
> +and acquires.
> +
> +First, when a lock-acquire reads from a lock-release, the LKMM
> +requires that every instruction po-before the lock-release must
> +execute before any instruction po-after the lock-acquire. This would
> +naturally hold if the release and acquire operations were on different
> +CPUs, but the LKMM says it holds even when they are on the same CPU.
> +For example:
> +
> + int x, y;
> + spinlock_t s;
> +
> + P0()
> + {
> + int r1, r2;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + spin_unlock(&s);
> + spin_lock(&s);
> + r2 = READ_ONCE(y);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + WRITE_ONCE(y, 1);
> + smp_wmb();
> + WRITE_ONCE(x, 1);
> + }
> +
> +Here the second spin_lock() reads from the first spin_unlock(), and
> +therefore the load of x must execute before the load of y. Thus we
> +cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> +MP pattern).
> +
> +This requirement does not apply to ordinary release and acquire
> +fences, only to lock-related operations. For instance, suppose P0()
> +in the example had been written as:
> +
> + P0()
> + {
> + int r1, r2, r3;
> +
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1);
> + r3 = smp_load_acquire(&s);
> + r2 = READ_ONCE(y);
> + }
> +
> +Then the CPU would be allowed to forward the s = 1 value from the
> +smp_store_release() to the smp_load_acquire(), executing the
> +instructions in the following order:
> +
> + r3 = smp_load_acquire(&s); // Obtains r3 = 1
> + r2 = READ_ONCE(y);
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1); // Value is forwarded
> +
> +and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
> +
> +Second, when a lock-acquire reads from a lock-release, and some other
> +stores W and W' occur po-before the lock-release and po-after the
> +lock-acquire respectively, the LKMM requires that W must propagate to
> +each CPU before W' does. For example, consider:
> +
> + int x, y;
> + spinlock_t x;
> +
> + P0()
> + {
> + spin_lock(&s);
> + WRITE_ONCE(x, 1);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + int r1;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + WRITE_ONCE(y, 1);
> + spin_unlock(&s);
> + }
> +
> + P2()
> + {
> + int r2, r3;
> +
> + r2 = READ_ONCE(y);
> + smp_rmb();
> + r3 = READ_ONCE(x);
> + }
Commit 047213158996f2 in -rcu/dev used the above test to illustrate a
property of smp_mb__after_spinlock(), c.f., its header comment; if we
accept this patch, we should consider updating that comment.
Andrea
> +
> +If r1 = 1 at the end then the spin_lock() in P1 must have read from
> +the spin_unlock() in P0. Hence the store to x must propagate to P2
> +before the store to y does, so we cannot have r2 = 1 and r3 = 0.
> +
> +These two special requirements for lock-release and lock-acquire do
> +not arise from the operational model. Nevertheless, kernel developers
> +have come to expect and rely on them because they do hold on all
> +architectures supported by the Linux kernel, albeit for various
> +differing reasons.
> +
> +
> ODDS AND ENDS
> -------------
>
> @@ -1831,26 +1951,6 @@ they behave as follows:
> events and the events preceding them against all po-later
> events.
>
> -The LKMM includes locking. In fact, there is special code for locking
> -in the formal model, added in order to make tools run faster.
> -However, this special code is intended to be exactly equivalent to
> -concepts we have already covered. A spinlock_t variable is treated
> -the same as an int, and spin_lock(&s) is treated the same as:
> -
> - while (cmpxchg_acquire(&s, 0, 1) != 0)
> - cpu_relax();
> -
> -which waits until s is equal to 0 and then atomically sets it to 1,
> -and where the read part of the atomic update is also an acquire fence.
> -An alternate way to express the same thing would be:
> -
> - r = xchg_acquire(&s, 1);
> -
> -along with a requirement that at the end, r = 0. spin_unlock(&s) is
> -treated the same as:
> -
> - smp_store_release(&s, 0);
> -
> Interestingly, RCU and locking each introduce the possibility of
> deadlock. When faced with code sequences such as:
>
>
On Mon, 9 Jul 2018, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> Nice!
>
> However, it doesn't apply against current -rcu. Am I missing a patch?
> Or is this supposed to apply against origin/lkmm?
I wrote it based on 4.18-rc. However, I can rebase it against your
current dev branch.
Alan
> > ACQUIRE operations include LOCK operations and both smp_load_acquire()
> > and smp_cond_acquire() operations. [BTW, the latter was replaced by
> > smp_cond_load_acquire() in 1f03e8d2919270 ...]
> >
> > RELEASE operations include UNLOCK operations and smp_store_release()
> > operations. [...]
> >
> > [...] after an ACQUIRE on a given variable, all memory accesses
> > preceding any prior RELEASE on that same variable are guaranteed
> > to be visible.
>
> As far as I can see, these statements remain valid.
Interesting; ;-) What does these statement tells you ;-) when applied
to a: and b: below?
a: WRITE_ONCE(x, 1); // "preceding any prior RELEASE..."
smp_store_release(&s, 1);
smp_load_acquire(&s);
b: WRITE_ONCE(y, 1); // "after an ACQUIRE..."
Andrea
On Tue, 10 Jul 2018, Andrea Parri wrote:
> > > ACQUIRE operations include LOCK operations and both smp_load_acquire()
> > > and smp_cond_acquire() operations. [BTW, the latter was replaced by
> > > smp_cond_load_acquire() in 1f03e8d2919270 ...]
> > >
> > > RELEASE operations include UNLOCK operations and smp_store_release()
> > > operations. [...]
> > >
> > > [...] after an ACQUIRE on a given variable, all memory accesses
> > > preceding any prior RELEASE on that same variable are guaranteed
> > > to be visible.
> >
> > As far as I can see, these statements remain valid.
>
> Interesting; ;-) What does these statement tells you ;-) when applied
> to a: and b: below?
>
> a: WRITE_ONCE(x, 1); // "preceding any prior RELEASE..."
> smp_store_release(&s, 1);
> smp_load_acquire(&s);
> b: WRITE_ONCE(y, 1); // "after an ACQUIRE..."
The first statement tells me that b: follows an ACQUIRE.
The second tells me that a: precedes a RELEASE.
And the third tells me that any READ_ONCE(x) statements coming po-after
b: would see x = 1 or a later value of x. (Of course, they would have
to see that anyway because of the cache coherency rules.)
More to the point, given:
P0()
{
WRITE_ONCE(x, 1);
a: smp_store_release(&s, 1);
}
P1()
{
b: r1 = smp_load_acquire(&s);
r2 = READ_ONCE(x);
}
the third statement tells me that if r1 = 1 (that is, if a: is prior to
b:) then r2 must be 1.
Alan
On Tue, Jul 10, 2018 at 09:57:17AM -0400, Alan Stern wrote:
> On Mon, 9 Jul 2018, Paul E. McKenney wrote:
>
> > On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > >
> > > Locks should also provide read-read (and read-write) ordering in a
> > > similar way. Given:
> > >
> > > READ_ONCE(x);
> > > spin_unlock(&s);
> > > spin_lock(&s);
> > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > >
> > > the load of x should be executed before the load of (or store to) y.
> > > The LKMM already provides this ordering, but it provides it even in
> > > the case where the two accesses are separated by a release/acquire
> > > pair of fences rather than unlock/lock. This would prevent
> > > architectures from using weakly ordered implementations of release and
> > > acquire, which seems like an unnecessary restriction. The patch
> > > therefore removes the ordering requirement from the LKMM for that
> > > case.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do provide this ordering for locks, albeit for varying reasons.
> > > Therefore this patch changes the model in accordance with the
> > > developers' wishes.
> > >
> > > Signed-off-by: Alan Stern <[email protected]>
> >
> > Nice!
> >
> > However, it doesn't apply against current -rcu. Am I missing a patch?
> > Or is this supposed to apply against origin/lkmm?
>
> I wrote it based on 4.18-rc. However, I can rebase it against your
> current dev branch.
Could you please? Against either the dev or lkmm branch should well.
If you don't have time for this, my approach would be to apply against
4.18-rc, then cherry-pick onto my branch, resolving the conflicts and
emailing you both the "<<<<"-marked file and my proposed resolution.
(Or git might just resolve everything automatically -- that does
sometimes happen. But it would still be good to double-check its work,
as it sometimes does "interesting" resolutions.)
Thanx, Paul
On Tue, 10 Jul 2018, Andrea Parri wrote:
> On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
>
> I'd like to step back on this point: I still don't have a strong opinion
> on this, but all this debating made me curious about others' opinion ;-)
> I'd like to see the above argument expanded: what's the rationale behind
> that opinion? can we maybe add references to actual code relying on that
> ordering? other that I've been missing?
>
> I'd extend these same questions to the "ordering of reads" snippet below
> (and discussed since so long...).
>
>
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
>
> IIUC, the same argument could be used to support the removal of the new
> unlock-rf-lock-po (we already discussed riscv .aq/.rl, it doesn't seem
> hard to imagine an arm64 LDAPR-exclusive, or the adoption of ctrl+isync
> on powerpc). Why are we effectively preventing their adoption? Again,
> I'd like to see more details about the underlying motivations...
>
>
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
> >
> > ---
> >
> > v.2: Restrict the ordering to lock operations, not general release
> > and acquire fences.
>
> This is another controversial point, and one that makes me shivering ...
>
> I have the impression that we're dismissing the suggestion "RMW-acquire
> at par with LKR" with a bit of rush. So, this patch is implying that:
>
> while (cmpxchg_acquire(&s, 0, 1) != 0)
> cpu_relax();
>
> is _not_ a valid implementation of spin_lock()! or, at least, it is not
> when paired with an smp_store_release().
At least, it's not a valid general-purpose implementation. For a lot
of architectures it would be okay, but it might not be okay (for
example) on RISC-V.
> Will was anticipating inserting
> arch hooks into the (generic) qspinlock code, when we know that similar
> patterns are spread all over in (q)rwlocks, mutexes, rwsem, ... (please
> also notice that the informal documentation is currently treating these
> synchronization mechanisms equally as far as "ordering" is concerned...).
>
> This distinction between locking operations and "other acquires" appears
> to me not only unmotivated but also extremely _fragile (difficult to use
> /maintain) when considering the analysis of synchronization mechanisms
> such as those mentioned above or their porting for new arch.
I will leave these points for others to discuss.
> memory-barriers.txt seems to also need an update on this regard: e.g.,
> "VARIETIES OF MEMORY BARRIERS" currently has:
>
> ACQUIRE operations include LOCK operations and both smp_load_acquire()
> and smp_cond_acquire() operations. [BTW, the latter was replaced by
> smp_cond_load_acquire() in 1f03e8d2919270 ...]
>
> RELEASE operations include UNLOCK operations and smp_store_release()
> operations. [...]
>
> [...] after an ACQUIRE on a given variable, all memory accesses
> preceding any prior RELEASE on that same variable are guaranteed
> to be visible.
As far as I can see, these statements remain valid.
> Please see also "LOCK ACQUISITION FUNCTIONS".
The (3) and (4) entries in that section's list seem redundant.
However, we should point out that one of the reorderings discussed
later on in that section would be disallowed if the RELEASE and ACQUIRE
were locking actions.
> > +
> > + int x, y;
> > + spinlock_t x;
> > +
> > + P0()
> > + {
> > + spin_lock(&s);
> > + WRITE_ONCE(x, 1);
> > + spin_unlock(&s);
> > + }
> > +
> > + P1()
> > + {
> > + int r1;
> > +
> > + spin_lock(&s);
> > + r1 = READ_ONCE(x);
> > + WRITE_ONCE(y, 1);
> > + spin_unlock(&s);
> > + }
> > +
> > + P2()
> > + {
> > + int r2, r3;
> > +
> > + r2 = READ_ONCE(y);
> > + smp_rmb();
> > + r3 = READ_ONCE(x);
> > + }
>
> Commit 047213158996f2 in -rcu/dev used the above test to illustrate a
> property of smp_mb__after_spinlock(), c.f., its header comment; if we
> accept this patch, we should consider updating that comment.
Indeed, the use of smb_mp__after_spinlock() illustrated in that comment
would become unnecessary.
Alan
On 7/9/2018 1:01 PM, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.
>
> Locks should also provide read-read (and read-write) ordering in a
> similar way. Given:
>
> READ_ONCE(x);
> spin_unlock(&s);
> spin_lock(&s);
> READ_ONCE(y); // or WRITE_ONCE(y, 1);
>
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock. This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction. The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
>
> ---
>
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
>
> [as1871b]
>
>
> tools/memory-model/Documentation/explanation.txt | 186 +++++++---
> tools/memory-model/linux-kernel.cat | 8
> tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 5
> 3 files changed, 149 insertions(+), 50 deletions(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
It feels slightly weird that unlock-rf-lock-po is asymmetrical. And in
fact, I think the current RISC-V solution we've been discussing (namely,
putting a fence.tso instead of a fence rw,w in front of the release)
may not even technically respect that particular sequence. The
fence.tso solution really enforces "po; [UL]; rf; [LKR]", right?
Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
of the model? If so, maybe that solves the asymmetry and also
legalizes the approach of putting fence.tso in front?
Or, other suggestions?
Dan
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
> let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
On Tue, 10 Jul 2018, Daniel Lustig wrote:
> > --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> > +++ usb-4.x/tools/memory-model/linux-kernel.cat
> > @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> > (* Release Acquire *)
> > let acq-po = [Acquire] ; po ; [M]
> > let po-rel = [M] ; po ; [Release]
> > -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> > +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
>
> It feels slightly weird that unlock-rf-lock-po is asymmetrical. And in
> fact, I think the current RISC-V solution we've been discussing (namely,
> putting a fence.tso instead of a fence rw,w in front of the release)
> may not even technically respect that particular sequence. The
> fence.tso solution really enforces "po; [UL]; rf; [LKR]", right?
>
> Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
> of the model? If so, maybe that solves the asymmetry and also
> legalizes the approach of putting fence.tso in front?
That would work just as well. For this version of the patch it
doesn't make any difference, because nothing that comes po-after the
LKR is able to directly read the value stored by the UL.
Alan
More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking. In other words, given
the following code:
WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&s);
WRITE_ONCE(y, 1);
the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s. In terms of
the memory model, this means expanding the cumul-fence relation.
Locks should also provide read-read (and read-write) ordering in a
similar way. Given:
READ_ONCE(x);
spin_unlock(&s);
spin_lock(&s);
READ_ONCE(y); // or WRITE_ONCE(y, 1);
the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock. This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction. The patch
therefore removes the ordering requirement from the LKMM for that
case.
All the architectures supported by the Linux kernel (including RISC-V)
do provide this ordering for locks, albeit for varying reasons.
Therefore this patch changes the model in accordance with the
developers' wishes.
Signed-off-by: Alan Stern <[email protected]>
---
v.3: Rebased against the dev branch of Paul's linux-rcu tree.
Changed unlock-rf-lock-po to po-unlock-rf-lock-po, making it more
symmetrical and more in accordance with the use of fence.tso for
the release on RISC-V.
v.2: Restrict the ordering to lock operations, not general release
and acquire fences.
[as1871c]
tools/memory-model/Documentation/explanation.txt | 186 +++++++---
tools/memory-model/linux-kernel.cat | 8
tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 7
3 files changed, 150 insertions(+), 51 deletions(-)
Index: usb-4.x/tools/memory-model/linux-kernel.cat
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.cat
+++ usb-4.x/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
(**********************************)
(* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
(*
Index: usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
===================================================================
--- usb-4.x.orig/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
C ISA2+pooncelock+pooncelock+pombonce
(*
- * Result: Sometimes
+ * Result: Never
*
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
*)
{}
Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
===================================================================
--- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-4.x/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory C
20. THE HAPPENS-BEFORE RELATION: hb
21. THE PROPAGATES-BEFORE RELATION: pb
22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
- 23. ODDS AND ENDS
+ 23. LOCKING
+ 24. ODDS AND ENDS
@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this t
violating the write-write coherence rule by requiring the CPU not to
send the W write to the memory subsystem at all!)
-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release. For example:
-
- smp_store_release(&x, 123);
- r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load. On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence. The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-
AND THEN THERE WAS ALPHA
------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section
grace period does and ends after it does.
+LOCKING
+-------
+
+The LKMM includes locking. In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered. A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+ while (cmpxchg_acquire(&s, 0, 1) != 0)
+ cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+ r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0. Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+ return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful). spin_unlock(&s)
+is treated almost the same as:
+
+ smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation. In the LKMM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire. This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+ int x, y;
+ spinlock_t s;
+
+ P0()
+ {
+ int r1, r2;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ spin_unlock(&s);
+ spin_lock(&s);
+ r2 = READ_ONCE(y);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ WRITE_ONCE(y, 1);
+ smp_wmb();
+ WRITE_ONCE(x, 1);
+ }
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y. Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations. For instance, suppose P0()
+in the example had been written as:
+
+ P0()
+ {
+ int r1, r2, r3;
+
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1);
+ r3 = smp_load_acquire(&s);
+ r2 = READ_ONCE(y);
+ }
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+ r3 = smp_load_acquire(&s); // Obtains r3 = 1
+ r2 = READ_ONCE(y);
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1); // Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does. For example, consider:
+
+ int x, y;
+ spinlock_t x;
+
+ P0()
+ {
+ spin_lock(&s);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ int r1;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&s);
+ }
+
+ P2()
+ {
+ int r2, r3;
+
+ r2 = READ_ONCE(y);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ }
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0. Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model. Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
ODDS AND ENDS
-------------
@@ -1831,26 +1951,6 @@ they behave as follows:
events and the events preceding them against all po-later
events.
-The LKMM includes locking. In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered. A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
- while (cmpxchg_acquire(&s, 0, 1) != 0)
- cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
- r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0. spin_unlock(&s) is
-treated the same as:
-
- smp_store_release(&s, 0);
-
Interestingly, RCU and locking each introduce the possibility of
deadlock. When faced with code sequences such as:
On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.
>
> Locks should also provide read-read (and read-write) ordering in a
> similar way. Given:
>
> READ_ONCE(x);
> spin_unlock(&s);
> spin_lock(&s);
> READ_ONCE(y); // or WRITE_ONCE(y, 1);
>
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock. This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction. The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
It now applies, thank you very much!
Is this something that you are comfortable pushing into the upcoming
merge window, or should I hold off until the next one?
Thanx, Paul
> ---
>
>
> v.3: Rebased against the dev branch of Paul's linux-rcu tree.
> Changed unlock-rf-lock-po to po-unlock-rf-lock-po, making it more
> symmetrical and more in accordance with the use of fence.tso for
> the release on RISC-V.
>
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
>
> [as1871c]
>
>
> tools/memory-model/Documentation/explanation.txt | 186 +++++++---
> tools/memory-model/linux-kernel.cat | 8
> tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 7
> 3 files changed, 150 insertions(+), 51 deletions(-)
>
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
> let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
>
> (*
> Index: usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> +++ usb-4.x/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
> @@ -1,11 +1,10 @@
> C ISA2+pooncelock+pooncelock+pombonce
>
> (*
> - * Result: Sometimes
> + * Result: Never
> *
> - * This test shows that the ordering provided by a lock-protected S
> - * litmus test (P0() and P1()) are not visible to external process P2().
> - * This is likely to change soon.
> + * This test shows that write-write ordering provided by locks
> + * (in P0() and P1()) is visible to external process P2().
> *)
>
> {}
> Index: usb-4.x/tools/memory-model/Documentation/explanation.txt
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-4.x/tools/memory-model/Documentation/explanation.txt
> @@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory C
> 20. THE HAPPENS-BEFORE RELATION: hb
> 21. THE PROPAGATES-BEFORE RELATION: pb
> 22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
> - 23. ODDS AND ENDS
> + 23. LOCKING
> + 24. ODDS AND ENDS
>
>
>
> @@ -1067,28 +1068,6 @@ allowing out-of-order writes like this t
> violating the write-write coherence rule by requiring the CPU not to
> send the W write to the memory subsystem at all!)
>
> -There is one last example of preserved program order in the LKMM: when
> -a load-acquire reads from an earlier store-release. For example:
> -
> - smp_store_release(&x, 123);
> - r1 = smp_load_acquire(&x);
> -
> -If the smp_load_acquire() ends up obtaining the 123 value that was
> -stored by the smp_store_release(), the LKMM says that the load must be
> -executed after the store; the store cannot be forwarded to the load.
> -This requirement does not arise from the operational model, but it
> -yields correct predictions on all architectures supported by the Linux
> -kernel, although for differing reasons.
> -
> -On some architectures, including x86 and ARMv8, it is true that the
> -store cannot be forwarded to the load. On others, including PowerPC
> -and ARMv7, smp_store_release() generates object code that starts with
> -a fence and smp_load_acquire() generates object code that ends with a
> -fence. The upshot is that even though the store may be forwarded to
> -the load, it is still true that any instruction preceding the store
> -will be executed before the load or any following instructions, and
> -the store will be executed before any instruction following the load.
> -
>
> AND THEN THERE WAS ALPHA
> ------------------------
> @@ -1766,6 +1745,147 @@ before it does, and the critical section
> grace period does and ends after it does.
>
>
> +LOCKING
> +-------
> +
> +The LKMM includes locking. In fact, there is special code for locking
> +in the formal model, added in order to make tools run faster.
> +However, this special code is intended to be more or less equivalent
> +to concepts we have already covered. A spinlock_t variable is treated
> +the same as an int, and spin_lock(&s) is treated almost the same as:
> +
> + while (cmpxchg_acquire(&s, 0, 1) != 0)
> + cpu_relax();
> +
> +This waits until s is equal to 0 and then atomically sets it to 1,
> +and the read part of the cmpxchg operation acts as an acquire fence.
> +An alternate way to express the same thing would be:
> +
> + r = xchg_acquire(&s, 1);
> +
> +along with a requirement that at the end, r = 0. Similarly,
> +spin_trylock(&s) is treated almost the same as:
> +
> + return !cmpxchg_acquire(&s, 0, 1);
> +
> +which atomically sets s to 1 if it is currently equal to 0 and returns
> +true if it succeeds (the read part of the cmpxchg operation acts as an
> +acquire fence only if the operation is successful). spin_unlock(&s)
> +is treated almost the same as:
> +
> + smp_store_release(&s, 0);
> +
> +The "almost" qualifiers above need some explanation. In the LKMM, the
> +store-release in a spin_unlock() and the load-acquire which forms the
> +first half of the atomic rmw update in a spin_lock() or a successful
> +spin_trylock() -- we can call these things lock-releases and
> +lock-acquires -- have two properties beyond those of ordinary releases
> +and acquires.
> +
> +First, when a lock-acquire reads from a lock-release, the LKMM
> +requires that every instruction po-before the lock-release must
> +execute before any instruction po-after the lock-acquire. This would
> +naturally hold if the release and acquire operations were on different
> +CPUs, but the LKMM says it holds even when they are on the same CPU.
> +For example:
> +
> + int x, y;
> + spinlock_t s;
> +
> + P0()
> + {
> + int r1, r2;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + spin_unlock(&s);
> + spin_lock(&s);
> + r2 = READ_ONCE(y);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + WRITE_ONCE(y, 1);
> + smp_wmb();
> + WRITE_ONCE(x, 1);
> + }
> +
> +Here the second spin_lock() reads from the first spin_unlock(), and
> +therefore the load of x must execute before the load of y. Thus we
> +cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> +MP pattern).
> +
> +This requirement does not apply to ordinary release and acquire
> +fences, only to lock-related operations. For instance, suppose P0()
> +in the example had been written as:
> +
> + P0()
> + {
> + int r1, r2, r3;
> +
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1);
> + r3 = smp_load_acquire(&s);
> + r2 = READ_ONCE(y);
> + }
> +
> +Then the CPU would be allowed to forward the s = 1 value from the
> +smp_store_release() to the smp_load_acquire(), executing the
> +instructions in the following order:
> +
> + r3 = smp_load_acquire(&s); // Obtains r3 = 1
> + r2 = READ_ONCE(y);
> + r1 = READ_ONCE(x);
> + smp_store_release(&s, 1); // Value is forwarded
> +
> +and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
> +
> +Second, when a lock-acquire reads from a lock-release, and some other
> +stores W and W' occur po-before the lock-release and po-after the
> +lock-acquire respectively, the LKMM requires that W must propagate to
> +each CPU before W' does. For example, consider:
> +
> + int x, y;
> + spinlock_t x;
> +
> + P0()
> + {
> + spin_lock(&s);
> + WRITE_ONCE(x, 1);
> + spin_unlock(&s);
> + }
> +
> + P1()
> + {
> + int r1;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + WRITE_ONCE(y, 1);
> + spin_unlock(&s);
> + }
> +
> + P2()
> + {
> + int r2, r3;
> +
> + r2 = READ_ONCE(y);
> + smp_rmb();
> + r3 = READ_ONCE(x);
> + }
> +
> +If r1 = 1 at the end then the spin_lock() in P1 must have read from
> +the spin_unlock() in P0. Hence the store to x must propagate to P2
> +before the store to y does, so we cannot have r2 = 1 and r3 = 0.
> +
> +These two special requirements for lock-release and lock-acquire do
> +not arise from the operational model. Nevertheless, kernel developers
> +have come to expect and rely on them because they do hold on all
> +architectures supported by the Linux kernel, albeit for various
> +differing reasons.
> +
> +
> ODDS AND ENDS
> -------------
>
> @@ -1831,26 +1951,6 @@ they behave as follows:
> events and the events preceding them against all po-later
> events.
>
> -The LKMM includes locking. In fact, there is special code for locking
> -in the formal model, added in order to make tools run faster.
> -However, this special code is intended to be exactly equivalent to
> -concepts we have already covered. A spinlock_t variable is treated
> -the same as an int, and spin_lock(&s) is treated the same as:
> -
> - while (cmpxchg_acquire(&s, 0, 1) != 0)
> - cpu_relax();
> -
> -which waits until s is equal to 0 and then atomically sets it to 1,
> -and where the read part of the atomic update is also an acquire fence.
> -An alternate way to express the same thing would be:
> -
> - r = xchg_acquire(&s, 1);
> -
> -along with a requirement that at the end, r = 0. spin_unlock(&s) is
> -treated the same as:
> -
> - smp_store_release(&s, 0);
> -
> Interestingly, RCU and locking each introduce the possibility of
> deadlock. When faced with code sequences such as:
>
>
On Tue, 10 Jul 2018, Paul E. McKenney wrote:
> On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> It now applies, thank you very much!
>
> Is this something that you are comfortable pushing into the upcoming
> merge window, or should I hold off until the next one?
Given the concerns that Andrea raised, and given that neither Peter,
Will, nor Daniel has commented on v.3 of the patch, I think we should
hold off for a little while.
Alan
On Tue, Jul 10, 2018 at 04:24:34PM -0400, Alan Stern wrote:
> On Tue, 10 Jul 2018, Paul E. McKenney wrote:
>
> > On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > >
> > > Locks should also provide read-read (and read-write) ordering in a
> > > similar way. Given:
> > >
> > > READ_ONCE(x);
> > > spin_unlock(&s);
> > > spin_lock(&s);
> > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > >
> > > the load of x should be executed before the load of (or store to) y.
> > > The LKMM already provides this ordering, but it provides it even in
> > > the case where the two accesses are separated by a release/acquire
> > > pair of fences rather than unlock/lock. This would prevent
> > > architectures from using weakly ordered implementations of release and
> > > acquire, which seems like an unnecessary restriction. The patch
> > > therefore removes the ordering requirement from the LKMM for that
> > > case.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do provide this ordering for locks, albeit for varying reasons.
> > > Therefore this patch changes the model in accordance with the
> > > developers' wishes.
> > >
> > > Signed-off-by: Alan Stern <[email protected]>
> >
> > It now applies, thank you very much!
> >
> > Is this something that you are comfortable pushing into the upcoming
> > merge window, or should I hold off until the next one?
>
> Given the concerns that Andrea raised, and given that neither Peter,
> Will, nor Daniel has commented on v.3 of the patch, I think we should
> hold off for a little while.
Works for me!
Thanx, Paul
On Tue, Jul 10, 2018 at 11:34:45AM -0400, Alan Stern wrote:
> On Tue, 10 Jul 2018, Andrea Parri wrote:
>
> > > > ACQUIRE operations include LOCK operations and both smp_load_acquire()
> > > > and smp_cond_acquire() operations. [BTW, the latter was replaced by
> > > > smp_cond_load_acquire() in 1f03e8d2919270 ...]
> > > >
> > > > RELEASE operations include UNLOCK operations and smp_store_release()
> > > > operations. [...]
> > > >
> > > > [...] after an ACQUIRE on a given variable, all memory accesses
> > > > preceding any prior RELEASE on that same variable are guaranteed
> > > > to be visible.
> > >
> > > As far as I can see, these statements remain valid.
> >
> > Interesting; ;-) What does these statement tells you ;-) when applied
> > to a: and b: below?
> >
> > a: WRITE_ONCE(x, 1); // "preceding any prior RELEASE..."
> > smp_store_release(&s, 1);
> > smp_load_acquire(&s);
> > b: WRITE_ONCE(y, 1); // "after an ACQUIRE..."
>
> The first statement tells me that b: follows an ACQUIRE.
>
> The second tells me that a: precedes a RELEASE.
>
> And the third tells me that any READ_ONCE(x) statements coming po-after
> b: would see x = 1 or a later value of x. (Of course, they would have
> to see that anyway because of the cache coherency rules.)
Mmh, something like "visible from the same CPU of the ACQUIRE" probably
could have helped me to reach the same conclusion.
>
> More to the point, given:
>
> P0()
> {
> WRITE_ONCE(x, 1);
> a: smp_store_release(&s, 1);
> }
>
> P1()
> {
> b: r1 = smp_load_acquire(&s);
> r2 = READ_ONCE(x);
> }
>
> the third statement tells me that if r1 = 1 (that is, if a: is prior to
> b:) then r2 must be 1.
Indeed; the "prior" is ambiguous, but yes.
Andrea
>
> Alan
>
On Tue, Jul 10, 2018 at 01:17:50PM -0400, Alan Stern wrote:
> On Tue, 10 Jul 2018, Daniel Lustig wrote:
>
> > > --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> > > +++ usb-4.x/tools/memory-model/linux-kernel.cat
> > > @@ -38,7 +38,7 @@ let strong-fence = mb | gp
> > > (* Release Acquire *)
> > > let acq-po = [Acquire] ; po ; [M]
> > > let po-rel = [M] ; po ; [Release]
> > > -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> > > +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po
> >
> > It feels slightly weird that unlock-rf-lock-po is asymmetrical. And in
> > fact, I think the current RISC-V solution we've been discussing (namely,
> > putting a fence.tso instead of a fence rw,w in front of the release)
> > may not even technically respect that particular sequence. The
> > fence.tso solution really enforces "po; [UL]; rf; [LKR]", right?
> >
> > Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
> > of the model? If so, maybe that solves the asymmetry and also
> > legalizes the approach of putting fence.tso in front?
>
> That would work just as well. For this version of the patch it
> doesn't make any difference, because nothing that comes po-after the
> LKR is able to directly read the value stored by the UL.
Consider:
C v2-versus-v3
{}
P0(spinlock_t *s, int *x)
{
spin_lock(s); /* A */
spin_unlock(s);
spin_lock(s);
WRITE_ONCE(*x, 1); /* B */
spin_unlock(s);
}
P1(spinlock_t *s, int *x)
{
int r0;
int r1;
r0 = READ_ONCE(*x); /* C */
smp_rmb();
r1 = spin_is_locked(s); /* D */
}
With v3, it's allowed that C reads from B and D reads from (the LKW of) A;
this is not allowed with v2 (unless I mis-applied/mis-read v2).
Andrea
On Tue, Jul 10, 2018 at 11:38:21AM +0200, Andrea Parri wrote:
> On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
>
> I'd like to step back on this point: I still don't have a strong opinion
> on this, but all this debating made me curious about others' opinion ;-)
> I'd like to see the above argument expanded: what's the rationale behind
> that opinion? can we maybe add references to actual code relying on that
> ordering? other that I've been missing?
>
> I'd extend these same questions to the "ordering of reads" snippet below
> (and discussed since so long...).
>
>
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
>
> IIUC, the same argument could be used to support the removal of the new
> unlock-rf-lock-po (we already discussed riscv .aq/.rl, it doesn't seem
> hard to imagine an arm64 LDAPR-exclusive, or the adoption of ctrl+isync
> on powerpc). Why are we effectively preventing their adoption? Again,
> I'd like to see more details about the underlying motivations...
>
>
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
> >
> > ---
> >
> > v.2: Restrict the ordering to lock operations, not general release
> > and acquire fences.
>
> This is another controversial point, and one that makes me shivering ...
>
> I have the impression that we're dismissing the suggestion "RMW-acquire
> at par with LKR" with a bit of rush. So, this patch is implying that:
>
> while (cmpxchg_acquire(&s, 0, 1) != 0)
> cpu_relax();
>
> is _not_ a valid implementation of spin_lock()! or, at least, it is not
> when paired with an smp_store_release(). Will was anticipating inserting
> arch hooks into the (generic) qspinlock code, when we know that similar
> patterns are spread all over in (q)rwlocks, mutexes, rwsem, ... (please
> also notice that the informal documentation is currently treating these
> synchronization mechanisms equally as far as "ordering" is concerned...).
>
> This distinction between locking operations and "other acquires" appears
> to me not only unmotivated but also extremely _fragile (difficult to use
> /maintain) when considering the analysis of synchronization mechanisms
> such as those mentioned above or their porting for new arch.
The main reason for this is because developers use spinlocks all of the
time, including in drivers. It's less common to use explicit atomics and
extremely rare to use explicit acquire/release operations. So let's make
locks as easy to use as possible, by giving them the strongest semantics
that we can whilst remaining a good fit for the instructions that are
provided by the architectures we support.
If you want to extend this to atomic rmws, go for it, but I don't think
it's nearly as important and there will still be ways to implement locks
with insufficient ordering guarantees if you want to.
Will
Hi Alan,
On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.
>
> Locks should also provide read-read (and read-write) ordering in a
> similar way. Given:
>
> READ_ONCE(x);
> spin_unlock(&s);
> spin_lock(&s);
> READ_ONCE(y); // or WRITE_ONCE(y, 1);
>
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock. This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction. The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
>
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
>
> Signed-off-by: Alan Stern <[email protected]>
Thanks, I'm happy with this version of the patch:
Reviewed-by: Will Deacon <[email protected]>
Will
On Wed, Jul 11, 2018 at 10:43:11AM +0100, Will Deacon wrote:
> On Tue, Jul 10, 2018 at 11:38:21AM +0200, Andrea Parri wrote:
> > On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> >
> > I'd like to step back on this point: I still don't have a strong opinion
> > on this, but all this debating made me curious about others' opinion ;-)
> > I'd like to see the above argument expanded: what's the rationale behind
> > that opinion? can we maybe add references to actual code relying on that
> > ordering? other that I've been missing?
> >
> > I'd extend these same questions to the "ordering of reads" snippet below
> > (and discussed since so long...).
> >
> >
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > >
> > > Locks should also provide read-read (and read-write) ordering in a
> > > similar way. Given:
> > >
> > > READ_ONCE(x);
> > > spin_unlock(&s);
> > > spin_lock(&s);
> > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > >
> > > the load of x should be executed before the load of (or store to) y.
> > > The LKMM already provides this ordering, but it provides it even in
> > > the case where the two accesses are separated by a release/acquire
> > > pair of fences rather than unlock/lock. This would prevent
> > > architectures from using weakly ordered implementations of release and
> > > acquire, which seems like an unnecessary restriction. The patch
> > > therefore removes the ordering requirement from the LKMM for that
> > > case.
> >
> > IIUC, the same argument could be used to support the removal of the new
> > unlock-rf-lock-po (we already discussed riscv .aq/.rl, it doesn't seem
> > hard to imagine an arm64 LDAPR-exclusive, or the adoption of ctrl+isync
> > on powerpc). Why are we effectively preventing their adoption? Again,
> > I'd like to see more details about the underlying motivations...
> >
> >
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do provide this ordering for locks, albeit for varying reasons.
> > > Therefore this patch changes the model in accordance with the
> > > developers' wishes.
> > >
> > > Signed-off-by: Alan Stern <[email protected]>
> > >
> > > ---
> > >
> > > v.2: Restrict the ordering to lock operations, not general release
> > > and acquire fences.
> >
> > This is another controversial point, and one that makes me shivering ...
> >
> > I have the impression that we're dismissing the suggestion "RMW-acquire
> > at par with LKR" with a bit of rush. So, this patch is implying that:
> >
> > while (cmpxchg_acquire(&s, 0, 1) != 0)
> > cpu_relax();
> >
> > is _not_ a valid implementation of spin_lock()! or, at least, it is not
> > when paired with an smp_store_release(). Will was anticipating inserting
> > arch hooks into the (generic) qspinlock code, when we know that similar
> > patterns are spread all over in (q)rwlocks, mutexes, rwsem, ... (please
> > also notice that the informal documentation is currently treating these
> > synchronization mechanisms equally as far as "ordering" is concerned...).
> >
> > This distinction between locking operations and "other acquires" appears
> > to me not only unmotivated but also extremely _fragile (difficult to use
> > /maintain) when considering the analysis of synchronization mechanisms
> > such as those mentioned above or their porting for new arch.
>
> The main reason for this is because developers use spinlocks all of the
> time, including in drivers. It's less common to use explicit atomics and
> extremely rare to use explicit acquire/release operations. So let's make
> locks as easy to use as possible, by giving them the strongest semantics
> that we can whilst remaining a good fit for the instructions that are
> provided by the architectures we support.
Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
(or its analogous in v3) all together:
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 59b5cbe6b6240..bc413a6839a2d 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,6 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
(**********************************)
(* Fundamental coherence ordering *)
@@ -60,7 +59,7 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
let ppo = to-r | to-w | fence
Among other things, this would immediately:
1) Enable RISC-V to use their .aq/.rl annotations _without_ having to
"worry" about tso or release/acquire fences; IOW, this will permit
a partial revert of:
0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences")
5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences")
2) Resolve the above mentioned controversy (the inconsistency between
- locking operations and atomic RMWs on one side, and their actual
implementation in generic code on the other), thus enabling the use
of LKMM _and_ its tools for the analysis/reviewing of the latter.
>
> If you want to extend this to atomic rmws, go for it, but I don't think
> it's nearly as important and there will still be ways to implement locks
> with insufficient ordering guarantees if you want to.
I don't want to "implement locks with insufficient ordering guarantees"
(w.r.t. LKMM). ;-)
Andrea
>
> Will
On Wed, 11 Jul 2018, Andrea Parri wrote:
> > > Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
> > > of the model? If so, maybe that solves the asymmetry and also
> > > legalizes the approach of putting fence.tso in front?
> >
> > That would work just as well. For this version of the patch it
> > doesn't make any difference, because nothing that comes po-after the
> > LKR is able to directly read the value stored by the UL.
>
> Consider:
>
> C v2-versus-v3
>
> {}
>
> P0(spinlock_t *s, int *x)
> {
> spin_lock(s); /* A */
> spin_unlock(s);
> spin_lock(s);
> WRITE_ONCE(*x, 1); /* B */
> spin_unlock(s);
> }
>
> P1(spinlock_t *s, int *x)
> {
> int r0;
> int r1;
>
> r0 = READ_ONCE(*x); /* C */
> smp_rmb();
> r1 = spin_is_locked(s); /* D */
> }
>
> With v3, it's allowed that C reads from B and D reads from (the LKW of) A;
> this is not allowed with v2 (unless I mis-applied/mis-read v2).
Correct. But it doesn't affect the end result, because both versions
allow C to read from B while D reads from the second spin_lock(), and
there's no way to distinguish that from the case where D reads from A.
If we were talking about arbitrary integers and rmw-acquire updates,
there _would_ be a difference. But not with spinlocks.
Alan
On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> On Wed, Jul 11, 2018 at 10:43:11AM +0100, Will Deacon wrote:
> > On Tue, Jul 10, 2018 at 11:38:21AM +0200, Andrea Parri wrote:
> > > On Mon, Jul 09, 2018 at 04:01:57PM -0400, Alan Stern wrote:
> > > > More than one kernel developer has expressed the opinion that the LKMM
> > > > should enforce ordering of writes by locking. In other words, given
> > >
> > > I'd like to step back on this point: I still don't have a strong opinion
> > > on this, but all this debating made me curious about others' opinion ;-)
> > > I'd like to see the above argument expanded: what's the rationale behind
> > > that opinion? can we maybe add references to actual code relying on that
> > > ordering? other that I've been missing?
> > >
> > > I'd extend these same questions to the "ordering of reads" snippet below
> > > (and discussed since so long...).
> > >
> > >
> > > > the following code:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > spin_unlock(&s):
> > > > spin_lock(&s);
> > > > WRITE_ONCE(y, 1);
> > > >
> > > > the stores to x and y should be propagated in order to all other CPUs,
> > > > even though those other CPUs might not access the lock s. In terms of
> > > > the memory model, this means expanding the cumul-fence relation.
> > > >
> > > > Locks should also provide read-read (and read-write) ordering in a
> > > > similar way. Given:
> > > >
> > > > READ_ONCE(x);
> > > > spin_unlock(&s);
> > > > spin_lock(&s);
> > > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > > >
> > > > the load of x should be executed before the load of (or store to) y.
> > > > The LKMM already provides this ordering, but it provides it even in
> > > > the case where the two accesses are separated by a release/acquire
> > > > pair of fences rather than unlock/lock. This would prevent
> > > > architectures from using weakly ordered implementations of release and
> > > > acquire, which seems like an unnecessary restriction. The patch
> > > > therefore removes the ordering requirement from the LKMM for that
> > > > case.
> > >
> > > IIUC, the same argument could be used to support the removal of the new
> > > unlock-rf-lock-po (we already discussed riscv .aq/.rl, it doesn't seem
> > > hard to imagine an arm64 LDAPR-exclusive, or the adoption of ctrl+isync
> > > on powerpc). Why are we effectively preventing their adoption? Again,
> > > I'd like to see more details about the underlying motivations...
> > >
> > >
> > > >
> > > > All the architectures supported by the Linux kernel (including RISC-V)
> > > > do provide this ordering for locks, albeit for varying reasons.
> > > > Therefore this patch changes the model in accordance with the
> > > > developers' wishes.
> > > >
> > > > Signed-off-by: Alan Stern <[email protected]>
> > > >
> > > > ---
> > > >
> > > > v.2: Restrict the ordering to lock operations, not general release
> > > > and acquire fences.
> > >
> > > This is another controversial point, and one that makes me shivering ...
> > >
> > > I have the impression that we're dismissing the suggestion "RMW-acquire
> > > at par with LKR" with a bit of rush. So, this patch is implying that:
> > >
> > > while (cmpxchg_acquire(&s, 0, 1) != 0)
> > > cpu_relax();
> > >
> > > is _not_ a valid implementation of spin_lock()! or, at least, it is not
> > > when paired with an smp_store_release(). Will was anticipating inserting
> > > arch hooks into the (generic) qspinlock code, when we know that similar
> > > patterns are spread all over in (q)rwlocks, mutexes, rwsem, ... (please
> > > also notice that the informal documentation is currently treating these
> > > synchronization mechanisms equally as far as "ordering" is concerned...).
> > >
> > > This distinction between locking operations and "other acquires" appears
> > > to me not only unmotivated but also extremely _fragile (difficult to use
> > > /maintain) when considering the analysis of synchronization mechanisms
> > > such as those mentioned above or their porting for new arch.
> >
> > The main reason for this is because developers use spinlocks all of the
> > time, including in drivers. It's less common to use explicit atomics and
> > extremely rare to use explicit acquire/release operations. So let's make
> > locks as easy to use as possible, by giving them the strongest semantics
> > that we can whilst remaining a good fit for the instructions that are
> > provided by the architectures we support.
>
> Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
> simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
> (or its analogous in v3) all together:
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 59b5cbe6b6240..bc413a6839a2d 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,6 @@ let strong-fence = mb | gp
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -60,7 +59,7 @@ let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
> let fence = strong-fence | wmb | po-rel | rmb | acq-po
> let ppo = to-r | to-w | fence
>
> Among other things, this would immediately:
>
> 1) Enable RISC-V to use their .aq/.rl annotations _without_ having to
> "worry" about tso or release/acquire fences; IOW, this will permit
> a partial revert of:
>
> 0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences")
> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences")
>
> 2) Resolve the above mentioned controversy (the inconsistency between
> - locking operations and atomic RMWs on one side, and their actual
> implementation in generic code on the other), thus enabling the use
> of LKMM _and_ its tools for the analysis/reviewing of the latter.
3) Liberate me from the unwritten duty of having to explain what these
rfi-rel-acq or unlock-rf-lock-po are (and imply!) _while_ reviewing
the next: ;-)
arch/$NEW_ARCH/include/asm/{spinlock,atomic}.h
(especially given that I could not point out a single use case in
the kernel which could illustrate and justify such requirements).
Andrea
>
>
> >
> > If you want to extend this to atomic rmws, go for it, but I don't think
> > it's nearly as important and there will still be ways to implement locks
> > with insufficient ordering guarantees if you want to.
>
> I don't want to "implement locks with insufficient ordering guarantees"
> (w.r.t. LKMM). ;-)
>
> Andrea
>
>
> >
> > Will
On Wed, Jul 11, 2018 at 10:43:45AM +0100, Will Deacon wrote:
> Hi Alan,
>
> On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> Thanks, I'm happy with this version of the patch:
>
> Reviewed-by: Will Deacon <[email protected]>
I have applied your Reviewed-by, and thank you both!
Given that this is a non-trivial change and given that I am posting
for -tip acceptance in a few days, I intend to send this one not
to the upcoming merge window, but to the one after that.
Please let me know if there is an urgent need for this to go into the
v4.19 merge window.
Thanx, Paul
On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> On Wed, Jul 11, 2018 at 10:43:11AM +0100, Will Deacon wrote:
> > On Tue, Jul 10, 2018 at 11:38:21AM +0200, Andrea Parri wrote:
> > > This distinction between locking operations and "other acquires" appears
> > > to me not only unmotivated but also extremely _fragile (difficult to use
> > > /maintain) when considering the analysis of synchronization mechanisms
> > > such as those mentioned above or their porting for new arch.
> >
> > The main reason for this is because developers use spinlocks all of the
> > time, including in drivers. It's less common to use explicit atomics and
> > extremely rare to use explicit acquire/release operations. So let's make
> > locks as easy to use as possible, by giving them the strongest semantics
> > that we can whilst remaining a good fit for the instructions that are
> > provided by the architectures we support.
>
> Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
> simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
> (or its analogous in v3) all together:
It might be simple to model, but I worry this weakens our locking
implementations to a point where they will not be understood by the average
kernel developer. As I've said before, I would prefer "full" RCsc locking,
but that's not the case with architectures we currently support today, so
the next best thing is this "everything apart from W->R in the
inter-thread case" ordering, which isn't going to crop up unless you're
doing weird stuff anyway afaict.
> > If you want to extend this to atomic rmws, go for it, but I don't think
> > it's nearly as important and there will still be ways to implement locks
> > with insufficient ordering guarantees if you want to.
>
> I don't want to "implement locks with insufficient ordering guarantees"
> (w.r.t. LKMM). ;-)
I didn't mean you personally; I mean that somebody could write a lock that
mixes rmws and acquire/release, so they'd still have to deal with the
fallout even with your proposed changes.
Will
> It might be simple to model, but I worry this weakens our locking
> implementations to a point where they will not be understood by the average
> kernel developer. As I've said before, I would prefer "full" RCsc locking,
> but that's not the case with architectures we currently support today, so
> the next best thing is this "everything apart from W->R in the
> inter-thread case" ordering, which isn't going to crop up unless you're
> doing weird stuff anyway afaict.
The "average kernel developer" thinks TSO or about, right? ;-)
Andrea
>
> Will
On Wed, Jul 11, 2018 at 10:43:45AM +0100, Will Deacon wrote:
> Hi Alan,
>
> On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
> >
> > Locks should also provide read-read (and read-write) ordering in a
> > similar way. Given:
> >
> > READ_ONCE(x);
> > spin_unlock(&s);
> > spin_lock(&s);
> > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> >
> > the load of x should be executed before the load of (or store to) y.
> > The LKMM already provides this ordering, but it provides it even in
> > the case where the two accesses are separated by a release/acquire
> > pair of fences rather than unlock/lock. This would prevent
> > architectures from using weakly ordered implementations of release and
> > acquire, which seems like an unnecessary restriction. The patch
> > therefore removes the ordering requirement from the LKMM for that
> > case.
> >
> > All the architectures supported by the Linux kernel (including RISC-V)
> > do provide this ordering for locks, albeit for varying reasons.
> > Therefore this patch changes the model in accordance with the
> > developers' wishes.
> >
> > Signed-off-by: Alan Stern <[email protected]>
>
> Thanks, I'm happy with this version of the patch:
>
> Reviewed-by: Will Deacon <[email protected]>
Me too! Thanks Alan.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Wed, Jul 11, 2018 at 06:34:56PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 10:43:45AM +0100, Will Deacon wrote:
> > Hi Alan,
> >
> > On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > >
> > > Locks should also provide read-read (and read-write) ordering in a
> > > similar way. Given:
> > >
> > > READ_ONCE(x);
> > > spin_unlock(&s);
> > > spin_lock(&s);
> > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > >
> > > the load of x should be executed before the load of (or store to) y.
> > > The LKMM already provides this ordering, but it provides it even in
> > > the case where the two accesses are separated by a release/acquire
> > > pair of fences rather than unlock/lock. This would prevent
> > > architectures from using weakly ordered implementations of release and
> > > acquire, which seems like an unnecessary restriction. The patch
> > > therefore removes the ordering requirement from the LKMM for that
> > > case.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do provide this ordering for locks, albeit for varying reasons.
> > > Therefore this patch changes the model in accordance with the
> > > developers' wishes.
> > >
> > > Signed-off-by: Alan Stern <[email protected]>
> >
> > Thanks, I'm happy with this version of the patch:
> >
> > Reviewed-by: Will Deacon <[email protected]>
>
> Me too! Thanks Alan.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
And I applies you ask as well, thank you!
Thanx, Paul
On Wed, Jul 11, 2018 at 08:42:11AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 10:43:45AM +0100, Will Deacon wrote:
> > Hi Alan,
> >
> > On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > >
> > > Locks should also provide read-read (and read-write) ordering in a
> > > similar way. Given:
> > >
> > > READ_ONCE(x);
> > > spin_unlock(&s);
> > > spin_lock(&s);
> > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > >
> > > the load of x should be executed before the load of (or store to) y.
> > > The LKMM already provides this ordering, but it provides it even in
> > > the case where the two accesses are separated by a release/acquire
> > > pair of fences rather than unlock/lock. This would prevent
> > > architectures from using weakly ordered implementations of release and
> > > acquire, which seems like an unnecessary restriction. The patch
> > > therefore removes the ordering requirement from the LKMM for that
> > > case.
> > >
> > > All the architectures supported by the Linux kernel (including RISC-V)
> > > do provide this ordering for locks, albeit for varying reasons.
> > > Therefore this patch changes the model in accordance with the
> > > developers' wishes.
> > >
> > > Signed-off-by: Alan Stern <[email protected]>
> >
> > Thanks, I'm happy with this version of the patch:
> >
> > Reviewed-by: Will Deacon <[email protected]>
>
> I have applied your Reviewed-by, and thank you both!
>
> Given that this is a non-trivial change and given that I am posting
> for -tip acceptance in a few days, I intend to send this one not
> to the upcoming merge window, but to the one after that.
>
> Please let me know if there is an urgent need for this to go into the
> v4.19 merge window.
I raised some concerns in my review to v2; AFAICT, these concerns have
not been resolved: so, until then, please feel free to add my NAK. ;-)
Andrea
>
> Thanx, Paul
>
On Wed, Jul 11, 2018 at 04:57:51PM +0100, Will Deacon wrote:
> It might be simple to model, but I worry this weakens our locking
> implementations to a point where they will not be understood by the average
> kernel developer. As I've said before, I would prefer "full" RCsc locking,
Another vote for RCsc locks. The (in)famous hold-out is of course
PowerPC, but it now looks like RISC-V is following where they I really
rather wish they didn't.
On 7/11/2018 10:00 AM, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 04:57:51PM +0100, Will Deacon wrote:
>
>> It might be simple to model, but I worry this weakens our locking
>> implementations to a point where they will not be understood by the average
>> kernel developer. As I've said before, I would prefer "full" RCsc locking,
>
> Another vote for RCsc locks. The (in)famous hold-out is of course
> PowerPC, but it now looks like RISC-V is following where they I really
> rather wish they didn't.
That's not entirely fair. We came in wanting to do the "natural" or
"expected" thing: use RCsc atomics where we have them available in the
ISA, and use "fence r,rw" and "fence rw,w" where we don't.
I would argue that the idea of having data races on variables that are
also protected by locks is equally if not more counterintuitive and
unexpected than the "natural" mapping we had originally proposed to use.
All the discussion here[1] for example is about having ordering and
doing an smp_cond_load_acquire() on a variable which is sometimes
protected by a CPU's rq->lock and sometimes not? Isn't that one of the
key use cases for this whole discussion?
FWIW, this code also mixes locking, smp_rmb(), and
smp_cond_load_acquire() all in very close proximity with each other...
I suppose there's no mixing and matching on the same variable, but
we are here trying to reason about how all those pieces interact, no?
[1] https://lkml.org/lkml/2015/10/6/805
Dan
On Wed, 11 Jul 2018, David Laight wrote:
> > From: Alan Stern
> > Sent: 10 July 2018 19:18
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
>
> The usual 'elephant in the room' is Alpha.
> I don't claim to understand the alpha memory model but it wouldn't
> surprise me if the above is impossible to implement on alpha.
It's not impossible, since Alpha does have a full memory barrier
instruction (and the implementation uses it).
Alan
On Wed, Jul 11, 2018 at 06:17:17PM +0200, Andrea Parri wrote:
> On Wed, Jul 11, 2018 at 08:42:11AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 11, 2018 at 10:43:45AM +0100, Will Deacon wrote:
> > > Hi Alan,
> > >
> > > On Tue, Jul 10, 2018 at 02:18:13PM -0400, Alan Stern wrote:
> > > > More than one kernel developer has expressed the opinion that the LKMM
> > > > should enforce ordering of writes by locking. In other words, given
> > > > the following code:
> > > >
> > > > WRITE_ONCE(x, 1);
> > > > spin_unlock(&s):
> > > > spin_lock(&s);
> > > > WRITE_ONCE(y, 1);
> > > >
> > > > the stores to x and y should be propagated in order to all other CPUs,
> > > > even though those other CPUs might not access the lock s. In terms of
> > > > the memory model, this means expanding the cumul-fence relation.
> > > >
> > > > Locks should also provide read-read (and read-write) ordering in a
> > > > similar way. Given:
> > > >
> > > > READ_ONCE(x);
> > > > spin_unlock(&s);
> > > > spin_lock(&s);
> > > > READ_ONCE(y); // or WRITE_ONCE(y, 1);
> > > >
> > > > the load of x should be executed before the load of (or store to) y.
> > > > The LKMM already provides this ordering, but it provides it even in
> > > > the case where the two accesses are separated by a release/acquire
> > > > pair of fences rather than unlock/lock. This would prevent
> > > > architectures from using weakly ordered implementations of release and
> > > > acquire, which seems like an unnecessary restriction. The patch
> > > > therefore removes the ordering requirement from the LKMM for that
> > > > case.
> > > >
> > > > All the architectures supported by the Linux kernel (including RISC-V)
> > > > do provide this ordering for locks, albeit for varying reasons.
> > > > Therefore this patch changes the model in accordance with the
> > > > developers' wishes.
> > > >
> > > > Signed-off-by: Alan Stern <[email protected]>
> > >
> > > Thanks, I'm happy with this version of the patch:
> > >
> > > Reviewed-by: Will Deacon <[email protected]>
> >
> > I have applied your Reviewed-by, and thank you both!
> >
> > Given that this is a non-trivial change and given that I am posting
> > for -tip acceptance in a few days, I intend to send this one not
> > to the upcoming merge window, but to the one after that.
> >
> > Please let me know if there is an urgent need for this to go into the
> > v4.19 merge window.
>
> I raised some concerns in my review to v2; AFAICT, these concerns have
> not been resolved: so, until then, please feel free to add my NAK. ;-)
I will be keeping the prototype in my -rcu tree for the time being,
but I will not be submitting it into the v4.19 merge window. I expect
that you all will be able to come to agreement in the couple of months
until the v4.20/v5.0 merge window. ;-)
I will apply Peter's ack and at the same time mark it EXP so that its
state will be apparent.
Thanx, Paul
On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
> simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
> (or its analogous in v3) all together:
<snip>
> Among other things, this would immediately:
>
> 1) Enable RISC-V to use their .aq/.rl annotations _without_ having to
> "worry" about tso or release/acquire fences; IOW, this will permit
> a partial revert of:
>
> 0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences")
> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences")
But I feel this goes in the wrong direction. This weakens the effective
memory model, where I feel we should strengthen it.
Currently PowerPC is the weakest here, and the above RISC-V changes
(reverts) would make RISC-V weaker still.
Any any effective weakening makes me very uncomfortable -- who knows
what will come apart this time. This memory ordering stuff causes
horrible subtle bugs at best.
> 2) Resolve the above mentioned controversy (the inconsistency between
> - locking operations and atomic RMWs on one side, and their actual
> implementation in generic code on the other), thus enabling the use
> of LKMM _and_ its tools for the analysis/reviewing of the latter.
This is a good point; so lets see if there is something we can do to
strengthen the model so it all works again.
And I think if we raise atomic*_acquire() to require TSO (but ideally
raise it to RCsc) we're there.
The TSO archs have RCpc load-acquire and store-release, but fully
ordered atomics. Most of the other archs have smp_mb() everything, with
the exception of PPC, ARM64 and now RISC-V.
PPC has the RCpc TSO fence LWSYNC, ARM64 has the RCsc
load-acquire/store-release. And RISC-V has a gazillion of options IIRC.
So ideally atomic*_acquire() + smp_store_release() will be RCsc, and is
with the notable exception of PPC, and ideally RISC-V would be RCsc
here. But at the very least it should not be weaker than PPC.
By increasing atomic*_acquire() to TSO we also immediately get the
proposed:
P0()
{
WRITE_ONCE(X, 1);
spin_unlock(&s);
spin_lock(&s);
WRITE_ONCE(Y, 1);
}
P1()
{
r1 = READ_ONCE(Y);
smp_rmb();
r2 = READ_ONCE(X);
}
behaviour under discussion; because the spin_lock() will imply the TSO
ordering.
And note that this retains regular RCpc ACQUIRE for smp_load_acquire()
and associated primitives -- as they have had since their introduction
not too long ago.
> All the discussion here[1] for example is about having ordering and
> doing an smp_cond_load_acquire() on a variable which is sometimes
> protected by a CPU's rq->lock and sometimes not? Isn't that one of the
> key use cases for this whole discussion?
Not a "pure" one:
http://lkml.kernel.org/r/1530629639-27767-1-git-send-email-andrea.parri@amarulasolutions.com
we also need "W->R ordering" in schedule()! so there better be an
smp_mb__after_spinlock() or a barrier providing similar ordering.
Nick was suggesting a "weaker version" of this barrier back in:
362a61ad61199e ("fix SMP data race in pagetable setup vs walking")
c.f., the comment in mm/memory.c:__pte_alloc(), but that does not
math our pattern (UNLOCK+LOCK), AFAICT.
Andrea
>
> [1] https://lkml.org/lkml/2015/10/6/805
>
> Dan
On Wed, Jul 11, 2018 at 10:50:56AM -0700, Daniel Lustig wrote:
> On 7/11/2018 10:00 AM, Peter Zijlstra wrote:
> > On Wed, Jul 11, 2018 at 04:57:51PM +0100, Will Deacon wrote:
> >
> >> It might be simple to model, but I worry this weakens our locking
> >> implementations to a point where they will not be understood by the average
> >> kernel developer. As I've said before, I would prefer "full" RCsc locking,
> >
> > Another vote for RCsc locks. The (in)famous hold-out is of course
> > PowerPC, but it now looks like RISC-V is following where they I really
> > rather wish they didn't.
>
> That's not entirely fair. We came in wanting to do the "natural" or
My appologies then, and I must admit I've not really kept track of what
RISC-V ended up doing :/
> "expected" thing: use RCsc atomics where we have them available in the
> ISA, and use "fence r,rw" and "fence rw,w" where we don't.
>
> I would argue that the idea of having data races on variables that are
> also protected by locks is equally if not more counterintuitive and
> unexpected than the "natural" mapping we had originally proposed to use.
>
> All the discussion here[1] for example is about having ordering and
> doing an smp_cond_load_acquire() on a variable which is sometimes
> protected by a CPU's rq->lock and sometimes not? Isn't that one of the
> key use cases for this whole discussion?
Well, the scheduler locks are on place where I know we rely on stuff
like this. RCU is another place where we know we rely on locks being
RCsc.
The big problem of course is all the places we do not know about. This
kernel thing is terribly big and there's an awful lot of clever people
involved.
Now, PPC seems to have build RCpc locks before we fully understood these
things and has since (understandably) 'refused' to go RCsc because
performance regressions.
But the fact is that stronger ordering is easier on these pesky humans.
Also:
https://lkml.org/lkml/2016/2/2/80
On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
> And I think if we raise atomic*_acquire() to require TSO (but ideally
> raise it to RCsc) we're there.
To clarify, just the RmW-acquire. Things like atomic_read_acquire() can
stay smp_load_acquire() and be RCpc.
On Thu, Jul 12, 2018 at 11:34:32AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
> > And I think if we raise atomic*_acquire() to require TSO (but ideally
> > raise it to RCsc) we're there.
>
> To clarify, just the RmW-acquire. Things like atomic_read_acquire() can
> stay smp_load_acquire() and be RCpc.
I don't have strong opinions about strengthening RmW atomics to TSO, so
if it helps to unblock Alan's patch (which doesn't go near this!) then I'll
go with it. The important part is that we continue to allow roach motel
into the RmW for other accesses in the non-fully-ordered cases.
Daniel -- your AMO instructions are cool with this, right? It's just the
fence-based implementations that will need help?
Will
On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> > Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
> > simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
> > (or its analogous in v3) all together:
>
> <snip>
>
> > Among other things, this would immediately:
> >
> > 1) Enable RISC-V to use their .aq/.rl annotations _without_ having to
> > "worry" about tso or release/acquire fences; IOW, this will permit
> > a partial revert of:
> >
> > 0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences")
> > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences")
>
> But I feel this goes in the wrong direction. This weakens the effective
> memory model, where I feel we should strengthen it.
>
> Currently PowerPC is the weakest here, and the above RISC-V changes
> (reverts) would make RISC-V weaker still.
>
> Any any effective weakening makes me very uncomfortable -- who knows
> what will come apart this time. This memory ordering stuff causes
> horrible subtle bugs at best.
Indeed, what I was suggesting above is a weaking of the current model
(and I agree: I wouldn't say that bugs in this context are nice ;-).
These changes would affect a specific area: (IMO,) the examples we've
been considering here aren't for the faint-hearted ;-) and as Daniel
already suggested, everything would again be "nice and neat", if this
was all about locking/if every thread used lock-synchronization.
>
> > 2) Resolve the above mentioned controversy (the inconsistency between
> > - locking operations and atomic RMWs on one side, and their actual
> > implementation in generic code on the other), thus enabling the use
> > of LKMM _and_ its tools for the analysis/reviewing of the latter.
>
> This is a good point; so lets see if there is something we can do to
> strengthen the model so it all works again.
>
> And I think if we raise atomic*_acquire() to require TSO (but ideally
> raise it to RCsc) we're there.
>
> The TSO archs have RCpc load-acquire and store-release, but fully
> ordered atomics. Most of the other archs have smp_mb() everything, with
> the exception of PPC, ARM64 and now RISC-V.
>
> PPC has the RCpc TSO fence LWSYNC, ARM64 has the RCsc
> load-acquire/store-release. And RISC-V has a gazillion of options IIRC.
>
>
> So ideally atomic*_acquire() + smp_store_release() will be RCsc, and is
> with the notable exception of PPC, and ideally RISC-V would be RCsc
> here. But at the very least it should not be weaker than PPC.
>
> By increasing atomic*_acquire() to TSO we also immediately get the
> proposed:
>
> P0()
> {
> WRITE_ONCE(X, 1);
> spin_unlock(&s);
> spin_lock(&s);
> WRITE_ONCE(Y, 1);
> }
>
> P1()
> {
> r1 = READ_ONCE(Y);
> smp_rmb();
> r2 = READ_ONCE(X);
> }
>
> behaviour under discussion; because the spin_lock() will imply the TSO
> ordering.
You mean: "when paired with a po-earlier release to the same memory
location", right? I am afraid that neither arm64 nor riscv current
implementations would ensure "(r1 == 1 && r2 == 0) forbidden" if we
removed the po-earlier spin_unlock()...
AFAICT, the current implementation would work with that release: as
you remarked above, arm64 release->acquire is SC; riscv has an rw,w
fence in its spin_unlock() (hence an w,w fence between the stores),
or it could have a .tso fence ...
But again, these are stuble patterns, and my guess is that several/
most kernel developers really won't care about such guarantees (and
if some will do, they'll have the tools to figure out what they can
actually rely on ...)
OTOH (as I pointed out earlier) the strengthening we're configuring
will prevent some arch. (riscv being just the example of today!) to
go "full RCsc", and this will inevitably "complicate" both the LKMM
and the reviewing process of related changes (atomics, locking, ...;
c.f., this debate), apparently, just because you ;-) want to "care"
about these guarantees.
Not yet convinced ... :/
Andrea
>
> And note that this retains regular RCpc ACQUIRE for smp_load_acquire()
> and associated primitives -- as they have had since their introduction
> not too long ago.
On Thu, Jul 12, 2018 at 01:52:49PM +0200, Andrea Parri wrote:
> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> > > Simplicity is the eye of the beholder. From my POV (LKMM maintainer), the
> > > simplest solution would be to get rid of rfi-rel-acq and unlock-rf-lock-po
> > > (or its analogous in v3) all together:
> >
> > <snip>
> >
> > > Among other things, this would immediately:
> > >
> > > 1) Enable RISC-V to use their .aq/.rl annotations _without_ having to
> > > "worry" about tso or release/acquire fences; IOW, this will permit
> > > a partial revert of:
> > >
> > > 0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences")
> > > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences")
> >
> > But I feel this goes in the wrong direction. This weakens the effective
> > memory model, where I feel we should strengthen it.
> >
> > Currently PowerPC is the weakest here, and the above RISC-V changes
> > (reverts) would make RISC-V weaker still.
> >
> > Any any effective weakening makes me very uncomfortable -- who knows
> > what will come apart this time. This memory ordering stuff causes
> > horrible subtle bugs at best.
>
> Indeed, what I was suggesting above is a weaking of the current model
> (and I agree: I wouldn't say that bugs in this context are nice ;-).
>
> These changes would affect a specific area: (IMO,) the examples we've
> been considering here aren't for the faint-hearted ;-) and as Daniel
> already suggested, everything would again be "nice and neat", if this
> was all about locking/if every thread used lock-synchronization.
>
>
> >
> > > 2) Resolve the above mentioned controversy (the inconsistency between
> > > - locking operations and atomic RMWs on one side, and their actual
> > > implementation in generic code on the other), thus enabling the use
> > > of LKMM _and_ its tools for the analysis/reviewing of the latter.
> >
> > This is a good point; so lets see if there is something we can do to
> > strengthen the model so it all works again.
> >
> > And I think if we raise atomic*_acquire() to require TSO (but ideally
> > raise it to RCsc) we're there.
> >
> > The TSO archs have RCpc load-acquire and store-release, but fully
> > ordered atomics. Most of the other archs have smp_mb() everything, with
> > the exception of PPC, ARM64 and now RISC-V.
> >
> > PPC has the RCpc TSO fence LWSYNC, ARM64 has the RCsc
> > load-acquire/store-release. And RISC-V has a gazillion of options IIRC.
> >
> >
> > So ideally atomic*_acquire() + smp_store_release() will be RCsc, and is
> > with the notable exception of PPC, and ideally RISC-V would be RCsc
> > here. But at the very least it should not be weaker than PPC.
> >
> > By increasing atomic*_acquire() to TSO we also immediately get the
> > proposed:
> >
> > P0()
> > {
> > WRITE_ONCE(X, 1);
> > spin_unlock(&s);
> > spin_lock(&s);
> > WRITE_ONCE(Y, 1);
> > }
> >
> > P1()
> > {
> > r1 = READ_ONCE(Y);
> > smp_rmb();
> > r2 = READ_ONCE(X);
> > }
> >
> > behaviour under discussion; because the spin_lock() will imply the TSO
> > ordering.
>
> You mean: "when paired with a po-earlier release to the same memory
> location", right? I am afraid that neither arm64 nor riscv current
> implementations would ensure "(r1 == 1 && r2 == 0) forbidden" if we
> removed the po-earlier spin_unlock()...
>
> AFAICT, the current implementation would work with that release: as
> you remarked above, arm64 release->acquire is SC; riscv has an rw,w
> fence in its spin_unlock() (hence an w,w fence between the stores),
> or it could have a .tso fence ...
>
> But again, these are stuble patterns, and my guess is that several/
> most kernel developers really won't care about such guarantees (and
> if some will do, they'll have the tools to figure out what they can
> actually rely on ...)
>
> OTOH (as I pointed out earlier) the strengthening we're configuring
> will prevent some arch. (riscv being just the example of today!) to
> go "full RCsc", and this will inevitably "complicate" both the LKMM
"full RCpc"
Andrea
> and the reviewing process of related changes (atomics, locking, ...;
> c.f., this debate), apparently, just because you ;-) want to "care"
> about these guarantees.
>
> Not yet convinced ... :/
>
> Andrea
>
>
> >
> > And note that this retains regular RCpc ACQUIRE for smp_load_acquire()
> > and associated primitives -- as they have had since their introduction
> > not too long ago.
On Thu, Jul 12, 2018 at 02:01:00PM +0200, Andrea Parri wrote:
> > OTOH (as I pointed out earlier) the strengthening we're configuring
> > will prevent some arch. (riscv being just the example of today!) to
> > go "full RCsc", and this will inevitably "complicate" both the LKMM
>
> "full RCpc"
That's a good thing :-)
On Thu, Jul 12, 2018 at 01:52:49PM +0200, Andrea Parri wrote:
> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 11, 2018 at 02:34:21PM +0200, Andrea Parri wrote:
> > > 2) Resolve the above mentioned controversy (the inconsistency between
> > > - locking operations and atomic RMWs on one side, and their actual
> > > implementation in generic code on the other), thus enabling the use
> > > of LKMM _and_ its tools for the analysis/reviewing of the latter.
> >
> > This is a good point; so lets see if there is something we can do to
> > strengthen the model so it all works again.
> >
> > And I think if we raise atomic*_acquire() to require TSO (but ideally
> > raise it to RCsc) we're there.
> >
> You mean: "when paired with a po-earlier release to the same memory
> location", right? I am afraid that neither arm64 nor riscv current
> implementations would ensure "(r1 == 1 && r2 == 0) forbidden" if we
> removed the po-earlier spin_unlock()...
Yes indeed. More on this below.
> But again, these are stuble patterns, and my guess is that several/
> most kernel developers really won't care about such guarantees (and
> if some will do, they'll have the tools to figure out what they can
> actually rely on ...)
Yes it is subtle, yes most people won't care, however the problem is
that it is subtly the wrong way around. People expect causality, this is
a human failing perhaps, but that's how it is.
And I strongly feel we should have our locks be such that they don't
subtly break things.
Take for instance the pattern where RCU relies on RCsc locks, this is an
entirely simple and straight forward use of locks, yet completely fails
on this subtle point.
And people will not even try and use complicated tools for apparently
simple things. They'll say, oh of course this simple thing will work
right.
I'm still hoping we can convince the PowerPC people that they're wrong,
and get rid of this wart and just call all locks RCsc.
> OTOH (as I pointed out earlier) the strengthening we're configuring
> will prevent some arch. (riscv being just the example of today!) to
> go "full RCpc", and this will inevitably "complicate" both the LKMM
> and the reviewing process of related changes (atomics, locking, ...;
> c.f., this debate), apparently, just because you ;-) want to "care"
> about these guarantees.
It's not just me btw, Linus also cares about these matters. Widely used
primitives such as spinlocks, should not have subtle and
counter-intuitive behaviour such as RCpc.
Anyway, back to the problem of being able to use the memory model to
describe locks. This is I think a useful property.
My earlier reasoning was that:
- smp_store_release() + smp_load_acquire() := RCpc
- we use smp_store_release() as unlock()
Therefore, if we want unlock+lock to imply at least TSO (ideally
smp_mb()) we need lock to make up for whatever unlock lacks.
Hence my proposal to strenghten rmw-acquire, because that is the basic
primitive used to implement lock.
But as you (and Will) point out, we don't so much care about rmw-acquire
semantics as much as that we care about unlock+lock behaviour. Another
way to look at this is to define:
smp-store-release + rmw-acquire := TSO (ideally smp_mb)
But then we also have to look at:
rmw-release + smp-load-acquire
rmw-release + rmw-acquire
for completeness sake, and I would suggest they result in (at least) the
same (TSO) ordering as the one we really care about.
One alternative is to no longer use smp_store_release() for unlock(),
and say define atomic_set_release() to be in the rmw-release class
instead of being a simple smp_store_release().
Another, and I like this proposal least, is to introduce a new barrier
to make this all work.
On Thu, Jul 12, 2018 at 03:48:21PM +0200, Peter Zijlstra wrote:
[ . . . ]
> I'm still hoping we can convince the PowerPC people that they're wrong,
> and get rid of this wart and just call all locks RCsc.
As always, I must defer to the PowerPC maintainers on this point.
Thanx, Paul
On Thu, 12 Jul 2018, Peter Zijlstra wrote:
> > But again, these are stuble patterns, and my guess is that several/
> > most kernel developers really won't care about such guarantees (and
> > if some will do, they'll have the tools to figure out what they can
> > actually rely on ...)
>
> Yes it is subtle, yes most people won't care, however the problem is
> that it is subtly the wrong way around. People expect causality, this is
> a human failing perhaps, but that's how it is.
>
> And I strongly feel we should have our locks be such that they don't
> subtly break things.
>
> Take for instance the pattern where RCU relies on RCsc locks, this is an
> entirely simple and straight forward use of locks, yet completely fails
> on this subtle point.
Do you happen to remember exactly where in the kernel source this
occurs?
> And people will not even try and use complicated tools for apparently
> simple things. They'll say, oh of course this simple thing will work
> right.
>
> I'm still hoping we can convince the PowerPC people that they're wrong,
> and get rid of this wart and just call all locks RCsc.
It seems reasonable to ask people to learn that locks have stronger
ordering guarantees than RMW atomics do. Maybe not the greatest
situation in the world, but one I think we could live with.
> > OTOH (as I pointed out earlier) the strengthening we're configuring
> > will prevent some arch. (riscv being just the example of today!) to
> > go "full RCpc", and this will inevitably "complicate" both the LKMM
> > and the reviewing process of related changes (atomics, locking, ...;
> > c.f., this debate), apparently, just because you ;-) want to "care"
> > about these guarantees.
>
> It's not just me btw, Linus also cares about these matters. Widely used
> primitives such as spinlocks, should not have subtle and
> counter-intuitive behaviour such as RCpc.
Which raises the question of whether RCtso (adopting Daniel's suggested
term) is also too subtle or counter-intuitive for spinlocks. I wonder
what Linus would say...
> Anyway, back to the problem of being able to use the memory model to
> describe locks. This is I think a useful property.
>
> My earlier reasoning was that:
>
> - smp_store_release() + smp_load_acquire() := RCpc
>
> - we use smp_store_release() as unlock()
>
> Therefore, if we want unlock+lock to imply at least TSO (ideally
> smp_mb()) we need lock to make up for whatever unlock lacks.
>
> Hence my proposal to strenghten rmw-acquire, because that is the basic
> primitive used to implement lock.
That was essentially what the v2 patch did. (And my reasoning was
basically the same as what you have just outlined. There was one
additional element: smp_store_release() is already strong enough for
TSO; the acquire is what needs to be stronger in the memory model.)
> But as you (and Will) point out, we don't so much care about rmw-acquire
> semantics as much as that we care about unlock+lock behaviour. Another
> way to look at this is to define:
>
> smp-store-release + rmw-acquire := TSO (ideally smp_mb)
>
> But then we also have to look at:
>
> rmw-release + smp-load-acquire
> rmw-release + rmw-acquire
Let's assume that rmw-release is equivalent, in terms of ordering
strength, to smp_store_release(). Then we can focus our attention on
just the acquire part.
On PowerPC, for instance, if spin_lock() used a full HWSYNC fence
then unlock+lock would become RCsc -- even with no changes to
spin_unlock().
> for completeness sake, and I would suggest they result in (at least) the
> same (TSO) ordering as the one we really care about.
>
> One alternative is to no longer use smp_store_release() for unlock(),
> and say define atomic_set_release() to be in the rmw-release class
> instead of being a simple smp_store_release().
>
> Another, and I like this proposal least, is to introduce a new barrier
> to make this all work.
This apparently boils down to two questions:
Should spin_lock/spin_unlock be RCsc?
Should rmw-acquire be strong enough so that smp_store_release +
rmw-acquire is RCtso?
If both answers are No, we end up with the v3 patch. If the first
answer is No and the second is Yes, we end up with the v2 patch. The
problem is that different people seem to want differing answers.
(The implicit third question, "Should spin_lock/spin_unlock be RCtso?",
seems to be pretty well settled at this point -- by Peter's and Will's
vociferousness if nothing else -- despite Andrea's reservations.
However I admit it would be nice to have one or two examples showing
that the kernel really needs this.)
Alan
Hi Alan,
On Thu, Jul 12, 2018 at 01:04:27PM -0400, Alan Stern wrote:
> On Thu, 12 Jul 2018, Peter Zijlstra wrote:
> > But as you (and Will) point out, we don't so much care about rmw-acquire
> > semantics as much as that we care about unlock+lock behaviour. Another
> > way to look at this is to define:
> >
> > smp-store-release + rmw-acquire := TSO (ideally smp_mb)
> >
> > But then we also have to look at:
> >
> > rmw-release + smp-load-acquire
> > rmw-release + rmw-acquire
>
> Let's assume that rmw-release is equivalent, in terms of ordering
> strength, to smp_store_release(). Then we can focus our attention on
> just the acquire part.
I can live with that, but it does add another special case, where we could
otherwise just special case acquire/release for the load/store variants
vs everything else.
> On PowerPC, for instance, if spin_lock() used a full HWSYNC fence
> then unlock+lock would become RCsc -- even with no changes to
> spin_unlock().
>
> > for completeness sake, and I would suggest they result in (at least) the
> > same (TSO) ordering as the one we really care about.
> >
> > One alternative is to no longer use smp_store_release() for unlock(),
> > and say define atomic_set_release() to be in the rmw-release class
> > instead of being a simple smp_store_release().
> >
> > Another, and I like this proposal least, is to introduce a new barrier
> > to make this all work.
>
> This apparently boils down to two questions:
>
> Should spin_lock/spin_unlock be RCsc?
I would love that to be the case, but I'm not asking you to fight that
battle :)
> Should rmw-acquire be strong enough so that smp_store_release +
> rmw-acquire is RCtso?
>
> If both answers are No, we end up with the v3 patch. If the first
> answer is No and the second is Yes, we end up with the v2 patch. The
> problem is that different people seem to want differing answers.
Just to be extra unhelpful: I'm happy with either v2 or v3. I suspect Daniel
is the one to convince on v2, because it's RISC-V that's affected by this.
Will
On Thu, Jul 12, 2018 at 01:04:27PM -0400, Alan Stern wrote:
> On Thu, 12 Jul 2018, Peter Zijlstra wrote:
>
> > > But again, these are stuble patterns, and my guess is that several/
> > > most kernel developers really won't care about such guarantees (and
> > > if some will do, they'll have the tools to figure out what they can
> > > actually rely on ...)
> >
> > Yes it is subtle, yes most people won't care, however the problem is
> > that it is subtly the wrong way around. People expect causality, this is
> > a human failing perhaps, but that's how it is.
> >
> > And I strongly feel we should have our locks be such that they don't
> > subtly break things.
> >
> > Take for instance the pattern where RCU relies on RCsc locks, this is an
> > entirely simple and straight forward use of locks, yet completely fails
> > on this subtle point.
>
> Do you happen to remember exactly where in the kernel source this
> occurs?
Look for the uses of raw_spin_lock_irq_rcu_node() and friends in
kernel/rcu and include/linux/*rcu*, along with the explanation in
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
I must confess that I am not following exactly what Peter is calling
out as the failure. My best guess is that he is leading up to his
call for RCsc locks, but I might have missed a turn.
But Peter did supply raw_spin_lock_irq_rcu_node(), which is a marked
improvement over the earlier open-coding of smp_mb__after_unlock_lock()
after each and every acquisition of the rcu_node structure's ->lock,
so overall I cannot complain. ;-)
Thanx, Paul
> Anyway, back to the problem of being able to use the memory model to
> describe locks. This is I think a useful property.
>
> My earlier reasoning was that:
>
> - smp_store_release() + smp_load_acquire() := RCpc
>
> - we use smp_store_release() as unlock()
>
> Therefore, if we want unlock+lock to imply at least TSO (ideally
> smp_mb()) we need lock to make up for whatever unlock lacks.
>
> Hence my proposal to strenghten rmw-acquire, because that is the basic
> primitive used to implement lock.
>
> But as you (and Will) point out, we don't so much care about rmw-acquire
> semantics as much as that we care about unlock+lock behaviour. Another
> way to look at this is to define:
>
> smp-store-release + rmw-acquire := TSO (ideally smp_mb)
>
> But then we also have to look at:
>
> rmw-release + smp-load-acquire
> rmw-release + rmw-acquire
>
> for completeness sake, and I would suggest they result in (at least) the
> same (TSO) ordering as the one we really care about.
Indeed (unless I'm not seeing something... ;-).
>
> One alternative is to no longer use smp_store_release() for unlock(),
> and say define atomic_set_release() to be in the rmw-release class
> instead of being a simple smp_store_release().
>
> Another, and I like this proposal least, is to introduce a new barrier
> to make this all work.
An smp_tso__after_unlock_lock()? (In a certain sense, the solution
adopted by RCU aligns to this approach: live with powerpc's RCpc and
introduce smp_mb__after_unlock_lock().) Or did you have something
else in mind?
But I wouldn't hasten to introduce such a barrier, given that: (1)
this would be a "do { } while (0)" for all the supported arch. _if_
we sticked to the current implementations, and (2) even if these
implementations changed or some new arch. required a non-trivial
definition, we still would have to find a "pure/TSO" case ;-).
Andrea
> It seems reasonable to ask people to learn that locks have stronger
> ordering guarantees than RMW atomics do. Maybe not the greatest
> situation in the world, but one I think we could live with.
Yeah, this was one of my main objections.
> > Hence my proposal to strenghten rmw-acquire, because that is the basic
> > primitive used to implement lock.
>
> That was essentially what the v2 patch did. (And my reasoning was
> basically the same as what you have just outlined. There was one
> additional element: smp_store_release() is already strong enough for
> TSO; the acquire is what needs to be stronger in the memory model.)
Mmh? see my comments to v2 (and your reply, in part., the part "At
least, it's not a valid general-purpose implementation".).
> > Another, and I like this proposal least, is to introduce a new barrier
> > to make this all work.
>
> This apparently boils down to two questions:
>
> Should spin_lock/spin_unlock be RCsc?
>
> Should rmw-acquire be strong enough so that smp_store_release +
> rmw-acquire is RCtso?
>
> If both answers are No, we end up with the v3 patch. If the first
> answer is No and the second is Yes, we end up with the v2 patch. The
> problem is that different people seem to want differing answers.
Again, maybe you're confonding v2 with v1?
Andrea
>
> (The implicit third question, "Should spin_lock/spin_unlock be RCtso?",
> seems to be pretty well settled at this point -- by Peter's and Will's
> vociferousness if nothing else -- despite Andrea's reservations.
> However I admit it would be nice to have one or two examples showing
> that the kernel really needs this.)
>
> Alan
>
On Thu, Jul 12, 2018 at 10:28:38AM -0700, Paul E. McKenney wrote:
> Look for the uses of raw_spin_lock_irq_rcu_node() and friends in
> kernel/rcu and include/linux/*rcu*, along with the explanation in
> Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
>
> I must confess that I am not following exactly what Peter is calling
> out as the failure. My best guess is that he is leading up to his
> call for RCsc locks, but I might have missed a turn.
Yes, it is part of my call for RCsc locks.
The locking pattern is fairly simple and shows where RCpc comes apart
from expectation real nice.
Yes I helped clean up the barrier usage, but as always it'd be even
cleaner if we could just do away with the special barrier entirely.
On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
>
> The locking pattern is fairly simple and shows where RCpc comes apart
> from expectation real nice.
So who does RCpc right now for the unlock-lock sequence? Somebody
mentioned powerpc. Anybody else?
How nasty would be be to make powerpc conform? I will always advocate
tighter locking and ordering rules over looser ones..
Linus
On Thu, Jul 12, 2018 at 01:04:27PM -0400, Alan Stern wrote:
> Which raises the question of whether RCtso (adopting Daniel's suggested
> term) is also too subtle or counter-intuitive for spinlocks.
IMO yes, RCtso also sucks :-) But in my 'fight' for RCsc it disallows
RCpc and captures the current state of affairs more accurately.
On Thu, Jul 12, 2018 at 11:10:58AM -0700, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> >
> > The locking pattern is fairly simple and shows where RCpc comes apart
> > from expectation real nice.
>
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
powerpc have RCtso (and RCpc) but not RCsc unlock-lock, according to the
following indeed original terminology:
- RCsc unlock-lock MUST ORDER:
a) the WRITE and the READ below:
WRITE x=1
UNLOCK s
LOCK s
READ y
as in a store-buffering test;
b) the two WRITEs below:
WRITE x=1
UNLOCK s
LOCK s
WRITE y=1
as in a message-passing test;
c) the two READs below:
READ x
UNLOCK s
LOCK s
READ y
as in a message-passing test;
d) the READ and the WRITE below:
READ x
UNLOCK s
LOCK s
WRITE y
as in a load-buffering test;
- RCtso unlock-lock MUST ORDER b), c), d) above.
- RCpc unlock-lock MUST ORDER none of the above.
AFAICT, all arch _in_ the current implementation have RCtso unlock-lock.
>
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
A simple answer is right above (place a sync somewhere in the sequence);
for benchmark results, I must defer...
Andrea
>
> Linus
On Thu, Jul 12, 2018 at 09:52:42PM +0200, Andrea Parri wrote:
> On Thu, Jul 12, 2018 at 11:10:58AM -0700, Linus Torvalds wrote:
> > On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > The locking pattern is fairly simple and shows where RCpc comes apart
> > > from expectation real nice.
> >
> > So who does RCpc right now for the unlock-lock sequence? Somebody
> > mentioned powerpc. Anybody else?
>
> powerpc have RCtso (and RCpc) but not RCsc unlock-lock, according to the
> following indeed original terminology:
>
> - RCsc unlock-lock MUST ORDER:
>
> a) the WRITE and the READ below:
>
> WRITE x=1
> UNLOCK s
> LOCK s
> READ y
>
> as in a store-buffering test;
>
> b) the two WRITEs below:
>
> WRITE x=1
> UNLOCK s
> LOCK s
> WRITE y=1
>
> as in a message-passing test;
>
> c) the two READs below:
>
> READ x
> UNLOCK s
> LOCK s
> READ y
>
> as in a message-passing test;
>
> d) the READ and the WRITE below:
>
> READ x
> UNLOCK s
> LOCK s
> WRITE y
>
> as in a load-buffering test;
>
> - RCtso unlock-lock MUST ORDER b), c), d) above.
>
> - RCpc unlock-lock MUST ORDER none of the above.
>
> AFAICT, all arch _in_ the current implementation have RCtso unlock-lock.
>
>
> >
> > How nasty would be be to make powerpc conform? I will always advocate
> > tighter locking and ordering rules over looser ones..
>
> A simple answer is right above (place a sync somewhere in the sequence);
> for benchmark results, I must defer...
Sorry, not sure why but I did intend "conform to RCsc" here.
>
> Andrea
>
>
> >
> > Linus
On Thu, 12 Jul 2018, Andrea Parri wrote:
> > It seems reasonable to ask people to learn that locks have stronger
> > ordering guarantees than RMW atomics do. Maybe not the greatest
> > situation in the world, but one I think we could live with.
>
> Yeah, this was one of my main objections.
Does this mean you don't think you could live with it?
> > > Hence my proposal to strenghten rmw-acquire, because that is the basic
> > > primitive used to implement lock.
> >
> > That was essentially what the v2 patch did. (And my reasoning was
> > basically the same as what you have just outlined. There was one
> > additional element: smp_store_release() is already strong enough for
> > TSO; the acquire is what needs to be stronger in the memory model.)
>
> Mmh? see my comments to v2 (and your reply, in part., the part "At
> least, it's not a valid general-purpose implementation".).
>
>
> > > Another, and I like this proposal least, is to introduce a new barrier
> > > to make this all work.
> >
> > This apparently boils down to two questions:
> >
> > Should spin_lock/spin_unlock be RCsc?
> >
> > Should rmw-acquire be strong enough so that smp_store_release +
> > rmw-acquire is RCtso?
> >
> > If both answers are No, we end up with the v3 patch. If the first
> > answer is No and the second is Yes, we end up with the v2 patch. The
> > problem is that different people seem to want differing answers.
>
> Again, maybe you're confonding v2 with v1?
Oops, yes, I was. v1 was the version that made RMW updates be RCtso.
v2 and v3 affected only locking, the difference being that v2 used
unlock-rf-lock-po and v3 used po-unlock-rf-lock-po.
Alan
On Thu, Jul 12, 2018 at 04:43:46PM -0400, Alan Stern wrote:
> On Thu, 12 Jul 2018, Andrea Parri wrote:
>
> > > It seems reasonable to ask people to learn that locks have stronger
> > > ordering guarantees than RMW atomics do. Maybe not the greatest
> > > situation in the world, but one I think we could live with.
> >
> > Yeah, this was one of my main objections.
>
> Does this mean you don't think you could live with it?
Well, I *could* leave with it and with RCtso locks, ;-) but I'd rather not.
Assuming that I will not be able to resist this RCtso trend, ;-) would the
below (completely untested) work?
let rmw = rmw | lk-rmw (* from lock.cat *)
let po-unlock-rf-lock-po = po ; [Release] ; rf ; [domain(rmw)] ; po
[the rest of your patch + the updates to the doc. I suggested in v2 ;-)]
Andrea
On Thu, Jul 12, 2018 at 11:13:48PM +0200, Andrea Parri wrote:
> On Thu, Jul 12, 2018 at 04:43:46PM -0400, Alan Stern wrote:
> > On Thu, 12 Jul 2018, Andrea Parri wrote:
> >
> > > > It seems reasonable to ask people to learn that locks have stronger
> > > > ordering guarantees than RMW atomics do. Maybe not the greatest
> > > > situation in the world, but one I think we could live with.
> > >
> > > Yeah, this was one of my main objections.
> >
> > Does this mean you don't think you could live with it?
>
> Well, I *could* leave with it and with RCtso locks, ;-) but I'd rather not.
>
> Assuming that I will not be able to resist this RCtso trend, ;-) would the
> below (completely untested) work?
>
> let rmw = rmw | lk-rmw (* from lock.cat *)
> let po-unlock-rf-lock-po = po ; [Release] ; rf ; [domain(rmw)] ; po
domain(rmw) & Acquire, maybe...
Andrea
> [the rest of your patch + the updates to the doc. I suggested in v2 ;-)]
>
> Andrea
On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
>>
>> The locking pattern is fairly simple and shows where RCpc comes apart
>> from expectation real nice.
>
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
>
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
>
> Linus
RISC-V probably would have been RCpc if we weren't having this discussion.
Depending on how we map atomics/acquire/release/unlock/lock, we can end up
producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
behaviors, and we're trying to figure out which we actually need.
I think the debate is this:
Obviously programmers would prefer just to have RCsc and not have to figure out
all the complexity of the other options. On x86 or architectures with native
RCsc operations (like ARMv8), that's generally easy enough to get.
For weakly-ordered architectures that use fences for ordering (including
PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
from RCpc to either "RCtso" or RCsc. People using these architectures are
concerned about whether there's a negative performance impact from those extra
fences.
However, some scheduler code, some RCU code, and probably some other examples
already implicitly or explicitly assume unlock()/lock() provides stronger
ordering than RCpc. So, we have to decide whether to:
1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
assumes something stronger than RCpc is being provided, and hope people don't
get it wrong in the future
3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
smp_cond_load_acquire() be only RCpc
Also, FWIW, if other weakly-ordered architectures come along in the future and
also use any kind of lightweight fence rather than native RCsc operations,
they'll likely be in the same boat as RISC-V and Power here, in the sense of
not providing RCsc by default either.
Is that a fair assessment everyone?
I can also not-so-briefly summarize RISC-V's status here, since I think there's
been a bunch of confusion about where we're coming from:
First of all, I promise we're not trying to start a fight about all this :)
We're trying to understand the LKMM requirements so we know what instructions
to use.
With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
store-conditional, all of which have RCsc .aq and .rl bits:
(a) ...
amoswap.w.rl x0, x0, [lock] // unlock()
...
loop:
amoswap.w.aq a0, t1, [lock] // lock()
bnez a0, loop // lock()
(b) ...
(a) is ordered before (b) here, regardless of (a) and (b). Likewise for our
load-reserved/store-conditional instructions, which also have .aq and rl.
That's similiar to how ARM behaves, and is no problem. We're happy with that
too.
Unfortunately, we don't (currently?) have plain load-acquire or store-release
opcodes in the ISA. (That's a different discussion...) For those, we need
fences instead. And that's where it gets messier.
RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
"natural" fence-based mapping for store-release operations, and then pair that
with LR/SC:
(a) ...
fence rw,w // unlock()
sw x0, [lock] // unlock()
...
loop:
lr.w.aq a0, [lock] // lock()
sc.w t1, [lock] // lock()
bnez loop // lock()
(b) ...
However, if (a) and (b) are loads to different addresses, then (a) is not
ordered before (b) here. One unpaired RCsc operation is not a full fence.
Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
depend on "RCtso" or RCsc.
RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:
(a) ...
fence.tso // unlock(), fence.tso == fence rw,w + fence r,r
sw x0, [lock] // unlock()
...
loop:
lr.w.aq a0, [lock] // lock()
sc.w t1, [lock] // lock()
bnez loop // lock()
(b) ...
(a) is ordered before (b), unless (a) is a store and (b) is a load to a
different address.
(Modeling note: this example is why I asked for Alan's v3 patch over the v2
patch, which I believe would only have worked if the fence.tso were at the end)
To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
the lock load, much like PowerPC would I believe need a heavyweight sync:
(a) ...
fence rw,w // unlock()
sw x0, [lock] // unlock()
...
fence rw,rw // can attach either to lock() or to unlock()
...
loop:
lr.w.aq a0, [lock] // lock()
sc.w t1, [lock] // lock()
bnez loop // lock()
(b) ...
In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
RISC-V's fence rw,rw will suffice wherever PowerPC's full sync does. If anyone
is claiming RISC-V is suddenly proposing to go weaker than all the other major
architectures, that's a mischaracterization.
All in all: if LKMM wants RCsc, we can do it, but it's not free for RISC-V (or
Power). If LKMM wants RCtso, we can do that too, and that's in between. If
LKMM wants RCpc, we can do that, and it's the fastest of the bunch. No I don't
have concrete numbers either... And RISC-V implementations are going to vary
pretty widely anyway.
Hope that helps. Please correct anything I screwed up or mischaracterized.
Dan
On 7/12/2018 2:45 AM, Will Deacon wrote:
> On Thu, Jul 12, 2018 at 11:34:32AM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
>>> And I think if we raise atomic*_acquire() to require TSO (but ideally
>>> raise it to RCsc) we're there.
>>
>> To clarify, just the RmW-acquire. Things like atomic_read_acquire() can
>> stay smp_load_acquire() and be RCpc.
>
> I don't have strong opinions about strengthening RmW atomics to TSO, so
> if it helps to unblock Alan's patch (which doesn't go near this!) then I'll
> go with it. The important part is that we continue to allow roach motel
> into the RmW for other accesses in the non-fully-ordered cases.
>
> Daniel -- your AMO instructions are cool with this, right? It's just the
> fence-based implementations that will need help?
>
> Will
Right, let me pull this part out of the overly-long response I just gave
on the thread with Linus :)
if we pair AMOs with AMOs, we get RCsc, and everything is fine. If we
start mixing in fences (mostly because we don't currently have native
load-acquire or store-release opcodes), then that's when all the rest of the
complexity comes in.
Dan
On Thu, Jul 12, 2018 at 07:05:39PM -0700, Daniel Lustig wrote:
> On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> > On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> >>
> >> The locking pattern is fairly simple and shows where RCpc comes apart
> >> from expectation real nice.
> >
> > So who does RCpc right now for the unlock-lock sequence? Somebody
> > mentioned powerpc. Anybody else?
> >
> > How nasty would be be to make powerpc conform? I will always advocate
> > tighter locking and ordering rules over looser ones..
> >
> > Linus
>
> RISC-V probably would have been RCpc if we weren't having this discussion.
> Depending on how we map atomics/acquire/release/unlock/lock, we can end up
> producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
> behaviors, and we're trying to figure out which we actually need.
>
> I think the debate is this:
>
> Obviously programmers would prefer just to have RCsc and not have to figure out
> all the complexity of the other options. On x86 or architectures with native
> RCsc operations (like ARMv8), that's generally easy enough to get.
>
> For weakly-ordered architectures that use fences for ordering (including
> PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
> from RCpc to either "RCtso" or RCsc. People using these architectures are
> concerned about whether there's a negative performance impact from those extra
> fences.
>
> However, some scheduler code, some RCU code, and probably some other examples
> already implicitly or explicitly assume unlock()/lock() provides stronger
> ordering than RCpc.
Just to be clear, the RCU code uses smp_mb__after_unlock_lock() to get
the ordering that it needs out of spinlocks. Maybe that is what you
meant by "explicitly assume", but I figured I should clarify.
Thanx, Paul
On Thu, Jul 12, 2018 at 07:05:39PM -0700, Daniel Lustig wrote:
> On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> > On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> >>
> >> The locking pattern is fairly simple and shows where RCpc comes apart
> >> from expectation real nice.
> >
> > So who does RCpc right now for the unlock-lock sequence? Somebody
> > mentioned powerpc. Anybody else?
> >
> > How nasty would be be to make powerpc conform? I will always advocate
> > tighter locking and ordering rules over looser ones..
> >
> > Linus
>
> RISC-V probably would have been RCpc if we weren't having this discussion.
> Depending on how we map atomics/acquire/release/unlock/lock, we can end up
> producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
> behaviors, and we're trying to figure out which we actually need.
>
> I think the debate is this:
>
> Obviously programmers would prefer just to have RCsc and not have to figure out
> all the complexity of the other options. On x86 or architectures with native
> RCsc operations (like ARMv8), that's generally easy enough to get.
>
> For weakly-ordered architectures that use fences for ordering (including
> PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
> from RCpc to either "RCtso" or RCsc. People using these architectures are
> concerned about whether there's a negative performance impact from those extra
> fences.
>
> However, some scheduler code, some RCU code, and probably some other examples
> already implicitly or explicitly assume unlock()/lock() provides stronger
> ordering than RCpc. So, we have to decide whether to:
> 1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
> PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
> 2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
> assumes something stronger than RCpc is being provided, and hope people don't
> get it wrong in the future
> 3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
> smp_cond_load_acquire() be only RCpc
>
> Also, FWIW, if other weakly-ordered architectures come along in the future and
> also use any kind of lightweight fence rather than native RCsc operations,
> they'll likely be in the same boat as RISC-V and Power here, in the sense of
> not providing RCsc by default either.
>
> Is that a fair assessment everyone?
It's for me, thank you! And as we've seen, there are arguments for each of
the above three choices. I'm afraid that (despite Linus's statement ;-)),
my preference would currently go to (2).
>
>
>
> I can also not-so-briefly summarize RISC-V's status here, since I think there's
> been a bunch of confusion about where we're coming from:
>
> First of all, I promise we're not trying to start a fight about all this :)
> We're trying to understand the LKMM requirements so we know what instructions
> to use.
>
> With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
> store-conditional, all of which have RCsc .aq and .rl bits:
>
> (a) ...
> amoswap.w.rl x0, x0, [lock] // unlock()
> ...
> loop:
> amoswap.w.aq a0, t1, [lock] // lock()
> bnez a0, loop // lock()
> (b) ...
>
> (a) is ordered before (b) here, regardless of (a) and (b). Likewise for our
> load-reserved/store-conditional instructions, which also have .aq and rl.
> That's similiar to how ARM behaves, and is no problem. We're happy with that
> too.
>
> Unfortunately, we don't (currently?) have plain load-acquire or store-release
> opcodes in the ISA. (That's a different discussion...) For those, we need
> fences instead. And that's where it gets messier.
>
> RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
> "natural" fence-based mapping for store-release operations, and then pair that
> with LR/SC:
>
> (a) ...
> fence rw,w // unlock()
> sw x0, [lock] // unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> However, if (a) and (b) are loads to different addresses, then (a) is not
> ordered before (b) here. One unpaired RCsc operation is not a full fence.
> Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
> depend on "RCtso" or RCsc.
>
> RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:
Or by using a "fence r,rw" in the lock() (without the .aq), as current code
does ;-) though I'm not sure how the current solution would compare to the
.tso mapping...
Andrea
>
> (a) ...
> fence.tso // unlock(), fence.tso == fence rw,w + fence r,r
> sw x0, [lock] // unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> (a) is ordered before (b), unless (a) is a store and (b) is a load to a
> different address.
>
> (Modeling note: this example is why I asked for Alan's v3 patch over the v2
> patch, which I believe would only have worked if the fence.tso were at the end)
>
> To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
> the lock load, much like PowerPC would I believe need a heavyweight sync:
>
> (a) ...
> fence rw,w // unlock()
> sw x0, [lock] // unlock()
> ...
> fence rw,rw // can attach either to lock() or to unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
> RISC-V's fence rw,rw will suffice wherever PowerPC's full sync does. If anyone
> is claiming RISC-V is suddenly proposing to go weaker than all the other major
> architectures, that's a mischaracterization.
>
> All in all: if LKMM wants RCsc, we can do it, but it's not free for RISC-V (or
> Power). If LKMM wants RCtso, we can do that too, and that's in between. If
> LKMM wants RCpc, we can do that, and it's the fastest of the bunch. No I don't
> have concrete numbers either... And RISC-V implementations are going to vary
> pretty widely anyway.
>
> Hope that helps. Please correct anything I screwed up or mischaracterized.
>
> Dan
On Fri, Jul 13, 2018 at 11:07:11AM +0200, Andrea Parri wrote:
> On Thu, Jul 12, 2018 at 07:05:39PM -0700, Daniel Lustig wrote:
> > On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> > > On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> > >>
> > >> The locking pattern is fairly simple and shows where RCpc comes apart
> > >> from expectation real nice.
> > >
> > > So who does RCpc right now for the unlock-lock sequence? Somebody
> > > mentioned powerpc. Anybody else?
> > >
> > > How nasty would be be to make powerpc conform? I will always advocate
> > > tighter locking and ordering rules over looser ones..
> > >
> > > Linus
> >
> > RISC-V probably would have been RCpc if we weren't having this discussion.
> > Depending on how we map atomics/acquire/release/unlock/lock, we can end up
> > producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
> > behaviors, and we're trying to figure out which we actually need.
> >
> > I think the debate is this:
> >
> > Obviously programmers would prefer just to have RCsc and not have to figure out
> > all the complexity of the other options. On x86 or architectures with native
> > RCsc operations (like ARMv8), that's generally easy enough to get.
> >
> > For weakly-ordered architectures that use fences for ordering (including
> > PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
> > from RCpc to either "RCtso" or RCsc. People using these architectures are
> > concerned about whether there's a negative performance impact from those extra
> > fences.
> >
> > However, some scheduler code, some RCU code, and probably some other examples
> > already implicitly or explicitly assume unlock()/lock() provides stronger
> > ordering than RCpc. So, we have to decide whether to:
> > 1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
> > PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
> > 2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
> > assumes something stronger than RCpc is being provided, and hope people don't
> > get it wrong in the future
> > 3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
> > smp_cond_load_acquire() be only RCpc
> >
> > Also, FWIW, if other weakly-ordered architectures come along in the future and
> > also use any kind of lightweight fence rather than native RCsc operations,
> > they'll likely be in the same boat as RISC-V and Power here, in the sense of
> > not providing RCsc by default either.
> >
> > Is that a fair assessment everyone?
>
> It's for me, thank you! And as we've seen, there are arguments for each of
> the above three choices. I'm afraid that (despite Linus's statement ;-)),
> my preference would currently go to (2).
And, since we're stating preferences, I'll reiterate my preference towards:
* RCsc unlock/lock
* RCpc release/acquire
* Not fussed about atomic rmws, but having them closer to RCsc would
make it easier to implement and reason about generic locking
implementations (i.e. reduce the number of special ordering cases
and/or magic barrier macros)
Will
On Thu, Jul 12, 2018 at 11:10:58AM -0700, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> >
> > The locking pattern is fairly simple and shows where RCpc comes apart
> > from expectation real nice.
>
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
RISC-V followed, because the LKMM currently states it is allowed, in
fact LKMM is currently weaker than even PowerPC, which is what this
current discussion is about.
A number of people, including myself, are arguing for stronger lock
ordering (RCsc) but getting the LKMM to be at least as strong as Power
(RCtsc as coined by Daniel) which disallows full RCpc.
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
mpe did a micro-bench a little while ago:
http://lkml.iu.edu/hypermail/linux/kernel/1804.0/01990.html
which says 30% more expensive for uncontended lock+unlock. Which I admit
is fairly yuck. No macro bench results though.
Peter Zijlstra <[email protected]> writes:
> On Thu, Jul 12, 2018 at 11:10:58AM -0700, Linus Torvalds wrote:
>> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
>> >
>> > The locking pattern is fairly simple and shows where RCpc comes apart
>> > from expectation real nice.
>>
>> So who does RCpc right now for the unlock-lock sequence? Somebody
>> mentioned powerpc. Anybody else?
>
> RISC-V followed, because the LKMM currently states it is allowed, in
> fact LKMM is currently weaker than even PowerPC, which is what this
> current discussion is about.
>
> A number of people, including myself, are arguing for stronger lock
> ordering (RCsc) but getting the LKMM to be at least as strong as Power
> (RCtsc as coined by Daniel) which disallows full RCpc.
>
>> How nasty would be be to make powerpc conform? I will always advocate
>> tighter locking and ordering rules over looser ones..
>
> mpe did a micro-bench a little while ago:
>
> http://lkml.iu.edu/hypermail/linux/kernel/1804.0/01990.html
>
> which says 30% more expensive for uncontended lock+unlock. Which I admit
> is fairly yuck. No macro bench results though.
I reran some numbers today with some slightly updated tests.
It varies quite a bit across machines and CPU revisions.
On one I get:
Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
lwsync/lwsync 79,290,859,955 100.0 % 290,160,065,087 145 -
lwsync/sync 104,903,703,237 132.3 % 383,966,199,430 192 47
Another:
Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
lwsync/lwsync 71,662,395,722 100.0 % 252,403,777,715 126 -
lwsync/sync 84,932,987,977 118.5 % 299,141,951,285 150 23
So 18-32% slower, or 23-47 cycles.
Next week I can do some macro benchmarks, to see if it's actually
detectable at all.
The other question is how they behave on a heavily loaded system.
My personal preference would be to switch to sync, we don't want to be
the only arch finding (or not finding!) exotic ordering bugs.
But we'd also rather not make our slow locks any slower than they have
to be.
cheers
Hi Michael,
On Fri, Jul 13, 2018 at 11:15:26PM +1000, Michael Ellerman wrote:
> I reran some numbers today with some slightly updated tests.
>
> It varies quite a bit across machines and CPU revisions.
>
> On one I get:
>
> Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
> lwsync/lwsync 79,290,859,955 100.0 % 290,160,065,087 145 -
> lwsync/sync 104,903,703,237 132.3 % 383,966,199,430 192 47
>
> Another:
>
> Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
> lwsync/lwsync 71,662,395,722 100.0 % 252,403,777,715 126 -
> lwsync/sync 84,932,987,977 118.5 % 299,141,951,285 150 23
>
>
> So 18-32% slower, or 23-47 cycles.
Very good info. Note that another option is to put the SYNC in lock() it
doesn't really matter which of the two primitives gets it. I don't
suppose it really matters for timing either way around.
> Next week I can do some macro benchmarks, to see if it's actually
> detectable at all.
>
> The other question is how they behave on a heavily loaded system.
>
>
> My personal preference would be to switch to sync, we don't want to be
> the only arch finding (or not finding!) exotic ordering bugs.
>
> But we'd also rather not make our slow locks any slower than they have
> to be.
I completely understand, but I'll get you beer (lots) if you do manage
to make SYNC happen :-) :-)
On Fri, Jul 13, 2018 at 2:34 AM Will Deacon <[email protected]> wrote:
>
> And, since we're stating preferences, I'll reiterate my preference towards:
>
> * RCsc unlock/lock
> * RCpc release/acquire
Yes, I think this would be best. We *used* to have pretty heavy-weight
locking rules for various reasons, and we relaxed them for reasons
that weren't perhaps always the right ones.
Locking is pretty heavy-weight in general, and meant to be the "I
don't really have to think about this very much" option. Then not
being serializing enough to confuse people when it allows odd behavior
(on _some_ architectures) does not sound like a great idea.
In contrast, when you do release/acquire or any of the other "I know
what I'm doing" things, I think we want the minimal serialization
implied by the very specialized op.
> * Not fussed about atomic rmws, but having them closer to RCsc would
> make it easier to implement and reason about generic locking
> implementations
I would prefer that rmw's be RCsc by default, but that there are then
"relaxed" versions of it that aren't.
For example, one common case of rmw has nothing to do with any
ordering at all: statistics gathering. It usually has absolutely zero
need for any ordering per se, and all it wants is cache coherence.
Yes, yes, the really crticial stuff we then use percpu counters for
and a lot of clever software, but there's a lot of cases where that
isn't practical or isn't _quite_ important enough.
So "atomic_add()" being RCsc sounds like a nice tight requirement, but
then architectures who can do it cheaper could have
"atomic_add_relaxed()" that has no inherent ordering at all.
But let's see what the powerpc people find about the actual
performance impact of being RCsc on locking. Real numbers for real
loads would be nice.
Linus
On Fri, Jul 13, 2018 at 10:16:48AM -0700, Linus Torvalds wrote:
> On Fri, Jul 13, 2018 at 2:34 AM Will Deacon <[email protected]> wrote:
> >
> > And, since we're stating preferences, I'll reiterate my preference towards:
> >
> > * RCsc unlock/lock
> > * RCpc release/acquire
>
> Yes, I think this would be best. We *used* to have pretty heavy-weight
> locking rules for various reasons, and we relaxed them for reasons
> that weren't perhaps always the right ones.
>
> Locking is pretty heavy-weight in general, and meant to be the "I
> don't really have to think about this very much" option. Then not
> being serializing enough to confuse people when it allows odd behavior
> (on _some_ architectures) does not sound like a great idea.
>
> In contrast, when you do release/acquire or any of the other "I know
> what I'm doing" things, I think we want the minimal serialization
> implied by the very specialized op.
The changes under discussion are _not_ affecting uses such as:
P0:
spin_lock(s);
UPDATE data_struct
spin_unlock(s);
P1:
spin_lock(s);
UPDATE data_struct
spin_unlock(s);
[...]
(most common use case for locking?): these uses work just _fine_ with
the current implementations and in LKMM.
OTOH, these changes are going to affect uses where threads interact by
"mixing" locking and _other_ synchronization primitives such as in:
{ x = 0; y = 0; }
P0:
spin_lock(s);
WRITE_ONCE(x, 1);
spin_unlock(s);
P1:
spin_lock(s);
r0 = READ_ONCE(x);
WRITE_ONCE(y, 1);
spin_unlock(s);
P2:
r1 = smp_load_acquire(&y);
r2 = READ_ONCE(x);
BUG_ON(r0 == 1 && r1 == 1 && r2 == 0)
and
{ x = 0; y = 0; z = 0; }
P0:
spin_lock(s);
WRITE_ONCE(x, 1);
r0 = READ_ONCE(y);
spin_unlock(s);
P1:
spin_lock(s);
WRITE_ONCE(y, 1);
r1 = READ_ONCE(z);
spin_unlock(s);
P2
WRITE_ONCE(z, 1);
smp_mb();
r2 = READ_ONCE(x);
BUG_ON(r0 == 0 && r1 == 0 && r2 == 0)
(inspired from __two__ uses in kernel/{sched,rcu}). Even if someone were
to tell me that locks serialize enough, I'd still be prompted to say "yes,
but do / can my BUG_ON()s fire?".
Actually, my very first reaction, before starting what does appear to be
indeed a long and complex conversation, would probably be to run/check the
above snippets against the (latest) LKMM, by using the associated tool.
Once "checked" with both people and automated models, I'd probably remain
suspicious about my "magic" code so that I most likely will be prompted to
dig into each single arch. implementation / reference manual...
... Time's up!
Andrea
On Fri, Jul 13, 2018 at 06:42:39PM +0200, Peter Zijlstra wrote:
>
> Hi Michael,
>
> On Fri, Jul 13, 2018 at 11:15:26PM +1000, Michael Ellerman wrote:
> > I reran some numbers today with some slightly updated tests.
> >
> > It varies quite a bit across machines and CPU revisions.
> >
> > On one I get:
> >
> > Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
> > lwsync/lwsync 79,290,859,955 100.0 % 290,160,065,087 145 -
> > lwsync/sync 104,903,703,237 132.3 % 383,966,199,430 192 47
> >
> > Another:
> >
> > Lock/Unlock Time Time % Total Cycles Cycles Cycles Delta
> > lwsync/lwsync 71,662,395,722 100.0 % 252,403,777,715 126 -
> > lwsync/sync 84,932,987,977 118.5 % 299,141,951,285 150 23
> >
> >
> > So 18-32% slower, or 23-47 cycles.
>
> Very good info. Note that another option is to put the SYNC in lock() it
> doesn't really matter which of the two primitives gets it. I don't
> suppose it really matters for timing either way around.
>
> > Next week I can do some macro benchmarks, to see if it's actually
> > detectable at all.
> >
> > The other question is how they behave on a heavily loaded system.
> >
> >
> > My personal preference would be to switch to sync, we don't want to be
> > the only arch finding (or not finding!) exotic ordering bugs.
> >
> > But we'd also rather not make our slow locks any slower than they have
> > to be.
>
> I completely understand, but I'll get you beer (lots) if you do manage
> to make SYNC happen :-) :-)
One trivia about seems due: it's of course very easy to stick a full
or a "tso" fence in one's spin_lock() implementation, or to tight the
semantics of such a primitive; removing this fence, or weakening the
semantics is another matter...
(/me reminding about that spin_is_locked() discussion...)
Andrea
On Fri, 13 Jul 2018, Andrea Parri wrote:
> On Fri, Jul 13, 2018 at 10:16:48AM -0700, Linus Torvalds wrote:
> > On Fri, Jul 13, 2018 at 2:34 AM Will Deacon <[email protected]> wrote:
> > >
> > > And, since we're stating preferences, I'll reiterate my preference towards:
> > >
> > > * RCsc unlock/lock
> > > * RCpc release/acquire
> >
> > Yes, I think this would be best. We *used* to have pretty heavy-weight
> > locking rules for various reasons, and we relaxed them for reasons
> > that weren't perhaps always the right ones.
> >
> > Locking is pretty heavy-weight in general, and meant to be the "I
> > don't really have to think about this very much" option. Then not
> > being serializing enough to confuse people when it allows odd behavior
> > (on _some_ architectures) does not sound like a great idea.
> >
> > In contrast, when you do release/acquire or any of the other "I know
> > what I'm doing" things, I think we want the minimal serialization
> > implied by the very specialized op.
>
> The changes under discussion are _not_ affecting uses such as:
>
> P0:
> spin_lock(s);
> UPDATE data_struct
> spin_unlock(s);
>
> P1:
> spin_lock(s);
> UPDATE data_struct
> spin_unlock(s);
>
> [...]
>
> (most common use case for locking?): these uses work just _fine_ with
> the current implementations and in LKMM.
>
> OTOH, these changes are going to affect uses where threads interact by
> "mixing" locking and _other_ synchronization primitives such as in:
>
> { x = 0; y = 0; }
>
> P0:
> spin_lock(s);
> WRITE_ONCE(x, 1);
> spin_unlock(s);
>
> P1:
> spin_lock(s);
> r0 = READ_ONCE(x);
> WRITE_ONCE(y, 1);
> spin_unlock(s);
>
> P2:
> r1 = smp_load_acquire(&y);
> r2 = READ_ONCE(x);
>
> BUG_ON(r0 == 1 && r1 == 1 && r2 == 0)
>
> and
>
> { x = 0; y = 0; z = 0; }
>
> P0:
> spin_lock(s);
> WRITE_ONCE(x, 1);
> r0 = READ_ONCE(y);
> spin_unlock(s);
>
> P1:
> spin_lock(s);
> WRITE_ONCE(y, 1);
> r1 = READ_ONCE(z);
> spin_unlock(s);
>
> P2
> WRITE_ONCE(z, 1);
> smp_mb();
> r2 = READ_ONCE(x);
>
> BUG_ON(r0 == 0 && r1 == 0 && r2 == 0)
>
> (inspired from __two__ uses in kernel/{sched,rcu}). Even if someone were
> to tell me that locks serialize enough, I'd still be prompted to say "yes,
> but do / can my BUG_ON()s fire?".
The point being that the scenarios under discussion in this thread all
fall most definitely into the "Non-standard usage; you'd better know
exactly what you're doing" category.
Which suggests, by Linus's reasoning, that locking should be as
lightweight as possible while still being able to perform its basic job
of defining critical sections. In other words, RCpc.
And which would still leave smp_mb__after_unlock_lock available for
more esoteric usages. Although it provides RCsc ordering, I assume the
overhead wouldn't be prohibitive in situations where only RCtso
ordering is needed.
Alan
> Actually, my very first reaction, before starting what does appear to be
> indeed a long and complex conversation, would probably be to run/check the
> above snippets against the (latest) LKMM, by using the associated tool.
>
> Once "checked" with both people and automated models, I'd probably remain
> suspicious about my "magic" code so that I most likely will be prompted to
> dig into each single arch. implementation / reference manual...
>
> ... Time's up!
>
> Andrea
On Fri, Jul 13, 2018 at 6:51 PM Alan Stern <[email protected]> wrote:
>
> The point being that the scenarios under discussion in this thread all
> fall most definitely into the "Non-standard usage; you'd better know
> exactly what you're doing" category.
Well, yes and no.
The thing is, people expected unlock+lock to give a memory ordering.
It happened in RCU, and it's happened before elsewhere.
So it *is* the "pure locking" thing that ends up confusing people.
Yes, you have some other access that then cares about the memory
ordering, but this is a fairly natural expectation to have
(considering that we've had the same issue before).
Linus
On Fri, Jul 13, 2018 at 07:58:25PM -0700, Linus Torvalds wrote:
> On Fri, Jul 13, 2018 at 6:51 PM Alan Stern <[email protected]> wrote:
> >
> > The point being that the scenarios under discussion in this thread all
> > fall most definitely into the "Non-standard usage; you'd better know
> > exactly what you're doing" category.
>
> Well, yes and no.
>
> The thing is, people expected unlock+lock to give a memory ordering.
> It happened in RCU, and it's happened before elsewhere.
True enough. I have grown quite used to the current unlock+lock ordering,
but yes, it did come as a bit of a surprise when I first came across it
some years back. And it took quite a bit of digging to work out what
was going on. My hope is of course that whatever the eventual choice,
having the memory model in place will make it easier for developers down
the line.
> So it *is* the "pure locking" thing that ends up confusing people.
> Yes, you have some other access that then cares about the memory
> ordering, but this is a fairly natural expectation to have
> (considering that we've had the same issue before).
That said, I was doing (and am still doing) some pretty unusual things
with RCU. It requires every access from any CPU before any given grace
period to be ordered before any access from any CPU after that same
grace period. That sort of requirement has been rare enough that the
smp_mb__after_unlock_lock() macro RCU uses is defined locally within RCU,
and is currently used only by RCU.
But as always, your kernel, your choice!
Thanx, Paul
Peter Zijlstra <[email protected]> writes:
> On Fri, Jul 13, 2018 at 11:15:26PM +1000, Michael Ellerman wrote:
...
>>
>>
>> So 18-32% slower, or 23-47 cycles.
>
> Very good info. Note that another option is to put the SYNC in lock() it
> doesn't really matter which of the two primitives gets it. I don't
> suppose it really matters for timing either way around.
If the numbers can be trusted it is actually slower to put the sync in
lock, at least on one of the machines:
Time
lwsync_sync 84,932,987,977
sync_lwsync 93,185,930,333
On the other machine it's slower but only by 0.1%, so that's slightly
weird.
The other advantage of putting the sync in unlock is we could get rid of
our SYNC_IO logic, which conditionally puts a sync in unlock to order
IO accesses vs unlock.
>> Next week I can do some macro benchmarks, to see if it's actually
>> detectable at all.
I guess arguably it's not a very macro benchmark, but we have a
context_switch benchmark in the tree[1] which we often use to tune
things, and it degrades badly. It just spins up two threads and has them
ping-pong using yield.
The numbers are context switch iterations, so more == better.
| Before | After | Change | Change %
+------------+------------+------------+----------
| 35,601,160 | 32,371,164 | -3,229,996 | -9.07%
| 35,762,126 | 32,438,798 | -3,323,328 | -9.29%
| 35,690,870 | 32,353,676 | -3,337,194 | -9.35%
| 35,440,346 | 32,336,750 | -3,103,596 | -8.76%
| 35,614,868 | 32,676,378 | -2,938,490 | -8.25%
| 35,659,690 | 32,462,624 | -3,197,066 | -8.97%
| 35,594,058 | 32,403,922 | -3,190,136 | -8.96%
| 35,682,682 | 32,353,146 | -3,329,536 | -9.33%
| 35,954,454 | 32,306,168 | -3,648,286 | -10.15%
| 35,849,314 | 32,291,094 | -3,558,220 | -9.93%
----------+------------+------------+------------+----------
Average | 35,684,956 | 32,399,372 | -3,285,584 | -9.21%
Std Dev | 143,877 | 111,385 |
Std Dev % | 0.40% | 0.34% |
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/benchmarks/context_switch.c
I'll do some kernbench runs tomorrow and see if it shows up there.
>> My personal preference would be to switch to sync, we don't want to be
>> the only arch finding (or not finding!) exotic ordering bugs.
>>
>> But we'd also rather not make our slow locks any slower than they have
>> to be.
>
> I completely understand, but I'll get you beer (lots) if you do manage
> to make SYNC happen :-) :-)
Just so we're clear Fosters is not beer :)
cheers
On Tue, Jul 17, 2018 at 12:40:19AM +1000, Michael Ellerman wrote:
>
> I guess arguably it's not a very macro benchmark, but we have a
> context_switch benchmark in the tree[1] which we often use to tune
> things, and it degrades badly. It just spins up two threads and has them
> ping-pong using yield.
The one advantage you'd get from putting it in lock() is that you could
do away with smp_mb__after_spinlock().
But yes, I completely forgot about your IO thingy.. those bench results
make me sad :/ a well.
On Mon, Jul 16, 2018 at 7:40 AM Michael Ellerman <[email protected]> wrote:
>
> If the numbers can be trusted it is actually slower to put the sync in
> lock, at least on one of the machines:
>
> Time
> lwsync_sync 84,932,987,977
> sync_lwsync 93,185,930,333
Very funky.
> I guess arguably it's not a very macro benchmark, but we have a
> context_switch benchmark in the tree[1] which we often use to tune
> things, and it degrades badly. It just spins up two threads and has them
> ping-pong using yield.
I hacked that up to run on x86, and it only is about 5% locking
overhead in my profiles. It's about 18% __switch_to, and a lot of
system call entry/exit, but not a lot of locking.
I'm actually surprised it is even that much locking, since it seems to
be single-cpu, so there should be no contention and the lock (which
seems to be
rq = this_rq();
rq_lock(rq, &rf);
in do_sched_yield()) should stay local to the cpu.
And for you the locking is apparently even _more_ noticeable.
But yes, a 10% regression on that context switch thing is huge. You
shouldn't do ping-pong stuff, but people kind of do.
Linus
Linus Torvalds <[email protected]> writes:
> On Mon, Jul 16, 2018 at 7:40 AM Michael Ellerman <[email protected]> wrote:
...
>> I guess arguably it's not a very macro benchmark, but we have a
>> context_switch benchmark in the tree[1] which we often use to tune
>> things, and it degrades badly. It just spins up two threads and has them
>> ping-pong using yield.
>
> I hacked that up to run on x86, and it only is about 5% locking
> overhead in my profiles. It's about 18% __switch_to, and a lot of
> system call entry/exit, but not a lot of locking.
Interesting. I don't see anything as high as 18%, it's more spread out:
7.81% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
7.60% context_switch [kernel.kallsyms] [k] system_call_exit
5.91% context_switch [kernel.kallsyms] [k] __switch_to
5.69% context_switch [kernel.kallsyms] [k] __sched_text_start
5.61% context_switch [kernel.kallsyms] [k] _raw_spin_lock
4.15% context_switch [kernel.kallsyms] [k] system_call
3.76% context_switch [kernel.kallsyms] [k] finish_task_switch
And it doesn't change much before/after the spinlock change.
(I should work out how to turn that cgroup stuff off.)
I tried uninlining spin_unlock() and that makes it a bit clearer.
Before:
9.67% context_switch [kernel.kallsyms] [k] _raw_spin_lock
7.74% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
7.39% context_switch [kernel.kallsyms] [k] system_call_exit
5.84% context_switch [kernel.kallsyms] [k] __sched_text_start
4.83% context_switch [kernel.kallsyms] [k] __switch_to
4.08% context_switch [kernel.kallsyms] [k] system_call
<snip 16 lines>
1.24% context_switch [kernel.kallsyms] [k] arch_spin_unlock <--
After:
8.69% context_switch [kernel.kallsyms] [k] _raw_spin_lock
7.01% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
6.76% context_switch [kernel.kallsyms] [k] system_call_exit
5.59% context_switch [kernel.kallsyms] [k] arch_spin_unlock <--
5.10% context_switch [kernel.kallsyms] [k] __sched_text_start
4.36% context_switch [kernel.kallsyms] [k] __switch_to
3.80% context_switch [kernel.kallsyms] [k] system_call
I was worried spectre/meltdown mitigations might be confusing things, but not
really, updated numbers with them off are higher but the delta is about the
same in percentage terms:
| lwsync/lwsync | lwsync/sync | Change | Change %
+---------------+-------------+------------+----------
Average | 47,938,888 | 43,655,184 | -4,283,703 | -9.00%
> I'm actually surprised it is even that much locking, since it seems to
> be single-cpu, so there should be no contention and the lock (which
> seems to be
>
> rq = this_rq();
> rq_lock(rq, &rf);
>
> in do_sched_yield()) should stay local to the cpu.
>
> And for you the locking is apparently even _more_ noticeable.
> But yes, a 10% regression on that context switch thing is huge. You
> shouldn't do ping-pong stuff, but people kind of do.
Yeah.
There also seem to be folks who have optimised the rest of their stack pretty
hard, and therefore care about context switch performance because it's pure
overhead and they're searching for every cycle.
So although this test is not a real workload it's a proxy for something people
do complain to us about.
cheers
On Tue, Jul 17, 2018 at 7:45 AM Michael Ellerman <[email protected]> wrote:
>
>
> Interesting. I don't see anything as high as 18%, it's more spread out:
>
> 7.81% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
Oh, see that's the difference.
You're running in a non-root cgroup, I think.
That also means that your scheduler overhead has way more spinlocks,
and in particular, you have that
raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
..
raw_spin_lock_irqsave(cpu_lock, flags);
there too.
So you have at least twice the spinlocks that my case had, and yes,
the costs are way more spread out because your case has all that
cgroup accounting too.
That said, I don't understand the powerpc memory ordering. I thought
the rules were "isync on lock, lwsync on unlock".
That's what the AIX docs imply, at least.
In particular, I find:
"isync is not a memory barrier instruction, but the
load-compare-conditional branch-isync sequence can provide this
ordering property"
so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
(for lock/unlock) is the right thing and would already give memory
barrier semantics?
Linus
On Tue, Jul 17, 2018 at 09:19:15AM -0700, Linus Torvalds wrote:
> On Tue, Jul 17, 2018 at 7:45 AM Michael Ellerman <[email protected]> wrote:
> >
> >
> > Interesting. I don't see anything as high as 18%, it's more spread out:
> >
> > 7.81% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
>
> Oh, see that's the difference.
>
> You're running in a non-root cgroup, I think.
>
> That also means that your scheduler overhead has way more spinlocks,
> and in particular, you have that
>
> raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> ..
> raw_spin_lock_irqsave(cpu_lock, flags);
>
> there too.
>
> So you have at least twice the spinlocks that my case had, and yes,
> the costs are way more spread out because your case has all that
> cgroup accounting too.
>
> That said, I don't understand the powerpc memory ordering. I thought
> the rules were "isync on lock, lwsync on unlock".
>
> That's what the AIX docs imply, at least.
>
> In particular, I find:
>
> "isync is not a memory barrier instruction, but the
> load-compare-conditional branch-isync sequence can provide this
> ordering property"
>
> so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> (for lock/unlock) is the right thing and would already give memory
> barrier semantics?
The PowerPC guys will correct me if I miss something here...
The isync provides ordering roughly similar to lwsync, but nowhere near
as strong as sync, and it is sync that would be needed to cause lock
acquisition to provide full ordering. The reason for using lwsync instead
of isync is that the former proved to be faster on recent hardware.
The reason that the kernel still has the ability to instead generate
isync instructions is that some older PowerPC hardware does not provide
the lwsync instruction. If the hardware does support lwsync, the isync
instructions are overwritten with lwsync at boot time.
Thanx, Paul
On Tue, Jul 17, 2018 at 11:33:41AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 17, 2018 at 09:19:15AM -0700, Linus Torvalds wrote:
> >
> > In particular, I find:
> >
> > "isync is not a memory barrier instruction, but the
> > load-compare-conditional branch-isync sequence can provide this
> > ordering property"
> >
> > so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> > (for lock/unlock) is the right thing and would already give memory
> > barrier semantics?
>
> The PowerPC guys will correct me if I miss something here...
>
> The isync provides ordering roughly similar to lwsync, but nowhere near
> as strong as sync, and it is sync that would be needed to cause lock
> acquisition to provide full ordering. The reason for using lwsync instead
> of isync is that the former proved to be faster on recent hardware.
> The reason that the kernel still has the ability to instead generate
> isync instructions is that some older PowerPC hardware does not provide
> the lwsync instruction. If the hardware does support lwsync, the isync
> instructions are overwritten with lwsync at boot time.
Isn't ISYNC the instruction-sync pipeline flush instruction? That is
used as an smp_rmb() here to, together with the control dependency from the
LL/SC, to form a LOAD->{LOAD,STORE} (aka LOAD-ACQUIRE) ordering?
Where LWSYNC provides a TSO like ordering and SYNC provides a full
transitive barrier aka. smp_mb() (althgouh I think it is strictly
stronger than smp_mb() since it also implies completion, which smp_mb()
does not).
And since both LL/SC-CTRL + ISYNC / LWSYNC are strictly CPU local, they
cannot be used to create RCsc ordering.
On Tue, Jul 17, 2018 at 11:31 AM Paul E. McKenney
<[email protected]> wrote:
>
> The isync provides ordering roughly similar to lwsync, but nowhere near
> as strong as sync, and it is sync that would be needed to cause lock
> acquisition to provide full ordering.
That's only true when looking at isync in isolation.
Read the part I quoted. The AIX documentation implies that the
*sequence* of load-compare-conditional branch-isync is a memory
barrier, even if isync on its own is now.
So I'm just saying that
(a) isync-on-lock is supposed to be much cheaper than sync-on-lock
(b) the AIX documentation at least implies that isync-on-lock (when
used together the the whole locking sequence) is actually a memory
barrier
Now, admittedly the powerpc barrier instructions are unfathomable
crazy stuff, so who knows. But:
(a) lwsync is a memory barrier for all the "easy" cases (ie
load->store, load->load, and store->load).
(b) lwsync is *not* a memory barrier for the store->load case.
(c) isync *is* (when in that *sequence*) a memory barrier for a
store->load case (and has to be: loads inside a spinlocked region MUST
NOT be done earlier than stores outside of it!).
So a unlock/lock sequence where the unlock is using lwsync, and the
lock is using isync, should in fact be a full memory barrier (which is
the semantics we're looking for).
So doing performance testing on sync/lwsync (for lock/unlock
respectively) seems the wrong thing to do. Please test the
isync/lwsync case instead.
Hmm? What am I missing?
Linus
On Tue, Jul 17, 2018 at 11:44 AM Linus Torvalds
<[email protected]> wrote:
>
> (a) lwsync is a memory barrier for all the "easy" cases (ie
> load->store, load->load, and store->load).
That last one should have been "store->store", of course.
So 'lwsync' gives smp_rmb(), smp_wmb(), and smp_load_acquire()
semantics (which are the usual "no barrier needed at all" suspects for
things like x86).
What lwsync lacks is store->load ordering. So:
> (b) lwsync is *not* a memory barrier for the store->load case.
BUT, this is where isync comes in:
> (c) isync *is* (when in that *sequence*) a memory barrier for a
> store->load case (and has to be: loads inside a spinlocked region MUST
> NOT be done earlier than stores outside of it!).
which is why I think that a spinlock implementation that uses isync
would give us the semantics we want, without the use of the crazy
expensive 'sync' that Michael tested (and which apparently gets
horrible 10% scheduler performance regressions at least on some
powerpc CPU's).
Linus
On Tue, Jul 17, 2018 at 11:44:23AM -0700, Linus Torvalds wrote:
> On Tue, Jul 17, 2018 at 11:31 AM Paul E. McKenney
> <[email protected]> wrote:
> >
> > The isync provides ordering roughly similar to lwsync, but nowhere near
> > as strong as sync, and it is sync that would be needed to cause lock
> > acquisition to provide full ordering.
>
> That's only true when looking at isync in isolation.
>
> Read the part I quoted. The AIX documentation implies that the
> *sequence* of load-compare-conditional branch-isync is a memory
> barrier, even if isync on its own is now.
You are referring to this URL, correct?
https://www.ibm.com/developerworks/systems/articles/powerpc.html
If so, the "this ordering property" refers to the ordering needed to
ensure that the later accesses happen after the "load global flag",
and lwsync suffices for that. (As does the branch-isync, as you say.)
The sync instruction provides much stronger ordering due to the fact
that sync flushes the store buffer and waits for acknowledgments from all
other CPUs (or, more accurately makes it appear as if it had done so --
speculative execution can be and is used to hide some of the latency).
In contrast, isync (with or without the load and branch) does not
flush the store buffer nor does it cause any particular communication
with the other CPUs. The ordering that isync provides is therefore
quite a bit weaker than that of the sync instruction.
> So I'm just saying that
>
> (a) isync-on-lock is supposed to be much cheaper than sync-on-lock
Completely agreed.
> (b) the AIX documentation at least implies that isync-on-lock (when
> used together the the whole locking sequence) is actually a memory
> barrier
Agreed, but the ordering properties that it provides are similar to
those of the weaker lwsync memory-barrier instruction, and nowhere near
as strong of those of the sync memory-barrier instruction.
> Now, admittedly the powerpc barrier instructions are unfathomable
> crazy stuff, so who knows. But:
>
> (a) lwsync is a memory barrier for all the "easy" cases (ie
> load->store, load->load, and store->load).
Yes.
> (b) lwsync is *not* a memory barrier for the store->load case.
Agreed.
> (c) isync *is* (when in that *sequence*) a memory barrier for a
> store->load case (and has to be: loads inside a spinlocked region MUST
> NOT be done earlier than stores outside of it!).
Yes, isync will wait for the prior stores to -execute-, but it doesn't
necessarily wait for the corresponding entries to leave the store buffer.
And this suffices to provide ordering from the viewpoint of some other
CPU holding the lock.
> So a unlock/lock sequence where the unlock is using lwsync, and the
> lock is using isync, should in fact be a full memory barrier (which is
> the semantics we're looking for).
>
> So doing performance testing on sync/lwsync (for lock/unlock
> respectively) seems the wrong thing to do. Please test the
> isync/lwsync case instead.
>
> Hmm? What am I missing?
That the load-branch-isync has ordering properties similar to the lwsync
memory-barrier instruction, not to the sync memory-barrier instruction.
This means that the isync/lwsync case simply won't provide full ordering
from the viewpoint of CPUs not holding the lock. (As with lwsync,
CPUs holding the lock do see full ordering, as is absolutely required.)
You absolutely must use sync/lwsync or lwsync/sync to get the strong
order that is visible to other CPUs not holding the lock.
The PowerPC hardware verification tools agree, as shown below.
Thanx, Paul
------------------------------------------------------------------------
The PPCMEM tools (which the Power hardware guys helped to develop) agrees.
For the following litmus test, in which P0 does the lwsync-unlock and
isync-lock, and for which P1 checks the ordering, but without acquiring
the lock, PPCMEM says "no ordering".
------------------------------------------------------------------------
PPC lock-1thread-WR-barrier.litmus
""
{
l=1;
0:r1=1; 0:r3=42; 0:r4=x; 0:r5=y; 0:r10=0; 0:r11=0; 0:r12=l;
1:r1=1; 1:r4=x; 1:r5=y;
}
P0 | P1 ;
stw r1,0(r4) | stw r1,0(r5) ;
lwsync | sync ;
stw r10,0(r12) | lwz r7,0(r4) ;
lwarx r11,r10,r12 | ;
cmpwi r11,0 | ;
bne Fail1 | ;
stwcx. r1,r10,r12 | ;
bne Fail1 | ;
isync | ;
lwz r3,0(r5) | ;
Fail1: | ;
exists
(0:r3=0 /\ 1:r7=0)
------------------------------------------------------------------------
Here is the output of running the tool locally:
------------------------------------------------------------------------
0:r3=0; 1:r7=0;
0:r3=0; 1:r7=1;
0:r3=1; 1:r7=0;
0:r3=1; 1:r7=1;
0:r3=42; 1:r7=0;
0:r3=42; 1:r7=1;
Ok
Condition exists (0:r3=0 /\ 1:r7=0)
Hash=16c3d58e658f6c16bc3df7d6233d8bf8
Observation SB+lock+sync-Linus Sometimes 1 5
------------------------------------------------------------------------
And you can see the "0:r3=0; 1:r7=0;" state, which indicates that neither
process's loads saw the other process's stores, which in turn indicates
no ordering.
That said, if P1 actually acquired the lock, there would be ordering.
Or you can get the same result manually using this web site:
https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html
On Tue, Jul 17, 2018 at 08:42:55PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 11:33:41AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 17, 2018 at 09:19:15AM -0700, Linus Torvalds wrote:
> > >
> > > In particular, I find:
> > >
> > > "isync is not a memory barrier instruction, but the
> > > load-compare-conditional branch-isync sequence can provide this
> > > ordering property"
> > >
> > > so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> > > (for lock/unlock) is the right thing and would already give memory
> > > barrier semantics?
> >
> > The PowerPC guys will correct me if I miss something here...
> >
> > The isync provides ordering roughly similar to lwsync, but nowhere near
> > as strong as sync, and it is sync that would be needed to cause lock
> > acquisition to provide full ordering. The reason for using lwsync instead
> > of isync is that the former proved to be faster on recent hardware.
> > The reason that the kernel still has the ability to instead generate
> > isync instructions is that some older PowerPC hardware does not provide
> > the lwsync instruction. If the hardware does support lwsync, the isync
> > instructions are overwritten with lwsync at boot time.
>
> Isn't ISYNC the instruction-sync pipeline flush instruction? That is
> used as an smp_rmb() here to, together with the control dependency from the
> LL/SC, to form a LOAD->{LOAD,STORE} (aka LOAD-ACQUIRE) ordering?
That is the one!
> Where LWSYNC provides a TSO like ordering and SYNC provides a full
> transitive barrier aka. smp_mb() (althgouh I think it is strictly
> stronger than smp_mb() since it also implies completion, which smp_mb()
> does not).
>
> And since both LL/SC-CTRL + ISYNC / LWSYNC are strictly CPU local, they
> cannot be used to create RCsc ordering.
Completely agreed, there needs to be a sync instruction in either the
lock or the unlock to get RCsc ordering. Neither lwsync nor isync
can provide the needed ordering.
Thanx, Paul
On Tue, 17 Jul 2018, Linus Torvalds wrote:
> On Tue, Jul 17, 2018 at 11:31 AM Paul E. McKenney
> <[email protected]> wrote:
> >
> > The isync provides ordering roughly similar to lwsync, but nowhere near
> > as strong as sync, and it is sync that would be needed to cause lock
> > acquisition to provide full ordering.
>
> That's only true when looking at isync in isolation.
>
> Read the part I quoted. The AIX documentation implies that the
> *sequence* of load-compare-conditional branch-isync is a memory
> barrier, even if isync on its own is now.
I'm not a huge expert on the PowerPC architecture, but I do have a
pretty good understanding of the widely accepted memory model published
by the Peter Sewell's group at Cambridge (PPCMEM). According to that
model, load-compare-conditional branch-isync is _not_ a full memory
barrier.
> So I'm just saying that
>
> (a) isync-on-lock is supposed to be much cheaper than sync-on-lock
>
> (b) the AIX documentation at least implies that isync-on-lock (when
> used together the the whole locking sequence) is actually a memory
> barrier
>
> Now, admittedly the powerpc barrier instructions are unfathomable
> crazy stuff, so who knows. But:
>
> (a) lwsync is a memory barrier for all the "easy" cases (ie
> load->store, load->load, and store->load).
>
> (b) lwsync is *not* a memory barrier for the store->load case.
>
> (c) isync *is* (when in that *sequence*) a memory barrier for a
> store->load case (and has to be: loads inside a spinlocked region MUST
> NOT be done earlier than stores outside of it!).
Why not? Instructions are allowed to migrate _into_ critical sections,
just not _out_ of them. So a store preceding the start of a spinlocked
region can migrate in and be executed after a load that is inside the
region.
Alan Stern
> So a unlock/lock sequence where the unlock is using lwsync, and the
> lock is using isync, should in fact be a full memory barrier (which is
> the semantics we're looking for).
>
> So doing performance testing on sync/lwsync (for lock/unlock
> respectively) seems the wrong thing to do. Please test the
> isync/lwsync case instead.
>
> Hmm? What am I missing?
> > That said, I don't understand the powerpc memory ordering. I thought
> > the rules were "isync on lock, lwsync on unlock".
> >
> > That's what the AIX docs imply, at least.
> >
> > In particular, I find:
> >
> > "isync is not a memory barrier instruction, but the
> > load-compare-conditional branch-isync sequence can provide this
> > ordering property"
> >
> > so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> > (for lock/unlock) is the right thing and would already give memory
> > barrier semantics?
>
> The PowerPC guys will correct me if I miss something here...
[Same here.]
>
> The isync provides ordering roughly similar to lwsync, but nowhere near
> as strong as sync, and it is sync that would be needed to cause lock
> acquisition to provide full ordering.
IIRC, ctrl+isync is even *weaker* than lwsync in certain respects, e.g.,
the former doesn't provide A-cumulativity according to the architectural
intent.
>The reason for using lwsync instead
> of isync is that the former proved to be faster on recent hardware.
Interesting; can you add some references about this?
Andrea
> The reason that the kernel still has the ability to instead generate
> isync instructions is that some older PowerPC hardware does not provide
> the lwsync instruction. If the hardware does support lwsync, the isync
> instructions are overwritten with lwsync at boot time.
>
> Thanx, Paul
>
On Tue, Jul 17, 2018 at 11:49:41AM -0700, Linus Torvalds wrote:
> On Tue, Jul 17, 2018 at 11:44 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > (a) lwsync is a memory barrier for all the "easy" cases (ie
> > load->store, load->load, and store->load).
>
> That last one should have been "store->store", of course.
Heh! I autocorrected without noticing.
> So 'lwsync' gives smp_rmb(), smp_wmb(), and smp_load_acquire()
> semantics (which are the usual "no barrier needed at all" suspects for
> things like x86).
Yes.
> What lwsync lacks is store->load ordering. So:
>
> > (b) lwsync is *not* a memory barrier for the store->load case.
>
> BUT, this is where isync comes in:
>
> > (c) isync *is* (when in that *sequence*) a memory barrier for a
> > store->load case (and has to be: loads inside a spinlocked region MUST
> > NOT be done earlier than stores outside of it!).
>
> which is why I think that a spinlock implementation that uses isync
> would give us the semantics we want, without the use of the crazy
> expensive 'sync' that Michael tested (and which apparently gets
> horrible 10% scheduler performance regressions at least on some
> powerpc CPU's).
Again, although isync orders the store -instruction- -execution- against
later loads, it does nothing to flush the store buffer. So the prior
stores will not necessarily appear to be ordered from the viewpoint of
other CPUs not holding the lock.
Again, if those other CPUs are holding the lock, they see all memory
accesses from all prior critical sections for that lock, as required.
Thanx, Paul
On Tue, 17 Jul 2018, Peter Zijlstra wrote:
> Isn't ISYNC the instruction-sync pipeline flush instruction? That is
> used as an smp_rmb() here to, together with the control dependency from the
> LL/SC, to form a LOAD->{LOAD,STORE} (aka LOAD-ACQUIRE) ordering?
That's right.
> Where LWSYNC provides a TSO like ordering and SYNC provides a full
> transitive barrier aka. smp_mb() (althgouh I think it is strictly
> stronger than smp_mb() since it also implies completion, which smp_mb()
> does not).
What is the difference? That is, can you give a litmus test for
"completion" in this sense?
> And since both LL/SC-CTRL + ISYNC / LWSYNC are strictly CPU local, they
> cannot be used to create RCsc ordering.
Yes, sort of -- it depends on whether you consider the store buffer to
be part of the CPU. But in any case, you are right that none of these
things can create RCsc ordering.
Alan
On Tue, Jul 17, 2018 at 09:40:01PM +0200, Andrea Parri wrote:
> > > That said, I don't understand the powerpc memory ordering. I thought
> > > the rules were "isync on lock, lwsync on unlock".
> > >
> > > That's what the AIX docs imply, at least.
> > >
> > > In particular, I find:
> > >
> > > "isync is not a memory barrier instruction, but the
> > > load-compare-conditional branch-isync sequence can provide this
> > > ordering property"
> > >
> > > so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> > > (for lock/unlock) is the right thing and would already give memory
> > > barrier semantics?
> >
> > The PowerPC guys will correct me if I miss something here...
>
> [Same here.]
>
> > The isync provides ordering roughly similar to lwsync, but nowhere near
> > as strong as sync, and it is sync that would be needed to cause lock
> > acquisition to provide full ordering.
>
> IIRC, ctrl+isync is even *weaker* than lwsync in certain respects, e.g.,
> the former doesn't provide A-cumulativity according to the architectural
> intent.
>
>
> >The reason for using lwsync instead
> > of isync is that the former proved to be faster on recent hardware.
>
> Interesting; can you add some references about this?
Sadly, no. I just asked why all the isyncs were being rewritten by
lwsyncs some years back, and that is the answer I got. I trust the
people answering, so didn't dig further.
Thanx, Paul
> Andrea
>
>
> > The reason that the kernel still has the ability to instead generate
> > isync instructions is that some older PowerPC hardware does not provide
> > the lwsync instruction. If the hardware does support lwsync, the isync
> > instructions are overwritten with lwsync at boot time.
> >
> > Thanx, Paul
> >
>
On Tue, Jul 17, 2018 at 12:37 PM Alan Stern <[email protected]> wrote:
>
> Why not? Instructions are allowed to migrate _into_ critical sections,
> just not _out_ of them. So a store preceding the start of a spinlocked
> region can migrate in and be executed after a load that is inside the
> region.
Hmm, yes of course. But the isync instruction description I found also
talks about the previous instructions being "completed".
But yeah, that obviously can mean just "in the store buffer", not
actually ordered.
Linus
Linus Torvalds <[email protected]> writes:
> On Tue, Jul 17, 2018 at 7:45 AM Michael Ellerman <[email protected]> wrote:
>>
>>
>> Interesting. I don't see anything as high as 18%, it's more spread out:
>>
>> 7.81% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
>
> Oh, see that's the difference.
>
> You're running in a non-root cgroup, I think.
I'm not much of a cgroup expert, but yeah I think so:
# cat /proc/self/cgroup
7:cpuset:/
6:cpu,cpuacct:/user.slice
5:memory:/user.slice
...
I guess I need to boot with init=/bin/sh.
> That also means that your scheduler overhead has way more spinlocks,
> and in particular, you have that
>
> raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> ..
> raw_spin_lock_irqsave(cpu_lock, flags);
>
> there too.
>
> So you have at least twice the spinlocks that my case had, and yes,
> the costs are way more spread out because your case has all that
> cgroup accounting too.
Yeah OK. And our locks are known to be more expensive generally, so that
and the cgroups seems like it accounts for most of the difference vs
x86.
> That said, I don't understand the powerpc memory ordering. I thought
> the rules were "isync on lock, lwsync on unlock".
>
> That's what the AIX docs imply, at least.
>
> In particular, I find:
>
> "isync is not a memory barrier instruction, but the
> load-compare-conditional branch-isync sequence can provide this
> ordering property"
>
> so why are you doing "sync/lwsync", when it sounds like "isync/lwsync"
> (for lock/unlock) is the right thing and would already give memory
> barrier semantics?
I think everyone else in the thread answered this already, and better
than I could.
Our options are sync/lwsync or lwsync/sync. I've been testing
lwsync/sync, because then we could also remove the conditional sync we
have in unlock (when there were IO accesses inside the lock).
These days you can download the ISA [PDF], with no registration or
anything required, here's a direct link:
https://ibm.ent.box.com/index.php?rm=box_download_shared_file&shared_name=1hzcwkwf8rbju5h9iyf44wm94amnlcrv&file_id=f_253346319281
That has some docs on locking in Book3 appendix B.2 (page 915). It
doesn't really have that much detail, but it does talk a bit about using
lwsync vs isync.
cheers
Michael Ellerman <[email protected]> writes:
> Peter Zijlstra <[email protected]> writes:
...
>
> I'll do some kernbench runs tomorrow and see if it shows up there.
Finally got some numbers.
The summary is it's in the noise, the average delta is +0.1s (total time
is ~67s, so 0.17%) but the standard deviation is ~0.2s.
So that's probably a real slow down, but it's very minor.
I guess that's not surprising, given kernbench should be spending the
bulk of its time in userspace.
cheers