2018-09-26 18:29:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 0/5] Updates to the formal memory model

Hello, Ingo!

This series contains updates to the Linux kernel's formal memory model
in tools/memory-model, along with corresponding changes in documentation
and Linux-kernel code. These patches are ready for inclusion into -tip.

1. Document litmus-test naming scheme.

2. Add extra ordering for locks and remove it for ordinary
release/acquire, courtesy of Alan Stern.

3. Fix a README typo, courtesy of SeongJae Park.

4. Expand the list of LKMM limitations.

5. Replace smp_cond_acquire() with smp_cond_load_acquire(),
courtesy of Andrea Parri.

Thanx, Paul

------------------------------------------------------------------------

Documentation/memory-barriers.txt | 3
tools/memory-model/Documentation/explanation.txt | 186 +++++++---
tools/memory-model/Documentation/recipes.txt | 2
tools/memory-model/README | 39 ++
tools/memory-model/linux-kernel.cat | 8
tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 7
tools/memory-model/litmus-tests/README | 104 +++++
7 files changed, 293 insertions(+), 56 deletions(-)



2018-09-26 18:32:15

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations

From: "Paul E. McKenney" <[email protected]>

This commit adds more detail about compiler optimizations and
not-yet-modeled Linux-kernel APIs.

Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
---
tools/memory-model/README | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/tools/memory-model/README b/tools/memory-model/README
index ee987ce20aae..acf9077cffaa 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -171,6 +171,12 @@ The Linux-kernel memory model has the following limitations:
particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
and "A WARNING" sections).

+ Note that this limitation in turn limits LKMM's ability to
+ accurately model address, control, and data dependencies.
+ For example, if the compiler can deduce the value of some variable
+ carrying a dependency, then the compiler can break that dependency
+ by substituting a constant of that value.
+
2. Multiple access sizes for a single variable are not supported,
and neither are misaligned or partially overlapping accesses.

@@ -190,6 +196,36 @@ The Linux-kernel memory model has the following limitations:
However, a substantial amount of support is provided for these
operations, as shown in the linux-kernel.def file.

+ a. When rcu_assign_pointer() is passed NULL, the Linux
+ kernel provides no ordering, but LKMM models this
+ case as a store release.
+
+ b. The "unless" RMW operations are not currently modeled:
+ atomic_long_add_unless(), atomic_add_unless(),
+ atomic_inc_unless_negative(), and
+ atomic_dec_unless_positive(). These can be emulated
+ in litmus tests, for example, by using atomic_cmpxchg().
+
+ c. The call_rcu() function is not modeled. It can be
+ emulated in litmus tests by adding another process that
+ invokes synchronize_rcu() and the body of the callback
+ function, with (for example) a release-acquire from
+ the site of the emulated call_rcu() to the beginning
+ of the additional process.
+
+ d. The rcu_barrier() function is not modeled. It can be
+ emulated in litmus tests emulating call_rcu() via
+ (for example) a release-acquire from the end of each
+ additional call_rcu() process to the site of the
+ emulated rcu-barrier().
+
+ e. Sleepable RCU (SRCU) is not modeled. It can be
+ emulated, but perhaps not simply.
+
+ f. Reader-writer locking is not modeled. It can be
+ emulated in litmus tests using atomic read-modify-write
+ operations.
+
The "herd7" tool has some additional limitations of its own, apart from
the memory model:

@@ -204,3 +240,6 @@ the memory model:
Some of these limitations may be overcome in the future, but others are
more likely to be addressed by incorporating the Linux-kernel memory model
into other tools.
+
+Finally, please note that LKMM is subject to change as hardware, use cases,
+and compilers evolve.
--
2.17.1


2018-09-26 18:32:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire()

From: Andrea Parri <[email protected]>

Amend commit 1f03e8d2919270 ("locking/barriers: Replace smp_cond_acquire()
with smp_cond_load_acquire()") by updating the documentation accordingly.
Also remove some obsolete information related to the implementation.

Signed-off-by: Andrea Parri <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jade Alglave <[email protected]>
Cc: Luc Maranget <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Akira Yokosawa <[email protected]>
Cc: Daniel Lustig <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Alan Stern <[email protected]>
---
Documentation/memory-barriers.txt | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 0d8d7ef131e9..c1d913944ad8 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -471,8 +471,7 @@ And a couple of implicit varieties:
operations after the ACQUIRE operation will appear to happen after the
ACQUIRE operation with respect to the other components of the system.
ACQUIRE operations include LOCK operations and both smp_load_acquire()
- and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
- semantics from relying on a control dependency and smp_rmb().
+ and smp_cond_load_acquire() operations.

Memory operations that occur before an ACQUIRE operation may appear to
happen after it completes.
--
2.17.1


2018-09-26 18:32:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

From: Alan Stern <[email protected]>

More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking. In other words, given
the following code:

WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&s);
WRITE_ONCE(y, 1);

the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s. In terms of
the memory model, this means expanding the cumul-fence relation.

Locks should also provide read-read (and read-write) ordering in a
similar way. Given:

READ_ONCE(x);
spin_unlock(&s);
spin_lock(&s);
READ_ONCE(y); // or WRITE_ONCE(y, 1);

the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock. This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction. The patch
therefore removes the ordering requirement from the LKMM for that
case.

There are several arguments both for and against this change. Let us
refer to these enhanced ordering properties by saying that the LKMM
would require locks to be RCtso (a bit of a misnomer, but analogous to
RCpc and RCsc) and it would require ordinary acquire/release only to
be RCpc. (Note: In the following, the phrase "all supported
architectures" is meant not to include RISC-V. Although RISC-V is
indeed supported by the kernel, the implementation is still somewhat
in a state of flux and therefore statements about it would be
premature.)

Pros:

The kernel already provides RCtso ordering for locks on all
supported architectures, even though this is not stated
explicitly anywhere. Therefore the LKMM should formalize it.

In theory, guaranteeing RCtso ordering would reduce the need
for additional barrier-like constructs meant to increase the
ordering strength of locks.

Will Deacon and Peter Zijlstra are strongly in favor of
formalizing the RCtso requirement. Linus Torvalds and Will
would like to go even further, requiring locks to have RCsc
behavior (ordering preceding writes against later reads), but
they recognize that this would incur a noticeable performance
degradation on the POWER architecture. Linus also points out
that people have made the mistake, in the past, of assuming
that locking has stronger ordering properties than is
currently guaranteed, and this change would reduce the
likelihood of such mistakes.

Not requiring ordinary acquire/release to be any stronger than
RCpc may prove advantageous for future architectures, allowing
them to implement smp_load_acquire() and smp_store_release()
with more efficient machine instructions than would be
possible if the operations had to be RCtso. Will and Linus
approve this rationale, hypothetical though it is at the
moment (it may end up affecting the RISC-V implementation).
The same argument may or may not apply to RMW-acquire/release;
see also the second Con entry below.

Linus feels that locks should be easy for people to use
without worrying about memory consistency issues, since they
are so pervasive in the kernel, whereas acquire/release is
much more of an "experts only" tool. Requiring locks to be
RCtso is a step in this direction.

Cons:

Andrea Parri and Luc Maranget think that locks should have the
same ordering properties as ordinary acquire/release (indeed,
Luc points out that the names "acquire" and "release" derive
from the usage of locks). Andrea points out that having
different ordering properties for different forms of acquires
and releases is not only unnecessary, it would also be
confusing and unmaintainable.

Locks are constructed from lower-level primitives, typically
RMW-acquire (for locking) and ordinary release (for unlock).
It is illogical to require stronger ordering properties from
the high-level operations than from the low-level operations
they comprise. Thus, this change would make

while (cmpxchg_acquire(&s, 0, 1) != 0)
cpu_relax();

an incorrect implementation of spin_lock(&s) as far as the
LKMM is concerned. In theory this weakness can be ameliorated
by changing the LKMM even further, requiring
RMW-acquire/release also to be RCtso (which it already is on
all supported architectures).

As far as I know, nobody has singled out any examples of code
in the kernel that actually relies on locks being RCtso.
(People mumble about RCU and the scheduler, but nobody has
pointed to any actual code. If there are any real cases,
their number is likely quite small.) If RCtso ordering is not
needed, why require it?

A handful of locking constructs (qspinlocks, qrwlocks, and
mcs_spinlocks) are built on top of smp_cond_load_acquire()
instead of an RMW-acquire instruction. It currently provides
only the ordinary acquire semantics, not the stronger ordering
this patch would require of locks. In theory this could be
ameliorated by requiring smp_cond_load_acquire() in
combination with ordinary release also to be RCtso (which is
currently true on all supported architectures).

On future weakly ordered architectures, people may be able to
implement locks in a non-RCtso fashion with significant
performance improvement. Meeting the RCtso requirement would
necessarily add run-time overhead.

Overall, the technical aspects of these arguments seem relatively
minor, and it appears mostly to boil down to a matter of opinion.
Since the opinions of senior kernel maintainers such as Linus,
Peter, and Will carry more weight than those of Luc and Andrea, this
patch changes the model in accordance with the maintainers' wishes.

Signed-off-by: Alan Stern <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
.../Documentation/explanation.txt | 186 ++++++++++++++----
tools/memory-model/linux-kernel.cat | 8 +-
...ISA2+pooncelock+pooncelock+pombonce.litmus | 7 +-
3 files changed, 150 insertions(+), 51 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 0cbd1ef8f86d..35bff92cc773 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
20. THE HAPPENS-BEFORE RELATION: hb
21. THE PROPAGATES-BEFORE RELATION: pb
22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
- 23. ODDS AND ENDS
+ 23. LOCKING
+ 24. ODDS AND ENDS



@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this to occur. The model avoided
violating the write-write coherence rule by requiring the CPU not to
send the W write to the memory subsystem at all!)

-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release. For example:
-
- smp_store_release(&x, 123);
- r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load. On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence. The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-

AND THEN THERE WAS ALPHA
------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section in P2 both starts after P1's
grace period does and ends after it does.


+LOCKING
+-------
+
+The LKMM includes locking. In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered. A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+ while (cmpxchg_acquire(&s, 0, 1) != 0)
+ cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+ r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0. Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+ return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful). spin_unlock(&s)
+is treated almost the same as:
+
+ smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation. In the LKMM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire. This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+ int x, y;
+ spinlock_t s;
+
+ P0()
+ {
+ int r1, r2;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ spin_unlock(&s);
+ spin_lock(&s);
+ r2 = READ_ONCE(y);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ WRITE_ONCE(y, 1);
+ smp_wmb();
+ WRITE_ONCE(x, 1);
+ }
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y. Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations. For instance, suppose P0()
+in the example had been written as:
+
+ P0()
+ {
+ int r1, r2, r3;
+
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1);
+ r3 = smp_load_acquire(&s);
+ r2 = READ_ONCE(y);
+ }
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+ r3 = smp_load_acquire(&s); // Obtains r3 = 1
+ r2 = READ_ONCE(y);
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1); // Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does. For example, consider:
+
+ int x, y;
+ spinlock_t x;
+
+ P0()
+ {
+ spin_lock(&s);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ int r1;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&s);
+ }
+
+ P2()
+ {
+ int r2, r3;
+
+ r2 = READ_ONCE(y);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ }
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0. Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model. Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
ODDS AND ENDS
-------------

@@ -1831,26 +1951,6 @@ they behave as follows:
events and the events preceding them against all po-later
events.

-The LKMM includes locking. In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered. A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
- while (cmpxchg_acquire(&s, 0, 1) != 0)
- cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
- r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0. spin_unlock(&s) is
-treated the same as:
-
- smp_store_release(&s, 0);
-
Interestingly, RCU and locking each introduce the possibility of
deadlock. When faced with code sequences such as:

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 59b5cbe6b624..882fc33274ac 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po

(**********************************)
(* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)

(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?

(*
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
index 0f749e419b34..094d58df7789 100644
--- a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
C ISA2+pooncelock+pooncelock+pombonce

(*
- * Result: Sometimes
+ * Result: Never
*
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
*)

{}
--
2.17.1


2018-09-26 18:32:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme

From: "Paul E. McKenney" <[email protected]>

This commit documents the scheme used to generate the names for the
litmus tests.

Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Apply feedback from Andrea Parri and Will Deacon. ]
Acked-by: Will Deacon <[email protected]>
---
tools/memory-model/litmus-tests/README | 104 ++++++++++++++++++++++++-
1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 4581ec2d3c57..5ee08f129094 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -1,4 +1,6 @@
-This directory contains the following litmus tests:
+============
+LITMUS TESTS
+============

CoRR+poonceonce+Once.litmus
Test of read-read coherence, that is, whether or not two
@@ -36,7 +38,7 @@ IRIW+poonceonces+OnceOnce.litmus
ISA2+pooncelock+pooncelock+pombonce.litmus
Tests whether the ordering provided by a lock-protected S
litmus test is visible to an external process whose accesses are
- separated by smp_mb(). This addition of an external process to
+ separated by smp_mb(). This addition of an external process to
S is otherwise known as ISA2.

ISA2+poonceonces.litmus
@@ -151,3 +153,101 @@ Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus
A great many more litmus tests are available here:

https://github.com/paulmckrcu/litmus
+
+==================
+LITMUS TEST NAMING
+==================
+
+Litmus tests are usually named based on their contents, which means that
+looking at the name tells you what the litmus test does. The naming
+scheme covers litmus tests having a single cycle that passes through
+each process exactly once, so litmus tests not fitting this description
+are named on an ad-hoc basis.
+
+The structure of a litmus-test name is the litmus-test class, a plus
+sign ("+"), and one string for each process, separated by plus signs.
+The end of the name is ".litmus".
+
+The litmus-test classes may be found in the infamous test6.pdf:
+https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
+Each class defines the pattern of accesses and of the variables accessed.
+For example, if the one process writes to a pair of variables, and
+the other process reads from these same variables, the corresponding
+litmus-test class is "MP" (message passing), which may be found on the
+left-hand end of the second row of tests on page one of test6.pdf.
+
+The strings used to identify the actions carried out by each process are
+complex due to a desire to have short(er) names. Thus, there is a tool to
+generate these strings from a given litmus test's actions. For example,
+consider the processes from SB+rfionceonce-poonceonces.litmus:
+
+ P0(int *x, int *y)
+ {
+ int r1;
+ int r2;
+
+ WRITE_ONCE(*x, 1);
+ r1 = READ_ONCE(*x);
+ r2 = READ_ONCE(*y);
+ }
+
+ P1(int *x, int *y)
+ {
+ int r3;
+ int r4;
+
+ WRITE_ONCE(*y, 1);
+ r3 = READ_ONCE(*y);
+ r4 = READ_ONCE(*x);
+ }
+
+The next step is to construct a space-separated list of descriptors,
+interleaving descriptions of the relation between a pair of consecutive
+accesses with descriptions of the second access in the pair.
+
+P0()'s WRITE_ONCE() is read by its first READ_ONCE(), which is a
+reads-from link (rf) and internal to the P0() process. This is
+"rfi", which is an abbreviation for "reads-from internal". Because
+some of the tools string these abbreviations together with space
+characters separating processes, the first character is capitalized,
+resulting in "Rfi".
+
+P0()'s second access is a READ_ONCE(), as opposed to (for example)
+smp_load_acquire(), so next is "Once". Thus far, we have "Rfi Once".
+
+P0()'s third access is also a READ_ONCE(), but to y rather than x.
+This is related to P0()'s second access by program order ("po"),
+to a different variable ("d"), and both accesses are reads ("RR").
+The resulting descriptor is "PodRR". Because P0()'s third access is
+READ_ONCE(), we add another "Once" descriptor.
+
+A from-read ("fre") relation links P0()'s third to P1()'s first
+access, and the resulting descriptor is "Fre". P1()'s first access is
+WRITE_ONCE(), which as before gives the descriptor "Once". The string
+thus far is thus "Rfi Once PodRR Once Fre Once".
+
+The remainder of P1() is similar to P0(), which means we add
+"Rfi Once PodRR Once". Another fre links P1()'s last access to
+P0()'s first access, which is WRITE_ONCE(), so we add "Fre Once".
+The full string is thus:
+
+ Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once
+
+This string can be given to the "norm7" and "classify7" tools to
+produce the name:
+
+ $ norm7 -bell linux-kernel.bell \
+ Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once | \
+ sed -e 's/:.*//g'
+ SB+rfionceonce-poonceonces
+
+Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
+
+The descriptors that describe connections between consecutive accesses
+within the cycle through a given litmus test can be provided by the herd
+tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
+Release, Acquire, and so on).
+
+To see the full list of descriptors, execute the following command:
+
+ $ diyone7 -bell linux-kernel.bell -show edges
--
2.17.1


2018-09-26 18:33:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH memory-model 3/5] tools/memory-model: Fix a README typo

From: SeongJae Park <[email protected]>

This commit fixes a duplicate-"the" typo in README.

Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Alan Stern <[email protected]>
---
tools/memory-model/Documentation/recipes.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt
index af72700cc20a..7fe8d7aa3029 100644
--- a/tools/memory-model/Documentation/recipes.txt
+++ b/tools/memory-model/Documentation/recipes.txt
@@ -311,7 +311,7 @@ The smp_wmb() macro orders prior stores against later stores, and the
smp_rmb() macro orders prior loads against later loads. Therefore, if
the final value of r0 is 1, the final value of r1 must also be 1.

-The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
+The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
the following write-side code fragment:

log->l_curr_block -= log->l_logBBsize;
--
2.17.1


2018-10-02 08:29:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH memory-model 0/5] Updates to the formal memory model


* Paul E. McKenney <[email protected]> wrote:

> Hello, Ingo!
>
> This series contains updates to the Linux kernel's formal memory model
> in tools/memory-model, along with corresponding changes in documentation
> and Linux-kernel code. These patches are ready for inclusion into -tip.
>
> 1. Document litmus-test naming scheme.
>
> 2. Add extra ordering for locks and remove it for ordinary
> release/acquire, courtesy of Alan Stern.
>
> 3. Fix a README typo, courtesy of SeongJae Park.
>
> 4. Expand the list of LKMM limitations.
>
> 5. Replace smp_cond_acquire() with smp_cond_load_acquire(),
> courtesy of Andrea Parri.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Documentation/memory-barriers.txt | 3
> tools/memory-model/Documentation/explanation.txt | 186 +++++++---
> tools/memory-model/Documentation/recipes.txt | 2
> tools/memory-model/README | 39 ++
> tools/memory-model/linux-kernel.cat | 8
> tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 7
> tools/memory-model/litmus-tests/README | 104 +++++
> 7 files changed, 293 insertions(+), 56 deletions(-)

Applied to tip:locking/core, thanks Paul!

Ingo

Subject: [tip:locking/core] tools/memory-model: Add litmus-test naming scheme

Commit-ID: c4f790f244070dbab486805276ba4d1f87a057af
Gitweb: https://git.kernel.org/tip/c4f790f244070dbab486805276ba4d1f87a057af
Author: Paul E. McKenney <[email protected]>
AuthorDate: Wed, 26 Sep 2018 11:29:16 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 2 Oct 2018 10:28:00 +0200

tools/memory-model: Add litmus-test naming scheme

This commit documents the scheme used to generate the names for the
litmus tests.

[ paulmck: Apply feedback from Andrea Parri and Will Deacon. ]
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/memory-model/litmus-tests/README | 104 ++++++++++++++++++++++++++++++++-
1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 4581ec2d3c57..5ee08f129094 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -1,4 +1,6 @@
-This directory contains the following litmus tests:
+============
+LITMUS TESTS
+============

CoRR+poonceonce+Once.litmus
Test of read-read coherence, that is, whether or not two
@@ -36,7 +38,7 @@ IRIW+poonceonces+OnceOnce.litmus
ISA2+pooncelock+pooncelock+pombonce.litmus
Tests whether the ordering provided by a lock-protected S
litmus test is visible to an external process whose accesses are
- separated by smp_mb(). This addition of an external process to
+ separated by smp_mb(). This addition of an external process to
S is otherwise known as ISA2.

ISA2+poonceonces.litmus
@@ -151,3 +153,101 @@ Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus
A great many more litmus tests are available here:

https://github.com/paulmckrcu/litmus
+
+==================
+LITMUS TEST NAMING
+==================
+
+Litmus tests are usually named based on their contents, which means that
+looking at the name tells you what the litmus test does. The naming
+scheme covers litmus tests having a single cycle that passes through
+each process exactly once, so litmus tests not fitting this description
+are named on an ad-hoc basis.
+
+The structure of a litmus-test name is the litmus-test class, a plus
+sign ("+"), and one string for each process, separated by plus signs.
+The end of the name is ".litmus".
+
+The litmus-test classes may be found in the infamous test6.pdf:
+https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
+Each class defines the pattern of accesses and of the variables accessed.
+For example, if the one process writes to a pair of variables, and
+the other process reads from these same variables, the corresponding
+litmus-test class is "MP" (message passing), which may be found on the
+left-hand end of the second row of tests on page one of test6.pdf.
+
+The strings used to identify the actions carried out by each process are
+complex due to a desire to have short(er) names. Thus, there is a tool to
+generate these strings from a given litmus test's actions. For example,
+consider the processes from SB+rfionceonce-poonceonces.litmus:
+
+ P0(int *x, int *y)
+ {
+ int r1;
+ int r2;
+
+ WRITE_ONCE(*x, 1);
+ r1 = READ_ONCE(*x);
+ r2 = READ_ONCE(*y);
+ }
+
+ P1(int *x, int *y)
+ {
+ int r3;
+ int r4;
+
+ WRITE_ONCE(*y, 1);
+ r3 = READ_ONCE(*y);
+ r4 = READ_ONCE(*x);
+ }
+
+The next step is to construct a space-separated list of descriptors,
+interleaving descriptions of the relation between a pair of consecutive
+accesses with descriptions of the second access in the pair.
+
+P0()'s WRITE_ONCE() is read by its first READ_ONCE(), which is a
+reads-from link (rf) and internal to the P0() process. This is
+"rfi", which is an abbreviation for "reads-from internal". Because
+some of the tools string these abbreviations together with space
+characters separating processes, the first character is capitalized,
+resulting in "Rfi".
+
+P0()'s second access is a READ_ONCE(), as opposed to (for example)
+smp_load_acquire(), so next is "Once". Thus far, we have "Rfi Once".
+
+P0()'s third access is also a READ_ONCE(), but to y rather than x.
+This is related to P0()'s second access by program order ("po"),
+to a different variable ("d"), and both accesses are reads ("RR").
+The resulting descriptor is "PodRR". Because P0()'s third access is
+READ_ONCE(), we add another "Once" descriptor.
+
+A from-read ("fre") relation links P0()'s third to P1()'s first
+access, and the resulting descriptor is "Fre". P1()'s first access is
+WRITE_ONCE(), which as before gives the descriptor "Once". The string
+thus far is thus "Rfi Once PodRR Once Fre Once".
+
+The remainder of P1() is similar to P0(), which means we add
+"Rfi Once PodRR Once". Another fre links P1()'s last access to
+P0()'s first access, which is WRITE_ONCE(), so we add "Fre Once".
+The full string is thus:
+
+ Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once
+
+This string can be given to the "norm7" and "classify7" tools to
+produce the name:
+
+ $ norm7 -bell linux-kernel.bell \
+ Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once | \
+ sed -e 's/:.*//g'
+ SB+rfionceonce-poonceonces
+
+Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
+
+The descriptors that describe connections between consecutive accesses
+within the cycle through a given litmus test can be provided by the herd
+tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
+Release, Acquire, and so on).
+
+To see the full list of descriptors, execute the following command:
+
+ $ diyone7 -bell linux-kernel.bell -show edges

Subject: [tip:locking/core] tools/memory-model: Fix a README typo

Commit-ID: 3d2046a6fa2106584cf1080c2c08a6e8e79cbbb4
Gitweb: https://git.kernel.org/tip/3d2046a6fa2106584cf1080c2c08a6e8e79cbbb4
Author: SeongJae Park <[email protected]>
AuthorDate: Wed, 26 Sep 2018 11:29:18 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 2 Oct 2018 10:28:03 +0200

tools/memory-model: Fix a README typo

This commit fixes a duplicate-"the" typo in README.

Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Alan Stern <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/memory-model/Documentation/recipes.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt
index af72700cc20a..7fe8d7aa3029 100644
--- a/tools/memory-model/Documentation/recipes.txt
+++ b/tools/memory-model/Documentation/recipes.txt
@@ -311,7 +311,7 @@ The smp_wmb() macro orders prior stores against later stores, and the
smp_rmb() macro orders prior loads against later loads. Therefore, if
the final value of r0 is 1, the final value of r1 must also be 1.

-The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
+The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
the following write-side code fragment:

log->l_curr_block -= log->l_logBBsize;

Subject: [tip:locking/core] tools/memory-model: Add more LKMM limitations

Commit-ID: d8fa25c4efde0e5f31a427202e583d73d3f021c4
Gitweb: https://git.kernel.org/tip/d8fa25c4efde0e5f31a427202e583d73d3f021c4
Author: Paul E. McKenney <[email protected]>
AuthorDate: Wed, 26 Sep 2018 11:29:19 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 2 Oct 2018 10:28:04 +0200

tools/memory-model: Add more LKMM limitations

This commit adds more detail about compiler optimizations and
not-yet-modeled Linux-kernel APIs.

Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/memory-model/README | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/tools/memory-model/README b/tools/memory-model/README
index ee987ce20aae..acf9077cffaa 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -171,6 +171,12 @@ The Linux-kernel memory model has the following limitations:
particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
and "A WARNING" sections).

+ Note that this limitation in turn limits LKMM's ability to
+ accurately model address, control, and data dependencies.
+ For example, if the compiler can deduce the value of some variable
+ carrying a dependency, then the compiler can break that dependency
+ by substituting a constant of that value.
+
2. Multiple access sizes for a single variable are not supported,
and neither are misaligned or partially overlapping accesses.

@@ -190,6 +196,36 @@ The Linux-kernel memory model has the following limitations:
However, a substantial amount of support is provided for these
operations, as shown in the linux-kernel.def file.

+ a. When rcu_assign_pointer() is passed NULL, the Linux
+ kernel provides no ordering, but LKMM models this
+ case as a store release.
+
+ b. The "unless" RMW operations are not currently modeled:
+ atomic_long_add_unless(), atomic_add_unless(),
+ atomic_inc_unless_negative(), and
+ atomic_dec_unless_positive(). These can be emulated
+ in litmus tests, for example, by using atomic_cmpxchg().
+
+ c. The call_rcu() function is not modeled. It can be
+ emulated in litmus tests by adding another process that
+ invokes synchronize_rcu() and the body of the callback
+ function, with (for example) a release-acquire from
+ the site of the emulated call_rcu() to the beginning
+ of the additional process.
+
+ d. The rcu_barrier() function is not modeled. It can be
+ emulated in litmus tests emulating call_rcu() via
+ (for example) a release-acquire from the end of each
+ additional call_rcu() process to the site of the
+ emulated rcu-barrier().
+
+ e. Sleepable RCU (SRCU) is not modeled. It can be
+ emulated, but perhaps not simply.
+
+ f. Reader-writer locking is not modeled. It can be
+ emulated in litmus tests using atomic read-modify-write
+ operations.
+
The "herd7" tool has some additional limitations of its own, apart from
the memory model:

@@ -204,3 +240,6 @@ the memory model:
Some of these limitations may be overcome in the future, but others are
more likely to be addressed by incorporating the Linux-kernel memory model
into other tools.
+
+Finally, please note that LKMM is subject to change as hardware, use cases,
+and compilers evolve.

Subject: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

Commit-ID: 6e89e831a90172bc3d34ecbba52af5b9c4a447d1
Gitweb: https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
Author: Alan Stern <[email protected]>
AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 2 Oct 2018 10:28:01 +0200

tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking. In other words, given
the following code:

WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&s);
WRITE_ONCE(y, 1);

the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s. In terms of
the memory model, this means expanding the cumul-fence relation.

Locks should also provide read-read (and read-write) ordering in a
similar way. Given:

READ_ONCE(x);
spin_unlock(&s);
spin_lock(&s);
READ_ONCE(y); // or WRITE_ONCE(y, 1);

the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock. This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction. The patch
therefore removes the ordering requirement from the LKMM for that
case.

There are several arguments both for and against this change. Let us
refer to these enhanced ordering properties by saying that the LKMM
would require locks to be RCtso (a bit of a misnomer, but analogous to
RCpc and RCsc) and it would require ordinary acquire/release only to
be RCpc. (Note: In the following, the phrase "all supported
architectures" is meant not to include RISC-V. Although RISC-V is
indeed supported by the kernel, the implementation is still somewhat
in a state of flux and therefore statements about it would be
premature.)

Pros:

The kernel already provides RCtso ordering for locks on all
supported architectures, even though this is not stated
explicitly anywhere. Therefore the LKMM should formalize it.

In theory, guaranteeing RCtso ordering would reduce the need
for additional barrier-like constructs meant to increase the
ordering strength of locks.

Will Deacon and Peter Zijlstra are strongly in favor of
formalizing the RCtso requirement. Linus Torvalds and Will
would like to go even further, requiring locks to have RCsc
behavior (ordering preceding writes against later reads), but
they recognize that this would incur a noticeable performance
degradation on the POWER architecture. Linus also points out
that people have made the mistake, in the past, of assuming
that locking has stronger ordering properties than is
currently guaranteed, and this change would reduce the
likelihood of such mistakes.

Not requiring ordinary acquire/release to be any stronger than
RCpc may prove advantageous for future architectures, allowing
them to implement smp_load_acquire() and smp_store_release()
with more efficient machine instructions than would be
possible if the operations had to be RCtso. Will and Linus
approve this rationale, hypothetical though it is at the
moment (it may end up affecting the RISC-V implementation).
The same argument may or may not apply to RMW-acquire/release;
see also the second Con entry below.

Linus feels that locks should be easy for people to use
without worrying about memory consistency issues, since they
are so pervasive in the kernel, whereas acquire/release is
much more of an "experts only" tool. Requiring locks to be
RCtso is a step in this direction.

Cons:

Andrea Parri and Luc Maranget think that locks should have the
same ordering properties as ordinary acquire/release (indeed,
Luc points out that the names "acquire" and "release" derive
from the usage of locks). Andrea points out that having
different ordering properties for different forms of acquires
and releases is not only unnecessary, it would also be
confusing and unmaintainable.

Locks are constructed from lower-level primitives, typically
RMW-acquire (for locking) and ordinary release (for unlock).
It is illogical to require stronger ordering properties from
the high-level operations than from the low-level operations
they comprise. Thus, this change would make

while (cmpxchg_acquire(&s, 0, 1) != 0)
cpu_relax();

an incorrect implementation of spin_lock(&s) as far as the
LKMM is concerned. In theory this weakness can be ameliorated
by changing the LKMM even further, requiring
RMW-acquire/release also to be RCtso (which it already is on
all supported architectures).

As far as I know, nobody has singled out any examples of code
in the kernel that actually relies on locks being RCtso.
(People mumble about RCU and the scheduler, but nobody has
pointed to any actual code. If there are any real cases,
their number is likely quite small.) If RCtso ordering is not
needed, why require it?

A handful of locking constructs (qspinlocks, qrwlocks, and
mcs_spinlocks) are built on top of smp_cond_load_acquire()
instead of an RMW-acquire instruction. It currently provides
only the ordinary acquire semantics, not the stronger ordering
this patch would require of locks. In theory this could be
ameliorated by requiring smp_cond_load_acquire() in
combination with ordinary release also to be RCtso (which is
currently true on all supported architectures).

On future weakly ordered architectures, people may be able to
implement locks in a non-RCtso fashion with significant
performance improvement. Meeting the RCtso requirement would
necessarily add run-time overhead.

Overall, the technical aspects of these arguments seem relatively
minor, and it appears mostly to boil down to a matter of opinion.
Since the opinions of senior kernel maintainers such as Linus,
Peter, and Will carry more weight than those of Luc and Andrea, this
patch changes the model in accordance with the maintainers' wishes.

Signed-off-by: Alan Stern <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/memory-model/Documentation/explanation.txt | 186 ++++++++++++++++-----
tools/memory-model/linux-kernel.cat | 8 +-
.../ISA2+pooncelock+pooncelock+pombonce.litmus | 7 +-
3 files changed, 150 insertions(+), 51 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 0cbd1ef8f86d..35bff92cc773 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
20. THE HAPPENS-BEFORE RELATION: hb
21. THE PROPAGATES-BEFORE RELATION: pb
22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
- 23. ODDS AND ENDS
+ 23. LOCKING
+ 24. ODDS AND ENDS



@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this to occur. The model avoided
violating the write-write coherence rule by requiring the CPU not to
send the W write to the memory subsystem at all!)

-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release. For example:
-
- smp_store_release(&x, 123);
- r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load. On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence. The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-

AND THEN THERE WAS ALPHA
------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section in P2 both starts after P1's
grace period does and ends after it does.


+LOCKING
+-------
+
+The LKMM includes locking. In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered. A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+ while (cmpxchg_acquire(&s, 0, 1) != 0)
+ cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+ r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0. Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+ return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful). spin_unlock(&s)
+is treated almost the same as:
+
+ smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation. In the LKMM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire. This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+ int x, y;
+ spinlock_t s;
+
+ P0()
+ {
+ int r1, r2;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ spin_unlock(&s);
+ spin_lock(&s);
+ r2 = READ_ONCE(y);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ WRITE_ONCE(y, 1);
+ smp_wmb();
+ WRITE_ONCE(x, 1);
+ }
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y. Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations. For instance, suppose P0()
+in the example had been written as:
+
+ P0()
+ {
+ int r1, r2, r3;
+
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1);
+ r3 = smp_load_acquire(&s);
+ r2 = READ_ONCE(y);
+ }
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+ r3 = smp_load_acquire(&s); // Obtains r3 = 1
+ r2 = READ_ONCE(y);
+ r1 = READ_ONCE(x);
+ smp_store_release(&s, 1); // Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does. For example, consider:
+
+ int x, y;
+ spinlock_t x;
+
+ P0()
+ {
+ spin_lock(&s);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&s);
+ }
+
+ P1()
+ {
+ int r1;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&s);
+ }
+
+ P2()
+ {
+ int r2, r3;
+
+ r2 = READ_ONCE(y);
+ smp_rmb();
+ r3 = READ_ONCE(x);
+ }
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0. Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model. Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
ODDS AND ENDS
-------------

@@ -1831,26 +1951,6 @@ they behave as follows:
events and the events preceding them against all po-later
events.

-The LKMM includes locking. In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered. A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
- while (cmpxchg_acquire(&s, 0, 1) != 0)
- cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
- r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0. spin_unlock(&s) is
-treated the same as:
-
- smp_store_release(&s, 0);
-
Interestingly, RCU and locking each introduce the possibility of
deadlock. When faced with code sequences such as:

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 59b5cbe6b624..882fc33274ac 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po

(**********************************)
(* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)

(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?

(*
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
index 0f749e419b34..094d58df7789 100644
--- a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
C ISA2+pooncelock+pooncelock+pombonce

(*
- * Result: Sometimes
+ * Result: Never
*
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
*)

{}

Subject: [tip:locking/core] locking/memory-barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()

Commit-ID: 2f359c7ea554ba9cb78a52c82bedff351cdabd2b
Gitweb: https://git.kernel.org/tip/2f359c7ea554ba9cb78a52c82bedff351cdabd2b
Author: Andrea Parri <[email protected]>
AuthorDate: Wed, 26 Sep 2018 11:29:20 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 2 Oct 2018 10:28:05 +0200

locking/memory-barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()

Amend the changes in commit:

1f03e8d2919270 ("locking/barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()")

... by updating the documentation accordingly.

Also remove some obsolete information related to the implementation.

Signed-off-by: Andrea Parri <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Alan Stern <[email protected]>
Cc: Akira Yokosawa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Daniel Lustig <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jade Alglave <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luc Maranget <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/memory-barriers.txt | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 0d8d7ef131e9..c1d913944ad8 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -471,8 +471,7 @@ And a couple of implicit varieties:
operations after the ACQUIRE operation will appear to happen after the
ACQUIRE operation with respect to the other components of the system.
ACQUIRE operations include LOCK operations and both smp_load_acquire()
- and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
- semantics from relying on a control dependency and smp_rmb().
+ and smp_cond_load_acquire() operations.

Memory operations that occur before an ACQUIRE operation may appear to
happen after it completes.

2021-09-08 11:46:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Wed, Sep 08, 2021 at 01:00:26PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> > Commit-ID: 6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > Gitweb: https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > Author: Alan Stern <[email protected]>
> > AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
> >
> > tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
> >
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking. In other words, given
> > the following code:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&s);
> > WRITE_ONCE(y, 1);
> >
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s. In terms of
> > the memory model, this means expanding the cumul-fence relation.
>
> Let me revive this old thread... recently we ran into the case:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&r);
> WRITE_ONCE(y, 1);
>
> which is distinct from the original in that UNLOCK and LOCK are not on
> the same variable.
>
> I'm arguing this should still be RCtso by reason of:
>
> spin_lock() requires an atomic-acquire which:
>
> TSO-arch) implies smp_mb()
> ARM64) is RCsc for any stlr/ldar
> Power) LWSYNC
> Risc-V) fence r , rw
> *) explicitly has smp_mb()
>
>
> However, Boqun points out that the memory model disagrees, per:
>
> https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux
>
> Is this an error/oversight of the memory model, or did I miss a subtlety
> somewhere?

Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
thing right. That's just R->RW ordering, which doesn't constrain the
first WRITE_ONCE().

However, that spin_unlock has "fence rw, w" with a subsequent write. So
the whole thing then becomes something like:


WRITE_ONCE(x, 1);
FENCE RW, W
WRITE_ONCE(s.lock, 0);
AMOSWAP %0, 1, r.lock
FENCE R, WR
WRITE_ONCE(y, 1);


Which I think is still sufficient, irrespective of the whole s!=r thing.

2021-09-08 12:24:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> Commit-ID: 6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> Gitweb: https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> Author: Alan Stern <[email protected]>
> AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
>
> tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
>
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking. In other words, given
> the following code:
>
> WRITE_ONCE(x, 1);
> spin_unlock(&s):
> spin_lock(&s);
> WRITE_ONCE(y, 1);
>
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s. In terms of
> the memory model, this means expanding the cumul-fence relation.

Let me revive this old thread... recently we ran into the case:

WRITE_ONCE(x, 1);
spin_unlock(&s):
spin_lock(&r);
WRITE_ONCE(y, 1);

which is distinct from the original in that UNLOCK and LOCK are not on
the same variable.

I'm arguing this should still be RCtso by reason of:

spin_lock() requires an atomic-acquire which:

TSO-arch) implies smp_mb()
ARM64) is RCsc for any stlr/ldar
Power) LWSYNC
Risc-V) fence r , rw
*) explicitly has smp_mb()


However, Boqun points out that the memory model disagrees, per:

https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux

Is this an error/oversight of the memory model, or did I miss a subtlety
somewhere?

2021-09-08 14:48:40

by Alan Stern

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Wed, Sep 08, 2021 at 01:44:11PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 08, 2021 at 01:00:26PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> > > Commit-ID: 6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > > Gitweb: https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > > Author: Alan Stern <[email protected]>
> > > AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> > > Committer: Ingo Molnar <[email protected]>
> > > CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
> > >
> > > tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
> > >
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking. In other words, given
> > > the following code:
> > >
> > > WRITE_ONCE(x, 1);
> > > spin_unlock(&s):
> > > spin_lock(&s);
> > > WRITE_ONCE(y, 1);
> > >
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s. In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> >
> > Let me revive this old thread... recently we ran into the case:
> >
> > WRITE_ONCE(x, 1);
> > spin_unlock(&s):
> > spin_lock(&r);
> > WRITE_ONCE(y, 1);
> >
> > which is distinct from the original in that UNLOCK and LOCK are not on
> > the same variable.
> >
> > I'm arguing this should still be RCtso by reason of:
> >
> > spin_lock() requires an atomic-acquire which:
> >
> > TSO-arch) implies smp_mb()
> > ARM64) is RCsc for any stlr/ldar
> > Power) LWSYNC
> > Risc-V) fence r , rw
> > *) explicitly has smp_mb()
> >
> >
> > However, Boqun points out that the memory model disagrees, per:
> >
> > https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux
> >
> > Is this an error/oversight of the memory model, or did I miss a subtlety
> > somewhere?

There's the question of what we think the LKMM should do in principle, and
the question of how far it should go in mirroring the limitations of the
various kernel hardware implementations. These are obviously separate
questions, but they both should influence the design of the memory model.
But to what extent?

Given:

spin_lock(&r);
WRITE_ONCE(x, 1);
spin_unlock(&r);
spin_lock(&s);
WRITE_ONCE(y, 1);
spin_unlock(&s);

there is no reason _in theory_ why a CPU shouldn't reorder and interleave
the operations to get:

spin_lock(&r);
spin_lock(&s);
WRITE_ONCE(x, 1);
WRITE_ONCE(y, 1);
spin_unlock(&r);
spin_unlock(&s);

(Of course, this wouldn't happen if some other CPU was holding the s lock
while waiting for r to be released. In that case the spin loop for s above
wouldn't be able to end until after the unlock operation on r was complete,
so this reordering couldn't occur. But if there was no such contention then
the reordering is possible in theory -- ignoring restrictions imposed by the
actual implementations of the operations.)

Given such a reordering, nothing will prevent other CPUs from observing the
write to y before the write to x.

> Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
> thing right. That's just R->RW ordering, which doesn't constrain the
> first WRITE_ONCE().
>
> However, that spin_unlock has "fence rw, w" with a subsequent write. So
> the whole thing then becomes something like:
>
>
> WRITE_ONCE(x, 1);
> FENCE RW, W
> WRITE_ONCE(s.lock, 0);
> AMOSWAP %0, 1, r.lock
> FENCE R, WR
> WRITE_ONCE(y, 1);
>
>
> Which I think is still sufficient, irrespective of the whole s!=r thing.

To me, this argument feels like an artificial, unintended consequence of the
individual implementations, not something that should be considered a
systematic architectural requirement. Perhaps one could say the same thing
about the case where the two spinlock_t variables are the same, but at least
in that case there is a good argument for inherent ordering of atomic
accesses to a single variable.

Alan

2021-09-08 15:17:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Wed, Sep 08, 2021 at 10:42:17AM -0400, Alan Stern wrote:
> On Wed, Sep 08, 2021 at 01:44:11PM +0200, Peter Zijlstra wrote:

> > > Is this an error/oversight of the memory model, or did I miss a subtlety
> > > somewhere?
>
> There's the question of what we think the LKMM should do in principle, and
> the question of how far it should go in mirroring the limitations of the
> various kernel hardware implementations. These are obviously separate
> questions, but they both should influence the design of the memory model.
> But to what extent?
>
> Given:
>
> spin_lock(&r);
> WRITE_ONCE(x, 1);
> spin_unlock(&r);
> spin_lock(&s);
> WRITE_ONCE(y, 1);
> spin_unlock(&s);
>
> there is no reason _in theory_ why a CPU shouldn't reorder and interleave
> the operations to get:
>
> spin_lock(&r);
> spin_lock(&s);
> WRITE_ONCE(x, 1);
> WRITE_ONCE(y, 1);
> spin_unlock(&r);
> spin_unlock(&s);
>
> (Of course, this wouldn't happen if some other CPU was holding the s lock
> while waiting for r to be released. In that case the spin loop for s above
> wouldn't be able to end until after the unlock operation on r was complete,
> so this reordering couldn't occur. But if there was no such contention then
> the reordering is possible in theory -- ignoring restrictions imposed by the
> actual implementations of the operations.)
>
> Given such a reordering, nothing will prevent other CPUs from observing the
> write to y before the write to x.

To a very small degree the Risc-V implementation actually does some of
that. It allows the stores from unlock and lock to be observed out of
order. But in general we have very weak rules about where the store of
the lock is visible in any case.

(revisit the spin_is_locked() saga for more details there)

> > Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
> > thing right. That's just R->RW ordering, which doesn't constrain the
> > first WRITE_ONCE().
> >
> > However, that spin_unlock has "fence rw, w" with a subsequent write. So
> > the whole thing then becomes something like:
> >
> >
> > WRITE_ONCE(x, 1);
> > FENCE RW, W
> > WRITE_ONCE(s.lock, 0);
> > AMOSWAP %0, 1, r.lock
> > FENCE R, WR
> > WRITE_ONCE(y, 1);
> >
> >
> > Which I think is still sufficient, irrespective of the whole s!=r thing.
>
> To me, this argument feels like an artificial, unintended consequence of the
> individual implementations, not something that should be considered a
> systematic architectural requirement. Perhaps one could say the same thing
> about the case where the two spinlock_t variables are the same, but at least
> in that case there is a good argument for inherent ordering of atomic
> accesses to a single variable.

Possibly :-) The way I got here is that my brain seems to have produced
the rule that UNLOCK+LOCK -> TSO order (an improvement, because for a
time it said SC), and it completely forgot about this subtlely. And in
general I feel that less subtlety is more better, but I understand your
counter argument :/

In any case, it looks like we had to put an smp_mb() in there anyway due
to other reasons and the whole argument is moot again.

I'll try and remember for next time :-)

2021-09-08 17:48:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Wed, Sep 8, 2021 at 7:42 AM Alan Stern <[email protected]> wrote:
>
> there is no reason _in theory_ why a CPU shouldn't reorder and interleave
> the operations to get:

I agree about the theory part.

But I think the LKMM should be the strongest ordering that is reasonable.

And it should take common architecture behavior into account.

IOW, if there is some rare architecture where the above can happen,
but no common sane one allows it in practice, we should strive to make
the LKMM the _stronger_ one.

We sure as hell shouldn't say "RISC-V is potentially very weakly
ordered, so we'll allow that weak ordering".

Because overly weak ordering only causes problems for others. And the
performance arguments for it have historically been garbage anyway.
See the pain powerpc goes through because of bad ordering (and even
more so alpha), and see how arm actually strengthened their ordering
to make everybody happier.

So if this is purely a RISC-V thing, then I think it's entirely reasonable to

spin_unlock(&r);
spin_lock(&s);

cannot be reordered.

Strict specifications are not a bad thing, and weak memory ordering is
not inherently good.

Linus

2021-09-09 07:52:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:

> So if this is purely a RISC-V thing,

Just to clarify, I think the current RISC-V thing is stonger than
PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
ordering is still somewhat hazy to me.

Specifically, the sequence:

/* critical section s */
WRITE_ONCE(x, 1);
FENCE RW, W
WRITE_ONCE(s.lock, 0); /* store S */
AMOSWAP %0, 1, r.lock /* store R */
FENCE R, RW
WRITE_ONCE(y, 1);
/* critical section r */

fully separates section s from section r, as in RW->RW ordering
(possibly not as strong as smp_mb() though), while on PowerPC it would
only impose TSO ordering between sections.

The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
similarly it marches the R from the R->RW fence, yielding an:

RW-> W
RmW
R ->RW

ordering. It's the stores S and R that can be re-ordered, but not the
sections themselves (same on PowerPC and many others).

Clarification from a RISC-V enabled person would be appreciated.

> then I think it's entirely reasonable to
>
> spin_unlock(&r);
> spin_lock(&s);
>
> cannot be reordered.

I'm obviously completely in favour of that :-)

2021-09-09 13:41:40

by Will Deacon

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

[+Palmer, PaulW, Daniel and Michael]

On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>
> > So if this is purely a RISC-V thing,
>
> Just to clarify, I think the current RISC-V thing is stonger than
> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> ordering is still somewhat hazy to me.
>
> Specifically, the sequence:
>
> /* critical section s */
> WRITE_ONCE(x, 1);
> FENCE RW, W
> WRITE_ONCE(s.lock, 0); /* store S */
> AMOSWAP %0, 1, r.lock /* store R */
> FENCE R, RW
> WRITE_ONCE(y, 1);
> /* critical section r */
>
> fully separates section s from section r, as in RW->RW ordering
> (possibly not as strong as smp_mb() though), while on PowerPC it would
> only impose TSO ordering between sections.
>
> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> similarly it marches the R from the R->RW fence, yielding an:
>
> RW-> W
> RmW
> R ->RW
>
> ordering. It's the stores S and R that can be re-ordered, but not the
> sections themselves (same on PowerPC and many others).
>
> Clarification from a RISC-V enabled person would be appreciated.
>
> > then I think it's entirely reasonable to
> >
> > spin_unlock(&r);
> > spin_lock(&s);
> >
> > cannot be reordered.
>
> I'm obviously completely in favour of that :-)

I don't think we should require the accesses to the actual lockwords to
be ordered here, as it becomes pretty onerous for relaxed LL/SC
architectures where you'd end up with an extra barrier either after the
unlock() or before the lock() operation. However, I remain absolutely in
favour of strengthening the ordering of the _critical sections_ guarded by
the locks to be RCsc.

Last time this came up, I think the RISC-V folks were generally happy to
implement whatever was necessary for Linux [1]. The thing that was stopping
us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
Michael saw quite a bit of variety in the impact on benchmarks [2] across
different machines. So the question is whether newer Power machines are less
affected to the degree that we could consider making this change again.

Will

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

2021-09-09 17:05:50

by Daniel Lustig

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On 9/9/2021 9:35 AM, Will Deacon wrote:
> [+Palmer, PaulW, Daniel and Michael]
>
> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>>
>>> So if this is purely a RISC-V thing,
>>
>> Just to clarify, I think the current RISC-V thing is stonger than
>> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
>> ordering is still somewhat hazy to me.
>>
>> Specifically, the sequence:
>>
>> /* critical section s */
>> WRITE_ONCE(x, 1);
>> FENCE RW, W
>> WRITE_ONCE(s.lock, 0); /* store S */
>> AMOSWAP %0, 1, r.lock /* store R */
>> FENCE R, RW
>> WRITE_ONCE(y, 1);
>> /* critical section r */
>>
>> fully separates section s from section r, as in RW->RW ordering
>> (possibly not as strong as smp_mb() though), while on PowerPC it would
>> only impose TSO ordering between sections.
>>
>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
>> similarly it marches the R from the R->RW fence, yielding an:
>>
>> RW-> W
>> RmW
>> R ->RW
>>
>> ordering. It's the stores S and R that can be re-ordered, but not the
>> sections themselves (same on PowerPC and many others).
>>
>> Clarification from a RISC-V enabled person would be appreciated.

To first order, RISC-V's memory model is very similar to ARMv8's. It
is "other-multi-copy-atomic", unlike Power, and respects dependencies.
It also has AMOs and LR/SC with optional RCsc acquire or release
semantics. There's no need to worry about RISC-V somehow pushing the
boundaries of weak memory ordering in new ways.

The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
or store-release opcodes at all. Only AMOs and LR/SC have acquire or
release options. That means that while certain operations like swap
can be implemented with native RCsc semantics, others like store-release
have to fall back on fences and plain writes.

That's where the complexity came up last time this was discussed, at
least as it relates to RISC-V: how to make sure the combination of RCsc
atomics and plain operations+fences gives the semantics everyone is
asking for here. And to be clear there, I'm not asking for LKMM to
weaken anything about critical section ordering just for RISC-V's sake.
TSO/RCsc ordering between critical sections is a perfectly reasonable
model in my opinion. I just want to make sure RISC-V gets it right
given whatever the decision is.

>>> then I think it's entirely reasonable to
>>>
>>> spin_unlock(&r);
>>> spin_lock(&s);
>>>
>>> cannot be reordered.
>>
>> I'm obviously completely in favour of that :-)
>
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

I agree with Will here. If the AMOSWAP above is actually implemented with
a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
as Peter described. If instead it's implemented using LR/SC, then RISC-V
gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
split, and that breaks the chain. Getting full RW->RW between the critical
sections would therefore require an extra fence. Also, the accesses to the
lockwords themselves would not be ordered without an extra fence.

> Last time this came up, I think the RISC-V folks were generally happy to
> implement whatever was necessary for Linux [1]. The thing that was stopping
> us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> Michael saw quite a bit of variety in the impact on benchmarks [2] across
> different machines. So the question is whether newer Power machines are less
> affected to the degree that we could consider making this change again.

Yes, as I said above, RISC-V will implement what is needed to make this work.

Dan

> Will
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/

2021-09-09 17:10:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Thu, Sep 9, 2021 at 6:35 AM Will Deacon <[email protected]> wrote:
>
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

Ack. The actual locking operations themselves can obviously overlap,
it's what they protect that should be ordered if at all possible.

Because anything else will be too confusing for words, and if we have
to add memory barriers *and* locking we're just screwed.

Because I think it is entirely understandable for people to expect
that sequence of two locked regions to be ordered wrt each other.

While memory ordering is subtle and confusing, we should strive to
make our "..but I used locks" to be as straightforward and as
understandable to people who really really don't want to even think
about memory order as at all reasonable.

I think we should have a very strong reason for accepting unordered
locked regions (with "strong reason" being defined as "on this
architecture that is hugely important, anything else would slow down
locks enormously").

It sounds like no such architecture exists, much less is important.

Linus

2021-09-09 17:47:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

[+ Nick Piggin]

On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
> [+Palmer, PaulW, Daniel and Michael]
>
> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> >
> > > So if this is purely a RISC-V thing,
> >
> > Just to clarify, I think the current RISC-V thing is stonger than
> > PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> > ordering is still somewhat hazy to me.
> >
> > Specifically, the sequence:
> >
> > /* critical section s */
> > WRITE_ONCE(x, 1);
> > FENCE RW, W
> > WRITE_ONCE(s.lock, 0); /* store S */
> > AMOSWAP %0, 1, r.lock /* store R */
> > FENCE R, RW
> > WRITE_ONCE(y, 1);
> > /* critical section r */
> >
> > fully separates section s from section r, as in RW->RW ordering
> > (possibly not as strong as smp_mb() though), while on PowerPC it would
> > only impose TSO ordering between sections.
> >
> > The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > similarly it marches the R from the R->RW fence, yielding an:
> >
> > RW-> W
> > RmW
> > R ->RW
> >
> > ordering. It's the stores S and R that can be re-ordered, but not the
> > sections themselves (same on PowerPC and many others).
> >
> > Clarification from a RISC-V enabled person would be appreciated.
> >
> > > then I think it's entirely reasonable to
> > >
> > > spin_unlock(&r);
> > > spin_lock(&s);
> > >
> > > cannot be reordered.
> >
> > I'm obviously completely in favour of that :-)
>
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

If by this you mean the critical sections when observed only by other
critical sections for a given lock, then everyone is already there.

However...

> Last time this came up, I think the RISC-V folks were generally happy to
> implement whatever was necessary for Linux [1]. The thing that was stopping
> us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> Michael saw quite a bit of variety in the impact on benchmarks [2] across
> different machines. So the question is whether newer Power machines are less
> affected to the degree that we could consider making this change again.

Last I knew, on Power a pair of critical sections for a given lock could
be observed out of order (writes from the earlier critical section vs.
reads from the later critical section), but only by CPUs not holding
that lock. Also last I knew, tightening this would require upgrading
some of the locking primitives' lwsync instructions to sync instructions.
But I know very little about Power 10.

Adding Nick on CC for his thoughts.

Thanx, Paul

> Will
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/

2021-09-09 18:01:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> On 9/9/2021 9:35 AM, Will Deacon wrote:
> > [+Palmer, PaulW, Daniel and Michael]
> >
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> >>
> >>> So if this is purely a RISC-V thing,
> >>
> >> Just to clarify, I think the current RISC-V thing is stonger than
> >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> >> ordering is still somewhat hazy to me.
> >>
> >> Specifically, the sequence:
> >>
> >> /* critical section s */
> >> WRITE_ONCE(x, 1);
> >> FENCE RW, W
> >> WRITE_ONCE(s.lock, 0); /* store S */
> >> AMOSWAP %0, 1, r.lock /* store R */
> >> FENCE R, RW
> >> WRITE_ONCE(y, 1);
> >> /* critical section r */
> >>
> >> fully separates section s from section r, as in RW->RW ordering
> >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> >> only impose TSO ordering between sections.
> >>
> >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >> similarly it marches the R from the R->RW fence, yielding an:
> >>
> >> RW-> W
> >> RmW
> >> R ->RW
> >>
> >> ordering. It's the stores S and R that can be re-ordered, but not the
> >> sections themselves (same on PowerPC and many others).
> >>
> >> Clarification from a RISC-V enabled person would be appreciated.
>
> To first order, RISC-V's memory model is very similar to ARMv8's. It
> is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> It also has AMOs and LR/SC with optional RCsc acquire or release
> semantics. There's no need to worry about RISC-V somehow pushing the
> boundaries of weak memory ordering in new ways.
>
> The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> or store-release opcodes at all. Only AMOs and LR/SC have acquire or
> release options. That means that while certain operations like swap
> can be implemented with native RCsc semantics, others like store-release
> have to fall back on fences and plain writes.
>
> That's where the complexity came up last time this was discussed, at
> least as it relates to RISC-V: how to make sure the combination of RCsc
> atomics and plain operations+fences gives the semantics everyone is
> asking for here. And to be clear there, I'm not asking for LKMM to
> weaken anything about critical section ordering just for RISC-V's sake.
> TSO/RCsc ordering between critical sections is a perfectly reasonable
> model in my opinion. I just want to make sure RISC-V gets it right
> given whatever the decision is.
>
> >>> then I think it's entirely reasonable to
> >>>
> >>> spin_unlock(&r);
> >>> spin_lock(&s);
> >>>
> >>> cannot be reordered.
> >>
> >> I'm obviously completely in favour of that :-)
> >
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
>
> I agree with Will here. If the AMOSWAP above is actually implemented with
> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> as Peter described. If instead it's implemented using LR/SC, then RISC-V
> gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> split, and that breaks the chain. Getting full RW->RW between the critical
> sections would therefore require an extra fence. Also, the accesses to the
> lockwords themselves would not be ordered without an extra fence.
>
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
>
> Yes, as I said above, RISC-V will implement what is needed to make this work.

Boqun, I vaguely remember a suggested change from you along these lines,
but now I cannot find it. Could you please send it as a formal patch
if you have not already done so or point me at it if you have?

Thanx, Paul

2021-09-09 19:04:43

by Alan Stern

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Thu, Sep 09, 2021 at 10:02:13AM -0700, Linus Torvalds wrote:
> On Thu, Sep 9, 2021 at 6:35 AM Will Deacon <[email protected]> wrote:
> >
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
>
> Ack. The actual locking operations themselves can obviously overlap,
> it's what they protect that should be ordered if at all possible.
>
> Because anything else will be too confusing for words, and if we have
> to add memory barriers *and* locking we're just screwed.
>
> Because I think it is entirely understandable for people to expect
> that sequence of two locked regions to be ordered wrt each other.
>
> While memory ordering is subtle and confusing, we should strive to
> make our "..but I used locks" to be as straightforward and as
> understandable to people who really really don't want to even think
> about memory order as at all reasonable.
>
> I think we should have a very strong reason for accepting unordered
> locked regions (with "strong reason" being defined as "on this
> architecture that is hugely important, anything else would slow down
> locks enormously").
>
> It sounds like no such architecture exists, much less is important.

All right.

Currently the memory model has RCtso ordering of accesses in separate
critical sections for the same lock, as observed by other CPUs not
holding the lock. Peter is proposing (and Linus agrees) to expand this
to cover critical sections for different locks, so long as both critical
sections are on the same CPU. But he didn't propose strengthening the
ordering to RCsc, and I presume we don't want to do this because of the
slowdown it would incur on Power.

Does everyone agree that this correctly summarizes the change to be made
to the memory model?

Alan

2021-09-10 00:04:05

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> On 9/9/2021 9:35 AM, Will Deacon wrote:
> > [+Palmer, PaulW, Daniel and Michael]
> >
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> >>
> >>> So if this is purely a RISC-V thing,
> >>
> >> Just to clarify, I think the current RISC-V thing is stonger than
> >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> >> ordering is still somewhat hazy to me.
> >>
> >> Specifically, the sequence:
> >>
> >> /* critical section s */
> >> WRITE_ONCE(x, 1);
> >> FENCE RW, W
> >> WRITE_ONCE(s.lock, 0); /* store S */
> >> AMOSWAP %0, 1, r.lock /* store R */
> >> FENCE R, RW
> >> WRITE_ONCE(y, 1);
> >> /* critical section r */
> >>
> >> fully separates section s from section r, as in RW->RW ordering
> >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> >> only impose TSO ordering between sections.
> >>
> >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >> similarly it marches the R from the R->RW fence, yielding an:
> >>
> >> RW-> W
> >> RmW
> >> R ->RW
> >>
> >> ordering. It's the stores S and R that can be re-ordered, but not the
> >> sections themselves (same on PowerPC and many others).
> >>
> >> Clarification from a RISC-V enabled person would be appreciated.
>
> To first order, RISC-V's memory model is very similar to ARMv8's. It
> is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> It also has AMOs and LR/SC with optional RCsc acquire or release
> semantics. There's no need to worry about RISC-V somehow pushing the
> boundaries of weak memory ordering in new ways.
>
> The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> or store-release opcodes at all. Only AMOs and LR/SC have acquire or
> release options. That means that while certain operations like swap
> can be implemented with native RCsc semantics, others like store-release
> have to fall back on fences and plain writes.
>
> That's where the complexity came up last time this was discussed, at
> least as it relates to RISC-V: how to make sure the combination of RCsc
> atomics and plain operations+fences gives the semantics everyone is
> asking for here. And to be clear there, I'm not asking for LKMM to
> weaken anything about critical section ordering just for RISC-V's sake.
> TSO/RCsc ordering between critical sections is a perfectly reasonable
> model in my opinion. I just want to make sure RISC-V gets it right
> given whatever the decision is.
>
> >>> then I think it's entirely reasonable to
> >>>
> >>> spin_unlock(&r);
> >>> spin_lock(&s);
> >>>
> >>> cannot be reordered.
> >>
> >> I'm obviously completely in favour of that :-)
> >
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
>
> I agree with Will here. If the AMOSWAP above is actually implemented with
> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> as Peter described. If instead it's implemented using LR/SC, then RISC-V

Just out of curiosity, in the following code, can the store S and load L
be reordered?

WRITE_ONCE(x, 1); // store S
FENCE RW, W
WRITE_ONCE(s.lock, 0); // unlock(s)
AMOSWAP %0, 1, s.lock // lock(s)
FENCE R, RW
r1 = READ_ONCE(y); // load L

I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
them. Note that the reordering is allowed in LKMM, because unlock-lock
only need to be as strong as RCtso.

Moreover, how about the outcome of the following case:

{
r1, r2 are registers (variables) on each CPU, X, Y are memory
locations, and initialized as 0
}

CPU 0
=====
AMOSWAP r1, 1, X
FENCE R, RW
r2 = READ_ONCE(Y);

CPU 1
=====
WRITE_ONCE(Y, 1);
FENCE RW, RW
r2 = READ_ONCE(X);

can we observe the result where r2 on CPU0 is 0 while r2 on CPU1 is 1?

Regards,
Boqun

> gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> split, and that breaks the chain. Getting full RW->RW between the critical
> sections would therefore require an extra fence. Also, the accesses to the
> lockwords themselves would not be ordered without an extra fence.
>
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
>
> Yes, as I said above, RISC-V will implement what is needed to make this work.
>
> Dan
>
> > Will
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/lkml/[email protected]/

2021-09-10 05:40:19

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > [+Palmer, PaulW, Daniel and Michael]
> > >
> > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> > >>
> > >>> So if this is purely a RISC-V thing,
> > >>
> > >> Just to clarify, I think the current RISC-V thing is stonger than
> > >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> > >> ordering is still somewhat hazy to me.
> > >>
> > >> Specifically, the sequence:
> > >>
> > >> /* critical section s */
> > >> WRITE_ONCE(x, 1);
> > >> FENCE RW, W
> > >> WRITE_ONCE(s.lock, 0); /* store S */
> > >> AMOSWAP %0, 1, r.lock /* store R */
> > >> FENCE R, RW
> > >> WRITE_ONCE(y, 1);
> > >> /* critical section r */
> > >>
> > >> fully separates section s from section r, as in RW->RW ordering
> > >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> > >> only impose TSO ordering between sections.
> > >>
> > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > >> similarly it marches the R from the R->RW fence, yielding an:
> > >>
> > >> RW-> W
> > >> RmW
> > >> R ->RW
> > >>
> > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > >> sections themselves (same on PowerPC and many others).
> > >>
> > >> Clarification from a RISC-V enabled person would be appreciated.
> >
> > To first order, RISC-V's memory model is very similar to ARMv8's. It
> > is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> > It also has AMOs and LR/SC with optional RCsc acquire or release
> > semantics. There's no need to worry about RISC-V somehow pushing the
> > boundaries of weak memory ordering in new ways.
> >
> > The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> > or store-release opcodes at all. Only AMOs and LR/SC have acquire or
> > release options. That means that while certain operations like swap
> > can be implemented with native RCsc semantics, others like store-release
> > have to fall back on fences and plain writes.
> >
> > That's where the complexity came up last time this was discussed, at
> > least as it relates to RISC-V: how to make sure the combination of RCsc
> > atomics and plain operations+fences gives the semantics everyone is
> > asking for here. And to be clear there, I'm not asking for LKMM to
> > weaken anything about critical section ordering just for RISC-V's sake.
> > TSO/RCsc ordering between critical sections is a perfectly reasonable
> > model in my opinion. I just want to make sure RISC-V gets it right
> > given whatever the decision is.
> >
> > >>> then I think it's entirely reasonable to
> > >>>
> > >>> spin_unlock(&r);
> > >>> spin_lock(&s);
> > >>>
> > >>> cannot be reordered.
> > >>
> > >> I'm obviously completely in favour of that :-)
> > >
> > > I don't think we should require the accesses to the actual lockwords to
> > > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > > architectures where you'd end up with an extra barrier either after the
> > > unlock() or before the lock() operation. However, I remain absolutely in
> > > favour of strengthening the ordering of the _critical sections_ guarded by
> > > the locks to be RCsc.
> >
> > I agree with Will here. If the AMOSWAP above is actually implemented with
> > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > as Peter described. If instead it's implemented using LR/SC, then RISC-V
>
> Just out of curiosity, in the following code, can the store S and load L
> be reordered?
>
> WRITE_ONCE(x, 1); // store S
> FENCE RW, W
> WRITE_ONCE(s.lock, 0); // unlock(s)
> AMOSWAP %0, 1, s.lock // lock(s)
> FENCE R, RW
> r1 = READ_ONCE(y); // load L
>
> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> them. Note that the reordering is allowed in LKMM, because unlock-lock
> only need to be as strong as RCtso.
>
> Moreover, how about the outcome of the following case:
>
> {
> r1, r2 are registers (variables) on each CPU, X, Y are memory
> locations, and initialized as 0
> }
>
> CPU 0
> =====
> AMOSWAP r1, 1, X
> FENCE R, RW
> r2 = READ_ONCE(Y);
>
> CPU 1
> =====
> WRITE_ONCE(Y, 1);
> FENCE RW, RW
> r2 = READ_ONCE(X);
>
> can we observe the result where r2 on CPU0 is 0 while r2 on CPU1 is 1?
>

As reminded by Andrea, what I meant to ask here is:

can we observer the result where r2 on CPU0 is 0 while r2 on CPU1 is 0?

Regards,
Boqun

> Regards,
> Boqun
>
> > gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> > split, and that breaks the chain. Getting full RW->RW between the critical
> > sections would therefore require an extra fence. Also, the accesses to the
> > lockwords themselves would not be ordered without an extra fence.
> >
> > > Last time this came up, I think the RISC-V folks were generally happy to
> > > implement whatever was necessary for Linux [1]. The thing that was stopping
> > > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > > different machines. So the question is whether newer Power machines are less
> > > affected to the degree that we could consider making this change again.
> >
> > Yes, as I said above, RISC-V will implement what is needed to make this work.
> >
> > Dan
> >
> > > Will
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://lore.kernel.org/lkml/[email protected]/

2021-09-10 09:37:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:

> > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > >> similarly it marches the R from the R->RW fence, yielding an:
> > >>
> > >> RW-> W
> > >> RmW
> > >> R ->RW
> > >>
> > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > >> sections themselves (same on PowerPC and many others).

> > I agree with Will here. If the AMOSWAP above is actually implemented with
> > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > as Peter described. If instead it's implemented using LR/SC, then RISC-V
>
> Just out of curiosity, in the following code, can the store S and load L
> be reordered?
>
> WRITE_ONCE(x, 1); // store S
> FENCE RW, W
> WRITE_ONCE(s.lock, 0); // unlock(s)
> AMOSWAP %0, 1, s.lock // lock(s)
> FENCE R, RW
> r1 = READ_ONCE(y); // load L
>
> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> them.

I'm confused by your argument, per the above quoted section, those
fences and the AMO combine into a RW,RW ordering which is (as per the
later clarification) multi-copy-atomic, aka smp_mb().

As such, S and L are not allowed to be re-ordered in the given scenario.

> Note that the reordering is allowed in LKMM, because unlock-lock
> only need to be as strong as RCtso.

Risc-V is strictly stronger than required in this instance. Given the
current lock implementation. Daniel pointed out that if the atomic op
were LL/SC based instead of AMO it would end up being RCtso.

2021-09-10 10:07:45

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> > On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>
> > > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > > >> similarly it marches the R from the R->RW fence, yielding an:
> > > >>
> > > >> RW-> W
> > > >> RmW
> > > >> R ->RW
> > > >>
> > > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > > >> sections themselves (same on PowerPC and many others).
>
> > > I agree with Will here. If the AMOSWAP above is actually implemented with
> > > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > > as Peter described. If instead it's implemented using LR/SC, then RISC-V
> >
> > Just out of curiosity, in the following code, can the store S and load L
> > be reordered?
> >
> > WRITE_ONCE(x, 1); // store S
> > FENCE RW, W
> > WRITE_ONCE(s.lock, 0); // unlock(s)
> > AMOSWAP %0, 1, s.lock // lock(s)
> > FENCE R, RW
> > r1 = READ_ONCE(y); // load L
> >
> > I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> > them.
>
> I'm confused by your argument, per the above quoted section, those
> fences and the AMO combine into a RW,RW ordering which is (as per the
> later clarification) multi-copy-atomic, aka smp_mb().
>

Right, my question is more about the reasoning about why fence rw,w +
AMO + fence r,rw act as a fence rw,rw. Another related question, can
fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
reasoning? IOW, will the two loads in the following be reordered?

r1 = READ_ONCE(x);
FENCE RW, W
WRITE_ONCE(z, 1);
FENCE W, RW
r2 = READ_ONCE(y);

again, this is more like a question out of curiosity, not that I find
this pattern is useful.

Regards,
Boqun

> As such, S and L are not allowed to be re-ordered in the given scenario.
>
> > Note that the reordering is allowed in LKMM, because unlock-lock
> > only need to be as strong as RCtso.
>
> Risc-V is strictly stronger than required in this instance. Given the
> current lock implementation. Daniel pointed out that if the atomic op
> were LL/SC based instead of AMO it would end up being RCtso.
>

2021-09-10 11:09:58

by Will Deacon

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

Hi Paul,

On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> > > > then I think it's entirely reasonable to
> > > >
> > > > spin_unlock(&r);
> > > > spin_lock(&s);
> > > >
> > > > cannot be reordered.
> > >
> > > I'm obviously completely in favour of that :-)
> >
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
>
> If by this you mean the critical sections when observed only by other
> critical sections for a given lock, then everyone is already there.

No, I mean the case where somebody without the lock (but using memory
barriers) can observe the critical sections out of order (i.e. W -> R
order is not maintained).

> However...
>
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
>
> Last I knew, on Power a pair of critical sections for a given lock could
> be observed out of order (writes from the earlier critical section vs.
> reads from the later critical section), but only by CPUs not holding
> that lock. Also last I knew, tightening this would require upgrading
> some of the locking primitives' lwsync instructions to sync instructions.
> But I know very little about Power 10.

Yup, that's the one. This is the primary reason why we have the confusing
"RCtso" model today so this is my periodic "Do we still need this?" poking
for the Power folks :)

If the SYNC is a disaster for Power, then I'll ask again in another ~3 years
time in the hope that newer micro-architectures can swallow the instruction
more easily, but the results last time weren't hugely compelling and so _if_
there's an opportunity to make locking more "obvious" then I'm all for it.

Will

2021-09-10 13:49:56

by Daniel Lustig

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On 9/10/2021 6:04 AM, Boqun Feng wrote:
> On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
>>> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
>>>> On 9/9/2021 9:35 AM, Will Deacon wrote:
>>>>> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>>
>>>>>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
>>>>>> similarly it marches the R from the R->RW fence, yielding an:
>>>>>>
>>>>>> RW-> W
>>>>>> RmW
>>>>>> R ->RW
>>>>>>
>>>>>> ordering. It's the stores S and R that can be re-ordered, but not the
>>>>>> sections themselves (same on PowerPC and many others).
>>
>>>> I agree with Will here. If the AMOSWAP above is actually implemented with
>>>> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
>>>> as Peter described. If instead it's implemented using LR/SC, then RISC-V
>>>
>>> Just out of curiosity, in the following code, can the store S and load L
>>> be reordered?
>>>
>>> WRITE_ONCE(x, 1); // store S
>>> FENCE RW, W
>>> WRITE_ONCE(s.lock, 0); // unlock(s)
>>> AMOSWAP %0, 1, s.lock // lock(s)
>>> FENCE R, RW
>>> r1 = READ_ONCE(y); // load L
>>>
>>> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
>>> them.
>>
>> I'm confused by your argument, per the above quoted section, those
>> fences and the AMO combine into a RW,RW ordering which is (as per the
>> later clarification) multi-copy-atomic, aka smp_mb().
>>
>
> Right, my question is more about the reasoning about why fence rw,w +
> AMO + fence r,rw act as a fence rw,rw.

Is this a RISC-V question? If so, it's as simple as:
1) S and anything earlier are ordered before the AMO by the first fence
2) L and anything later are ordered after the AMO by the second fence
3) 1 + 2 = S and anything earlier are ordered before L or anything later

Since RISC-V is multi-copy atomic, so 1+2 just naturally compose
transitively.

> Another related question, can
> fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
> reasoning? IOW, will the two loads in the following be reordered?
>
> r1 = READ_ONCE(x);
> FENCE RW, W
> WRITE_ONCE(z, 1);
> FENCE W, RW
> r2 = READ_ONCE(y);
>
> again, this is more like a question out of curiosity, not that I find
> this pattern is useful.

Does FENCE W,RW appear in some actual use case? But yes, if it does
appear, this sequence would also act as a FENCE RW,RW on RISC-V.

Dan

> Regards,
> Boqun
>
>> As such, S and L are not allowed to be re-ordered in the given scenario.
>>
>>> Note that the reordering is allowed in LKMM, because unlock-lock
>>> only need to be as strong as RCtso.
>>
>> Risc-V is strictly stronger than required in this instance. Given the
>> current lock implementation. Daniel pointed out that if the atomic op
>> were LL/SC based instead of AMO it would end up being RCtso.
>>

2021-09-10 14:22:06

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 09:48:55AM -0400, Dan Lustig wrote:
> On 9/10/2021 6:04 AM, Boqun Feng wrote:
> > On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
> >> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> >>> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> >>>> On 9/9/2021 9:35 AM, Will Deacon wrote:
> >>>>> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >>
> >>>>>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >>>>>> similarly it marches the R from the R->RW fence, yielding an:
> >>>>>>
> >>>>>> RW-> W
> >>>>>> RmW
> >>>>>> R ->RW
> >>>>>>
> >>>>>> ordering. It's the stores S and R that can be re-ordered, but not the
> >>>>>> sections themselves (same on PowerPC and many others).
> >>
> >>>> I agree with Will here. If the AMOSWAP above is actually implemented with
> >>>> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> >>>> as Peter described. If instead it's implemented using LR/SC, then RISC-V
> >>>
> >>> Just out of curiosity, in the following code, can the store S and load L
> >>> be reordered?
> >>>
> >>> WRITE_ONCE(x, 1); // store S
> >>> FENCE RW, W
> >>> WRITE_ONCE(s.lock, 0); // unlock(s)
> >>> AMOSWAP %0, 1, s.lock // lock(s)
> >>> FENCE R, RW
> >>> r1 = READ_ONCE(y); // load L
> >>>
> >>> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> >>> them.
> >>
> >> I'm confused by your argument, per the above quoted section, those
> >> fences and the AMO combine into a RW,RW ordering which is (as per the
> >> later clarification) multi-copy-atomic, aka smp_mb().
> >>
> >
> > Right, my question is more about the reasoning about why fence rw,w +
> > AMO + fence r,rw act as a fence rw,rw.
>
> Is this a RISC-V question? If so, it's as simple as:

Yep, and thanks for the answer.

> 1) S and anything earlier are ordered before the AMO by the first fence
> 2) L and anything later are ordered after the AMO by the second fence
> 3) 1 + 2 = S and anything earlier are ordered before L or anything later
>
> Since RISC-V is multi-copy atomic, so 1+2 just naturally compose
> transitively.
>
> > Another related question, can
> > fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
> > reasoning? IOW, will the two loads in the following be reordered?
> >
> > r1 = READ_ONCE(x);
> > FENCE RW, W
> > WRITE_ONCE(z, 1);
> > FENCE W, RW
> > r2 = READ_ONCE(y);
> >
> > again, this is more like a question out of curiosity, not that I find
> > this pattern is useful.
>
> Does FENCE W,RW appear in some actual use case? But yes, if it does

I'm not aware of any, but probably because no other arch can order
write->read without a full barrier (or release+acquire if RCsc), we have
a few patterns in kernel where we only want to order write->read, and
smp_mb()s are used, if on RISCV FENCE W,R is cheaper than FENCE RW,RW,
then *in theory* we can have smp_wrmb() implemented as FENCE W,R on
RISCV and smp_mb() on other archs.

/me run

And I'm sure there are cases that we use smp_mb() where only
write->{read,write} is supposed to be ordered, so there may be use case
by the same reason.

I'm not proposing doing anything, just saying we don't use FENCE W,RW
because there is no equilavent concept in other archs, so it's not
modeled by an API. Besides, it may not be cheaper than FENCE RW,RW on
RISCV.

Regards,
Boqun

> appear, this sequence would also act as a FENCE RW,RW on RISC-V.
>
> Dan
>
> > Regards,
> > Boqun
> >
> >> As such, S and L are not allowed to be re-ordered in the given scenario.
> >>
> >>> Note that the reordering is allowed in LKMM, because unlock-lock
> >>> only need to be as strong as RCtso.
> >>
> >> Risc-V is strictly stronger than required in this instance. Given the
> >> current lock implementation. Daniel pointed out that if the atomic op
> >> were LL/SC based instead of AMO it would end up being RCtso.
> >>

2021-09-10 14:23:05

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
[...]
>
> Boqun, I vaguely remember a suggested change from you along these lines,
> but now I cannot find it. Could you please send it as a formal patch
> if you have not already done so or point me at it if you have?
>

Here is a draft patch based on the change I did when I discussed with
Peter, and I really want to hear Alan's thought first. Ideally, we
should also have related litmus tests and send to linux-arch list so
that we know the ordering is provided by every architecture.

Regards,
Boqun

--------------------------------->8
Subject: [PATCH] tools/memory-model: Provide extra ordering for
lock-{release,acquire} on the same CPU

A recent discussion[1] shows that we are in favor of strengthening the
ordering of lock-release + lock-acquire on the same CPU: a lock-release
and a po-after lock-acquire should provide the so-called RCtso ordering,
that is a memory access S po-before the lock-release should be ordered
against a memory access R po-after the lock-acquire, unless S is a store
and R is a load.

The strengthening meets programmers' expection that "sequence of two
locked regions to be ordered wrt each other" (from Linus), and can
reduce the mental burden when using locks. Therefore add it in LKMM.

[1]: https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Boqun Feng <[email protected]>
---
.../Documentation/explanation.txt | 28 +++++++++++++++++++
tools/memory-model/linux-kernel.cat | 6 ++--
2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 5d72f3112e56..d62de21f32c4 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -1847,6 +1847,34 @@ therefore the load of x must execute before the load of y. Thus we
cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
MP pattern).

+This requirement also applies to a lock-release and a lock-acquire
+on different locks, as long as the lock-acquire is po-after the
+lock-release. Note that "po-after" means the lock-acquire and the
+lock-release are on the same cpu. An example simliar to the above:
+
+ int x, y;
+ spinlock_t s;
+ spinlock_t t;
+
+ P0()
+ {
+ int r1, r2;
+
+ spin_lock(&s);
+ r1 = READ_ONCE(x);
+ spin_unlock(&s);
+ spin_lock(&t);
+ r2 = READ_ONCE(y);
+ spin_unlock(&t);
+ }
+
+ P1()
+ {
+ WRITE_ONCE(y, 1);
+ smp_wmb();
+ WRITE_ONCE(x, 1);
+ }
+
This requirement does not apply to ordinary release and acquire
fences, only to lock-related operations. For instance, suppose P0()
in the example had been written as:
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 2a9b4fe4a84e..d70315fddef6 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -27,7 +27,7 @@ include "lock.cat"
(* Release Acquire *)
let acq-po = [Acquire] ; po ; [M]
let po-rel = [M] ; po ; [Release]
-let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
+let po-unlock-lock-po = po ; [UL] ; (po|rf) ; [LKR] ; po

(* Fences *)
let R4rmb = R \ Noreturn (* Reads for which rmb works *)
@@ -70,12 +70,12 @@ let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
let to-r = addr | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
+let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)

(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = (rfe ; [Marked])? ; r
let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
- po-unlock-rf-lock-po) ; [Marked]
+ po-unlock-lock-po) ; [Marked]
let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
[Marked] ; rfe? ; [Marked]

--
2.32.0

2021-09-10 15:35:03

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, 10 Sep 2021 07:20:13 PDT (-0700), [email protected] wrote:
> On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> [...]
>>
>> Boqun, I vaguely remember a suggested change from you along these lines,
>> but now I cannot find it. Could you please send it as a formal patch
>> if you have not already done so or point me at it if you have?
>>
>
> Here is a draft patch based on the change I did when I discussed with
> Peter, and I really want to hear Alan's thought first. Ideally, we
> should also have related litmus tests and send to linux-arch list so
> that we know the ordering is provided by every architecture.
>
> Regards,
> Boqun
>
> --------------------------------->8
> Subject: [PATCH] tools/memory-model: Provide extra ordering for
> lock-{release,acquire} on the same CPU
>
> A recent discussion[1] shows that we are in favor of strengthening the
> ordering of lock-release + lock-acquire on the same CPU: a lock-release
> and a po-after lock-acquire should provide the so-called RCtso ordering,
> that is a memory access S po-before the lock-release should be ordered
> against a memory access R po-after the lock-acquire, unless S is a store
> and R is a load.
>
> The strengthening meets programmers' expection that "sequence of two
> locked regions to be ordered wrt each other" (from Linus), and can
> reduce the mental burden when using locks. Therefore add it in LKMM.
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Boqun Feng <[email protected]>
> ---
> .../Documentation/explanation.txt | 28 +++++++++++++++++++
> tools/memory-model/linux-kernel.cat | 6 ++--
> 2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
> index 5d72f3112e56..d62de21f32c4 100644
> --- a/tools/memory-model/Documentation/explanation.txt
> +++ b/tools/memory-model/Documentation/explanation.txt
> @@ -1847,6 +1847,34 @@ therefore the load of x must execute before the load of y. Thus we
> cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> MP pattern).
>
> +This requirement also applies to a lock-release and a lock-acquire
> +on different locks, as long as the lock-acquire is po-after the
> +lock-release. Note that "po-after" means the lock-acquire and the
> +lock-release are on the same cpu. An example simliar to the above:
> +
> + int x, y;
> + spinlock_t s;
> + spinlock_t t;
> +
> + P0()
> + {
> + int r1, r2;
> +
> + spin_lock(&s);
> + r1 = READ_ONCE(x);
> + spin_unlock(&s);
> + spin_lock(&t);
> + r2 = READ_ONCE(y);
> + spin_unlock(&t);
> + }
> +
> + P1()
> + {
> + WRITE_ONCE(y, 1);
> + smp_wmb();
> + WRITE_ONCE(x, 1);
> + }
> +
> This requirement does not apply to ordinary release and acquire
> fences, only to lock-related operations. For instance, suppose P0()
> in the example had been written as:
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 2a9b4fe4a84e..d70315fddef6 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -27,7 +27,7 @@ include "lock.cat"
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> -let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
> +let po-unlock-lock-po = po ; [UL] ; (po|rf) ; [LKR] ; po
>
> (* Fences *)
> let R4rmb = R \ Noreturn (* Reads for which rmb works *)
> @@ -70,12 +70,12 @@ let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> let to-r = addr | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
> +let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = (rfe ; [Marked])? ; r
> let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> - po-unlock-rf-lock-po) ; [Marked]
> + po-unlock-lock-po) ; [Marked]
> let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
> [Marked] ; rfe? ; [Marked]

I'm not a memory model person so I don't really feel comfortable
reviewing this, but I can follow the non-formal discussion so

Acked-by: Palmer Dabbelt <[email protected]> # For the RISC-V fallout

So far we've been sticking with the "fastest implementation allowed by
the spec" mentality, but TBH I think we'd have ended up moving to RCsc
locks regardless of where LKMM ended up just to be in line with the more
popular architectures. With mmiowb gone I think this was the last bit
of memory model weirdness we'd been carrying around in the RISC-V port,
so it would have always just been a worry.

We don't really have any hardware to evaluate the performance
implications of this change, as there are no interestingly aggressive
implementations of the memory model availiable today. Like Dan said
we've got all the ISA mechanisms in place to adequently describe these
orderings to hardware, so in theory implementations should be able to
handle this without falling off any performance cliffs.

Happy to take a look and an implementation of this on RISC-V, but if
nothing arises I'll go sort it out. It does remind me that we were
supposed to move over to those generic ticket spinlocks, though...

Thanks!

2021-09-10 16:37:49

by Alan Stern

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 10:20:13PM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> [...]
> >
> > Boqun, I vaguely remember a suggested change from you along these lines,
> > but now I cannot find it. Could you please send it as a formal patch
> > if you have not already done so or point me at it if you have?
> >
>
> Here is a draft patch based on the change I did when I discussed with
> Peter, and I really want to hear Alan's thought first. Ideally, we
> should also have related litmus tests and send to linux-arch list so
> that we know the ordering is provided by every architecture.
>
> Regards,
> Boqun
>
> --------------------------------->8
> Subject: [PATCH] tools/memory-model: Provide extra ordering for
> lock-{release,acquire} on the same CPU
>
> A recent discussion[1] shows that we are in favor of strengthening the
> ordering of lock-release + lock-acquire on the same CPU: a lock-release
> and a po-after lock-acquire should provide the so-called RCtso ordering,
> that is a memory access S po-before the lock-release should be ordered
> against a memory access R po-after the lock-acquire, unless S is a store
> and R is a load.
>
> The strengthening meets programmers' expection that "sequence of two
> locked regions to be ordered wrt each other" (from Linus), and can
> reduce the mental burden when using locks. Therefore add it in LKMM.
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Boqun Feng <[email protected]>
> ---

The change to linux-kernel.cat looks fine. However, I didn't like your
update to explanation.txt. Instead I wrote my own, given below.

I also wrote a couple of litmus tests which Paul can add to the
appropriate archive. They are attached to this email. As expected,
they fail (result Sometimes) with the current LKMM and succeed (Never)
with Boqun's updated model.

Alan


--- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-devel/tools/memory-model/Documentation/explanation.txt
@@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
lock-acquires -- have two properties beyond those of ordinary releases
and acquires.

-First, when a lock-acquire reads from a lock-release, the LKMM
-requires that every instruction po-before the lock-release must
-execute before any instruction po-after the lock-acquire. This would
-naturally hold if the release and acquire operations were on different
-CPUs, but the LKMM says it holds even when they are on the same CPU.
-For example:
+First, when a lock-acquire reads from or is po-after a lock-release,
+the LKMM requires that every instruction po-before the lock-release
+must execute before any instruction po-after the lock-acquire. This
+would naturally hold if the release and acquire operations were on
+different CPUs and accessed the same lock variable, but the LKMM says
+it also holds when they are on the same CPU, even if they access
+different lock variables. For example:

int x, y;
- spinlock_t s;
+ spinlock_t s, t;

P0()
{
@@ -1830,9 +1831,9 @@ For example:
spin_lock(&s);
r1 = READ_ONCE(x);
spin_unlock(&s);
- spin_lock(&s);
+ spin_lock(&t);
r2 = READ_ONCE(y);
- spin_unlock(&s);
+ spin_unlock(&t);
}

P1()
@@ -1842,10 +1843,10 @@ For example:
WRITE_ONCE(x, 1);
}

-Here the second spin_lock() reads from the first spin_unlock(), and
-therefore the load of x must execute before the load of y. Thus we
-cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
-MP pattern).
+Here the second spin_lock() is po-after the first spin_unlock(), and
+therefore the load of x must execute before the load of y, even tbough
+the two locking operations use different locks. Thus we cannot have
+r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).

This requirement does not apply to ordinary release and acquire
fences, only to lock-related operations. For instance, suppose P0()
@@ -1872,13 +1873,13 @@ instructions in the following order:

and thus it could load y before x, obtaining r2 = 0 and r1 = 1.

-Second, when a lock-acquire reads from a lock-release, and some other
-stores W and W' occur po-before the lock-release and po-after the
-lock-acquire respectively, the LKMM requires that W must propagate to
-each CPU before W' does. For example, consider:
+Second, when a lock-acquire reads from or is po-after a lock-release,
+and some other stores W and W' occur po-before the lock-release and
+po-after the lock-acquire respectively, the LKMM requires that W must
+propagate to each CPU before W' does. For example, consider:

int x, y;
- spinlock_t x;
+ spinlock_t s;

P0()
{
@@ -1908,7 +1909,12 @@ each CPU before W' does. For example, c

If r1 = 1 at the end then the spin_lock() in P1 must have read from
the spin_unlock() in P0. Hence the store to x must propagate to P2
-before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+before the store to y does, so we cannot have r2 = 1 and r3 = 0. But
+if P1 had used a lock variable different from s, the writes could have
+propagated in either order. (On the other hand, if the code in P0 and
+P1 had all executed on a single CPU, as in the example before this
+one, then the writes would have propagated in order even if the two
+critical sections used different lock variables.)

These two special requirements for lock-release and lock-acquire do
not arise from the operational model. Nevertheless, kernel developers


Attachments:
(No filename) (5.56 kB)
ullk-rw.litmus (556.00 B)
ullk-ww.litmus (566.00 B)
Download all attachments

2021-09-10 17:14:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> +Here the second spin_lock() is po-after the first spin_unlock(), and
> +therefore the load of x must execute before the load of y, even tbough

I think that's commonly spelled: though, right? ^^^^^^

2021-09-10 17:22:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
> lock-acquires -- have two properties beyond those of ordinary releases
> and acquires.
>
> +First, when a lock-acquire reads from or is po-after a lock-release,
> +the LKMM requires that every instruction po-before the lock-release
> +must execute before any instruction po-after the lock-acquire. This
> +would naturally hold if the release and acquire operations were on
> +different CPUs and accessed the same lock variable, but the LKMM says
> +it also holds when they are on the same CPU, even if they access
> +different lock variables. For example:

Could be I don't understand this right, but the way I'm reading it, it
seems to imply RCsc. Which I don't think we're actually asking at this
time.

2021-09-10 17:58:33

by Alan Stern

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 07:12:21PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> > +Here the second spin_lock() is po-after the first spin_unlock(), and
> > +therefore the load of x must execute before the load of y, even tbough
>
> I think that's commonly spelled: though, right? ^^^^^^

Oops, yes, I missed that. Good eye!

> > --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> > +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> > @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
> > lock-acquires -- have two properties beyond those of ordinary releases
> > and acquires.
> >
> > +First, when a lock-acquire reads from or is po-after a lock-release,
> > +the LKMM requires that every instruction po-before the lock-release
> > +must execute before any instruction po-after the lock-acquire. This
> > +would naturally hold if the release and acquire operations were on
> > +different CPUs and accessed the same lock variable, but the LKMM says
> > +it also holds when they are on the same CPU, even if they access
> > +different lock variables. For example:
>
> Could be I don't understand this right, but the way I'm reading it, it
> seems to imply RCsc. Which I don't think we're actually asking at this
> time.

No, it doesn't imply RCsc. This document makes a distinction between
when a store executes and when it becomes visible to (or propagates to)
other CPUs. Thus, even though write 1 executes before write 2, write 2
might become visible to a different CPU before write 1 does. In fact,
on non-other-multicopy-atomic systems, two writes might become visible
to different CPUs in different orders (think of the IRIW litmus
pattern.)

Or to consider a more relevant example, a write can execute before a
read even though the write doesn't become visible to other CPUs until
after the read is complete.

If you want, you can read this as saying "execute as seen from its own
CPU" (although even that isn't entirely right, since a write can be
visible to a po-later read which nevertheless executes before the write
does). Or think of a write as executing when its value gets put into
the local store buffer, rather than when it gets put into the cache
line.

Alan

2021-09-12 00:31:20

by Boqun Feng

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> On Fri, Sep 10, 2021 at 10:20:13PM +0800, Boqun Feng wrote:
> > On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> > [...]
> > >
> > > Boqun, I vaguely remember a suggested change from you along these lines,
> > > but now I cannot find it. Could you please send it as a formal patch
> > > if you have not already done so or point me at it if you have?
> > >
> >
> > Here is a draft patch based on the change I did when I discussed with
> > Peter, and I really want to hear Alan's thought first. Ideally, we
> > should also have related litmus tests and send to linux-arch list so
> > that we know the ordering is provided by every architecture.
> >
> > Regards,
> > Boqun
> >
> > --------------------------------->8
> > Subject: [PATCH] tools/memory-model: Provide extra ordering for
> > lock-{release,acquire} on the same CPU
> >
> > A recent discussion[1] shows that we are in favor of strengthening the
> > ordering of lock-release + lock-acquire on the same CPU: a lock-release
> > and a po-after lock-acquire should provide the so-called RCtso ordering,
> > that is a memory access S po-before the lock-release should be ordered
> > against a memory access R po-after the lock-acquire, unless S is a store
> > and R is a load.
> >
> > The strengthening meets programmers' expection that "sequence of two
> > locked regions to be ordered wrt each other" (from Linus), and can
> > reduce the mental burden when using locks. Therefore add it in LKMM.
> >
> > [1]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Signed-off-by: Boqun Feng <[email protected]>
> > ---
>
> The change to linux-kernel.cat looks fine. However, I didn't like your
> update to explanation.txt. Instead I wrote my own, given below.
>

Thanks. Indeed your changes of explanation is better.

> I also wrote a couple of litmus tests which Paul can add to the
> appropriate archive. They are attached to this email. As expected,
> they fail (result Sometimes) with the current LKMM and succeed (Never)
> with Boqun's updated model.
>

Appreciate it, I will put together your change to explanation.txt (with
the typo fixed), my change to cat file and the litmus tests, and send
a proper patch next Monday.

Regards,
Boqun

> Alan
>
>
> --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
> lock-acquires -- have two properties beyond those of ordinary releases
> and acquires.
>
> -First, when a lock-acquire reads from a lock-release, the LKMM
> -requires that every instruction po-before the lock-release must
> -execute before any instruction po-after the lock-acquire. This would
> -naturally hold if the release and acquire operations were on different
> -CPUs, but the LKMM says it holds even when they are on the same CPU.
> -For example:
> +First, when a lock-acquire reads from or is po-after a lock-release,
> +the LKMM requires that every instruction po-before the lock-release
> +must execute before any instruction po-after the lock-acquire. This
> +would naturally hold if the release and acquire operations were on
> +different CPUs and accessed the same lock variable, but the LKMM says
> +it also holds when they are on the same CPU, even if they access
> +different lock variables. For example:
>
> int x, y;
> - spinlock_t s;
> + spinlock_t s, t;
>
> P0()
> {
> @@ -1830,9 +1831,9 @@ For example:
> spin_lock(&s);
> r1 = READ_ONCE(x);
> spin_unlock(&s);
> - spin_lock(&s);
> + spin_lock(&t);
> r2 = READ_ONCE(y);
> - spin_unlock(&s);
> + spin_unlock(&t);
> }
>
> P1()
> @@ -1842,10 +1843,10 @@ For example:
> WRITE_ONCE(x, 1);
> }
>
> -Here the second spin_lock() reads from the first spin_unlock(), and
> -therefore the load of x must execute before the load of y. Thus we
> -cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> -MP pattern).
> +Here the second spin_lock() is po-after the first spin_unlock(), and
> +therefore the load of x must execute before the load of y, even tbough
> +the two locking operations use different locks. Thus we cannot have
> +r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).
>
> This requirement does not apply to ordinary release and acquire
> fences, only to lock-related operations. For instance, suppose P0()
> @@ -1872,13 +1873,13 @@ instructions in the following order:
>
> and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
>
> -Second, when a lock-acquire reads from a lock-release, and some other
> -stores W and W' occur po-before the lock-release and po-after the
> -lock-acquire respectively, the LKMM requires that W must propagate to
> -each CPU before W' does. For example, consider:
> +Second, when a lock-acquire reads from or is po-after a lock-release,
> +and some other stores W and W' occur po-before the lock-release and
> +po-after the lock-acquire respectively, the LKMM requires that W must
> +propagate to each CPU before W' does. For example, consider:
>
> int x, y;
> - spinlock_t x;
> + spinlock_t s;
>
> P0()
> {
> @@ -1908,7 +1909,12 @@ each CPU before W' does. For example, c
>
> If r1 = 1 at the end then the spin_lock() in P1 must have read from
> the spin_unlock() in P0. Hence the store to x must propagate to P2
> -before the store to y does, so we cannot have r2 = 1 and r3 = 0.
> +before the store to y does, so we cannot have r2 = 1 and r3 = 0. But
> +if P1 had used a lock variable different from s, the writes could have
> +propagated in either order. (On the other hand, if the code in P0 and
> +P1 had all executed on a single CPU, as in the example before this
> +one, then the writes would have propagated in order even if the two
> +critical sections used different lock variables.)
>
> These two special requirements for lock-release and lock-acquire do
> not arise from the operational model. Nevertheless, kernel developers
>

> C ullk-rw
>
> (*
> * Result: Never
> *
> * If two locked critical sections execute on the same CPU, all accesses
> * in the first must execute before any accesses in the second, even if
> * the critical sections are protected by different locks.
> *)
>
> {}
>
> P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
> {
> int r1;
>
> spin_lock(s);
> r1 = READ_ONCE(*x);
> spin_unlock(s);
> spin_lock(t);
> WRITE_ONCE(*y, 1);
> spin_unlock(t);
> }
>
> P1(int *x, int *y)
> {
> int r2;
>
> r2 = smp_load_acquire(y);
> WRITE_ONCE(*x, 1);
> }
>
> exists (0:r1=1 /\ 1:r2=1)

> C ullk-ww
>
> (*
> * Result: Never
> *
> * If two locked critical sections execute on the same CPU, stores in the
> * first must propagate to each CPU before stores in the second do, even if
> * the critical sections are protected by different locks.
> *)
>
> {}
>
> P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
> {
> spin_lock(s);
> WRITE_ONCE(*x, 1);
> spin_unlock(s);
> spin_lock(t);
> WRITE_ONCE(*y, 1);
> spin_unlock(t);
> }
>
> P1(int *x, int *y)
> {
> int r1;
> int r2;
>
> r1 = READ_ONCE(*y);
> smp_rmb();
> r2 = READ_ONCE(*x);
> }
>
> exists (1:r1=1 /\ 1:r2=0)

2021-09-17 12:26:43

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

Excerpts from Nicholas Piggin's message of September 17, 2021 1:21 pm:
> Excerpts from Will Deacon's message of September 10, 2021 9:08 pm:
>> Hi Paul,
>>
>> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>>> > > > then I think it's entirely reasonable to
>>> > > >
>>> > > > spin_unlock(&r);
>>> > > > spin_lock(&s);
>>> > > >
>>> > > > cannot be reordered.
>>> > >
>>> > > I'm obviously completely in favour of that :-)
>>> >
>>> > I don't think we should require the accesses to the actual lockwords to
>>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>>> > architectures where you'd end up with an extra barrier either after the
>>> > unlock() or before the lock() operation. However, I remain absolutely in
>>> > favour of strengthening the ordering of the _critical sections_ guarded by
>>> > the locks to be RCsc.
>>>
>>> If by this you mean the critical sections when observed only by other
>>> critical sections for a given lock, then everyone is already there.
>>
>> No, I mean the case where somebody without the lock (but using memory
>> barriers) can observe the critical sections out of order (i.e. W -> R
>> order is not maintained).
>
> This is a sincere question, why is this important? I mean _any_
> restriction on reordering makes things easier by definition I can't
> argue with that, but why is this one in particular seen as a problem?
> It just seems disproportionate.
>
> We naturally think of accesses within locks as atomic as a whole
> (provided the other parties are doing the proper locking too). So like
> atomic operations, aligned stores, etc can be reordered, I don't see why
> these should have any particular ordering either, or why a unlock()
> should pair with a later lock() of an unrelated lock to provide some
> ordering.
>
> It gives the idea that individual lock operations in isolation should be
> or do something special, but I think that's the wrong way to think about
> it, the lock and the unlock operate on a specific lock word and protect
> specific data vs other processors that access the same data under the
> same locks.
>
> If you don't know what you're doing or don't want to think about
> ordering, perform accesses under locks. If you don't lock, you get to
> think about ordering. At which point sure two sets of operations from
> different critical sections could go out of order, but so can any two
> stores. Or two stores from inside the one critical section if you are
> not holding the correct lock.

It doesn't actually really relieve the burden of thinking about barriers
mcuh at all, come to think of it.

spin_lock(&foo);
x = 1;
spin_unlock(&foo);
spin_lock(&bar);
y = 1;
spin_unlock(&bar);

vs

if (READ_ONCE(y) == 1)
// spin_unlock(&foo)+spin_lock(&bar) provides store ordering
smp_rmb();
BUG_ON(READ_ONCE(x) == 0);

Then if you didn't comment the store ordering requirement in the first
code, then you patch things to simplify or add functionality:

spin_lock(&foo);
x = 1;
spin_lock(&bar);
y = 1;
spin_unlock(&bar);
z = 1;
spin_unlock(&foo);

or

spin_lock(&baz);
x = 1;
y = 1;
spin_unlock(&baz);

Then you broke it. Because you thought being clever and avoiding
thinking about or talking about ordering in your code which performs
memory accesses outside of locks was improving the situation.

It's not good practice. If there is *any* unlocked memory access, you
should always think about and comment *all* memory orderings. Locking
should not ever be relied on to give you some kind of implicit semantics
that you think should be obvious. All the accesses always need thought
and they always need comments.

spin_lock(&foo);
// The spin_unlock+spin_lock orders this store before the store to y,
// the corresponding smp_rmb is at function().
x = 1;
spin_unlock(&foo);
spin_lock(&bar);
// See x
y = 1;
spin_unlock(&bar);

Once you do that, does it *really* break the bank to add a line of code?

spin_lock(&foo);
// See smp_unlock_lock_mb() below
x = 1;
spin_unlock(&foo);
// This orders this store of x above before the
// store to y below.
smp_unlock_lock_mb();
spin_lock(&bar);
// See smp_unlock_lock_mb() above
y = 1;
spin_unlock(&bar);

I can't see how that line of code created a fundamentally a bigger
problem.

If you don't need the performance and don't want to deal with ordering,
*always use locks*. If you absolutely can't always use locks, *always
document important memory accesses ordering that is or can affect
unlocked memory accesses*. We are all agreed on this rule, right? So
what am I missing?

Thanks,
Nick

2021-09-17 15:19:21

by Michael Ellerman

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

Will Deacon <[email protected]> writes:
> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>> > > > then I think it's entirely reasonable to
>> > > >
>> > > > spin_unlock(&r);
>> > > > spin_lock(&s);
>> > > >
>> > > > cannot be reordered.
>> > >
>> > > I'm obviously completely in favour of that :-)
>> >
>> > I don't think we should require the accesses to the actual lockwords to
>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>> > architectures where you'd end up with an extra barrier either after the
>> > unlock() or before the lock() operation. However, I remain absolutely in
>> > favour of strengthening the ordering of the _critical sections_ guarded by
>> > the locks to be RCsc.
>>
>> If by this you mean the critical sections when observed only by other
>> critical sections for a given lock, then everyone is already there.
>
> No, I mean the case where somebody without the lock (but using memory
> barriers) can observe the critical sections out of order (i.e. W -> R
> order is not maintained).
>
>> However...
>>
>> > Last time this came up, I think the RISC-V folks were generally happy to
>> > implement whatever was necessary for Linux [1]. The thing that was stopping
>> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
>> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
>> > different machines. So the question is whether newer Power machines are less
>> > affected to the degree that we could consider making this change again.
>>
>> Last I knew, on Power a pair of critical sections for a given lock could
>> be observed out of order (writes from the earlier critical section vs.
>> reads from the later critical section), but only by CPUs not holding
>> that lock. Also last I knew, tightening this would require upgrading
>> some of the locking primitives' lwsync instructions to sync instructions.
>> But I know very little about Power 10.
>
> Yup, that's the one. This is the primary reason why we have the confusing
> "RCtso" model today so this is my periodic "Do we still need this?" poking
> for the Power folks :)
>
> If the SYNC is a disaster for Power, then I'll ask again in another ~3 years
> time in the hope that newer micro-architectures can swallow the instruction
> more easily, but the results last time weren't hugely compelling and so _if_
> there's an opportunity to make locking more "obvious" then I'm all for it.

I haven't had time to do the full set of numbers like I did last time,
but a quick test shows it's still about a 20-25% drop switching to sync.

So on that basis we'd definitely rather not :)

I'll try and get some more numbers next week.

cheers

2021-09-17 16:00:09

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

Excerpts from Will Deacon's message of September 10, 2021 9:08 pm:
> Hi Paul,
>
> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>> > > > then I think it's entirely reasonable to
>> > > >
>> > > > spin_unlock(&r);
>> > > > spin_lock(&s);
>> > > >
>> > > > cannot be reordered.
>> > >
>> > > I'm obviously completely in favour of that :-)
>> >
>> > I don't think we should require the accesses to the actual lockwords to
>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>> > architectures where you'd end up with an extra barrier either after the
>> > unlock() or before the lock() operation. However, I remain absolutely in
>> > favour of strengthening the ordering of the _critical sections_ guarded by
>> > the locks to be RCsc.
>>
>> If by this you mean the critical sections when observed only by other
>> critical sections for a given lock, then everyone is already there.
>
> No, I mean the case where somebody without the lock (but using memory
> barriers) can observe the critical sections out of order (i.e. W -> R
> order is not maintained).

This is a sincere question, why is this important? I mean _any_
restriction on reordering makes things easier by definition I can't
argue with that, but why is this one in particular seen as a problem?
It just seems disproportionate.

We naturally think of accesses within locks as atomic as a whole
(provided the other parties are doing the proper locking too). So like
atomic operations, aligned stores, etc can be reordered, I don't see why
these should have any particular ordering either, or why a unlock()
should pair with a later lock() of an unrelated lock to provide some
ordering.

It gives the idea that individual lock operations in isolation should be
or do something special, but I think that's the wrong way to think about
it, the lock and the unlock operate on a specific lock word and protect
specific data vs other processors that access the same data under the
same locks.

If you don't know what you're doing or don't want to think about
ordering, perform accesses under locks. If you don't lock, you get to
think about ordering. At which point sure two sets of operations from
different critical sections could go out of order, but so can any two
stores. Or two stores from inside the one critical section if you are
not holding the correct lock.

Thanks,
Nick