2019-03-19 19:40:39

by Alan Stern

[permalink] [raw]
Subject: Adding plain accesses and detecting data races in the LKMM

LKMM maintainers and other interested parties:

Here is my latest proposal for extending the Linux Kernel Memory Model
to handle plain accesses and data races. The document below explains
the rationale behind the proposal, and a patch (based on the dev
branch of Paul's linux-rcu tree) is attached.

This extension is not a complete treatment of plain accesses, because
it pretty much ignores any ordering that plain accesses can provide.
For example, given:

r0 = rcu_dereference(ptr);
r1 = r0->branch;
r2 = READ_ONCE(&r1->data);

the memory model will not realize that the READ_ONCE must execute after
the rcu_dereference, because it doesn't take into account the address
dependency from the intermediate plain read. Hopefully we will add
such things to the memory model later on. Concentrating on data races
seems like enough for now.

Some of the ideas below were originally proposed for the LKMM by Andrea
Parri.

Alan

-----------------------------------------------------------------------

A Plan For Modeling Plain Accesses In The Linux Kernel Memory Model

Definition: A "plain" access is one which is not "marked". Marked
accesses include all those annotated with READ_ONCE or WRITE_ONCE,
plus acquire and release accesses, Read-Modify-Write (RMW) atomic
and bitop accesses, and locking accesses.

Plain accesses are those the compiler is free to mess around with.
Marked accesses are implemented in the kernel by means of inline
assembly or as volatile accesses, which greatly restricts what the
compiler can do.

Definition: A "region" of code is a program-order source segment
contained in a single thread that is maximal with respect to not
containing any compiler barriers (i.e., it is bounded at each end by
either a compiler barrier or the start or end of its thread).

Assumption: All memory barriers are also compiler barriers. This
includes acquire loads and release stores, which are not considered to
be compiler barriers in the C++11 Standard, as well as non-relaxed RMW
atomic operations. (Question: should this include situation-specific
memory barriers such as smp_mb__after_unlock_lock? Probably not.)

To avoid complications, the LKMM will consider only code in which
plain writes are separated by a compiler barrier from any marked
accesses of the same location. (Question: Can this restriction be
removed?) As a consequence, if a region contains any marked accesses
to a particular location then it cannot contain any plain writes to
that location.

This analysis is guided by the intuition that one can set up a limited
correspondence between an execution of the source code and an
execution of the generated object code. Such a correspondence is
based on some limitations on what compilers can do when translating a
program:

Each marked access in the source-level execution must
correspond to a unique instruction in the object-level
execution. The corresponding events must access the same
location, have the same values, and be of the same type (read
or write).

If a marked access is address-dependent on a marked read then
the corresponding object-level instructions must have the same
dependency (i.e., the compiler does not break address
dependencies for marked accesses).

If a source region contains no plain accesses to a location
then the corresponding object-level region contains no
accesses to that location other than those corresponding to
marked accesses.

If a source region contains no plain writes to a location then
the corresponding object-level region contains no stores to
that location other than those corresponding to marked writes.

If a source region contains a plain write then the object code
corresponding to that region must contain at least one access
(it could be a load instead of a store) to the same location.

If all the accesses to a particular location in some source
region are address-dependent on one of the marked reads in
some set S then every object-level access of that location
must be dependent (not necessarily address-dependent, when the
access is a store) on one of the instructions corresponding to
the members of S.

The object code for a region must ensure that the final value
stored in a location is the same as the value of the po-final
source write in the region.

The positions of the load and store instructions in the object code
for a region do not have to bear any particular relation to the plain
accesses in the source code. Subject to the restrictions above, a
sequence of m plain writes in the source could be implemented by a
sequence of n store instructions at runtime where n is <, =, or > m,
and likewise for plain reads.

Similarly, the rf, co, and fr relations for plain accesses in
source-level executions have very limited meaning.

Given an object-level execution, let A1 and B1 be accesses in region
R1 and let A2 and B2 be accesses in region R2, all of the same
location, not all reads, and not all corresponding to marked accesses.
(We explicitly allow A1 and B1 to be the same access, and the same for
A2 and B2.) The execution has a data race if the coherence ordering
from A1 to A2 is opposite to the ordering from B1 to B2.

Definition: Two accesses of the same location "conflict" if they are
in different threads, they are not both marked, and they are not both
reads.

The notions "executes before" and "is visible to" have already been
defined for marked accesses in the LKMM or in earlier proposals for
handling plain accesses (the xb -- or more properly, xb* -- and vis
relations). I trust the ideas are familiar to everyone. Note that
vis is contained in xb. Also note that the vis relation is not
transitive, because smp_wmb is not A-cumulative. That is, it's
possible for W1 to be visible to W2 and W2 to be visible to W3,
without W1 being visible to W3. (However, on systems that are other
multicopy-atomic, vis is indeed transitive.)

We want to define data races for source programs in such a way that a
race-free source program has no data races in any of its object-level
executions. To that end, we could begin with some requirements:

For any two conflicting writes, one must be visible to the
other.

For a write conflicting with a read, either the write must be
visible to the read or the read must execute before the write.

However, this isn't strong enough. A source-level write can give rise
to object-level loads as well as stores. Therefore whenever we
require that a plain write W be visible to some other access A, we
must also require that if W were a read then it would execute before
A. And similarly, if A is required to be visible to W, we must also
require that if W were a read then A would still be visible to it.

At the same time, it is too strong. If W1 is required to be visible
to W2 and W2 is required to be visible to A, then W1 doesn't
necessarily have to be visible to A. Visibility is required only in
cases where W2 need not correspond to any object-level stores.

Since vis and xb aren't defined for plain accesses, we have to extend
the definitions. We begin by saying that a plain access is pre- or
post-bounded by a marked access if the execution order can be
guaranteed, as follows.

Definition:

A write W is "w-post-bounded" by a po-later marked access A if
they are separated by an appropriate memory barrier (including
the case where A is a release write).

A read R is "r-post-bounded" by a po-later marked access A if
they are separated by an appropriate memory barrier.

Oddly, a plain write W is "r-post-bounded" by a po-later
marked access A whenever W would be considered to be
r-post-bounded by A if W were a read (for example, when A is a
read and the two are separated by smp_rmb).

In addition, a marked write is w-post-bounded by itself and a
marked read is r-post-bounded by itself.

An access being "w-pre-bounded" and "r-pre-bounded" by a
po-earlier marked access are defined analogously, except that
we also include cases where the later plain access is
address-dependent on the marked access.

Note that as a result of these definitions, if one plain write is
w-post-bounded by a marked access than all the writes in the same
region are (this is because the memory barrier which enforces the
bounding is also a region delimiter). And if one plain access is
r-post-bounded by a marked access then all the plain accesses in the
same region are. The same cannot be said of w-pre-bounded and
r-pre-bounded, though, because of possible address dependencies.

Definition: ww-vis, ww-xb, wr-vis, wr-xb, and rw-xb are given by the
following scheme. Let i and j each be either w or r, and let rel be
either vis or xb. Then access A is related to access D by ij-rel if
there are marked accesses B and C such that:

A ->i-post-bounded B ->rel C ->j-pre-bounded D.

For example, A is related by wr-vis to D if there are marked accesses
B and C such that A is w-post-bounded by B, B ->vis C, and C
r-pre-bounds D. In cases where A or D is marked, B could be equal to
A or C could be equal to D.

As mentioned earlier, ww-vis is contained in ww-xb and wr-vis is
contained in wr-xb. It turns out that the LKMM can almost prove that
the ij-xb relations are transitive, in the following sense. If W is a
write and R is a read, then:

(A ->iw-xb W ->wj-xb C) implies (A ->ij-xb C), and

(A ->ir-xb R ->rj-xb C) implies (A ->ij-xb C).

To fully prove this requires adding one new term to the ppo relation:

[Marked & R] ; addr ; [Plain & W] ; wmb ; [Marked & W]

For example, given the code:

r = READ_ONCE(ptr);
*r = 27;
smp_wmb();
WRITE_ONCE(x, 1);

the new term says that the READ_ONCE must execute before the
WRITE_ONCE. To justify this, note that the object code is obliged to
ensure that *r contains 27 (or some co-later value) before the smp_wmb
executes. It can do so in one of three ways:

(1) Actually store a 27 through the *r pointer;

(2) Load through the *r pointer and check that the location already
holds 27; or

(3) Check that r aliases a pointer to a location known to hold 27
already.

In case (1), the ordering is enforced by the address dependency and
the smp_wmb. In case (2) there is an address dependency to the *r
load and a conditional depending on that load. Control dependencies
in object-code executions extend to all po-later stores; hence the
WRITE_ONCE must be ordered after the *r load. In case (3) there is a
conditional depending on the READ_ONCE and po-before the WRITE_ONCE.

With this new term added, the LKMM can show that a cycle in the ij-xb
relations would give rise to a cycle in xb of marked accesses. It
follows that in an allowed execution, the regions containing accesses
to a particular location x can be totally ordered in a way compatible
with the ij-xb relations, and this ordering must agree with the co and
rf orderings for x.

Therefore we can try defining a race-free execution to be one which
obeys the following requirements:

ww-race: If W1 ->co W2 and the two writes conflict then we must have
W1 ->ww-vis W2. If W1 is a plain write then we must also have
W1 ->rw-xb W2, and if W2 is a plain write then we must also
have W1 ->wr-vis W2.

wr-race: If W ->(co?;rf) R and the two accesses conflict then we must
have W ->wr-vis R.

rw-race: If R ->fr W and the two accesses conflict then we must have
R ->rw-xb W.

However, as described earlier, this is stronger than we really need.
In ww-race, we don't need to have W1 ->ww-vis W2 or W1 ->wr-vis W2 if
there is a third write W3 in between such that W3 must give rise to a
store in the object code. In this case it doesn't matter whether W1
is visible to W2 or not; they can't race because W3 will be visible to
W2 and being co-later than W1, it will prevent W1 from interfering
with W2. Much the same is true for wr-race.

If W3 is a marked write, it certainly corresponds to a write in the
object code. But what if it is a plain write?

Definition: A write is "major" if it is not overwritten by any
po-later writes in its region. A major write is "super" if it has a
different value from the co-preceding major write. In addition, all
marked writes are considered to be super.

If a region contains a super write then the object code for the region
must contain a store to the write's location. Otherwise, when
execution reached the end of the region at runtime, the value
contained in that location would still be the value from the end of
the preceding region -- that is, the value would be wrong.

Thus, ww-race need not apply if there is a super write co-between W1
and W2. Likewise, wr-race need not apply if there is a super write
co-after W and co?-before the write which R reads from. Note that
these weakenings do not apply in situations where W2 is the co-next
write after W1 or when R reads from W; in such cases ww-race and
wr-race retain their full force.

The LKMM also has to detect forbidden executions involving plain
accesses. There are three obvious coherence-related checks:

If W ->rf R then we must not have R ->rw-xb W.

If R ->fr W then we must not have W ->wr-vis R.

If W1 ->co W2 then we must not have W2 ->ww-vis W1.

(Question: Are these checks actually correct? I'm not certain; they
are hard to reason about because co, rf, and fr don't have clear
meanings for plain accesses in racy executions.)

These proposed changes do not include any ordering due to dependencies
from plain reads to marked accesses, or from overwrites. Perhaps we
will want to add them later on.

Some relevant and interesting litmus tests:

C non-transitive-wmb
(* allowed, no race *)

{}

P0(int *x, int *y)
{
*x = 1;
smp_store_release(y, 1);
}

P1(int *x, int *y, int *z)
{
int r1;

r1 = smp_load_acquire(y);
if (r1) {
*x = 2;
smp_wmb();
WRITE_ONCE(*z, 1);
}
}

P2(int *x, int *z)
{
int r3;
int r4 = 0;

r3 = READ_ONCE(*z);
if (r3) {
smp_rmb();
r4 = READ_ONCE(*x);
}
}

exists (2:r3=1 /\ 2:r4=2)

If the plain write in P1 were changed to "*x = 1" then it would no
longer be super, and P0 would race with P2.

C LB1
(* forbidden, no race *)

{
int *x = &a;
}

P0(int **x, int *y)
{
int *r0;

r0 = rcu_dereference(*x);
*r0 = 0;
smp_wmb();
WRITE_ONCE(*y, 1);
}

P1(int **x, int *y, int *b)
{
int r0;

r0 = READ_ONCE(*y);
rcu_assign_pointer(*x, b);
}

exists (0:r0=b /\ 1:r0=1)

The new term in ppo forces P0's rcu_dereference to execute before the
WRITE_ONCE.

C non-race4
(* allowed (should be forbidden), no race *)

{
int *x = a;
}

P0(int **x, int *b)
{
*b = 1;
smp_store_release(*x, b);
}

P1(int **x, int **tmp)
{
int *r1;
int *r2;
int r3;

r1 = READ_ONCE(*x);
*tmp = r1;
r2 = *tmp;
r3 = *r2;
}

exists (1:r1=b /\ 1:r3=0)

It's pretty obvious that P1's read of *r2 must execute after the
READ_ONCE, but the model doesn't know this.

C overwrite-race

{}

P0(int *x)
{
int r0;

r0 = *x;
WRITE_ONCE(*x, 1);
}

P1(int *x)
{
int r1;

r1 = READ_ONCE(*x);
if (r1 == 1)
WRITE_ONCE(*x, 2);
}

exists (1:r1=1)

This counts as a race, because the compiler is allowed to generate
loads from *x po-after P0's WRITE_ONCE.

C plain-ppo1
(* allowed (should be forbidden), no race *)

{
int *x = &u;
int u = 0;
int z = 5;
}

P0(int **x, int *y)
{
int *r0;
int r1;

r0 = READ_ONCE(*x);
r1 = *r0;
WRITE_ONCE(*y, r1);
}

P1(int **x, int *y, int *z)
{
int r2;

r2 = READ_ONCE(*y);
if (r2)
WRITE_ONCE(*x, z);
}

exists (0:r0=z /\ 0:r1=5 /\ 1:r2=5)

The model does not recognize that P0's WRITE_ONCE must execute after
the READ_ONCE.


Attachments:
plain-race.patch (7.02 kB)

2019-04-02 15:52:28

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Tue, Mar 19, 2019 at 03:38:19PM -0400, Alan Stern wrote:
> LKMM maintainers and other interested parties:
>
> Here is my latest proposal for extending the Linux Kernel Memory Model
> to handle plain accesses and data races. The document below explains
> the rationale behind the proposal, and a patch (based on the dev
> branch of Paul's linux-rcu tree) is attached.

Thank you for this! Unfortunately, I didn't have enough time to give it
justice yet... Below are some first thoughts/questions.


>
> This extension is not a complete treatment of plain accesses, because
> it pretty much ignores any ordering that plain accesses can provide.
> For example, given:
>
> r0 = rcu_dereference(ptr);
> r1 = r0->branch;
> r2 = READ_ONCE(&r1->data);
>
> the memory model will not realize that the READ_ONCE must execute after
> the rcu_dereference, because it doesn't take into account the address
> dependency from the intermediate plain read. Hopefully we will add
> such things to the memory model later on. Concentrating on data races
> seems like enough for now.
>
> Some of the ideas below were originally proposed for the LKMM by Andrea
> Parri.
>
> Alan
>
> -----------------------------------------------------------------------
>
> A Plan For Modeling Plain Accesses In The Linux Kernel Memory Model
>
> Definition: A "plain" access is one which is not "marked". Marked
> accesses include all those annotated with READ_ONCE or WRITE_ONCE,
> plus acquire and release accesses, Read-Modify-Write (RMW) atomic
> and bitop accesses, and locking accesses.
>
> Plain accesses are those the compiler is free to mess around with.
> Marked accesses are implemented in the kernel by means of inline
> assembly or as volatile accesses, which greatly restricts what the
> compiler can do.
>
> Definition: A "region" of code is a program-order source segment
> contained in a single thread that is maximal with respect to not
> containing any compiler barriers (i.e., it is bounded at each end by
> either a compiler barrier or the start or end of its thread).
>
> Assumption: All memory barriers are also compiler barriers. This
> includes acquire loads and release stores, which are not considered to
> be compiler barriers in the C++11 Standard, as well as non-relaxed RMW
> atomic operations. (Question: should this include situation-specific
> memory barriers such as smp_mb__after_unlock_lock? Probably not.)

I'd have:

*x = 1; /* A */
smp_mb__before_atomic();
r0 = xchg_relaxed(x, 2); /* B (read or write part) */

=> (A ->barrier B)

smp_mb__after_unlock_lock() seems more tricky, given that it generates
inter-threads links.

Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
not saying that this is feasible, but this seems worth considering...)

---
I don't get the motivations for the component:

(po ; [Acquire | Release]) |
([Acquire | Release] ; po)

in "barrier". Why did you write it?

Also, consider the snippets:

WRITE_ONCE(*x, 1);
*x = 2;

and

smp_store_release(x, 1);
*x = 2;

The first snippet would be flagged "mixed-accesses" by the patch while
the second snippet would not (thanks to the above component); was this
intentional? (Notice that some implementations map the latter to:

barrier();
WRITE_ONCE(*x, 1);
*x = 2;

)


>
> To avoid complications, the LKMM will consider only code in which
> plain writes are separated by a compiler barrier from any marked
> accesses of the same location. (Question: Can this restriction be
> removed?) As a consequence, if a region contains any marked accesses
> to a particular location then it cannot contain any plain writes to
> that location.

I don't know if it can be removed... Said this, maybe we should go back
to the (simpler?) question: "What can go wrong without the restriction?",
;-) IOW, what are those "complications", can you provide some examples?


>
> This analysis is guided by the intuition that one can set up a limited
> correspondence between an execution of the source code and an
> execution of the generated object code. Such a correspondence is
> based on some limitations on what compilers can do when translating a
> program:
>
> Each marked access in the source-level execution must
> correspond to a unique instruction in the object-level
> execution. The corresponding events must access the same
> location, have the same values, and be of the same type (read
> or write).
>
> If a marked access is address-dependent on a marked read then
> the corresponding object-level instructions must have the same
> dependency (i.e., the compiler does not break address
> dependencies for marked accesses).

We known that this "limitation" does not hold, in this generality. Your
proposal goes even further than that (by adding address dep. from marked
reads to marked accesses): A warning, maybe also a pointer to doc. such
as Documentation/RCU/rcu_dereference.txt wouldn't have hurt! ;-)


>
> If a source region contains no plain accesses to a location
> then the corresponding object-level region contains no
> accesses to that location other than those corresponding to
> marked accesses.
>
> If a source region contains no plain writes to a location then
> the corresponding object-level region contains no stores to
> that location other than those corresponding to marked writes.
>
> If a source region contains a plain write then the object code
> corresponding to that region must contain at least one access
> (it could be a load instead of a store) to the same location.
>
> If all the accesses to a particular location in some source
> region are address-dependent on one of the marked reads in
> some set S then every object-level access of that location
> must be dependent (not necessarily address-dependent, when the
> access is a store) on one of the instructions corresponding to
> the members of S.
>
> The object code for a region must ensure that the final value
> stored in a location is the same as the value of the po-final
> source write in the region.
>
> The positions of the load and store instructions in the object code
> for a region do not have to bear any particular relation to the plain
> accesses in the source code. Subject to the restrictions above, a
> sequence of m plain writes in the source could be implemented by a
> sequence of n store instructions at runtime where n is <, =, or > m,
> and likewise for plain reads.
>
> Similarly, the rf, co, and fr relations for plain accesses in
> source-level executions have very limited meaning.
>
> Given an object-level execution, let A1 and B1 be accesses in region
> R1 and let A2 and B2 be accesses in region R2, all of the same
> location, not all reads, and not all corresponding to marked accesses.
> (We explicitly allow A1 and B1 to be the same access, and the same for
> A2 and B2.) The execution has a data race if the coherence ordering
> from A1 to A2 is opposite to the ordering from B1 to B2.
>
> Definition: Two accesses of the same location "conflict" if they are
> in different threads, they are not both marked, and they are not both
> reads.
>
> The notions "executes before" and "is visible to" have already been
> defined for marked accesses in the LKMM or in earlier proposals for
> handling plain accesses (the xb -- or more properly, xb* -- and vis
> relations). I trust the ideas are familiar to everyone. Note that
> vis is contained in xb. Also note that the vis relation is not
> transitive, because smp_wmb is not A-cumulative. That is, it's
> possible for W1 to be visible to W2 and W2 to be visible to W3,
> without W1 being visible to W3. (However, on systems that are other
> multicopy-atomic, vis is indeed transitive.)
>
> We want to define data races for source programs in such a way that a
> race-free source program has no data races in any of its object-level
> executions. To that end, we could begin with some requirements:
>
> For any two conflicting writes, one must be visible to the
> other.
>
> For a write conflicting with a read, either the write must be
> visible to the read or the read must execute before the write.

I'm missing the link with the notion of "data race of an object-level
execution" given above (it is possible that I did not understand that
description): could you elaborate? (maybe by providing some examples)


>
> However, this isn't strong enough. A source-level write can give rise
> to object-level loads as well as stores. Therefore whenever we
> require that a plain write W be visible to some other access A, we
> must also require that if W were a read then it would execute before
> A. And similarly, if A is required to be visible to W, we must also
> require that if W were a read then A would still be visible to it.
>
> At the same time, it is too strong. If W1 is required to be visible
> to W2 and W2 is required to be visible to A, then W1 doesn't
> necessarily have to be visible to A. Visibility is required only in
> cases where W2 need not correspond to any object-level stores.
>
> Since vis and xb aren't defined for plain accesses, we have to extend
> the definitions. We begin by saying that a plain access is pre- or
> post-bounded by a marked access if the execution order can be
> guaranteed, as follows.
>
> Definition:
>
> A write W is "w-post-bounded" by a po-later marked access A if
> they are separated by an appropriate memory barrier (including
> the case where A is a release write).
>
> A read R is "r-post-bounded" by a po-later marked access A if
> they are separated by an appropriate memory barrier.
>
> Oddly, a plain write W is "r-post-bounded" by a po-later
> marked access A whenever W would be considered to be
> r-post-bounded by A if W were a read (for example, when A is a
> read and the two are separated by smp_rmb).

Looking at the patch, I see that a plain read can be "w-post-bounded":
should/can we prevent this from happening?


>
> In addition, a marked write is w-post-bounded by itself and a
> marked read is r-post-bounded by itself.
>
> An access being "w-pre-bounded" and "r-pre-bounded" by a
> po-earlier marked access are defined analogously, except that
> we also include cases where the later plain access is
> address-dependent on the marked access.
>
> Note that as a result of these definitions, if one plain write is
> w-post-bounded by a marked access than all the writes in the same
> region are (this is because the memory barrier which enforces the
> bounding is also a region delimiter). And if one plain access is
> r-post-bounded by a marked access then all the plain accesses in the
> same region are. The same cannot be said of w-pre-bounded and
> r-pre-bounded, though, because of possible address dependencies.
>
> Definition: ww-vis, ww-xb, wr-vis, wr-xb, and rw-xb are given by the
> following scheme. Let i and j each be either w or r, and let rel be
> either vis or xb. Then access A is related to access D by ij-rel if
> there are marked accesses B and C such that:
>
> A ->i-post-bounded B ->rel C ->j-pre-bounded D.
>
> For example, A is related by wr-vis to D if there are marked accesses
> B and C such that A is w-post-bounded by B, B ->vis C, and C
> r-pre-bounds D. In cases where A or D is marked, B could be equal to
> A or C could be equal to D.
>
> As mentioned earlier, ww-vis is contained in ww-xb and wr-vis is
> contained in wr-xb. It turns out that the LKMM can almost prove that
> the ij-xb relations are transitive, in the following sense. If W is a
> write and R is a read, then:
>
> (A ->iw-xb W ->wj-xb C) implies (A ->ij-xb C), and
>
> (A ->ir-xb R ->rj-xb C) implies (A ->ij-xb C).
>
> To fully prove this requires adding one new term to the ppo relation:
>
> [Marked & R] ; addr ; [Plain & W] ; wmb ; [Marked & W]

... or removing any of these links, right?


>
> For example, given the code:
>
> r = READ_ONCE(ptr);
> *r = 27;
> smp_wmb();
> WRITE_ONCE(x, 1);
>
> the new term says that the READ_ONCE must execute before the
> WRITE_ONCE. To justify this, note that the object code is obliged to
> ensure that *r contains 27 (or some co-later value) before the smp_wmb
> executes. It can do so in one of three ways:
>
> (1) Actually store a 27 through the *r pointer;
>
> (2) Load through the *r pointer and check that the location already
> holds 27; or
>
> (3) Check that r aliases a pointer to a location known to hold 27
> already.
>
> In case (1), the ordering is enforced by the address dependency and
> the smp_wmb. In case (2) there is an address dependency to the *r
> load and a conditional depending on that load. Control dependencies
> in object-code executions extend to all po-later stores; hence the
> WRITE_ONCE must be ordered after the *r load. In case (3) there is a
> conditional depending on the READ_ONCE and po-before the WRITE_ONCE.

The proposed model is unsound w.r.t. (2); c.f., LB1 vs. LB2 from:

http://lkml.kernel.org/r/20190116213658.GA3984@andrea

Mmh..., no particular suggestions at the moment (just blaming "addr"
and "wmb", I guess ;-) ); the fact is that "soundness" looks like a
very desirable property...


>
> With this new term added, the LKMM can show that a cycle in the ij-xb
> relations would give rise to a cycle in xb of marked accesses. It
> follows that in an allowed execution, the regions containing accesses
> to a particular location x can be totally ordered in a way compatible
> with the ij-xb relations, and this ordering must agree with the co and
> rf orderings for x.
>
> Therefore we can try defining a race-free execution to be one which
> obeys the following requirements:
>
> ww-race: If W1 ->co W2 and the two writes conflict then we must have
> W1 ->ww-vis W2. If W1 is a plain write then we must also have
> W1 ->rw-xb W2, and if W2 is a plain write then we must also
> have W1 ->wr-vis W2.
>
> wr-race: If W ->(co?;rf) R and the two accesses conflict then we must
> have W ->wr-vis R.
>
> rw-race: If R ->fr W and the two accesses conflict then we must have
> R ->rw-xb W.
>
> However, as described earlier, this is stronger than we really need.
> In ww-race, we don't need to have W1 ->ww-vis W2 or W1 ->wr-vis W2 if
> there is a third write W3 in between such that W3 must give rise to a
> store in the object code. In this case it doesn't matter whether W1
> is visible to W2 or not; they can't race because W3 will be visible to
> W2 and being co-later than W1, it will prevent W1 from interfering
> with W2. Much the same is true for wr-race.
>
> If W3 is a marked write, it certainly corresponds to a write in the
> object code. But what if it is a plain write?
>
> Definition: A write is "major" if it is not overwritten by any
> po-later writes in its region. A major write is "super" if it has a
> different value from the co-preceding major write. In addition, all
> marked writes are considered to be super.
>
> If a region contains a super write then the object code for the region
> must contain a store to the write's location. Otherwise, when
> execution reached the end of the region at runtime, the value
> contained in that location would still be the value from the end of
> the preceding region -- that is, the value would be wrong.
>
> Thus, ww-race need not apply if there is a super write co-between W1
> and W2. Likewise, wr-race need not apply if there is a super write
> co-after W and co?-before the write which R reads from. Note that
> these weakenings do not apply in situations where W2 is the co-next
> write after W1 or when R reads from W; in such cases ww-race and
> wr-race retain their full force.

I don't know yet... Frankly, this idea of "semantics depending on the
_values_ of plain accesses" simply scares me. ;-)

Let's look at your "non-transitive-wmb" test: I tend to read the write
"*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";
maybe some of these writes aren't performed at all (the compiler knows
that the most significant byte, say, is never touched/modified); maybe
they intersect (overwrite) one another... what else? So, for example,
if we "slice" the most significant parts of the plain writes in P0 and
P1 then they have the "same value" in a certain sense.

Mmh, I still don't feel comfortable with these concepts sorry, maybe I
just need more time to process them...


>
> The LKMM also has to detect forbidden executions involving plain
> accesses. There are three obvious coherence-related checks:
>
> If W ->rf R then we must not have R ->rw-xb W.
>
> If R ->fr W then we must not have W ->wr-vis R.
>
> If W1 ->co W2 then we must not have W2 ->ww-vis W1.
>
> (Question: Are these checks actually correct? I'm not certain; they
> are hard to reason about because co, rf, and fr don't have clear
> meanings for plain accesses in racy executions.)

So maybe they were not that "obvious". ;-)

This all deserves a careful review and testing; unfortunately, I doubt
that I'll be able to do much more in the next couple of weeks. bbl,

Andrea


>
> These proposed changes do not include any ordering due to dependencies
> from plain reads to marked accesses, or from overwrites. Perhaps we
> will want to add them later on.
>
> Some relevant and interesting litmus tests:
>
> C non-transitive-wmb
> (* allowed, no race *)
>
> {}
>
> P0(int *x, int *y)
> {
> *x = 1;
> smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> int r1;
>
> r1 = smp_load_acquire(y);
> if (r1) {
> *x = 2;
> smp_wmb();
> WRITE_ONCE(*z, 1);
> }
> }
>
> P2(int *x, int *z)
> {
> int r3;
> int r4 = 0;
>
> r3 = READ_ONCE(*z);
> if (r3) {
> smp_rmb();
> r4 = READ_ONCE(*x);
> }
> }
>
> exists (2:r3=1 /\ 2:r4=2)
>
> If the plain write in P1 were changed to "*x = 1" then it would no
> longer be super, and P0 would race with P2.
>
> C LB1
> (* forbidden, no race *)
>
> {
> int *x = &a;
> }
>
> P0(int **x, int *y)
> {
> int *r0;
>
> r0 = rcu_dereference(*x);
> *r0 = 0;
> smp_wmb();
> WRITE_ONCE(*y, 1);
> }
>
> P1(int **x, int *y, int *b)
> {
> int r0;
>
> r0 = READ_ONCE(*y);
> rcu_assign_pointer(*x, b);
> }
>
> exists (0:r0=b /\ 1:r0=1)
>
> The new term in ppo forces P0's rcu_dereference to execute before the
> WRITE_ONCE.
>
> C non-race4
> (* allowed (should be forbidden), no race *)
>
> {
> int *x = a;
> }
>
> P0(int **x, int *b)
> {
> *b = 1;
> smp_store_release(*x, b);
> }
>
> P1(int **x, int **tmp)
> {
> int *r1;
> int *r2;
> int r3;
>
> r1 = READ_ONCE(*x);
> *tmp = r1;
> r2 = *tmp;
> r3 = *r2;
> }
>
> exists (1:r1=b /\ 1:r3=0)
>
> It's pretty obvious that P1's read of *r2 must execute after the
> READ_ONCE, but the model doesn't know this.
>
> C overwrite-race
>
> {}
>
> P0(int *x)
> {
> int r0;
>
> r0 = *x;
> WRITE_ONCE(*x, 1);
> }
>
> P1(int *x)
> {
> int r1;
>
> r1 = READ_ONCE(*x);
> if (r1 == 1)
> WRITE_ONCE(*x, 2);
> }
>
> exists (1:r1=1)
>
> This counts as a race, because the compiler is allowed to generate
> loads from *x po-after P0's WRITE_ONCE.
>
> C plain-ppo1
> (* allowed (should be forbidden), no race *)
>
> {
> int *x = &u;
> int u = 0;
> int z = 5;
> }
>
> P0(int **x, int *y)
> {
> int *r0;
> int r1;
>
> r0 = READ_ONCE(*x);
> r1 = *r0;
> WRITE_ONCE(*y, r1);
> }
>
> P1(int **x, int *y, int *z)
> {
> int r2;
>
> r2 = READ_ONCE(*y);
> if (r2)
> WRITE_ONCE(*x, z);
> }
>
> exists (0:r0=z /\ 0:r1=5 /\ 1:r2=5)
>
> The model does not recognize that P0's WRITE_ONCE must execute after
> the READ_ONCE.

> tools/memory-model/linux-kernel.bell | 6 ++
> tools/memory-model/linux-kernel.cat | 94 +++++++++++++++++++++++++++++------
> tools/memory-model/linux-kernel.def | 1
> 3 files changed, 86 insertions(+), 15 deletions(-)
>
> Index: lkmm/tools/memory-model/linux-kernel.bell
> ===================================================================
> --- lkmm.orig/tools/memory-model/linux-kernel.bell
> +++ lkmm/tools/memory-model/linux-kernel.bell
> @@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'releas
> enum Barriers = 'wmb (*smp_wmb*) ||
> 'rmb (*smp_rmb*) ||
> 'mb (*smp_mb*) ||
> + 'barrier (*barrier*) ||
> 'rcu-lock (*rcu_read_lock*) ||
> 'rcu-unlock (*rcu_read_unlock*) ||
> 'sync-rcu (*synchronize_rcu*) ||
> @@ -76,3 +77,8 @@ flag ~empty rcu-rscs & (po ; [Sync-srcu]
>
> (* Validate SRCU dynamic match *)
> flag ~empty different-values(srcu-rscs) as srcu-bad-nesting
> +
> +(* Compute marked and plain memory accesses *)
> +let Marked = (~M) | IW | Once | Release | Acquire | domain(rmw) | range(rmw) |
> + LKR | LKW | UL | LF | RL | RU
> +let Plain = M \ Marked
> Index: lkmm/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- lkmm.orig/tools/memory-model/linux-kernel.cat
> +++ lkmm/tools/memory-model/linux-kernel.cat
> @@ -24,8 +24,14 @@ include "lock.cat"
> (* Basic relations *)
> (*******************)
>
> +(* Release Acquire *)
> +let acq-po = [Acquire] ; po ; [M]
> +let po-rel = [M] ; po ; [Release]
> +let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
> +
> (* Fences *)
> -let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> +let R4rmb = R \ Noreturn (* Reads for which rmb works *)
> +let rmb = [R4rmb] ; fencerel(Rmb) ; [R4rmb]
> let wmb = [W] ; fencerel(Wmb) ; [W]
> let mb = ([M] ; fencerel(Mb) ; [M]) |
> ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> @@ -34,13 +40,14 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> fencerel(After-unlock-lock) ; [M])
> let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> -
> let strong-fence = mb | gp
>
> -(* Release Acquire *)
> -let acq-po = [Acquire] ; po ; [M]
> -let po-rel = [M] ; po ; [Release]
> -let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
> +let nonrw-fence = strong-fence | po-rel | acq-po
> +let fence = nonrw-fence | wmb | rmb
> +let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
> + Acquire | Release) |
> + (po ; [Acquire | Release]) |
> + ([Acquire | Release] ; po)
>
> (**********************************)
> (* Fundamental coherence ordering *)
> @@ -61,21 +68,22 @@ empty rmw & (fre ; coe) as atomic
> let dep = addr | data
> let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> -let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi)
> -let fence = strong-fence | wmb | po-rel | rmb | acq-po
> +let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> +let to-r = addr | (dep ; [Marked] ; rfi)
> let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> -let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
> -let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
> +let A-cumul(r) = (rfe ; [Marked])? ; r
> +let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> + po-unlock-rf-lock-po) ; [Marked]
> +let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
> + [Marked] ; rfe? ; [Marked]
>
> (*
> * Happens Before: Ordering from the passage of time.
> * No fences needed here for prop because relation confined to one process.
> *)
> -let hb = ppo | rfe | ((prop \ id) & int)
> +let hb = [Marked] ; (ppo | rfe | ((prop \ id) & int)) ; [Marked]
> acyclic hb as happens-before
>
> (****************************************)
> @@ -83,7 +91,7 @@ acyclic hb as happens-before
> (****************************************)
>
> (* Propagation: Each non-rf link needs a strong fence. *)
> -let pb = prop ; strong-fence ; hb*
> +let pb = prop ; strong-fence ; hb* ; [Marked]
> acyclic pb as propagation
>
> (*******)
> @@ -131,7 +139,7 @@ let rec rcu-fence = rcu-gp | srcu-gp |
> (rcu-fence ; rcu-link ; rcu-fence)
>
> (* rb orders instructions just as pb does *)
> -let rb = prop ; po ; rcu-fence ; po? ; hb* ; pb*
> +let rb = prop ; po ; rcu-fence ; po? ; hb* ; pb* ; [Marked]
>
> irreflexive rb as rcu
>
> @@ -143,3 +151,59 @@ irreflexive rb as rcu
> * let xb = hb | pb | rb
> * acyclic xb as executes-before
> *)
> +
> +
> +(*********************************)
> +(* Plain accesses and data races *)
> +(*********************************)
> +
> +(* Warn about plain writes and marked accesses in the same region *)
> +let mixed-accesses = ([Plain & W] ; (po-loc \ barrier) ; [Marked]) |
> + ([Marked] ; (po-loc \ barrier) ; [Plain & W])
> +flag ~empty mixed-accesses as mixed-accesses
> +
> +(* Executes-before and visibility *)
> +let xbstar = (hb | pb | rb)*
> +let full-fence = strong-fence | (po ; rcu-fence ; po?)
> +let vis = cumul-fence* ; rfe? ; [Marked] ;
> + ((full-fence ; [Marked] ; xbstar) | (xbstar & int))
> +
> +(* Boundaries for lifetimes of plain accesses *)
> +let w-pre-bounded = [Marked] ; (addr | fence)?
> +let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
> + ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
> +let w-post-bounded = fence? ; [Marked]
> +let r-post-bounded = (nonrw-fence | ([~Noreturn] ; fencerel(Rmb) ; [R4rmb]))? ;
> + [Marked]
> +
> +(* Visibility and executes-before for plain accesses *)
> +let ww-vis = w-post-bounded ; vis ; w-pre-bounded
> +let wr-vis = w-post-bounded ; vis ; r-pre-bounded
> +let rw-xbstar = r-post-bounded ; xbstar ; w-pre-bounded
> +
> +(* Potential races *)
> +let pre-race = ext & ((Plain * M) | ((M \ IW) * Plain))
> +
> +(* Coherence requirements for plain accesses *)
> +let wr-incoh = pre-race & rf & rw-xbstar^-1
> +let rw-incoh = pre-race & fr & wr-vis^-1
> +let ww-incoh = pre-race & co & ww-vis^-1
> +empty (wr-incoh | rw-incoh | ww-incoh) as plain-coherence
> +
> +(* Actual races *)
> +let ww-nonrace = ww-vis & ((Marked * W) | rw-xbstar) & ((W * Marked) | wr-vis)
> +let ww-race = (pre-race & co) \ ww-nonrace
> +let wr-race = (pre-race & (co? ; rf)) \ wr-vis
> +let rw-race = (pre-race & fr) \ rw-xbstar
> +
> +(* Major writes, super writes, and superseded writes *)
> +let Major-w = W \ domain(coi \ barrier)
> +let Super-w = (W & Marked) |
> + range(different-values(singlestep([Major-w] ; co ; [Major-w])))
> +let superseded-w = co ; [Super-w] ; (co | (co? ; rf))
> +
> +(* Superseded writes don't race *)
> +let ww-race = ww-race \ superseded-w
> +let wr-race = wr-race \ superseded-w
> +
> +flag ~empty (ww-race | wr-race | rw-race) as data-race
> Index: lkmm/tools/memory-model/linux-kernel.def
> ===================================================================
> --- lkmm.orig/tools/memory-model/linux-kernel.def
> +++ lkmm/tools/memory-model/linux-kernel.def
> @@ -24,6 +24,7 @@ smp_mb__before_atomic() { __fence{before
> smp_mb__after_atomic() { __fence{after-atomic}; }
> smp_mb__after_spinlock() { __fence{after-spinlock}; }
> smp_mb__after_unlock_lock() { __fence{after-unlock-lock}; }
> +barrier() { __fence{barrier}; }
>
> // Exchange
> xchg(X,V) __xchg{mb}(X,V)

2019-04-02 18:29:33

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Tue, 2 Apr 2019, Andrea Parri wrote:

> On Tue, Mar 19, 2019 at 03:38:19PM -0400, Alan Stern wrote:
> > LKMM maintainers and other interested parties:
> >
> > Here is my latest proposal for extending the Linux Kernel Memory Model
> > to handle plain accesses and data races. The document below explains
> > the rationale behind the proposal, and a patch (based on the dev
> > branch of Paul's linux-rcu tree) is attached.
>
> Thank you for this! Unfortunately, I didn't have enough time to give it
> justice yet... Below are some first thoughts/questions.

Thanks for the feedback.

> > This extension is not a complete treatment of plain accesses, because
> > it pretty much ignores any ordering that plain accesses can provide.
> > For example, given:
> >
> > r0 = rcu_dereference(ptr);
> > r1 = r0->branch;
> > r2 = READ_ONCE(&r1->data);
> >
> > the memory model will not realize that the READ_ONCE must execute after
> > the rcu_dereference, because it doesn't take into account the address
> > dependency from the intermediate plain read. Hopefully we will add
> > such things to the memory model later on. Concentrating on data races
> > seems like enough for now.
> >
> > Some of the ideas below were originally proposed for the LKMM by Andrea
> > Parri.
> >
> > Alan
> >
> > -----------------------------------------------------------------------
> >
> > A Plan For Modeling Plain Accesses In The Linux Kernel Memory Model
> >
> > Definition: A "plain" access is one which is not "marked". Marked
> > accesses include all those annotated with READ_ONCE or WRITE_ONCE,
> > plus acquire and release accesses, Read-Modify-Write (RMW) atomic
> > and bitop accesses, and locking accesses.
> >
> > Plain accesses are those the compiler is free to mess around with.
> > Marked accesses are implemented in the kernel by means of inline
> > assembly or as volatile accesses, which greatly restricts what the
> > compiler can do.
> >
> > Definition: A "region" of code is a program-order source segment
> > contained in a single thread that is maximal with respect to not
> > containing any compiler barriers (i.e., it is bounded at each end by
> > either a compiler barrier or the start or end of its thread).
> >
> > Assumption: All memory barriers are also compiler barriers. This
> > includes acquire loads and release stores, which are not considered to
> > be compiler barriers in the C++11 Standard, as well as non-relaxed RMW
> > atomic operations. (Question: should this include situation-specific
> > memory barriers such as smp_mb__after_unlock_lock? Probably not.)
>
> I'd have:
>
> *x = 1; /* A */
> smp_mb__before_atomic();
> r0 = xchg_relaxed(x, 2); /* B (read or write part) */
>
> => (A ->barrier B)

Perhaps so. It wasn't clear initially how these should be treated, so
I just ignored them. For example, what should happen here if there
were other statements between the smp_mb__before_atomic and the
xchg_relaxed?

> smp_mb__after_unlock_lock() seems more tricky, given that it generates
> inter-threads links.

The inter-thread part is completely irrelevant as far as compiler
barriers are concerned.

> Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> not saying that this is feasible, but this seems worth considering...)

I don't understand that comment.

> ---
> I don't get the motivations for the component:
>
> (po ; [Acquire | Release]) |
> ([Acquire | Release] ; po)
>
> in "barrier". Why did you write it?

I was thinking that in situations like:

A: z = 1;
B: smp_store_release(z, 2);
C: r = z;

in addition to there being a compiler barrier between A and C, there
effectively has to be a barrier between A and B (that is, the object
code has to store first 1 and then 2 to z) and between B and C (that
is, the object code can't skip doing the load from z and just set r to
2).

The basic principle was the idea that load-acquire and store-release
are special because each of them is both a memory access and a memory
barrier at the same time. Now, I didn't give this a tremendous amount
of thought and it's quite possible that the Cat code should be changed.
For example, maybe there should be a compiler barrier immediately
before a store-release but not immediately after, and vice versa for
load-acquire.

> Also, consider the snippets:
>
> WRITE_ONCE(*x, 1);
> *x = 2;
>
> and
>
> smp_store_release(x, 1);
> *x = 2;
>
> The first snippet would be flagged "mixed-accesses" by the patch while
> the second snippet would not (thanks to the above component); was this
> intentional? (Notice that some implementations map the latter to:
>
> barrier();
> WRITE_ONCE(*x, 1);
> *x = 2;
>
> )

That last observation is a good reason for changing the Cat code.

> > To avoid complications, the LKMM will consider only code in which
> > plain writes are separated by a compiler barrier from any marked
> > accesses of the same location. (Question: Can this restriction be
> > removed?) As a consequence, if a region contains any marked accesses
> > to a particular location then it cannot contain any plain writes to
> > that location.
>
> I don't know if it can be removed... Said this, maybe we should go back
> to the (simpler?) question: "What can go wrong without the restriction?",
> ;-) IOW, what are those "complications", can you provide some examples?

Here's an example I sent to Will a little while ago. Suppose we
have:

r = rcu_dereference(ptr);
*r = 1;
WRITE_ONCE(x, 2);

Imagine if the compiler transformed this to:

r = rcu_dereference(ptr);
WRITE_ONCE(x, 2);
if (r != &x)
*r = 1;

This is bad because it destroys the address dependency when ptr
contains &x.

> > This analysis is guided by the intuition that one can set up a limited
> > correspondence between an execution of the source code and an
> > execution of the generated object code. Such a correspondence is
> > based on some limitations on what compilers can do when translating a
> > program:
> >
> > Each marked access in the source-level execution must
> > correspond to a unique instruction in the object-level
> > execution. The corresponding events must access the same
> > location, have the same values, and be of the same type (read
> > or write).
> >
> > If a marked access is address-dependent on a marked read then
> > the corresponding object-level instructions must have the same
> > dependency (i.e., the compiler does not break address
> > dependencies for marked accesses).
>
> We known that this "limitation" does not hold, in this generality. Your
> proposal goes even further than that (by adding address dep. from marked
> reads to marked accesses): A warning, maybe also a pointer to doc. such
> as Documentation/RCU/rcu_dereference.txt wouldn't have hurt! ;-)

Agreed.

> > If a source region contains no plain accesses to a location
> > then the corresponding object-level region contains no
> > accesses to that location other than those corresponding to
> > marked accesses.
> >
> > If a source region contains no plain writes to a location then
> > the corresponding object-level region contains no stores to
> > that location other than those corresponding to marked writes.
> >
> > If a source region contains a plain write then the object code
> > corresponding to that region must contain at least one access
> > (it could be a load instead of a store) to the same location.
> >
> > If all the accesses to a particular location in some source
> > region are address-dependent on one of the marked reads in
> > some set S then every object-level access of that location
> > must be dependent (not necessarily address-dependent, when the
> > access is a store) on one of the instructions corresponding to
> > the members of S.
> >
> > The object code for a region must ensure that the final value
> > stored in a location is the same as the value of the po-final
> > source write in the region.
> >
> > The positions of the load and store instructions in the object code
> > for a region do not have to bear any particular relation to the plain
> > accesses in the source code. Subject to the restrictions above, a
> > sequence of m plain writes in the source could be implemented by a
> > sequence of n store instructions at runtime where n is <, =, or > m,
> > and likewise for plain reads.
> >
> > Similarly, the rf, co, and fr relations for plain accesses in
> > source-level executions have very limited meaning.
> >
> > Given an object-level execution, let A1 and B1 be accesses in region
> > R1 and let A2 and B2 be accesses in region R2, all of the same
> > location, not all reads, and not all corresponding to marked accesses.
> > (We explicitly allow A1 and B1 to be the same access, and the same for
> > A2 and B2.) The execution has a data race if the coherence ordering
> > from A1 to A2 is opposite to the ordering from B1 to B2.
> >
> > Definition: Two accesses of the same location "conflict" if they are
> > in different threads, they are not both marked, and they are not both
> > reads.
> >
> > The notions "executes before" and "is visible to" have already been
> > defined for marked accesses in the LKMM or in earlier proposals for
> > handling plain accesses (the xb -- or more properly, xb* -- and vis
> > relations). I trust the ideas are familiar to everyone. Note that
> > vis is contained in xb. Also note that the vis relation is not
> > transitive, because smp_wmb is not A-cumulative. That is, it's
> > possible for W1 to be visible to W2 and W2 to be visible to W3,
> > without W1 being visible to W3. (However, on systems that are other
> > multicopy-atomic, vis is indeed transitive.)
> >
> > We want to define data races for source programs in such a way that a
> > race-free source program has no data races in any of its object-level
> > executions. To that end, we could begin with some requirements:
> >
> > For any two conflicting writes, one must be visible to the
> > other.
> >
> > For a write conflicting with a read, either the write must be
> > visible to the read or the read must execute before the write.
>
> I'm missing the link with the notion of "data race of an object-level
> execution" given above (it is possible that I did not understand that
> description): could you elaborate? (maybe by providing some examples)

The argument here is based on an unstated principle (I didn't state it
explicitly because I don't know if it is really true): Given a
candidate execution in which write W is not visible to write X even
though W is co-before X, there is another, very similar (whatever
that means!) execution in which X is co-before W.

So consider as an example the case where W is a marked write which
conflicts with a plain write X. Since X is plain, the compiler is
allowed to generate object code which carries out the X write twice;
let's call these object-level writes X1 and X2. If W is not visible to
X1 and X2 and yet W ->co X2, the principle states that there is
an execution in which X1 ->co W ->co X2. This fits the object-level
definition of a data race above.

> > However, this isn't strong enough. A source-level write can give rise
> > to object-level loads as well as stores. Therefore whenever we
> > require that a plain write W be visible to some other access A, we
> > must also require that if W were a read then it would execute before
> > A. And similarly, if A is required to be visible to W, we must also
> > require that if W were a read then A would still be visible to it.
> >
> > At the same time, it is too strong. If W1 is required to be visible
> > to W2 and W2 is required to be visible to A, then W1 doesn't
> > necessarily have to be visible to A. Visibility is required only in
> > cases where W2 need not correspond to any object-level stores.
> >
> > Since vis and xb aren't defined for plain accesses, we have to extend
> > the definitions. We begin by saying that a plain access is pre- or
> > post-bounded by a marked access if the execution order can be
> > guaranteed, as follows.
> >
> > Definition:
> >
> > A write W is "w-post-bounded" by a po-later marked access A if
> > they are separated by an appropriate memory barrier (including
> > the case where A is a release write).
> >
> > A read R is "r-post-bounded" by a po-later marked access A if
> > they are separated by an appropriate memory barrier.
> >
> > Oddly, a plain write W is "r-post-bounded" by a po-later
> > marked access A whenever W would be considered to be
> > r-post-bounded by A if W were a read (for example, when A is a
> > read and the two are separated by smp_rmb).
>
> Looking at the patch, I see that a plain read can be "w-post-bounded":
> should/can we prevent this from happening?

It doesn't matter. w-post-bounded gets used only in situations where
the left-hand access is a write. (Unless I made a mistake somewhere.)

> > In addition, a marked write is w-post-bounded by itself and a
> > marked read is r-post-bounded by itself.
> >
> > An access being "w-pre-bounded" and "r-pre-bounded" by a
> > po-earlier marked access are defined analogously, except that
> > we also include cases where the later plain access is
> > address-dependent on the marked access.
> >
> > Note that as a result of these definitions, if one plain write is
> > w-post-bounded by a marked access than all the writes in the same
> > region are (this is because the memory barrier which enforces the
> > bounding is also a region delimiter). And if one plain access is
> > r-post-bounded by a marked access then all the plain accesses in the
> > same region are. The same cannot be said of w-pre-bounded and
> > r-pre-bounded, though, because of possible address dependencies.
> >
> > Definition: ww-vis, ww-xb, wr-vis, wr-xb, and rw-xb are given by the
> > following scheme. Let i and j each be either w or r, and let rel be
> > either vis or xb. Then access A is related to access D by ij-rel if
> > there are marked accesses B and C such that:
> >
> > A ->i-post-bounded B ->rel C ->j-pre-bounded D.
> >
> > For example, A is related by wr-vis to D if there are marked accesses
> > B and C such that A is w-post-bounded by B, B ->vis C, and C
> > r-pre-bounds D. In cases where A or D is marked, B could be equal to
> > A or C could be equal to D.
> >
> > As mentioned earlier, ww-vis is contained in ww-xb and wr-vis is
> > contained in wr-xb. It turns out that the LKMM can almost prove that
> > the ij-xb relations are transitive, in the following sense. If W is a
> > write and R is a read, then:
> >
> > (A ->iw-xb W ->wj-xb C) implies (A ->ij-xb C), and
> >
> > (A ->ir-xb R ->rj-xb C) implies (A ->ij-xb C).
> >
> > To fully prove this requires adding one new term to the ppo relation:
> >
> > [Marked & R] ; addr ; [Plain & W] ; wmb ; [Marked & W]
>
> ... or removing any of these links, right?

I don't understand. Remove any of which links? Remove it from the
candidate execution or from the memory model?

> > For example, given the code:
> >
> > r = READ_ONCE(ptr);
> > *r = 27;
> > smp_wmb();
> > WRITE_ONCE(x, 1);
> >
> > the new term says that the READ_ONCE must execute before the
> > WRITE_ONCE. To justify this, note that the object code is obliged to
> > ensure that *r contains 27 (or some co-later value) before the smp_wmb
> > executes. It can do so in one of three ways:
> >
> > (1) Actually store a 27 through the *r pointer;
> >
> > (2) Load through the *r pointer and check that the location already
> > holds 27; or
> >
> > (3) Check that r aliases a pointer to a location known to hold 27
> > already.
> >
> > In case (1), the ordering is enforced by the address dependency and
> > the smp_wmb. In case (2) there is an address dependency to the *r
> > load and a conditional depending on that load. Control dependencies
> > in object-code executions extend to all po-later stores; hence the
> > WRITE_ONCE must be ordered after the *r load. In case (3) there is a
> > conditional depending on the READ_ONCE and po-before the WRITE_ONCE.
>
> The proposed model is unsound w.r.t. (2); c.f., LB1 vs. LB2 from:
>
> http://lkml.kernel.org/r/20190116213658.GA3984@andrea

Again, I don't understand. What's unsound about the proposed model?

For LB1 it says that the WRITE_ONCE must execute after the
rcu_dereference, which is correct. For LB2 it doesn't say anything,
because in LB2 there is no plain write event between the
rcu_dereference and the smp_wmb events.

> Mmh..., no particular suggestions at the moment (just blaming "addr"
> and "wmb", I guess ;-) ); the fact is that "soundness" looks like a
> very desirable property...
>
>
> >
> > With this new term added, the LKMM can show that a cycle in the ij-xb
> > relations would give rise to a cycle in xb of marked accesses. It
> > follows that in an allowed execution, the regions containing accesses
> > to a particular location x can be totally ordered in a way compatible
> > with the ij-xb relations, and this ordering must agree with the co and
> > rf orderings for x.
> >
> > Therefore we can try defining a race-free execution to be one which
> > obeys the following requirements:
> >
> > ww-race: If W1 ->co W2 and the two writes conflict then we must have
> > W1 ->ww-vis W2. If W1 is a plain write then we must also have
> > W1 ->rw-xb W2, and if W2 is a plain write then we must also
> > have W1 ->wr-vis W2.
> >
> > wr-race: If W ->(co?;rf) R and the two accesses conflict then we must
> > have W ->wr-vis R.
> >
> > rw-race: If R ->fr W and the two accesses conflict then we must have
> > R ->rw-xb W.
> >
> > However, as described earlier, this is stronger than we really need.
> > In ww-race, we don't need to have W1 ->ww-vis W2 or W1 ->wr-vis W2 if
> > there is a third write W3 in between such that W3 must give rise to a
> > store in the object code. In this case it doesn't matter whether W1
> > is visible to W2 or not; they can't race because W3 will be visible to
> > W2 and being co-later than W1, it will prevent W1 from interfering
> > with W2. Much the same is true for wr-race.
> >
> > If W3 is a marked write, it certainly corresponds to a write in the
> > object code. But what if it is a plain write?
> >
> > Definition: A write is "major" if it is not overwritten by any
> > po-later writes in its region. A major write is "super" if it has a
> > different value from the co-preceding major write. In addition, all
> > marked writes are considered to be super.
> >
> > If a region contains a super write then the object code for the region
> > must contain a store to the write's location. Otherwise, when
> > execution reached the end of the region at runtime, the value
> > contained in that location would still be the value from the end of
> > the preceding region -- that is, the value would be wrong.
> >
> > Thus, ww-race need not apply if there is a super write co-between W1
> > and W2. Likewise, wr-race need not apply if there is a super write
> > co-after W and co?-before the write which R reads from. Note that
> > these weakenings do not apply in situations where W2 is the co-next
> > write after W1 or when R reads from W; in such cases ww-race and
> > wr-race retain their full force.
>
> I don't know yet... Frankly, this idea of "semantics depending on the
> _values_ of plain accesses" simply scares me. ;-)

Unfortunately, that's how computer programs work in the real world:
Their behavior depends on the values that reads obtain.

> Let's look at your "non-transitive-wmb" test: I tend to read the write
> "*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
> 2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";

That's already a mistake. The compiler can do extraordinary things to
the code. It doesn't have to end up being anything as simple as "write
2 to the 1st byte, ...".

> maybe some of these writes aren't performed at all (the compiler knows
> that the most significant byte, say, is never touched/modified); maybe
> they intersect (overwrite) one another... what else?

That's the thing. According to one of the assumptions listed earlier
in this document, it's not possible that the "write 2 to the 1st byte"
access isn't performed at all. The higher-order bytes, yes, they might
not be touched. But if no writes are performed at all then at the end
of the region *x will contain 1, not 2 -- and that would be a bug in
the compiler.

> So, for example,
> if we "slice" the most significant parts of the plain writes in P0 and
> P1 then they have the "same value" in a certain sense.
>
> Mmh, I still don't feel comfortable with these concepts sorry, maybe I
> just need more time to process them...
>
>
> >
> > The LKMM also has to detect forbidden executions involving plain
> > accesses. There are three obvious coherence-related checks:
> >
> > If W ->rf R then we must not have R ->rw-xb W.
> >
> > If R ->fr W then we must not have W ->wr-vis R.
> >
> > If W1 ->co W2 then we must not have W2 ->ww-vis W1.
> >
> > (Question: Are these checks actually correct? I'm not certain; they
> > are hard to reason about because co, rf, and fr don't have clear
> > meanings for plain accesses in racy executions.)
>
> So maybe they were not that "obvious". ;-)

Well, they were pretty obvious to me -- they are just restatements of
the coherence-related rules in the operational model. Of course,
"obvious" doesn't always mean "correct".

> This all deserves a careful review and testing; unfortunately, I doubt
> that I'll be able to do much more in the next couple of weeks. bbl,

Good luck on all the more pressing engagements!

Alan

2019-04-06 00:52:12

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> > I'd have:
> >
> > *x = 1; /* A */
> > smp_mb__before_atomic();
> > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> >
> > => (A ->barrier B)
>
> Perhaps so. It wasn't clear initially how these should be treated, so
> I just ignored them. For example, what should happen here if there
> were other statements between the smp_mb__before_atomic and the
> xchg_relaxed?

I'd say that the link "A ->barrier B" should still be present and that
no other barrier-links should appear. More formally, I am suggesting
that Before-atomic induced the following barrier-links:

[M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]


> > smp_mb__after_unlock_lock() seems more tricky, given that it generates
> > inter-threads links.
>
> The inter-thread part is completely irrelevant as far as compiler
> barriers are concerned.
>
> > Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> > not saying that this is feasible, but this seems worth considering...)
>
> I don't understand that comment.

I was suggesting to consider something like:

let barrier = [...] | mb

but I'm still not sure about those After-unlock-lock and After-spinlock.


> > I don't get the motivations for the component:
> >
> > (po ; [Acquire | Release]) |
> > ([Acquire | Release] ; po)
> >
> > in "barrier". Why did you write it?
>
> I was thinking that in situations like:
>
> A: z = 1;
> B: smp_store_release(z, 2);
> C: r = z;
>
> in addition to there being a compiler barrier between A and C, there
> effectively has to be a barrier between A and B (that is, the object
> code has to store first 1 and then 2 to z) and between B and C (that
> is, the object code can't skip doing the load from z and just set r to
> 2).
>
> The basic principle was the idea that load-acquire and store-release
> are special because each of them is both a memory access and a memory
> barrier at the same time. Now, I didn't give this a tremendous amount
> of thought and it's quite possible that the Cat code should be changed.
> For example, maybe there should be a compiler barrier immediately
> before a store-release but not immediately after, and vice versa for
> load-acquire.
>
> > Also, consider the snippets:
> >
> > WRITE_ONCE(*x, 1);
> > *x = 2;
> >
> > and
> >
> > smp_store_release(x, 1);
> > *x = 2;
> >
> > The first snippet would be flagged "mixed-accesses" by the patch while
> > the second snippet would not (thanks to the above component); was this
> > intentional? (Notice that some implementations map the latter to:
> >
> > barrier();
> > WRITE_ONCE(*x, 1);
> > *x = 2;
> >
> > )
>
> That last observation is a good reason for changing the Cat code.

You mean in:

po-rel | acq-po

? (together with the fencerel()-ifications of Release and Acquire already
present in the patch).


> > > To avoid complications, the LKMM will consider only code in which
> > > plain writes are separated by a compiler barrier from any marked
> > > accesses of the same location. (Question: Can this restriction be
> > > removed?) As a consequence, if a region contains any marked accesses
> > > to a particular location then it cannot contain any plain writes to
> > > that location.
> >
> > I don't know if it can be removed... Said this, maybe we should go back
> > to the (simpler?) question: "What can go wrong without the restriction?",
> > ;-) IOW, what are those "complications", can you provide some examples?
>
> Here's an example I sent to Will a little while ago. Suppose we
> have:
>
> r = rcu_dereference(ptr);
> *r = 1;
> WRITE_ONCE(x, 2);
>
> Imagine if the compiler transformed this to:
>
> r = rcu_dereference(ptr);
> WRITE_ONCE(x, 2);
> if (r != &x)
> *r = 1;
>
> This is bad because it destroys the address dependency when ptr
> contains &x.

Oh, indeed! (In fact, we discussed this example before... ;-/) BTW,
can you also point out an example which does not involve dependencies
(or destruction thereof)?


> > > Given an object-level execution, let A1 and B1 be accesses in region
> > > R1 and let A2 and B2 be accesses in region R2, all of the same
> > > location, not all reads, and not all corresponding to marked accesses.
> > > (We explicitly allow A1 and B1 to be the same access, and the same for
> > > A2 and B2.) The execution has a data race if the coherence ordering
> > > from A1 to A2 is opposite to the ordering from B1 to B2.
> > >
> > > Definition: Two accesses of the same location "conflict" if they are
> > > in different threads, they are not both marked, and they are not both
> > > reads.
> > >
> > > The notions "executes before" and "is visible to" have already been
> > > defined for marked accesses in the LKMM or in earlier proposals for
> > > handling plain accesses (the xb -- or more properly, xb* -- and vis
> > > relations). I trust the ideas are familiar to everyone. Note that
> > > vis is contained in xb. Also note that the vis relation is not
> > > transitive, because smp_wmb is not A-cumulative. That is, it's
> > > possible for W1 to be visible to W2 and W2 to be visible to W3,
> > > without W1 being visible to W3. (However, on systems that are other
> > > multicopy-atomic, vis is indeed transitive.)
> > >
> > > We want to define data races for source programs in such a way that a
> > > race-free source program has no data races in any of its object-level
> > > executions. To that end, we could begin with some requirements:
> > >
> > > For any two conflicting writes, one must be visible to the
> > > other.
> > >
> > > For a write conflicting with a read, either the write must be
> > > visible to the read or the read must execute before the write.
> >
> > I'm missing the link with the notion of "data race of an object-level
> > execution" given above (it is possible that I did not understand that
> > description): could you elaborate? (maybe by providing some examples)
>
> The argument here is based on an unstated principle (I didn't state it
> explicitly because I don't know if it is really true): Given a
> candidate execution in which write W is not visible to write X even
> though W is co-before X, there is another, very similar (whatever
> that means!) execution in which X is co-before W.
>
> So consider as an example the case where W is a marked write which
> conflicts with a plain write X. Since X is plain, the compiler is
> allowed to generate object code which carries out the X write twice;
> let's call these object-level writes X1 and X2. If W is not visible to
> X1 and X2 and yet W ->co X2, the principle states that there is
> an execution in which X1 ->co W ->co X2. This fits the object-level
> definition of a data race above.

Thanks for the intuition. (I'll think about it...)


> > Looking at the patch, I see that a plain read can be "w-post-bounded":
> > should/can we prevent this from happening?
>
> It doesn't matter. w-post-bounded gets used only in situations where
> the left-hand access is a write. (Unless I made a mistake somewhere.)

Well, we can enforce that "gets used only in..." by heading the relation
with [W] ; _ (this will also help us to prevent "mistakes" in the future)


> > > As mentioned earlier, ww-vis is contained in ww-xb and wr-vis is
> > > contained in wr-xb. It turns out that the LKMM can almost prove that
> > > the ij-xb relations are transitive, in the following sense. If W is a
> > > write and R is a read, then:
> > >
> > > (A ->iw-xb W ->wj-xb C) implies (A ->ij-xb C), and
> > >
> > > (A ->ir-xb R ->rj-xb C) implies (A ->ij-xb C).
> > >
> > > To fully prove this requires adding one new term to the ppo relation:
> > >
> > > [Marked & R] ; addr ; [Plain & W] ; wmb ; [Marked & W]
> >
> > ... or removing any of these links, right?
>
> I don't understand. Remove any of which links? Remove it from the
> candidate execution or from the memory model?

I meant removing the term [Marked & R] ; addr ; [Plain & W] or the term
[Plain & W] ; wmb ; [Marked & W] from the (interested) definitions of
{w,r}-{pre,post}-bounded. (And yes: IIUC, this also means to continue
to say "sorry, you're on your own" for some common patterns...)


> > The proposed model is unsound w.r.t. (2); c.f., LB1 vs. LB2 from:
> >
> > http://lkml.kernel.org/r/20190116213658.GA3984@andrea
>
> Again, I don't understand. What's unsound about the proposed model?

Let me try again... Consider the following two steps:

- Start with a _race-free_ test program T1 (= LB1),

- Apply a "reasonable" source-to-source transformation (2) to obtain a
new test T2 (LB2) (in particular, T2 must be race-free);

Then the property that I had in mind is described as follows:

If S is a valid state of T2 then S is also a valid state of T1; IOW,
the tranformation does not introduce new behaviors.

(An infamously well-known paper about the study of this property, in the
context of the C11 memory model, is available at:

http://plv.mpi-sws.org/c11comp/ )

This property does not hold in the proposed model (c.f., e.g., the state
specified in the "exists" clauses).


> > Let's look at your "non-transitive-wmb" test: I tend to read the write
> > "*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
> > 2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";
> > maybe some of these writes aren't performed at all (the compiler knows
> > that the most significant byte, say, is never touched/modified); maybe
> > they intersect (overwrite) one another... what else?
>
> That's the thing. According to one of the assumptions listed earlier
> in this document, it's not possible that the "write 2 to the 1st byte"
> access isn't performed at all. The higher-order bytes, yes, they might
> not be touched. But if no writes are performed at all then at the end
> of the region *x will contain 1, not 2 -- and that would be a bug in
> the compiler.

AFAICT, we agreed here. So consider the following test:

C non-transitive-wmb-2

{
x=0x00010000;
}

P0(int *x, int *y)
{
*x = 1;
smp_store_release(y, 1);
}

P1(int *x, int *y, int *z)
{
int r1;

r1 = smp_load_acquire(y);
if (r1) {
*x = 2;
smp_wmb();
WRITE_ONCE(*z, 1);
}
}

P2(int *x, int *z)
{
int r3;
int r4 = 0;

r3 = READ_ONCE(*z);
if (r3) {
smp_rmb();
r4 = READ_ONCE(*x); /* E */
}
}

exists (2:r3=1 /\ 2:r4=2)

The proposed model says that this is race-free. Now suppose that the
compiler mapped the two plain writes above as follows:

*x = 1; --> A:write 0x0001 at x
B:write 0x0000 at (x + 2)

*x = 2; --> C:write 0x0002 at x
if ((D:read of 2 bytes at (x + 2)) != 0x0000)
write 0x0000 at (x + 2)

and consider the execution where 1:r1=1 and 2:r3=1. We know that A and
B propagate to P1 before C and D execute (release/acquire) and that C
propagates to P2 before E executes (wmb/rmb). But what guarantees that
B, say, propagates to P2 before E executes? AFAICT, nothing prevent P2
from reading the value 0x00010002 for x, that is, from reading the least
significant bits from P1's "write 0x0002 at x" and the most significant
bits from the initial write. However, the proposed model doesn't report
the corresponding execution/state.

Andrea

2019-04-06 16:05:32

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Sat, 6 Apr 2019, Andrea Parri wrote:

> > > I'd have:
> > >
> > > *x = 1; /* A */
> > > smp_mb__before_atomic();
> > > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > >
> > > => (A ->barrier B)
> >
> > Perhaps so. It wasn't clear initially how these should be treated, so
> > I just ignored them. For example, what should happen here if there
> > were other statements between the smp_mb__before_atomic and the
> > xchg_relaxed?
>
> I'd say that the link "A ->barrier B" should still be present and that
> no other barrier-links should appear. More formally, I am suggesting
> that Before-atomic induced the following barrier-links:
>
> [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]

Hmmm, maybe. But exactly where is the barrier introduced? Are you
saying that on some architectures the barrier is contained in the
smp_mb__before_atomic and on others it is contained in the
xchg_relaxed?

> > > smp_mb__after_unlock_lock() seems more tricky, given that it generates
> > > inter-threads links.
> >
> > The inter-thread part is completely irrelevant as far as compiler
> > barriers are concerned.
> >
> > > Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> > > not saying that this is feasible, but this seems worth considering...)
> >
> > I don't understand that comment.
>
> I was suggesting to consider something like:
>
> let barrier = [...] | mb
>
> but I'm still not sure about those After-unlock-lock and After-spinlock.

I don't see any point in saying that After-unlock-lock or
After-spinlock contains a barrier, because spin_lock and spin_unlock
already do.


> > > I don't get the motivations for the component:
> > >
> > > (po ; [Acquire | Release]) |
> > > ([Acquire | Release] ; po)
> > >
> > > in "barrier". Why did you write it?
> >
> > I was thinking that in situations like:
> >
> > A: z = 1;
> > B: smp_store_release(z, 2);
> > C: r = z;
> >
> > in addition to there being a compiler barrier between A and C, there
> > effectively has to be a barrier between A and B (that is, the object
> > code has to store first 1 and then 2 to z) and between B and C (that
> > is, the object code can't skip doing the load from z and just set r to
> > 2).
> >
> > The basic principle was the idea that load-acquire and store-release
> > are special because each of them is both a memory access and a memory
> > barrier at the same time. Now, I didn't give this a tremendous amount
> > of thought and it's quite possible that the Cat code should be changed.
> > For example, maybe there should be a compiler barrier immediately
> > before a store-release but not immediately after, and vice versa for
> > load-acquire.
> >
> > > Also, consider the snippets:
> > >
> > > WRITE_ONCE(*x, 1);
> > > *x = 2;
> > >
> > > and
> > >
> > > smp_store_release(x, 1);
> > > *x = 2;
> > >
> > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > the second snippet would not (thanks to the above component); was this
> > > intentional? (Notice that some implementations map the latter to:
> > >
> > > barrier();
> > > WRITE_ONCE(*x, 1);
> > > *x = 2;
> > >
> > > )
> >
> > That last observation is a good reason for changing the Cat code.
>
> You mean in:
>
> po-rel | acq-po
>
> ? (together with the fencerel()-ifications of Release and Acquire already
> present in the patch).

No, I meant changing the definition of "barrier" to say:

(po ; [Release]) | ([Acquire] ; po)

(for example) instead of the code you quoted above.


> > > > To avoid complications, the LKMM will consider only code in which
> > > > plain writes are separated by a compiler barrier from any marked
> > > > accesses of the same location. (Question: Can this restriction be
> > > > removed?) As a consequence, if a region contains any marked accesses
> > > > to a particular location then it cannot contain any plain writes to
> > > > that location.
> > >
> > > I don't know if it can be removed... Said this, maybe we should go back
> > > to the (simpler?) question: "What can go wrong without the restriction?",
> > > ;-) IOW, what are those "complications", can you provide some examples?
> >
> > Here's an example I sent to Will a little while ago. Suppose we
> > have:
> >
> > r = rcu_dereference(ptr);
> > *r = 1;
> > WRITE_ONCE(x, 2);
> >
> > Imagine if the compiler transformed this to:
> >
> > r = rcu_dereference(ptr);
> > WRITE_ONCE(x, 2);
> > if (r != &x)
> > *r = 1;
> >
> > This is bad because it destroys the address dependency when ptr
> > contains &x.
>
> Oh, indeed! (In fact, we discussed this example before... ;-/) BTW,
> can you also point out an example which does not involve dependencies
> (or destruction thereof)?

I believe all the examples did involve dependencies. However, Paul has
provided several use cases showing that one might have good reasons for
mixing plain reads with marked acccesses -- but he didn't have any use
cases for mixing plain writes.

There are some open questions remaining. For instance, given:

r1 = READ_ONCE(*x);
r2 = *x;

is the compiler allowed to replace the plain read with the value from
the READ_ONCE? What if the statements were in the opposite order?
What if the READ_ONCE was changed to WRITE_ONCE?

At the moment, I think it's best if we can ignore these issues.


> > > Looking at the patch, I see that a plain read can be "w-post-bounded":
> > > should/can we prevent this from happening?
> >
> > It doesn't matter. w-post-bounded gets used only in situations where
> > the left-hand access is a write. (Unless I made a mistake somewhere.)
>
> Well, we can enforce that "gets used only in..." by heading the relation
> with [W] ; _ (this will also help us to prevent "mistakes" in the future)

I suppose so. It shouldn't make any difference.


> > > > As mentioned earlier, ww-vis is contained in ww-xb and wr-vis is
> > > > contained in wr-xb. It turns out that the LKMM can almost prove that
> > > > the ij-xb relations are transitive, in the following sense. If W is a
> > > > write and R is a read, then:
> > > >
> > > > (A ->iw-xb W ->wj-xb C) implies (A ->ij-xb C), and
> > > >
> > > > (A ->ir-xb R ->rj-xb C) implies (A ->ij-xb C).
> > > >
> > > > To fully prove this requires adding one new term to the ppo relation:
> > > >
> > > > [Marked & R] ; addr ; [Plain & W] ; wmb ; [Marked & W]
> > >
> > > ... or removing any of these links, right?
> >
> > I don't understand. Remove any of which links? Remove it from the
> > candidate execution or from the memory model?
>
> I meant removing the term [Marked & R] ; addr ; [Plain & W] or the term
> [Plain & W] ; wmb ; [Marked & W] from the (interested) definitions of
> {w,r}-{pre,post}-bounded. (And yes: IIUC, this also means to continue
> to say "sorry, you're on your own" for some common patterns...)

Yes, indeed it will. Many of the common patterns involved in RCU would
then be considered to contain races. I don't think people would like
that.


> > > The proposed model is unsound w.r.t. (2); c.f., LB1 vs. LB2 from:
> > >
> > > http://lkml.kernel.org/r/20190116213658.GA3984@andrea
> >
> > Again, I don't understand. What's unsound about the proposed model?
>
> Let me try again... Consider the following two steps:
>
> - Start with a _race-free_ test program T1 (= LB1),
>
> - Apply a "reasonable" source-to-source transformation (2) to obtain a
> new test T2 (LB2) (in particular, T2 must be race-free);
>
> Then the property that I had in mind is described as follows:
>
> If S is a valid state of T2 then S is also a valid state of T1; IOW,
> the tranformation does not introduce new behaviors.
>
> (An infamously well-known paper about the study of this property, in the
> context of the C11 memory model, is available at:
>
> http://plv.mpi-sws.org/c11comp/ )
>
> This property does not hold in the proposed model (c.f., e.g., the state
> specified in the "exists" clauses).

I don't regard this as a serious drawback. We know that the memory
model doesn't behave all that well when it comes to control
dependencies to marked writes beyond the end of an "if" statement, for
example. This is related to that.

We also know that not all hardware exhibits all the behaviors allowed
by the architecture. So it's reasonable that on some computer, T1
might never exhibit state S (even though it is architecturally allowed
to) whereas T2 could exhibit S.

> > > Let's look at your "non-transitive-wmb" test: I tend to read the write
> > > "*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
> > > 2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";
> > > maybe some of these writes aren't performed at all (the compiler knows
> > > that the most significant byte, say, is never touched/modified); maybe
> > > they intersect (overwrite) one another... what else?
> >
> > That's the thing. According to one of the assumptions listed earlier
> > in this document, it's not possible that the "write 2 to the 1st byte"
> > access isn't performed at all. The higher-order bytes, yes, they might
> > not be touched. But if no writes are performed at all then at the end
> > of the region *x will contain 1, not 2 -- and that would be a bug in
> > the compiler.
>
> AFAICT, we agreed here. So consider the following test:
>
> C non-transitive-wmb-2
>
> {
> x=0x00010000;
> }
>
> P0(int *x, int *y)
> {
> *x = 1;
> smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> int r1;
>
> r1 = smp_load_acquire(y);
> if (r1) {
> *x = 2;
> smp_wmb();
> WRITE_ONCE(*z, 1);
> }
> }
>
> P2(int *x, int *z)
> {
> int r3;
> int r4 = 0;
>
> r3 = READ_ONCE(*z);
> if (r3) {
> smp_rmb();
> r4 = READ_ONCE(*x); /* E */
> }
> }
>
> exists (2:r3=1 /\ 2:r4=2)
>
> The proposed model says that this is race-free. Now suppose that the
> compiler mapped the two plain writes above as follows:
>
> *x = 1; --> A:write 0x0001 at x
> B:write 0x0000 at (x + 2)
>
> *x = 2; --> C:write 0x0002 at x
> if ((D:read of 2 bytes at (x + 2)) != 0x0000)
> write 0x0000 at (x + 2)
>
> and consider the execution where 1:r1=1 and 2:r3=1. We know that A and
> B propagate to P1 before C and D execute (release/acquire) and that C
> propagates to P2 before E executes (wmb/rmb). But what guarantees that
> B, say, propagates to P2 before E executes? AFAICT, nothing prevent P2
> from reading the value 0x00010002 for x, that is, from reading the least
> significant bits from P1's "write 0x0002 at x" and the most significant
> bits from the initial write. However, the proposed model doesn't report
> the corresponding execution/state.

Ah, that is indeed an excellent example! I had not considered
mixed-size accesses generated by the compiler.

So perhaps it would be better to get rid of the whole notion of super
and major writes, and declare that non-transitive-wmb is racy in all
its variants.

Alan

2019-04-08 05:52:43

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> > > > I'd have:
> > > >
> > > > *x = 1; /* A */
> > > > smp_mb__before_atomic();
> > > > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > > >
> > > > => (A ->barrier B)
> > >
> > > Perhaps so. It wasn't clear initially how these should be treated, so
> > > I just ignored them. For example, what should happen here if there
> > > were other statements between the smp_mb__before_atomic and the
> > > xchg_relaxed?
> >
> > I'd say that the link "A ->barrier B" should still be present and that
> > no other barrier-links should appear. More formally, I am suggesting
> > that Before-atomic induced the following barrier-links:
> >
> > [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]
>
> Hmmm, maybe. But exactly where is the barrier introduced? Are you
> saying that on some architectures the barrier is contained in the
> smp_mb__before_atomic and on others it is contained in the
> xchg_relaxed?

The formula was more along the line of "do not assume either of these
cases to hold; use barrier() is you need an unconditional barrier..."
AFAICT, all current implementations of smp_mb__{before,after}_atomic()
provides a compiler barrier with either barrier() or "memory" clobber.


> > I was suggesting to consider something like:
> >
> > let barrier = [...] | mb
> >
> > but I'm still not sure about those After-unlock-lock and After-spinlock.
>
> I don't see any point in saying that After-unlock-lock or
> After-spinlock contains a barrier, because spin_lock and spin_unlock
> already do.

(They would not "contain a barrier", they would induce some when paired
with the specified locking primitive/sequence: this distinction matters
for some implementation here doesn't provide the compiler barrier. But)
I agree with you: these barriers seem either redundant or unnecessary.


> > > > Also, consider the snippets:
> > > >
> > > > WRITE_ONCE(*x, 1);
> > > > *x = 2;
> > > >
> > > > and
> > > >
> > > > smp_store_release(x, 1);
> > > > *x = 2;
> > > >
> > > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > > the second snippet would not (thanks to the above component); was this
> > > > intentional? (Notice that some implementations map the latter to:
> > > >
> > > > barrier();
> > > > WRITE_ONCE(*x, 1);
> > > > *x = 2;
> > > >
> > > > )
> > >
> > > That last observation is a good reason for changing the Cat code.
> >
> > You mean in:
> >
> > po-rel | acq-po
> >
> > ? (together with the fencerel()-ifications of Release and Acquire already
> > present in the patch).
>
> No, I meant changing the definition of "barrier" to say:
>
> (po ; [Release]) | ([Acquire] ; po)
>
> (for example) instead of the code you quoted above.

This makes sense to me.


> > > > > To avoid complications, the LKMM will consider only code in which
> > > > > plain writes are separated by a compiler barrier from any marked
> > > > > accesses of the same location. (Question: Can this restriction be
> > > > > removed?) As a consequence, if a region contains any marked accesses
> > > > > to a particular location then it cannot contain any plain writes to
> > > > > that location.
> > > >
> > > > I don't know if it can be removed... Said this, maybe we should go back
> > > > to the (simpler?) question: "What can go wrong without the restriction?",
> > > > ;-) IOW, what are those "complications", can you provide some examples?
> > >
> > > Here's an example I sent to Will a little while ago. Suppose we
> > > have:
> > >
> > > r = rcu_dereference(ptr);
> > > *r = 1;
> > > WRITE_ONCE(x, 2);
> > >
> > > Imagine if the compiler transformed this to:
> > >
> > > r = rcu_dereference(ptr);
> > > WRITE_ONCE(x, 2);
> > > if (r != &x)
> > > *r = 1;
> > >
> > > This is bad because it destroys the address dependency when ptr
> > > contains &x.
> >
> > Oh, indeed! (In fact, we discussed this example before... ;-/) BTW,
> > can you also point out an example which does not involve dependencies
> > (or destruction thereof)?
>
> I believe all the examples did involve dependencies. However, Paul has
> provided several use cases showing that one might have good reasons for
> mixing plain reads with marked acccesses -- but he didn't have any use
> cases for mixing plain writes.

I see. BTW, if we'll converge to add this new flag (or some variant),
it might be a good idea to log some of these examples, maybe just the
snippet above, in a commit message (these could come in handy when we
will be addressed with the question "Why am I seeing this flag?" ...)


> There are some open questions remaining. For instance, given:
>
> r1 = READ_ONCE(*x);
> r2 = *x;
>
> is the compiler allowed to replace the plain read with the value from
> the READ_ONCE? What if the statements were in the opposite order?
> What if the READ_ONCE was changed to WRITE_ONCE?
>
> At the moment, I think it's best if we can ignore these issues.

(To all three questions:) let me put it in these terms: I'm currently
unable to argue: "The compiler isn't allowed, because...". But wait,
which "issues"? am I forgetting some other "bad" examples? (the above
mentioned flag doesn't seem to "cure" plain reads...)


> > I meant removing the term [Marked & R] ; addr ; [Plain & W] or the term
> > [Plain & W] ; wmb ; [Marked & W] from the (interested) definitions of
> > {w,r}-{pre,post}-bounded. (And yes: IIUC, this also means to continue
> > to say "sorry, you're on your own" for some common patterns...)
>
> Yes, indeed it will. Many of the common patterns involved in RCU would
> then be considered to contain races. I don't think people would like
> that.

We will try to give people what they like, no doubts about this. ;-)
Of course, people also do not typically like compilers breaking their
dependencies: ideally, we will flag "all exceptions" (including those
documented in rcu_dereference.txt), but we're not quite there... ;-/


> > Let me try again... Consider the following two steps:
> >
> > - Start with a _race-free_ test program T1 (= LB1),
> >
> > - Apply a "reasonable" source-to-source transformation (2) to obtain a
> > new test T2 (LB2) (in particular, T2 must be race-free);
> >
> > Then the property that I had in mind is described as follows:
> >
> > If S is a valid state of T2 then S is also a valid state of T1; IOW,
> > the tranformation does not introduce new behaviors.
> >
> > (An infamously well-known paper about the study of this property, in the
> > context of the C11 memory model, is available at:
> >
> > http://plv.mpi-sws.org/c11comp/ )
> >
> > This property does not hold in the proposed model (c.f., e.g., the state
> > specified in the "exists" clauses).
>
> I don't regard this as a serious drawback. We know that the memory
> model doesn't behave all that well when it comes to control
> dependencies to marked writes beyond the end of an "if" statement, for
> example. This is related to that.
>
> We also know that not all hardware exhibits all the behaviors allowed
> by the architecture. So it's reasonable that on some computer, T1
> might never exhibit state S (even though it is architecturally allowed
> to) whereas T2 could exhibit S.

Sure. I was just pointing out the fact that the proposed approach does
introduce scenarios (the example above) where this property is violated.
It wasn't intended as "No, no" ;-) but, again, (like for the "if (-) E
else E" scenario, also related...) it seemed worth noticing.


> > > > Let's look at your "non-transitive-wmb" test: I tend to read the write
> > > > "*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
> > > > 2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";
> > > > maybe some of these writes aren't performed at all (the compiler knows
> > > > that the most significant byte, say, is never touched/modified); maybe
> > > > they intersect (overwrite) one another... what else?
> > >
> > > That's the thing. According to one of the assumptions listed earlier
> > > in this document, it's not possible that the "write 2 to the 1st byte"
> > > access isn't performed at all. The higher-order bytes, yes, they might
> > > not be touched. But if no writes are performed at all then at the end
> > > of the region *x will contain 1, not 2 -- and that would be a bug in
> > > the compiler.
> >
> > AFAICT, we agreed here. So consider the following test:
> >
> > C non-transitive-wmb-2
> >
> > {
> > x=0x00010000;
> > }
> >
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_store_release(y, 1);
> > }
> >
> > P1(int *x, int *y, int *z)
> > {
> > int r1;
> >
> > r1 = smp_load_acquire(y);
> > if (r1) {
> > *x = 2;
> > smp_wmb();
> > WRITE_ONCE(*z, 1);
> > }
> > }
> >
> > P2(int *x, int *z)
> > {
> > int r3;
> > int r4 = 0;
> >
> > r3 = READ_ONCE(*z);
> > if (r3) {
> > smp_rmb();
> > r4 = READ_ONCE(*x); /* E */
> > }
> > }
> >
> > exists (2:r3=1 /\ 2:r4=2)
> >
> > The proposed model says that this is race-free. Now suppose that the
> > compiler mapped the two plain writes above as follows:
> >
> > *x = 1; --> A:write 0x0001 at x
> > B:write 0x0000 at (x + 2)
> >
> > *x = 2; --> C:write 0x0002 at x
> > if ((D:read of 2 bytes at (x + 2)) != 0x0000)
> > write 0x0000 at (x + 2)
> >
> > and consider the execution where 1:r1=1 and 2:r3=1. We know that A and
> > B propagate to P1 before C and D execute (release/acquire) and that C
> > propagates to P2 before E executes (wmb/rmb). But what guarantees that
> > B, say, propagates to P2 before E executes? AFAICT, nothing prevent P2
> > from reading the value 0x00010002 for x, that is, from reading the least
> > significant bits from P1's "write 0x0002 at x" and the most significant
> > bits from the initial write. However, the proposed model doesn't report
> > the corresponding execution/state.
>
> Ah, that is indeed an excellent example! I had not considered
> mixed-size accesses generated by the compiler.
>
> So perhaps it would be better to get rid of the whole notion of super
> and major writes, and declare that non-transitive-wmb is racy in all
> its variants.

This makes sense to me.

Andrea

2019-04-08 14:19:38

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Mon, 8 Apr 2019, Andrea Parri wrote:

> > > > > I'd have:
> > > > >
> > > > > *x = 1; /* A */
> > > > > smp_mb__before_atomic();
> > > > > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > > > >
> > > > > => (A ->barrier B)
> > > >
> > > > Perhaps so. It wasn't clear initially how these should be treated, so
> > > > I just ignored them. For example, what should happen here if there
> > > > were other statements between the smp_mb__before_atomic and the
> > > > xchg_relaxed?
> > >
> > > I'd say that the link "A ->barrier B" should still be present and that
> > > no other barrier-links should appear. More formally, I am suggesting
> > > that Before-atomic induced the following barrier-links:
> > >
> > > [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]
> >
> > Hmmm, maybe. But exactly where is the barrier introduced? Are you
> > saying that on some architectures the barrier is contained in the
> > smp_mb__before_atomic and on others it is contained in the
> > xchg_relaxed?
>
> The formula was more along the line of "do not assume either of these
> cases to hold; use barrier() is you need an unconditional barrier..."
> AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> provides a compiler barrier with either barrier() or "memory" clobber.

Well, we have two reasonable choices: Say that
smp_mb__{before,after}_atomic will always provide a compiler barrier,
or don't say this. I see no point in saying that the combination of
Before-atomic followed by RMW provides a barrier.


> > > > > Also, consider the snippets:
> > > > >
> > > > > WRITE_ONCE(*x, 1);
> > > > > *x = 2;
> > > > >
> > > > > and
> > > > >
> > > > > smp_store_release(x, 1);
> > > > > *x = 2;
> > > > >
> > > > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > > > the second snippet would not (thanks to the above component); was this
> > > > > intentional? (Notice that some implementations map the latter to:
> > > > >
> > > > > barrier();
> > > > > WRITE_ONCE(*x, 1);
> > > > > *x = 2;
> > > > >
> > > > > )
> > > >
> > > > That last observation is a good reason for changing the Cat code.
> > >
> > > You mean in:
> > >
> > > po-rel | acq-po
> > >
> > > ? (together with the fencerel()-ifications of Release and Acquire already
> > > present in the patch).
> >
> > No, I meant changing the definition of "barrier" to say:
> >
> > (po ; [Release]) | ([Acquire] ; po)
> >
> > (for example) instead of the code you quoted above.
>
> This makes sense to me.

I'll put it into the next version.


> > > > > > To avoid complications, the LKMM will consider only code in which
> > > > > > plain writes are separated by a compiler barrier from any marked
> > > > > > accesses of the same location. (Question: Can this restriction be
> > > > > > removed?) As a consequence, if a region contains any marked accesses
> > > > > > to a particular location then it cannot contain any plain writes to
> > > > > > that location.
> > > > >
> > > > > I don't know if it can be removed... Said this, maybe we should go back
> > > > > to the (simpler?) question: "What can go wrong without the restriction?",
> > > > > ;-) IOW, what are those "complications", can you provide some examples?
> > > >
> > > > Here's an example I sent to Will a little while ago. Suppose we
> > > > have:
> > > >
> > > > r = rcu_dereference(ptr);
> > > > *r = 1;
> > > > WRITE_ONCE(x, 2);
> > > >
> > > > Imagine if the compiler transformed this to:
> > > >
> > > > r = rcu_dereference(ptr);
> > > > WRITE_ONCE(x, 2);
> > > > if (r != &x)
> > > > *r = 1;
> > > >
> > > > This is bad because it destroys the address dependency when ptr
> > > > contains &x.
> > >
> > > Oh, indeed! (In fact, we discussed this example before... ;-/) BTW,
> > > can you also point out an example which does not involve dependencies
> > > (or destruction thereof)?
> >
> > I believe all the examples did involve dependencies. However, Paul has
> > provided several use cases showing that one might have good reasons for
> > mixing plain reads with marked acccesses -- but he didn't have any use
> > cases for mixing plain writes.
>
> I see. BTW, if we'll converge to add this new flag (or some variant),
> it might be a good idea to log some of these examples, maybe just the
> snippet above, in a commit message (these could come in handy when we
> will be addressed with the question "Why am I seeing this flag?" ...)

Okay.

> > There are some open questions remaining. For instance, given:
> >
> > r1 = READ_ONCE(*x);
> > r2 = *x;
> >
> > is the compiler allowed to replace the plain read with the value from
> > the READ_ONCE? What if the statements were in the opposite order?
> > What if the READ_ONCE was changed to WRITE_ONCE?
> >
> > At the moment, I think it's best if we can ignore these issues.
>
> (To all three questions:) let me put it in these terms: I'm currently
> unable to argue: "The compiler isn't allowed, because...". But wait,
> which "issues"? am I forgetting some other "bad" examples? (the above
> mentioned flag doesn't seem to "cure" plain reads...)

Well, these aren't really issues, because they don't affect the memory
model directly. At least, not in its current form. But they are open
questions, and at the moment I think it's best to assume the most
conservative answers -- that is, the compiler may do all sorts of crazy
things to these code examples.


> > > Let me try again... Consider the following two steps:
> > >
> > > - Start with a _race-free_ test program T1 (= LB1),
> > >
> > > - Apply a "reasonable" source-to-source transformation (2) to obtain a
> > > new test T2 (LB2) (in particular, T2 must be race-free);
> > >
> > > Then the property that I had in mind is described as follows:
> > >
> > > If S is a valid state of T2 then S is also a valid state of T1; IOW,
> > > the tranformation does not introduce new behaviors.
> > >
> > > (An infamously well-known paper about the study of this property, in the
> > > context of the C11 memory model, is available at:
> > >
> > > http://plv.mpi-sws.org/c11comp/ )
> > >
> > > This property does not hold in the proposed model (c.f., e.g., the state
> > > specified in the "exists" clauses).
> >
> > I don't regard this as a serious drawback. We know that the memory
> > model doesn't behave all that well when it comes to control
> > dependencies to marked writes beyond the end of an "if" statement, for
> > example. This is related to that.
> >
> > We also know that not all hardware exhibits all the behaviors allowed
> > by the architecture. So it's reasonable that on some computer, T1
> > might never exhibit state S (even though it is architecturally allowed
> > to) whereas T2 could exhibit S.
>
> Sure. I was just pointing out the fact that the proposed approach does
> introduce scenarios (the example above) where this property is violated.
> It wasn't intended as "No, no" ;-) but, again, (like for the "if (-) E
> else E" scenario, also related...) it seemed worth noticing.

All right. Consider it noticed. :-)


> > > > > Let's look at your "non-transitive-wmb" test: I tend to read the write
> > > > > "*x = 2" as nothing more than "write 2 to the 1st byte, write 0 to the
> > > > > 2nd byte, write 0 to the next byte, ..."; and similarly for "*x = 1";
> > > > > maybe some of these writes aren't performed at all (the compiler knows
> > > > > that the most significant byte, say, is never touched/modified); maybe
> > > > > they intersect (overwrite) one another... what else?
> > > >
> > > > That's the thing. According to one of the assumptions listed earlier
> > > > in this document, it's not possible that the "write 2 to the 1st byte"
> > > > access isn't performed at all. The higher-order bytes, yes, they might
> > > > not be touched. But if no writes are performed at all then at the end
> > > > of the region *x will contain 1, not 2 -- and that would be a bug in
> > > > the compiler.
> > >
> > > AFAICT, we agreed here. So consider the following test:
> > >
> > > C non-transitive-wmb-2
> > >
> > > {
> > > x=0x00010000;
> > > }
> > >
> > > P0(int *x, int *y)
> > > {
> > > *x = 1;
> > > smp_store_release(y, 1);
> > > }
> > >
> > > P1(int *x, int *y, int *z)
> > > {
> > > int r1;
> > >
> > > r1 = smp_load_acquire(y);
> > > if (r1) {
> > > *x = 2;
> > > smp_wmb();
> > > WRITE_ONCE(*z, 1);
> > > }
> > > }
> > >
> > > P2(int *x, int *z)
> > > {
> > > int r3;
> > > int r4 = 0;
> > >
> > > r3 = READ_ONCE(*z);
> > > if (r3) {
> > > smp_rmb();
> > > r4 = READ_ONCE(*x); /* E */
> > > }
> > > }
> > >
> > > exists (2:r3=1 /\ 2:r4=2)
> > >
> > > The proposed model says that this is race-free. Now suppose that the
> > > compiler mapped the two plain writes above as follows:
> > >
> > > *x = 1; --> A:write 0x0001 at x
> > > B:write 0x0000 at (x + 2)
> > >
> > > *x = 2; --> C:write 0x0002 at x
> > > if ((D:read of 2 bytes at (x + 2)) != 0x0000)
> > > write 0x0000 at (x + 2)
> > >
> > > and consider the execution where 1:r1=1 and 2:r3=1. We know that A and
> > > B propagate to P1 before C and D execute (release/acquire) and that C
> > > propagates to P2 before E executes (wmb/rmb). But what guarantees that
> > > B, say, propagates to P2 before E executes? AFAICT, nothing prevent P2
> > > from reading the value 0x00010002 for x, that is, from reading the least
> > > significant bits from P1's "write 0x0002 at x" and the most significant
> > > bits from the initial write. However, the proposed model doesn't report
> > > the corresponding execution/state.
> >
> > Ah, that is indeed an excellent example! I had not considered
> > mixed-size accesses generated by the compiler.
> >
> > So perhaps it would be better to get rid of the whole notion of super
> > and major writes, and declare that non-transitive-wmb is racy in all
> > its variants.
>
> This makes sense to me.

In the next version...

Alan

2019-04-09 01:38:19

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> > The formula was more along the line of "do not assume either of these
> > cases to hold; use barrier() is you need an unconditional barrier..."
> > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > provides a compiler barrier with either barrier() or "memory" clobber.
>
> Well, we have two reasonable choices: Say that
> smp_mb__{before,after}_atomic will always provide a compiler barrier,
> or don't say this. I see no point in saying that the combination of
> Before-atomic followed by RMW provides a barrier.

;-/ I'm fine with the first choice. I don't see how the second choice
(this proposal/patch) would be consistent with some documentation and
with the current implementations; for example,

1) Documentation/atomic_t.txt says:

Thus:

atomic_fetch_add();

is equivalent to:

smp_mb__before_atomic();
atomic_fetch_add_relaxed();
smp_mb__after_atomic();

[...]

2) Some implementations of the _relaxed() variants do not provide any
compiler barrier currently.

Andrea

2019-04-09 15:03:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > The formula was more along the line of "do not assume either of these
> > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > provides a compiler barrier with either barrier() or "memory" clobber.
> >
> > Well, we have two reasonable choices: Say that
> > smp_mb__{before,after}_atomic will always provide a compiler barrier,
> > or don't say this. I see no point in saying that the combination of
> > Before-atomic followed by RMW provides a barrier.
>
> ;-/ I'm fine with the first choice. I don't see how the second choice
> (this proposal/patch) would be consistent with some documentation and
> with the current implementations; for example,
>
> 1) Documentation/atomic_t.txt says:
>
> Thus:
>
> atomic_fetch_add();
>
> is equivalent to:
>
> smp_mb__before_atomic();
> atomic_fetch_add_relaxed();
> smp_mb__after_atomic();
>
> [...]
>
> 2) Some implementations of the _relaxed() variants do not provide any
> compiler barrier currently.

But don't all implementations of smp_mb__before_atomic() and
smp_mb__after_atomic() currently supply a compiler barrier?

Thanx, Paul

2019-04-13 21:40:48

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > > The formula was more along the line of "do not assume either of these
> > > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > > provides a compiler barrier with either barrier() or "memory" clobber.
> > >
> > > Well, we have two reasonable choices: Say that
> > > smp_mb__{before,after}_atomic will always provide a compiler barrier,
> > > or don't say this. I see no point in saying that the combination of
> > > Before-atomic followed by RMW provides a barrier.
> >
> > ;-/ I'm fine with the first choice. I don't see how the second choice
> > (this proposal/patch) would be consistent with some documentation and
> > with the current implementations; for example,
> >
> > 1) Documentation/atomic_t.txt says:
> >
> > Thus:
> >
> > atomic_fetch_add();
> >
> > is equivalent to:
> >
> > smp_mb__before_atomic();
> > atomic_fetch_add_relaxed();
> > smp_mb__after_atomic();
> >
> > [...]
> >
> > 2) Some implementations of the _relaxed() variants do not provide any
> > compiler barrier currently.
>
> But don't all implementations of smp_mb__before_atomic() and
> smp_mb__after_atomic() currently supply a compiler barrier?

Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently
supply a compiler barrier.

Nevertheless, there's a difference between: (1) Specify that these barriers
supply a compiler barrier, (2) Specify that (certain) combinations of these
barriers and RMWs supply a compiler barrier, and (3) This patch... ;-)

FWIW, I'm not aware of current/informal documentation following (the arguably
simpler but slightly stronger) (1). But again (amending my last remark): (1)
and (2) both make sense to me.

Thanx,
Andrea


>
> Thanx, Paul
>

2019-04-15 13:37:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Sat, Apr 13, 2019 at 11:39:38PM +0200, Andrea Parri wrote:
> On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > > > The formula was more along the line of "do not assume either of these
> > > > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > > > provides a compiler barrier with either barrier() or "memory" clobber.
> > > >
> > > > Well, we have two reasonable choices: Say that
> > > > smp_mb__{before,after}_atomic will always provide a compiler barrier,
> > > > or don't say this. I see no point in saying that the combination of
> > > > Before-atomic followed by RMW provides a barrier.
> > >
> > > ;-/ I'm fine with the first choice. I don't see how the second choice
> > > (this proposal/patch) would be consistent with some documentation and
> > > with the current implementations; for example,
> > >
> > > 1) Documentation/atomic_t.txt says:
> > >
> > > Thus:
> > >
> > > atomic_fetch_add();
> > >
> > > is equivalent to:
> > >
> > > smp_mb__before_atomic();
> > > atomic_fetch_add_relaxed();
> > > smp_mb__after_atomic();
> > >
> > > [...]
> > >
> > > 2) Some implementations of the _relaxed() variants do not provide any
> > > compiler barrier currently.
> >
> > But don't all implementations of smp_mb__before_atomic() and
> > smp_mb__after_atomic() currently supply a compiler barrier?
>
> Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently
> supply a compiler barrier.
>
> Nevertheless, there's a difference between: (1) Specify that these barriers
> supply a compiler barrier, (2) Specify that (certain) combinations of these
> barriers and RMWs supply a compiler barrier, and (3) This patch... ;-)
>
> FWIW, I'm not aware of current/informal documentation following (the arguably
> simpler but slightly stronger) (1). But again (amending my last remark): (1)
> and (2) both make sense to me.

Another question is "should the kernel permit smp_mb__{before,after}*()
anywhere other than immediately before or after the primitive being
strengthened?"

Thoughts?

Thanx, Paul

2019-04-15 13:51:03

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Mon, 15 Apr 2019, Paul E. McKenney wrote:

> Another question is "should the kernel permit smp_mb__{before,after}*()
> anywhere other than immediately before or after the primitive being
> strengthened?"
>
> Thoughts?

How would the kernel forbid other constructions?

Alan

2019-04-15 13:54:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Mon, Apr 15, 2019 at 09:50:03AM -0400, Alan Stern wrote:
> On Mon, 15 Apr 2019, Paul E. McKenney wrote:
>
> > Another question is "should the kernel permit smp_mb__{before,after}*()
> > anywhere other than immediately before or after the primitive being
> > strengthened?"
> >
> > Thoughts?
>
> How would the kernel forbid other constructions?

Coccinelle scripts added to 0day test robot.

Thanx, Paul

2019-04-18 12:55:29

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> Another question is "should the kernel permit smp_mb__{before,after}*()
> anywhere other than immediately before or after the primitive being
> strengthened?"

Mmh, I do think that keeping these barriers "immediately before or after
the primitive being strengthened" is a good practice (readability, and
all that), if this is what you're suggesting.

However, a first auditing of the callsites showed that this practice is
in fact not always applied, notably... ;-)

kernel/rcu/tree_exp.h:sync_exp_work_done
kernel/sched/cpupri.c:cpupri_set

So there appear, at least, to be some exceptions/reasons for not always
following it? Thoughts?

BTW, while auditing these callsites, I've stumbled across the following
snippet (from kernel/futex.c):

*futex = newval;
sys_futex(WAKE, futex);
futex_wake(futex);
smp_mb(); (B)
if (waiters)
...

where B is actually (c.f., futex_get_mm()):

atomic_inc(...->mm_count);
smp_mb__after_atomic();

It seems worth mentioning the fact that, AFAICT, this sequence does not
necessarily provide ordering when plain accesses are involved: consider,
e.g., the following variant of the snippet:

A:*x = 1;
/*
* I've "ignored" the syscall, which should provide
* (at least) a compiler barrier...
*/
atomic_inc(u);
smp_mb__after_atomic();
B:r0 = *y;

On x86, AFAICT, the compiler can do this:

atomic_inc(u);
A:*x = 1;
smp_mb__after_atomic();
B:r0 = *y;

(the implementation of atomic_inc() contains no compiler barrier), then
the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
a compiler barrier).

The mips implementation seems also affected by such "reorderings": I am
not familiar with this implementation but, AFAICT, it does not enforce
ordering from A to B in the following snippet:

A:*x = 1;
atomic_inc(u);
smp_mb__after_atomic();
B:WRITE_ONCE(*y, 1);

when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.

Do these observations make sense to you? Thoughts?

Andrea

2019-04-18 17:45:47

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Thu, 18 Apr 2019, Andrea Parri wrote:

> > Another question is "should the kernel permit smp_mb__{before,after}*()
> > anywhere other than immediately before or after the primitive being
> > strengthened?"
>
> Mmh, I do think that keeping these barriers "immediately before or after
> the primitive being strengthened" is a good practice (readability, and
> all that), if this is what you're suggesting.
>
> However, a first auditing of the callsites showed that this practice is
> in fact not always applied, notably... ;-)
>
> kernel/rcu/tree_exp.h:sync_exp_work_done
> kernel/sched/cpupri.c:cpupri_set
>
> So there appear, at least, to be some exceptions/reasons for not always
> following it? Thoughts?
>
> BTW, while auditing these callsites, I've stumbled across the following
> snippet (from kernel/futex.c):
>
> *futex = newval;
> sys_futex(WAKE, futex);
> futex_wake(futex);
> smp_mb(); (B)
> if (waiters)
> ...
>
> where B is actually (c.f., futex_get_mm()):
>
> atomic_inc(...->mm_count);
> smp_mb__after_atomic();
>
> It seems worth mentioning the fact that, AFAICT, this sequence does not
> necessarily provide ordering when plain accesses are involved: consider,
> e.g., the following variant of the snippet:
>
> A:*x = 1;
> /*
> * I've "ignored" the syscall, which should provide
> * (at least) a compiler barrier...
> */
> atomic_inc(u);
> smp_mb__after_atomic();
> B:r0 = *y;
>
> On x86, AFAICT, the compiler can do this:
>
> atomic_inc(u);
> A:*x = 1;
> smp_mb__after_atomic();
> B:r0 = *y;
>
> (the implementation of atomic_inc() contains no compiler barrier), then
> the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
> a compiler barrier).

Are you saying that on x86, atomic_inc() acts as a full memory barrier
but not as a compiler barrier, and vice versa for
smp_mb__after_atomic()? Or that neither atomic_inc() nor
smp_mb__after_atomic() implements a full memory barrier?

Either one seems like a very dangerous situation indeed.

Alan

> The mips implementation seems also affected by such "reorderings": I am
> not familiar with this implementation but, AFAICT, it does not enforce
> ordering from A to B in the following snippet:
>
> A:*x = 1;
> atomic_inc(u);
> smp_mb__after_atomic();
> B:WRITE_ONCE(*y, 1);
>
> when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.
>
> Do these observations make sense to you? Thoughts?
>
> Andrea

2019-04-18 18:40:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Thu, Apr 18, 2019 at 01:44:36PM -0400, Alan Stern wrote:
> On Thu, 18 Apr 2019, Andrea Parri wrote:
>
> > > Another question is "should the kernel permit smp_mb__{before,after}*()
> > > anywhere other than immediately before or after the primitive being
> > > strengthened?"
> >
> > Mmh, I do think that keeping these barriers "immediately before or after
> > the primitive being strengthened" is a good practice (readability, and
> > all that), if this is what you're suggesting.
> >
> > However, a first auditing of the callsites showed that this practice is
> > in fact not always applied, notably... ;-)
> >
> > kernel/rcu/tree_exp.h:sync_exp_work_done
> > kernel/sched/cpupri.c:cpupri_set
> >
> > So there appear, at least, to be some exceptions/reasons for not always
> > following it? Thoughts?
> >
> > BTW, while auditing these callsites, I've stumbled across the following
> > snippet (from kernel/futex.c):
> >
> > *futex = newval;
> > sys_futex(WAKE, futex);
> > futex_wake(futex);
> > smp_mb(); (B)
> > if (waiters)
> > ...
> >
> > where B is actually (c.f., futex_get_mm()):
> >
> > atomic_inc(...->mm_count);
> > smp_mb__after_atomic();
> >
> > It seems worth mentioning the fact that, AFAICT, this sequence does not
> > necessarily provide ordering when plain accesses are involved: consider,
> > e.g., the following variant of the snippet:
> >
> > A:*x = 1;
> > /*
> > * I've "ignored" the syscall, which should provide
> > * (at least) a compiler barrier...
> > */
> > atomic_inc(u);
> > smp_mb__after_atomic();
> > B:r0 = *y;
> >
> > On x86, AFAICT, the compiler can do this:
> >
> > atomic_inc(u);
> > A:*x = 1;
> > smp_mb__after_atomic();
> > B:r0 = *y;
> >
> > (the implementation of atomic_inc() contains no compiler barrier), then
> > the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
> > a compiler barrier).
>
> Are you saying that on x86, atomic_inc() acts as a full memory barrier
> but not as a compiler barrier, and vice versa for
> smp_mb__after_atomic()? Or that neither atomic_inc() nor
> smp_mb__after_atomic() implements a full memory barrier?
>
> Either one seems like a very dangerous situation indeed.

If I am following the macro-name breadcrumb trails correctly, x86's
atomic_inc() does have a compiler barrier. But this is an accident
of implementation -- from what I can see, it is not required to do so.

So smb_mb__after_atomic() is only guaranteed to order the atomic_inc()
before B, not A. To order A before B in the above example, an
smp_mb__before_atomic() is also needed.

But now that I look, LKMM looks to be stating a stronger guarantee:

([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

Maybe something like this?

([M] ; fencerel(Before-atomic) ; [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] |
( [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

Who is the lead maintainer for this stuff, anyway??? ;-)

Thanx, Paul

> Alan
>
> > The mips implementation seems also affected by such "reorderings": I am
> > not familiar with this implementation but, AFAICT, it does not enforce
> > ordering from A to B in the following snippet:
> >
> > A:*x = 1;
> > atomic_inc(u);
> > smp_mb__after_atomic();
> > B:WRITE_ONCE(*y, 1);
> >
> > when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.
> >
> > Do these observations make sense to you? Thoughts?
> >
> > Andrea
>

2019-04-18 20:21:47

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Thu, 18 Apr 2019, Paul E. McKenney wrote:

> > Are you saying that on x86, atomic_inc() acts as a full memory barrier
> > but not as a compiler barrier, and vice versa for
> > smp_mb__after_atomic()? Or that neither atomic_inc() nor
> > smp_mb__after_atomic() implements a full memory barrier?
> >
> > Either one seems like a very dangerous situation indeed.
>
> If I am following the macro-name breadcrumb trails correctly, x86's
> atomic_inc() does have a compiler barrier. But this is an accident
> of implementation -- from what I can see, it is not required to do so.
>
> So smb_mb__after_atomic() is only guaranteed to order the atomic_inc()
> before B, not A. To order A before B in the above example, an
> smp_mb__before_atomic() is also needed.

Are you certain?

> But now that I look, LKMM looks to be stating a stronger guarantee:
>
> ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> fencerel(After-unlock-lock) ; [M])
>
> Maybe something like this?
>
> ([M] ; fencerel(Before-atomic) ; [RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; fencerel(Before-atomic) ; [RMW] |
> ( [RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> fencerel(After-unlock-lock) ; [M])

The first line you wrote is redundant; it follows from the second and
third lines.

Aside from that, while this proposal may be correct and may express
what smb_mb__{before|after}_atomic really are intended to do, it
contradicts Documentation/atomic_t.txt. That file says:

These barriers provide a full smp_mb().

And of course, a full smp_mb() would order everything before it against
everything after it. If we update the model then we should also update
that file.

In addition, it's noteworthy that smp_mb__after_spinlock and
smp_mb__after_unlock_lock do not behave in this way.

Alan

2019-04-19 00:54:30

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> Are you saying that on x86, atomic_inc() acts as a full memory barrier
> but not as a compiler barrier, and vice versa for
> smp_mb__after_atomic()? Or that neither atomic_inc() nor
> smp_mb__after_atomic() implements a full memory barrier?

I'd say the former; AFAICT, these boil down to:

https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84

Andrea

2019-04-19 18:20:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Fri, Apr 19, 2019 at 10:34:06AM -0400, Alan Stern wrote:
> On Fri, 19 Apr 2019, Paul E. McKenney wrote:
>
> > On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > > > Are you saying that on x86, atomic_inc() acts as a full memory barrier
> > > > but not as a compiler barrier, and vice versa for
> > > > smp_mb__after_atomic()? Or that neither atomic_inc() nor
> > > > smp_mb__after_atomic() implements a full memory barrier?
> > >
> > > I'd say the former; AFAICT, these boil down to:
> > >
> > > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> > > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> >
> > OK, how about the following?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> > Author: Paul E. McKenney <[email protected]>
> > Date: Fri Apr 19 05:20:30 2019 -0700
> >
> > tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> >
> > Read-modify-write atomic operations that do not return values need not
> > provide any ordering guarantees, and this means that both the compiler
> > and the CPU are free to reorder accesses across things like atomic_inc()
> > and atomic_dec(). The stronger systems such as x86 allow the compiler
> > to do the reordering, but prevent the CPU from so doing, and these
> > systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> > The weaker systems such as Power allow both the compiler and the CPU
> > to reorder accesses across things like atomic_inc() and atomic_dec(),
> > and implement smp_mb__{before,after}_atomic() as full memory barriers.
> >
> > This means that smp_mb__before_atomic() only orders the atomic operation
> > itself with accesses preceding the smp_mb__before_atomic(), and does
> > not necessarily provide any ordering whatsoever against accesses
> > folowing the atomic operation. Similarly, smp_mb__after_atomic()
> > only orders the atomic operation itself with accesses following the
> > smp_mb__after_atomic(), and does not necessarily provide any ordering
> > whatsoever against accesses preceding the atomic operation. Full ordering
> > therefore requires both an smp_mb__before_atomic() before the atomic
> > operation and an smp_mb__after_atomic() after the atomic operation.
> >
> > Therefore, linux-kernel.cat's current model of Before-atomic
> > and After-atomic is too strong, as it guarantees ordering of
> > accesses on the other side of the atomic operation from the
> > smp_mb__{before,after}_atomic(). This commit therefore weakens
> > the guarantee to match the semantics called out above.
> >
> > Reported-by: Andrea Parri <[email protected]>
> > Suggested-by: Alan Stern <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 169d938c0b53..e5b97c3e8e39 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> > atomic_dec(&obj->ref_count);
> >
> > This makes sure that the death mark on the object is perceived to be set
> > - *before* the reference counter is decremented.
> > + *before* the reference counter is decremented. However, please note
> > + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> > + extend beyond the atomic operation. For example:
> > +
> > + obj->dead = 1;
> > + smp_mb__before_atomic();
> > + atomic_dec(&obj->ref_count);
> > + r1 = a;
> > +
> > + Here the store to obj->dead is not guaranteed to be ordered with
> > + with the load from a. This reordering can happen on x86 as follows:
> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables. However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
> > +
> > + WRITE_ONCE(obj->dead, 1);
> > + smp_mb__before_atomic();
> > + atomic_dec(&obj->ref_count);
> > + smp_mb__after_atomic();
> > + r1 = READ_ONCE(a);
> > +
> > + This orders all three accesses against each other, and also makes
> > + the intent quite clear.
> >
> > See Documentation/atomic_{t,bitops}.txt for more information.
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 8dcb37835b61..b6866f93abb8 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -28,8 +28,8 @@ include "lock.cat"
> > let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> > let wmb = [W] ; fencerel(Wmb) ; [W]
> > let mb = ([M] ; fencerel(Mb) ; [M]) |
> > - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> > - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> > + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> > + ([RMW] ; fencerel(After-atomic) ; [M]) |
> > ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> > ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> > fencerel(After-unlock-lock) ; [M])
>
> Something like the following should also be applied, either as part of
> the same patch or immediately after.

Please do send a patch!

Thanx, Paul

> Alan
>
>
> Index: usb-devel/Documentation/atomic_t.txt
> ===================================================================
> --- usb-devel.orig/Documentation/atomic_t.txt
> +++ usb-devel/Documentation/atomic_t.txt
> @@ -171,7 +171,10 @@ The barriers:
> smp_mb__{before,after}_atomic()
>
> only apply to the RMW ops and can be used to augment/upgrade the ordering
> -inherent to the used atomic op. These barriers provide a full smp_mb().
> +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> +only the RMW op itself against the instructions preceding the
> +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> +not order instructions on the other side of the RMW op at all.
>
> These helper barriers exist because architectures have varying implicit
> ordering on their SMP atomic primitives. For example our TSO architectures
> @@ -195,7 +198,8 @@ Further, while something like:
> atomic_dec(&X);
>
> is a 'typical' RELEASE pattern, the barrier is strictly stronger than
> -a RELEASE. Similarly for something like:
> +a RELEASE because it orders preceding instructions against both the read
> +and write parts of the atomic_dec(). Similarly, something like:
>
> atomic_inc(&X);
> smp_mb__after_atomic();
> @@ -227,7 +231,8 @@ strictly stronger than ACQUIRE. As illus
>
> This should not happen; but a hypothetical atomic_inc_acquire() --
> (void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
> -since then:
> +because it would not order the W part of the RMW against the following
> +WRITE_ONCE. Thus:
>
> P1 P2
>
>


2019-04-19 18:36:03

by Alan Stern

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Fri, 19 Apr 2019, Paul E. McKenney wrote:

> On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > > Are you saying that on x86, atomic_inc() acts as a full memory barrier
> > > but not as a compiler barrier, and vice versa for
> > > smp_mb__after_atomic()? Or that neither atomic_inc() nor
> > > smp_mb__after_atomic() implements a full memory barrier?
> >
> > I'd say the former; AFAICT, these boil down to:
> >
> > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
>
> OK, how about the following?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> Author: Paul E. McKenney <[email protected]>
> Date: Fri Apr 19 05:20:30 2019 -0700
>
> tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
>
> Read-modify-write atomic operations that do not return values need not
> provide any ordering guarantees, and this means that both the compiler
> and the CPU are free to reorder accesses across things like atomic_inc()
> and atomic_dec(). The stronger systems such as x86 allow the compiler
> to do the reordering, but prevent the CPU from so doing, and these
> systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> The weaker systems such as Power allow both the compiler and the CPU
> to reorder accesses across things like atomic_inc() and atomic_dec(),
> and implement smp_mb__{before,after}_atomic() as full memory barriers.
>
> This means that smp_mb__before_atomic() only orders the atomic operation
> itself with accesses preceding the smp_mb__before_atomic(), and does
> not necessarily provide any ordering whatsoever against accesses
> folowing the atomic operation. Similarly, smp_mb__after_atomic()
> only orders the atomic operation itself with accesses following the
> smp_mb__after_atomic(), and does not necessarily provide any ordering
> whatsoever against accesses preceding the atomic operation. Full ordering
> therefore requires both an smp_mb__before_atomic() before the atomic
> operation and an smp_mb__after_atomic() after the atomic operation.
>
> Therefore, linux-kernel.cat's current model of Before-atomic
> and After-atomic is too strong, as it guarantees ordering of
> accesses on the other side of the atomic operation from the
> smp_mb__{before,after}_atomic(). This commit therefore weakens
> the guarantee to match the semantics called out above.
>
> Reported-by: Andrea Parri <[email protected]>
> Suggested-by: Alan Stern <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 169d938c0b53..e5b97c3e8e39 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> atomic_dec(&obj->ref_count);
>
> This makes sure that the death mark on the object is perceived to be set
> - *before* the reference counter is decremented.
> + *before* the reference counter is decremented. However, please note
> + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> + extend beyond the atomic operation. For example:
> +
> + obj->dead = 1;
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + r1 = a;
> +
> + Here the store to obj->dead is not guaranteed to be ordered with
> + with the load from a. This reordering can happen on x86 as follows:
> + (1) The compiler can reorder the load from a to precede the
> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> + compiler barrier, the CPU can reorder the preceding store to
> + obj->dead with the later load from a.
> +
> + This could be avoided by using READ_ONCE(), which would prevent the
> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> + being volatile accesses, and is usually preferable for loads from
> + shared variables. However, weakly ordered CPUs would still be
> + free to reorder the atomic_dec() with the load from a, so a more
> + readable option is to also use smp_mb__after_atomic() as follows:
> +
> + WRITE_ONCE(obj->dead, 1);
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + smp_mb__after_atomic();
> + r1 = READ_ONCE(a);
> +
> + This orders all three accesses against each other, and also makes
> + the intent quite clear.
>
> See Documentation/atomic_{t,bitops}.txt for more information.
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 8dcb37835b61..b6866f93abb8 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -28,8 +28,8 @@ include "lock.cat"
> let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> let wmb = [W] ; fencerel(Wmb) ; [W]
> let mb = ([M] ; fencerel(Mb) ; [M]) |
> - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> + ([RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> fencerel(After-unlock-lock) ; [M])

Something like the following should also be applied, either as part of
the same patch or immediately after.

Alan


Index: usb-devel/Documentation/atomic_t.txt
===================================================================
--- usb-devel.orig/Documentation/atomic_t.txt
+++ usb-devel/Documentation/atomic_t.txt
@@ -171,7 +171,10 @@ The barriers:
smp_mb__{before,after}_atomic()

only apply to the RMW ops and can be used to augment/upgrade the ordering
-inherent to the used atomic op. These barriers provide a full smp_mb().
+inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
+only the RMW op itself against the instructions preceding the
+smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
+not order instructions on the other side of the RMW op at all.

These helper barriers exist because architectures have varying implicit
ordering on their SMP atomic primitives. For example our TSO architectures
@@ -195,7 +198,8 @@ Further, while something like:
atomic_dec(&X);

is a 'typical' RELEASE pattern, the barrier is strictly stronger than
-a RELEASE. Similarly for something like:
+a RELEASE because it orders preceding instructions against both the read
+and write parts of the atomic_dec(). Similarly, something like:

atomic_inc(&X);
smp_mb__after_atomic();
@@ -227,7 +231,8 @@ strictly stronger than ACQUIRE. As illus

This should not happen; but a hypothetical atomic_inc_acquire() --
(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
-since then:
+because it would not order the W part of the RMW against the following
+WRITE_ONCE. Thus:

P1 P2



2019-04-19 18:44:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > Are you saying that on x86, atomic_inc() acts as a full memory barrier
> > but not as a compiler barrier, and vice versa for
> > smp_mb__after_atomic()? Or that neither atomic_inc() nor
> > smp_mb__after_atomic() implements a full memory barrier?
>
> I'd say the former; AFAICT, these boil down to:
>
> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84

OK, how about the following?

Thanx, Paul

------------------------------------------------------------------------

commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
Author: Paul E. McKenney <[email protected]>
Date: Fri Apr 19 05:20:30 2019 -0700

tools/memory-model: Make smp_mb__{before,after}_atomic() match x86

Read-modify-write atomic operations that do not return values need not
provide any ordering guarantees, and this means that both the compiler
and the CPU are free to reorder accesses across things like atomic_inc()
and atomic_dec(). The stronger systems such as x86 allow the compiler
to do the reordering, but prevent the CPU from so doing, and these
systems implement smp_mb__{before,after}_atomic() as compiler barriers.
The weaker systems such as Power allow both the compiler and the CPU
to reorder accesses across things like atomic_inc() and atomic_dec(),
and implement smp_mb__{before,after}_atomic() as full memory barriers.

This means that smp_mb__before_atomic() only orders the atomic operation
itself with accesses preceding the smp_mb__before_atomic(), and does
not necessarily provide any ordering whatsoever against accesses
folowing the atomic operation. Similarly, smp_mb__after_atomic()
only orders the atomic operation itself with accesses following the
smp_mb__after_atomic(), and does not necessarily provide any ordering
whatsoever against accesses preceding the atomic operation. Full ordering
therefore requires both an smp_mb__before_atomic() before the atomic
operation and an smp_mb__after_atomic() after the atomic operation.

Therefore, linux-kernel.cat's current model of Before-atomic
and After-atomic is too strong, as it guarantees ordering of
accesses on the other side of the atomic operation from the
smp_mb__{before,after}_atomic(). This commit therefore weakens
the guarantee to match the semantics called out above.

Reported-by: Andrea Parri <[email protected]>
Suggested-by: Alan Stern <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 169d938c0b53..e5b97c3e8e39 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
atomic_dec(&obj->ref_count);

This makes sure that the death mark on the object is perceived to be set
- *before* the reference counter is decremented.
+ *before* the reference counter is decremented. However, please note
+ that smp_mb__before_atomic()'s ordering guarantee does not necessarily
+ extend beyond the atomic operation. For example:
+
+ obj->dead = 1;
+ smp_mb__before_atomic();
+ atomic_dec(&obj->ref_count);
+ r1 = a;
+
+ Here the store to obj->dead is not guaranteed to be ordered with
+ with the load from a. This reordering can happen on x86 as follows:
+ (1) The compiler can reorder the load from a to precede the
+ atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
+ compiler barrier, the CPU can reorder the preceding store to
+ obj->dead with the later load from a.
+
+ This could be avoided by using READ_ONCE(), which would prevent the
+ compiler from reordering due to both atomic_dec() and READ_ONCE()
+ being volatile accesses, and is usually preferable for loads from
+ shared variables. However, weakly ordered CPUs would still be
+ free to reorder the atomic_dec() with the load from a, so a more
+ readable option is to also use smp_mb__after_atomic() as follows:
+
+ WRITE_ONCE(obj->dead, 1);
+ smp_mb__before_atomic();
+ atomic_dec(&obj->ref_count);
+ smp_mb__after_atomic();
+ r1 = READ_ONCE(a);
+
+ This orders all three accesses against each other, and also makes
+ the intent quite clear.

See Documentation/atomic_{t,bitops}.txt for more information.

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 8dcb37835b61..b6866f93abb8 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -28,8 +28,8 @@ include "lock.cat"
let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
let wmb = [W] ; fencerel(Wmb) ; [W]
let mb = ([M] ; fencerel(Mb) ; [M]) |
- ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
- ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
+ ([M] ; fencerel(Before-atomic) ; [RMW]) |
+ ([RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])


2019-04-19 19:11:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
> Hi Paul,
>
> Please find inline comments below.
>
> On Fri, 19 Apr 2019 05:47:20 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> >>> Are you saying that on x86, atomic_inc() acts as a full memory barrier
> >>> but not as a compiler barrier, and vice versa for
> >>> smp_mb__after_atomic()? Or that neither atomic_inc() nor
> >>> smp_mb__after_atomic() implements a full memory barrier?
> >>
> >> I'd say the former; AFAICT, these boil down to:
> >>
> >> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> >> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> >
> > OK, how about the following?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> > Author: Paul E. McKenney <[email protected]>
> > Date: Fri Apr 19 05:20:30 2019 -0700
> >
> > tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> >
> > Read-modify-write atomic operations that do not return values need not
> > provide any ordering guarantees, and this means that both the compiler
> > and the CPU are free to reorder accesses across things like atomic_inc()
> > and atomic_dec(). The stronger systems such as x86 allow the compiler
> > to do the reordering, but prevent the CPU from so doing, and these
> > systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> > The weaker systems such as Power allow both the compiler and the CPU
> > to reorder accesses across things like atomic_inc() and atomic_dec(),
> > and implement smp_mb__{before,after}_atomic() as full memory barriers.
> >
> > This means that smp_mb__before_atomic() only orders the atomic operation
> > itself with accesses preceding the smp_mb__before_atomic(), and does
> > not necessarily provide any ordering whatsoever against accesses
> > folowing the atomic operation. Similarly, smp_mb__after_atomic()
>
> s/folowing/following/

Good eyes, fixed!

> > only orders the atomic operation itself with accesses following the
> > smp_mb__after_atomic(), and does not necessarily provide any ordering
> > whatsoever against accesses preceding the atomic operation. Full ordering
> > therefore requires both an smp_mb__before_atomic() before the atomic
> > operation and an smp_mb__after_atomic() after the atomic operation.
> >
> > Therefore, linux-kernel.cat's current model of Before-atomic
> > and After-atomic is too strong, as it guarantees ordering of
> > accesses on the other side of the atomic operation from the
> > smp_mb__{before,after}_atomic(). This commit therefore weakens
> > the guarantee to match the semantics called out above.
> >
> > Reported-by: Andrea Parri <[email protected]>
> > Suggested-by: Alan Stern <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 169d938c0b53..e5b97c3e8e39 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> > atomic_dec(&obj->ref_count);
> >
> > This makes sure that the death mark on the object is perceived to be set
> > - *before* the reference counter is decremented.
> > + *before* the reference counter is decremented. However, please note
> > + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> > + extend beyond the atomic operation. For example:
> > +
> > + obj->dead = 1;
> > + smp_mb__before_atomic();
> > + atomic_dec(&obj->ref_count);
> > + r1 = a;
> > +
> > + Here the store to obj->dead is not guaranteed to be ordered with
> > + with the load from a. This reordering can happen on x86 as follows:
>
> s/with//

Fixed fixed. ;-)

> And I beg you to avoid using the single letter variable "a".
> It's confusing.

Good point! I changed it to "x".

> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables. However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
>
> The point here is not just "readability", but also the portability of the
> code, isn't it?

As Andrea noted, in this particular case, the guarantee that the
store to obj->dead precedes the load from x is portable. Either the
smp_mb__before_atomic() or the atomic_dec() must provide the ordering.
However, you are right that there is some non-portability. But this
non-portability involves the order of the atomic_dec() and the store to x.

So what I did was ...

> Thanks, Akira
>
> > +
> > + WRITE_ONCE(obj->dead, 1);
> > + smp_mb__before_atomic();
> > + atomic_dec(&obj->ref_count);
> > + smp_mb__after_atomic();
> > + r1 = READ_ONCE(a);
> > +
> > + This orders all three accesses against each other, and also makes
> > + the intent quite clear.

... change the above paragraph to read as follows:

In addition, the example without the smp_mb__after_atomic() does
not necessarily order the atomic_dec() with the load from x.
In contrast, the example with both smp_mb__before_atomic() and
smp_mb__after_atomic() orders all three accesses against each other,
and also makes the intent quite clear.

Does that help?

Thanx, Paul

> > See Documentation/atomic_{t,bitops}.txt for more information.
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 8dcb37835b61..b6866f93abb8 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -28,8 +28,8 @@ include "lock.cat"
> > let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> > let wmb = [W] ; fencerel(Wmb) ; [W]
> > let mb = ([M] ; fencerel(Mb) ; [M]) |
> > - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> > - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> > + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> > + ([RMW] ; fencerel(After-atomic) ; [M]) |
> > ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> > ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> > fencerel(After-unlock-lock) ; [M])
> >
>


2019-04-19 19:25:02

by Akira Yokosawa

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

Hi Paul,

Please find inline comments below.

On Fri, 19 Apr 2019 05:47:20 -0700, Paul E. McKenney wrote:
> On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
>>> Are you saying that on x86, atomic_inc() acts as a full memory barrier
>>> but not as a compiler barrier, and vice versa for
>>> smp_mb__after_atomic()? Or that neither atomic_inc() nor
>>> smp_mb__after_atomic() implements a full memory barrier?
>>
>> I'd say the former; AFAICT, these boil down to:
>>
>> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
>> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
>
> OK, how about the following?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> Author: Paul E. McKenney <[email protected]>
> Date: Fri Apr 19 05:20:30 2019 -0700
>
> tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
>
> Read-modify-write atomic operations that do not return values need not
> provide any ordering guarantees, and this means that both the compiler
> and the CPU are free to reorder accesses across things like atomic_inc()
> and atomic_dec(). The stronger systems such as x86 allow the compiler
> to do the reordering, but prevent the CPU from so doing, and these
> systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> The weaker systems such as Power allow both the compiler and the CPU
> to reorder accesses across things like atomic_inc() and atomic_dec(),
> and implement smp_mb__{before,after}_atomic() as full memory barriers.
>
> This means that smp_mb__before_atomic() only orders the atomic operation
> itself with accesses preceding the smp_mb__before_atomic(), and does
> not necessarily provide any ordering whatsoever against accesses
> folowing the atomic operation. Similarly, smp_mb__after_atomic()

s/folowing/following/

> only orders the atomic operation itself with accesses following the
> smp_mb__after_atomic(), and does not necessarily provide any ordering
> whatsoever against accesses preceding the atomic operation. Full ordering
> therefore requires both an smp_mb__before_atomic() before the atomic
> operation and an smp_mb__after_atomic() after the atomic operation.
>
> Therefore, linux-kernel.cat's current model of Before-atomic
> and After-atomic is too strong, as it guarantees ordering of
> accesses on the other side of the atomic operation from the
> smp_mb__{before,after}_atomic(). This commit therefore weakens
> the guarantee to match the semantics called out above.
>
> Reported-by: Andrea Parri <[email protected]>
> Suggested-by: Alan Stern <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 169d938c0b53..e5b97c3e8e39 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> atomic_dec(&obj->ref_count);
>
> This makes sure that the death mark on the object is perceived to be set
> - *before* the reference counter is decremented.
> + *before* the reference counter is decremented. However, please note
> + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> + extend beyond the atomic operation. For example:
> +
> + obj->dead = 1;
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + r1 = a;
> +
> + Here the store to obj->dead is not guaranteed to be ordered with
> + with the load from a. This reordering can happen on x86 as follows:

s/with//

And I beg you to avoid using the single letter variable "a".
It's confusing.

> + (1) The compiler can reorder the load from a to precede the
> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> + compiler barrier, the CPU can reorder the preceding store to
> + obj->dead with the later load from a.
> +
> + This could be avoided by using READ_ONCE(), which would prevent the
> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> + being volatile accesses, and is usually preferable for loads from
> + shared variables. However, weakly ordered CPUs would still be
> + free to reorder the atomic_dec() with the load from a, so a more
> + readable option is to also use smp_mb__after_atomic() as follows:

The point here is not just "readability", but also the portability of the
code, isn't it?

Thanks, Akira

> +
> + WRITE_ONCE(obj->dead, 1);
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + smp_mb__after_atomic();
> + r1 = READ_ONCE(a);
> +
> + This orders all three accesses against each other, and also makes
> + the intent quite clear.
>
> See Documentation/atomic_{t,bitops}.txt for more information.
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 8dcb37835b61..b6866f93abb8 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -28,8 +28,8 @@ include "lock.cat"
> let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> let wmb = [W] ; fencerel(Wmb) ; [W]
> let mb = ([M] ; fencerel(Mb) ; [M]) |
> - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> + ([RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> fencerel(After-unlock-lock) ; [M])
>


2019-04-19 19:25:45

by Andrea Parri

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables. However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
> > +
> > + WRITE_ONCE(obj->dead, 1);
> > + smp_mb__before_atomic();
> > + atomic_dec(&obj->ref_count);
> > + smp_mb__after_atomic();
> > + r1 = READ_ONCE(a);
>
> The point here is not just "readability", but also the portability of the
> code, isn't it?

The implicit assumption was, I guess, that all weakly ordered CPUs which
are free to reorder the atomic_dec() with the READ_ONCE() execute a full
memory barrier in smp_mb__before_atomic() ... This assumption currently
holds, AFAICT, but yes: it may well become "non-portable"! ... ;-)

Andrea

2019-04-20 14:51:37

by Akira Yokosawa

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Fri, 19 Apr 2019 11:06:41 -0700, Paul E. McKenney wrote:
> On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
>> Hi Paul,
>>
[...]
>
>>> + (1) The compiler can reorder the load from a to precede the
>>> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
>>> + compiler barrier, the CPU can reorder the preceding store to
>>> + obj->dead with the later load from a.
>>> +
>>> + This could be avoided by using READ_ONCE(), which would prevent the
>>> + compiler from reordering due to both atomic_dec() and READ_ONCE()
>>> + being volatile accesses, and is usually preferable for loads from
>>> + shared variables. However, weakly ordered CPUs would still be
>>> + free to reorder the atomic_dec() with the load from a, so a more
>>> + readable option is to also use smp_mb__after_atomic() as follows:
>>
>> The point here is not just "readability", but also the portability of the
>> code, isn't it?
>
> As Andrea noted, in this particular case, the guarantee that the
> store to obj->dead precedes the load from x is portable. Either the
> smp_mb__before_atomic() or the atomic_dec() must provide the ordering.

I think I understood this. What I wanted to say was the code for x86 implied
in the subjunctive sentence:

obj->dead = 1;
smp_mb__before_atomic();
atomic_dec(&obj->ref_count);
r1 = READ_ONCE(x);

, which was not spelled out, is not portable if we expect the ordering of
atomic_dec() with READ_ONCE().

> However, you are right that there is some non-portability. But this
> non-portability involves the order of the atomic_dec() and the store to x.

Yes, you've guessed it right.

>
> So what I did was ...
>
>> Thanks, Akira
>>
>>> +
>>> + WRITE_ONCE(obj->dead, 1);
>>> + smp_mb__before_atomic();
>>> + atomic_dec(&obj->ref_count);
>>> + smp_mb__after_atomic();
>>> + r1 = READ_ONCE(a);
>>> +
>>> + This orders all three accesses against each other, and also makes
>>> + the intent quite clear.
>
> ... change the above paragraph to read as follows:
>
> In addition, the example without the smp_mb__after_atomic() does
> not necessarily order the atomic_dec() with the load from x.
> In contrast, the example with both smp_mb__before_atomic() and
> smp_mb__after_atomic() orders all three accesses against each other,
> and also makes the intent quite clear.
>
> Does that help?

This looks a little bit redundant to me. The original one is clear
enough.

How about editing the leading sentence above:

>>> + shared variables. However, weakly ordered CPUs would still be
>>> + free to reorder the atomic_dec() with the load from a, so a more
>>> + readable option is to also use smp_mb__after_atomic() as follows:

to read as follows?

shared variables. However, weakly ordered CPUs would still be
free to reorder the atomic_dec() with the load from x, so a
portable and more readable option is to also use
smp_mb__after_atomic() as follows:

Obviously, the interesting discussion going on in another thread will
surely affect this patch.

Thanks, Akira

>
> Thanx, Paul
>
>>> See Documentation/atomic_{t,bitops}.txt for more information.
>>>
>>> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
>>> index 8dcb37835b61..b6866f93abb8 100644
>>> --- a/tools/memory-model/linux-kernel.cat
>>> +++ b/tools/memory-model/linux-kernel.cat
>>> @@ -28,8 +28,8 @@ include "lock.cat"
>>> let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
>>> let wmb = [W] ; fencerel(Wmb) ; [W]
>>> let mb = ([M] ; fencerel(Mb) ; [M]) |
>>> - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>>> - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
>>> + ([M] ; fencerel(Before-atomic) ; [RMW]) |
>>> + ([RMW] ; fencerel(After-atomic) ; [M]) |
>>> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>>> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>>> fencerel(After-unlock-lock) ; [M])
>>>
>>
>

2019-04-21 19:44:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Adding plain accesses and detecting data races in the LKMM

On Sat, Apr 20, 2019 at 11:50:14PM +0900, Akira Yokosawa wrote:
> On Fri, 19 Apr 2019 11:06:41 -0700, Paul E. McKenney wrote:
> > On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
> >> Hi Paul,
> >>
> [...]
> >
> >>> + (1) The compiler can reorder the load from a to precede the
> >>> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> >>> + compiler barrier, the CPU can reorder the preceding store to
> >>> + obj->dead with the later load from a.
> >>> +
> >>> + This could be avoided by using READ_ONCE(), which would prevent the
> >>> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> >>> + being volatile accesses, and is usually preferable for loads from
> >>> + shared variables. However, weakly ordered CPUs would still be
> >>> + free to reorder the atomic_dec() with the load from a, so a more
> >>> + readable option is to also use smp_mb__after_atomic() as follows:
> >>
> >> The point here is not just "readability", but also the portability of the
> >> code, isn't it?
> >
> > As Andrea noted, in this particular case, the guarantee that the
> > store to obj->dead precedes the load from x is portable. Either the
> > smp_mb__before_atomic() or the atomic_dec() must provide the ordering.
>
> I think I understood this. What I wanted to say was the code for x86 implied
> in the subjunctive sentence:
>
> obj->dead = 1;
> smp_mb__before_atomic();
> atomic_dec(&obj->ref_count);
> r1 = READ_ONCE(x);
>
> , which was not spelled out, is not portable if we expect the ordering of
> atomic_dec() with READ_ONCE().

I now understand that you understood. ;-)

> > However, you are right that there is some non-portability. But this
> > non-portability involves the order of the atomic_dec() and the store to x.
>
> Yes, you've guessed it right.

Don't worry, it won't happen again!

> > So what I did was ...
> >
> >> Thanks, Akira
> >>
> >>> +
> >>> + WRITE_ONCE(obj->dead, 1);
> >>> + smp_mb__before_atomic();
> >>> + atomic_dec(&obj->ref_count);
> >>> + smp_mb__after_atomic();
> >>> + r1 = READ_ONCE(a);
> >>> +
> >>> + This orders all three accesses against each other, and also makes
> >>> + the intent quite clear.
> >
> > ... change the above paragraph to read as follows:
> >
> > In addition, the example without the smp_mb__after_atomic() does
> > not necessarily order the atomic_dec() with the load from x.
> > In contrast, the example with both smp_mb__before_atomic() and
> > smp_mb__after_atomic() orders all three accesses against each other,
> > and also makes the intent quite clear.
> >
> > Does that help?
>
> This looks a little bit redundant to me. The original one is clear
> enough.
>
> How about editing the leading sentence above:
>
> >>> + shared variables. However, weakly ordered CPUs would still be
> >>> + free to reorder the atomic_dec() with the load from a, so a more
> >>> + readable option is to also use smp_mb__after_atomic() as follows:
>
> to read as follows?
>
> shared variables. However, weakly ordered CPUs would still be
> free to reorder the atomic_dec() with the load from x, so a
> portable and more readable option is to also use
> smp_mb__after_atomic() as follows:

Adding "portable and", correct? Makes sense, so I applied this change.

> Obviously, the interesting discussion going on in another thread will
> surely affect this patch.

Quite possibly! ;-)


Thanx, Paul

> >>> See Documentation/atomic_{t,bitops}.txt for more information.
> >>>
> >>> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> >>> index 8dcb37835b61..b6866f93abb8 100644
> >>> --- a/tools/memory-model/linux-kernel.cat
> >>> +++ b/tools/memory-model/linux-kernel.cat
> >>> @@ -28,8 +28,8 @@ include "lock.cat"
> >>> let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> >>> let wmb = [W] ; fencerel(Wmb) ; [W]
> >>> let mb = ([M] ; fencerel(Mb) ; [M]) |
> >>> - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> >>> - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> >>> + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> >>> + ([RMW] ; fencerel(After-atomic) ; [M]) |
> >>> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> >>> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> >>> fencerel(After-unlock-lock) ; [M])
> >>>
> >>
> >
>