Folks:
I have spent some time writing up a section for
tools/memory-model/Documentation/explanation.txt on plain accesses and
data races. The initial version is below.
I'm afraid it's rather long and perhaps gets too bogged down in
complexities. On the other hand, this is a complicated topic so to
some extent this is unavoidable.
In any case, I'd like to hear your comments and reviews.
Alan
------------------------------------------------------------------------
PLAIN ACCESSES AND DATA RACES
-----------------------------
In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
smp_load_acquire(&z), and so on are collectively referred to as
"marked" accesses, because they are all annotated with special
operations of one kind or another. Ordinary C-language memory
accesses such as x or y = 0 are simply called "plain" accesses.
Early versions of the LKMM had nothing to say about plain accesses.
The C standard allows compilers to assume that the variables affected
by plain accesses are not concurrently read or written by any other
threads or CPUs. This leaves compilers free to implement all manner
of transformations or optimizations of code containing plain accesses,
making such code very difficult for a memory model to handle.
Here is just one example of a possible pitfall:
int a = 6;
int *x = &a;
P0()
{
int *r1;
int r2 = 0;
r1 = x;
if (r1 != NULL)
r2 = READ_ONCE(*r1);
}
P1()
{
WRITE_ONCE(x, NULL);
}
On the face of it, one would expect that when this code runs, the only
possible final values for r2 are 6 and 0, depending on whether or not
P1's store to x propagates to P0 before P0's load from x executes.
But since P0's load from x is a plain access, the compiler may decide
to carry out the load twice (for the comparison against NULL, then again
for the READ_ONCE()) and eliminate the temporary variable r1. The
object code generated for P0 could therefore end up looking rather
like this:
P0()
{
int r2 = 0;
if (x != NULL)
r2 = READ_ONCE(*x);
}
And now it is obvious that this code runs the risk of dereferencing a
NULL pointer, because P1's store to x might propagate to P0 after the
test against NULL has been made but before the READ_ONCE() executes.
If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
the compiler would not have performed this optimization and there
would be no possibility of a NULL-pointer dereference.
Given the possibility of transformations like this one, the LKMM
doesn't try to predict all possible outcomes of code containing plain
accesses. It is content to determine whether the code violates the
compiler's assumptions, which would render the ultimate outcome
undefined.
In technical terms, the compiler is allowed to assume that when the
program executes, there will not be any data races. A "data race"
occurs when two conflicting memory accesses execute concurrently;
two memory accesses "conflict" if:
they access the same location,
they occur on different CPUs (or in different threads on the
same CPU),
at least one of them is a plain access,
and at least one of them is a store.
The LKMM tries to determine whether a program contains two conflicting
accesses which may execute concurrently; if it does then the LKMM says
there is a potential data race and makes no predictions about the
program's outcome.
Determining whether two accesses conflict is easy; you can see that
all the concepts involved in the definition above are already part of
the memory model. The hard part is telling whether they may execute
concurrently. The LKMM takes a conservative attitude, assuming that
accesses may be concurrent unless it can prove they cannot.
If two memory accesses aren't concurrent then one must execute before
the other. Therefore the LKMM decides two accesses aren't concurrent
if they can be connected by a sequence of hb, pb, and rb links
(together referred to as xb, for "executes before"). However, there
are two complicating factors.
If X is a load and X executes before a store Y, then indeed there is
no danger of X and Y being concurrent. After all, Y can't have any
effect on the value obtained by X until the memory subsystem has
propagated Y from its own CPU to X's CPU, which won't happen until
some time after Y executes and thus after X executes. But if X is a
store, then even if X executes before Y it is still possible that X
will propagate to Y's CPU just as Y is executing. In such a case X
could very well interfere somehow with Y, and we would have to
consider X and Y to be concurrent.
Therefore when X is a store, for X and Y to be non-concurrent the LKMM
requires not only that X must execute before Y but also that X must
propagate to Y's CPU before Y executes. (Or vice versa, of course, if
Y executes before X -- then Y must propagate to X's CPU before X
executes if Y is a store.) This is expressed by the visibility
relation (vis), where X ->vis Y is defined to hold if there is an
intermediate event Z such that:
X is connected to Z by a possibly empty sequence of
cumul-fence links followed by an optional rfe link (if none of
these links are present, X and Z are the same event),
and either:
Z is connected to Y by a strong-fence link followed by a
possibly empty sequence of xb links,
or:
Z is on the same CPU as Y and is connected to Y by a possibly
empty sequence of xb links (again, if the sequence is empty it
means Z and Y are the same event).
The motivations behind this definition are straightforward:
cumul-fence memory barriers force stores that are po-before
the barrier to propagate to other CPUs before stores that are
po-after the barrier.
An rfe link from an event W to an event R says that R reads
from W, which certainly means that W must have propagated to
R's CPU before R executed.
strong-fence memory barriers force stores that are po-before
the barrier, or that propagate to the barrier's CPU before the
barrier executes, to propagate to all CPUs before any events
po-after the barrier can execute.
To see how this works out in practice, consider our old friend, the MP
pattern (with fences and statement labels, but without the conditional
test):
int buf = 0, flag = 0;
P0()
{
X: WRITE_ONCE(buf, 1);
smp_wmb();
W: WRITE_ONCE(flag, 1);
}
P1()
{
int r1;
int r2 = 0;
Z: r1 = READ_ONCE(flag);
smp_rmb();
Y: r2 = READ_ONCE(buf);
}
The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
assuming r1 = 1 at the end, there is an rfe link from W to Z. This
means that the store to buf must propagate from P0 to P1 before Z
executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
provides an xb link from Z to Y (i.e., it forces Z to execute before
Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
executes.
The second complicating factor mentioned above arises from the fact
that when we are considering data races, some of the memory accesses
are plain. Now, although we have not said so explicitly, up to this
point most of the relations defined by the LKMM (ppo, hb, prop,
cumul-fence, pb, and so on -- including vis) apply only to marked
accesses.
There are good reasons for this restriction. The compiler is not
allowed to apply fancy transformations to marked accesses, and
consequently each such access in the source code corresponds more or
less directly to a single machine instruction in the object code. But
plain accesses are a different story; the compiler may combine them,
split them up, duplicate them, eliminate them, invent new ones, and
who knows what else. Seeing a plain access in the source code tells
you almost nothing about what machine instructions will end up in the
object code.
Fortunately, the compiler isn't completely free; it is subject to some
limitations. For one, it is not allowed to introduce a data race into
the object code if the source code does not already contain a data
race (if it could, memory models would be useless and no multithreaded
code would be safe!). For another, it cannot move a plain access past
a compiler barrier.
A compiler barrier is a kind of fence, but as the name implies, it
only affects the compiler; it does not necessarily have any effect on
how instructions are executed by the CPU. In Linux kernel source
code, the barrier() function is a compiler barrier. It doesn't give
rise directly to any machine instructions in the object code; rather,
it affects how the compiler generates the rest of the object code.
Given source code like this:
... some memory accesses ...
barrier();
... some other memory accesses ...
the barrier() function ensures that the machine instructions
corresponding to the first group of accesses will all end po-before
any machine instructions corresponding to the second group of accesses
-- even if some of the accesses are plain. (Of course, the CPU may
then execute some of those accesses out of program order, but we
already know how to deal with such issues.) Without the barrier()
there would be no such guarantee; the two groups of accesses could be
intermingled or even reversed in the object code.
The LKMM doesn't say much about the barrier() function, but it does
require that all fences are also compiler barriers. In addition, it
requires that the ordering properties of memory barriers such as
smp_rmb() or smp_store_release() apply to plain accesses as well as to
marked accesses.
This is the key to analyzing data races. Consider the MP pattern
again, now using plain accesses for buf:
int buf = 0, flag = 0;
P0()
{
U: buf = 1;
smp_wmb();
X: WRITE_ONCE(flag, 1);
}
P1()
{
int r1;
int r2 = 0;
Y: r1 = READ_ONCE(flag);
if (r1) {
smp_rmb();
V: r2 = buf;
}
}
This program does not contain a data race. Although the U and V
accesses conflict, the LKMM can prove they are not concurrent as
follows:
The smp_wmb() fence in P0 is both a compiler barrier and a
cumul-fence. It guarantees that no matter what hash of
machine instructions the compiler generates for the plain
access U, all those instructions will be po-before the fence.
Consequently U's store to buf, no matter how it is carried out
at the machine level, must propagate to P1 before X's store to
flag does.
X and Y are both marked accesses. Hence an rfe link from X to
Y is a valid indicator that X propagated to P1 before Y
executed, i.e., X ->vis Y. (And if there is no rfe link then
r1 will be 0, so V will not be executed and ipso facto won't
race with U.)
The smp_rmb() fence in P1 is a compiler barrier as well as a
fence. It guarantees that all the machine-level instructions
corresponding to the access V will be po-after the fence, and
therefore any loads among those instructions will execute
after the fence does and hence after Y does.
Thus U's store to buf is forced to propagate to P1 before V's load
executes (assuming V does execute), ruling out the possibility of a
data race between them.
This analysis illustrates how the LKMM deals with plain accesses in
general. Suppose R is a plain load and we want to show that R
executes before some marked access E. We can do this by finding a
marked access X such that R and X are ordered by a suitable fence and
X ->xb* E. If E was also a plain access, we would also look for a
marked access Y such that X ->xb* Y, and Y and E are ordered by a
fence. We describe this arrangement by saying that R is
"post-bounded" by X and E is "pre-bounded" by Y.
In fact, we go one step further: Since R is a read, we say that R is
"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or
"w-pre-bounded" by Y, depending on whether E was a store or a load.
This distinction is needed because some fences affect only loads
(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
the two types of bounds are the same. And as a degenerate case, we
say that a marked access pre-bounds and post-bounds itself (e.g., if R
above were a marked load then X could simply be taken to be R itself.)
The need to distinguish between r- and w-bounding raises yet another
issue. When the source code contains a plain store, the compiler is
allowed to put plain loads of the same location into the object code.
For example, given the source code:
x = 1;
the compiler is theoretically allowed to generate object code that
looks like:
if (x != 1)
x = 1;
thereby adding a load (and possibly replacing the store entirely).
For this reason, whenever the LKMM requires a plain store to be
w-pre-bounded or w-post-bounded by a marked access, it also requires
the store to be r-pre-bounded or r-post-bounded, so as to handle cases
where the compiler adds a load.
(This may be overly cautious. We don't know of any examples where a
compiler has augmented a store with a load in this fashion, and the
Linux kernel developers would probably fight pretty hard to change a
compiler if it ever did this. Still, better safe than sorry.)
Incidentally, the other tranformation -- augmenting a plain load by
adding in a store to the same location -- is not allowed. This is
because the compiler cannot know whether any other CPUs might perform
a concurrent load from that location. Two concurrent loads don't
constitute a race (they can't interfere with each other), but a store
does race with a concurrent load. Thus adding a store might create a
data race where one was not already present in the source code,
something the compiler is forbidden to do. Augmenting a store with a
load, on the other hand, is acceptable because doing so won't create a
data race unless one already existed.
The LKMM includes a second way to pre-bound plain accesses, in
addition to fences: an address dependency from a marked load. That
is, in the sequence:
p = READ_ONCE(ptr);
r = *p;
the LKMM says that the marked load of ptr pre-bounds the plain load of
*p; the marked load must execute before any of the machine
instructions corresponding to the plain load. This is a reasonable
stipulation, since after all, the CPU can't perform the load of *p
until it knows what value p will hold. Furthermore, without some
assumption like this one, some usages typical of RCU would count as
data races. For example:
int a = 1, b;
int *ptr = &a;
P0()
{
b = 2;
rcu_assign_ptr(ptr, &b);
}
P1()
{
int *p;
int r;
rcu_read_lock();
p = rcu_dereference(ptr);
r = *p;
rcu_read_unlock();
}
(In this example the rcu_read_lock() and rcu_read_unlock() calls don't
really do anything, because there aren't any grace periods. They are
included merely for the sake of good form; typically P0 would call
synchronize_rcu() somewhere after the rcu_assign_ptr().)
rcu_assign_ptr() performs a store-release, so the plain store to b is
definitely w-post-bounded before the store to ptr, and the two stores
will propagate to P1 in that order. However, rcu_dereference() is
only equivalent to READ_ONCE(). While it is a marked access, it is
not a fence or compiler barrier. Hence the only guarantee we have
that the load of ptr in P1 is r-pre-bounded before the load of *p
(thus avoiding a race) is the assumption about address dependencies.
This is a situation where the compiler can undermine the memory model,
and a certain amount of care is required when programming constructs
like this one. In particular, comparisons between the pointer and
other known addresses can cause trouble. If you have something like:
p = rcu_dereference(ptr);
if (p == &x)
r = *p;
then the compiler just might generate object code resembling:
p = rcu_dereference(ptr);
if (p == &x)
r = x;
or even:
rtemp = x;
p = rcu_dereference(ptr);
if (p == &x)
r = rtemp;
which would invalidate the memory model's assumption, since the CPU
could now perform the load of x before the load of ptr (there might be
a control dependency but no address dependency at the machine level).
Finally, it turns out there is a situation in which a plain write does
not need to be w-post-bounded: when it is separated from the
conflicting access by a fence. At first glance this may seem
impossible. After all, to be conflicting the second access has to be
on a different CPU from the first, and fences don't link events on
different CPUs. Well, normal fences don't -- but rcu-fence can!
Here's an example:
int x, y;
P0()
{
WRITE_ONCE(x, 1);
synchronize_rcu();
y = 3;
}
P1()
{
rcu_read_lock();
if (READ_ONCE(x) == 0)
y = 2;
rcu_read_unlock();
}
Do the plain stores to y race? Clearly not if P1 reads a non-zero
value for x, so let's assume the READ_ONCE(x) does obtain 0. This
means that the read-side critical section in P1 must finish executing
before the grace period in P0 does, because RCU's Grace-Period
Guarantee says that otherwise P0's store to x would have propagated to
P1 before the critical section started and so would have been visible
to the READ_ONCE(). (Another way of putting it is that the fre link
from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
between those two events.)
This means there is an rcu-fence link from P1's "y = 2" store to P0's
"y = 3" store, and consequently the first must propagate from P1 to P0
before the second can execute. Therefore the two stores cannot be
concurrent and there is no race, even though P1's plain store to y
isn't w-post-bounded by any marked accesses.
Putting all this material together yields the following picture. For
two conflicting stores W and W', where W ->co W', the LKMM says the
stores don't race if W can be linked to W' by a
w-post-bounded ; vis ; w-pre-bounded
sequence. If W is plain then they also have to be linked by an
r-post-bounded ; xb* ; w-pre-bounded
sequence, and if W' is plain then they also have to be linked by a
w-post-bounded ; vis ; r-pre-bounded
sequence. For a conflicting load R and store W, the LKMM says the two
accesses don't race if R can be linked to W by an
r-post-bounded ; xb* ; w-pre-bounded
sequence or if W can be linked to R by a
w-post-bounded ; vis ; r-pre-bounded
sequence. For the cases involving a vis link, the LKMM also accepts
sequences in which W is linked to W' or R by a
strong-fence ; xb* ; {w and/or r}-pre-bounded
sequence with no post-bounding, and in every case the LKMM also allows
the link simply to be a fence with no bounding at all. If no sequence
of the appropriate sort exists, the LKMM says that the accesses race.
There is one more part of the LKMM related to plain accesses (although
not to data races) we should discuss. Recall that many relations such
as hb are limited to marked accesses only. As a result, the
happens-before, propagates-before, and rcu axioms (which state that
various relation must not contain a cycle) doesn't apply to plain
accesses. Nevertheless, we do want to rule out such cycles, because
they don't make sense even for plain accesses.
To this end, the LKMM imposes three extra restrictions, together
called the "plain-coherence" axiom because of their resemblance to the
coherency rules:
If R and W conflict and it is possible to link R to W by one
of the xb* sequences listed above, then W ->rfe R is not
allowed (i.e., a load cannot read from a store that it
executes before, even if one or both is plain).
If W and R conflict and it is possible to link W to R by one
of the vis sequences listed above, then R ->fre W is not
allowed (i.e., if a store is visible to a load then the load
must read from that store or one coherence-after it).
If W and W' conflict and it is possible to link W to W' by one
of the vis sequences listed above, then W' ->co W is not
allowed (i.e., if one store is visible to another then it must
come after in the coherence order).
This is the extent to which the LKMM deals with plain accesses.
Perhaps it could say more (for example, plain accesses might
contribute to the ppo relation), but at the moment it seems that this
minimal, conservative approach is good enough.
On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> Folks:
>
> I have spent some time writing up a section for
> tools/memory-model/Documentation/explanation.txt on plain accesses and
> data races. The initial version is below.
>
> I'm afraid it's rather long and perhaps gets too bogged down in
> complexities. On the other hand, this is a complicated topic so to
> some extent this is unavoidable.
>
> In any case, I'd like to hear your comments and reviews.
Good stuff, thank you for putting this together!
Please see below for some questions, comments, and confusion interspersed.
> Alan
>
> ------------------------------------------------------------------------
>
>
> PLAIN ACCESSES AND DATA RACES
> -----------------------------
>
> In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
> smp_load_acquire(&z), and so on are collectively referred to as
> "marked" accesses, because they are all annotated with special
> operations of one kind or another. Ordinary C-language memory
> accesses such as x or y = 0 are simply called "plain" accesses.
>
> Early versions of the LKMM had nothing to say about plain accesses.
> The C standard allows compilers to assume that the variables affected
> by plain accesses are not concurrently read or written by any other
> threads or CPUs. This leaves compilers free to implement all manner
> of transformations or optimizations of code containing plain accesses,
> making such code very difficult for a memory model to handle.
>
> Here is just one example of a possible pitfall:
>
> int a = 6;
> int *x = &a;
>
> P0()
> {
> int *r1;
> int r2 = 0;
>
> r1 = x;
> if (r1 != NULL)
> r2 = READ_ONCE(*r1);
> }
>
> P1()
> {
> WRITE_ONCE(x, NULL);
> }
I tried making a litmus test out of this:
------------------------------------------------------------------------
C plain-1
{
int a = 6;
int *x = &a;
}
P0(int **x)
{
int *r1;
int r2 = 0;
r1 = *x;
if (r1 != 0)
r2 = READ_ONCE(*r1);
}
P1(int **x)
{
WRITE_ONCE(*x, 0);
}
locations [a; x; r1]
exists ~r2=6 /\ ~r2=0
------------------------------------------------------------------------
However, r1 steadfastly refuses to have any value other than zero.
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/argh
Test plain-1 Allowed
States 1
a=6; r1=0; r2=0; x=0;
No
Witnesses
Positive: 0 Negative: 2
Flag data-race
Condition exists (not (r2=6) /\ not (r2=0))
Observation plain-1 Never 0 2
Time plain-1 0.00
Hash=b0fdbd0f627fd65e0cd413bf87f6f4a4
------------------------------------------------------------------------
What am I doing wrong here? Outdated herd7 version?
$ herd7 -version
7.52+7(dev), Rev: c81f1ff06f30d5c28c34d893a29f5f8505334179
Hmmm... I might well be in an inconsistent herd7/ocaml state. If no
one sees anything obvious, I will try reinstalling from scratch, but
that will not likely happen in the next few days.
> On the face of it, one would expect that when this code runs, the only
> possible final values for r2 are 6 and 0, depending on whether or not
> P1's store to x propagates to P0 before P0's load from x executes.
> But since P0's load from x is a plain access, the compiler may decide
> to carry out the load twice (for the comparison against NULL, then again
> for the READ_ONCE()) and eliminate the temporary variable r1. The
> object code generated for P0 could therefore end up looking rather
> like this:
>
> P0()
> {
> int r2 = 0;
>
> if (x != NULL)
> r2 = READ_ONCE(*x);
> }
>
> And now it is obvious that this code runs the risk of dereferencing a
> NULL pointer, because P1's store to x might propagate to P0 after the
> test against NULL has been made but before the READ_ONCE() executes.
> If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
> the compiler would not have performed this optimization and there
> would be no possibility of a NULL-pointer dereference.
>
> Given the possibility of transformations like this one, the LKMM
> doesn't try to predict all possible outcomes of code containing plain
> accesses. It is content to determine whether the code violates the
I suggest starting this sentence with something like "It is instead
content to determine", adding "instead", to help the reader transition.
> compiler's assumptions, which would render the ultimate outcome
> undefined.
>
> In technical terms, the compiler is allowed to assume that when the
> program executes, there will not be any data races. A "data race"
> occurs when two conflicting memory accesses execute concurrently;
> two memory accesses "conflict" if:
>
> they access the same location,
>
> they occur on different CPUs (or in different threads on the
> same CPU),
>
> at least one of them is a plain access,
>
> and at least one of them is a store.
>
> The LKMM tries to determine whether a program contains two conflicting
> accesses which may execute concurrently; if it does then the LKMM says
> there is a potential data race and makes no predictions about the
> program's outcome.
>
> Determining whether two accesses conflict is easy; you can see that
> all the concepts involved in the definition above are already part of
> the memory model. The hard part is telling whether they may execute
> concurrently. The LKMM takes a conservative attitude, assuming that
> accesses may be concurrent unless it can prove they cannot.
>
> If two memory accesses aren't concurrent then one must execute before
Should this say "If two memory accesses to the same location aren't
concurrent..."?
> the other. Therefore the LKMM decides two accesses aren't concurrent
> if they can be connected by a sequence of hb, pb, and rb links
> (together referred to as xb, for "executes before"). However, there
> are two complicating factors.
>
> If X is a load and X executes before a store Y, then indeed there is
> no danger of X and Y being concurrent. After all, Y can't have any
> effect on the value obtained by X until the memory subsystem has
> propagated Y from its own CPU to X's CPU, which won't happen until
> some time after Y executes and thus after X executes. But if X is a
> store, then even if X executes before Y it is still possible that X
> will propagate to Y's CPU just as Y is executing. In such a case X
> could very well interfere somehow with Y, and we would have to
> consider X and Y to be concurrent.
>
> Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> requires not only that X must execute before Y but also that X must
> propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> Y executes before X -- then Y must propagate to X's CPU before X
> executes if Y is a store.) This is expressed by the visibility
> relation (vis), where X ->vis Y is defined to hold if there is an
> intermediate event Z such that:
"if there is a marked intermediate event Z such that"?
> X is connected to Z by a possibly empty sequence of
> cumul-fence links followed by an optional rfe link (if none of
> these links are present, X and Z are the same event),
>
> and either:
>
> Z is connected to Y by a strong-fence link followed by a
> possibly empty sequence of xb links,
"possibly empty sequence of xb links from a marked access"?
> or:
>
> Z is on the same CPU as Y and is connected to Y by a possibly
> empty sequence of xb links (again, if the sequence is empty it
> means Z and Y are the same event).
>
> The motivations behind this definition are straightforward:
>
> cumul-fence memory barriers force stores that are po-before
> the barrier to propagate to other CPUs before stores that are
> po-after the barrier.
>
> An rfe link from an event W to an event R says that R reads
> from W, which certainly means that W must have propagated to
> R's CPU before R executed.
>
> strong-fence memory barriers force stores that are po-before
> the barrier, or that propagate to the barrier's CPU before the
> barrier executes, to propagate to all CPUs before any events
> po-after the barrier can execute.
>
> To see how this works out in practice, consider our old friend, the MP
> pattern (with fences and statement labels, but without the conditional
> test):
>
> int buf = 0, flag = 0;
>
> P0()
> {
> X: WRITE_ONCE(buf, 1);
> smp_wmb();
> W: WRITE_ONCE(flag, 1);
> }
>
> P1()
> {
> int r1;
> int r2 = 0;
>
> Z: r1 = READ_ONCE(flag);
> smp_rmb();
> Y: r2 = READ_ONCE(buf);
> }
I have to ask. Why X then W then Z then Y? ;-)
(This is MP+fencewmbonceonce+fencermbonceonce.litmus in the current set
in tools/memory-model/litmus-tests.)
> The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> means that the store to buf must propagate from P0 to P1 before Z
> executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> provides an xb link from Z to Y (i.e., it forces Z to execute before
> Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> executes.
>
> The second complicating factor mentioned above arises from the fact
> that when we are considering data races, some of the memory accesses
> are plain. Now, although we have not said so explicitly, up to this
> point most of the relations defined by the LKMM (ppo, hb, prop,
> cumul-fence, pb, and so on -- including vis) apply only to marked
> accesses.
>
> There are good reasons for this restriction. The compiler is not
> allowed to apply fancy transformations to marked accesses, and
> consequently each such access in the source code corresponds more or
> less directly to a single machine instruction in the object code. But
> plain accesses are a different story; the compiler may combine them,
> split them up, duplicate them, eliminate them, invent new ones, and
> who knows what else. Seeing a plain access in the source code tells
> you almost nothing about what machine instructions will end up in the
> object code.
>
> Fortunately, the compiler isn't completely free; it is subject to some
> limitations. For one, it is not allowed to introduce a data race into
> the object code if the source code does not already contain a data
> race (if it could, memory models would be useless and no multithreaded
> code would be safe!). For another, it cannot move a plain access past
> a compiler barrier.
>
> A compiler barrier is a kind of fence, but as the name implies, it
> only affects the compiler; it does not necessarily have any effect on
> how instructions are executed by the CPU. In Linux kernel source
> code, the barrier() function is a compiler barrier. It doesn't give
> rise directly to any machine instructions in the object code; rather,
> it affects how the compiler generates the rest of the object code.
> Given source code like this:
>
> ... some memory accesses ...
> barrier();
> ... some other memory accesses ...
>
> the barrier() function ensures that the machine instructions
> corresponding to the first group of accesses will all end po-before
> any machine instructions corresponding to the second group of accesses
> -- even if some of the accesses are plain. (Of course, the CPU may
> then execute some of those accesses out of program order, but we
> already know how to deal with such issues.) Without the barrier()
> there would be no such guarantee; the two groups of accesses could be
> intermingled or even reversed in the object code.
>
> The LKMM doesn't say much about the barrier() function, but it does
> require that all fences are also compiler barriers. In addition, it
> requires that the ordering properties of memory barriers such as
> smp_rmb() or smp_store_release() apply to plain accesses as well as to
> marked accesses.
>
> This is the key to analyzing data races. Consider the MP pattern
> again, now using plain accesses for buf:
>
> int buf = 0, flag = 0;
>
> P0()
> {
> U: buf = 1;
> smp_wmb();
> X: WRITE_ONCE(flag, 1);
> }
>
> P1()
> {
> int r1;
> int r2 = 0;
>
> Y: r1 = READ_ONCE(flag);
> if (r1) {
> smp_rmb();
> V: r2 = buf;
> }
> }
And same nit, why not just X, Y, and Z?
Similar issues with the litmus test:
------------------------------------------------------------------------
C plain-4
{
int buf = 0;
int flag = 0;
}
P0(int *buf, int *flag)
{
*buf = 1;
smp_wmb();
WRITE_ONCE(*flag, 1);
}
P1(int *buf, int *flag)
{
int r1;
int r2 = 0;
r1 = READ_ONCE(*flag);
if (r1) {
smp_rmb();
r2 = *buf;
}
}
exists r1=1 /\ r2=0
------------------------------------------------------------------------
> This program does not contain a data race. Although the U and V
> accesses conflict, the LKMM can prove they are not concurrent as
> follows:
>
> The smp_wmb() fence in P0 is both a compiler barrier and a
> cumul-fence. It guarantees that no matter what hash of
> machine instructions the compiler generates for the plain
> access U, all those instructions will be po-before the fence.
> Consequently U's store to buf, no matter how it is carried out
> at the machine level, must propagate to P1 before X's store to
> flag does.
>
> X and Y are both marked accesses. Hence an rfe link from X to
> Y is a valid indicator that X propagated to P1 before Y
> executed, i.e., X ->vis Y. (And if there is no rfe link then
> r1 will be 0, so V will not be executed and ipso facto won't
> race with U.)
>
> The smp_rmb() fence in P1 is a compiler barrier as well as a
> fence. It guarantees that all the machine-level instructions
> corresponding to the access V will be po-after the fence, and
> therefore any loads among those instructions will execute
> after the fence does and hence after Y does.
>
> Thus U's store to buf is forced to propagate to P1 before V's load
> executes (assuming V does execute), ruling out the possibility of a
> data race between them.
>
> This analysis illustrates how the LKMM deals with plain accesses in
> general. Suppose R is a plain load and we want to show that R
> executes before some marked access E. We can do this by finding a
> marked access X such that R and X are ordered by a suitable fence and
> X ->xb* E. If E was also a plain access, we would also look for a
> marked access Y such that X ->xb* Y, and Y and E are ordered by a
> fence. We describe this arrangement by saying that R is
> "post-bounded" by X and E is "pre-bounded" by Y.
>
> In fact, we go one step further: Since R is a read, we say that R is
> "r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or
> "w-pre-bounded" by Y, depending on whether E was a store or a load.
> This distinction is needed because some fences affect only loads
> (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
> the two types of bounds are the same. And as a degenerate case, we
> say that a marked access pre-bounds and post-bounds itself (e.g., if R
> above were a marked load then X could simply be taken to be R itself.)
>
> The need to distinguish between r- and w-bounding raises yet another
> issue. When the source code contains a plain store, the compiler is
> allowed to put plain loads of the same location into the object code.
> For example, given the source code:
>
> x = 1;
>
> the compiler is theoretically allowed to generate object code that
> looks like:
>
> if (x != 1)
> x = 1;
>
> thereby adding a load (and possibly replacing the store entirely).
> For this reason, whenever the LKMM requires a plain store to be
> w-pre-bounded or w-post-bounded by a marked access, it also requires
> the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> where the compiler adds a load.
>
> (This may be overly cautious. We don't know of any examples where a
> compiler has augmented a store with a load in this fashion, and the
> Linux kernel developers would probably fight pretty hard to change a
> compiler if it ever did this. Still, better safe than sorry.)
>
> Incidentally, the other tranformation -- augmenting a plain load by
> adding in a store to the same location -- is not allowed. This is
> because the compiler cannot know whether any other CPUs might perform
> a concurrent load from that location. Two concurrent loads don't
> constitute a race (they can't interfere with each other), but a store
> does race with a concurrent load. Thus adding a store might create a
> data race where one was not already present in the source code,
> something the compiler is forbidden to do. Augmenting a store with a
> load, on the other hand, is acceptable because doing so won't create a
> data race unless one already existed.
>
> The LKMM includes a second way to pre-bound plain accesses, in
> addition to fences: an address dependency from a marked load. That
> is, in the sequence:
>
> p = READ_ONCE(ptr);
> r = *p;
>
> the LKMM says that the marked load of ptr pre-bounds the plain load of
> *p; the marked load must execute before any of the machine
> instructions corresponding to the plain load. This is a reasonable
> stipulation, since after all, the CPU can't perform the load of *p
> until it knows what value p will hold. Furthermore, without some
> assumption like this one, some usages typical of RCU would count as
> data races. For example:
>
> int a = 1, b;
herd7 doesn't much like this.
> int *ptr = &a;
>
> P0()
> {
> b = 2;
> rcu_assign_ptr(ptr, &b);
rcu_assign_pointer(), globally.
> }
>
> P1()
> {
> int *p;
> int r;
>
> rcu_read_lock();
> p = rcu_dereference(ptr);
> r = *p;
> rcu_read_unlock();
> }
------------------------------------------------------------------------
C plain-5
{
int a = 1;
int b;
int *ptr = &a;
}
P0(int *b, int **ptr)
{
*b = 2;
rcu_assign_pointer(*ptr, b);
}
P1(int *ptr)
{
int *r1;
int r2;
rcu_read_lock();
r1 = rcu_dereference(*ptr);
r2 = *r1;
rcu_read_unlock();
}
exists r1=b /\ r2=1
------------------------------------------------------------------------
Also strange output:
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/argh
Test plain-3 Allowed
States 1
r1=0; r2=0;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (r1=b /\ r2=1)
Observation plain-3 Never 0 2
Time plain-3 0.00
Hash=44e039597a92bdc7efc9217813478126
------------------------------------------------------------------------
> (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
> really do anything, because there aren't any grace periods. They are
> included merely for the sake of good form; typically P0 would call
> synchronize_rcu() somewhere after the rcu_assign_ptr().)
>
> rcu_assign_ptr() performs a store-release, so the plain store to b is
> definitely w-post-bounded before the store to ptr, and the two stores
> will propagate to P1 in that order. However, rcu_dereference() is
> only equivalent to READ_ONCE(). While it is a marked access, it is
> not a fence or compiler barrier. Hence the only guarantee we have
> that the load of ptr in P1 is r-pre-bounded before the load of *p
> (thus avoiding a race) is the assumption about address dependencies.
>
> This is a situation where the compiler can undermine the memory model,
> and a certain amount of care is required when programming constructs
> like this one. In particular, comparisons between the pointer and
> other known addresses can cause trouble. If you have something like:
>
> p = rcu_dereference(ptr);
> if (p == &x)
> r = *p;
>
> then the compiler just might generate object code resembling:
>
> p = rcu_dereference(ptr);
> if (p == &x)
> r = x;
>
> or even:
>
> rtemp = x;
> p = rcu_dereference(ptr);
> if (p == &x)
> r = rtemp;
>
> which would invalidate the memory model's assumption, since the CPU
> could now perform the load of x before the load of ptr (there might be
> a control dependency but no address dependency at the machine level).
>
> Finally, it turns out there is a situation in which a plain write does
> not need to be w-post-bounded: when it is separated from the
> conflicting access by a fence. At first glance this may seem
> impossible. After all, to be conflicting the second access has to be
> on a different CPU from the first, and fences don't link events on
> different CPUs. Well, normal fences don't -- but rcu-fence can!
> Here's an example:
>
> int x, y;
>
> P0()
> {
> WRITE_ONCE(x, 1);
> synchronize_rcu();
> y = 3;
> }
>
> P1()
> {
> rcu_read_lock();
> if (READ_ONCE(x) == 0)
> y = 2;
> rcu_read_unlock();
> }
I introduced an r1 to create an "exists" clause:
------------------------------------------------------------------------
C plain-6
{
int x;
int y;
}
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
synchronize_rcu();
*y = 3;
}
P1(int *x, int *y)
{
int r1;
rcu_read_lock();
r1 = READ_ONCE(*x);
if (r1 == 0)
*y = 2;
rcu_read_unlock();
}
exists r1=0 /\ y=2
------------------------------------------------------------------------
> Do the plain stores to y race? Clearly not if P1 reads a non-zero
> value for x, so let's assume the READ_ONCE(x) does obtain 0. This
> means that the read-side critical section in P1 must finish executing
> before the grace period in P0 does, because RCU's Grace-Period
> Guarantee says that otherwise P0's store to x would have propagated to
> P1 before the critical section started and so would have been visible
> to the READ_ONCE(). (Another way of putting it is that the fre link
> from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
> between those two events.)
>
> This means there is an rcu-fence link from P1's "y = 2" store to P0's
> "y = 3" store, and consequently the first must propagate from P1 to P0
> before the second can execute. Therefore the two stores cannot be
> concurrent and there is no race, even though P1's plain store to y
> isn't w-post-bounded by any marked accesses.
>
> Putting all this material together yields the following picture. For
> two conflicting stores W and W', where W ->co W', the LKMM says the
> stores don't race if W can be linked to W' by a
>
> w-post-bounded ; vis ; w-pre-bounded
>
> sequence. If W is plain then they also have to be linked by an
>
> r-post-bounded ; xb* ; w-pre-bounded
>
> sequence, and if W' is plain then they also have to be linked by a
>
> w-post-bounded ; vis ; r-pre-bounded
>
> sequence. For a conflicting load R and store W, the LKMM says the two
> accesses don't race if R can be linked to W by an
>
> r-post-bounded ; xb* ; w-pre-bounded
>
> sequence or if W can be linked to R by a
>
> w-post-bounded ; vis ; r-pre-bounded
>
> sequence. For the cases involving a vis link, the LKMM also accepts
> sequences in which W is linked to W' or R by a
>
> strong-fence ; xb* ; {w and/or r}-pre-bounded
>
> sequence with no post-bounding, and in every case the LKMM also allows
> the link simply to be a fence with no bounding at all. If no sequence
> of the appropriate sort exists, the LKMM says that the accesses race.
>
> There is one more part of the LKMM related to plain accesses (although
> not to data races) we should discuss. Recall that many relations such
> as hb are limited to marked accesses only. As a result, the
> happens-before, propagates-before, and rcu axioms (which state that
> various relation must not contain a cycle) doesn't apply to plain
> accesses. Nevertheless, we do want to rule out such cycles, because
> they don't make sense even for plain accesses.
>
> To this end, the LKMM imposes three extra restrictions, together
> called the "plain-coherence" axiom because of their resemblance to the
> coherency rules:
>
> If R and W conflict and it is possible to link R to W by one
> of the xb* sequences listed above, then W ->rfe R is not
> allowed (i.e., a load cannot read from a store that it
> executes before, even if one or both is plain).
>
> If W and R conflict and it is possible to link W to R by one
> of the vis sequences listed above, then R ->fre W is not
> allowed (i.e., if a store is visible to a load then the load
> must read from that store or one coherence-after it).
>
> If W and W' conflict and it is possible to link W to W' by one
> of the vis sequences listed above, then W' ->co W is not
> allowed (i.e., if one store is visible to another then it must
> come after in the coherence order).
>
> This is the extent to which the LKMM deals with plain accesses.
> Perhaps it could say more (for example, plain accesses might
> contribute to the ppo relation), but at the moment it seems that this
> minimal, conservative approach is good enough.
I will need to read this last section again. Perhaps more than once. ;-)
Anyway, good stuff!!!
Thanx, Paul
On Thu, 12 Sep 2019, Paul E. McKenney wrote:
> On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> > Folks:
> >
> > I have spent some time writing up a section for
> > tools/memory-model/Documentation/explanation.txt on plain accesses and
> > data races. The initial version is below.
> >
> > I'm afraid it's rather long and perhaps gets too bogged down in
> > complexities. On the other hand, this is a complicated topic so to
> > some extent this is unavoidable.
> >
> > In any case, I'd like to hear your comments and reviews.
>
> Good stuff, thank you for putting this together!
>
> Please see below for some questions, comments, and confusion interspersed.
Thanks for looking over this. Replies given inline below...
> > Alan
> >
> > ------------------------------------------------------------------------
> >
> >
> > PLAIN ACCESSES AND DATA RACES
> > -----------------------------
> >
> > In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
> > smp_load_acquire(&z), and so on are collectively referred to as
> > "marked" accesses, because they are all annotated with special
> > operations of one kind or another. Ordinary C-language memory
> > accesses such as x or y = 0 are simply called "plain" accesses.
> >
> > Early versions of the LKMM had nothing to say about plain accesses.
> > The C standard allows compilers to assume that the variables affected
> > by plain accesses are not concurrently read or written by any other
> > threads or CPUs. This leaves compilers free to implement all manner
> > of transformations or optimizations of code containing plain accesses,
> > making such code very difficult for a memory model to handle.
> >
> > Here is just one example of a possible pitfall:
> >
> > int a = 6;
> > int *x = &a;
> >
> > P0()
> > {
> > int *r1;
> > int r2 = 0;
> >
> > r1 = x;
> > if (r1 != NULL)
> > r2 = READ_ONCE(*r1);
> > }
> >
> > P1()
> > {
> > WRITE_ONCE(x, NULL);
> > }
>
> I tried making a litmus test out of this:
>
> ------------------------------------------------------------------------
> C plain-1
>
> {
> int a = 6;
> int *x = &a;
> }
>
> P0(int **x)
> {
> int *r1;
> int r2 = 0;
>
> r1 = *x;
> if (r1 != 0)
> r2 = READ_ONCE(*r1);
> }
>
> P1(int **x)
> {
> WRITE_ONCE(*x, 0);
> }
>
> locations [a; x; r1]
> exists ~r2=6 /\ ~r2=0
> ------------------------------------------------------------------------
>
> However, r1 steadfastly refuses to have any value other than zero.
>
> ------------------------------------------------------------------------
> $ herd7 -conf linux-kernel.cfg /tmp/argh
> Test plain-1 Allowed
> States 1
> a=6; r1=0; r2=0; x=0;
> No
> Witnesses
> Positive: 0 Negative: 2
> Flag data-race
> Condition exists (not (r2=6) /\ not (r2=0))
> Observation plain-1 Never 0 2
> Time plain-1 0.00
> Hash=b0fdbd0f627fd65e0cd413bf87f6f4a4
> ------------------------------------------------------------------------
>
> What am I doing wrong here? Outdated herd7 version?
In this and all the other litmus tests below, your "exists" clauses
are wrong. For example, here it should say:
exists ~0:r2=6 /\ ~0:r2=0
You forgot about the "0:" in front of the CPU-local variables.
Similarly for the "locations" clause.
> $ herd7 -version
> 7.52+7(dev), Rev: c81f1ff06f30d5c28c34d893a29f5f8505334179
>
> Hmmm... I might well be in an inconsistent herd7/ocaml state. If no
> one sees anything obvious, I will try reinstalling from scratch, but
> that will not likely happen in the next few days.
>
> > On the face of it, one would expect that when this code runs, the only
> > possible final values for r2 are 6 and 0, depending on whether or not
> > P1's store to x propagates to P0 before P0's load from x executes.
> > But since P0's load from x is a plain access, the compiler may decide
> > to carry out the load twice (for the comparison against NULL, then again
> > for the READ_ONCE()) and eliminate the temporary variable r1. The
> > object code generated for P0 could therefore end up looking rather
> > like this:
> >
> > P0()
> > {
> > int r2 = 0;
> >
> > if (x != NULL)
> > r2 = READ_ONCE(*x);
> > }
> >
> > And now it is obvious that this code runs the risk of dereferencing a
> > NULL pointer, because P1's store to x might propagate to P0 after the
> > test against NULL has been made but before the READ_ONCE() executes.
> > If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
> > the compiler would not have performed this optimization and there
> > would be no possibility of a NULL-pointer dereference.
> >
> > Given the possibility of transformations like this one, the LKMM
> > doesn't try to predict all possible outcomes of code containing plain
> > accesses. It is content to determine whether the code violates the
>
> I suggest starting this sentence with something like "It is instead
> content to determine", adding "instead", to help the reader transition.
Okay.
> > compiler's assumptions, which would render the ultimate outcome
> > undefined.
> >
> > In technical terms, the compiler is allowed to assume that when the
> > program executes, there will not be any data races. A "data race"
> > occurs when two conflicting memory accesses execute concurrently;
> > two memory accesses "conflict" if:
> >
> > they access the same location,
> >
> > they occur on different CPUs (or in different threads on the
> > same CPU),
> >
> > at least one of them is a plain access,
> >
> > and at least one of them is a store.
> >
> > The LKMM tries to determine whether a program contains two conflicting
> > accesses which may execute concurrently; if it does then the LKMM says
> > there is a potential data race and makes no predictions about the
> > program's outcome.
> >
> > Determining whether two accesses conflict is easy; you can see that
> > all the concepts involved in the definition above are already part of
> > the memory model. The hard part is telling whether they may execute
> > concurrently. The LKMM takes a conservative attitude, assuming that
> > accesses may be concurrent unless it can prove they cannot.
> >
> > If two memory accesses aren't concurrent then one must execute before
>
> Should this say "If two memory accesses to the same location aren't
> concurrent..."?
It could, but that doesn't seem to be necessary. The sentence is just
as true when talking about accesses to different locations.
Besides, if I have to add that qualifier _every_ time the document
talks about concurrent accesses, things will quickly get out of hand.
> > the other. Therefore the LKMM decides two accesses aren't concurrent
> > if they can be connected by a sequence of hb, pb, and rb links
> > (together referred to as xb, for "executes before"). However, there
> > are two complicating factors.
> >
> > If X is a load and X executes before a store Y, then indeed there is
> > no danger of X and Y being concurrent. After all, Y can't have any
> > effect on the value obtained by X until the memory subsystem has
> > propagated Y from its own CPU to X's CPU, which won't happen until
> > some time after Y executes and thus after X executes. But if X is a
> > store, then even if X executes before Y it is still possible that X
> > will propagate to Y's CPU just as Y is executing. In such a case X
> > could very well interfere somehow with Y, and we would have to
> > consider X and Y to be concurrent.
> >
> > Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> > requires not only that X must execute before Y but also that X must
> > propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> > Y executes before X -- then Y must propagate to X's CPU before X
> > executes if Y is a store.) This is expressed by the visibility
> > relation (vis), where X ->vis Y is defined to hold if there is an
> > intermediate event Z such that:
>
> "if there is a marked intermediate event Z such that"?
Not really needed. I explicitly mention elsewhere that many of the
relations defined by the LKMM, including vis, apply only to marked
accesses.
> > X is connected to Z by a possibly empty sequence of
> > cumul-fence links followed by an optional rfe link (if none of
> > these links are present, X and Z are the same event),
> >
> > and either:
> >
> > Z is connected to Y by a strong-fence link followed by a
> > possibly empty sequence of xb links,
>
> "possibly empty sequence of xb links from a marked access"?
Ditto.
> > or:
> >
> > Z is on the same CPU as Y and is connected to Y by a possibly
> > empty sequence of xb links (again, if the sequence is empty it
> > means Z and Y are the same event).
> >
> > The motivations behind this definition are straightforward:
> >
> > cumul-fence memory barriers force stores that are po-before
> > the barrier to propagate to other CPUs before stores that are
> > po-after the barrier.
> >
> > An rfe link from an event W to an event R says that R reads
> > from W, which certainly means that W must have propagated to
> > R's CPU before R executed.
> >
> > strong-fence memory barriers force stores that are po-before
> > the barrier, or that propagate to the barrier's CPU before the
> > barrier executes, to propagate to all CPUs before any events
> > po-after the barrier can execute.
> >
> > To see how this works out in practice, consider our old friend, the MP
> > pattern (with fences and statement labels, but without the conditional
> > test):
> >
> > int buf = 0, flag = 0;
> >
> > P0()
> > {
> > X: WRITE_ONCE(buf, 1);
> > smp_wmb();
> > W: WRITE_ONCE(flag, 1);
> > }
> >
> > P1()
> > {
> > int r1;
> > int r2 = 0;
> >
> > Z: r1 = READ_ONCE(flag);
> > smp_rmb();
> > Y: r2 = READ_ONCE(buf);
> > }
>
> I have to ask. Why X then W then Z then Y? ;-)
In order to agree with the text above. The Z statement here
corresponds to the Z event in the explanation of vis (and likewise for
X and Y). W was just an extra letter because I needed to refer to that
statement, and X, Y, and Z were already in use.
> (This is MP+fencewmbonceonce+fencermbonceonce.litmus in the current set
> in tools/memory-model/litmus-tests.)
>
> > The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> > assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> > means that the store to buf must propagate from P0 to P1 before Z
> > executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> > provides an xb link from Z to Y (i.e., it forces Z to execute before
> > Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> > executes.
> >
> > The second complicating factor mentioned above arises from the fact
> > that when we are considering data races, some of the memory accesses
> > are plain. Now, although we have not said so explicitly, up to this
> > point most of the relations defined by the LKMM (ppo, hb, prop,
> > cumul-fence, pb, and so on -- including vis) apply only to marked
> > accesses.
> >
> > There are good reasons for this restriction. The compiler is not
> > allowed to apply fancy transformations to marked accesses, and
> > consequently each such access in the source code corresponds more or
> > less directly to a single machine instruction in the object code. But
> > plain accesses are a different story; the compiler may combine them,
> > split them up, duplicate them, eliminate them, invent new ones, and
> > who knows what else. Seeing a plain access in the source code tells
> > you almost nothing about what machine instructions will end up in the
> > object code.
> >
> > Fortunately, the compiler isn't completely free; it is subject to some
> > limitations. For one, it is not allowed to introduce a data race into
> > the object code if the source code does not already contain a data
> > race (if it could, memory models would be useless and no multithreaded
> > code would be safe!). For another, it cannot move a plain access past
> > a compiler barrier.
> >
> > A compiler barrier is a kind of fence, but as the name implies, it
> > only affects the compiler; it does not necessarily have any effect on
> > how instructions are executed by the CPU. In Linux kernel source
> > code, the barrier() function is a compiler barrier. It doesn't give
> > rise directly to any machine instructions in the object code; rather,
> > it affects how the compiler generates the rest of the object code.
> > Given source code like this:
> >
> > ... some memory accesses ...
> > barrier();
> > ... some other memory accesses ...
> >
> > the barrier() function ensures that the machine instructions
> > corresponding to the first group of accesses will all end po-before
> > any machine instructions corresponding to the second group of accesses
> > -- even if some of the accesses are plain. (Of course, the CPU may
> > then execute some of those accesses out of program order, but we
> > already know how to deal with such issues.) Without the barrier()
> > there would be no such guarantee; the two groups of accesses could be
> > intermingled or even reversed in the object code.
> >
> > The LKMM doesn't say much about the barrier() function, but it does
> > require that all fences are also compiler barriers. In addition, it
> > requires that the ordering properties of memory barriers such as
> > smp_rmb() or smp_store_release() apply to plain accesses as well as to
> > marked accesses.
> >
> > This is the key to analyzing data races. Consider the MP pattern
> > again, now using plain accesses for buf:
> >
> > int buf = 0, flag = 0;
> >
> > P0()
> > {
> > U: buf = 1;
> > smp_wmb();
> > X: WRITE_ONCE(flag, 1);
> > }
> >
> > P1()
> > {
> > int r1;
> > int r2 = 0;
> >
> > Y: r1 = READ_ONCE(flag);
> > if (r1) {
> > smp_rmb();
> > V: r2 = buf;
> > }
> > }
>
> And same nit, why not just X, Y, and Z?
Well, for one thing, you can't make four distinct statement labels out
of X, Y, and Z! :-)
In this case I wanted X ->vis Y, to agree with the earlier usage, and I
needed two other labels for the plain accesses. U and V seemed
appropriate.
> Similar issues with the litmus test:
>
> ------------------------------------------------------------------------
> C plain-4
>
> {
> int buf = 0;
> int flag = 0;
> }
>
>
> P0(int *buf, int *flag)
> {
> *buf = 1;
> smp_wmb();
> WRITE_ONCE(*flag, 1);
> }
>
> P1(int *buf, int *flag)
> {
> int r1;
> int r2 = 0;
>
> r1 = READ_ONCE(*flag);
> if (r1) {
> smp_rmb();
> r2 = *buf;
> }
> }
>
> exists r1=1 /\ r2=0
> ------------------------------------------------------------------------
>
> > This program does not contain a data race. Although the U and V
> > accesses conflict, the LKMM can prove they are not concurrent as
> > follows:
> >
> > The smp_wmb() fence in P0 is both a compiler barrier and a
> > cumul-fence. It guarantees that no matter what hash of
> > machine instructions the compiler generates for the plain
> > access U, all those instructions will be po-before the fence.
> > Consequently U's store to buf, no matter how it is carried out
> > at the machine level, must propagate to P1 before X's store to
> > flag does.
> >
> > X and Y are both marked accesses. Hence an rfe link from X to
> > Y is a valid indicator that X propagated to P1 before Y
> > executed, i.e., X ->vis Y. (And if there is no rfe link then
> > r1 will be 0, so V will not be executed and ipso facto won't
> > race with U.)
> >
> > The smp_rmb() fence in P1 is a compiler barrier as well as a
> > fence. It guarantees that all the machine-level instructions
> > corresponding to the access V will be po-after the fence, and
> > therefore any loads among those instructions will execute
> > after the fence does and hence after Y does.
> >
> > Thus U's store to buf is forced to propagate to P1 before V's load
> > executes (assuming V does execute), ruling out the possibility of a
> > data race between them.
> >
> > This analysis illustrates how the LKMM deals with plain accesses in
> > general. Suppose R is a plain load and we want to show that R
> > executes before some marked access E. We can do this by finding a
> > marked access X such that R and X are ordered by a suitable fence and
> > X ->xb* E. If E was also a plain access, we would also look for a
> > marked access Y such that X ->xb* Y, and Y and E are ordered by a
> > fence. We describe this arrangement by saying that R is
> > "post-bounded" by X and E is "pre-bounded" by Y.
> >
> > In fact, we go one step further: Since R is a read, we say that R is
> > "r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or
> > "w-pre-bounded" by Y, depending on whether E was a store or a load.
> > This distinction is needed because some fences affect only loads
> > (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
> > the two types of bounds are the same. And as a degenerate case, we
> > say that a marked access pre-bounds and post-bounds itself (e.g., if R
> > above were a marked load then X could simply be taken to be R itself.)
> >
> > The need to distinguish between r- and w-bounding raises yet another
> > issue. When the source code contains a plain store, the compiler is
> > allowed to put plain loads of the same location into the object code.
> > For example, given the source code:
> >
> > x = 1;
> >
> > the compiler is theoretically allowed to generate object code that
> > looks like:
> >
> > if (x != 1)
> > x = 1;
> >
> > thereby adding a load (and possibly replacing the store entirely).
> > For this reason, whenever the LKMM requires a plain store to be
> > w-pre-bounded or w-post-bounded by a marked access, it also requires
> > the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> > where the compiler adds a load.
> >
> > (This may be overly cautious. We don't know of any examples where a
> > compiler has augmented a store with a load in this fashion, and the
> > Linux kernel developers would probably fight pretty hard to change a
> > compiler if it ever did this. Still, better safe than sorry.)
> >
> > Incidentally, the other tranformation -- augmenting a plain load by
> > adding in a store to the same location -- is not allowed. This is
> > because the compiler cannot know whether any other CPUs might perform
> > a concurrent load from that location. Two concurrent loads don't
> > constitute a race (they can't interfere with each other), but a store
> > does race with a concurrent load. Thus adding a store might create a
> > data race where one was not already present in the source code,
> > something the compiler is forbidden to do. Augmenting a store with a
> > load, on the other hand, is acceptable because doing so won't create a
> > data race unless one already existed.
> >
> > The LKMM includes a second way to pre-bound plain accesses, in
> > addition to fences: an address dependency from a marked load. That
> > is, in the sequence:
> >
> > p = READ_ONCE(ptr);
> > r = *p;
> >
> > the LKMM says that the marked load of ptr pre-bounds the plain load of
> > *p; the marked load must execute before any of the machine
> > instructions corresponding to the plain load. This is a reasonable
> > stipulation, since after all, the CPU can't perform the load of *p
> > until it knows what value p will hold. Furthermore, without some
> > assumption like this one, some usages typical of RCU would count as
> > data races. For example:
> >
> > int a = 1, b;
>
> herd7 doesn't much like this.
The document mentions (near the top) that its code examples are not in
the proper form for actual litmus tests.
> > int *ptr = &a;
> >
> > P0()
> > {
> > b = 2;
> > rcu_assign_ptr(ptr, &b);
>
> rcu_assign_pointer(), globally.
Thank you; I will fix it.
> > }
> >
> > P1()
> > {
> > int *p;
> > int r;
> >
> > rcu_read_lock();
> > p = rcu_dereference(ptr);
> > r = *p;
> > rcu_read_unlock();
> > }
>
> ------------------------------------------------------------------------
> C plain-5
>
> {
> int a = 1;
> int b;
> int *ptr = &a;
> }
>
> P0(int *b, int **ptr)
> {
> *b = 2;
> rcu_assign_pointer(*ptr, b);
> }
>
> P1(int *ptr)
> {
> int *r1;
> int r2;
>
> rcu_read_lock();
> r1 = rcu_dereference(*ptr);
> r2 = *r1;
> rcu_read_unlock();
> }
>
> exists r1=b /\ r2=1
> ------------------------------------------------------------------------
>
> Also strange output:
>
> ------------------------------------------------------------------------
> $ herd7 -conf linux-kernel.cfg /tmp/argh
> Test plain-3 Allowed
> States 1
> r1=0; r2=0;
> No
> Witnesses
> Positive: 0 Negative: 2
> Condition exists (r1=b /\ r2=1)
> Observation plain-3 Never 0 2
> Time plain-3 0.00
> Hash=44e039597a92bdc7efc9217813478126
> ------------------------------------------------------------------------
>
> > (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
> > really do anything, because there aren't any grace periods. They are
> > included merely for the sake of good form; typically P0 would call
> > synchronize_rcu() somewhere after the rcu_assign_ptr().)
> >
> > rcu_assign_ptr() performs a store-release, so the plain store to b is
> > definitely w-post-bounded before the store to ptr, and the two stores
> > will propagate to P1 in that order. However, rcu_dereference() is
> > only equivalent to READ_ONCE(). While it is a marked access, it is
> > not a fence or compiler barrier. Hence the only guarantee we have
> > that the load of ptr in P1 is r-pre-bounded before the load of *p
> > (thus avoiding a race) is the assumption about address dependencies.
> >
> > This is a situation where the compiler can undermine the memory model,
> > and a certain amount of care is required when programming constructs
> > like this one. In particular, comparisons between the pointer and
> > other known addresses can cause trouble. If you have something like:
> >
> > p = rcu_dereference(ptr);
> > if (p == &x)
> > r = *p;
> >
> > then the compiler just might generate object code resembling:
> >
> > p = rcu_dereference(ptr);
> > if (p == &x)
> > r = x;
> >
> > or even:
> >
> > rtemp = x;
> > p = rcu_dereference(ptr);
> > if (p == &x)
> > r = rtemp;
> >
> > which would invalidate the memory model's assumption, since the CPU
> > could now perform the load of x before the load of ptr (there might be
> > a control dependency but no address dependency at the machine level).
> >
> > Finally, it turns out there is a situation in which a plain write does
> > not need to be w-post-bounded: when it is separated from the
> > conflicting access by a fence. At first glance this may seem
> > impossible. After all, to be conflicting the second access has to be
> > on a different CPU from the first, and fences don't link events on
> > different CPUs. Well, normal fences don't -- but rcu-fence can!
> > Here's an example:
> >
> > int x, y;
> >
> > P0()
> > {
> > WRITE_ONCE(x, 1);
> > synchronize_rcu();
> > y = 3;
> > }
> >
> > P1()
> > {
> > rcu_read_lock();
> > if (READ_ONCE(x) == 0)
> > y = 2;
> > rcu_read_unlock();
> > }
>
> I introduced an r1 to create an "exists" clause:
>
> ------------------------------------------------------------------------
> C plain-6
>
> {
> int x;
> int y;
> }
>
>
> P0(int *x, int *y)
> {
> WRITE_ONCE(*x, 1);
> synchronize_rcu();
> *y = 3;
> }
>
> P1(int *x, int *y)
> {
> int r1;
>
> rcu_read_lock();
> r1 = READ_ONCE(*x);
> if (r1 == 0)
> *y = 2;
> rcu_read_unlock();
> }
>
> exists r1=0 /\ y=2
> ------------------------------------------------------------------------
By the way, my example is essentially the same as
manual/plain/C-S-rcunoderef-2.litmus in your archive. Variable names
changed and the local variable eliminated, but otherwise equivalent.
> > Do the plain stores to y race? Clearly not if P1 reads a non-zero
> > value for x, so let's assume the READ_ONCE(x) does obtain 0. This
> > means that the read-side critical section in P1 must finish executing
> > before the grace period in P0 does, because RCU's Grace-Period
> > Guarantee says that otherwise P0's store to x would have propagated to
> > P1 before the critical section started and so would have been visible
> > to the READ_ONCE(). (Another way of putting it is that the fre link
> > from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
> > between those two events.)
> >
> > This means there is an rcu-fence link from P1's "y = 2" store to P0's
> > "y = 3" store, and consequently the first must propagate from P1 to P0
> > before the second can execute. Therefore the two stores cannot be
> > concurrent and there is no race, even though P1's plain store to y
> > isn't w-post-bounded by any marked accesses.
> >
> > Putting all this material together yields the following picture. For
> > two conflicting stores W and W', where W ->co W', the LKMM says the
> > stores don't race if W can be linked to W' by a
> >
> > w-post-bounded ; vis ; w-pre-bounded
> >
> > sequence. If W is plain then they also have to be linked by an
> >
> > r-post-bounded ; xb* ; w-pre-bounded
> >
> > sequence, and if W' is plain then they also have to be linked by a
> >
> > w-post-bounded ; vis ; r-pre-bounded
> >
> > sequence. For a conflicting load R and store W, the LKMM says the two
> > accesses don't race if R can be linked to W by an
> >
> > r-post-bounded ; xb* ; w-pre-bounded
> >
> > sequence or if W can be linked to R by a
> >
> > w-post-bounded ; vis ; r-pre-bounded
> >
> > sequence. For the cases involving a vis link, the LKMM also accepts
> > sequences in which W is linked to W' or R by a
> >
> > strong-fence ; xb* ; {w and/or r}-pre-bounded
> >
> > sequence with no post-bounding, and in every case the LKMM also allows
> > the link simply to be a fence with no bounding at all. If no sequence
> > of the appropriate sort exists, the LKMM says that the accesses race.
> >
> > There is one more part of the LKMM related to plain accesses (although
> > not to data races) we should discuss. Recall that many relations such
> > as hb are limited to marked accesses only. As a result, the
> > happens-before, propagates-before, and rcu axioms (which state that
> > various relation must not contain a cycle) doesn't apply to plain
> > accesses. Nevertheless, we do want to rule out such cycles, because
> > they don't make sense even for plain accesses.
> >
> > To this end, the LKMM imposes three extra restrictions, together
> > called the "plain-coherence" axiom because of their resemblance to the
> > coherency rules:
> >
> > If R and W conflict and it is possible to link R to W by one
> > of the xb* sequences listed above, then W ->rfe R is not
> > allowed (i.e., a load cannot read from a store that it
> > executes before, even if one or both is plain).
> >
> > If W and R conflict and it is possible to link W to R by one
> > of the vis sequences listed above, then R ->fre W is not
> > allowed (i.e., if a store is visible to a load then the load
> > must read from that store or one coherence-after it).
> >
> > If W and W' conflict and it is possible to link W to W' by one
> > of the vis sequences listed above, then W' ->co W is not
> > allowed (i.e., if one store is visible to another then it must
> > come after in the coherence order).
> >
> > This is the extent to which the LKMM deals with plain accesses.
> > Perhaps it could say more (for example, plain accesses might
> > contribute to the ppo relation), but at the moment it seems that this
> > minimal, conservative approach is good enough.
>
> I will need to read this last section again. Perhaps more than once. ;-)
>
> Anyway, good stuff!!!
Thanks. I'll submit it after more reviews are in.
Alan
On Thu, 12 Sep 2019, Paul E. McKenney wrote:
> On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> > To this end, the LKMM imposes three extra restrictions, together
> > called the "plain-coherence" axiom because of their resemblance to the
> > coherency rules:
> >
> > If R and W conflict and it is possible to link R to W by one
> > of the xb* sequences listed above, then W ->rfe R is not
> > allowed (i.e., a load cannot read from a store that it
> > executes before, even if one or both is plain).
> >
> > If W and R conflict and it is possible to link W to R by one
> > of the vis sequences listed above, then R ->fre W is not
> > allowed (i.e., if a store is visible to a load then the load
> > must read from that store or one coherence-after it).
> >
> > If W and W' conflict and it is possible to link W to W' by one
> > of the vis sequences listed above, then W' ->co W is not
> > allowed (i.e., if one store is visible to another then it must
> > come after in the coherence order).
> I will need to read this last section again. Perhaps more than once. ;-)
I decided this part could use some improvement. Here is the updated
text:
To this end, the LKMM imposes three extra restrictions, together
called the "plain-coherence" axiom because of their resemblance to the
rules used by the operational model to ensure cache coherence (that
is, the rules governing the memory subsystem's choice of a store to
satisfy a load request and its determination of where a store will
fall in the coherence order):
If R and W conflict and it is possible to link R to W by one
of the xb* sequences listed above, then W ->rfe R is not
allowed (i.e., a load cannot read from a store that it
executes before, even if one or both is plain).
If W and R conflict and it is possible to link W to R by one
of the vis sequences listed above, then R ->fre W is not
allowed (i.e., if a store is visible to a load then the load
must read from that store or one coherence-after it).
If W and W' conflict and it is possible to link W to W' by one
of the vis sequences listed above, then W' ->co W is not
allowed (i.e., if one store is visible to a second then the
second must come after the first in the coherence order).
Alan
On Fri, Sep 13, 2019 at 11:21:17AM -0400, Alan Stern wrote:
> On Thu, 12 Sep 2019, Paul E. McKenney wrote:
> > On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> > > Folks:
> > >
> > > I have spent some time writing up a section for
> > > tools/memory-model/Documentation/explanation.txt on plain accesses and
> > > data races. The initial version is below.
> > >
> > > I'm afraid it's rather long and perhaps gets too bogged down in
> > > complexities. On the other hand, this is a complicated topic so to
> > > some extent this is unavoidable.
> > >
> > > In any case, I'd like to hear your comments and reviews.
> >
> > Good stuff, thank you for putting this together!
> >
> > Please see below for some questions, comments, and confusion interspersed.
>
> Thanks for looking over this. Replies given inline below...
>
> > > Alan
> > >
> > > ------------------------------------------------------------------------
> > >
> > >
> > > PLAIN ACCESSES AND DATA RACES
> > > -----------------------------
> > >
> > > In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
> > > smp_load_acquire(&z), and so on are collectively referred to as
> > > "marked" accesses, because they are all annotated with special
> > > operations of one kind or another. Ordinary C-language memory
> > > accesses such as x or y = 0 are simply called "plain" accesses.
> > >
> > > Early versions of the LKMM had nothing to say about plain accesses.
> > > The C standard allows compilers to assume that the variables affected
> > > by plain accesses are not concurrently read or written by any other
> > > threads or CPUs. This leaves compilers free to implement all manner
> > > of transformations or optimizations of code containing plain accesses,
> > > making such code very difficult for a memory model to handle.
> > >
> > > Here is just one example of a possible pitfall:
> > >
> > > int a = 6;
> > > int *x = &a;
> > >
> > > P0()
> > > {
> > > int *r1;
> > > int r2 = 0;
> > >
> > > r1 = x;
> > > if (r1 != NULL)
> > > r2 = READ_ONCE(*r1);
> > > }
> > >
> > > P1()
> > > {
> > > WRITE_ONCE(x, NULL);
> > > }
> >
> > I tried making a litmus test out of this:
> >
> > ------------------------------------------------------------------------
> > C plain-1
> >
> > {
> > int a = 6;
> > int *x = &a;
> > }
> >
> > P0(int **x)
> > {
> > int *r1;
> > int r2 = 0;
> >
> > r1 = *x;
> > if (r1 != 0)
> > r2 = READ_ONCE(*r1);
> > }
> >
> > P1(int **x)
> > {
> > WRITE_ONCE(*x, 0);
> > }
> >
> > locations [a; x; r1]
> > exists ~r2=6 /\ ~r2=0
> > ------------------------------------------------------------------------
> >
> > However, r1 steadfastly refuses to have any value other than zero.
> >
> > ------------------------------------------------------------------------
> > $ herd7 -conf linux-kernel.cfg /tmp/argh
> > Test plain-1 Allowed
> > States 1
> > a=6; r1=0; r2=0; x=0;
> > No
> > Witnesses
> > Positive: 0 Negative: 2
> > Flag data-race
> > Condition exists (not (r2=6) /\ not (r2=0))
> > Observation plain-1 Never 0 2
> > Time plain-1 0.00
> > Hash=b0fdbd0f627fd65e0cd413bf87f6f4a4
> > ------------------------------------------------------------------------
> >
> > What am I doing wrong here? Outdated herd7 version?
>
> In this and all the other litmus tests below, your "exists" clauses
> are wrong. For example, here it should say:
>
> exists ~0:r2=6 /\ ~0:r2=0
>
> You forgot about the "0:" in front of the CPU-local variables.
> Similarly for the "locations" clause.
Thank you! Color me blind...
------------------------------------------------------------------------
C plain-1
{
int a = 6;
int *x = &a;
}
P0(int **x)
{
int *r1;
int r2 = 0;
r1 = *x;
if (r1 != 0)
r2 = READ_ONCE(*r1);
}
P1(int **x)
{
WRITE_ONCE(*x, 0);
}
locations [a; x; 0:r1]
exists ~0:r2=6 /\ ~0:r2=0
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/plain1.litmus
Test plain-1 Allowed
States 2
0:r1=0; 0:r2=0; a=6; x=0;
0:r1=a; 0:r2=6; a=6; x=0;
No
Witnesses
Positive: 0 Negative: 2
Flag data-race
Condition exists (not (0:r2=6) /\ not (0:r2=0))
Observation plain-1 Never 0 2
Time plain-1 0.00
Hash=85c5c52abcd2e90733526b3fc42a01f2
------------------------------------------------------------------------
[ . . . ]
> > > compiler's assumptions, which would render the ultimate outcome
> > > undefined.
> > >
> > > In technical terms, the compiler is allowed to assume that when the
> > > program executes, there will not be any data races. A "data race"
> > > occurs when two conflicting memory accesses execute concurrently;
> > > two memory accesses "conflict" if:
> > >
> > > they access the same location,
> > >
> > > they occur on different CPUs (or in different threads on the
> > > same CPU),
> > >
> > > at least one of them is a plain access,
> > >
> > > and at least one of them is a store.
> > >
> > > The LKMM tries to determine whether a program contains two conflicting
> > > accesses which may execute concurrently; if it does then the LKMM says
> > > there is a potential data race and makes no predictions about the
> > > program's outcome.
> > >
> > > Determining whether two accesses conflict is easy; you can see that
> > > all the concepts involved in the definition above are already part of
> > > the memory model. The hard part is telling whether they may execute
> > > concurrently. The LKMM takes a conservative attitude, assuming that
> > > accesses may be concurrent unless it can prove they cannot.
> > >
> > > If two memory accesses aren't concurrent then one must execute before
> >
> > Should this say "If two memory accesses to the same location aren't
> > concurrent..."?
>
> It could, but that doesn't seem to be necessary. The sentence is just
> as true when talking about accesses to different locations.
>
> Besides, if I have to add that qualifier _every_ time the document
> talks about concurrent accesses, things will quickly get out of hand.
Fair enough!
> > > the other. Therefore the LKMM decides two accesses aren't concurrent
> > > if they can be connected by a sequence of hb, pb, and rb links
> > > (together referred to as xb, for "executes before"). However, there
> > > are two complicating factors.
> > >
> > > If X is a load and X executes before a store Y, then indeed there is
> > > no danger of X and Y being concurrent. After all, Y can't have any
> > > effect on the value obtained by X until the memory subsystem has
> > > propagated Y from its own CPU to X's CPU, which won't happen until
> > > some time after Y executes and thus after X executes. But if X is a
> > > store, then even if X executes before Y it is still possible that X
> > > will propagate to Y's CPU just as Y is executing. In such a case X
> > > could very well interfere somehow with Y, and we would have to
> > > consider X and Y to be concurrent.
> > >
> > > Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> > > requires not only that X must execute before Y but also that X must
> > > propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> > > Y executes before X -- then Y must propagate to X's CPU before X
> > > executes if Y is a store.) This is expressed by the visibility
> > > relation (vis), where X ->vis Y is defined to hold if there is an
> > > intermediate event Z such that:
> >
> > "if there is a marked intermediate event Z such that"?
>
> Not really needed. I explicitly mention elsewhere that many of the
> relations defined by the LKMM, including vis, apply only to marked
> accesses.
>
> > > X is connected to Z by a possibly empty sequence of
> > > cumul-fence links followed by an optional rfe link (if none of
> > > these links are present, X and Z are the same event),
> > >
> > > and either:
> > >
> > > Z is connected to Y by a strong-fence link followed by a
> > > possibly empty sequence of xb links,
> >
> > "possibly empty sequence of xb links from a marked access"?
>
> Ditto.
OK, leave it as is. We can always add it later should it prove necessary.
> > > or:
> > >
> > > Z is on the same CPU as Y and is connected to Y by a possibly
> > > empty sequence of xb links (again, if the sequence is empty it
> > > means Z and Y are the same event).
> > >
> > > The motivations behind this definition are straightforward:
> > >
> > > cumul-fence memory barriers force stores that are po-before
> > > the barrier to propagate to other CPUs before stores that are
> > > po-after the barrier.
> > >
> > > An rfe link from an event W to an event R says that R reads
> > > from W, which certainly means that W must have propagated to
> > > R's CPU before R executed.
> > >
> > > strong-fence memory barriers force stores that are po-before
> > > the barrier, or that propagate to the barrier's CPU before the
> > > barrier executes, to propagate to all CPUs before any events
> > > po-after the barrier can execute.
> > >
> > > To see how this works out in practice, consider our old friend, the MP
> > > pattern (with fences and statement labels, but without the conditional
> > > test):
> > >
> > > int buf = 0, flag = 0;
> > >
> > > P0()
> > > {
> > > X: WRITE_ONCE(buf, 1);
> > > smp_wmb();
> > > W: WRITE_ONCE(flag, 1);
> > > }
> > >
> > > P1()
> > > {
> > > int r1;
> > > int r2 = 0;
> > >
> > > Z: r1 = READ_ONCE(flag);
> > > smp_rmb();
> > > Y: r2 = READ_ONCE(buf);
> > > }
> >
> > I have to ask. Why X then W then Z then Y? ;-)
>
> In order to agree with the text above. The Z statement here
> corresponds to the Z event in the explanation of vis (and likewise for
> X and Y). W was just an extra letter because I needed to refer to that
> statement, and X, Y, and Z were already in use.
OK. ;-)
> > (This is MP+fencewmbonceonce+fencermbonceonce.litmus in the current set
> > in tools/memory-model/litmus-tests.)
> >
> > > The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> > > assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> > > means that the store to buf must propagate from P0 to P1 before Z
> > > executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> > > provides an xb link from Z to Y (i.e., it forces Z to execute before
> > > Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> > > executes.
> > >
> > > The second complicating factor mentioned above arises from the fact
> > > that when we are considering data races, some of the memory accesses
> > > are plain. Now, although we have not said so explicitly, up to this
> > > point most of the relations defined by the LKMM (ppo, hb, prop,
> > > cumul-fence, pb, and so on -- including vis) apply only to marked
> > > accesses.
> > >
> > > There are good reasons for this restriction. The compiler is not
> > > allowed to apply fancy transformations to marked accesses, and
> > > consequently each such access in the source code corresponds more or
> > > less directly to a single machine instruction in the object code. But
> > > plain accesses are a different story; the compiler may combine them,
> > > split them up, duplicate them, eliminate them, invent new ones, and
> > > who knows what else. Seeing a plain access in the source code tells
> > > you almost nothing about what machine instructions will end up in the
> > > object code.
> > >
> > > Fortunately, the compiler isn't completely free; it is subject to some
> > > limitations. For one, it is not allowed to introduce a data race into
> > > the object code if the source code does not already contain a data
> > > race (if it could, memory models would be useless and no multithreaded
> > > code would be safe!). For another, it cannot move a plain access past
> > > a compiler barrier.
> > >
> > > A compiler barrier is a kind of fence, but as the name implies, it
> > > only affects the compiler; it does not necessarily have any effect on
> > > how instructions are executed by the CPU. In Linux kernel source
> > > code, the barrier() function is a compiler barrier. It doesn't give
> > > rise directly to any machine instructions in the object code; rather,
> > > it affects how the compiler generates the rest of the object code.
> > > Given source code like this:
> > >
> > > ... some memory accesses ...
> > > barrier();
> > > ... some other memory accesses ...
> > >
> > > the barrier() function ensures that the machine instructions
> > > corresponding to the first group of accesses will all end po-before
> > > any machine instructions corresponding to the second group of accesses
> > > -- even if some of the accesses are plain. (Of course, the CPU may
> > > then execute some of those accesses out of program order, but we
> > > already know how to deal with such issues.) Without the barrier()
> > > there would be no such guarantee; the two groups of accesses could be
> > > intermingled or even reversed in the object code.
> > >
> > > The LKMM doesn't say much about the barrier() function, but it does
> > > require that all fences are also compiler barriers. In addition, it
> > > requires that the ordering properties of memory barriers such as
> > > smp_rmb() or smp_store_release() apply to plain accesses as well as to
> > > marked accesses.
> > >
> > > This is the key to analyzing data races. Consider the MP pattern
> > > again, now using plain accesses for buf:
> > >
> > > int buf = 0, flag = 0;
> > >
> > > P0()
> > > {
> > > U: buf = 1;
> > > smp_wmb();
> > > X: WRITE_ONCE(flag, 1);
> > > }
> > >
> > > P1()
> > > {
> > > int r1;
> > > int r2 = 0;
> > >
> > > Y: r1 = READ_ONCE(flag);
> > > if (r1) {
> > > smp_rmb();
> > > V: r2 = buf;
> > > }
> > > }
> >
> > And same nit, why not just X, Y, and Z?
>
> Well, for one thing, you can't make four distinct statement labels out
> of X, Y, and Z! :-)
Except that you only have three labels above.
> In this case I wanted X ->vis Y, to agree with the earlier usage, and I
> needed two other labels for the plain accesses. U and V seemed
> appropriate.
But fair enough, let's leave it as you have it.
> > Similar issues with the litmus test:
And adding the process numbers again helps considerably:
------------------------------------------------------------------------
C plain-4
{
int buf = 0;
int flag = 0;
}
P0(int *buf, int *flag)
{
*buf = 1;
smp_wmb();
WRITE_ONCE(*flag, 1);
}
P1(int *buf, int *flag)
{
int r1;
int r2 = 0;
r1 = READ_ONCE(*flag);
if (r1) {
smp_rmb();
r2 = *buf;
}
}
exists 1:r1=1 /\ 1:r2=0
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/plain-4.litmus
Test plain-4 Allowed
States 2
1:r1=0; 1:r2=0;
1:r1=1; 1:r2=1;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r1=1 /\ 1:r2=0)
Observation plain-4 Never 0 2
Time plain-4 0.00
Hash=8121f748ab8c25abddd995581ed19844
------------------------------------------------------------------------
[ . . . ]
> > > the LKMM says that the marked load of ptr pre-bounds the plain load of
> > > *p; the marked load must execute before any of the machine
> > > instructions corresponding to the plain load. This is a reasonable
> > > stipulation, since after all, the CPU can't perform the load of *p
> > > until it knows what value p will hold. Furthermore, without some
> > > assumption like this one, some usages typical of RCU would count as
> > > data races. For example:
> > >
> > > int a = 1, b;
> >
> > herd7 doesn't much like this.
>
> The document mentions (near the top) that its code examples are not in
> the proper form for actual litmus tests.
Fair enough!
> > > int *ptr = &a;
> > >
> > > P0()
> > > {
> > > b = 2;
> > > rcu_assign_ptr(ptr, &b);
> >
> > rcu_assign_pointer(), globally.
>
> Thank you; I will fix it.
>
> > > }
> > >
> > > P1()
> > > {
> > > int *p;
> > > int r;
> > >
> > > rcu_read_lock();
> > > p = rcu_dereference(ptr);
> > > r = *p;
> > > rcu_read_unlock();
> > > }
And again:
------------------------------------------------------------------------
C plain-5
{
int a = 1;
int b;
int *ptr = &a;
}
P0(int *b, int **ptr)
{
*b = 2;
rcu_assign_pointer(*ptr, b);
}
P1(int *ptr)
{
int *r1;
int r2;
rcu_read_lock();
r1 = rcu_dereference(*ptr);
r2 = *r1;
rcu_read_unlock();
}
exists 1:r1=b /\ 1:r2=1
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/plain-5.litmus
Test plain-5 Allowed
States 2
1:r1=a; 1:r2=1;
1:r1=b; 1:r2=2;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r1=b /\ 1:r2=1)
Observation plain-5 Never 0 2
Time plain-5 0.00
Hash=87a84be62e57a3086ef62fd4fbaaf405
------------------------------------------------------------------------
> > > (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
> > > really do anything, because there aren't any grace periods. They are
> > > included merely for the sake of good form; typically P0 would call
> > > synchronize_rcu() somewhere after the rcu_assign_ptr().)
> > >
> > > rcu_assign_ptr() performs a store-release, so the plain store to b is
> > > definitely w-post-bounded before the store to ptr, and the two stores
> > > will propagate to P1 in that order. However, rcu_dereference() is
> > > only equivalent to READ_ONCE(). While it is a marked access, it is
> > > not a fence or compiler barrier. Hence the only guarantee we have
> > > that the load of ptr in P1 is r-pre-bounded before the load of *p
> > > (thus avoiding a race) is the assumption about address dependencies.
> > >
> > > This is a situation where the compiler can undermine the memory model,
> > > and a certain amount of care is required when programming constructs
> > > like this one. In particular, comparisons between the pointer and
> > > other known addresses can cause trouble. If you have something like:
> > >
> > > p = rcu_dereference(ptr);
> > > if (p == &x)
> > > r = *p;
> > >
> > > then the compiler just might generate object code resembling:
> > >
> > > p = rcu_dereference(ptr);
> > > if (p == &x)
> > > r = x;
> > >
> > > or even:
> > >
> > > rtemp = x;
> > > p = rcu_dereference(ptr);
> > > if (p == &x)
> > > r = rtemp;
> > >
> > > which would invalidate the memory model's assumption, since the CPU
> > > could now perform the load of x before the load of ptr (there might be
> > > a control dependency but no address dependency at the machine level).
> > >
> > > Finally, it turns out there is a situation in which a plain write does
> > > not need to be w-post-bounded: when it is separated from the
> > > conflicting access by a fence. At first glance this may seem
> > > impossible. After all, to be conflicting the second access has to be
> > > on a different CPU from the first, and fences don't link events on
> > > different CPUs. Well, normal fences don't -- but rcu-fence can!
> > > Here's an example:
> > >
> > > int x, y;
> > >
> > > P0()
> > > {
> > > WRITE_ONCE(x, 1);
> > > synchronize_rcu();
> > > y = 3;
> > > }
> > >
> > > P1()
> > > {
> > > rcu_read_lock();
> > > if (READ_ONCE(x) == 0)
> > > y = 2;
> > > rcu_read_unlock();
> > > }
> >
> > I introduced an r1 to create an "exists" clause:
------------------------------------------------------------------------
C plain-6
{
int x;
int y;
}
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
synchronize_rcu();
*y = 3;
}
P1(int *x, int *y)
{
int r1;
rcu_read_lock();
r1 = READ_ONCE(*x);
if (r1 == 0)
*y = 2;
rcu_read_unlock();
}
exists 1:r1=0 /\ y=2
------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/plain-6.litmus
Test plain-6 Allowed
States 2
1:r1=0; y=3;
1:r1=1; y=3;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r1=0 /\ y=2)
Observation plain-6 Never 0 2
Time plain-6 0.00
Hash=be76479def6656df046180396df7b38a
------------------------------------------------------------------------
> By the way, my example is essentially the same as
> manual/plain/C-S-rcunoderef-2.litmus in your archive. Variable names
> changed and the local variable eliminated, but otherwise equivalent.
OK, no need to add this one, then, thank you!
> > > Do the plain stores to y race? Clearly not if P1 reads a non-zero
> > > value for x, so let's assume the READ_ONCE(x) does obtain 0. This
> > > means that the read-side critical section in P1 must finish executing
> > > before the grace period in P0 does, because RCU's Grace-Period
> > > Guarantee says that otherwise P0's store to x would have propagated to
> > > P1 before the critical section started and so would have been visible
> > > to the READ_ONCE(). (Another way of putting it is that the fre link
> > > from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
> > > between those two events.)
> > >
> > > This means there is an rcu-fence link from P1's "y = 2" store to P0's
> > > "y = 3" store, and consequently the first must propagate from P1 to P0
> > > before the second can execute. Therefore the two stores cannot be
> > > concurrent and there is no race, even though P1's plain store to y
> > > isn't w-post-bounded by any marked accesses.
> > >
> > > Putting all this material together yields the following picture. For
> > > two conflicting stores W and W', where W ->co W', the LKMM says the
> > > stores don't race if W can be linked to W' by a
> > >
> > > w-post-bounded ; vis ; w-pre-bounded
> > >
> > > sequence. If W is plain then they also have to be linked by an
> > >
> > > r-post-bounded ; xb* ; w-pre-bounded
> > >
> > > sequence, and if W' is plain then they also have to be linked by a
> > >
> > > w-post-bounded ; vis ; r-pre-bounded
> > >
> > > sequence. For a conflicting load R and store W, the LKMM says the two
> > > accesses don't race if R can be linked to W by an
> > >
> > > r-post-bounded ; xb* ; w-pre-bounded
> > >
> > > sequence or if W can be linked to R by a
> > >
> > > w-post-bounded ; vis ; r-pre-bounded
> > >
> > > sequence. For the cases involving a vis link, the LKMM also accepts
> > > sequences in which W is linked to W' or R by a
> > >
> > > strong-fence ; xb* ; {w and/or r}-pre-bounded
> > >
> > > sequence with no post-bounding, and in every case the LKMM also allows
> > > the link simply to be a fence with no bounding at all. If no sequence
> > > of the appropriate sort exists, the LKMM says that the accesses race.
> > >
> > > There is one more part of the LKMM related to plain accesses (although
> > > not to data races) we should discuss. Recall that many relations such
> > > as hb are limited to marked accesses only. As a result, the
> > > happens-before, propagates-before, and rcu axioms (which state that
> > > various relation must not contain a cycle) doesn't apply to plain
> > > accesses. Nevertheless, we do want to rule out such cycles, because
> > > they don't make sense even for plain accesses.
> > >
> > > To this end, the LKMM imposes three extra restrictions, together
> > > called the "plain-coherence" axiom because of their resemblance to the
> > > coherency rules:
> > >
> > > If R and W conflict and it is possible to link R to W by one
> > > of the xb* sequences listed above, then W ->rfe R is not
> > > allowed (i.e., a load cannot read from a store that it
> > > executes before, even if one or both is plain).
> > >
> > > If W and R conflict and it is possible to link W to R by one
> > > of the vis sequences listed above, then R ->fre W is not
> > > allowed (i.e., if a store is visible to a load then the load
> > > must read from that store or one coherence-after it).
> > >
> > > If W and W' conflict and it is possible to link W to W' by one
> > > of the vis sequences listed above, then W' ->co W is not
> > > allowed (i.e., if one store is visible to another then it must
> > > come after in the coherence order).
> > >
> > > This is the extent to which the LKMM deals with plain accesses.
> > > Perhaps it could say more (for example, plain accesses might
> > > contribute to the ppo relation), but at the moment it seems that this
> > > minimal, conservative approach is good enough.
> >
> > I will need to read this last section again. Perhaps more than once. ;-)
> >
> > Anyway, good stuff!!!
>
> Thanks. I'll submit it after more reviews are in.
Sounds good!
Thanx, Paul
On Fri, Sep 13, 2019 at 03:13:26PM -0400, Alan Stern wrote:
> On Thu, 12 Sep 2019, Paul E. McKenney wrote:
>
> > On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
>
> > > To this end, the LKMM imposes three extra restrictions, together
> > > called the "plain-coherence" axiom because of their resemblance to the
> > > coherency rules:
> > >
> > > If R and W conflict and it is possible to link R to W by one
> > > of the xb* sequences listed above, then W ->rfe R is not
> > > allowed (i.e., a load cannot read from a store that it
> > > executes before, even if one or both is plain).
> > >
> > > If W and R conflict and it is possible to link W to R by one
> > > of the vis sequences listed above, then R ->fre W is not
> > > allowed (i.e., if a store is visible to a load then the load
> > > must read from that store or one coherence-after it).
> > >
> > > If W and W' conflict and it is possible to link W to W' by one
> > > of the vis sequences listed above, then W' ->co W is not
> > > allowed (i.e., if one store is visible to another then it must
> > > come after in the coherence order).
>
> > I will need to read this last section again. Perhaps more than once. ;-)
>
> I decided this part could use some improvement. Here is the updated
> text:
>
>
> To this end, the LKMM imposes three extra restrictions, together
> called the "plain-coherence" axiom because of their resemblance to the
> rules used by the operational model to ensure cache coherence (that
> is, the rules governing the memory subsystem's choice of a store to
> satisfy a load request and its determination of where a store will
> fall in the coherence order):
>
> If R and W conflict and it is possible to link R to W by one
> of the xb* sequences listed above, then W ->rfe R is not
> allowed (i.e., a load cannot read from a store that it
> executes before, even if one or both is plain).
>
> If W and R conflict and it is possible to link W to R by one
> of the vis sequences listed above, then R ->fre W is not
> allowed (i.e., if a store is visible to a load then the load
> must read from that store or one coherence-after it).
>
> If W and W' conflict and it is possible to link W to W' by one
> of the vis sequences listed above, then W' ->co W is not
> allowed (i.e., if one store is visible to a second then the
> second must come after the first in the coherence order).
These look good to me!
Thanx, Paul
Hi Alan,
I spend some time reading this, really helpful! Thanks.
Please see comments below:
On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
[...]
> If two memory accesses aren't concurrent then one must execute before
> the other. Therefore the LKMM decides two accesses aren't concurrent
> if they can be connected by a sequence of hb, pb, and rb links
> (together referred to as xb, for "executes before"). However, there
> are two complicating factors.
>
> If X is a load and X executes before a store Y, then indeed there is
> no danger of X and Y being concurrent. After all, Y can't have any
> effect on the value obtained by X until the memory subsystem has
> propagated Y from its own CPU to X's CPU, which won't happen until
> some time after Y executes and thus after X executes. But if X is a
> store, then even if X executes before Y it is still possible that X
> will propagate to Y's CPU just as Y is executing. In such a case X
> could very well interfere somehow with Y, and we would have to
> consider X and Y to be concurrent.
>
> Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> requires not only that X must execute before Y but also that X must
> propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> Y executes before X -- then Y must propagate to X's CPU before X
> executes if Y is a store.) This is expressed by the visibility
> relation (vis), where X ->vis Y is defined to hold if there is an
> intermediate event Z such that:
>
> X is connected to Z by a possibly empty sequence of
> cumul-fence links followed by an optional rfe link (if none of
> these links are present, X and Z are the same event),
>
I wonder whehter we could add an optional ->coe or ->fre between X and
the possibly empty sequence of cumul-fence, smiliar to the definition
of ->prop. This makes sense because if we have
X ->coe X' (and X' in on CPU C)
, X must be already propagated to C before X' executed, according to our
operation model:
"... In particular, it must arrange for the store to be co-later than
(i.e., to overwrite) any other store to the same location which has
already propagated to CPU C."
In other words, we can define ->vis as:
let vis = prop ; ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
, and for this document, reference the "prop" section in
explanation.txt. This could make the model simple (both for description
and explanation), and one better thing is that we won't need commit in
Paul's dev branch:
c683f2c807d2 "tools/memory-model: Fix data race detection for unordered store and load"
, and data race rules will look more symmetrical ;-)
Thoughts? Or am I missing something subtle here?
Regards,
Boqun
> and either:
>
> Z is connected to Y by a strong-fence link followed by a
> possibly empty sequence of xb links,
>
> or:
>
> Z is on the same CPU as Y and is connected to Y by a possibly
> empty sequence of xb links (again, if the sequence is empty it
> means Z and Y are the same event).
>
> The motivations behind this definition are straightforward:
>
> cumul-fence memory barriers force stores that are po-before
> the barrier to propagate to other CPUs before stores that are
> po-after the barrier.
>
> An rfe link from an event W to an event R says that R reads
> from W, which certainly means that W must have propagated to
> R's CPU before R executed.
>
> strong-fence memory barriers force stores that are po-before
> the barrier, or that propagate to the barrier's CPU before the
> barrier executes, to propagate to all CPUs before any events
> po-after the barrier can execute.
>
> To see how this works out in practice, consider our old friend, the MP
> pattern (with fences and statement labels, but without the conditional
> test):
>
> int buf = 0, flag = 0;
>
> P0()
> {
> X: WRITE_ONCE(buf, 1);
> smp_wmb();
> W: WRITE_ONCE(flag, 1);
> }
>
> P1()
> {
> int r1;
> int r2 = 0;
>
> Z: r1 = READ_ONCE(flag);
> smp_rmb();
> Y: r2 = READ_ONCE(buf);
> }
>
> The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> means that the store to buf must propagate from P0 to P1 before Z
> executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> provides an xb link from Z to Y (i.e., it forces Z to execute before
> Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> executes.
>
[...]
On Mon, Sep 16, 2019 at 01:17:53PM +0800, Boqun Feng wrote:
> Hi Alan,
>
> I spend some time reading this, really helpful! Thanks.
>
> Please see comments below:
>
> On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> [...]
> > If two memory accesses aren't concurrent then one must execute before
> > the other. Therefore the LKMM decides two accesses aren't concurrent
> > if they can be connected by a sequence of hb, pb, and rb links
> > (together referred to as xb, for "executes before"). However, there
> > are two complicating factors.
> >
> > If X is a load and X executes before a store Y, then indeed there is
> > no danger of X and Y being concurrent. After all, Y can't have any
> > effect on the value obtained by X until the memory subsystem has
> > propagated Y from its own CPU to X's CPU, which won't happen until
> > some time after Y executes and thus after X executes. But if X is a
> > store, then even if X executes before Y it is still possible that X
> > will propagate to Y's CPU just as Y is executing. In such a case X
> > could very well interfere somehow with Y, and we would have to
> > consider X and Y to be concurrent.
> >
> > Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> > requires not only that X must execute before Y but also that X must
> > propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> > Y executes before X -- then Y must propagate to X's CPU before X
> > executes if Y is a store.) This is expressed by the visibility
> > relation (vis), where X ->vis Y is defined to hold if there is an
> > intermediate event Z such that:
> >
> > X is connected to Z by a possibly empty sequence of
> > cumul-fence links followed by an optional rfe link (if none of
> > these links are present, X and Z are the same event),
> >
>
> I wonder whehter we could add an optional ->coe or ->fre between X and
> the possibly empty sequence of cumul-fence, smiliar to the definition
> of ->prop. This makes sense because if we have
>
> X ->coe X' (and X' in on CPU C)
>
> , X must be already propagated to C before X' executed, according to our
> operation model:
>
> "... In particular, it must arrange for the store to be co-later than
> (i.e., to overwrite) any other store to the same location which has
> already propagated to CPU C."
>
> In other words, we can define ->vis as:
>
> let vis = prop ; ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>
Hmm.. so the problem with this approach is that the (xbstar & int) part
doesn't satisfy the requirement of visibility... i.e.
X ->prop Z ->(xbstar & int) Y
may not guarantee when Y executes, X is already propagated to Y's CPU.
Lemme think more on this...
Regards,
Boqun
> , and for this document, reference the "prop" section in
> explanation.txt. This could make the model simple (both for description
> and explanation), and one better thing is that we won't need commit in
> Paul's dev branch:
>
> c683f2c807d2 "tools/memory-model: Fix data race detection for unordered store and load"
>
> , and data race rules will look more symmetrical ;-)
>
> Thoughts? Or am I missing something subtle here?
>
> Regards,
> Boqun
>
> > and either:
> >
> > Z is connected to Y by a strong-fence link followed by a
> > possibly empty sequence of xb links,
> >
> > or:
> >
> > Z is on the same CPU as Y and is connected to Y by a possibly
> > empty sequence of xb links (again, if the sequence is empty it
> > means Z and Y are the same event).
> >
> > The motivations behind this definition are straightforward:
> >
> > cumul-fence memory barriers force stores that are po-before
> > the barrier to propagate to other CPUs before stores that are
> > po-after the barrier.
> >
> > An rfe link from an event W to an event R says that R reads
> > from W, which certainly means that W must have propagated to
> > R's CPU before R executed.
> >
> > strong-fence memory barriers force stores that are po-before
> > the barrier, or that propagate to the barrier's CPU before the
> > barrier executes, to propagate to all CPUs before any events
> > po-after the barrier can execute.
> >
> > To see how this works out in practice, consider our old friend, the MP
> > pattern (with fences and statement labels, but without the conditional
> > test):
> >
> > int buf = 0, flag = 0;
> >
> > P0()
> > {
> > X: WRITE_ONCE(buf, 1);
> > smp_wmb();
> > W: WRITE_ONCE(flag, 1);
> > }
> >
> > P1()
> > {
> > int r1;
> > int r2 = 0;
> >
> > Z: r1 = READ_ONCE(flag);
> > smp_rmb();
> > Y: r2 = READ_ONCE(buf);
> > }
> >
> > The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> > assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> > means that the store to buf must propagate from P0 to P1 before Z
> > executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> > provides an xb link from Z to Y (i.e., it forces Z to execute before
> > Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> > executes.
> >
> [...]
On Mon, 16 Sep 2019, Boqun Feng wrote:
> > executes if Y is a store.) This is expressed by the visibility
> > relation (vis), where X ->vis Y is defined to hold if there is an
> > intermediate event Z such that:
> >
> > X is connected to Z by a possibly empty sequence of
> > cumul-fence links followed by an optional rfe link (if none of
> > these links are present, X and Z are the same event),
> >
>
> I wonder whehter we could add an optional ->coe or ->fre between X and
> the possibly empty sequence of cumul-fence, smiliar to the definition
> of ->prop. This makes sense because if we have
>
> X ->coe X' (and X' in on CPU C)
>
> , X must be already propagated to C before X' executed, according to our
> operation model:
>
> "... In particular, it must arrange for the store to be co-later than
> (i.e., to overwrite) any other store to the same location which has
> already propagated to CPU C."
You have misinterpreted the text. The operational model says that if X
propagates to CPU C before X' executes then X ->coe X'. It does _not_
say that if X ->coe X' then X propagates to CPU C before X' executes.
> In other words, we can define ->vis as:
>
> let vis = prop ; ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>
> , and for this document, reference the "prop" section in
> explanation.txt. This could make the model simple (both for description
> and explanation), and one better thing is that we won't need commit in
> Paul's dev branch:
>
> c683f2c807d2 "tools/memory-model: Fix data race detection for unordered store and load"
>
> , and data race rules will look more symmetrical ;-)
>
> Thoughts? Or am I missing something subtle here?
See above.
Alan
On Mon, 16 Sep 2019, Boqun Feng wrote:
> > In other words, we can define ->vis as:
> >
> > let vis = prop ; ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
> >
>
> Hmm.. so the problem with this approach is that the (xbstar & int) part
> doesn't satisfy the requirement of visibility... i.e.
>
> X ->prop Z ->(xbstar & int) Y
>
> may not guarantee when Y executes, X is already propagated to Y's CPU.
Yes, it doesn't guarantee this. But the reason it doesn't guarantee
this is because of the prop. The (xbstar & int) part is okay. In
other words, if
Z ->(xbstar & int) Y
then it is certainly true that any store propagating to Z's CPU before
Z executes also propagates to Y's CPU (which is the same one) before Y
executes.
Alan
On Mon, Sep 16, 2019 at 11:22:18AM -0400, Alan Stern wrote:
> On Mon, 16 Sep 2019, Boqun Feng wrote:
>
> > > executes if Y is a store.) This is expressed by the visibility
> > > relation (vis), where X ->vis Y is defined to hold if there is an
> > > intermediate event Z such that:
> > >
> > > X is connected to Z by a possibly empty sequence of
> > > cumul-fence links followed by an optional rfe link (if none of
> > > these links are present, X and Z are the same event),
> > >
> >
> > I wonder whehter we could add an optional ->coe or ->fre between X and
> > the possibly empty sequence of cumul-fence, smiliar to the definition
> > of ->prop. This makes sense because if we have
> >
> > X ->coe X' (and X' in on CPU C)
> >
> > , X must be already propagated to C before X' executed, according to our
> > operation model:
> >
> > "... In particular, it must arrange for the store to be co-later than
> > (i.e., to overwrite) any other store to the same location which has
> > already propagated to CPU C."
>
> You have misinterpreted the text. The operational model says that if X
> propagates to CPU C before X' executes then X ->coe X'. It does _not_
> say that if X ->coe X' then X propagates to CPU C before X' executes.
>
Alright, I got confused. If X ->coe X', we only have "X must execute
before X' propagated to the CPU of X" (otherwise, we will have X' ->coe
X), so we will only know the execution time of X (not the propagation
time) is before the propagation of X' to the CPU of X, and that won't
help build the visiblity, because we know zero about the propagation
time of X to any CPU.
> > In other words, we can define ->vis as:
> >
> > let vis = prop ; ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
> >
> > , and for this document, reference the "prop" section in
> > explanation.txt. This could make the model simple (both for description
> > and explanation), and one better thing is that we won't need commit in
> > Paul's dev branch:
> >
> > c683f2c807d2 "tools/memory-model: Fix data race detection for unordered store and load"
> >
> > , and data race rules will look more symmetrical ;-)
> >
> > Thoughts? Or am I missing something subtle here?
>
> See above.
>
Thanks!
Regards,
Boqun
> Alan
>
On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> Folks:
>
> I have spent some time writing up a section for
> tools/memory-model/Documentation/explanation.txt on plain accesses and
> data races. The initial version is below.
>
> I'm afraid it's rather long and perhaps gets too bogged down in
> complexities. On the other hand, this is a complicated topic so to
> some extent this is unavoidable.
>
> In any case, I'd like to hear your comments and reviews.
Thank you for writing this up, Alan, and sorry for the delayed reply.
The section looks great to me, and I have no further suggestions besides
the minor fixes which have been already pointed out in the thread.
Looking forward to your v2 (an actual patch),
Andrea
On Fri, 27 Sep 2019, Andrea Parri wrote:
> On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> > Folks:
> >
> > I have spent some time writing up a section for
> > tools/memory-model/Documentation/explanation.txt on plain accesses and
> > data races. The initial version is below.
> >
> > I'm afraid it's rather long and perhaps gets too bogged down in
> > complexities. On the other hand, this is a complicated topic so to
> > some extent this is unavoidable.
> >
> > In any case, I'd like to hear your comments and reviews.
>
> Thank you for writing this up, Alan, and sorry for the delayed reply.
>
> The section looks great to me, and I have no further suggestions besides
> the minor fixes which have been already pointed out in the thread.
>
> Looking forward to your v2 (an actual patch),
Thanks for the review. I'll post the patch (together with a couple of
other changes) in a week or so.
Alan