2006-09-07 21:25:54

by Alan Stern

[permalink] [raw]
Subject: Uses for memory barriers

Paul:

Here's something I had been thinking about back in July but never got
around to discussing: Under what circumstances would one ever want to use
"mb()" rather than "rmb()" or "wmb()"?

The canonical application for memory barriers is where one CPU writes two
locations and another reads them, to make certain that the ordering is
preserved (assume everything is initially equal to 0):

CPU 0 CPU 1
----- -----
a = 1; y = b;
wmb(); rmb();
b = 1; x = a;
assert(x==1 || y==0);

In this situation the first CPU only needs wmb() and the second only needs
rmb(). So when would we need a full mb()?...

The obvious extension of the canonical example is to have CPU 0 write
one location and read another, while CPU 1 reads and writes the same
locations. Example:

CPU 0 CPU 1
----- -----
while (y==0) relax(); y = -1;
a = 1; b = 1;
mb(); mb();
y = b; x = a;
while (y < 0) relax();
assert(x==1 || y==1); //???

Apart from the extra stuff needed to make sure that CPU 1 sees the proper
value stored in y by CPU 0, this is just like the first example except for
the pattern of reads and writes. Naively one would think that if the
first half of the assertion fails, so x==0, then CPU 1 must have completed
all of the first four lines above by the time CPU 0 completed its mb().
Hence the assignment to b would have to be visible to CPU 0, since the
read of b occurs after the mb(). But of course we know that naive
reasoning isn't always right when it comes to the operation of memory
caches.

The opposite approach would use reads followed by writes:

CPU 0 CPU 1
----- -----
while (x==0) relax(); x = -1;
x = a; y = b;
mb(); mb();
b = 1; a = 1;
while (x < 0) relax();
assert(x==0 || y==0); //???

Similar reasoning can be applied here. However IIRC, you decided that
neither of these assertions is actually guaranteed to hold. If that's the
case, then it looks like mb() is useless for coordinating two CPUs.

Am I correct? Or are there some easily-explained situations where mb()
really should be used for inter-CPU synchronization?

Alan


2006-09-07 22:10:09

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Uses for memory barriers


On Thu, 7 Sep 2006, Alan Stern wrote:

> Paul:
>
> Here's something I had been thinking about back in July but never got
> around to discussing: Under what circumstances would one ever want to use
> "mb()" rather than "rmb()" or "wmb()"?
>
> The canonical application for memory barriers is where one CPU writes two
> locations and another reads them, to make certain that the ordering is
> preserved (assume everything is initially equal to 0):
>
> CPU 0 CPU 1
> ----- -----
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(x==1 || y==0);
>
> In this situation the first CPU only needs wmb() and the second only needs
> rmb(). So when would we need a full mb()?...
>
> The obvious extension of the canonical example is to have CPU 0 write
> one location and read another, while CPU 1 reads and writes the same
> locations. Example:
>
> CPU 0 CPU 1
> ----- -----
> while (y==0) relax(); y = -1;
> a = 1; b = 1;
> mb(); mb();
> y = b; x = a;
> while (y < 0) relax();
> assert(x==1 || y==1); //???
>
> Apart from the extra stuff needed to make sure that CPU 1 sees the proper
> value stored in y by CPU 0, this is just like the first example except for
> the pattern of reads and writes. Naively one would think that if the
> first half of the assertion fails, so x==0, then CPU 1 must have completed
> all of the first four lines above by the time CPU 0 completed its mb().
> Hence the assignment to b would have to be visible to CPU 0, since the
> read of b occurs after the mb(). But of course we know that naive
> reasoning isn't always right when it comes to the operation of memory
> caches.
>
> The opposite approach would use reads followed by writes:
>
> CPU 0 CPU 1
> ----- -----
> while (x==0) relax(); x = -1;
> x = a; y = b;
> mb(); mb();
> b = 1; a = 1;
> while (x < 0) relax();
> assert(x==0 || y==0); //???
>
> Similar reasoning can be applied here. However IIRC, you decided that
> neither of these assertions is actually guaranteed to hold. If that's the
> case, then it looks like mb() is useless for coordinating two CPUs.
>
> Am I correct? Or are there some easily-explained situations where mb()
> really should be used for inter-CPU synchronization?
>
> Alan
>

It's simpler to understand if you know what the underlying problem
may be. Many modern computers don't have direct, interlocked, connections
to RAM anymore. They used to be like this:

CPU0 CPU1
cache cache
[memory controller]
|
[RAM]

The memory controller handled the hardware to make sure that reads
and writes didn't occur at the same time and a read would be held-off
until a write completed. That made sure that each CPU read what was
last written, regardless of who wrote it.

The situation is not the same anymore. It's now like this:

CPU0 CPU1
cache cache
| |
[serial link] [serial link]
| |
| |
|________________|
|
[memory controller]
|
[RAM]

The serial links have a common characteristic: writes can be
queued, but a read forces all writes to complete before the
read occurs. Nothing is out of order [1], as seen by an individual
CPU, but you could have some real bad problems if you didn't
realize that the other CPUs' writes might get interleaved with
your CPU's writes!

So, if it is important what another CPU may write to your
variable, you need a memory-barrier which tells the compiler
to not assume that the variable just written contains the
value just written! It needs to read it again so that all
pending writes from anybody are finished before using
that memory value. Many times it's not important because
your variable may already be protected by a spin-lock so
it's impossible for any other CPU to muck with it. Other
times, as in the case of spin-locks themselves, memory
barriers are very important to make the lock work.

In your example, you cannot tell *when* one CPU may have
written to what you are reading. What you do know is
that the CPU that read a common variable, will read the
value that exists after all pending writes have occurred.
Since you don't know if there were any pending writes, you
need to protect such variables with spin-locks anyway.
These spin-locks contain the required memory barriers.

[1] Some CPUs now implement a 'sfence' instruction. Since
write combining can occur as a hardware optimization, this
can make it look to hardware as though writes occurred out-of
order. This only affects what the hardware 'sees' not what
any CPU sees. If write ordering is important (it may be when
setting up DMA for instance, where the last write is
supposed to be the one that starts the DMA), then a 'sfence'
instruction should occur before such important writes. For
CPUs that don't have such an instruction, just read something
in the same address-space to force all pending writes to complete.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.66 BogoMips).
New book: http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2006-09-08 00:14:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, Sep 07, 2006 at 05:25:51PM -0400, Alan Stern wrote:
> Paul:
>
> Here's something I had been thinking about back in July but never got
> around to discussing: Under what circumstances would one ever want to use
> "mb()" rather than "rmb()" or "wmb()"?

If there were reads needing to be separated from writes, for example
in spinlocks. The spinlock-acquisition primitive could not use either
wmb() or rmb(), since reads in the critical section must remain ordered
with respect to the write to the spinlock itself.

> The canonical application for memory barriers is where one CPU writes two
> locations and another reads them, to make certain that the ordering is
> preserved (assume everything is initially equal to 0):
>
> CPU 0 CPU 1
> ----- -----
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(x==1 || y==0);
>
> In this situation the first CPU only needs wmb() and the second only needs
> rmb(). So when would we need a full mb()?...

Right, the above example does not need mb().

> The obvious extension of the canonical example is to have CPU 0 write
> one location and read another, while CPU 1 reads and writes the same
> locations. Example:
>
> CPU 0 CPU 1
> ----- -----
> while (y==0) relax(); y = -1;
> a = 1; b = 1;
> mb(); mb();
> y = b; x = a;
> while (y < 0) relax();
> assert(x==1 || y==1); //???
>
> Apart from the extra stuff needed to make sure that CPU 1 sees the proper
> value stored in y by CPU 0, this is just like the first example except for
> the pattern of reads and writes. Naively one would think that if the
> first half of the assertion fails, so x==0, then CPU 1 must have completed
> all of the first four lines above by the time CPU 0 completed its mb().
> Hence the assignment to b would have to be visible to CPU 0, since the
> read of b occurs after the mb(). But of course we know that naive
> reasoning isn't always right when it comes to the operation of memory
> caches.

In the above code, there is nothing stopping CPU 1 from executing through
the "x=a" before CPU 0 starts, so that x==0. In addition, CPU 1 imposes
no ordering between the assignment to y and b, so there is nothing stopping
CPU 0 from seeing the new value of y, but failing to see the new value of
b, so that y==0 (assuming the initial value of b is zero).

Something like the following might illustrate your point:

CPU 0 CPU 1
----- -----
b = 1;
wmb();
while (y==0) relax(); y = -1;
a = 1;
wmb();
y = b; while (y < 0) relax();
rmb();
x = a;
assert(x==1 || y==1); //???

Except that the memory barriers have all turned into rmb()s or wmb()s...

> The opposite approach would use reads followed by writes:
>
> CPU 0 CPU 1
> ----- -----
> while (x==0) relax(); x = -1;
> x = a; y = b;
> mb(); mb();
> b = 1; a = 1;
> while (x < 0) relax();
> assert(x==0 || y==0); //???
>
> Similar reasoning can be applied here. However IIRC, you decided that
> neither of these assertions is actually guaranteed to hold. If that's the
> case, then it looks like mb() is useless for coordinating two CPUs.

Yep, similar problems as with the earlier example.

> Am I correct? Or are there some easily-explained situations where mb()
> really should be used for inter-CPU synchronization?

Consider the following (lame) definitions for spinlock primitives,
but in an alternate universe where atomic_xchg() did not imply a
memory barrier, and on a weak-memory CPU:

typedef spinlock_t atomic_t;

void spin_lock(spinlock_t *l)
{
for (;;) {
if (atomic_xchg(l, 1) == 0) {
smp_mb();
return;
}
while (atomic_read(l) != 0) barrier();
}

}

void spin_unlock(spinlock_t *l)
{
smp_mb();
atomic_set(l, 0);
}

The spin_lock() primitive needs smp_mb() to ensure that all loads and
stores in the following critical section happen only -after- the lock
is acquired. Similarly for the spin_unlock() primitive.

Thanx, Paul

2006-09-08 05:52:41

by David Schwartz

[permalink] [raw]
Subject: RE: Uses for memory barriers


> Am I correct? Or are there some easily-explained situations where mb()
> really should be used for inter-CPU synchronization?

Consider when one CPU does the following:

while(!spinlock_acquire()) relax();
x=shared_value_protected_by_spinlock;

We need to make sure we do not *read* the protected values (say due to
prefetching) before other CPUs see our write that locks the spinlock.

DS


2006-09-08 15:55:48

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, 7 Sep 2006, Paul E. McKenney wrote:

> On Thu, Sep 07, 2006 at 05:25:51PM -0400, Alan Stern wrote:
> > Paul:
> >
> > Here's something I had been thinking about back in July but never got
> > around to discussing: Under what circumstances would one ever want to use
> > "mb()" rather than "rmb()" or "wmb()"?
>
> If there were reads needing to be separated from writes, for example
> in spinlocks. The spinlock-acquisition primitive could not use either
> wmb() or rmb(), since reads in the critical section must remain ordered
> with respect to the write to the spinlock itself.

Yes, okay, that's a general situation where mb() would be useful. But the
actual application amounts to one of the two patterns below, as you will
see...

Let's call the following pattern "write-mb-read", since that's what each
CPU does (write a, mb, read b on CPU 0 and write b, mb, read a on CPU 1).

> > The obvious extension of the canonical example is to have CPU 0 write
> > one location and read another, while CPU 1 reads and writes the same
> > locations. Example:
> >
> > CPU 0 CPU 1
> > ----- -----
> > while (y==0) relax(); y = -1;
> > a = 1; b = 1;
> > mb(); mb();
> > y = b; x = a;
> > while (y < 0) relax();
> > assert(x==1 || y==1); //???

> In the above code, there is nothing stopping CPU 1 from executing through
> the "x=a" before CPU 0 starts, so that x==0.

Agreed.

> In addition, CPU 1 imposes
> no ordering between the assignment to y and b,

Agreed.

> so there is nothing stopping
> CPU 0 from seeing the new value of y, but failing to see the new value of
> b,

Disagree: CPU 0 executes mb() between reading y and b. You have assumed
that CPU 1 executed its write to b and its mb() before CPU 0 got started,
so why wouldn't CPU 0's mb() guarantee that it sees the new value of b?
That's really the key point.

To rephrase it in terms of partial orderings of events: CPU 1's mb()
orders the commit for the write to b before the read from a. The fact
that a was read as 0 means that the read occurred before CPU 0's write to
a was committed. The mb() on CPU 0 orders the commit for the write to a
before the read from b. By transitivity, the read from b must have
occurred after the write to b was committed, so the value read must have
been 1.

> so that y==0 (assuming the initial value of b is zero).
>
> Something like the following might illustrate your point:
>
> CPU 0 CPU 1
> ----- -----
> b = 1;
> wmb();
> while (y==0) relax(); y = -1;
> a = 1;
> wmb();
> y = b; while (y < 0) relax();
> rmb();
> x = a;
> assert(x==1 || y==1); //???
>
> Except that the memory barriers have all turned into rmb()s or wmb()s...

This seems like a non-sequitur. My point was about mb(); how could this
example illustrate it? In particular, CPU 0's wmb() doesn't imply any
ordering between its read of y and its read of b.

Furthermore, in this example the stronger assertion x==1 would always
hold (it's a corollary of assert(y==-1 || x==1) combined with the
knowledge that the while loop has terminated). Did you in fact intend CPU
0's wmb() to be rmb()?

Let's call the following pattern "read-mb-write".

> > The opposite approach would use reads followed by writes:
> >
> > CPU 0 CPU 1
> > ----- -----
> > while (x==0) relax(); x = -1;
> > x = a; y = b;
> > mb(); mb();
> > b = 1; a = 1;
> > while (x < 0) relax();
> > assert(x==0 || y==0); //???
> >
> > Similar reasoning can be applied here. However IIRC, you decided that
> > neither of these assertions is actually guaranteed to hold. If that's the
> > case, then it looks like mb() is useless for coordinating two CPUs.
>
> Yep, similar problems as with the earlier example.
>
> > Am I correct? Or are there some easily-explained situations where mb()
> > really should be used for inter-CPU synchronization?
>
> Consider the following (lame) definitions for spinlock primitives,
> but in an alternate universe where atomic_xchg() did not imply a
> memory barrier, and on a weak-memory CPU:
>
> typedef spinlock_t atomic_t;
>
> void spin_lock(spinlock_t *l)
> {
> for (;;) {
> if (atomic_xchg(l, 1) == 0) {
> smp_mb();
> return;
> }
> while (atomic_read(l) != 0) barrier();
> }
>
> }
>
> void spin_unlock(spinlock_t *l)
> {
> smp_mb();
> atomic_set(l, 0);
> }
>
> The spin_lock() primitive needs smp_mb() to ensure that all loads and
> stores in the following critical section happen only -after- the lock
> is acquired. Similarly for the spin_unlock() primitive.

Really? Let's take a closer look. Let b be a pointer to a spinlock_t and
let a get accessed only within the critical sections:

CPU 0 CPU 1
----- -----
spin_lock(b);
...
x = a;
spin_unlock(b); spin_lock(b);
a = 1;
...
assert(x==0); //???

Expanding out the code gives us a version of the read-mb-write pattern.
For the sake of argument, let's suppose that CPU 1's spin_lock succeeds
immediately with no looping:

CPU 0 CPU 1
----- -----
...
x = a;
// spin_unlock(b):
mb();
atomic_set(b,0); if (atomic_xchg(b,1) == 0) // Succeeds
mb();
a = 1;
...
assert(x==0); //???

As you can see, this fits the pattern exactly. CPU 0 does read a, mb,
write b, and CPU 1 does read b, mb, write a.

We are supposing the read of b obtains the new value (no looping); can we
then assert that the read of a obtained the old value? If we can -- and
we had better if spinlocks are to work correctly -- then doesn't this mean
that the read-mb-write pattern will always succeed?

Alan

2006-09-08 18:06:31

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 8 Sep 2006, Oliver Neukum wrote:

> Am Freitag, 8. September 2006 17:55 schrieb Alan Stern:
> > > ????CPU 0???????????????????CPU 1
> > > ????-----???????????????????-----
> > > ????while (y==0) relax();???y = -1;
> > > ????a = 1;??????????????????b = 1;
> > > ????mb();???????????????????mb();
> > > ????y = b;??????????????????x = a;
> > > ????????????????????????????while (y < 0) relax();
> > > ????????????????????????????assert(x==1 || y==1);???//???
>
> y = -1
> a = 1
> y =b (== 0)
> b = 1
> x = a (==1)

So in this case the assertion is satisfied because x == 1. My question
was whether the assertion could ever fail. Paul said that it could, but I
didn't find his reason convincing.

> > Disagree: CPU 0 executes mb() between reading y and b. ?You have assumed
> > that CPU 1 executed its write to b and its mb() before CPU 0 got started,
> > so why wouldn't CPU 0's mb() guarantee that it sees the new value of b? ?
> > That's really the key point.
>
> That code has an ordinary race condition.
> For it to work it needs to be:
>
> b = 1;
> mb(); //could be wmb()
> y = -1;

I'm not sure what you mean. The code wasn't intended to "work" in any
sense; it was just to make a point. My question still stands: Is it
possible, in the code as I originally wrote it, for the assertion to fail?

Alan Stern

2006-09-08 18:22:43

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers


> I'm not sure what you mean. The code wasn't intended to "work" in any
> sense; it was just to make a point. My question still stands: Is it
> possible, in the code as I originally wrote it, for the assertion to fail?

Sorry, I misread || for &&.
The assertion will be true.

Regards
Oliver

2006-09-08 18:39:15

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

Thanks for your comments, although they did not directly address the
question I asked.

On Thu, 7 Sep 2006, linux-os (Dick Johnson) wrote:

> It's simpler to understand if you know what the underlying problem
> may be. Many modern computers don't have direct, interlocked, connections
> to RAM anymore. They used to be like this:
>
> CPU0 CPU1
> cache cache
> [memory controller]
> |
> [RAM]
>
> The memory controller handled the hardware to make sure that reads
> and writes didn't occur at the same time and a read would be held-off
> until a write completed. That made sure that each CPU read what was
> last written, regardless of who wrote it.
>
> The situation is not the same anymore. It's now like this:
>
> CPU0 CPU1
> cache cache
> | |
> [serial link] [serial link]
> | |
> | |
> |________________|
> |
> [memory controller]
> |
> [RAM]
>
> The serial links have a common characteristic: writes can be
> queued, but a read forces all writes to complete before the
> read occurs. Nothing is out of order [1], as seen by an individual
> CPU, but you could have some real bad problems if you didn't
> realize that the other CPUs' writes might get interleaved with
> your CPU's writes!

That's not the whole story. An important aspect is that CPUs are free to
execute instructions out-of-order. In this code, even with a compiler
barrier present:

a = 1;
barrier();
b = 1;

the CPU is allowed to execute the write to b before the write to a. It's
not clear whether or not this agrees with what you wrote above: "Nothing
is out of order, as seen by an individual CPU".

> So, if it is important what another CPU may write to your
> variable, you need a memory-barrier which tells the compiler
> to not assume that the variable just written contains the
> value just written!

My questions concern memory barriers only, not compiler barriers.

> It needs to read it again so that all
> pending writes from anybody are finished before using
> that memory value. Many times it's not important because
> your variable may already be protected by a spin-lock so
> it's impossible for any other CPU to muck with it. Other
> times, as in the case of spin-locks themselves, memory
> barriers are very important to make the lock work.
>
> In your example, you cannot tell *when* one CPU may have
> written to what you are reading. What you do know is
> that the CPU that read a common variable, will read the
> value that exists after all pending writes have occurred.

That's not at all clear. Exactly what does it mean for a write to be
pending? Can't a read be satisfied right away from a CPU's local cache
even if another CPU has a pending write to the same variable? (In the
absence of instantaneous communication between processors it's hard to see
how this could fail to be true.)

> Since you don't know if there were any pending writes, you
> need to protect such variables with spin-locks anyway.
> These spin-locks contain the required memory barriers.

The point of my question was to understand more precisely how memory
barriers work. Saying that spinlocks should always be used, as they
contain all the required barriers, doesn't further my understanding at
all.

Alan Stern

2006-09-08 18:56:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Sep 08, 2006 at 11:55:46AM -0400, Alan Stern wrote:
> On Thu, 7 Sep 2006, Paul E. McKenney wrote:
>
> > On Thu, Sep 07, 2006 at 05:25:51PM -0400, Alan Stern wrote:
> > > Paul:
> > >
> > > Here's something I had been thinking about back in July but never got
> > > around to discussing: Under what circumstances would one ever want to use
> > > "mb()" rather than "rmb()" or "wmb()"?
> >
> > If there were reads needing to be separated from writes, for example
> > in spinlocks. The spinlock-acquisition primitive could not use either
> > wmb() or rmb(), since reads in the critical section must remain ordered
> > with respect to the write to the spinlock itself.
>
> Yes, okay, that's a general situation where mb() would be useful. But the
> actual application amounts to one of the two patterns below, as you will
> see...

Hey, you asked!!! ;-)

> Let's call the following pattern "write-mb-read", since that's what each
> CPU does (write a, mb, read b on CPU 0 and write b, mb, read a on CPU 1).
>
> > > The obvious extension of the canonical example is to have CPU 0 write
> > > one location and read another, while CPU 1 reads and writes the same
> > > locations. Example:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > while (y==0) relax(); y = -1;
> > > a = 1; b = 1;
> > > mb(); mb();
> > > y = b; x = a;
> > > while (y < 0) relax();
> > > assert(x==1 || y==1); //???
>
> > In the above code, there is nothing stopping CPU 1 from executing through
> > the "x=a" before CPU 0 starts, so that x==0.
>
> Agreed.
>
> > In addition, CPU 1 imposes
> > no ordering between the assignment to y and b,
>
> Agreed.
>
> > so there is nothing stopping
> > CPU 0 from seeing the new value of y, but failing to see the new value of
> > b,
>
> Disagree: CPU 0 executes mb() between reading y and b. You have assumed
> that CPU 1 executed its write to b and its mb() before CPU 0 got started,
> so why wouldn't CPU 0's mb() guarantee that it sees the new value of b?
> That's really the key point.

If we are talking about some specific CPUs, you might have a point.
But Linux must tolerate least-common-demoninator semantics...

And those least-common-denominator semantics are "if a CPU sees an
assignment from another CPU that followed a memory barrier on that
other CPU, then the first CPU is guaranteed to see any stores from
the other CPU -preceding- that memory barrier.

In your example, CPU 0 does not access x, so never sees any stores
from CPU 1 following the mb(), and thus is never guaranteed to see
the assignments preceding CPU 1's mb().

> To rephrase it in terms of partial orderings of events: CPU 1's mb()
> orders the commit for the write to b before the read from a. The fact
> that a was read as 0 means that the read occurred before CPU 0's write to
> a was committed. The mb() on CPU 0 orders the commit for the write to a
> before the read from b. By transitivity, the read from b must have
> occurred after the write to b was committed, so the value read must have
> been 1.

After CPU 1 sees the new value of y, then, yes, any operation ordered
after the read of y would see the new value of a. But (1) the "x=a"
precedes the check for y and (2) there is no mb() following the check of y
to force any ordering whatsoever.

Assume that initially, a==0 in a cache line owned by CPU 1. Assume that
b==0 and y==0 in separate cache lines owned by CPU 0.

CPU 0 CPU 1
----- -----
y = -1; [this goes out quickly.]
while (y==0) relax(); [picks up the value from CPU 1]
a = 1; [the cacheline is on CPU 1, so this one sits in
CPU 0's store queue. CPU 0 transmits an invalidation
request, which will take time to reach CPU 1.]
b = 1; [the cacheline is on CPU 0, so
this one sits in CPU 1's store
queue, and CPU 1 transmits an
invalidation request, which again
will take time to rech CPU 0.]
mb(); [CPU 0 waits for acknowledgement of reception of all previously
transmitted invalidation requests, and also processes any
invalidation requests that it has previously received (none!)
before processing any subsequent loads. Yes, CPU 1 already
sent the invalidation request for b, but there is no
guarantee that it has worked its way to CPU 0 yet!]
mb(); [Ditto.]

At this point, CPU 0 has received the invalidation request for b
from CPU 1, and CPU 1 has received the invalidation request for
a from CPU 0, but there is no guarantee that either of these
invalidation requests have been processed.

x = a; [Could therefore still be using
the old cached version of a, so
that x==0.]
y = b; [Could also be still using the old cached version of b,
so that y==0.]
while (y < 0) relax(); [This will spin until
the cache coherence protocol delivers
the new value y==0. At this point,
CPU 1 is guaranteed to see the new
value of a due to CPU 0's mb(), but
it is too late... And CPU 1 would
need a memory barrier following the
while loop in any case -- otherwise,
CPU 1 would be within its rights
on some CPUs to execute the code
out of order.]
assert(x==1 || y==1); //???
[This assertion can therefore fail.]

By the way, this sort of scenario is why maintainers have my full
sympathies when they automatically reject any patches containing explicit
memory barriers...

> > so that y==0 (assuming the initial value of b is zero).
> >
> > Something like the following might illustrate your point:
> >
> > CPU 0 CPU 1
> > ----- -----
> > b = 1;
> > wmb();
> > while (y==0) relax(); y = -1;
> > a = 1;
> > wmb();
> > y = b; while (y < 0) relax();
> > rmb();
> > x = a;
> > assert(x==1 || y==1); //???
> >
> > Except that the memory barriers have all turned into rmb()s or wmb()s...
>
> This seems like a non-sequitur. My point was about mb(); how could this
> example illustrate it? In particular, CPU 0's wmb() doesn't imply any
> ordering between its read of y and its read of b.

Excellent point, thus CPU 0's wmb() needs to instead be an mb().

> Furthermore, in this example the stronger assertion x==1 would always
> hold (it's a corollary of assert(y==-1 || x==1) combined with the
> knowledge that the while loop has terminated). Did you in fact intend CPU
> 0's wmb() to be rmb()?

I was just confused, confusion being a common symptom of working with
explicit memory barriers. :-/

If CPU 0 did mb(), then I believe assert(x==1&&y==1) would hold.

> Let's call the following pattern "read-mb-write".
>
> > > The opposite approach would use reads followed by writes:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > while (x==0) relax(); x = -1;
> > > x = a; y = b;
> > > mb(); mb();
> > > b = 1; a = 1;
> > > while (x < 0) relax();
> > > assert(x==0 || y==0); //???
> > >
> > > Similar reasoning can be applied here. However IIRC, you decided that
> > > neither of these assertions is actually guaranteed to hold. If that's the
> > > case, then it looks like mb() is useless for coordinating two CPUs.
> >
> > Yep, similar problems as with the earlier example.
> >
> > > Am I correct? Or are there some easily-explained situations where mb()
> > > really should be used for inter-CPU synchronization?
> >
> > Consider the following (lame) definitions for spinlock primitives,
> > but in an alternate universe where atomic_xchg() did not imply a
> > memory barrier, and on a weak-memory CPU:
> >
> > typedef spinlock_t atomic_t;
> >
> > void spin_lock(spinlock_t *l)
> > {
> > for (;;) {
> > if (atomic_xchg(l, 1) == 0) {
> > smp_mb();
> > return;
> > }
> > while (atomic_read(l) != 0) barrier();
> > }
> >
> > }
> >
> > void spin_unlock(spinlock_t *l)
> > {
> > smp_mb();
> > atomic_set(l, 0);
> > }
> >
> > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > stores in the following critical section happen only -after- the lock
> > is acquired. Similarly for the spin_unlock() primitive.
>
> Really? Let's take a closer look. Let b be a pointer to a spinlock_t and
> let a get accessed only within the critical sections:
>
> CPU 0 CPU 1
> ----- -----
> spin_lock(b);
> ...
> x = a;
> spin_unlock(b); spin_lock(b);
> a = 1;
> ...
> assert(x==0); //???
>
> Expanding out the code gives us a version of the read-mb-write pattern.

Eh? There is nothing guaranteeing that CPU 0 gets the lock before
CPU 1, so what is to assert?

> For the sake of argument, let's suppose that CPU 1's spin_lock succeeds
> immediately with no looping:
>
> CPU 0 CPU 1
> ----- -----
> ...
> x = a;
> // spin_unlock(b):
> mb();
> atomic_set(b,0); if (atomic_xchg(b,1) == 0) // Succeeds
> mb();
> a = 1;
> ...
> assert(x==0); //???
>
> As you can see, this fits the pattern exactly. CPU 0 does read a, mb,
> write b, and CPU 1 does read b, mb, write a.

Again, you don't have anything that guarantees that CPU 0 will get the
lock first, so the assertion is bogus. Also, the "if" shouldn't be
there -- since we are assuming CPU 1 got the lock immediately, it
would be unconditionally executing the mb(), right?

> We are supposing the read of b obtains the new value (no looping); can we
> then assert that the read of a obtained the old value? If we can -- and
> we had better if spinlocks are to work correctly -- then doesn't this mean
> that the read-mb-write pattern will always succeed?

Since CPU 1 saw CPU 0's assignment to b (which follows CPU 0's mb()),
and since CPU 1 immediately executed a memory barrier, CPU 1's critical
section is guaranteed to follow that of CPU 0. -Both- CPU 0's and CPU 1's
memory barriers are needed to make this work. This does not rely on
anything fancy, simply the standard pair-wise memory-barrier guarantee.

(Which is: if CPU 1 sees an assignment by CPU 0 following CPU 0 executing
a memory barrier, then any code on CPU 1 following a later CPU 1 memory
barrier is guaranteed to see the effects of all references by CPU 0
that preceded CPU 0's memory barrier.)

David Howells's Documentation/memory-barriers.txt goes through this
sort of thing, IIRC.

Thanx, Paul

2006-09-08 21:23:09

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 8 Sep 2006, Paul E. McKenney wrote:

> > Yes, okay, that's a general situation where mb() would be useful. But the
> > actual application amounts to one of the two patterns below, as you will
> > see...
>
> Hey, you asked!!! ;-)

I knew that starting this discussion would open a can of worms, but what
the heck... I was curious! :-)

> If we are talking about some specific CPUs, you might have a point.
> But Linux must tolerate least-common-demoninator semantics...
>
> And those least-common-denominator semantics are "if a CPU sees an
> assignment from another CPU that followed a memory barrier on that
> other CPU, then the first CPU is guaranteed to see any stores from
> the other CPU -preceding- that memory barrier.

The "canonical" memory-barrier guarantee. But I'm interested in
non-canonical situations...

> Assume that initially, a==0 in a cache line owned by CPU 1. Assume that
> b==0 and y==0 in separate cache lines owned by CPU 0.

Ah, a meaty analysis with details of inner mechanisms. Good.

> CPU 0 CPU 1
> ----- -----
> y = -1; [this goes out quickly.]
> while (y==0) relax(); [picks up the value from CPU 1]
> a = 1; [the cacheline is on CPU 1, so this one sits in
> CPU 0's store queue. CPU 0 transmits an invalidation
> request, which will take time to reach CPU 1.]
> b = 1; [the cacheline is on CPU 0, so
> this one sits in CPU 1's store
> queue, and CPU 1 transmits an
> invalidation request, which again
> will take time to rech CPU 0.]
> mb(); [CPU 0 waits for acknowledgement of reception of all previously
> transmitted invalidation requests, and also processes any
> invalidation requests that it has previously received (none!)
> before processing any subsequent loads. Yes, CPU 1 already
> sent the invalidation request for b, but there is no
> guarantee that it has worked its way to CPU 0 yet!]
> mb(); [Ditto.]
>
> At this point, CPU 0 has received the invalidation request for b
> from CPU 1, and CPU 1 has received the invalidation request for
> a from CPU 0, but there is no guarantee that either of these
> invalidation requests have been processed.

Do you draw a distinction between an invalidation request being
"acknowledged" and being "processed"? CPU 0 won't finish its mb() until
CPU 1 has acknowledged receiving the invalidation request for a's
cacheline. Does it make sense for CPU 1 to acknowledge reception of the
request without having yet processed the request?

Wouldn't that open a door for some nasty synchronization problems? For
example, suppose CPU 1 had previously modified some other variable c
located in the same cacheline as a. Acknowledging the invalidation
request means it is giving permission for CPU 0 to own the cacheline;
wouldn't this cause the modification of c to be lost?

>
> x = a; [Could therefore still be using
> the old cached version of a, so
> that x==0.]

Not if the invalidation request was processed as well as merely
acknowledged.

> y = b; [Could also be still using the old cached version of b,
> so that y==0.]
> while (y < 0) relax(); [This will spin until
> the cache coherence protocol delivers
> the new value y==0. At this point,
> CPU 1 is guaranteed to see the new
> value of a due to CPU 0's mb(), but
> it is too late... And CPU 1 would
> need a memory barrier following the
> while loop in any case -- otherwise,
> CPU 1 would be within its rights
> on some CPUs to execute the code
> out of order.]
> assert(x==1 || y==1); //???
> [This assertion can therefore fail.]
>
> By the way, this sort of scenario is why maintainers have my full
> sympathies when they automatically reject any patches containing explicit
> memory barriers...

I sympathize.


> > Let's call the following pattern "read-mb-write".

I wish I had presented read-mb-write first, before write-mb-read. It's
more pertinent to the section on locking below. Can you please give an
equally detailed analysis, like the one above, showing how the assertion
in thie pattern can fail?

> > > > The opposite approach would use reads followed by writes:
> > > >
> > > > CPU 0 CPU 1
> > > > ----- -----
> > > > while (x==0) relax(); x = -1;
> > > > x = a; y = b;
> > > > mb(); mb();
> > > > b = 1; a = 1;
> > > > while (x < 0) relax();
> > > > assert(x==0 || y==0); //???

Or does it turn out that read-mb-write does succeed, even though
write-mb-read can sometimes fail?


> > > Consider the following (lame) definitions for spinlock primitives,
> > > but in an alternate universe where atomic_xchg() did not imply a
> > > memory barrier, and on a weak-memory CPU:
> > >
> > > typedef spinlock_t atomic_t;
> > >
> > > void spin_lock(spinlock_t *l)
> > > {
> > > for (;;) {
> > > if (atomic_xchg(l, 1) == 0) {
> > > smp_mb();
> > > return;
> > > }
> > > while (atomic_read(l) != 0) barrier();
> > > }
> > >
> > > }
> > >
> > > void spin_unlock(spinlock_t *l)
> > > {
> > > smp_mb();
> > > atomic_set(l, 0);
> > > }
> > >
> > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > stores in the following critical section happen only -after- the lock
> > > is acquired. Similarly for the spin_unlock() primitive.
> >
> > Really? Let's take a closer look. Let b be a pointer to a spinlock_t and
> > let a get accessed only within the critical sections:
> >
> > CPU 0 CPU 1
> > ----- -----
> > spin_lock(b);
> > ...
> > x = a;
> > spin_unlock(b); spin_lock(b);
> > a = 1;
> > ...
> > assert(x==0); //???
> >
> > Expanding out the code gives us a version of the read-mb-write pattern.
>
> Eh? There is nothing guaranteeing that CPU 0 gets the lock before
> CPU 1, so what is to assert?

Okay, I left out some stuff. Suppose we have this instead:

CPU 0 CPU 1
----- -----
spin_lock(b);
...
y = -1; while (y == 0) relax();
x = a;
spin_unlock(b); spin_lock(b);
a = 1;
...
assert(x==0); //???

Now we _can_ guarantee that CPU 0 gets the lock before CPU 1. The
discussion will not be significantly affected by this change.

> > For the sake of argument, let's suppose that CPU 1's spin_lock succeeds
> > immediately with no looping:
> >
> > CPU 0 CPU 1
> > ----- -----
> > ...
Insert: y = -1; (goes out quickly)
while (y == 0) relax(); (quickly terminates)
> > x = a;
> > // spin_unlock(b):
> > mb();
> > atomic_set(b,0); if (atomic_xchg(b,1) == 0) // Succeeds
> > mb();
> > a = 1;
> > ...
> > assert(x==0); //???
> >
> > As you can see, this fits the pattern exactly. CPU 0 does read a, mb,
> > write b, and CPU 1 does read b, mb, write a.
>
> Again, you don't have anything that guarantees that CPU 0 will get the
> lock first, so the assertion is bogus.

Now we do have such a guarantee.

> Also, the "if" shouldn't be
> there -- since we are assuming CPU 1 got the lock immediately, it
> would be unconditionally executing the mb(), right?

I copied the "if" from your original code. If the "if" had failed then
CPU 1 would have looped and retried the "if", until it did succeed. As
the "// Succeeds" comment indicates, we can assume for the sake of
argument that the "if" succeeds the first time through.

> > We are supposing the read of b obtains the new value (no looping); can we
> > then assert that the read of a obtained the old value? If we can -- and
> > we had better if spinlocks are to work correctly -- then doesn't this mean
> > that the read-mb-write pattern will always succeed?
>
> Since CPU 1 saw CPU 0's assignment to b (which follows CPU 0's mb()),
> and since CPU 1 immediately executed a memory barrier, CPU 1's critical
> section is guaranteed to follow that of CPU 0.

Correct.

> -Both- CPU 0's and CPU 1's
> memory barriers are needed to make this work. This does not rely on
> anything fancy, simply the standard pair-wise memory-barrier guarantee.

Yes. But you haven't answered the question: Must the assertion succeed?
And if it must, what is the low-level mechanism which insures it will?
How does this mechanism compare with the mechanism for the read-mb-write
pattern?


> David Howells's Documentation/memory-barriers.txt goes through this
> sort of thing, IIRC.

That's why I put him on the CC: list. It doesn't explain things in quite
this much detail, though, and it doesn't consider these non-canonical
cases (except for one brief section on the implementation of
rw-semaphores).

Alan

2006-09-08 21:26:06

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 8 Sep 2006, Oliver Neukum wrote:

> It seems you are correct.
> Therefore the correct code on CPU 1 would be:
>
> y = -1;
> b = 1;
> //mb();
> //x = a;
> while (y < 0) relax();
>
> mb();
> x = a;
>
> assert(x==1 || y==1); //???
>
> And yes, it is confusing. I've been forced to change my mind twice.

Again you have misunderstood. The original code was _not_ incorrect. I
was asking: Given the code as stated, would the assertion ever fail?

The code _was_ correct for my purposes, namely, to illustrate a technical
point about the behavior of memory barriers.

Alan

2006-09-08 21:46:07

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers

Am Freitag, 8. September 2006 23:26 schrieb Alan Stern:
> On Fri, 8 Sep 2006, Oliver Neukum wrote:
>
> > It seems you are correct.
> > Therefore the correct code on CPU 1 would be:
> >
> > y = -1;
> > b = 1;
> > //mb();
> > //x = a;
> > while (y < 0) relax();
> >
> > mb();
> > x = a;
> >
> > assert(x==1 || y==1); //???
> >
> > And yes, it is confusing. I've been forced to change my mind twice.
>
> Again you have misunderstood. The original code was _not_ incorrect. I
> was asking: Given the code as stated, would the assertion ever fail?

I claim the right to call code that fails its own assertions incorrect. :-)

> The code _was_ correct for my purposes, namely, to illustrate a technical
> point about the behavior of memory barriers.

I would say that the code may fail the assertion purely based
on the formal definition of a memory barrier. And do so in a subtle
and inobvious way.

Regards
Oliver

2006-09-08 22:25:27

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 8 Sep 2006, Oliver Neukum wrote:

> > Again you have misunderstood. The original code was _not_ incorrect. I
> > was asking: Given the code as stated, would the assertion ever fail?
>
> I claim the right to call code that fails its own assertions incorrect. :-)

Touche!

> > The code _was_ correct for my purposes, namely, to illustrate a technical
> > point about the behavior of memory barriers.
>
> I would say that the code may fail the assertion purely based
> on the formal definition of a memory barrier. And do so in a subtle
> and inobvious way.

But what _is_ the formal definition of a memory barrier? I've never seen
one that was complete and correct.

Alan Stern

2006-09-08 22:49:07

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers

Am Samstag, 9. September 2006 00:25 schrieb Alan Stern:
> On Fri, 8 Sep 2006, Oliver Neukum wrote:
>
> > > Again you have misunderstood. The original code was _not_ incorrect. I
> > > was asking: Given the code as stated, would the assertion ever fail?
> >
> > I claim the right to call code that fails its own assertions incorrect. :-)
>
> Touche!
>
> > > The code _was_ correct for my purposes, namely, to illustrate a technical
> > > point about the behavior of memory barriers.
> >
> > I would say that the code may fail the assertion purely based
> > on the formal definition of a memory barrier. And do so in a subtle
> > and inobvious way.
>
> But what _is_ the formal definition of a memory barrier? I've never seen
> one that was complete and correct.

I' d say "mb();" is "rmb();wmb();"

and they work so that:

CPU 0

a = TRUE;
wmb();
b = TRUE;

CPU 1

if (b) {
rmb();
assert(a);
}

is correct. Possibly that is not a complete definition though.

Regards
Oliver

2006-09-09 00:43:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Sep 08, 2006 at 05:23:04PM -0400, Alan Stern wrote:
> On Fri, 8 Sep 2006, Paul E. McKenney wrote:
>
> > > Yes, okay, that's a general situation where mb() would be useful. But the
> > > actual application amounts to one of the two patterns below, as you will
> > > see...
> >
> > Hey, you asked!!! ;-)
>
> I knew that starting this discussion would open a can of worms, but what
> the heck... I was curious! :-)

;-)

> > If we are talking about some specific CPUs, you might have a point.
> > But Linux must tolerate least-common-demoninator semantics...
> >
> > And those least-common-denominator semantics are "if a CPU sees an
> > assignment from another CPU that followed a memory barrier on that
> > other CPU, then the first CPU is guaranteed to see any stores from
> > the other CPU -preceding- that memory barrier.
>
> The "canonical" memory-barrier guarantee. But I'm interested in
> non-canonical situations...

OK... But beyond a certain point, that way lies madness...

> > Assume that initially, a==0 in a cache line owned by CPU 1. Assume that
> > b==0 and y==0 in separate cache lines owned by CPU 0.
>
> Ah, a meaty analysis with details of inner mechanisms. Good.
>
> > CPU 0 CPU 1
> > ----- -----
> > y = -1; [this goes out quickly.]
> > while (y==0) relax(); [picks up the value from CPU 1]
> > a = 1; [the cacheline is on CPU 1, so this one sits in
> > CPU 0's store queue. CPU 0 transmits an invalidation
> > request, which will take time to reach CPU 1.]
> > b = 1; [the cacheline is on CPU 0, so
> > this one sits in CPU 1's store
> > queue, and CPU 1 transmits an
> > invalidation request, which again
> > will take time to rech CPU 0.]
> > mb(); [CPU 0 waits for acknowledgement of reception of all previously
> > transmitted invalidation requests, and also processes any
> > invalidation requests that it has previously received (none!)
> > before processing any subsequent loads. Yes, CPU 1 already
> > sent the invalidation request for b, but there is no
> > guarantee that it has worked its way to CPU 0 yet!]
> > mb(); [Ditto.]
> >
> > At this point, CPU 0 has received the invalidation request for b
> > from CPU 1, and CPU 1 has received the invalidation request for
> > a from CPU 0, but there is no guarantee that either of these
> > invalidation requests have been processed.
>
> Do you draw a distinction between an invalidation request being
> "acknowledged" and being "processed"? CPU 0 won't finish its mb() until
> CPU 1 has acknowledged receiving the invalidation request for a's
> cacheline. Does it make sense for CPU 1 to acknowledge reception of the
> request without having yet processed the request?

Yes. At least for the definition for "make sense" that is used in
some CPU-design circles...

Here is one possible sequence of events for CPU 0's mb() operation:

1. Mark all currently pending invalidation requests from other CPUs
(but there are none).

2. Wait for acknowledgements for all outstanding invalidation requests
from CPU 0 to other CPUs (there is one, that corresponding to the
assignment to a).

3. At some point in this sequence, the invalidation request from
CPU 1 corresponding to the assignment to b arrives. However,
it was not present before the mb() started executing, so is
-not- marked.

4. Ensure that all marked pending invalidation requests from other
CPUs complete before any subsequence loads are allowed to
commence on CPU 0 (but there are no marked pending invalidation
requests).

Real CPUs are much more clever about performing these operations in a way
that reduces the probability of stalling, but the above sequence should
be good enough for this discussion. The assignments to a and b pass in
the night -- there are no memory barriers forcing them to be ordered.

> Wouldn't that open a door for some nasty synchronization problems?

Ummm.... Yes!!! ;-)

That is one reason why people should use the standard primitives that
already contain the necessary memory barriers whenever possible.

> For
> example, suppose CPU 1 had previously modified some other variable c
> located in the same cacheline as a. Acknowledging the invalidation
> request means it is giving permission for CPU 0 to own the cacheline;
> wouldn't this cause the modification of c to be lost?

Not necessarily. Keep in mind that in this example, CPU 0 does not yet
have a copy of the cache line containing variables a and c. CPU 1 is
therefore free to update the cacheline with the values of a and c before
transmitting it to CPU 0.

Unless, of course, there was a memory barrier between the two assignments,
and that memory barrier was executed -after- the reception of the
invalidation request. In that case, CPU 1 would be required to update
the cache line with the first assignment (say to variable a), then ship
the cacheline to CPU 0. But the assignment to c still will not be lost,
since CPU 1 can retain the assignment in its store queue. CPU 1 would
send an invalidation request to CPU 0, and update the cacheline with
the assignment to c after getting the cacheline back from CPU 0.

This would be cache thrashing. Or cacheline bouncing.

> > x = a; [Could therefore still be using
> > the old cached version of a, so
> > that x==0.]
>
> Not if the invalidation request was processed as well as merely
> acknowledged.

True enough. But there is no -guarantee- that the invalidation request
will have been processed.

> > y = b; [Could also be still using the old cached version of b,
> > so that y==0.]
> > while (y < 0) relax(); [This will spin until
> > the cache coherence protocol delivers
> > the new value y==0. At this point,
> > CPU 1 is guaranteed to see the new
> > value of a due to CPU 0's mb(), but
> > it is too late... And CPU 1 would
> > need a memory barrier following the
> > while loop in any case -- otherwise,
> > CPU 1 would be within its rights
> > on some CPUs to execute the code
> > out of order.]
> > assert(x==1 || y==1); //???
> > [This assertion can therefore fail.]
> >
> > By the way, this sort of scenario is why maintainers have my full
> > sympathies when they automatically reject any patches containing explicit
> > memory barriers...
>
> I sympathize.
>
> > > Let's call the following pattern "read-mb-write".
>
> I wish I had presented read-mb-write first, before write-mb-read. It's
> more pertinent to the section on locking below. Can you please give an
> equally detailed analysis, like the one above, showing how the assertion
> in thie pattern can fail?

See below... But I would strongly advise nacking any patch involving
this sort of line of reasoning, especially given that I could have
written a significant amount of "normal" code in the time that it
took me to analyze this, and given the error rate on the part of
myself and of others.

> > > > > The opposite approach would use reads followed by writes:
> > > > >
> > > > > CPU 0 CPU 1
> > > > > ----- -----
> > > > > while (x==0) relax(); x = -1;
> > > > > x = a; y = b;
> > > > > mb(); mb();
> > > > > b = 1; a = 1;
> > > > > while (x < 0) relax();
> > > > > assert(x==0 || y==0); //???
>
> Or does it turn out that read-mb-write does succeed, even though
> write-mb-read can sometimes fail?

OK... The initial value of a, b, x, and y are zero, right?

CPU 0 CPU 1
----- -----
while (x==0) relax(); x = -1;
x = a; y = b;
mb(); mb();
b = 1; a = 1;
while (x < 0) relax();
assert(x==0 || y==0); //???

For y!=0, CPU 1's assignment y=b must follow CPU 0's assignment b=1.
Since b=1 follows CPU 0's memory barrier, and since y=b precedes CPU 1's
memory barrier, any code after CPU 1's memory barrier must see the effect
of assignments from CPU 0 preceding CPU 0's memory barrier, so that CPU
1's assignment a=1 comes after CPU 0's assignment x=a, and so therefore
x=a=0 at CPU 1. But CPU 1 also assigns x=-1. The question the becomes
"which assignment to x came last?". The "while" loop on CPU 0 cannot
exit before the assignment x=-1 on CPU 1 completes, but there is no memory
barrier between CPU 0's "while" loop and its subsequent assignment "x=a".

However, CPU 0 must see its own accesses as occurring in order, and all
CPUs have to see some single global ordering of a sequence of assignments
to a -single- given variable. So I believe CPU 0's assignment x=a must be
seen by all CPUs as following CPU 1's assignment x=-1, implying that the
final value of x==0, and that the assertion does not trip.

But my head hurts...

Now, for x!=0... We have established that CPU 0's assignment x=a must
follow CPU 0's assignment x=a. Therefore, for x!=0, CPU 0's assignment
x=1 must also follow CPU 1's assignment a=1. In this case, CPU 0 has
seen an assignment that follows CPU 1's memory barrier, and therefore CPU
1's subsequent assignment b=1 must see all of CPU 1's assignments prior
to the memory barrier -- in particular, y=b must have already happened,
so that y==0. Again, the assertion does not trip.

So I was mistaken in my first email when I said that the assert() can
trip in this case -- it cannot.

But there had better be a -really- good reason for pulling this kind
of stunt... And even then, there had better be some -really- good
comments...

> > > > Consider the following (lame) definitions for spinlock primitives,
> > > > but in an alternate universe where atomic_xchg() did not imply a
> > > > memory barrier, and on a weak-memory CPU:
> > > >
> > > > typedef spinlock_t atomic_t;
> > > >
> > > > void spin_lock(spinlock_t *l)
> > > > {
> > > > for (;;) {
> > > > if (atomic_xchg(l, 1) == 0) {
> > > > smp_mb();
> > > > return;
> > > > }
> > > > while (atomic_read(l) != 0) barrier();
> > > > }
> > > >
> > > > }
> > > >
> > > > void spin_unlock(spinlock_t *l)
> > > > {
> > > > smp_mb();
> > > > atomic_set(l, 0);
> > > > }
> > > >
> > > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > > stores in the following critical section happen only -after- the lock
> > > > is acquired. Similarly for the spin_unlock() primitive.
> > >
> > > Really? Let's take a closer look. Let b be a pointer to a spinlock_t and
> > > let a get accessed only within the critical sections:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > spin_lock(b);
> > > ...
> > > x = a;
> > > spin_unlock(b); spin_lock(b);
> > > a = 1;
> > > ...
> > > assert(x==0); //???
> > >
> > > Expanding out the code gives us a version of the read-mb-write pattern.
> >
> > Eh? There is nothing guaranteeing that CPU 0 gets the lock before
> > CPU 1, so what is to assert?
>
> Okay, I left out some stuff. Suppose we have this instead:
>
> CPU 0 CPU 1
> ----- -----
> spin_lock(b);
> ...
> y = -1; while (y == 0) relax();
> x = a;
> spin_unlock(b); spin_lock(b);
> a = 1;
> ...
> assert(x==0); //???
>
> Now we _can_ guarantee that CPU 0 gets the lock before CPU 1. The
> discussion will not be significantly affected by this change.

-My- discussion is affected!

Anyway, the assignment a=1 must follow the assignment x=a, so the
assertion cannot trip. Without the guarantee, CPU 1's critical
section could have preceded that of CPU 0, so that x==1, tripping
the assertion.

> > > For the sake of argument, let's suppose that CPU 1's spin_lock succeeds
> > > immediately with no looping:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > ...
> Insert: y = -1; (goes out quickly)
> while (y == 0) relax(); (quickly terminates)
> > > x = a;
> > > // spin_unlock(b):
> > > mb();
> > > atomic_set(b,0); if (atomic_xchg(b,1) == 0) // Succeeds
> > > mb();
> > > a = 1;
> > > ...
> > > assert(x==0); //???
> > >
> > > As you can see, this fits the pattern exactly. CPU 0 does read a, mb,
> > > write b, and CPU 1 does read b, mb, write a.
> >
> > Again, you don't have anything that guarantees that CPU 0 will get the
> > lock first, so the assertion is bogus.
>
> Now we do have such a guarantee.

And preceding the y=-1 was a lock acquisition (presumably), so that
the assignment a=1 is again required to follow the assignment x=1,
so that x==0 and the assertion does not trip. This works on IA64
as well (with its semi-permeable memory barriers associated with
locking), but the potential scenarios are more complicated.

However, if the lock acquisition follows the assignment y=-1, then
CPU 1 could potentially acquire the lock before CPU 0 does, in which
case one could end up with x==1, tripping the assertion.

> > Also, the "if" shouldn't be
> > there -- since we are assuming CPU 1 got the lock immediately, it
> > would be unconditionally executing the mb(), right?
>
> I copied the "if" from your original code. If the "if" had failed then
> CPU 1 would have looped and retried the "if", until it did succeed. As
> the "// Succeeds" comment indicates, we can assume for the sake of
> argument that the "if" succeeds the first time through.

OK. But this stuff is grotesque enough without leaving assumptions
unstated, no matter how natural they might seem. ;-)

> > > We are supposing the read of b obtains the new value (no looping); can we
> > > then assert that the read of a obtained the old value? If we can -- and
> > > we had better if spinlocks are to work correctly -- then doesn't this mean
> > > that the read-mb-write pattern will always succeed?
> >
> > Since CPU 1 saw CPU 0's assignment to b (which follows CPU 0's mb()),
> > and since CPU 1 immediately executed a memory barrier, CPU 1's critical
> > section is guaranteed to follow that of CPU 0.
>
> Correct.
>
> > -Both- CPU 0's and CPU 1's
> > memory barriers are needed to make this work. This does not rely on
> > anything fancy, simply the standard pair-wise memory-barrier guarantee.
>
> Yes. But you haven't answered the question: Must the assertion succeed?
> And if it must, what is the low-level mechanism which insures it will?
> How does this mechanism compare with the mechanism for the read-mb-write
> pattern?

See above, but the answer applies only to the new version. In the
old version of the code, the order of the critical sections could
have reversed, possibly tripping the assertion.

> > David Howells's Documentation/memory-barriers.txt goes through this
> > sort of thing, IIRC.
>
> That's why I put him on the CC: list. It doesn't explain things in quite
> this much detail, though, and it doesn't consider these non-canonical
> cases (except for one brief section on the implementation of
> rw-semaphores).

I can't say I blame him -- I certainly would not want to be seen as in
any way encouraging non-canonical use of memory barriers, certainly not
if there is some reasonable alternative! ;-)

Thanx, Paul

2006-09-09 02:25:47

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, 9 Sep 2006, Oliver Neukum wrote:

> > But what _is_ the formal definition of a memory barrier? I've never seen
> > one that was complete and correct.
>
> I' d say "mb();" is "rmb();wmb();"
>
> and they work so that:
>
> CPU 0
>
> a = TRUE;
> wmb();
> b = TRUE;
>
> CPU 1
>
> if (b) {
> rmb();
> assert(a);
> }
>
> is correct. Possibly that is not a complete definition though.

It isn't. Paul has agreed that this assertion:

CPU 0 CPU 1
----- -----
while (x == 0) relax(); x = -1;
x = a; y = b;
mb(); mb();
b = 1; a = 1;
while (x < 0) relax();
assert(x==0 || y==0);

will not fail. I think this would not be true if either of the mb()
statements were replaced with {rmb(); wmb();}.

To put it another way, {rmb(); wmb();} guarantees that any preceding read
will complete before any following read and any preceding write will
complete before any following write. However it does not guarantee that
any preceding read will complete before any following write, whereas mb()
does guarantee that. (To whatever extent these statements make sense.)

Alan

2006-09-11 16:05:51

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 8 Sep 2006, Paul E. McKenney wrote:

> Here is one possible sequence of events for CPU 0's mb() operation:
>
> 1. Mark all currently pending invalidation requests from other CPUs
> (but there are none).
>
> 2. Wait for acknowledgements for all outstanding invalidation requests
> from CPU 0 to other CPUs (there is one, that corresponding to the
> assignment to a).
>
> 3. At some point in this sequence, the invalidation request from
> CPU 1 corresponding to the assignment to b arrives. However,
> it was not present before the mb() started executing, so is
> -not- marked.
>
> 4. Ensure that all marked pending invalidation requests from other
> CPUs complete before any subsequence loads are allowed to
> commence on CPU 0 (but there are no marked pending invalidation
> requests).
>
> Real CPUs are much more clever about performing these operations in a way
> that reduces the probability of stalling, but the above sequence should
> be good enough for this discussion. The assignments to a and b pass in
> the night -- there are no memory barriers forcing them to be ordered.

Ah, I see. This explains a lot. I now have a much clearer mental
model of how these things work. It isn't complete, but it does have
some interesting consequences.

One insight has to do with why it is that read-mb-write succeeds
whereas write-mb-read can fail. It comes down to this essential
asymmetry:

Suppose one CPU writes to a memory location at about the same
time as another CPU using a different cache reads from it. If
the read completes before the write then the read is certain
to return the old value. But if the write completes before the
read, the read is _not_ certain to return the new value.

> See below... But I would strongly advise nacking any patch involving
> this sort of line of reasoning, especially given that I could have
> written a significant amount of "normal" code in the time that it
> took me to analyze this, and given the error rate on the part of
> myself and of others.

This discussion has not been aimed toward writing or accepting any
patches. It's simply an attempt to understand better how memory
barriers work.


Moving on to the read-mb-write pattern...

> OK... The initial value of a, b, x, and y are zero, right?
>
> CPU 0 CPU 1
> ----- -----
> while (x==0) relax(); x = -1;
> x = a; y = b;
> mb(); mb();
> b = 1; a = 1;
> while (x < 0) relax();
> assert(x==0 || y==0); //???
>
> For y!=0, CPU 1's assignment y=b must follow CPU 0's assignment b=1.
> Since b=1 follows CPU 0's memory barrier, and since y=b precedes CPU 1's
> memory barrier, any code after CPU 1's memory barrier must see the effect
> of assignments from CPU 0 preceding CPU 0's memory barrier, so that CPU
> 1's assignment a=1 comes after CPU 0's assignment x=a, and so therefore
> x=a=0 at CPU 1. But CPU 1 also assigns x=-1. The question the becomes
> "which assignment to x came last?". The "while" loop on CPU 0 cannot
> exit before the assignment x=-1 on CPU 1 completes, but there is no memory
> barrier between CPU 0's "while" loop and its subsequent assignment "x=a".
>
> However, CPU 0 must see its own accesses as occurring in order, and all
> CPUs have to see some single global ordering of a sequence of assignments
> to a -single- given variable. So I believe CPU 0's assignment x=a must be
> seen by all CPUs as following CPU 1's assignment x=-1, implying that the
> final value of x==0, and that the assertion does not trip.
>
> But my head hurts...

You're getting a little hung-up on those "while" loops. They aren't
an essential feature of the example; I put them in there just to
insure that CPU 0's write to x would be visible on CPU 1.

The real point of the example is that the read of a on CPU 0 and the
read of b on CPU 1 cannot both yield 1. I could have made the same
point in a number of different ways. This is perhaps the simplest:

CPU 0 CPU 1
----- -----
x = a; y = b;
mb(); mb();
b = 1; a = 1;
assert(x==0 || y==0);

I'm sure you can verify that this assertion will never fail.

It's worth noting that since y is accessible to both CPUs, it must be
a variable stored in memory. It follows that CPU 1's mb() could be
replaced with wmb() and the assertion would still always hold. On the
other hand, the compiler is free to put x in a processor register.
This means that CPU 0's mb() cannot safely be replaced with wmb().


Let's return to the question I asked at the start of this thread:
Under what circumstances is mb() truly needed? Well, we've just seen
one. If CPU 0's mb() were replaced with "rmb(); wmb();" the assertion
above might fail. The cache would be free to carry out the operations
in a way such that the read of a completed after the write to b.

This read-mb-write pattern turns out to be vital for implementing
sychronization primitives. Take your own example:

> Consider the following (lame) definitions for spinlock primitives,
> but in an alternate universe where atomic_xchg() did not imply a
> memory barrier, and on a weak-memory CPU:
>
> typedef spinlock_t atomic_t;
>
> void spin_lock(spinlock_t *l)
> {
> for (;;) {
> if (atomic_xchg(l, 1) == 0) {
> smp_mb();
> return;
> }
> while (atomic_read(l) != 0) barrier();
> }
>
> }
>
> void spin_unlock(spinlock_t *l)
> {
> smp_mb();
> atomic_set(l, 0);
> }
>
> The spin_lock() primitive needs smp_mb() to ensure that all loads and
> stores in the following critical section happen only -after- the lock
> is acquired. Similarly for the spin_unlock() primitive.

In fact that last paragraph isn't quite right. The spin_lock()
primitive would also work with smp_rmb() in place of smb_mb(). (I can
explain my reasoning later, if you're interested.) However
spin_unlock() _does_ need smp_mb(), and for the very same reason as
the read-mb-write pattern. If it were replaced with "smp_rmb();
smp_wmb();" then a register-read preceding a spin_unlock() could leak
past the unlock.


So it looks like this "non-canonical" read-mb-write pattern is
essential under certain circumstances, even if it doesn't turn out to
be particularly useful in more general-purpose code. An interesting
thing about it is that, unlike the "canonical" pattern, it readily
extends to more than two processors:

CPU 0 CPU 1 CPU 2
----- ----- -----
x = a; y = b; z = c;
mb(); mb(); mb();
b = 1; c = 1; a = 1;
assert(x==0 || y==0 || z==0);

The cyclic pattern of ordering requirements guarantees that at least
one of the reads must complete before the write of the corresponding
variable. In more detail: One of the three reads, let's say the read
of b, will complete before -- or at the same time as -- each of the
others. The mb() forces the write of b on CPU 0 to complete after the
read of a and hence after the read of b; thus the value of y must end
up being 0. Extensions to more than 3 processors are left as an
exercise for the reader. :-)


Here's another question: To what extent do memory barriers actually
force completions to occur in order? Let me explain...

We can think of a CPU and its cache as two semi-autonomous units.
>From the cache's point of view, a load or a store initializes when the
CPU issues the request to the cache, and it completes when the data is
extracted from the cache for transmission to the CPU (for a load) or
when the data is placed in the cache (for a store).

Memory barriers prevent CPUs from reordering instructions, thereby
forcing loads and stores to initialize in order. But this reflects
how the barriers affect the CPU, whereas I'm more concerned about how
they affect the cache. So to what extent do barriers force
memory accesses to _complete_ in order? Let's go through the
possibilities.

STORE A; wmb; STORE B;

This does indeed force the completion of STORE B to wait until STORE A
has completed. Nothing else would have the desired effect.

READ A; mb; STORE B;

This also forces the completion of STORE B to wait until READ A has
completed. Without such a guarantee the read-mb-write pattern wouldn't
work.

STORE A; mb; READ B;

This is one is strange. For all I know the mb might very well force
the completions to occur in order... but it doesn't matter! Even when
the completions do occur in order, it can appear to other CPUs as if
they did not! This is exactly what happens when the write-mb-read
pattern fails.

READ A; rmb; READ B;

Another peculiar case. One almost always hears read barriers described
as forcing the second read to complete after the first. But in reality
this is not sufficient.

The additional requirement of rmb is: Unless there is a pending STORE
to B, READ B is not allowed to complete until all the invalidation
requests for B pending (i.e., acknowledged but not yet processed) at
the time of the rmb instruction have completed (i.e., been processed).


There are consequences of these ideas with potential implications for
RCU. Consider, as a simplified example, a thread T0 which constantly
repeats the following loop:

for (;;) {
++cnt;
mb();
x = p;
// Do something time-consuming with *x
}

Now suppose another thread T1 wants to update the value of p and wait
until the old value is no longer being used (a typical RCU-ish sort of
thing). The natural approach is this:

old_p = p;
p = new_p;
mb();
c = cnt;
while (c == cnt)
barrier();
// Deallocate old_p

This code can fail, for exactly the same reason as the write-mb-read
pattern. If the two mb() instructions happen to execute at the same
time, it's possible for T0 to set x to the old value of p while T1
sets c to the old value of cnt. (That's how the write-mb-read pattern
fails; both reads end up seeing the pre-write values.) Then T1's
while() loop will quickly terminate, so T1 ends up destroying whatever
old_p points to while T0 is still using it.

RCU doesn't work quite like this, but it's still worth keeping in
mind.


There are yet more implications when one starts considering
communication with peripheral devices via DMA-coherent or
DMA-consistent memory buffers. However I don't know enough about
these topics to say anything intelligent. And then there's
memory-mapped I/O, but that's another story...

Alan Stern

2006-09-11 16:20:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Sep 08, 2006 at 10:25:45PM -0400, Alan Stern wrote:
> On Sat, 9 Sep 2006, Oliver Neukum wrote:
>
> > > But what _is_ the formal definition of a memory barrier? I've never seen
> > > one that was complete and correct.
> >
> > I' d say "mb();" is "rmb();wmb();"
> >
> > and they work so that:
> >
> > CPU 0
> >
> > a = TRUE;
> > wmb();
> > b = TRUE;
> >
> > CPU 1
> >
> > if (b) {
> > rmb();
> > assert(a);
> > }
> >
> > is correct. Possibly that is not a complete definition though.
>
> It isn't. Paul has agreed that this assertion:
>
> CPU 0 CPU 1
> ----- -----
> while (x == 0) relax(); x = -1;
> x = a; y = b;
> mb(); mb();
> b = 1; a = 1;
> while (x < 0) relax();
> assert(x==0 || y==0);
>
> will not fail. I think this would not be true if either of the mb()
> statements were replaced with {rmb(); wmb();}.
>
> To put it another way, {rmb(); wmb();} guarantees that any preceding read
> will complete before any following read and any preceding write will
> complete before any following write. However it does not guarantee that
> any preceding read will complete before any following write, whereas mb()
> does guarantee that. (To whatever extent these statements make sense.)

This is a summary of the Linux memory-barrier semantics as I understand
them:

1. A given CPU will always perceive its own memory operations
as occuring in program order.

2. All stores to a given single memory location will be perceived
as having occurred in the same order by all CPUs. This is
"coherence". (And this is the property that I was forgetting
about when I first looked at your second example.)

3. A given type of memory barrier, when executed on a given CPU,
causes that CPU's prior accesses of the corresponding type to
be perceived by other CPUs as having occurred before the given
CPU's subsequent accesses of the corresponding type.

The types of memory barriers are rmb(), which segregates only
reads, wmb(), which segregates only writes, and mb(), which
segregates both. There is also mmiowb(), which is like wmb(),
but gives additional ordering guarantees that extend to I/O
busses, such as PCI bridges.

Alan is quite correct when he says that rmb();wmb(); is not necessarily
equivalent to mb(). For example:

Sequence 1 Sequence 2

load A load A
store B store B
mb() rmb();wmb()
load C load C
store D store D

In sequence 1, other CPUs will see the load from A and the store to B both
preceding the load from C and the store to D. In sequence 2, other CPUs
might well see the store to D preceding the load from A, or, conversely,
the store to B following the load from C. This second scenario might seem
unlikely, but there is real hardware that has similar properties (e.g.,
ppc's separately ordering accesses to cached and to non-cached memory).

In all these cases, of course, these other CPUs would themselves be needing
to use memory barriers to order their own accesses.

You guys asked!!!

Thanx, Paul

2006-09-11 16:50:09

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, 11 Sep 2006, Paul E. McKenney wrote:

> This is a summary of the Linux memory-barrier semantics as I understand
> them:
>
> 1. A given CPU will always perceive its own memory operations
> as occuring in program order.
>
> 2. All stores to a given single memory location will be perceived
> as having occurred in the same order by all CPUs. This is
> "coherence". (And this is the property that I was forgetting
> about when I first looked at your second example.)
...

This can't be right. Together 1 and 2 would obviate the need for wmb().
The CPU doing "STORE A; STORE B" will always see the operations occuring
in program order by 1, and hence every other CPU would always see them
occurring in the same order by 2 -- even without wmb().

Either 2 is too strong, or else what you mean by "perceived" isn't
sufficiently clear.

Alan Stern

2006-09-11 17:23:54

by Segher Boessenkool

[permalink] [raw]
Subject: Re: Uses for memory barriers

> This can't be right. Together 1 and 2 would obviate the need for
> wmb().
> The CPU doing "STORE A; STORE B" will always see the operations
> occuring
> in program order by 1, and hence every other CPU would always see them
> occurring in the same order by 2 -- even without wmb().
>
> Either 2 is too strong, or else what you mean by "perceived" isn't
> sufficiently clear.

2. is only for multiple stores to a _single_ memory location -- you
use wmb() to order stores to _separate_ memory locations.


Segher

2006-09-11 17:21:45

by Segher Boessenkool

[permalink] [raw]
Subject: Re: Uses for memory barriers

> 2. All stores to a given single memory location will be perceived
> as having occurred in the same order by all CPUs.

All CPUs that _do_ see two stores to the same memory location happening,
will see them occurring in the same order -- not all CPUs seeing a
later store will necessarily see the earlier stores.


Segher

2006-09-11 19:02:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Sep 11, 2006 at 12:50:07PM -0400, Alan Stern wrote:
> On Mon, 11 Sep 2006, Paul E. McKenney wrote:
>
> > This is a summary of the Linux memory-barrier semantics as I understand
> > them:
> >
> > 1. A given CPU will always perceive its own memory operations
> > as occuring in program order.
> >
> > 2. All stores to a given single memory location will be perceived
> > as having occurred in the same order by all CPUs. This is
> > "coherence". (And this is the property that I was forgetting
> > about when I first looked at your second example.)
> ...
>
> This can't be right. Together 1 and 2 would obviate the need for wmb().
> The CPU doing "STORE A; STORE B" will always see the operations occuring
> in program order by 1, and hence every other CPU would always see them
> occurring in the same order by 2 -- even without wmb().

Not so. A and B are different memory locations, hence #2 does not
apply to the "STORE A; STORE B" sequence.

> Either 2 is too strong, or else what you mean by "perceived" isn't
> sufficiently clear.

The key phrase is "to a given -single- memory location". ;-)

A and B are presumably -different- memory locations. However, if A and
B are aliases for the same memory location, then the wmb() would in fact
be unnecessary. But, again, I am assuming that they are different, so
that #2 does not apply.

Thanx, Paul

2006-09-11 19:03:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Sep 11, 2006 at 07:23:49PM +0200, Segher Boessenkool wrote:
> >This can't be right. Together 1 and 2 would obviate the need for
> >wmb().
> >The CPU doing "STORE A; STORE B" will always see the operations
> >occuring
> >in program order by 1, and hence every other CPU would always see them
> >occurring in the same order by 2 -- even without wmb().
> >
> >Either 2 is too strong, or else what you mean by "perceived" isn't
> >sufficiently clear.
>
> 2. is only for multiple stores to a _single_ memory location -- you
> use wmb() to order stores to _separate_ memory locations.

Precisely!!!

Thanx, Paul

2006-09-11 19:48:25

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers

Am Montag, 11. September 2006 18:21 schrieb Paul E. McKenney:
> 1.??????A given CPU will always perceive its own memory operations
> ????????as occuring in program order.

Is this true for physical memory if virtually indexed caches are
involved?

Regards
Oliver

2006-09-11 20:28:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Sep 11, 2006 at 09:48:42PM +0200, Oliver Neukum wrote:
> Am Montag, 11. September 2006 18:21 schrieb Paul E. McKenney:
> > 1.??????A given CPU will always perceive its own memory operations
> > ????????as occuring in program order.
>
> Is this true for physical memory if virtually indexed caches are
> involved?

As I understand it, in systems with virtually indexed caches, the OS must
take care to ensure that a given cacheline appears only once in the cache,
even if it is mapped to multiple virtual addresses. If an OS failed to
do this, then, as far as I can see, all bets are off. Curt Schimmel's
book "UNIX(R) Systems for Modern Architectures: Symmetric Multiprocessing
and Caching for Kernel Programmers" is an excellent guide to the issues
posed by virtually indexed and virtually tagged caches.

In principle, one could construct a virtually indexed/tagged CPU cache
that automatically ejected any line with a conflicting physical address
(given that lookups are presumably much more frequent than loading new
cache lines), but I have no idea if any real hardware takes this approach.
I have had the good fortune to always work with physically tagged/indexed
caches. ;-)

Thanx, Paul

2006-09-12 08:58:07

by David Howells

[permalink] [raw]
Subject: Re: Uses for memory barriers


I'd state it as:

mb() _implies_ both rmb() and wmb(), but is more complete than both
since it _also_ partially orders reads and writes with respect to each
other.

David

2006-09-12 09:01:52

by David Howells

[permalink] [raw]
Subject: Re: Uses for memory barriers

Paul E. McKenney <[email protected]> wrote:

> 2. All stores to a given single memory location will be perceived
> as having occurred in the same order by all CPUs.

Does that take into account a CPU combining or discarding coincident memory
operations?

For instance, a CPU asked to issue two writes to the same location may discard
the first if it hasn't done it yet.

David

2006-09-12 10:21:43

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers

Am Dienstag, 12. September 2006 11:01 schrieb David Howells:
> Paul E. McKenney <[email protected]> wrote:
>
> > 2. All stores to a given single memory location will be perceived
> > as having occurred in the same order by all CPUs.
>
> Does that take into account a CPU combining or discarding coincident memory
> operations?
>
> For instance, a CPU asked to issue two writes to the same location may discard
> the first if it hasn't done it yet.

Does it make sense? If you do:
mov #x, $a
wmb
mov #y, $b
wmb
mov #z, $a

The CPU must not discard any write. If you do

mov #x, $a
mov #y, $b
wmb
mov #z, $a

The first store to $a is superfluous if you have only inter-CPU
issues in mind.

Regards
Oliver

2006-09-12 14:41:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 12, 2006 at 10:01:43AM +0100, David Howells wrote:
> Paul E. McKenney <[email protected]> wrote:
>
> > 2. All stores to a given single memory location will be perceived
> > as having occurred in the same order by all CPUs.
>
> Does that take into account a CPU combining or discarding coincident memory
> operations?

I believe so.

> For instance, a CPU asked to issue two writes to the same location may discard
> the first if it hasn't done it yet.

If I understand your example correctly, in that case, all CPUs would agree
that the given CPU's stores happened consecutively. Yes, they might not
see the intermediate value, but their view of the sequence of values
would be consistent with the given CPU's pair of stores having happened
at a specific place in the sequence of values.

This is not peculiar to this situation -- consider the following:

CPU 0 CPU 1 CPU 2 CPU 3

A=1
Q1=A Q3=A
A=2
A=3
Q2=A Q4=A

None of the CPUs saw CPU 2's first assignment A=2, but all of their
reads are consistent with the 1,2,3 sequence of values. Your example
(if I understand it correctly) is simply a special case where the
pair of assignments happened on a single CPU.

Thanx, Paul

2006-09-12 14:54:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 12, 2006 at 12:22:00PM +0200, Oliver Neukum wrote:
> Am Dienstag, 12. September 2006 11:01 schrieb David Howells:
> > Paul E. McKenney <[email protected]> wrote:
> >
> > > 2. All stores to a given single memory location will be perceived
> > > as having occurred in the same order by all CPUs.
> >
> > Does that take into account a CPU combining or discarding coincident memory
> > operations?
> >
> > For instance, a CPU asked to issue two writes to the same location may discard
> > the first if it hasn't done it yet.
>
> Does it make sense? If you do:
> mov #x, $a
> wmb
> mov #y, $b
> wmb
> mov #z, $a
>
> The CPU must not discard any write. If you do
>
> mov #x, $a
> mov #y, $b
> wmb
> mov #z, $a
>
> The first store to $a is superfluous if you have only inter-CPU
> issues in mind.

In both cases, the CPU might "discard" the write, if there are no intervening
reads or writes to the same location. The only difference between your
two examples is the ordering of the first store to $a and the store to $b.
In your first example, other CPUs must see the first store to $a as happening
first, while in your second example, other CPUs might see the store to $b
as happening first.

Thanx, Paul

2006-09-12 15:07:14

by Oliver Neukum

[permalink] [raw]
Subject: Re: Uses for memory barriers

Am Dienstag, 12. September 2006 16:55 schrieb Paul E. McKenney:
> On Tue, Sep 12, 2006 at 12:22:00PM +0200, Oliver Neukum wrote:
> > Am Dienstag, 12. September 2006 11:01 schrieb David Howells:
> > > Paul E. McKenney <[email protected]> wrote:
> > >
> > > > 2. All stores to a given single memory location will be perceived
> > > > as having occurred in the same order by all CPUs.
> > >
> > > Does that take into account a CPU combining or discarding coincident memory
> > > operations?
> > >
> > > For instance, a CPU asked to issue two writes to the same location may discard
> > > the first if it hasn't done it yet.
> >
> > Does it make sense? If you do:
> > mov #x, $a
> > wmb
> > mov #y, $b
> > wmb
> > mov #z, $a
> >
> > The CPU must not discard any write. If you do
> >
> > mov #x, $a
> > mov #y, $b
> > wmb
> > mov #z, $a
> >
> > The first store to $a is superfluous if you have only inter-CPU
> > issues in mind.
>
> In both cases, the CPU might "discard" the write, if there are no intervening
> reads or writes to the same location. The only difference between your

How can it know that?

> two examples is the ordering of the first store to $a and the store to $b.
> In your first example, other CPUs must see the first store to $a as happening
> first, while in your second example, other CPUs might see the store to $b
> as happening first.

There's no way in the second case a CPU might tell whether the first
write ever happened.

Regards
Oliver

2006-09-12 16:12:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 12, 2006 at 05:07:31PM +0200, Oliver Neukum wrote:
> Am Dienstag, 12. September 2006 16:55 schrieb Paul E. McKenney:
> > On Tue, Sep 12, 2006 at 12:22:00PM +0200, Oliver Neukum wrote:
> > > Am Dienstag, 12. September 2006 11:01 schrieb David Howells:
> > > > Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > > 2. All stores to a given single memory location will be perceived
> > > > > as having occurred in the same order by all CPUs.
> > > >
> > > > Does that take into account a CPU combining or discarding coincident memory
> > > > operations?
> > > >
> > > > For instance, a CPU asked to issue two writes to the same location may discard
> > > > the first if it hasn't done it yet.
> > >
> > > Does it make sense? If you do:
> > > mov #x, $a
> > > wmb
> > > mov #y, $b
> > > wmb
> > > mov #z, $a
> > >
> > > The CPU must not discard any write. If you do
> > >
> > > mov #x, $a
> > > mov #y, $b
> > > wmb
> > > mov #z, $a
> > >
> > > The first store to $a is superfluous if you have only inter-CPU
> > > issues in mind.
> >
> > In both cases, the CPU might "discard" the write, if there are no intervening
> > reads or writes to the same location. The only difference between your
>
> How can it know that?

If the CPU starts off owning both cache lines, and if there are no intervening
invalidation requests from other CPUs, then the writing CPU knows that there
could not possibly have been any intervening writes.

> > two examples is the ordering of the first store to $a and the store to $b.
> > In your first example, other CPUs must see the first store to $a as happening
> > first, while in your second example, other CPUs might see the store to $b
> > as happening first.
>
> There's no way in the second case a CPU might tell whether the first
> write ever happened.

With one important exception -- the CPU that did the write knows.
There is no requirement that each and every CPU see each and every write.
The only requirement is that the sequence of values that each CPU sees
for a single given variable is consistent with that of each other CPU.

Thanx, Paul

2006-09-12 17:50:56

by Segher Boessenkool

[permalink] [raw]
Subject: Re: Uses for memory barriers

>> In both cases, the CPU might "discard" the write, if there are no
>> intervening
>> reads or writes to the same location. The only difference between
>> your
>
> How can it know that?

Because it holds the cache line in the "O" (owned) state, for example.

And it doesn't matter how a CPU would do this; the only thing that
matters is that you do not assume anything that is not guaranteed
by the model.


Segher

2006-09-12 18:08:24

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, 11 Sep 2006, Paul E. McKenney wrote:

> > One insight has to do with why it is that read-mb-write succeeds
> > whereas write-mb-read can fail. It comes down to this essential
> > asymmetry:
> >
> > Suppose one CPU writes to a memory location at about the same
> > time as another CPU using a different cache reads from it. If
> > the read completes before the write then the read is certain
> > to return the old value. But if the write completes before the
> > read, the read is _not_ certain to return the new value.
>
> Yes. Of course, keeping in mind that "before" can be a partial ordering.
> In any real situation, you would need to say how you determined that the
> read did in fact happen "before" the write. Some systems have fine-grained
> synchronized counters that make this easy, but most do not. In the latter
> case, the memory operations themselves are about the only evidence you
> have to figure out what happened "before".

I think it makes sense to say this even in the absence of precise
definitions. To put it another way, a CPU or cache can't read the
updated value from another CPU (or cache) before the second CPU (or cache)
decides to make the updated value publicly available ("committed the
value"). It _can_ read the updated value after the value has been
committed, but it isn't constrained to do so -- it's allowed to keep using
the old value for a while.

Incidentally, I object very strongly to the way you hardware folk keep
talking about events being only partially ordered. This isn't the case at
all. Given a precise physical definition of what an event is (a
particular transistor changing state for example, or perhaps the entire
time interval starting when one transistor changes state and ending when
another does), each event takes place at a moment of time or during an
interval of time.

Now time really _is_ totally ordered (so long as you confine yourself to a
single reference frame!) and hence in any individual run of a computer
program so are the events -- provided you take proper care when discussing
events that encompass a time interval.

The difficulty is that program runs have a stochastic character; it's
difficult or impossible to predict exactly what the hardware will do. So
the same program events could be ordered one way during a run and the
opposite way during another run. Each run is still totally ordered.

If you want, you can derive a partial ordering by considering only those
pairs of events that will _always_ occur in the same order in _every_ run.
But thinking in terms of this sort of partial ordering is clearly
inadequate. It doesn't even let you derive the properties of the
"canonical" pattern:

CPU 0 CPU 1
----- -----
a = 1; y = b;
wmb(); rmb();
b = 1; x = a;
assert(!(y==1 && x==0));

In terms of partial orderings, we know that all the events on CPU 0 are
ordered by the wmb() and the events on CPU 1 are ordered by the rmb().
But there is no ordering relation between the events on the two CPUs. "y
= b" could occur either before or after "b = 1", hence these two events
are incomparable in the partial ordering. So you can't conclude anything
about the assertion, if you confine your thinking to the partial ordering.


> > CPU 0 CPU 1
> > ----- -----
> > x = a; y = b;
> > mb(); mb();
> > b = 1; a = 1;
> > assert(x==0 || y==0);
> >
> > I'm sure you can verify that this assertion will never fail.
>
> Two cases, assuming that the initial value of all variables is zero:

Actually it's just one case. Once you have proved that x!=0 implies y==0,
by elementary logic you have also proved that y!=0 implies x==0 (one
statement is the contrapositive of the other).

> 1. If x!=0, then CPU 0 must have seen CPU 1's assignment a=1
> before CPU 0's memory barrier, so all of CPU 0's code after
> the memory barrier must see CPU 1's assignment to y. In more
> detail:
>
> CPU 0 CPU 1
> ----- -----
> y = b;
> mb(); The load of b must precede the
> store to a, and the invalidate
> of y must be queued at CPU 1.
> a = 1;
> x = a; This saw CPU 1's a=1, so CPU 1's load of b must already
> have happened.
> mb(); The invalidate of y must have been received, so it will
> be processed prior to any subsequent load.
> b = 1; This cannot affect CPU 1's assignment to y.
> assert(x==0 || y==0);
> Since y was invalidated, we must get the cacheline. Since
> we saw the assignment to a, we must also see the assignment
> to y, which must be zero, since the assignment to y preceded
> the assignment to b.
>
> So the assertion succeeds with y==0 in this case.
>
> 2. If y!=0, then CPU 1 must have seen CPU 0's assignment b=1.
> This is symmetric with case 1, so x==0 in this case.
>
> So, yes, the assertion never trips. And, yes, the intuitive approach
> would get the answer with much less hassle, but, given your prior examples,
> I figured that being pedantic was the correct approach.

One could derive the result a little more formally using the causality
principle I mentioned above (LOAD before STORE cannot return the result of
the STORE) together with another causality principle: A CPU cannot make
the result of a STORE available to other CPUs (or to itself) before it
knows the value to be stored. A cache might send out invalidate messages
before knowing the value to be stored, but it can't send out read
responses.

> And, yes, removing the loops did not change the result.

As intended.

> > It's worth noting that since y is accessible to both CPUs, it must be
> > a variable stored in memory. It follows that CPU 1's mb() could be
> > replaced with wmb() and the assertion would still always hold. On the
> > other hand, the compiler is free to put x in a processor register.
> > This means that CPU 0's mb() cannot safely be replaced with wmb().
>
> You might well be correct, but I must ask you to lay out the operations.
> If CPU 1's mb() becomes a wmb(), the CPU is within its rights to violate
> causality in the assignment y=b. There are some very strange things
> that can happen in NUMA machines with nested shared caches.

What do you mean by "violate causality"? Storing a value in y (and making
that value available to the other CPU) before the read has obtained b's
value? I don't see how that could happen on any architecture, unless your
machine uses thiotimoline. :-)

> > Let's return to the question I asked at the start of this thread:
> > Under what circumstances is mb() truly needed? Well, we've just seen
> > one. If CPU 0's mb() were replaced with "rmb(); wmb();" the assertion
> > above might fail. The cache would be free to carry out the operations
> > in a way such that the read of a completed after the write to b.
>
> And this would be one of the causality violations that render an intuitive
> approach to this so hazardous. Intuitively, the write of x must follow
> the read of a, and the write memory barrier would force the write of b
> to follow the write of x. The question in this case is whether the
> intuitive application of transitivity really applies in this case.

I wouldn't put it that way. Knowing that CPUs are free to reorder
operations in the absence of barriers to prevent such things, the
violation you mention wouldn't seem unintuitive to me. I tend to have
more trouble remembering that reads don't necessarily have to return the
most recent values available.

> > This read-mb-write pattern turns out to be vital for implementing
> > sychronization primitives. Take your own example:
> >
> > > Consider the following (lame) definitions for spinlock primitives,
> > > but in an alternate universe where atomic_xchg() did not imply a
> > > memory barrier, and on a weak-memory CPU:
> > >
> > > typedef spinlock_t atomic_t;
> > >
> > > void spin_lock(spinlock_t *l)
> > > {
> > > for (;;) {
> > > if (atomic_xchg(l, 1) == 0) {
> > > smp_mb();
> > > return;
> > > }
> > > while (atomic_read(l) != 0) barrier();
> > > }
> > >
> > > }
> > >
> > > void spin_unlock(spinlock_t *l)
> > > {
> > > smp_mb();
> > > atomic_set(l, 0);
> > > }
> > >
> > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > stores in the following critical section happen only -after- the lock
> > > is acquired. Similarly for the spin_unlock() primitive.
> >
> > In fact that last paragraph isn't quite right. The spin_lock()
> > primitive would also work with smp_rmb() in place of smb_mb(). (I can
> > explain my reasoning later, if you're interested.)
>
> With smp_rmb(), why couldn't the following, given globals b and c:
>
> spin_lock(&mylock);
> b = 1;
> c = 1;
> spin_unlock(&mylock);
>
> be executed by the CPU as follows?
>
> c = 1;
> b = 1;
> spin_lock(&mylock);
> spin_unlock(&mylock);

Because of the conditional in spin_lock().

> This order of execution seems to me to be highly undesireable. ;-)
>
> So, what am I missing here???

A CPU cannot move a write back past a conditional. Otherwise it runs
the risk of committing the write when (according to the program flow) the
write should never have taken place. No CPU does speculative writes.


> > Here's another question: To what extent do memory barriers actually
> > force completions to occur in order? Let me explain...
>
> In general, they do not at all. They instead simply impose ordering
> constraints.

What's the difference between imposing an ordering constraint and forcing
two events to occur in a particular order? To me the phrases appear
synonymous.

> > We can think of a CPU and its cache as two semi-autonomous units.
> > From the cache's point of view, a load or a store initializes when the
> > CPU issues the request to the cache, and it completes when the data is
> > extracted from the cache for transmission to the CPU (for a load) or
> > when the data is placed in the cache (for a store).
> >
> > Memory barriers prevent CPUs from reordering instructions, thereby
> > forcing loads and stores to initialize in order. But this reflects
> > how the barriers affect the CPU, whereas I'm more concerned about how
> > they affect the cache. So to what extent do barriers force
> > memory accesses to _complete_ in order? Let's go through the
> > possibilities.
> >
> > STORE A; wmb; STORE B;
> >
> > This does indeed force the completion of STORE B to wait until STORE A
> > has completed. Nothing else would have the desired effect.
>
> CPU designers would not agree. The completion of STORE B must -appear-
> -to- -software- to precede that of STORE A. For example, you could get
> the effect by allowing the stores to complete out of order, but then
> constraining the ordering of concurrent loads of A and B, for example,
> delaying concurrent loads of A until the store to B completed.
>
> Typical CPUs really do use both approaches!

Then it would be better to say that the wmb prevents the STORE A from
becoming visible on any particular CPU after the STORE B has become
visible on that CPU. (Note that this is weaker than saying the STORE A
must not become visible to any CPU after the STORE B has become visible to
any CPU.)

> > READ A; mb; STORE B;
> >
> > This also forces the completion of STORE B to wait until READ A has
> > completed. Without such a guarantee the read-mb-write pattern wouldn't
> > work.
>
> Again, this forces the STORE B to -appear- -to- -software- to have
> completed after the READ A. The hardware is free to allow the operations
> to complete in either order, as long as it constrains the order of
> concurrent accesses to A and B so as to preserve the illusion of ordering.

Again, the mb forces the READ A to commit to a value before the STORE B
becomes visible to any CPU.

> > STORE A; mb; READ B;
> >
> > This is one is strange. For all I know the mb might very well force
> > the completions to occur in order... but it doesn't matter! Even when
> > the completions do occur in order, it can appear to other CPUs as if
> > they did not! This is exactly what happens when the write-mb-read
> > pattern fails.
>
> ???
>
> The corresponding accesses would be STORE B; mb; LOAD A. If the LOAD A
> got the old value of A, but the STORE B did -not- affect the other CPU's
> READ B, we have a violation. So I don't see anything strange about this
> one. What am I missing?

You're missing something you described earlier: The two STORES can cross
in the night. If neither cache has acknowledged the other's invalidate
message when the mb instructions are executed, then the mb's won't have
any effect. The invalidate acknowledgments can then be sent and the
WRITEs allowed to update their respective caches. If the processing of
the invalidates is delayed, the READs could be allowed to return the
caches' old data.

> > READ A; rmb; READ B;
> >
> > Another peculiar case. One almost always hears read barriers described
> > as forcing the second read to complete after the first. But in reality
> > this is not sufficient.
>
> Agreed, not if you want to think closer to the hardware. If you are
> willing to work at a higher level of abstraction, then the description
> is reasonable.
>
> > The additional requirement of rmb is: Unless there is a pending STORE
> > to B, READ B is not allowed to complete until all the invalidation
> > requests for B pending (i.e., acknowledged but not yet processed) at
> > the time of the rmb instruction have completed (i.e., been processed).
>
> Did you mean to say "If there is a pending invalidate to B, READ B is
> not allowed to complete until all such pre-existing invalidate requests
> have completed"? As opposed to "Unless..."?

I was trying to come to terms with the possibility that there might be a
a pending STORE sitting in the store buffer when the rmb is executed. In
that case subsequent LOADs would not be forced to wait for pending
invalidations; they could be satisfied directly from the store buffer.

Continuing with this reasoning leads to some annoying complications. It
becomes clear, for instance, that sometimes one CPU's STORE does not ever
have to become visible to another CPU! This is perhaps most obvious in a
hierarchical cache arrangement, but it's true even in the simple flat
model.

For example, imagine that CPU 0 does "a = 1" and CPU 1 does "a = 2" at
just about the same time. Close enough together so that each CPU sees its
own assignment before it sees the other one. Let's say that the final
value stored in a is 1. Then the store of 2 could never have been visible
to CPU 0; otherwise CPU 0 would see the stores occuring in the order
1,2,1.

Or to put it another way, suppose CPU 0's write just sits in a store
buffer while CPU 0 receives, acknowledges, and processes CPU 1's
invalidate. CPU 0 would never see the value associated with that
invalidate because all reads would be satisfied directly from the store
buffer.

Now that I realize this, it becomes a lot harder to describe what rmb
really does. Consider again the "canonical" pattern:

CPU 0 CPU 1
----- -----
a = 1; y = b;
wmb(); rmb();
b = 1; x = a;
assert(!(y==1 && x==0));

We've been saying all along that initially a and b are equal to 0. But
suppose that the _reason_ a is equal to 0 is because some time in the past
both CPUs executed "a = 0". Suppose CPU 0's assignment completed
quickly, but CPU 1's assignment was delayed and is in fact still sitting
in a store buffer. Then the assertion might fail!

A more correct pattern would have to look like this:

CPU 0 CPU 1
----- -----
while (c == 0) ; wmb();
// Earlier writes to a or b are now flushed
c = 1;

// arbitrarily long delay with neither CPU writing to a or b

a = 1; y = b;
wmb(); rmb();
b = 1; x = a;
assert(!(y==1 && x==0));

In general you might say that a LOAD following rmb won't return the value
from any STORE that became visible before the rmb other than the last one.
But that leaves open the question of which STOREs actually do become
visible. I don't know how to deal with that in general.


> > There are consequences of these ideas with potential implications for
> > RCU. Consider, as a simplified example, a thread T0 which constantly
> > repeats the following loop:
> >
> > for (;;) {
> > ++cnt;
> > mb();
>
> Need an rcu_read_lock() here (or earlier).
>
> > x = p;
>
> The above assignment to x should be "x = rcu_dereference(p)", though it
> doesn't affect memory ordering except on Alpha.

True, but not relevant in this context.

> > // Do something time-consuming with *x
>
> Need an rcu_read_unlock() here (or later).
>
> > }
> >
> > Now suppose another thread T1 wants to update the value of p and wait
> > until the old value is no longer being used (a typical RCU-ish sort of
> > thing). The natural approach is this:
> >
> > old_p = p;
> > p = new_p;
>
> The above assignment should instead be "rcu_assign_pointer(p, new_p)",
> which would put an smp_wmb() before the actual assignment to p. Not
> yet sure if this matters in this case, but just for the record...

You may assume there was a wmb() immediately prior to this.

> > mb();
> > c = cnt;
> > while (c == cnt)
> > barrier();
>
> Yikes!!! The above loop should be replaced by synchronize_rcu().
> Otherwise, you are in an extreme state of sin.

Perhaps you could explain why (other than the fact that it might fail!).

I actually do have code like this in one of my drivers. But there are
significant differences from the example. In the driver T0 doesn't run on
a CPU; it runs on a peripheral controller which accesses p via DMA and
makes cnt available over the I/O bus (inw).

Alan

2006-09-12 20:23:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 12, 2006 at 02:08:21PM -0400, Alan Stern wrote:
> On Mon, 11 Sep 2006, Paul E. McKenney wrote:
>
> > > One insight has to do with why it is that read-mb-write succeeds
> > > whereas write-mb-read can fail. It comes down to this essential
> > > asymmetry:
> > >
> > > Suppose one CPU writes to a memory location at about the same
> > > time as another CPU using a different cache reads from it. If
> > > the read completes before the write then the read is certain
> > > to return the old value. But if the write completes before the
> > > read, the read is _not_ certain to return the new value.
> >
> > Yes. Of course, keeping in mind that "before" can be a partial ordering.
> > In any real situation, you would need to say how you determined that the
> > read did in fact happen "before" the write. Some systems have fine-grained
> > synchronized counters that make this easy, but most do not. In the latter
> > case, the memory operations themselves are about the only evidence you
> > have to figure out what happened "before".
>
> I think it makes sense to say this even in the absence of precise
> definitions. To put it another way, a CPU or cache can't read the
> updated value from another CPU (or cache) before the second CPU (or cache)
> decides to make the updated value publicly available ("committed the
> value"). It _can_ read the updated value after the value has been
> committed, but it isn't constrained to do so -- it's allowed to keep using
> the old value for a while.

Yes, at least in absence of ordering constraints.

> Incidentally, I object very strongly to the way you hardware folk keep
> talking about events being only partially ordered. This isn't the case at
> all. Given a precise physical definition of what an event is (a
> particular transistor changing state for example, or perhaps the entire
> time interval starting when one transistor changes state and ending when
> another does), each event takes place at a moment of time or during an
> interval of time.

Sorry, but there is absolutely no way I am carrying this discussion down
to the transistor level. You need to talk to a real hardware guy for
that. ;-)

> Now time really _is_ totally ordered (so long as you confine yourself to a
> single reference frame!) and hence in any individual run of a computer
> program so are the events -- provided you take proper care when discussing
> events that encompass a time interval.

But different CPUs would have different opinions about the time at which
a given event (e.g., a store) occurred. They would also have different
opinions about the order in which sequences of stores occurred. So the
linear underlying time is not all that relevant -- maybe it is there, maybe
not, but software cannot see it, so it cannot rely on it. Hence the partial
order.

The underlying problem is a given load or store is actually made up of
many events, which will occur at different times. You could arbitrarily
choose a particular event as being the key event that determines the
time for the corresponding load or store, but this won't really help.
Which of the following events should determine the timestamp for a
given store from CPU 0?

1. When the instruction determines what value to store?

2. When an entry on CPU 0's store queue becomes available?

3. When the store-queue entry is filled in? (This is the earliest
time that makes sense to me.)

4. When the invalidation request is transmitted (and more than
one such transmission is required on some architectures)?

5. When the invalidation request is received? If so, by which
CPU?

6. When CPUs send acknowledgements for the invalidation requests?

7. When CPU 0 receives the acknowledgments? (The reception of
the last such acknowledgement would be a reasonable choice
in some cases.)

8. When CPU 0 applies the store-queue entry to the cache line?

9. When the other CPUs process the invalidate request?
(The processing of the invalidate request at the last
CPU to do so would be another reasonable choice in some cases.)

I have an easier time thinking of this as a partial order than to try
to keep track of many individual events that I cannot accurately measure.

> The difficulty is that program runs have a stochastic character; it's
> difficult or impossible to predict exactly what the hardware will do. So
> the same program events could be ordered one way during a run and the
> opposite way during another run. Each run is still totally ordered.

But if I must prove some property for all possible actual orderings, then
keeping track of a partial ordering does make sense. Enumerating all of
the possible actual orderings is often infeasible.

> If you want, you can derive a partial ordering by considering only those
> pairs of events that will _always_ occur in the same order in _every_ run.
> But thinking in terms of this sort of partial ordering is clearly
> inadequate. It doesn't even let you derive the properties of the
> "canonical" pattern:
>
> CPU 0 CPU 1
> ----- -----
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(!(y==1 && x==0));
>
> In terms of partial orderings, we know that all the events on CPU 0 are
> ordered by the wmb() and the events on CPU 1 are ordered by the rmb().
> But there is no ordering relation between the events on the two CPUs. "y
> = b" could occur either before or after "b = 1", hence these two events
> are incomparable in the partial ordering. So you can't conclude anything
> about the assertion, if you confine your thinking to the partial ordering.

But I don't need absolute timings, all I need to know is whether or not
CPU 1's "y=b" sees CPU 0's "b=1".

> > > CPU 0 CPU 1
> > > ----- -----
> > > x = a; y = b;
> > > mb(); mb();
> > > b = 1; a = 1;
> > > assert(x==0 || y==0);
> > >
> > > I'm sure you can verify that this assertion will never fail.
> >
> > Two cases, assuming that the initial value of all variables is zero:
>
> Actually it's just one case. Once you have proved that x!=0 implies y==0,
> by elementary logic you have also proved that y!=0 implies x==0 (one
> statement is the contrapositive of the other).

Agreed, they are symmetric. Though I got there by noting that the variable
names can be interchanged, same result. ;-)

> > 1. If x!=0, then CPU 0 must have seen CPU 1's assignment a=1
> > before CPU 0's memory barrier, so all of CPU 0's code after
> > the memory barrier must see CPU 1's assignment to y. In more
> > detail:
> >
> > CPU 0 CPU 1
> > ----- -----
> > y = b;
> > mb(); The load of b must precede the
> > store to a, and the invalidate
> > of y must be queued at CPU 1.
> > a = 1;
> > x = a; This saw CPU 1's a=1, so CPU 1's load of b must already
> > have happened.
> > mb(); The invalidate of y must have been received, so it will
> > be processed prior to any subsequent load.
> > b = 1; This cannot affect CPU 1's assignment to y.
> > assert(x==0 || y==0);
> > Since y was invalidated, we must get the cacheline. Since
> > we saw the assignment to a, we must also see the assignment
> > to y, which must be zero, since the assignment to y preceded
> > the assignment to b.
> >
> > So the assertion succeeds with y==0 in this case.
> >
> > 2. If y!=0, then CPU 1 must have seen CPU 0's assignment b=1.
> > This is symmetric with case 1, so x==0 in this case.
> >
> > So, yes, the assertion never trips. And, yes, the intuitive approach
> > would get the answer with much less hassle, but, given your prior examples,
> > I figured that being pedantic was the correct approach.
>
> One could derive the result a little more formally using the causality
> principle I mentioned above (LOAD before STORE cannot return the result of
> the STORE) together with another causality principle: A CPU cannot make
> the result of a STORE available to other CPUs (or to itself) before it
> knows the value to be stored. A cache might send out invalidate messages
> before knowing the value to be stored, but it can't send out read
> responses.

Yes but... Some other CPU might see the resulting store before it
saw the information that the storing CPU based the stored value on.

> > And, yes, removing the loops did not change the result.
>
> As intended.

Fair enough!

> > > It's worth noting that since y is accessible to both CPUs, it must be
> > > a variable stored in memory. It follows that CPU 1's mb() could be
> > > replaced with wmb() and the assertion would still always hold. On the
> > > other hand, the compiler is free to put x in a processor register.
> > > This means that CPU 0's mb() cannot safely be replaced with wmb().
> >
> > You might well be correct, but I must ask you to lay out the operations.
> > If CPU 1's mb() becomes a wmb(), the CPU is within its rights to violate
> > causality in the assignment y=b. There are some very strange things
> > that can happen in NUMA machines with nested shared caches.
>
> What do you mean by "violate causality"? Storing a value in y (and making
> that value available to the other CPU) before the read has obtained b's
> value? I don't see how that could happen on any architecture, unless your
> machine uses thiotimoline. :-)

Different CPUs can perceive a given event happening at different times,
so other CPUs might see the store into y before they saw the logically
earlier store into b. Thiotimoline available on eBay? I would be indeed
be interested! ;-)

Your point might be that this example doesn't care, given that the CPU
doing the store into y is also doing the assert, but that doesn't change
my basic point that you have to be -very- careful when using causality
arguments when working out memory ordering.

> > > Let's return to the question I asked at the start of this thread:
> > > Under what circumstances is mb() truly needed? Well, we've just seen
> > > one. If CPU 0's mb() were replaced with "rmb(); wmb();" the assertion
> > > above might fail. The cache would be free to carry out the operations
> > > in a way such that the read of a completed after the write to b.
> >
> > And this would be one of the causality violations that render an intuitive
> > approach to this so hazardous. Intuitively, the write of x must follow
> > the read of a, and the write memory barrier would force the write of b
> > to follow the write of x. The question in this case is whether the
> > intuitive application of transitivity really applies in this case.
>
> I wouldn't put it that way. Knowing that CPUs are free to reorder
> operations in the absence of barriers to prevent such things, the
> violation you mention wouldn't seem unintuitive to me. I tend to have
> more trouble remembering that reads don't necessarily have to return the
> most recent values available.

So do I, which is one reason that I resist the notion that stores happen
at a single specific time. If I consider stores to be fuzzy events, then
it is easier to for me to keep in mind that concurrent reads to the
same value from different CPUs might return different values.

> > > This read-mb-write pattern turns out to be vital for implementing
> > > sychronization primitives. Take your own example:
> > >
> > > > Consider the following (lame) definitions for spinlock primitives,
> > > > but in an alternate universe where atomic_xchg() did not imply a
> > > > memory barrier, and on a weak-memory CPU:
> > > >
> > > > typedef spinlock_t atomic_t;
> > > >
> > > > void spin_lock(spinlock_t *l)
> > > > {
> > > > for (;;) {
> > > > if (atomic_xchg(l, 1) == 0) {
> > > > smp_mb();
> > > > return;
> > > > }
> > > > while (atomic_read(l) != 0) barrier();
> > > > }
> > > >
> > > > }
> > > >
> > > > void spin_unlock(spinlock_t *l)
> > > > {
> > > > smp_mb();
> > > > atomic_set(l, 0);
> > > > }
> > > >
> > > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > > stores in the following critical section happen only -after- the lock
> > > > is acquired. Similarly for the spin_unlock() primitive.
> > >
> > > In fact that last paragraph isn't quite right. The spin_lock()
> > > primitive would also work with smp_rmb() in place of smb_mb(). (I can
> > > explain my reasoning later, if you're interested.)
> >
> > With smp_rmb(), why couldn't the following, given globals b and c:
> >
> > spin_lock(&mylock);
> > b = 1;
> > c = 1;
> > spin_unlock(&mylock);
> >
> > be executed by the CPU as follows?
> >
> > c = 1;
> > b = 1;
> > spin_lock(&mylock);
> > spin_unlock(&mylock);
>
> Because of the conditional in spin_lock().
>
> > This order of execution seems to me to be highly undesireable. ;-)
> >
> > So, what am I missing here???
>
> A CPU cannot move a write back past a conditional. Otherwise it runs
> the risk of committing the write when (according to the program flow) the
> write should never have taken place. No CPU does speculative writes.

>From the perspective of some other CPU holding the lock, agreed (I
think...). From the perspective of some other CPU reading the lock
word and the variables b and c, I do not agree. The CPU doing the
stores does not have to move the writes back past the conditional --
all it has to do is move the "c=1" and "b=1" stores back past the store
implied by the atomic operation in the spin_lock(). All three stores
then appear to have happened after the conditional.

> > > Here's another question: To what extent do memory barriers actually
> > > force completions to occur in order? Let me explain...
> >
> > In general, they do not at all. They instead simply impose ordering
> > constraints.
>
> What's the difference between imposing an ordering constraint and forcing
> two events to occur in a particular order? To me the phrases appear
> synonymous.

That is because you persist in believing that things happen at a
well-defined single time. ;-)

> > > We can think of a CPU and its cache as two semi-autonomous units.
> > > From the cache's point of view, a load or a store initializes when the
> > > CPU issues the request to the cache, and it completes when the data is
> > > extracted from the cache for transmission to the CPU (for a load) or
> > > when the data is placed in the cache (for a store).
> > >
> > > Memory barriers prevent CPUs from reordering instructions, thereby
> > > forcing loads and stores to initialize in order. But this reflects
> > > how the barriers affect the CPU, whereas I'm more concerned about how
> > > they affect the cache. So to what extent do barriers force
> > > memory accesses to _complete_ in order? Let's go through the
> > > possibilities.
> > >
> > > STORE A; wmb; STORE B;
> > >
> > > This does indeed force the completion of STORE B to wait until STORE A
> > > has completed. Nothing else would have the desired effect.
> >
> > CPU designers would not agree. The completion of STORE B must -appear-
> > -to- -software- to precede that of STORE A. For example, you could get
> > the effect by allowing the stores to complete out of order, but then
> > constraining the ordering of concurrent loads of A and B, for example,
> > delaying concurrent loads of A until the store to B completed.
> >
> > Typical CPUs really do use both approaches!
>
> Then it would be better to say that the wmb prevents the STORE A from
> becoming visible on any particular CPU after the STORE B has become
> visible on that CPU. (Note that this is weaker than saying the STORE A
> must not become visible to any CPU after the STORE B has become visible to
> any CPU.)

s/precede/follow/ in my response above, sorry for the confusion!
Or, better yet, interchange A and B.

> > > READ A; mb; STORE B;
> > >
> > > This also forces the completion of STORE B to wait until READ A has
> > > completed. Without such a guarantee the read-mb-write pattern wouldn't
> > > work.
> >
> > Again, this forces the STORE B to -appear- -to- -software- to have
> > completed after the READ A. The hardware is free to allow the operations
> > to complete in either order, as long as it constrains the order of
> > concurrent accesses to A and B so as to preserve the illusion of ordering.
>
> Again, the mb forces the READ A to commit to a value before the STORE B
> becomes visible to any CPU.

Although it is quite possible that the CPU executing the sequence is unaware
of the value of A at the time that the new value of B becomes visible to
other CPUs.

> > > STORE A; mb; READ B;
> > >
> > > This is one is strange. For all I know the mb might very well force
> > > the completions to occur in order... but it doesn't matter! Even when
> > > the completions do occur in order, it can appear to other CPUs as if
> > > they did not! This is exactly what happens when the write-mb-read
> > > pattern fails.
> >
> > ???
> >
> > The corresponding accesses would be STORE B; mb; LOAD A. If the LOAD A
> > got the old value of A, but the STORE B did -not- affect the other CPU's
> > READ B, we have a violation. So I don't see anything strange about this
> > one. What am I missing?
>
> You're missing something you described earlier: The two STORES can cross
> in the night. If neither cache has acknowledged the other's invalidate
> message when the mb instructions are executed, then the mb's won't have
> any effect. The invalidate acknowledgments can then be sent and the
> WRITEs allowed to update their respective caches. If the processing of
> the invalidates is delayed, the READs could be allowed to return the
> caches' old data.

Good catch! In all the locking cases, there is a store -after- at least
one of the memory barriers. :-/

> > > READ A; rmb; READ B;
> > >
> > > Another peculiar case. One almost always hears read barriers described
> > > as forcing the second read to complete after the first. But in reality
> > > this is not sufficient.
> >
> > Agreed, not if you want to think closer to the hardware. If you are
> > willing to work at a higher level of abstraction, then the description
> > is reasonable.
> >
> > > The additional requirement of rmb is: Unless there is a pending STORE
> > > to B, READ B is not allowed to complete until all the invalidation
> > > requests for B pending (i.e., acknowledged but not yet processed) at
> > > the time of the rmb instruction have completed (i.e., been processed).
> >
> > Did you mean to say "If there is a pending invalidate to B, READ B is
> > not allowed to complete until all such pre-existing invalidate requests
> > have completed"? As opposed to "Unless..."?
>
> I was trying to come to terms with the possibility that there might be a
> a pending STORE sitting in the store buffer when the rmb is executed. In
> that case subsequent LOADs would not be forced to wait for pending
> invalidations; they could be satisfied directly from the store buffer.

Good point!

> Continuing with this reasoning leads to some annoying complications. It
> becomes clear, for instance, that sometimes one CPU's STORE does not ever
> have to become visible to another CPU! This is perhaps most obvious in a
> hierarchical cache arrangement, but it's true even in the simple flat
> model.

Yep!

> For example, imagine that CPU 0 does "a = 1" and CPU 1 does "a = 2" at
> just about the same time. Close enough together so that each CPU sees its
> own assignment before it sees the other one. Let's say that the final
> value stored in a is 1. Then the store of 2 could never have been visible
> to CPU 0; otherwise CPU 0 would see the stores occuring in the order
> 1,2,1.
>
> Or to put it another way, suppose CPU 0's write just sits in a store
> buffer while CPU 0 receives, acknowledges, and processes CPU 1's
> invalidate. CPU 0 would never see the value associated with that
> invalidate because all reads would be satisfied directly from the store
> buffer.

True enough!

> Now that I realize this, it becomes a lot harder to describe what rmb
> really does. Consider again the "canonical" pattern:
>
> CPU 0 CPU 1
> ----- -----
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(!(y==1 && x==0));
>
> We've been saying all along that initially a and b are equal to 0. But
> suppose that the _reason_ a is equal to 0 is because some time in the past
> both CPUs executed "a = 0". Suppose CPU 0's assignment completed
> quickly, but CPU 1's assignment was delayed and is in fact still sitting
> in a store buffer. Then the assertion might fail!

Yep!

> A more correct pattern would have to look like this:
>
> CPU 0 CPU 1
> ----- -----
> while (c == 0) ; wmb();
> // Earlier writes to a or b are now flushed
> c = 1;
>
> // arbitrarily long delay with neither CPU writing to a or b
>
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(!(y==1 && x==0));
>
> In general you might say that a LOAD following rmb won't return the value
> from any STORE that became visible before the rmb other than the last one.
> But that leaves open the question of which STOREs actually do become
> visible. I don't know how to deal with that in general.

The best answer I can give is that an rmb() -by- -itself- means absolutely
nothing. The only way it has any meaning is in conjunction with an
mb() or a wmb() on some other CPU. The "pairwise" stuff in David's
documentation.

In other words, if there was an rmb() in flight on one CPU, but none
of the other CPUs ever executed a memory barrier of any kind, then
the lone rmb() would not need to place any constraints on any CPU's
memory access.

Or, more accurately, if you wish to understand an rmb() in isolation,
you must do so in the context of a specific hardware implementation.

> > > There are consequences of these ideas with potential implications for
> > > RCU. Consider, as a simplified example, a thread T0 which constantly
> > > repeats the following loop:
> > >
> > > for (;;) {
> > > ++cnt;
> > > mb();
> >
> > Need an rcu_read_lock() here (or earlier).
> >
> > > x = p;
> >
> > The above assignment to x should be "x = rcu_dereference(p)", though it
> > doesn't affect memory ordering except on Alpha.
>
> True, but not relevant in this context.
>
> > > // Do something time-consuming with *x
> >
> > Need an rcu_read_unlock() here (or later).
> >
> > > }
> > >
> > > Now suppose another thread T1 wants to update the value of p and wait
> > > until the old value is no longer being used (a typical RCU-ish sort of
> > > thing). The natural approach is this:
> > >
> > > old_p = p;
> > > p = new_p;
> >
> > The above assignment should instead be "rcu_assign_pointer(p, new_p)",
> > which would put an smp_wmb() before the actual assignment to p. Not
> > yet sure if this matters in this case, but just for the record...
>
> You may assume there was a wmb() immediately prior to this.

OK.

> > > mb();
> > > c = cnt;
> > > while (c == cnt)
> > > barrier();
> >
> > Yikes!!! The above loop should be replaced by synchronize_rcu().
> > Otherwise, you are in an extreme state of sin.
>
> Perhaps you could explain why (other than the fact that it might fail!).

Ummm... Isn't that reason enough?

One reason is that the ++cnt can be reordered by both CPU and compiler
to precede some of the code in "// Do something time-consuming with *x".
I am assuming that only one thread is permitted to execute in the loop
at the same time, otherwise there are many other failure mechanisms.

> I actually do have code like this in one of my drivers. But there are
> significant differences from the example. In the driver T0 doesn't run on
> a CPU; it runs on a peripheral controller which accesses p via DMA and
> makes cnt available over the I/O bus (inw).

OK, so there is only one thread T0... And so it depends on the
peripheral controller's memory model. I would hope that they sorted
out the synchronization. ;-)

Thanx, Paul

2006-09-14 14:58:46

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 12 Sep 2006, Paul E. McKenney wrote:

> > Incidentally, I object very strongly to the way you hardware folk keep
> > talking about events being only partially ordered. This isn't the case at
> > all. Given a precise physical definition of what an event is (a
> > particular transistor changing state for example, or perhaps the entire
> > time interval starting when one transistor changes state and ending when
> > another does), each event takes place at a moment of time or during an
> > interval of time.
>
> Sorry, but there is absolutely no way I am carrying this discussion down
> to the transistor level. You need to talk to a real hardware guy for
> that. ;-)

I don't mind keeping the discussion at a higher level. The real point is
that physical events occur at real, physical times.

> > Now time really _is_ totally ordered (so long as you confine yourself to a
> > single reference frame!) and hence in any individual run of a computer
> > program so are the events -- provided you take proper care when discussing
> > events that encompass a time interval.
>
> But different CPUs would have different opinions about the time at which
> a given event (e.g., a store) occurred.

That's okay. There isn't any single notion of "This store occurred at
this time". Instead, for each CPU there is a notion of "This store became
visible to this CPU at this time".

> They would also have different
> opinions about the order in which sequences of stores occurred.

Again, not a problem. And for the same reason. Although if the stores
are all to the same location, different CPUs should _not_ have different
opinions about the order in which they became visible.

> So the
> linear underlying time is not all that relevant -- maybe it is there, maybe
> not, but software cannot see it, so it cannot rely on it. Hence the partial
> order.

Maybe the software can't see it directly, but it can still help when
you're trying to reason about the program's behavior.

> The underlying problem is a given load or store is actually made up of
> many events, which will occur at different times. You could arbitrarily
> choose a particular event as being the key event that determines the
> time for the corresponding load or store, but this won't really help.
> Which of the following events should determine the timestamp for a
> given store from CPU 0?

That's not a valid notion. Let's ask instead what is the timestamp for
when a given store from CPU 0 becomes visible to CPU 1.

> 1. When the instruction determines what value to store?
>
> 2. When an entry on CPU 0's store queue becomes available?
>
> 3. When the store-queue entry is filled in? (This is the earliest
> time that makes sense to me.)
>
> 4. When the invalidation request is transmitted (and more than
> one such transmission is required on some architectures)?
>
> 5. When the invalidation request is received? If so, by which
> CPU?
>
> 6. When CPUs send acknowledgements for the invalidation requests?

This is a very good candidate: when CPU 1 sends its acknowledgement. Any
read on CPU 1 which commits to a return value before this time will not
see the new value. A read on CPU 1 which commits to a return value after
this time may see the new value.

> 7. When CPU 0 receives the acknowledgments? (The reception of
> the last such acknowledgement would be a reasonable choice
> in some cases.)
>
> 8. When CPU 0 applies the store-queue entry to the cache line?
>
> 9. When the other CPUs process the invalidate request?
> (The processing of the invalidate request at the last
> CPU to do so would be another reasonable choice in some cases.)
>
> I have an easier time thinking of this as a partial order than to try
> to keep track of many individual events that I cannot accurately measure.

All right, let's express things in terms of the partial order. In those
terms, this is the principle you would expound:

(P1): If two writes are ordered on the same CPU and two reads are
ordered on another CPU, and the first read sees the result of
the second write, then the second read will see the result of
the first write.

Correct?

> > The difficulty is that program runs have a stochastic character; it's
> > difficult or impossible to predict exactly what the hardware will do. So
> > the same program events could be ordered one way during a run and the
> > opposite way during another run. Each run is still totally ordered.
>
> But if I must prove some property for all possible actual orderings, then
> keeping track of a partial ordering does make sense. Enumerating all of
> the possible actual orderings is often infeasible.

You don't have to enumerate all possible cases to prove things.
Mathematics would still be stuck back at the time of the ancient Greeks if
that were so.

> > If you want, you can derive a partial ordering by considering only those
> > pairs of events that will _always_ occur in the same order in _every_ run.
> > But thinking in terms of this sort of partial ordering is clearly
> > inadequate. It doesn't even let you derive the properties of the
> > "canonical" pattern:
> >
> > CPU 0 CPU 1
> > ----- -----
> > a = 1; y = b;
> > wmb(); rmb();
> > b = 1; x = a;
> > assert(!(y==1 && x==0));
> >
> > In terms of partial orderings, we know that all the events on CPU 0 are
> > ordered by the wmb() and the events on CPU 1 are ordered by the rmb().
> > But there is no ordering relation between the events on the two CPUs. "y
> > = b" could occur either before or after "b = 1", hence these two events
> > are incomparable in the partial ordering. So you can't conclude anything
> > about the assertion, if you confine your thinking to the partial ordering.
>
> But I don't need absolute timings, all I need to know is whether or not
> CPU 1's "y=b" sees CPU 0's "b=1".

How does the partial-ordering point of view help you understand the
read-mb-write pattern?

CPU 0 CPU 1
----- -----
x = a; y = b + 1;
mb(); mb();
b = 1; a = 1;
assert(!(x==1 && y!=1));

You would need an additional principle similar to P1 above. Trying to
apply P1 won't work; all you can deduce is that CPU 1 writes y and a
whereas CPU 0 reads a and y, hence CPU 1 can assert only (!(x==1 &&
y==0)), not (!(x==1 && y!=1)).

This idea, that some _new_ principle P2 is needed to explain
read-mb-write, is one of the major points I have been trying to establish
here. Discussions like David's always mention P1 but they hardly ever
mention this P2. In fact I can't recall ever seeing it mentioned before
this email thread. And yet, as we've seen, P2 is essential for explaining
why synchronization primitives work.


> > One could derive the result a little more formally using the causality
> > principle I mentioned above (LOAD before STORE cannot return the result of
> > the STORE) together with another causality principle: A CPU cannot make
> > the result of a STORE available to other CPUs (or to itself) before it
> > knows the value to be stored. A cache might send out invalidate messages
> > before knowing the value to be stored, but it can't send out read
> > responses.
>
> Yes but... Some other CPU might see the resulting store before it
> saw the information that the storing CPU based the stored value on.

Nothing wrong with that. As I have mentioned before, stores don't have to
become visible to all CPUs at the same time.

> > What do you mean by "violate causality"? Storing a value in y (and making
> > that value available to the other CPU) before the read has obtained b's
> > value? I don't see how that could happen on any architecture, unless your
> > machine uses thiotimoline. :-)
>
> Different CPUs can perceive a given event happening at different times,
> so other CPUs might see the store into y before they saw the logically
> earlier store into b.

Nothing wrong with that either, and for the same reason.

> Thiotimoline available on eBay? I would be indeed
> be interested! ;-)

You and me both!

> Your point might be that this example doesn't care, given that the CPU
> doing the store into y is also doing the assert, but that doesn't change
> my basic point that you have to be -very- careful when using causality
> arguments when working out memory ordering.

I think that all you really need to do is remember that statements can be
reordered, that stores become visible to different CPUs at different
times, and that loads don't always return the value of the last visible
store. Everything else is intuitive -- except for the way memory barriers
work!

> > I wouldn't put it that way. Knowing that CPUs are free to reorder
> > operations in the absence of barriers to prevent such things, the
> > violation you mention wouldn't seem unintuitive to me. I tend to have
> > more trouble remembering that reads don't necessarily have to return the
> > most recent values available.
>
> So do I, which is one reason that I resist the notion that stores happen
> at a single specific time. If I consider stores to be fuzzy events, then
> it is easier to for me to keep in mind that concurrent reads to the
> same value from different CPUs might return different values.

Instead of considering stores to be fuzzy events, you can think of them as
becoming visible at precise times to individual CPUs. That sort of
reasoning would help someone to understand P2, whereas the
partially-ordered fuzzy-events approach would not.


> > > > This read-mb-write pattern turns out to be vital for implementing
> > > > sychronization primitives. Take your own example:
> > > >
> > > > > Consider the following (lame) definitions for spinlock primitives,
> > > > > but in an alternate universe where atomic_xchg() did not imply a
> > > > > memory barrier, and on a weak-memory CPU:
> > > > >
> > > > > typedef spinlock_t atomic_t;
> > > > >
> > > > > void spin_lock(spinlock_t *l)
> > > > > {
> > > > > for (;;) {
> > > > > if (atomic_xchg(l, 1) == 0) {
> > > > > smp_mb();
> > > > > return;
> > > > > }
> > > > > while (atomic_read(l) != 0) barrier();
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > > void spin_unlock(spinlock_t *l)
> > > > > {
> > > > > smp_mb();
> > > > > atomic_set(l, 0);
> > > > > }
> > > > >
> > > > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > > > stores in the following critical section happen only -after- the lock
> > > > > is acquired. Similarly for the spin_unlock() primitive.
> > > >
> > > > In fact that last paragraph isn't quite right. The spin_lock()
> > > > primitive would also work with smp_rmb() in place of smb_mb(). (I can
> > > > explain my reasoning later, if you're interested.)
> > >
> > > With smp_rmb(), why couldn't the following, given globals b and c:
> > >
> > > spin_lock(&mylock);
> > > b = 1;
> > > c = 1;
> > > spin_unlock(&mylock);
> > >
> > > be executed by the CPU as follows?
> > >
> > > c = 1;
> > > b = 1;
> > > spin_lock(&mylock);
> > > spin_unlock(&mylock);
> >
> > Because of the conditional in spin_lock().
> >
> > > This order of execution seems to me to be highly undesireable. ;-)
> > >
> > > So, what am I missing here???
> >
> > A CPU cannot move a write back past a conditional. Otherwise it runs
> > the risk of committing the write when (according to the program flow) the
> > write should never have taken place. No CPU does speculative writes.
>
> From the perspective of some other CPU holding the lock, agreed (I
> think...). From the perspective of some other CPU reading the lock
> word and the variables b and c, I do not agree. The CPU doing the
> stores does not have to move the writes back past the conditional --
> all it has to do is move the "c=1" and "b=1" stores back past the store
> implied by the atomic operation in the spin_lock(). All three stores
> then appear to have happened after the conditional.

Ah. I'm glad you mentioned this.

Yes, it's true that given sufficiently weak ordering, the writes could
be moved back inbetween the load and store implicit in the atomic
exchange. It's worth pointing out that even if this does occur, it will
not be visible to any CPU that accesses b and c only from within a
critical section. But it could be visible in a situation like this:

CPU 0 CPU 1
----- -----
[call spin_lock(&mylock)] x = b;
read mylock, obtain 0 mb();
b = 1; [moved up] y = atomic_read(&mylock);
write mylock = 1
rmb();
[leave spin_lock()]
mb();
assert(!(x==1 && y==0 && c==0));
c = 1;
spin_unlock(&mylock);

The assertion could indeed fail. HOWEVER...

What you may not realize is that even when spin_lock() contains your
original full mb(), if CPU 0 reads b (instead of writing b) that read
could appear to another CPU to have occurred before the write to mylock.
Example:

CPU 0 CPU 1
----- -----
[call spin_lock(&mylock)] b = 1;
read mylock, obtain 0 mb();
write mylock = 1 y = atomic_read(&mylock);
mb();
[leave spin_lock()]

x = b + 1; mb();
assert(!(x==1 && y==0 && c==0));
c = 1;
spin_unlock(&mylock);

If the assertion fails, then CPU 1 will think that CPU 0 read the value of
b before setting mylock to 1. But the assertion can fail! It's another
example of the write-mb-read pattern. If CPU 0's write to mylock crosses
with CPU 1's write to b, then both CPU 0's read of b and CPU 1's read of
mylock could obtain the old values.

So in this respect your implementation already fails to prevent reads from
leaking partway out of the critical section. My implementation does the
same thing with writes, but otherwise is no worse.


> > What's the difference between imposing an ordering constraint and forcing
> > two events to occur in a particular order? To me the phrases appear
> > synonymous.
>
> That is because you persist in believing that things happen at a
> well-defined single time. ;-)

Physical events _do_ happen at well-defined times. Even if we don't know
exactly when those times occur.


> > A more correct pattern would have to look like this:
> >
> > CPU 0 CPU 1
> > ----- -----
> > while (c == 0) ; wmb();
> > // Earlier writes to a or b are now flushed
> > c = 1;
> >
> > // arbitrarily long delay with neither CPU writing to a or b
> >
> > a = 1; y = b;
> > wmb(); rmb();
> > b = 1; x = a;
> > assert(!(y==1 && x==0));
> >
> > In general you might say that a LOAD following rmb won't return the value
> > from any STORE that became visible before the rmb other than the last one.
> > But that leaves open the question of which STOREs actually do become
> > visible. I don't know how to deal with that in general.
>
> The best answer I can give is that an rmb() -by- -itself- means absolutely
> nothing. The only way it has any meaning is in conjunction with an
> mb() or a wmb() on some other CPU. The "pairwise" stuff in David's
> documentation.

Even that isn't enough. You also have to have some sort of additional
information that can guarantee CPU 0's write to b will become visible to
CPU 1. If it never becomes visible then the assertion could fail.

But I can't think of any reasonable way to obtain such a guarantee that
isn't implementation-specific. Which leaves things in a rather
unsatisfactory state...

> In other words, if there was an rmb() in flight on one CPU, but none
> of the other CPUs ever executed a memory barrier of any kind, then
> the lone rmb() would not need to place any constraints on any CPU's
> memory access.
>
> Or, more accurately, if you wish to understand an rmb() in isolation,
> you must do so in the context of a specific hardware implementation.

I'll buy that second description but not the first. It makes no sense to
say that what an rmb() instruction does to this CPU right here depends
somehow on what that CPU way over there is doing. Instructions are
supposed to act locally.

> > Perhaps you could explain why (other than the fact that it might fail!).
>
> Ummm... Isn't that reason enough?
>
> One reason is that the ++cnt can be reordered by both CPU and compiler
> to precede some of the code in "// Do something time-consuming with *x".
> I am assuming that only one thread is permitted to execute in the loop
> at the same time, otherwise there are many other failure mechanisms.

Yes. There would have to be a barrier before ++cnt.

> > I actually do have code like this in one of my drivers. But there are
> > significant differences from the example. In the driver T0 doesn't run on
> > a CPU; it runs on a peripheral controller which accesses p via DMA and
> > makes cnt available over the I/O bus (inw).
>
> OK, so there is only one thread T0... And so it depends on the
> peripheral controller's memory model. I would hope that they sorted
> out the synchronization. ;-)

Me too!

Alan

2006-09-15 05:17:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, Sep 14, 2006 at 10:58:43AM -0400, Alan Stern wrote:
> On Tue, 12 Sep 2006, Paul E. McKenney wrote:
>
> > > Incidentally, I object very strongly to the way you hardware folk keep
> > > talking about events being only partially ordered. This isn't the case at
> > > all. Given a precise physical definition of what an event is (a
> > > particular transistor changing state for example, or perhaps the entire
> > > time interval starting when one transistor changes state and ending when
> > > another does), each event takes place at a moment of time or during an
> > > interval of time.
> >
> > Sorry, but there is absolutely no way I am carrying this discussion down
> > to the transistor level. You need to talk to a real hardware guy for
> > that. ;-)
>
> I don't mind keeping the discussion at a higher level. The real point is
> that physical events occur at real, physical times.

I agree that physical events do occur at real physical times (exceptions
a la Heisenberg not being relevant to the energy levels and timescales
we are discussing, besides which -you- are the physicist, not me),
but so what?

> > > Now time really _is_ totally ordered (so long as you confine yourself to a
> > > single reference frame!) and hence in any individual run of a computer
> > > program so are the events -- provided you take proper care when discussing
> > > events that encompass a time interval.
> >
> > But different CPUs would have different opinions about the time at which
> > a given event (e.g., a store) occurred.
>
> That's okay. There isn't any single notion of "This store occurred at
> this time". Instead, for each CPU there is a notion of "This store became
> visible to this CPU at this time".

That is one way to look at it. It may or may not be the best way to
look at it in a given situation -- for example, some stores never become
visible to some CPUs due to the effect of later stores. For another
example, if the software doesn't look at a given situation, it cannot
tell whether or not a given store is visible to it.

The fact that there are indeed real events happening at real time isn't
necessarily relevant if they cannot be seen. Accepting fuzzy stores
can decrease the size of the state space substantially. This can be an
extremely good thing.

> > They would also have different
> > opinions about the order in which sequences of stores occurred.
>
> Again, not a problem. And for the same reason. Although if the stores
> are all to the same location, different CPUs should _not_ have different
> opinions about the order in which they became visible.

The different CPUs might have different opinions -- in fact, they
usually -do- have different opinions. The constraint is instead that
their different opinions must be consistent with -at- -least- -one-
totally ordered sequence of stores. For an example, consider three
CPUs storing to the same location concurrently, and another CPU watching.
The CPUs will agree on who was the last CPU to do the store, but might
have no idea of the order of the other two stores.

On the "not a problem", yes, you -can- model stores as a set of events
that all happen at precise but unknowable times. You can also assert
that this is the One True Way of modeling stores, but I am not buying.

In my experience, it is often extremely helpful to model stores as
being fuzzy. You are free to disagree, but your disagreement will
not change my memory. ;-)

> > So the
> > linear underlying time is not all that relevant -- maybe it is there, maybe
> > not, but software cannot see it, so it cannot rely on it. Hence the partial
> > order.
>
> Maybe the software can't see it directly, but it can still help when
> you're trying to reason about the program's behavior.

Yes, it -can- -sometimes- help. Other times it can result in useless
combinatorial explosion.

> > The underlying problem is a given load or store is actually made up of
> > many events, which will occur at different times. You could arbitrarily
> > choose a particular event as being the key event that determines the
> > time for the corresponding load or store, but this won't really help.
> > Which of the following events should determine the timestamp for a
> > given store from CPU 0?
>
> That's not a valid notion. Let's ask instead what is the timestamp for
> when a given store from CPU 0 becomes visible to CPU 1.

Yes it is a valid notion.

Yes, your question is sometimes the right thing to ask. But not always.

> > 1. When the instruction determines what value to store?
> >
> > 2. When an entry on CPU 0's store queue becomes available?
> >
> > 3. When the store-queue entry is filled in? (This is the earliest
> > time that makes sense to me.)
> >
> > 4. When the invalidation request is transmitted (and more than
> > one such transmission is required on some architectures)?
> >
> > 5. When the invalidation request is received? If so, by which
> > CPU?
> >
> > 6. When CPUs send acknowledgements for the invalidation requests?
>
> This is a very good candidate: when CPU 1 sends its acknowledgement. Any
> read on CPU 1 which commits to a return value before this time will not
> see the new value. A read on CPU 1 which commits to a return value after
> this time may see the new value.

You cannot see it from software, so it is difficult to make use of,
unless you are running your software in an extremely detailed simulator
or on some types of in-circuit emulators.

I have sometimes had to live in this space, including quite recently.
Trust me, it is -not- where you want to be!!!

> > 7. When CPU 0 receives the acknowledgments? (The reception of
> > the last such acknowledgement would be a reasonable choice
> > in some cases.)
> >
> > 8. When CPU 0 applies the store-queue entry to the cache line?
> >
> > 9. When the other CPUs process the invalidate request?
> > (The processing of the invalidate request at the last
> > CPU to do so would be another reasonable choice in some cases.)
> >
> > I have an easier time thinking of this as a partial order than to try
> > to keep track of many individual events that I cannot accurately measure.
>
> All right, let's express things in terms of the partial order. In those
> terms, this is the principle you would expound:
>
> (P1): If two writes are ordered on the same CPU and two reads are
> ordered on another CPU, and the first read sees the result of
> the second write, then the second read will see the result of
> the first write.
>
> Correct?

Partially.

(P1a): If two writes are ordered on one CPU, and two reads are ordered
on another CPU, and if the first read sees the result of the
second write, then the second read will see the result of the
first write.

(P1b): If a read is ordered before a write on each of two CPUs,
and if the read on the second CPU sees the result of the write
on the first CPU, then the read on the first CPU will -not-
see the result of the write on the second CPU.

(P2): All CPUs will see the results of writes to a single location
in orderings consistent with at least one total global order
of writes to that single location.

> > > The difficulty is that program runs have a stochastic character; it's
> > > difficult or impossible to predict exactly what the hardware will do. So
> > > the same program events could be ordered one way during a run and the
> > > opposite way during another run. Each run is still totally ordered.
> >
> > But if I must prove some property for all possible actual orderings, then
> > keeping track of a partial ordering does make sense. Enumerating all of
> > the possible actual orderings is often infeasible.
>
> You don't have to enumerate all possible cases to prove things.
> Mathematics would still be stuck back at the time of the ancient Greeks if
> that were so.

You are correct that I was being overly dramatic. Still, explicitly
demonstrating collapse of the state space is best avoided in many cases.

> > > If you want, you can derive a partial ordering by considering only those
> > > pairs of events that will _always_ occur in the same order in _every_ run.
> > > But thinking in terms of this sort of partial ordering is clearly
> > > inadequate. It doesn't even let you derive the properties of the
> > > "canonical" pattern:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > a = 1; y = b;
> > > wmb(); rmb();
> > > b = 1; x = a;
> > > assert(!(y==1 && x==0));
> > >
> > > In terms of partial orderings, we know that all the events on CPU 0 are
> > > ordered by the wmb() and the events on CPU 1 are ordered by the rmb().
> > > But there is no ordering relation between the events on the two CPUs. "y
> > > = b" could occur either before or after "b = 1", hence these two events
> > > are incomparable in the partial ordering. So you can't conclude anything
> > > about the assertion, if you confine your thinking to the partial ordering.
> >
> > But I don't need absolute timings, all I need to know is whether or not
> > CPU 1's "y=b" sees CPU 0's "b=1".
>
> How does the partial-ordering point of view help you understand the
> read-mb-write pattern?
>
> CPU 0 CPU 1
> ----- -----

All variables initially zero, correct?

> x = a; y = b + 1;
> mb(); mb();
> b = 1; a = 1;
> assert(!(x==1 && y!=1));
>
> You would need an additional principle similar to P1 above. Trying to
> apply P1 won't work; all you can deduce is that CPU 1 writes y and a
> whereas CPU 0 reads a and y, hence CPU 1 can assert only (!(x==1 &&
> y==0)), not (!(x==1 && y!=1)).

Yep, you need both my P1a (same as your P1) and my P1b.

> This idea, that some _new_ principle P2 is needed to explain
> read-mb-write, is one of the major points I have been trying to establish
> here. Discussions like David's always mention P1 but they hardly ever
> mention this P2. In fact I can't recall ever seeing it mentioned before
> this email thread. And yet, as we've seen, P2 is essential for explaining
> why synchronization primitives work.

Is your P2 the same as my P1b? If so, we might well be in agreement! ;-)

> > > One could derive the result a little more formally using the causality
> > > principle I mentioned above (LOAD before STORE cannot return the result of
> > > the STORE) together with another causality principle: A CPU cannot make
> > > the result of a STORE available to other CPUs (or to itself) before it
> > > knows the value to be stored. A cache might send out invalidate messages
> > > before knowing the value to be stored, but it can't send out read
> > > responses.
> >
> > Yes but... Some other CPU might see the resulting store before it
> > saw the information that the storing CPU based the stored value on.
>
> Nothing wrong with that. As I have mentioned before, stores don't have to
> become visible to all CPUs at the same time.

Agreed.

> > > What do you mean by "violate causality"? Storing a value in y (and making
> > > that value available to the other CPU) before the read has obtained b's
> > > value? I don't see how that could happen on any architecture, unless your
> > > machine uses thiotimoline. :-)
> >
> > Different CPUs can perceive a given event happening at different times,
> > so other CPUs might see the store into y before they saw the logically
> > earlier store into b.
>
> Nothing wrong with that either, and for the same reason.

It breaks the intuitive notion of transitivity in some cases.

Stores "passing in the night", for example.

> > Thiotimoline available on eBay? I would be indeed
> > be interested! ;-)
>
> You and me both!

;-)

> > Your point might be that this example doesn't care, given that the CPU
> > doing the store into y is also doing the assert, but that doesn't change
> > my basic point that you have to be -very- careful when using causality
> > arguments when working out memory ordering.
>
> I think that all you really need to do is remember that statements can be
> reordered, that stores become visible to different CPUs at different
> times, and that loads don't always return the value of the last visible
> store. Everything else is intuitive -- except for the way memory barriers
> work!

Let me get this right. In your viewpoint, memory barriers are strongly
counter-intuitive, yet you are trying to convince me to switch to your
viewpoint? ;-)

(Seriously, you have convinced me that current explanations of memory
barriers fall short -- which I do very much appreciate!)

> > > I wouldn't put it that way. Knowing that CPUs are free to reorder
> > > operations in the absence of barriers to prevent such things, the
> > > violation you mention wouldn't seem unintuitive to me. I tend to have
> > > more trouble remembering that reads don't necessarily have to return the
> > > most recent values available.
> >
> > So do I, which is one reason that I resist the notion that stores happen
> > at a single specific time. If I consider stores to be fuzzy events, then
> > it is easier to for me to keep in mind that concurrent reads to the
> > same value from different CPUs might return different values.
>
> Instead of considering stores to be fuzzy events, you can think of them as
> becoming visible at precise times to individual CPUs. That sort of
> reasoning would help someone to understand P2, whereas the
> partially-ordered fuzzy-events approach would not.

Sure, I most certainly could. But in many cases, it is better not to.

> > > > > This read-mb-write pattern turns out to be vital for implementing
> > > > > sychronization primitives. Take your own example:
> > > > >
> > > > > > Consider the following (lame) definitions for spinlock primitives,
> > > > > > but in an alternate universe where atomic_xchg() did not imply a
> > > > > > memory barrier, and on a weak-memory CPU:
> > > > > >
> > > > > > typedef spinlock_t atomic_t;
> > > > > >
> > > > > > void spin_lock(spinlock_t *l)
> > > > > > {
> > > > > > for (;;) {
> > > > > > if (atomic_xchg(l, 1) == 0) {
> > > > > > smp_mb();
> > > > > > return;
> > > > > > }
> > > > > > while (atomic_read(l) != 0) barrier();
> > > > > > }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > void spin_unlock(spinlock_t *l)
> > > > > > {
> > > > > > smp_mb();
> > > > > > atomic_set(l, 0);
> > > > > > }
> > > > > >
> > > > > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > > > > stores in the following critical section happen only -after- the lock
> > > > > > is acquired. Similarly for the spin_unlock() primitive.
> > > > >
> > > > > In fact that last paragraph isn't quite right. The spin_lock()
> > > > > primitive would also work with smp_rmb() in place of smb_mb(). (I can
> > > > > explain my reasoning later, if you're interested.)
> > > >
> > > > With smp_rmb(), why couldn't the following, given globals b and c:
> > > >
> > > > spin_lock(&mylock);
> > > > b = 1;
> > > > c = 1;
> > > > spin_unlock(&mylock);
> > > >
> > > > be executed by the CPU as follows?
> > > >
> > > > c = 1;
> > > > b = 1;
> > > > spin_lock(&mylock);
> > > > spin_unlock(&mylock);
> > >
> > > Because of the conditional in spin_lock().
> > >
> > > > This order of execution seems to me to be highly undesireable. ;-)
> > > >
> > > > So, what am I missing here???
> > >
> > > A CPU cannot move a write back past a conditional. Otherwise it runs
> > > the risk of committing the write when (according to the program flow) the
> > > write should never have taken place. No CPU does speculative writes.
> >
> > From the perspective of some other CPU holding the lock, agreed (I
> > think...). From the perspective of some other CPU reading the lock
> > word and the variables b and c, I do not agree. The CPU doing the
> > stores does not have to move the writes back past the conditional --
> > all it has to do is move the "c=1" and "b=1" stores back past the store
> > implied by the atomic operation in the spin_lock(). All three stores
> > then appear to have happened after the conditional.
>
> Ah. I'm glad you mentioned this.
>
> Yes, it's true that given sufficiently weak ordering, the writes could
> be moved back inbetween the load and store implicit in the atomic
> exchange. It's worth pointing out that even if this does occur, it will
> not be visible to any CPU that accesses b and c only from within a
> critical section.

The other key point is that some weakly ordered CPUs were designed
only to make spinlocks work.

> But it could be visible in a situation like this:
>
> CPU 0 CPU 1
> ----- -----
> [call spin_lock(&mylock)] x = b;
> read mylock, obtain 0 mb();
> b = 1; [moved up] y = atomic_read(&mylock);
> write mylock = 1
> rmb();
> [leave spin_lock()]
> mb();
> assert(!(x==1 && y==0 && c==0));
> c = 1;
> spin_unlock(&mylock);
>
> The assertion could indeed fail. HOWEVER...
>
> What you may not realize is that even when spin_lock() contains your
> original full mb(), if CPU 0 reads b (instead of writing b) that read
> could appear to another CPU to have occurred before the write to mylock.
> Example:
>
> CPU 0 CPU 1
> ----- -----

All variables initially zero?

> [call spin_lock(&mylock)] b = 1;
> read mylock, obtain 0 mb();
> write mylock = 1 y = atomic_read(&mylock);
> mb();
> [leave spin_lock()]
>
> x = b + 1; mb();
> assert(!(x==1 && y==0 && c==0));
> c = 1;
> spin_unlock(&mylock);
>
> If the assertion fails, then CPU 1 will think that CPU 0 read the value of
> b before setting mylock to 1. But the assertion can fail! It's another
> example of the write-mb-read pattern. If CPU 0's write to mylock crosses
> with CPU 1's write to b, then both CPU 0's read of b and CPU 1's read of
> mylock could obtain the old values.
>
> So in this respect your implementation already fails to prevent reads from
> leaking partway out of the critical section. My implementation does the
> same thing with writes, but otherwise is no worse.

But isn't this the write-memory-barrier-read sequence that you (rightly)
convinced me was problematic in earlier email?

> > > What's the difference between imposing an ordering constraint and forcing
> > > two events to occur in a particular order? To me the phrases appear
> > > synonymous.
> >
> > That is because you persist in believing that things happen at a
> > well-defined single time. ;-)
>
> Physical events _do_ happen at well-defined times. Even if we don't know
> exactly when those times occur.

Yes. But that doesn't -require- me to think in those terms.

> > > A more correct pattern would have to look like this:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > while (c == 0) ; wmb();
> > > // Earlier writes to a or b are now flushed
> > > c = 1;
> > >
> > > // arbitrarily long delay with neither CPU writing to a or b
> > >
> > > a = 1; y = b;
> > > wmb(); rmb();
> > > b = 1; x = a;
> > > assert(!(y==1 && x==0));
> > >
> > > In general you might say that a LOAD following rmb won't return the value
> > > from any STORE that became visible before the rmb other than the last one.
> > > But that leaves open the question of which STOREs actually do become
> > > visible. I don't know how to deal with that in general.
> >
> > The best answer I can give is that an rmb() -by- -itself- means absolutely
> > nothing. The only way it has any meaning is in conjunction with an
> > mb() or a wmb() on some other CPU. The "pairwise" stuff in David's
> > documentation.
>
> Even that isn't enough. You also have to have some sort of additional
> information that can guarantee CPU 0's write to b will become visible to
> CPU 1. If it never becomes visible then the assertion could fail.

Eh? If the write to b never becomes visible, then y==0, and the assertion
always succeeds. Or am I misreading this example?

> But I can't think of any reasonable way to obtain such a guarantee that
> isn't implementation-specific. Which leaves things in a rather
> unsatisfactory state...

But arbitrary delays can happen due to interrupts, preemption, etc.
So you are simply not allowed such a guarantee.

> > In other words, if there was an rmb() in flight on one CPU, but none
> > of the other CPUs ever executed a memory barrier of any kind, then
> > the lone rmb() would not need to place any constraints on any CPU's
> > memory access.
> >
> > Or, more accurately, if you wish to understand an rmb() in isolation,
> > you must do so in the context of a specific hardware implementation.
>
> I'll buy that second description but not the first. It makes no sense to
> say that what an rmb() instruction does to this CPU right here depends
> somehow on what that CPU way over there is doing. Instructions are
> supposed to act locally.

A CPU might choose to note whether or not all the loads are to cachelines
in cache and whether or not there are pending invalidates to any of the
corresponding cachelines, and then decide to let the loads execute in
arbitrary order. A CPU might also try executing the loads in arbitrary
order, and undo them if some event (such as reception of an invalidate)
occurred that invalidated the sequence chosen (speculative execution).
Real CPUs have been implemented in both ways, and in both cases, the
rmb() would have no effect on execution (in absence of interference from
other CPUs).

And the CPU's instructions are -still- acting locally! ;-)

And I know of no non-implementation-specific reason that instructions
need to act locally. Yes, things normally go faster if they do act
locally, but that would be an implementation issue.

> > > Perhaps you could explain why (other than the fact that it might fail!).
> >
> > Ummm... Isn't that reason enough?
> >
> > One reason is that the ++cnt can be reordered by both CPU and compiler
> > to precede some of the code in "// Do something time-consuming with *x".
> > I am assuming that only one thread is permitted to execute in the loop
> > at the same time, otherwise there are many other failure mechanisms.
>
> Yes. There would have to be a barrier before ++cnt.

Or the CPU on the peripheral controller might do strict in-order
execution. But that would be equivalent to having full barriers
between all instructions, so ends up agreeing with your point.

> > > I actually do have code like this in one of my drivers. But there are
> > > significant differences from the example. In the driver T0 doesn't run on
> > > a CPU; it runs on a peripheral controller which accesses p via DMA and
> > > makes cnt available over the I/O bus (inw).
> >
> > OK, so there is only one thread T0... And so it depends on the
> > peripheral controller's memory model. I would hope that they sorted
> > out the synchronization. ;-)
>
> Me too!

;-) ;-)

Thanx, Paul

2006-09-15 19:48:33

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

At this stage I think we're violently approaching agreement. :-)


On Thu, 14 Sep 2006, Paul E. McKenney wrote:

> > That's okay. There isn't any single notion of "This store occurred at
> > this time". Instead, for each CPU there is a notion of "This store became
> > visible to this CPU at this time".
>
> That is one way to look at it. It may or may not be the best way to
> look at it in a given situation -- for example, some stores never become
> visible to some CPUs due to the effect of later stores. For another
> example, if the software doesn't look at a given situation, it cannot
> tell whether or not a given store is visible to it.
>
> The fact that there are indeed real events happening at real time isn't
> necessarily relevant if they cannot be seen. Accepting fuzzy stores
> can decrease the size of the state space substantially. This can be an
> extremely good thing.

Okay. I'll allow that both points of view are valid and each can be
useful in its own way.


> > Again, not a problem. And for the same reason. Although if the stores
> > are all to the same location, different CPUs should _not_ have different
> > opinions about the order in which they became visible.
>
> The different CPUs might have different opinions -- in fact, they
> usually -do- have different opinions. The constraint is instead that
> their different opinions must be consistent with -at- -least- -one-
> totally ordered sequence of stores. For an example, consider three
> CPUs storing to the same location concurrently, and another CPU watching.
> The CPUs will agree on who was the last CPU to do the store, but might
> have no idea of the order of the other two stores.

The only way a CPU can have no idea of the order of the other two stores
is if it fails to see one or both of them. If it does see both stores
then of course it knows in what order it saw them.

A less ambiguous way to phrase what I said before would be: two CPUs
should not have different opinions about the order in which two stores
became visible, provided both stores were visible to both CPUs. Obviously
if a CPU doesn't see both of the stores then it can't have any opinion
about their ordering.


> > (P1): If two writes are ordered on the same CPU and two reads are
> > ordered on another CPU, and the first read sees the result of
> > the second write, then the second read will see the result of
> > the first write.
> >
> > Correct?
>
> Partially.
>
> (P1a): If two writes are ordered on one CPU, and two reads are ordered
> on another CPU, and if the first read sees the result of the
> second write, then the second read will see the result of the
> first write.
>
> (P1b): If a read is ordered before a write on each of two CPUs,
> and if the read on the second CPU sees the result of the write
> on the first CPU, then the read on the first CPU will -not-
> see the result of the write on the second CPU.
>
> (P2): All CPUs will see the results of writes to a single location
> in orderings consistent with at least one total global order
> of writes to that single location.

As you noted, your P1b is what I had called P2. It is usually not
presented in dicussions about memory barriers. Nevertheless it is the
only justification for the mb() instruction; P1a talks about only rmb()
and wmb().


> > I think that all you really need to do is remember that statements can be
> > reordered, that stores become visible to different CPUs at different
> > times, and that loads don't always return the value of the last visible
> > store. Everything else is intuitive -- except for the way memory barriers
> > work!
>
> Let me get this right. In your viewpoint, memory barriers are strongly
> counter-intuitive, yet you are trying to convince me to switch to your
> viewpoint? ;-)

Not the barriers themselves, but how they work! You have to agree that
their operation is deeply tied into the low-level nature of the
memory-access hardware on any particular system. Much more so than for
most instructions.


> > Yes, it's true that given sufficiently weak ordering, the writes could
> > be moved back inbetween the load and store implicit in the atomic
> > exchange. It's worth pointing out that even if this does occur, it will
> > not be visible to any CPU that accesses b and c only from within a
> > critical section.
>
> The other key point is that some weakly ordered CPUs were designed
> only to make spinlocks work.
>
> > But it could be visible in a situation like this:
> >
> > CPU 0 CPU 1
> > ----- -----
> > [call spin_lock(&mylock)] x = b;
> > read mylock, obtain 0 mb();
> > b = 1; [moved up] y = atomic_read(&mylock);
> > write mylock = 1
> > rmb();
> > [leave spin_lock()]
> > mb();
> > assert(!(x==1 && y==0 && c==0));
> > c = 1;
> > spin_unlock(&mylock);
> >
> > The assertion could indeed fail. HOWEVER...
> >
> > What you may not realize is that even when spin_lock() contains your
> > original full mb(), if CPU 0 reads b (instead of writing b) that read
> > could appear to another CPU to have occurred before the write to mylock.
> > Example:
> >
> > CPU 0 CPU 1
> > ----- -----
>
> All variables initially zero?

Of course.

> > [call spin_lock(&mylock)] b = 1;
> > read mylock, obtain 0 mb();
> > write mylock = 1 y = atomic_read(&mylock);
> > mb();
> > [leave spin_lock()]
> >
> > x = b + 1; mb();
> > assert(!(x==1 && y==0 && c==0));
> > c = 1;
> > spin_unlock(&mylock);
> >
> > If the assertion fails, then CPU 1 will think that CPU 0 read the value of
> > b before setting mylock to 1. But the assertion can fail! It's another
> > example of the write-mb-read pattern. If CPU 0's write to mylock crosses
> > with CPU 1's write to b, then both CPU 0's read of b and CPU 1's read of
> > mylock could obtain the old values.
> >
> > So in this respect your implementation already fails to prevent reads from
> > leaking partway out of the critical section. My implementation does the
> > same thing with writes, but otherwise is no worse.
>
> But isn't this the write-memory-barrier-read sequence that you (rightly)
> convinced me was problematic in earlier email?

Indeed it is. The problematic nature of that sequence means that in the
right conditions, a read can appear to leak partially in front of a
spin_lock().


> > > > A more correct pattern would have to look like this:
> > > >
> > > > CPU 0 CPU 1
> > > > ----- -----
> > > > while (c == 0) ; wmb();
> > > > // Earlier writes to a or b are now flushed
> > > > c = 1;
> > > >
> > > > // arbitrarily long delay with neither CPU writing to a or b
> > > >
> > > > a = 1; y = b;
> > > > wmb(); rmb();
> > > > b = 1; x = a;
> > > > assert(!(y==1 && x==0));
> > > >
> > > > In general you might say that a LOAD following rmb won't return the value
> > > > from any STORE that became visible before the rmb other than the last one.
> > > > But that leaves open the question of which STOREs actually do become
> > > > visible. I don't know how to deal with that in general.
> > >
> > > The best answer I can give is that an rmb() -by- -itself- means absolutely
> > > nothing. The only way it has any meaning is in conjunction with an
> > > mb() or a wmb() on some other CPU. The "pairwise" stuff in David's
> > > documentation.
> >
> > Even that isn't enough. You also have to have some sort of additional
> > information that can guarantee CPU 0's write to b will become visible to
> > CPU 1. If it never becomes visible then the assertion could fail.
>
> Eh? If the write to b never becomes visible, then y==0, and the assertion
> always succeeds. Or am I misreading this example?

Sorry, typo. I meant CPU 0's write to a, not its write to b. If you
can't guarantee that CPU 0's write to a will become visible to CPU 1 then
you can't guarantee the assertion will succeed.

> > But I can't think of any reasonable way to obtain such a guarantee that
> > isn't implementation-specific. Which leaves things in a rather
> > unsatisfactory state...
>
> But arbitrary delays can happen due to interrupts, preemption, etc.
> So you are simply not allowed such a guarantee.

The delay isn't important. What matters is that whatever else may happen
during the delay, neither CPU writes anything to a or b.

Strictly speaking even that's not quite right. What's needed is a
condition that can guarantee CPU 0's write to a will become visible to
CPU 1. Flushing CPU 1's store buffer may not be sufficient.


> > > In other words, if there was an rmb() in flight on one CPU, but none
> > > of the other CPUs ever executed a memory barrier of any kind, then
> > > the lone rmb() would not need to place any constraints on any CPU's
> > > memory access.
> > >
> > > Or, more accurately, if you wish to understand an rmb() in isolation,
> > > you must do so in the context of a specific hardware implementation.
> >
> > I'll buy that second description but not the first. It makes no sense to
> > say that what an rmb() instruction does to this CPU right here depends
> > somehow on what that CPU way over there is doing. Instructions are
> > supposed to act locally.
>
> A CPU might choose to note whether or not all the loads are to cachelines
> in cache and whether or not there are pending invalidates to any of the
> corresponding cachelines, and then decide to let the loads execute in
> arbitrary order. A CPU might also try executing the loads in arbitrary
> order, and undo them if some event (such as reception of an invalidate)
> occurred that invalidated the sequence chosen (speculative execution).
> Real CPUs have been implemented in both ways, and in both cases, the
> rmb() would have no effect on execution (in absence of interference from
> other CPUs).

In each of these examples an rmb() _would_ place constraints on its CPU's
memory access, by forcing the CPU to make those choices (about whether to
check for pending invalidates and allow loads in arbitrary order) in a
particular way. And it would do so whether or not any other CPUs executed
a memory barrier.

About the only reason I can think of for rmb() not placing contraints on a
CPU's memory accesses is if those accesses are already constrained to
occur in strict program order. Even then, the operation of rmb()
wouldn't depend on whether other CPUs had executed a memory barrier.

> And the CPU's instructions are -still- acting locally! ;-)
>
> And I know of no non-implementation-specific reason that instructions
> need to act locally. Yes, things normally go faster if they do act
> locally, but that would be an implementation issue.

Yes, all right. It's true that a system _could_ be designed in which CPUs
really would coordinate directly with each other as opposed to passing
messages. On such a system, rmb() could cause its CPU to ask all the
others whether they had done a wmb() recently or were going to do one in
the near future, and if the answers were all negative then the rmb() could
avoid putting any constraints on further memory accesses. In theory,
anyway -- I'm not sure such a design would be feasible in practice.

Alan Stern

2006-09-16 04:18:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Sep 15, 2006 at 03:48:31PM -0400, Alan Stern wrote:
> At this stage I think we're violently approaching agreement. :-)

;-) ;-) ;-)

> On Thu, 14 Sep 2006, Paul E. McKenney wrote:
>
> > > That's okay. There isn't any single notion of "This store occurred at
> > > this time". Instead, for each CPU there is a notion of "This store became
> > > visible to this CPU at this time".
> >
> > That is one way to look at it. It may or may not be the best way to
> > look at it in a given situation -- for example, some stores never become
> > visible to some CPUs due to the effect of later stores. For another
> > example, if the software doesn't look at a given situation, it cannot
> > tell whether or not a given store is visible to it.
> >
> > The fact that there are indeed real events happening at real time isn't
> > necessarily relevant if they cannot be seen. Accepting fuzzy stores
> > can decrease the size of the state space substantially. This can be an
> > extremely good thing.
>
> Okay. I'll allow that both points of view are valid and each can be
> useful in its own way.

Fair enough!!!

> > > Again, not a problem. And for the same reason. Although if the stores
> > > are all to the same location, different CPUs should _not_ have different
> > > opinions about the order in which they became visible.
> >
> > The different CPUs might have different opinions -- in fact, they
> > usually -do- have different opinions. The constraint is instead that
> > their different opinions must be consistent with -at- -least- -one-
> > totally ordered sequence of stores. For an example, consider three
> > CPUs storing to the same location concurrently, and another CPU watching.
> > The CPUs will agree on who was the last CPU to do the store, but might
> > have no idea of the order of the other two stores.
>
> The only way a CPU can have no idea of the order of the other two stores
> is if it fails to see one or both of them. If it does see both stores
> then of course it knows in what order it saw them.

Yep, with the proviso that one way to see a store is to be the CPU
actually executing it.

> A less ambiguous way to phrase what I said before would be: two CPUs
> should not have different opinions about the order in which two stores
> became visible, provided both stores were visible to both CPUs. Obviously
> if a CPU doesn't see both of the stores then it can't have any opinion
> about their ordering.

Good enough!

> > > (P1): If two writes are ordered on the same CPU and two reads are
> > > ordered on another CPU, and the first read sees the result of
> > > the second write, then the second read will see the result of
> > > the first write.
> > >
> > > Correct?
> >
> > Partially.
> >
> > (P1a): If two writes are ordered on one CPU, and two reads are ordered
> > on another CPU, and if the first read sees the result of the
> > second write, then the second read will see the result of the
> > first write.
> >
> > (P1b): If a read is ordered before a write on each of two CPUs,
> > and if the read on the second CPU sees the result of the write
> > on the first CPU, then the read on the first CPU will -not-
> > see the result of the write on the second CPU.
> >
> > (P2): All CPUs will see the results of writes to a single location
> > in orderings consistent with at least one total global order
> > of writes to that single location.
>
> As you noted, your P1b is what I had called P2. It is usually not
> presented in dicussions about memory barriers. Nevertheless it is the
> only justification for the mb() instruction; P1a talks about only rmb()
> and wmb().

And there is a P1c:

(P1c): If one CPU does a load from A ordered before a store to B,
and if a second CPU does a store to B ordered before a
store to A, and if the first CPU's load from A gives
the value stored by the second CPU, then the first CPU's
store to B must happen after the second CPU's store to B,
hence the value stored by the first CPU persists. (Or,
for the more competitively oriented, the first CPU's store
to B "wins".)

I am mapping out a complete set -- these three might cover them all,
but need to revisit after sleeping on it. I also need to run this
by some CPU architects...

> > > I think that all you really need to do is remember that statements can be
> > > reordered, that stores become visible to different CPUs at different
> > > times, and that loads don't always return the value of the last visible
> > > store. Everything else is intuitive -- except for the way memory barriers
> > > work!
> >
> > Let me get this right. In your viewpoint, memory barriers are strongly
> > counter-intuitive, yet you are trying to convince me to switch to your
> > viewpoint? ;-)
>
> Not the barriers themselves, but how they work! You have to agree that
> their operation is deeply tied into the low-level nature of the
> memory-access hardware on any particular system. Much more so than for
> most instructions.

It certainly is safe to say that arriving at a correct but simple
description of what they do has been unexpectedly challenging...

> > > Yes, it's true that given sufficiently weak ordering, the writes could
> > > be moved back inbetween the load and store implicit in the atomic
> > > exchange. It's worth pointing out that even if this does occur, it will
> > > not be visible to any CPU that accesses b and c only from within a
> > > critical section.
> >
> > The other key point is that some weakly ordered CPUs were designed
> > only to make spinlocks work.
> >
> > > But it could be visible in a situation like this:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > [call spin_lock(&mylock)] x = b;
> > > read mylock, obtain 0 mb();
> > > b = 1; [moved up] y = atomic_read(&mylock);
> > > write mylock = 1
> > > rmb();
> > > [leave spin_lock()]
> > > mb();
> > > assert(!(x==1 && y==0 && c==0));
> > > c = 1;
> > > spin_unlock(&mylock);
> > >
> > > The assertion could indeed fail. HOWEVER...
> > >
> > > What you may not realize is that even when spin_lock() contains your
> > > original full mb(), if CPU 0 reads b (instead of writing b) that read
> > > could appear to another CPU to have occurred before the write to mylock.
> > > Example:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> >
> > All variables initially zero?
>
> Of course.

So I am paranoid. What is your point? ;-)

> > > [call spin_lock(&mylock)] b = 1;
> > > read mylock, obtain 0 mb();
> > > write mylock = 1 y = atomic_read(&mylock);
> > > mb();
> > > [leave spin_lock()]
> > >
> > > x = b + 1; mb();
> > > assert(!(x==1 && y==0 && c==0));
> > > c = 1;
> > > spin_unlock(&mylock);
> > >
> > > If the assertion fails, then CPU 1 will think that CPU 0 read the value of
> > > b before setting mylock to 1. But the assertion can fail! It's another
> > > example of the write-mb-read pattern. If CPU 0's write to mylock crosses
> > > with CPU 1's write to b, then both CPU 0's read of b and CPU 1's read of
> > > mylock could obtain the old values.
> > >
> > > So in this respect your implementation already fails to prevent reads from
> > > leaking partway out of the critical section. My implementation does the
> > > same thing with writes, but otherwise is no worse.
> >
> > But isn't this the write-memory-barrier-read sequence that you (rightly)
> > convinced me was problematic in earlier email?
>
> Indeed it is. The problematic nature of that sequence means that in the
> right conditions, a read can appear to leak partially in front of a
> spin_lock().

The above sequence can reasonably be interpreted as having that effect.
Good reason to be -very- careful when modifying data used by a given
critical section without the protection of the corresponding lock.

> > > > > A more correct pattern would have to look like this:
> > > > >
> > > > > CPU 0 CPU 1
> > > > > ----- -----
> > > > > while (c == 0) ; wmb();
> > > > > // Earlier writes to a or b are now flushed
> > > > > c = 1;

If CPU 0 might also have stored to a and b, you also need the following:

wmb() while (d == 0);
d = 1;

Interestingly enough, this resembles one aspect of realtime
synchronize_rcu() where there are no barriers in rcu_read_lock()
and rcu_read_unlock(), and for roughly the same reason -- except
that mb() rather than wmb() is required for synchronize_rcu().
And synchronize_rcu() uses counters rather than binary variables in
order to allow the code to be "retriggered" after use.

> > > > > // arbitrarily long delay with neither CPU writing to a or b
> > > > >
> > > > > a = 1; y = b;
> > > > > wmb(); rmb();
> > > > > b = 1; x = a;
> > > > > assert(!(y==1 && x==0));
> > > > >
> > > > > In general you might say that a LOAD following rmb won't return the value
> > > > > from any STORE that became visible before the rmb other than the last one.
> > > > > But that leaves open the question of which STOREs actually do become
> > > > > visible. I don't know how to deal with that in general.
> > > >
> > > > The best answer I can give is that an rmb() -by- -itself- means absolutely
> > > > nothing. The only way it has any meaning is in conjunction with an
> > > > mb() or a wmb() on some other CPU. The "pairwise" stuff in David's
> > > > documentation.
> > >
> > > Even that isn't enough. You also have to have some sort of additional
> > > information that can guarantee CPU 0's write to b will become visible to
> > > CPU 1. If it never becomes visible then the assertion could fail.
> >
> > Eh? If the write to b never becomes visible, then y==0, and the assertion
> > always succeeds. Or am I misreading this example?
>
> Sorry, typo. I meant CPU 0's write to a, not its write to b. If you
> can't guarantee that CPU 0's write to a will become visible to CPU 1 then
> you can't guarantee the assertion will succeed.

Ah!

But if CPU 0's write to a is indefinitely delayed, then, by virtue of
CPU 0's wmb(), CPU 0's write to b must also be indefinitely delayed.
In this case, y==0&&x==0, so the assertion again always succeeds.

> > > But I can't think of any reasonable way to obtain such a guarantee that
> > > isn't implementation-specific. Which leaves things in a rather
> > > unsatisfactory state...
> >
> > But arbitrary delays can happen due to interrupts, preemption, etc.
> > So you are simply not allowed such a guarantee.
>
> The delay isn't important. What matters is that whatever else may happen
> during the delay, neither CPU writes anything to a or b.
>
> Strictly speaking even that's not quite right. What's needed is a
> condition that can guarantee CPU 0's write to a will become visible to
> CPU 1. Flushing CPU 1's store buffer may not be sufficient.

The guarantee is that b=1 will not become visible before a=1 is visible.
This guarantee is sufficient to cause the assertion to succeed. I think.

> > > > In other words, if there was an rmb() in flight on one CPU, but none
> > > > of the other CPUs ever executed a memory barrier of any kind, then
> > > > the lone rmb() would not need to place any constraints on any CPU's
> > > > memory access.
> > > >
> > > > Or, more accurately, if you wish to understand an rmb() in isolation,
> > > > you must do so in the context of a specific hardware implementation.
> > >
> > > I'll buy that second description but not the first. It makes no sense to
> > > say that what an rmb() instruction does to this CPU right here depends
> > > somehow on what that CPU way over there is doing. Instructions are
> > > supposed to act locally.
> >
> > A CPU might choose to note whether or not all the loads are to cachelines
> > in cache and whether or not there are pending invalidates to any of the
> > corresponding cachelines, and then decide to let the loads execute in
> > arbitrary order. A CPU might also try executing the loads in arbitrary
> > order, and undo them if some event (such as reception of an invalidate)
> > occurred that invalidated the sequence chosen (speculative execution).
> > Real CPUs have been implemented in both ways, and in both cases, the
> > rmb() would have no effect on execution (in absence of interference from
> > other CPUs).
>
> In each of these examples an rmb() _would_ place constraints on its CPU's
> memory access, by forcing the CPU to make those choices (about whether to
> check for pending invalidates and allow loads in arbitrary order) in a
> particular way. And it would do so whether or not any other CPUs executed
> a memory barrier.

Yes, my example differed from my earlier description -- but the underlying
point manages to survive, which is that the work performed by rmb() can
depend on what the other CPUs are doing (stores in my example, as opposed
to memory barriers in the description), and that in some situations, the
rmb() need impose no constraint on the order of loads on the CPU executing
the rmb().

For example, if CPU 0 executed x=a;rmb();y=b with a and b both initially
zero, and if both a and b are in cache, and if x and y are registers,
and if there are no pending invalidates, then CPU 0 would be within its
rights to load b before loading a.

Why? Because there is no way that any other CPU could tell that CPU 0
had misbehaved.

> About the only reason I can think of for rmb() not placing contraints on a
> CPU's memory accesses is if those accesses are already constrained to
> occur in strict program order.

If there is no way for other CPUs to detect the misordering, why should
CPU 0 be prohibited from doing the misordering? Perhaps prior loads
are queued up against the cache bank containing b, while the cache bank
containing a is idle. Why delay the load of a unnecessarily?

> Even then, the operation of rmb()
> wouldn't depend on whether other CPUs had executed a memory barrier.

For the modern CPU -implementations- I am aware of, rmb()'s effect on
ordering of loads would indeed -not- depend on whether other CPUs had
executed a memory barrier. However, at the conceptual level, if the
CPU executing an rmb() was somehow telepathically aware that none of
the other CPUs were executing memory barriers anytime near the present,
then the CPU could simply ignore the rmb().

I agree that no real CPU is going to implement this telepathic awareness,
at least not until quantum computers become real (though I have been
surprised before). ;-)

However, as noted above, there might well be real CPUs that ignore rmb()s
when there are no interfering writes.

> > And the CPU's instructions are -still- acting locally! ;-)
> >
> > And I know of no non-implementation-specific reason that instructions
> > need to act locally. Yes, things normally go faster if they do act
> > locally, but that would be an implementation issue.
>
> Yes, all right. It's true that a system _could_ be designed in which CPUs
> really would coordinate directly with each other as opposed to passing
> messages. On such a system, rmb() could cause its CPU to ask all the
> others whether they had done a wmb() recently or were going to do one in
> the near future, and if the answers were all negative then the rmb() could
> avoid putting any constraints on further memory accesses. In theory,
> anyway -- I'm not sure such a design would be feasible in practice.

Agreed!

However, CPUs can (and I suspect actually do) ignore memory barriers in
cases where there are no pending interfering writes.

Thanx, Paul

2006-09-16 15:28:41

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 15 Sep 2006, Paul E. McKenney wrote:

> > > (P1a): If two writes are ordered on one CPU, and two reads are ordered
> > > on another CPU, and if the first read sees the result of the
> > > second write, then the second read will see the result of the
> > > first write.
> > >
> > > (P1b): If a read is ordered before a write on each of two CPUs,
> > > and if the read on the second CPU sees the result of the write
> > > on the first CPU, then the read on the first CPU will -not-
> > > see the result of the write on the second CPU.
> > >
> > > (P2): All CPUs will see the results of writes to a single location
> > > in orderings consistent with at least one total global order
> > > of writes to that single location.
> >
> > As you noted, your P1b is what I had called P2. It is usually not
> > presented in dicussions about memory barriers. Nevertheless it is the
> > only justification for the mb() instruction; P1a talks about only rmb()
> > and wmb().
>
> And there is a P1c:
>
> (P1c): If one CPU does a load from A ordered before a store to B,
> and if a second CPU does a store to B ordered before a
> store to A, and if the first CPU's load from A gives
> the value stored by the second CPU, then the first CPU's
> store to B must happen after the second CPU's store to B,
> hence the value stored by the first CPU persists. (Or,
> for the more competitively oriented, the first CPU's store
> to B "wins".)
>
> I am mapping out a complete set -- these three might cover them all,
> but need to revisit after sleeping on it. I also need to run this
> by some CPU architects...

I wonder if it really is possible to find a complete set. Perhaps
examples can become arbitrarily complex and only some formal logical
system would be able to generate all of them. If that's the case then I
would like to know what that logical system is.

In any case, it is important to distinguish carefully between the two
classes of effects caused by memory barriers:

They prevent CPUs from reordering certain types of instructions;

They cause certain loads to obtain the results of certain stores.

The first is more of an effect on the CPUs whereas the second is more
of an effect on the caches.


> > > But isn't this the write-memory-barrier-read sequence that you (rightly)
> > > convinced me was problematic in earlier email?
> >
> > Indeed it is. The problematic nature of that sequence means that in the
> > right conditions, a read can appear to leak partially in front of a
> > spin_lock().
>
> The above sequence can reasonably be interpreted as having that effect.
> Good reason to be -very- careful when modifying data used by a given
> critical section without the protection of the corresponding lock.

The write-mb-read sequence is a good example showing how naive reasoning
about memory barriers can give wrong answers.


> > Sorry, typo. I meant CPU 0's write to a, not its write to b. If you
> > can't guarantee that CPU 0's write to a will become visible to CPU 1 then
> > you can't guarantee the assertion will succeed.
>
> Ah!
>
> But if CPU 0's write to a is indefinitely delayed, then, by virtue of
> CPU 0's wmb(), CPU 0's write to b must also be indefinitely delayed.
> In this case, y==0&&x==0, so the assertion again always succeeds.

There are other reasons for a write not becoming visible besides being
indefinitely delayed. It could be masked entirely by a separate write.

> The guarantee is that b=1 will not become visible before a=1 is visible.
> This guarantee is sufficient to cause the assertion to succeed. I think.

No -- the guarantee is that _if_ a=1 and b=1 both become visible then
a=1 will become visible first. But it's possible for b=1 to become
visible and a=1 not to, which would go against your wording of the
guarantee.


> > In each of these examples an rmb() _would_ place constraints on its CPU's
> > memory access, by forcing the CPU to make those choices (about whether to
> > check for pending invalidates and allow loads in arbitrary order) in a
> > particular way. And it would do so whether or not any other CPUs executed
> > a memory barrier.
>
> Yes, my example differed from my earlier description -- but the underlying
> point manages to survive, which is that the work performed by rmb() can
> depend on what the other CPUs are doing (stores in my example, as opposed
> to memory barriers in the description), and that in some situations, the
> rmb() need impose no constraint on the order of loads on the CPU executing
> the rmb().
>
> For example, if CPU 0 executed x=a;rmb();y=b with a and b both initially
> zero, and if both a and b are in cache, and if x and y are registers,
> and if there are no pending invalidates, then CPU 0 would be within its
> rights to load b before loading a.
>
> Why? Because there is no way that any other CPU could tell that CPU 0
> had misbehaved.

I think now we're getting into semantics. If rmb() causes a CPU (or
cache) to modify its behavior based on what invalidates are pending,
is this behavior change local to the CPU (& cache) or not? Checking
the queue of pending invalidates is indeed a local behavior, but the
contents of that table will depend on what other CPUs do.

So I can reasonably say that rmb() causes a CPU to change its behavior
based on its invalidate queue and this is a local change (given the
same contents of the queue the CPU will always behave the same way,
regardless of what other CPUs are doing). And you can also reasonably
say that rmb() causes a CPU to adjust its behavior according to the
actions of other CPUs, because those actions will be reflected in the
contents of the invalidate queue.

> > About the only reason I can think of for rmb() not placing contraints on a
> > CPU's memory accesses is if those accesses are already constrained to
> > occur in strict program order.
>
> If there is no way for other CPUs to detect the misordering, why should
> CPU 0 be prohibited from doing the misordering? Perhaps prior loads
> are queued up against the cache bank containing b, while the cache bank
> containing a is idle. Why delay the load of a unnecessarily?

Ah, but how does the CPU _know_ there is no way for other CPUs to
detect the misordering? Semantics again.

> For the modern CPU -implementations- I am aware of, rmb()'s effect on
> ordering of loads would indeed -not- depend on whether other CPUs had
> executed a memory barrier. However, at the conceptual level, if the
> CPU executing an rmb() was somehow telepathically aware that none of
> the other CPUs were executing memory barriers anytime near the present,
> then the CPU could simply ignore the rmb().
>
> I agree that no real CPU is going to implement this telepathic awareness,
> at least not until quantum computers become real (though I have been
> surprised before). ;-)
>
> However, as noted above, there might well be real CPUs that ignore rmb()s
> when there are no interfering writes.

In the model we've been talking about, where the CPU sends requests to
the cache and the cache carries them out, I think the CPU can't afford
to ignore rmb()s. That is, only the cache knows whether there are any
interfering writes, so the CPU has to respect the ordering enforced
by the rmb(). Of course, the cache would then be free to interchange
the loads.

Agreed, other architectural models could behave differently.

Alan

2006-09-18 19:12:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, Sep 16, 2006 at 11:28:39AM -0400, Alan Stern wrote:
> On Fri, 15 Sep 2006, Paul E. McKenney wrote:
>
> > > > (P1a): If two writes are ordered on one CPU, and two reads are ordered
> > > > on another CPU, and if the first read sees the result of the
> > > > second write, then the second read will see the result of the
> > > > first write.
> > > >
> > > > (P1b): If a read is ordered before a write on each of two CPUs,
> > > > and if the read on the second CPU sees the result of the write
> > > > on the first CPU, then the read on the first CPU will -not-
> > > > see the result of the write on the second CPU.
> > > >
> > > > (P2): All CPUs will see the results of writes to a single location
> > > > in orderings consistent with at least one total global order
> > > > of writes to that single location.
> > >
> > > As you noted, your P1b is what I had called P2. It is usually not
> > > presented in dicussions about memory barriers. Nevertheless it is the
> > > only justification for the mb() instruction; P1a talks about only rmb()
> > > and wmb().
> >
> > And there is a P1c:
> >
> > (P1c): If one CPU does a load from A ordered before a store to B,
> > and if a second CPU does a store to B ordered before a
> > store to A, and if the first CPU's load from A gives
> > the value stored by the second CPU, then the first CPU's
> > store to B must happen after the second CPU's store to B,
> > hence the value stored by the first CPU persists. (Or,
> > for the more competitively oriented, the first CPU's store
> > to B "wins".)
> >
> > I am mapping out a complete set -- these three might cover them all,
> > but need to revisit after sleeping on it. I also need to run this
> > by some CPU architects...
>
> I wonder if it really is possible to find a complete set. Perhaps
> examples can become arbitrarily complex and only some formal logical
> system would be able to generate all of them. If that's the case then I
> would like to know what that logical system is.

Here is my nomination for the complete set:

A|B B|A

L|L L|L -- Independent of ordering
L|L L|S -- Pointless, since there is no second store
L|L S|L -- Pointless, since there is no second store
L|L S|S -- first pairing

L|S L|L -- (DUP) single-store pointless
L|S L|S -- second pairing
L|S S|L -- problematic store-barrier-load
L|S S|S -- If CPU 0's load sees CPU 1's 2nd store, then CPU 0's
store wins. (third pairing)

S|L L|L -- (DUP) single-store pointless
S|L L|S -- (DUP) problematic store-barrier-load
S|L S|L -- (DUP) problematic store-barrier-load
S|L S|S -- (DUP) problematic store-barrier-load

S|S L|L -- (DUP) first pairing
S|S L|S -- (DUP) winning store.
S|S S|L -- (DUP) problematic store-barrier-load
S|S S|S -- Pointless, no loads

The first column is CPU 0's actions, the second is CPU 1's actions.
"S" is store, "L" is load. CPU 0 accesses A first, then B, while
CPU 1 does the opposite. The commentary is my guess at what happens.

I believe that more complex cases can be handled by superposition.
For example, if CPU 0 load A, does a barrier, stores B and C, while
CPU 1 loads B and C, does a barrier, and stores A, then one can apply
the "second pairing" twice, once for A and B, and again for A and C.
If B and C are the same variable, then P2 would also apply.

Counterexamples? ;-)

> In any case, it is important to distinguish carefully between the two
> classes of effects caused by memory barriers:
>
> They prevent CPUs from reordering certain types of instructions;
>
> They cause certain loads to obtain the results of certain stores.
>
> The first is more of an effect on the CPUs whereas the second is more
> of an effect on the caches.

>From where I sit, the second is the architectural constraint, while the
first is implementation-specific. So I am very concerned about the
second, but worried only about the first as needed to give examples.

> > > > But isn't this the write-memory-barrier-read sequence that you (rightly)
> > > > convinced me was problematic in earlier email?
> > >
> > > Indeed it is. The problematic nature of that sequence means that in the
> > > right conditions, a read can appear to leak partially in front of a
> > > spin_lock().
> >
> > The above sequence can reasonably be interpreted as having that effect.
> > Good reason to be -very- careful when modifying data used by a given
> > critical section without the protection of the corresponding lock.
>
> The write-mb-read sequence is a good example showing how naive reasoning
> about memory barriers can give wrong answers.

That too!!! ;-)

> > > Sorry, typo. I meant CPU 0's write to a, not its write to b. If you
> > > can't guarantee that CPU 0's write to a will become visible to CPU 1 then
> > > you can't guarantee the assertion will succeed.
> >
> > Ah!
> >
> > But if CPU 0's write to a is indefinitely delayed, then, by virtue of
> > CPU 0's wmb(), CPU 0's write to b must also be indefinitely delayed.
> > In this case, y==0&&x==0, so the assertion again always succeeds.
>
> There are other reasons for a write not becoming visible besides being
> indefinitely delayed. It could be masked entirely by a separate write.

OK. I was assuming that there was no such separate write. After all,
you didn't show it on your diagram.

If there was an additional write, then (P2) would apply if from CPU 0.
If it is from some other CPU, then the circumstances of that other write
are needed to work out what happens.

> > The guarantee is that b=1 will not become visible before a=1 is visible.
> > This guarantee is sufficient to cause the assertion to succeed. I think.
>
> No -- the guarantee is that _if_ a=1 and b=1 both become visible then
> a=1 will become visible first. But it's possible for b=1 to become
> visible and a=1 not to, which would go against your wording of the
> guarantee.

Good point -- though I might choose to explicitly specify that there are
no other writes to the variables instead. Then use an example showing
that if other writes are possible, that they need to be accounted for.

> > > In each of these examples an rmb() _would_ place constraints on its CPU's
> > > memory access, by forcing the CPU to make those choices (about whether to
> > > check for pending invalidates and allow loads in arbitrary order) in a
> > > particular way. And it would do so whether or not any other CPUs executed
> > > a memory barrier.
> >
> > Yes, my example differed from my earlier description -- but the underlying
> > point manages to survive, which is that the work performed by rmb() can
> > depend on what the other CPUs are doing (stores in my example, as opposed
> > to memory barriers in the description), and that in some situations, the
> > rmb() need impose no constraint on the order of loads on the CPU executing
> > the rmb().
> >
> > For example, if CPU 0 executed x=a;rmb();y=b with a and b both initially
> > zero, and if both a and b are in cache, and if x and y are registers,
> > and if there are no pending invalidates, then CPU 0 would be within its
> > rights to load b before loading a.
> >
> > Why? Because there is no way that any other CPU could tell that CPU 0
> > had misbehaved.
>
> I think now we're getting into semantics. If rmb() causes a CPU (or
> cache) to modify its behavior based on what invalidates are pending,
> is this behavior change local to the CPU (& cache) or not? Checking
> the queue of pending invalidates is indeed a local behavior, but the
> contents of that table will depend on what other CPUs do.
>
> So I can reasonably say that rmb() causes a CPU to change its behavior
> based on its invalidate queue and this is a local change (given the
> same contents of the queue the CPU will always behave the same way,
> regardless of what other CPUs are doing). And you can also reasonably
> say that rmb() causes a CPU to adjust its behavior according to the
> actions of other CPUs, because those actions will be reflected in the
> contents of the invalidate queue.

That does seem to be about the size of it.

> > > About the only reason I can think of for rmb() not placing contraints on a
> > > CPU's memory accesses is if those accesses are already constrained to
> > > occur in strict program order.
> >
> > If there is no way for other CPUs to detect the misordering, why should
> > CPU 0 be prohibited from doing the misordering? Perhaps prior loads
> > are queued up against the cache bank containing b, while the cache bank
> > containing a is idle. Why delay the load of a unnecessarily?
>
> Ah, but how does the CPU _know_ there is no way for other CPUs to
> detect the misordering? Semantics again.

In an efficient hardware implementation, by inspecting its local state,
of course! ;-)

> > For the modern CPU -implementations- I am aware of, rmb()'s effect on
> > ordering of loads would indeed -not- depend on whether other CPUs had
> > executed a memory barrier. However, at the conceptual level, if the
> > CPU executing an rmb() was somehow telepathically aware that none of
> > the other CPUs were executing memory barriers anytime near the present,
> > then the CPU could simply ignore the rmb().
> >
> > I agree that no real CPU is going to implement this telepathic awareness,
> > at least not until quantum computers become real (though I have been
> > surprised before). ;-)
> >
> > However, as noted above, there might well be real CPUs that ignore rmb()s
> > when there are no interfering writes.
>
> In the model we've been talking about, where the CPU sends requests to
> the cache and the cache carries them out, I think the CPU can't afford
> to ignore rmb()s. That is, only the cache knows whether there are any
> interfering writes, so the CPU has to respect the ordering enforced
> by the rmb(). Of course, the cache would then be free to interchange
> the loads.
>
> Agreed, other architectural models could behave differently.

So one approach would be to describe the abstract ordering principles,
show how they work in code snippets, and then give a couple "bookend"
examples of hardware being creative about adhering to these principles.

Seem reasonable?

Thanx, Paul

2006-09-18 20:13:46

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, 18 Sep 2006, Paul E. McKenney wrote:

> > I wonder if it really is possible to find a complete set. Perhaps
> > examples can become arbitrarily complex and only some formal logical
> > system would be able to generate all of them. If that's the case then I
> > would like to know what that logical system is.
>
> Here is my nomination for the complete set:
>
> A|B B|A
>
> L|L L|L -- Independent of ordering
> L|L L|S -- Pointless, since there is no second store
> L|L S|L -- Pointless, since there is no second store
> L|L S|S -- first pairing
>
> L|S L|L -- (DUP) single-store pointless
> L|S L|S -- second pairing
> L|S S|L -- problematic store-barrier-load
> L|S S|S -- If CPU 0's load sees CPU 1's 2nd store, then CPU 0's
> store wins. (third pairing)
>
> S|L L|L -- (DUP) single-store pointless
> S|L L|S -- (DUP) problematic store-barrier-load
> S|L S|L -- (DUP) problematic store-barrier-load
> S|L S|S -- (DUP) problematic store-barrier-load
>
> S|S L|L -- (DUP) first pairing
> S|S L|S -- (DUP) winning store.
> S|S S|L -- (DUP) problematic store-barrier-load
> S|S S|S -- Pointless, no loads
>
> The first column is CPU 0's actions, the second is CPU 1's actions.
> "S" is store, "L" is load. CPU 0 accesses A first, then B, while
> CPU 1 does the opposite. The commentary is my guess at what happens.
>
> I believe that more complex cases can be handled by superposition.
> For example, if CPU 0 load A, does a barrier, stores B and C, while
> CPU 1 loads B and C, does a barrier, and stores A, then one can apply
> the "second pairing" twice, once for A and B, and again for A and C.
> If B and C are the same variable, then P2 would also apply.
>
> Counterexamples? ;-)

You only considered actions by two CPUs, and you only considered orderings
induced by the memory barriers. What about interactions among multiple
CPUs or orderings caused by control dependencies? Two examples:

CPU 0 CPU 1 CPU 2
----- ----- -----
x = a; y = b; z = c;
mb(); mb(); mb();
b = 1; c = 1; a = 1;
assert(x==0 || y==0 || z==0);


CPU 0 CPU 1
----- -----
x = a; while (b == 0) ;
mb(); a = 1;
b = 1;
assert(x==0);

I suspect your scheme can't handle either of these. In fact, I rather
suspect that the "partial ordering" approach is fundamentally incapable of
generating all possible correct deductions about memory access
requirements, and that something more like my "every action happens at a
specific time" approach is necessary. That is, after all, what led me to
consider it in the first place.


> > In any case, it is important to distinguish carefully between the two
> > classes of effects caused by memory barriers:
> >
> > They prevent CPUs from reordering certain types of instructions;
> >
> > They cause certain loads to obtain the results of certain stores.
> >
> > The first is more of an effect on the CPUs whereas the second is more
> > of an effect on the caches.
>
> From where I sit, the second is the architectural constraint, while the
> first is implementation-specific. So I am very concerned about the
> second, but worried only about the first as needed to give examples.

It's true that this distinction depends very much on the system design.
For the model where CPUs send requests to caches which then carry them
out, the first effect is quite important.


> > There are other reasons for a write not becoming visible besides being
> > indefinitely delayed. It could be masked entirely by a separate write.
>
> OK. I was assuming that there was no such separate write. After all,
> you didn't show it on your diagram.

I had mentioned the possibility in the text earlier. Something like this:

CPU 0 CPU 1
----- -----
a = 0; // Suppose this gets hung up in a store buffer
// for an indefinite length of time. Then later...

// Here's where the standard example starts
y = b; a = 1;
rmb(); wmb();
x = a; b = 1;
assert(!(y==1 && x==0)); // Might fail

It's a matter of boundary conditions. So far the only stated boundary
condition has been that initially all variables are equal to 0. There has
also been an unstated condition that there are no writes to the variables
during the time period in question other than the ones shown in the
example, whether by these CPUs or any others.

In the example above, the boundary conditions are all satisfied. It's
still true that a is initially equal to 0, and the first line above is
supposed to execute before the time period in question for the standard
example. Neverthless, the standard example fails.

Some stronger boundary condition is needed. Something along the lines of:
All writes to the variables preceding the time period in question have
completed when the example starts. That ought to be sufficient to
guarantee that each of the writes will become visible eventually to the
other CPU.


> So one approach would be to describe the abstract ordering principles,
> show how they work in code snippets, and then give a couple "bookend"
> examples of hardware being creative about adhering to these principles.
>
> Seem reasonable?

Yes.

Alan

2006-09-19 00:47:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Sep 18, 2006 at 04:13:42PM -0400, Alan Stern wrote:
> On Mon, 18 Sep 2006, Paul E. McKenney wrote:
>
> > > I wonder if it really is possible to find a complete set. Perhaps
> > > examples can become arbitrarily complex and only some formal logical
> > > system would be able to generate all of them. If that's the case then I
> > > would like to know what that logical system is.
> >
> > Here is my nomination for the complete set:
> >
> > A|B B|A
> >
> > L|L L|L -- Independent of ordering
> > L|L L|S -- Pointless, since there is no second store
> > L|L S|L -- Pointless, since there is no second store
> > L|L S|S -- first pairing
> >
> > L|S L|L -- (DUP) single-store pointless
> > L|S L|S -- second pairing
> > L|S S|L -- problematic store-barrier-load
> > L|S S|S -- If CPU 0's load sees CPU 1's 2nd store, then CPU 0's
> > store wins. (third pairing)
> >
> > S|L L|L -- (DUP) single-store pointless
> > S|L L|S -- (DUP) problematic store-barrier-load
> > S|L S|L -- (DUP) problematic store-barrier-load
> > S|L S|S -- (DUP) problematic store-barrier-load
> >
> > S|S L|L -- (DUP) first pairing
> > S|S L|S -- (DUP) winning store.
> > S|S S|L -- (DUP) problematic store-barrier-load
> > S|S S|S -- Pointless, no loads
> >
> > The first column is CPU 0's actions, the second is CPU 1's actions.
> > "S" is store, "L" is load. CPU 0 accesses A first, then B, while
> > CPU 1 does the opposite. The commentary is my guess at what happens.
> >
> > I believe that more complex cases can be handled by superposition.
> > For example, if CPU 0 load A, does a barrier, stores B and C, while
> > CPU 1 loads B and C, does a barrier, and stores A, then one can apply
> > the "second pairing" twice, once for A and B, and again for A and C.
> > If B and C are the same variable, then P2 would also apply.
> >
> > Counterexamples? ;-)

Restating (and renumbering) the principles, all assuming no other stores
than those mentioned, assuming no aliasing among variables, and assuming
that each store changes the value of the target variable:

(P0): Each CPU sees its own stores and loads as occurring in program
order.

(P1): If each CPU performs a series of stores to a single shared variable,
then the series of values obtained by the a given CPUs stores and
loads must be consistent with that obtained by each of the other
CPUs. It may or may not be possible to deduce a single global
order from the full set of such series.

(P2a): If two stores to A and B are ordered on CPU 0, and two loads from
B and A are ordered on CPU 1, and if CPU 1's load from B sees
CPU 0's store to B, then CPU 1's load from A must see CPU 0's
store to A.

(P2b): If a load from A is ordered before a store to B on CPU 0, and a
load from B is ordered before a store to A on CPU 1, then if
CPU 1's load from B sees CPU 0's store to B, then CPU 0's load
from A cannot see CPU 1's store to A.

(P2c): If a load from A is ordered before a store to B on CPU 0, and a
store to B is ordered before a store to A on CPU 1, and if CPU 0's
load from A gives the value stored to A by CPU 1, then CPU 1's
store to B must be ordered before CPU 0's store to B, so that
the final value of B is that stored by CPU 0.

> You only considered actions by two CPUs, and you only considered orderings
> induced by the memory barriers. What about interactions among multiple
> CPUs or orderings caused by control dependencies? Two examples:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----

As always, assuming all variables initially zero. ;-)

> x = a; y = b; z = c;
> mb(); mb(); mb();
> b = 1; c = 1; a = 1;
> assert(x==0 || y==0 || z==0);

1. On CPU 0, either x=a sees CPU 2's a=1 (and by P2b sees z=c and sets
x==1) or does not (and thus might see either z=1 or z=0, but
sets x==0).

2. On CPU 2, z=c might see CPU 1's c=1 (and would by P2b also see
y=b, but it does not look).

3. CPU 1's situation is similar to that of CPU 2.

So, yes, some other principle would be required if we wanted to prove that
the assertion always succeeded. Do we need this assertion to succeed?
(Sure, it would be a bit mind-bending if it were permitted to fail,
but that would not be unheard of when dealing with memory barriers.)
One litmus test is "would assertion failure break normal locking?".
So, let's apply that test.

However, note that the locking sequence differs significantly from the
locking scenario that drove memory ordering on many CPUs -- in the locking
scenario, the lock variable itself would be able to make use of P1.
To properly use locking, we end up with something like the following,
all variables initially zero:

CPU 0 CPU 1 CPU 2
----- ----- -----
testandset(l) testandset(l) testandset(l)
mb() mb() mb()
x = a; y = b; z = c;
mb() mb() mb()
clear(l) clear(l) clear(l)
testandset(l) testandset(l) testandset(l)
mb() mb() mb()
b = 1; c = 1; a = 1;
mb() mb() mb()
clear(l) clear(l) clear(l)
assert(x==0 || y==0 || z==0);

The key additional point is that all CPUs see the series of values
taken on by the lockword "l". The load implied by a given testandtest(),
other than the first one, must see the store implied by the prior clear().
So far, nothing different. But let's take a look at the ordering 1, 2, 0.

At the handoff from CPU 1 to CPU 2, CPU 2 has seen CPU 1's clear(l), and
thus must see CPU 1's assignment to y (by P2b). At the handoff from
CPU 2 to CPU 0, again, CPU 0 has seen CPU 2's clear(l), thus CPU 2's
assignment to z. However, because "l" is a single variable, we can
now use P1, so that CPU 0 has also seen the effect of CPU 1's earlier
clear(l), and thus also CPU 1's y=b. Continuing this analysis shows that
the assertion always succeeds in this case.

A simpler sequence illustrating this is as follows:

CPU 0 CPU 1 CPU 2 CPU 3
----- ----- ----- -----
a = 1 while (b < 1); while (b < 2); y = c;
mb() mb() mb(); mb();
b = 1 b = 2; c = 1; x = a;
assert(y == 0 || x == 1)

The fact that we have the single variable "b" tracing through this
sequence allows P1 to be applied.

So, is there a burning need for the assertion in your original example
to succeed? If not, fewer principles allows Linux to deal with a larger
variety of CPUs.

> CPU 0 CPU 1
> ----- -----

Just to be obnoxious, again noting all variables initially zero. ;-)

> x = a; while (b == 0) ;
> mb(); a = 1;
> b = 1;
> assert(x==0);

Yep, here you are relying on a control dependency, and control dependencies
are not recognized by the principles. And, possibly, also not by some
CPUs...

The fix would be to insert an mb() just after CPU 1's "while" loop.

> I suspect your scheme can't handle either of these. In fact, I rather
> suspect that the "partial ordering" approach is fundamentally incapable of
> generating all possible correct deductions about memory access
> requirements, and that something more like my "every action happens at a
> specific time" approach is necessary. That is, after all, what led me to
> consider it in the first place.

Agreed, these assertions cannot be proven using the principles outlined
above. Is this necessarily bad? In other words, are there real situations
where we need these assertions to hold, and do all CPUs really meet them?

> > > In any case, it is important to distinguish carefully between the two
> > > classes of effects caused by memory barriers:
> > >
> > > They prevent CPUs from reordering certain types of instructions;
> > >
> > > They cause certain loads to obtain the results of certain stores.
> > >
> > > The first is more of an effect on the CPUs whereas the second is more
> > > of an effect on the caches.
> >
> > From where I sit, the second is the architectural constraint, while the
> > first is implementation-specific. So I am very concerned about the
> > second, but worried only about the first as needed to give examples.
>
> It's true that this distinction depends very much on the system design.
> For the model where CPUs send requests to caches which then carry them
> out, the first effect is quite important.
>
>
> > > There are other reasons for a write not becoming visible besides being
> > > indefinitely delayed. It could be masked entirely by a separate write.
> >
> > OK. I was assuming that there was no such separate write. After all,
> > you didn't show it on your diagram.
>
> I had mentioned the possibility in the text earlier. Something like this:
>
> CPU 0 CPU 1
> ----- -----
> a = 0; // Suppose this gets hung up in a store buffer
> // for an indefinite length of time. Then later...
>
> // Here's where the standard example starts
> y = b; a = 1;
> rmb(); wmb();
> x = a; b = 1;
> assert(!(y==1 && x==0)); // Might fail
>
> It's a matter of boundary conditions. So far the only stated boundary
> condition has been that initially all variables are equal to 0. There has
> also been an unstated condition that there are no writes to the variables
> during the time period in question other than the ones shown in the
> example, whether by these CPUs or any others.

Touche on my failing to cover all the boundary conditions! ;-)

> In the example above, the boundary conditions are all satisfied. It's
> still true that a is initially equal to 0, and the first line above is
> supposed to execute before the time period in question for the standard
> example. Neverthless, the standard example fails.
>
> Some stronger boundary condition is needed. Something along the lines of:
> All writes to the variables preceding the time period in question have
> completed when the example starts. That ought to be sufficient to
> guarantee that each of the writes will become visible eventually to the
> other CPU.

Or: "there are no stores other than those shown in the example".

> > So one approach would be to describe the abstract ordering principles,
> > show how they work in code snippets, and then give a couple "bookend"
> > examples of hardware being creative about adhering to these principles.
> >
> > Seem reasonable?
>
> Yes.

Sounds good!

Thanx, Paul

2006-09-19 16:04:17

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, 18 Sep 2006, Paul E. McKenney wrote:

> Restating (and renumbering) the principles, all assuming no other stores
> than those mentioned, assuming no aliasing among variables, and assuming
> that each store changes the value of the target variable:
>
> (P0): Each CPU sees its own stores and loads as occurring in program
> order.
>
> (P1): If each CPU performs a series of stores to a single shared variable,
> then the series of values obtained by the a given CPUs stores and
> loads must be consistent with that obtained by each of the other
> CPUs. It may or may not be possible to deduce a single global
> order from the full set of such series.

Suppose three CPUs respectively write the values 1, 2, and 3 to a single
variable. Are you saying that some CPU A might see the values 1,2 (in
that order), CPU B might see 2,3 (in that order), and CPU C might see 3,1
(in that order)? Each CPU's view would be consistent with each of the
others but there would not be any global order.

Somehow I don't think that's what you intended. In general the actual
situation is much messier, with some writes masking others for some CPUs
in such a way that whenever two CPUs both see the same two writes, they
see them in the same order. Is that all you meant to say?

> (P2a): If two stores to A and B are ordered on CPU 0, and two loads from
> B and A are ordered on CPU 1, and if CPU 1's load from B sees
> CPU 0's store to B, then CPU 1's load from A must see CPU 0's
> store to A.
>
> (P2b): If a load from A is ordered before a store to B on CPU 0, and a
> load from B is ordered before a store to A on CPU 1, then if
> CPU 1's load from B sees CPU 0's store to B, then CPU 0's load
> from A cannot see CPU 1's store to A.
>
> (P2c): If a load from A is ordered before a store to B on CPU 0, and a
> store to B is ordered before a store to A on CPU 1, and if CPU 0's
> load from A gives the value stored to A by CPU 1, then CPU 1's
> store to B must be ordered before CPU 0's store to B, so that
> the final value of B is that stored by CPU 0.

Implicit in P2a and P2b is that events can be ordered only if they all
take place on the same CPU. Now P2c introduces the idea of ordering two
events that occur on different CPUs. This sounds to me like it's heading
for trouble.

You probably could revise P2c slightly and then be happy with P2[a-c].
They wouldn't be complete, although they might be adequate for all normal
uses of memory barriers.

> > You only considered actions by two CPUs, and you only considered orderings
> > induced by the memory barriers. What about interactions among multiple
> > CPUs or orderings caused by control dependencies? Two examples:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
>
> As always, assuming all variables initially zero. ;-)
>
> > x = a; y = b; z = c;
> > mb(); mb(); mb();
> > b = 1; c = 1; a = 1;
> > assert(x==0 || y==0 || z==0);
>
> 1. On CPU 0, either x=a sees CPU 2's a=1 (and by P2b sees z=c and sets
> x==1) or does not (and thus might see either z=1 or z=0, but
> sets x==0).
>
> 2. On CPU 2, z=c might see CPU 1's c=1 (and would by P2b also see
> y=b, but it does not look).
>
> 3. CPU 1's situation is similar to that of CPU 2.
>
> So, yes, some other principle would be required if we wanted to prove that
> the assertion always succeeded. Do we need this assertion to succeed?
> (Sure, it would be a bit mind-bending if it were permitted to fail,
> but that would not be unheard of when dealing with memory barriers.)
> One litmus test is "would assertion failure break normal locking?".
> So, let's apply that test.
>
> However, note that the locking sequence differs significantly from the
> locking scenario that drove memory ordering on many CPUs -- in the locking
> scenario, the lock variable itself would be able to make use of P1.
> To properly use locking, we end up with something like the following,
> all variables initially zero:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> testandset(l) testandset(l) testandset(l)
> mb() mb() mb()
> x = a; y = b; z = c;
> mb() mb() mb()
> clear(l) clear(l) clear(l)
> testandset(l) testandset(l) testandset(l)
> mb() mb() mb()
> b = 1; c = 1; a = 1;
> mb() mb() mb()
> clear(l) clear(l) clear(l)
> assert(x==0 || y==0 || z==0);
>
> The key additional point is that all CPUs see the series of values
> taken on by the lockword "l". The load implied by a given testandtest(),
> other than the first one, must see the store implied by the prior clear().
> So far, nothing different. But let's take a look at the ordering 1, 2, 0.
>
> At the handoff from CPU 1 to CPU 2, CPU 2 has seen CPU 1's clear(l), and
> thus must see CPU 1's assignment to y (by P2b). At the handoff from
> CPU 2 to CPU 0, again, CPU 0 has seen CPU 2's clear(l), thus CPU 2's
> assignment to z. However, because "l" is a single variable, we can
> now use P1, so that CPU 0 has also seen the effect of CPU 1's earlier
> clear(l), and thus also CPU 1's y=b. Continuing this analysis shows that
> the assertion always succeeds in this case.
>
> A simpler sequence illustrating this is as follows:
>
> CPU 0 CPU 1 CPU 2 CPU 3
> ----- ----- ----- -----
> a = 1 while (b < 1); while (b < 2); y = c;
> mb() mb() mb(); mb();
> b = 1 b = 2; c = 1; x = a;
> assert(y == 0 || x == 1)
>
> The fact that we have the single variable "b" tracing through this
> sequence allows P1 to be applied.

Isn't it possible that the write to c become visible on CPU 3 before the
write to a? In fact, isn't this essentially the same as one of the
examples in your manuscript?

> So, is there a burning need for the assertion in your original example
> to succeed? If not, fewer principles allows Linux to deal with a larger
> variety of CPUs.
>
> > CPU 0 CPU 1
> > ----- -----
>
> Just to be obnoxious, again noting all variables initially zero. ;-)
>
> > x = a; while (b == 0) ;
> > mb(); a = 1;
> > b = 1;
> > assert(x==0);
>
> Yep, here you are relying on a control dependency, and control dependencies
> are not recognized by the principles. And, possibly, also not by some
> CPUs...

Not recognizing a control dependency like this would be tantamount to
doing a speculative write. Any CPU indulging in such fancies isn't likely
to survive very long.

> The fix would be to insert an mb() just after CPU 1's "while" loop.
>
> > I suspect your scheme can't handle either of these. In fact, I rather
> > suspect that the "partial ordering" approach is fundamentally incapable of
> > generating all possible correct deductions about memory access
> > requirements, and that something more like my "every action happens at a
> > specific time" approach is necessary. That is, after all, what led me to
> > consider it in the first place.
>
> Agreed, these assertions cannot be proven using the principles outlined
> above. Is this necessarily bad? In other words, are there real situations
> where we need these assertions to hold, and do all CPUs really meet them?

As far as real situations are concerned, you could probably get away with
nothing more than P2a and P2b, maybe also P2c. Gaining a clear and
precise understanding is a different story, though.

For instance, it's not clear to what extent the "ordering" in your scheme
is transitive. Part of the problem is that the events you order are
individual loads and stores. There's no acknowledgement of the fact that
a single store can become visible to different CPUs at different times (or
not at all).

I would much prefer to see an analysis where the primitive notions include
things like "This store becomes visible to that CPU before this load
occurs". It would bear a closer correspondence to what actually happens
in the hardware.

Alan

2006-09-19 16:38:18

by Nick Piggin

[permalink] [raw]
Subject: Re: Uses for memory barriers

Alan Stern wrote:
> On Mon, 18 Sep 2006, Paul E. McKenney wrote:
>
>
>>Restating (and renumbering) the principles, all assuming no other stores
>>than those mentioned, assuming no aliasing among variables, and assuming
>>that each store changes the value of the target variable:
>>
>>(P0): Each CPU sees its own stores and loads as occurring in program
>> order.
>>
>>(P1): If each CPU performs a series of stores to a single shared variable,
>> then the series of values obtained by the a given CPUs stores and
>> loads must be consistent with that obtained by each of the other
>> CPUs. It may or may not be possible to deduce a single global
>> order from the full set of such series.
>
>
> Suppose three CPUs respectively write the values 1, 2, and 3 to a single
> variable. Are you saying that some CPU A might see the values 1,2 (in
> that order), CPU B might see 2,3 (in that order), and CPU C might see 3,1
> (in that order)? Each CPU's view would be consistent with each of the
> others but there would not be any global order.
>
> Somehow I don't think that's what you intended. In general the actual
> situation is much messier, with some writes masking others for some CPUs
> in such a way that whenever two CPUs both see the same two writes, they
> see them in the same order. Is that all you meant to say?

I don't think that need be the case if one of the CPUs that has written
the variable forwards the store to a subsequent load before it reaches
the cache coherency (I could be wrong here). So if that is the case, then
your above example would be correct.

But if I'm wrong there, I think Paul's statement holds even if all
stores to a single cacheline are always instantly coherent (and thus do
have some global ordering). Consider a variation on your example where
one CPU loads 1,2 and another loads 1,3. What's the order?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-09-19 17:40:17

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, 20 Sep 2006, Nick Piggin wrote:

> >>(P1): If each CPU performs a series of stores to a single shared variable,
> >> then the series of values obtained by the a given CPUs stores and
> >> loads must be consistent with that obtained by each of the other
> >> CPUs. It may or may not be possible to deduce a single global
> >> order from the full set of such series.
> >
> >
> > Suppose three CPUs respectively write the values 1, 2, and 3 to a single
> > variable. Are you saying that some CPU A might see the values 1,2 (in
> > that order), CPU B might see 2,3 (in that order), and CPU C might see 3,1
> > (in that order)? Each CPU's view would be consistent with each of the
> > others but there would not be any global order.
> >
> > Somehow I don't think that's what you intended. In general the actual
> > situation is much messier, with some writes masking others for some CPUs
> > in such a way that whenever two CPUs both see the same two writes, they
> > see them in the same order. Is that all you meant to say?
>
> I don't think that need be the case if one of the CPUs that has written
> the variable forwards the store to a subsequent load before it reaches
> the cache coherency (I could be wrong here). So if that is the case, then
> your above example would be correct.

I don't understand your comment. Are you saying it's possible for two
CPUs to observe the same two writes and see them occurring in opposite
orders?

> But if I'm wrong there, I think Paul's statement holds even if all
> stores to a single cacheline are always instantly coherent (and thus do
> have some global ordering). Consider a variation on your example where
> one CPU loads 1,2 and another loads 1,3. What's the order?

Again I don't follow. If one CPU sees 1,2 and another sees 1,3 then there
are two possible global orderings: 1,2,3 and 1,3,2. Both are consistent
with what each CPU sees. If a third CPU sees 2,3 then the only consistent
ordering is 1,2,3.

But in the example I gave there are no global orderings consistent with
all the observations. Nevertheless, my example is isn't ruled out by what
Paul wrote. So could my example arise on a real system?

Alan

2006-09-19 17:51:35

by Nick Piggin

[permalink] [raw]
Subject: Re: Uses for memory barriers

Alan Stern wrote:
> On Wed, 20 Sep 2006, Nick Piggin wrote:

>>I don't think that need be the case if one of the CPUs that has written
>>the variable forwards the store to a subsequent load before it reaches
>>the cache coherency (I could be wrong here). So if that is the case, then
>>your above example would be correct.
>
>
> I don't understand your comment. Are you saying it's possible for two
> CPUs to observe the same two writes and see them occurring in opposite
> orders?

If store forwarding is able to occur outside cache coherency protocol,
then I don't see why not. I would also be interested to know if this
is the case on real systems.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-09-19 18:15:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 19, 2006 at 12:04:15PM -0400, Alan Stern wrote:
> On Mon, 18 Sep 2006, Paul E. McKenney wrote:
>
> > Restating (and renumbering) the principles, all assuming no other stores
> > than those mentioned, assuming no aliasing among variables, and assuming
> > that each store changes the value of the target variable:
> >
> > (P0): Each CPU sees its own stores and loads as occurring in program
> > order.
> >
> > (P1): If each CPU performs a series of stores to a single shared variable,
> > then the series of values obtained by the a given CPUs stores and
> > loads must be consistent with that obtained by each of the other
> > CPUs. It may or may not be possible to deduce a single global
> > order from the full set of such series.
>
> Suppose three CPUs respectively write the values 1, 2, and 3 to a single
> variable. Are you saying that some CPU A might see the values 1,2 (in
> that order), CPU B might see 2,3 (in that order), and CPU C might see 3,1
> (in that order)? Each CPU's view would be consistent with each of the
> others but there would not be any global order.

Right... Good catch!!!

> Somehow I don't think that's what you intended. In general the actual
> situation is much messier, with some writes masking others for some CPUs
> in such a way that whenever two CPUs both see the same two writes, they
> see them in the same order. Is that all you meant to say?

I did indeed intend to say that there should be at least one global
ordering consistent with all the serieses of values. How about the
following?

(P1): If each CPU performs a series of stores to a single shared variable,
then the series of values obtained by the a given CPUs stores
and loads must be consistent with that obtained by each of the
other CPUs. It may or may not be possible to deduce a single
global order from the full set of such series, however, there
will be at least one global order that is consistent with each
CPU's observed series of values.

> > (P2a): If two stores to A and B are ordered on CPU 0, and two loads from
> > B and A are ordered on CPU 1, and if CPU 1's load from B sees
> > CPU 0's store to B, then CPU 1's load from A must see CPU 0's
> > store to A.
> >
> > (P2b): If a load from A is ordered before a store to B on CPU 0, and a
> > load from B is ordered before a store to A on CPU 1, then if
> > CPU 1's load from B sees CPU 0's store to B, then CPU 0's load
> > from A cannot see CPU 1's store to A.
> >
> > (P2c): If a load from A is ordered before a store to B on CPU 0, and a
> > store to B is ordered before a store to A on CPU 1, and if CPU 0's
> > load from A gives the value stored to A by CPU 1, then CPU 1's
> > store to B must be ordered before CPU 0's store to B, so that
> > the final value of B is that stored by CPU 0.
>
> Implicit in P2a and P2b is that events can be ordered only if they all
> take place on the same CPU. Now P2c introduces the idea of ordering two
> events that occur on different CPUs. This sounds to me like it's heading
> for trouble.
>
> You probably could revise P2c slightly and then be happy with P2[a-c].
> They wouldn't be complete, although they might be adequate for all normal
> uses of memory barriers.

Good point -- perhaps I should just drop P2c entirely. I am running it by
some CPU architects to get their thoughts on it. But if we don't need it,
we should not specify it. Keep Linux portable!!! ;-)

> > > You only considered actions by two CPUs, and you only considered orderings
> > > induced by the memory barriers. What about interactions among multiple
> > > CPUs or orderings caused by control dependencies? Two examples:
> > >
> > > CPU 0 CPU 1 CPU 2
> > > ----- ----- -----
> >
> > As always, assuming all variables initially zero. ;-)
> >
> > > x = a; y = b; z = c;
> > > mb(); mb(); mb();
> > > b = 1; c = 1; a = 1;
> > > assert(x==0 || y==0 || z==0);
> >
> > 1. On CPU 0, either x=a sees CPU 2's a=1 (and by P2b sees z=c and sets
> > x==1) or does not (and thus might see either z=1 or z=0, but
> > sets x==0).
> >
> > 2. On CPU 2, z=c might see CPU 1's c=1 (and would by P2b also see
> > y=b, but it does not look).
> >
> > 3. CPU 1's situation is similar to that of CPU 2.
> >
> > So, yes, some other principle would be required if we wanted to prove that
> > the assertion always succeeded. Do we need this assertion to succeed?
> > (Sure, it would be a bit mind-bending if it were permitted to fail,
> > but that would not be unheard of when dealing with memory barriers.)
> > One litmus test is "would assertion failure break normal locking?".
> > So, let's apply that test.
> >
> > However, note that the locking sequence differs significantly from the
> > locking scenario that drove memory ordering on many CPUs -- in the locking
> > scenario, the lock variable itself would be able to make use of P1.
> > To properly use locking, we end up with something like the following,
> > all variables initially zero:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > testandset(l) testandset(l) testandset(l)
> > mb() mb() mb()
> > x = a; y = b; z = c;
> > mb() mb() mb()
> > clear(l) clear(l) clear(l)
> > testandset(l) testandset(l) testandset(l)
> > mb() mb() mb()
> > b = 1; c = 1; a = 1;
> > mb() mb() mb()
> > clear(l) clear(l) clear(l)
> > assert(x==0 || y==0 || z==0);
> >
> > The key additional point is that all CPUs see the series of values
> > taken on by the lockword "l". The load implied by a given testandtest(),
> > other than the first one, must see the store implied by the prior clear().
> > So far, nothing different. But let's take a look at the ordering 1, 2, 0.
> >
> > At the handoff from CPU 1 to CPU 2, CPU 2 has seen CPU 1's clear(l), and
> > thus must see CPU 1's assignment to y (by P2b). At the handoff from
> > CPU 2 to CPU 0, again, CPU 0 has seen CPU 2's clear(l), thus CPU 2's
> > assignment to z. However, because "l" is a single variable, we can
> > now use P1, so that CPU 0 has also seen the effect of CPU 1's earlier
> > clear(l), and thus also CPU 1's y=b. Continuing this analysis shows that
> > the assertion always succeeds in this case.
> >
> > A simpler sequence illustrating this is as follows:
> >
> > CPU 0 CPU 1 CPU 2 CPU 3
> > ----- ----- ----- -----
> > a = 1 while (b < 1); while (b < 2); y = c;
> > mb() mb() mb(); mb();
> > b = 1 b = 2; c = 1; x = a;
> > assert(y == 0 || x == 1)
> >
> > The fact that we have the single variable "b" tracing through this
> > sequence allows P1 to be applied.
>
> Isn't it possible that the write to c become visible on CPU 3 before the
> write to a? In fact, isn't this essentially the same as one of the
> examples in your manuscript?

No clue. But I certainly didn't get this example right. How about
the following instead?

CPU 0 CPU 1 CPU 2
----- ----- -----
a = 1 while (l < 1); z = l;
mb(); b = 1; mb();
l = 1 mb(); y = b;
l = 2; x = a;
assert(l != 2 || (x == 1 && y == 1));

Here "l" is standing in for the lock variable -- if we acquire the lock,
we must see all the stores from all preceding critical sections for
that lock. A similar example would show that any stores CPU 2 executes
after its mb() cannot affect any loads that CPUs 0 and 1 executed prior
to their mb()s.

> > So, is there a burning need for the assertion in your original example
> > to succeed? If not, fewer principles allows Linux to deal with a larger
> > variety of CPUs.
> >
> > > CPU 0 CPU 1
> > > ----- -----
> >
> > Just to be obnoxious, again noting all variables initially zero. ;-)
> >
> > > x = a; while (b == 0) ;
> > > mb(); a = 1;
> > > b = 1;
> > > assert(x==0);
> >
> > Yep, here you are relying on a control dependency, and control dependencies
> > are not recognized by the principles. And, possibly, also not by some
> > CPUs...
>
> Not recognizing a control dependency like this would be tantamount to
> doing a speculative write. Any CPU indulging in such fancies isn't likely
> to survive very long.

You might well be correct, though I would want to ask an Alpha expert.
In any case, there are a number of CPUs that do allow speculative -reads-
past control dependencies. My fear is that people will over-generalize
the "can't do speculative writes" lesson into "cannot speculate
past a conditional branch". So I would be OK saying that a memory
barrier is required in the above example -- but would be interested
in counter-examples where such code is desperately needed, and the
overhead of a memory barrier cannot be tolerated.

> > The fix would be to insert an mb() just after CPU 1's "while" loop.
> >
> > > I suspect your scheme can't handle either of these. In fact, I rather
> > > suspect that the "partial ordering" approach is fundamentally incapable of
> > > generating all possible correct deductions about memory access
> > > requirements, and that something more like my "every action happens at a
> > > specific time" approach is necessary. That is, after all, what led me to
> > > consider it in the first place.
> >
> > Agreed, these assertions cannot be proven using the principles outlined
> > above. Is this necessarily bad? In other words, are there real situations
> > where we need these assertions to hold, and do all CPUs really meet them?
>
> As far as real situations are concerned, you could probably get away with
> nothing more than P2a and P2b, maybe also P2c. Gaining a clear and
> precise understanding is a different story, though.

Yep, I suspect that it might be OK to ditch P2c from a "Linux absolutely
relies on this" viewpoint. "But it seems so harmless!" ;-)

And I agree that examples drawn from real
hardware will be helpful. Or even examples drawn from imaginary "hostile"
hardware. ;-)

> For instance, it's not clear to what extent the "ordering" in your scheme
> is transitive. Part of the problem is that the events you order are
> individual loads and stores. There's no acknowledgement of the fact that
> a single store can become visible to different CPUs at different times (or
> not at all).

I was trying to get at this with the "see the value stored" verbiage.
Any thoughts on ways to make this more clear? Perhaps have the principles
first, followed by longer explanations and examples.

And transitivity is painful -- the Alpha "dereferencing does not imply
ordering" example being a similar case in point!

> I would much prefer to see an analysis where the primitive notions include
> things like "This store becomes visible to that CPU before this load
> occurs". It would bear a closer correspondence to what actually happens
> in the hardware.

I will think on better wording.

Thanx, Paul

2006-09-19 18:18:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Sep 20, 2006 at 03:51:29AM +1000, Nick Piggin wrote:
> Alan Stern wrote:
> >On Wed, 20 Sep 2006, Nick Piggin wrote:
>
> >>I don't think that need be the case if one of the CPUs that has written
> >>the variable forwards the store to a subsequent load before it reaches
> >>the cache coherency (I could be wrong here). So if that is the case, then
> >>your above example would be correct.
> >
> >I don't understand your comment. Are you saying it's possible for two
> >CPUs to observe the same two writes and see them occurring in opposite
> >orders?
>
> If store forwarding is able to occur outside cache coherency protocol,
> then I don't see why not. I would also be interested to know if this
> is the case on real systems.

We are discussing multiple writes to the same variable, correct?

Just checking...

Thanx, Paul

2006-09-19 18:48:52

by Nick Piggin

[permalink] [raw]
Subject: Re: Uses for memory barriers

Paul E. McKenney wrote:
> On Wed, Sep 20, 2006 at 03:51:29AM +1000, Nick Piggin wrote:

>>If store forwarding is able to occur outside cache coherency protocol,
>>then I don't see why not. I would also be interested to know if this
>>is the case on real systems.
>
>
> We are discussing multiple writes to the same variable, correct?
>
> Just checking...

Correct.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-09-19 19:35:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Sep 20, 2006 at 04:48:45AM +1000, Nick Piggin wrote:
> Paul E. McKenney wrote:
> >On Wed, Sep 20, 2006 at 03:51:29AM +1000, Nick Piggin wrote:
>
> >>If store forwarding is able to occur outside cache coherency protocol,
> >>then I don't see why not. I would also be interested to know if this
> >>is the case on real systems.
> >
> >
> >We are discussing multiple writes to the same variable, correct?
> >
> >Just checking...
>
> Correct.

I am having a hard time seeing how this would happen.

Sooner or later, the cacheline comes to the store queue, defining
the ordering. All changes that occurred in the store queue while
waiting for the cache line appear to other CPUs as having happened
in very quick succession while the cacheline resides with the store
queue in question.

So, what am I missing?

Thanx, Paul

2006-09-19 19:48:09

by Nick Piggin

[permalink] [raw]
Subject: Re: Uses for memory barriers

Paul E. McKenney wrote:
> On Wed, Sep 20, 2006 at 04:48:45AM +1000, Nick Piggin wrote:
> Sooner or later, the cacheline comes to the store queue, defining
> the ordering. All changes that occurred in the store queue while
> waiting for the cache line appear to other CPUs as having happened
> in very quick succession while the cacheline resides with the store
> queue in question.
>
> So, what am I missing?

Maybe I'm missing something. But if the same CPU loads the value
before the store becomes visible to cache coherency, it might see
the value out of any order any of the other CPUs sees.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-09-19 20:00:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Sep 20, 2006 at 05:48:00AM +1000, Nick Piggin wrote:
> Paul E. McKenney wrote:
> >On Wed, Sep 20, 2006 at 04:48:45AM +1000, Nick Piggin wrote:
> >Sooner or later, the cacheline comes to the store queue, defining
> >the ordering. All changes that occurred in the store queue while
> >waiting for the cache line appear to other CPUs as having happened
> >in very quick succession while the cacheline resides with the store
> >queue in question.
> >
> >So, what am I missing?
>
> Maybe I'm missing something. But if the same CPU loads the value
> before the store becomes visible to cache coherency, it might see
> the value out of any order any of the other CPUs sees.

Agreed. But the CPUs would have to refer to a fine-grained synchronized
timebase or to some other variable in order to detect the fact that there
were in fact multiple different values for the same variable at the same
time (held in the different store queues).

If the CPUs looked only at that one single variable being stored to,
could they have inconsistent opinions about the order of values that
this single variable took on? My belief is that they could not.

Thanx, Paul

2006-09-19 20:38:21

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 19 Sep 2006, Paul E. McKenney wrote:

> > Maybe I'm missing something. But if the same CPU loads the value
> > before the store becomes visible to cache coherency, it might see
> > the value out of any order any of the other CPUs sees.
>
> Agreed. But the CPUs would have to refer to a fine-grained synchronized
> timebase or to some other variable in order to detect the fact that there
> were in fact multiple different values for the same variable at the same
> time (held in the different store queues).

Even that wouldn't be illegal. No one ever said that any particular write
becomes visible to all CPUs at the same time.

> If the CPUs looked only at that one single variable being stored to,
> could they have inconsistent opinions about the order of values that
> this single variable took on? My belief is that they could not.

Yes, I think this must be right. If a store is hung up in a CPU's store
buffer, it will mask later stores by other CPUs (i.e., prevent them from
becoming visible to the CPU that owns the store buffer). Hence all stores
that _do_ become visible will appear in a consistent order.

But my knowledge of outlandish hardware is extremely limited, so don't
take my word as gospel.

Alan

2006-09-20 19:39:24

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 19 Sep 2006, Paul E. McKenney wrote:

> I did indeed intend to say that there should be at least one global
> ordering consistent with all the serieses of values. How about the
> following?
>
> (P1): If each CPU performs a series of stores to a single shared variable,
> then the series of values obtained by the a given CPUs stores
> and loads must be consistent with that obtained by each of the
> other CPUs. It may or may not be possible to deduce a single
> global order from the full set of such series, however, there
> will be at least one global order that is consistent with each
> CPU's observed series of values.

How about just

(P1): If two CPUs both see the same two stores to a single shared
variable, then they will see the stores in the same order.


> Good point -- perhaps I should just drop P2c entirely. I am running it by
> some CPU architects to get their thoughts on it. But if we don't need it,
> we should not specify it. Keep Linux portable!!! ;-)

What P2c really tries to do is specify conditions for which of two stores
to a single location will be visible last. To do it properly requires
specifying an order for events that take place on more than one CPU,
something you've tried to avoid doing until now.


> No clue. But I certainly didn't get this example right. How about
> the following instead?
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> a = 1 while (l < 1); z = l;
> mb(); b = 1; mb();
> l = 1 mb(); y = b;
> l = 2; x = a;
> assert(l != 2 || (x == 1 && y == 1));

You probably meant to write "z" in the assertion rather than "l".

> Here "l" is standing in for the lock variable -- if we acquire the lock,
> we must see all the stores from all preceding critical sections for
> that lock. A similar example would show that any stores CPU 2 executes
> after its mb() cannot affect any loads that CPUs 0 and 1 executed prior
> to their mb()s.

Right. To do this properly requires a more fully developed notion of
ordering than you've been using, and it also requires a notion of control
dependencies for writes.

> > Not recognizing a control dependency like this would be tantamount to
> > doing a speculative write. Any CPU indulging in such fancies isn't likely
> > to survive very long.
>
> You might well be correct, though I would want to ask an Alpha expert.
> In any case, there are a number of CPUs that do allow speculative -reads-
> past control dependencies. My fear is that people will over-generalize
> the "can't do speculative writes" lesson into "cannot speculate
> past a conditional branch".

I think that's important enough to be worth emphasizing, so that people
won't get it wrong.

> So I would be OK saying that a memory
> barrier is required in the above example -- but would be interested
> in counter-examples where such code is desperately needed, and the
> overhead of a memory barrier cannot be tolerated.

I don't have any good examples.


> I was trying to get at this with the "see the value stored" verbiage.
> Any thoughts on ways to make this more clear? Perhaps have the principles
> first, followed by longer explanations and examples.
>
> And transitivity is painful -- the Alpha "dereferencing does not imply
> ordering" example being a similar case in point!
>
> > I would much prefer to see an analysis where the primitive notions include
> > things like "This store becomes visible to that CPU before this load
> > occurs". It would bear a closer correspondence to what actually happens
> > in the hardware.
>
> I will think on better wording.

My thoughts have been moving in this direction:

Describe everything in terms of a single global ordering. It's
easier to think about, and there shouldn't be a state-space
explosion because you can always confine your attention to the
events you care about and ignore the others.

Reads take place at a particular time (when the return value is
committed to) but writes become visible at different times to
different CPUs (such as when they respond to the invalidate
message).

A CPU's own writes become visible to itself as soon as they
execute. They become visible to other CPUs at the same time or
later, but not before. A write might not ever become visible
to some CPUs.

A read will always return the value of a store that has already
become visible to that CPU, but not necessarily the latest such
store.

So far everything is straightforward. The difficult part arises because
multiple stores to the same location can mask and overwrite one another.

If a read returns the value from a particular store, then later
reads (on the same CPU and of the same location) will never return
values from stores that became visible earlier than that one.
(The value has overwritten any earlier stores.)

A store can be masked from a particular CPU by earlier or later
stores to the same location (perhaps made by other CPUs). A store
is masked from a CPU iff it never becomes visible to that CPU.

In this setting we can describe the operation of the various sorts of
memory barriers. Apart from the obvious effect of forcing instructions to
apparently execute in program order, we might have the following:

If two stores are separated by wmb() then on any CPU where they
both become visible, they become visible in the order they were
executed.

If multiple stores to the same location are visible when a CPU
executes rmb(), then after the rmb() the latest store has
overwritten all the earlier ones.

If a CPU executes read-mb-store, then no other read on any CPU
can return the value of the store before this read completes.
(I'd like to say that the read completes before the store
becomes visible on any CPU; I'm not sure that either is right.
What about on the executing CPU?)

But this isn't complete. It doesn't say enough about when one write may
or may not mask another, or the fact that this occurs in such a way that
all CPUs will agree on the order of the stores to a single location they
see in common.

And certainly there should be something saying that one way or another,
stores do eventually complete. Given a long enough time with no other
accesses, some single store to a given location should become visible to
every CPU.

Alan

2006-09-21 01:34:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Sep 20, 2006 at 03:39:22PM -0400, Alan Stern wrote:
> On Tue, 19 Sep 2006, Paul E. McKenney wrote:
>
> > I did indeed intend to say that there should be at least one global
> > ordering consistent with all the serieses of values. How about the
> > following?
> >
> > (P1): If each CPU performs a series of stores to a single shared variable,
> > then the series of values obtained by the a given CPUs stores
> > and loads must be consistent with that obtained by each of the
> > other CPUs. It may or may not be possible to deduce a single
> > global order from the full set of such series, however, there
> > will be at least one global order that is consistent with each
> > CPU's observed series of values.
>
> How about just
>
> (P1): If two CPUs both see the same two stores to a single shared
> variable, then they will see the stores in the same order.
>
>
> > Good point -- perhaps I should just drop P2c entirely. I am running it by
> > some CPU architects to get their thoughts on it. But if we don't need it,
> > we should not specify it. Keep Linux portable!!! ;-)
>
> What P2c really tries to do is specify conditions for which of two stores
> to a single location will be visible last. To do it properly requires
> specifying an order for events that take place on more than one CPU,
> something you've tried to avoid doing until now.

That is what I was thinking, but then I realized that P2c is absolutely
required for locking. For example:

CPU 0 CPU 1
----- -----
spin_lock(&l); spin_lock(&l);
x = 0; x = 1;
spin_unlock(&l); spin_unlock(&l);

Whichever CPU acquires the lock last had better be the one whose value
"sticks" in x. So, if I see that the other CPU has released its lock,
then any writes that I do need to override whatever writes the other CPU
did within its critical section.

> > No clue. But I certainly didn't get this example right. How about
> > the following instead?
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > a = 1 while (l < 1); z = l;
> > mb(); b = 1; mb();
> > l = 1 mb(); y = b;
> > l = 2; x = a;
> > assert(l != 2 || (x == 1 && y == 1));
>
> You probably meant to write "z" in the assertion rather than "l".

That I did... As follows:

CPU 0 CPU 1 CPU 2
----- ----- -----
a = 1 while (l < 1); z = l;
mb(); b = 1; mb();
l = 1 mb(); y = b;
l = 2; x = a;
assert(z != 2 || (x == 1 && y == 1));

> > Here "l" is standing in for the lock variable -- if we acquire the lock,
> > we must see all the stores from all preceding critical sections for
> > that lock. A similar example would show that any stores CPU 2 executes
> > after its mb() cannot affect any loads that CPUs 0 and 1 executed prior
> > to their mb()s.
>
> Right. To do this properly requires a more fully developed notion of
> ordering than you've been using, and it also requires a notion of control
> dependencies for writes.

I believe that P0, P1, and P2a are sufficient in this case:

0. If CPU 2 sees l!=2, then by P0, it will see z!=2 at the
assertion, which must therefore succeed.
1. On the other hand, if CPU 2 sees l==2, then by P2a, it must
also see b==1.
2. By P0, CPU 1 observes that l==1 preceded l==2.
3. By P1 and #2, neither CPU 0 nor CPU can see l==1 following l==2.
4. Therefore, if CPU 2 observes l==2, there must have been
an earlier l==1.
5. By P2a, any implied observation by CPU 2 of l==1 requires it to
see a==1.
6. By #4 and #5, if CPU 2 sees l==2, it must also see a==1.
7. By P0, if CPU 2 sees b==1, it must see y==1 at the assertion.
8. By P0, if CPU 2 sees a==1, it must see x==1 at the assertion.
9. #1, #6, #7, and #8 taken together implies that if CPU 2 sees
l==2, the assertion succeeds.
10. #0 and #9 taken together demonstrate that the assertion always
succeeds.

I would need P2b or P2c if different combinations of reads and writes'
happened in the critical section.

> > > Not recognizing a control dependency like this would be tantamount to
> > > doing a speculative write. Any CPU indulging in such fancies isn't likely
> > > to survive very long.
> >
> > You might well be correct, though I would want to ask an Alpha expert.
> > In any case, there are a number of CPUs that do allow speculative -reads-
> > past control dependencies. My fear is that people will over-generalize
> > the "can't do speculative writes" lesson into "cannot speculate
> > past a conditional branch".
>
> I think that's important enough to be worth emphasizing, so that people
> won't get it wrong.

Yep. Just as it is necessary that in head->a, the fetch of "head" might
well follow the fetch of "head->a". ;-)

> > So I would be OK saying that a memory
> > barrier is required in the above example -- but would be interested
> > in counter-examples where such code is desperately needed, and the
> > overhead of a memory barrier cannot be tolerated.
>
> I don't have any good examples.

OK, until someone comes up with an example, I will not add the "writes
cannot speculate" assumption.

> > I was trying to get at this with the "see the value stored" verbiage.
> > Any thoughts on ways to make this more clear? Perhaps have the principles
> > first, followed by longer explanations and examples.
> >
> > And transitivity is painful -- the Alpha "dereferencing does not imply
> > ordering" example being a similar case in point!
> >
> > > I would much prefer to see an analysis where the primitive notions include
> > > things like "This store becomes visible to that CPU before this load
> > > occurs". It would bear a closer correspondence to what actually happens
> > > in the hardware.
> >
> > I will think on better wording.
>
> My thoughts have been moving in this direction:
>
> Describe everything in terms of a single global ordering. It's
> easier to think about, and there shouldn't be a state-space
> explosion because you can always confine your attention to the
> events you care about and ignore the others.

I may well need to use both viewpoints. However, my experience has been
that thinking in terms of a single global ordering leads me to miss race
conditions, so I am -extremely- reluctant (as you might have noticed!) to
rely solely on explaining this in terms of a single global ordering.

Modern CPUs resemble beehives more than they resemble assembly lines. ;-)
(That is, each CPU acts like a beehive, so that the system is a collection
of interacting beehives, each individual bee representing either an
instruction or some data. For whatever it is worth, the honeycomb would
be the cache and the flowers main memory. There is no queen bee, at least
not in well-designed parallel code. OK, so BKL is the queen bee!!!)

> Reads take place at a particular time (when the return value is
> committed to) but writes become visible at different times to
> different CPUs (such as when they respond to the invalidate
> message).

Reads are as fuzzy as writes. The candidate times for a read would be:

0. When the CPU hands the read request to the cache (BTW, I do agree
with your earlier suggestion to separate CPU and cache actions,
though I might well change my mind after running through an example
or three...).

1. When all invalidates that arrived at the local cache before the
read have been processed. (Assuming the read reaches the local
cache -- it might instead have been satisfied by the store queue.)

2. When the read request is transmitted from the cache to the rest
of the system. (Assuming it missed in the cache.)

3. When the last prior conflicting write completes -- with the
list of possible write-completion times given in a previous
email. As near as I can tell, this is your preferred event
for defining the time at which a read takes place.

4. When the cacheline containing the variable being read is
received at the local cache.

5. When the value being read is delivered to the CPU.

Of course, real CPUs are more complicated, so present more possible events
to tag as being "when the read took place". ;-)

> A CPU's own writes become visible to itself as soon as they
> execute. They become visible to other CPUs at the same time or
> later, but not before. A write might not ever become visible
> to some CPUs.

The first sentence is implied by P0. Not sure the second sentence is
needed -- example? The third sentence is implied by P1.

> A read will always return the value of a store that has already
> become visible to that CPU, but not necessarily the latest such
> store.

Not sure that this sentence is needed -- example?

> So far everything is straightforward. The difficult part arises because
> multiple stores to the same location can mask and overwrite one another.
>
> If a read returns the value from a particular store, then later
> reads (on the same CPU and of the same location) will never return
> values from stores that became visible earlier than that one.
> (The value has overwritten any earlier stores.)

Implied separately by each of P0 and P1, depending on your taste.

> A store can be masked from a particular CPU by earlier or later
> stores to the same location (perhaps made by other CPUs). A store
> is masked from a CPU iff it never becomes visible to that CPU.

Implied by P1.

> In this setting we can describe the operation of the various sorts of
> memory barriers. Apart from the obvious effect of forcing instructions to
> apparently execute in program order, we might have the following:
>
> If two stores are separated by wmb() then on any CPU where they
> both become visible, they become visible in the order they were
> executed.

This is P2a.

> If multiple stores to the same location are visible when a CPU
> executes rmb(), then after the rmb() the latest store has
> overwritten all the earlier ones.

This is P1 and P0, even if there is no rmb(). The rmb() does not affect
loads and stores to a single location. In fact, no memory barrier
affects loads and stores to a single location, except to the extent
that it changes timing in a given implementation -- but a random number
of trips through a loop containing only nops would change timing as well.

> If a CPU executes read-mb-store, then no other read on any CPU
> can return the value of the store before this read completes.
> (I'd like to say that the read completes before the store
> becomes visible on any CPU; I'm not sure that either is right.
> What about on the executing CPU?)

This is P2b. I think.

> But this isn't complete. It doesn't say enough about when one write may
> or may not mask another, or the fact that this occurs in such a way that
> all CPUs will agree on the order of the stores to a single location they
> see in common.

This would be covered by P1.

> And certainly there should be something saying that one way or another,
> stores do eventually complete. Given a long enough time with no other
> accesses, some single store to a given location should become visible to
> every CPU.

This would certainly be needed in order to analyze liveness, as opposed
to "simply" analyzing SMP correctness (or lack thereof, as the case may be).
SMP correctness "only" requires a full state-space search -- preferably
-after- collapsing the state space. ;-)

My thought is to take a more portability-paranoid approach. Start with
locking, deriving the memory-barrier properties required for locking
to work correctly. Rely only on those properties, since a number of
CPUs seem to have been designed (to say nothing of validated!!!) only
for locking. After all, if a CPU doesn't support locking, it cannot
support an SMP build of Linux.

Thanx, Paul

2006-09-21 01:43:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Sep 19, 2006 at 04:38:19PM -0400, Alan Stern wrote:
> On Tue, 19 Sep 2006, Paul E. McKenney wrote:
>
> > > Maybe I'm missing something. But if the same CPU loads the value
> > > before the store becomes visible to cache coherency, it might see
> > > the value out of any order any of the other CPUs sees.
> >
> > Agreed. But the CPUs would have to refer to a fine-grained synchronized
> > timebase or to some other variable in order to detect the fact that there
> > were in fact multiple different values for the same variable at the same
> > time (held in the different store queues).
>
> Even that wouldn't be illegal. No one ever said that any particular write
> becomes visible to all CPUs at the same time.

Agreed. But this would be outside of the cache-coherence protocol.

That said, cross-CPU timing measurements have been very helpful to
me in the past when messing with memory ordering. Spooky to be able
to prove that a single variable really does have multiple values at
a single point in time, from the perspectives of different CPUs! ;-)

> > If the CPUs looked only at that one single variable being stored to,
> > could they have inconsistent opinions about the order of values that
> > this single variable took on? My belief is that they could not.
>
> Yes, I think this must be right. If a store is hung up in a CPU's store
> buffer, it will mask later stores by other CPUs (i.e., prevent them from
> becoming visible to the CPU that owns the store buffer). Hence all stores
> that _do_ become visible will appear in a consistent order.
>
> But my knowledge of outlandish hardware is extremely limited, so don't
> take my word as gospel.

All the hardware that I have had intimate knowledge of has adhered to
this constraint.

Thanx, Paul

2006-09-21 20:59:04

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

We are really discussing two different things here. On one hand you're
trying to present a high-level description of the requirements memory
barriers must fulfill in order to support usable synchronization. On the
other, I'm trying to present a lower-level description of how memory
barriers operate, from which one could prove that your higher-level
requirements are satisfied as a sort of emergent behavior.


On Wed, 20 Sep 2006, Paul E. McKenney wrote:

> That is what I was thinking, but then I realized that P2c is absolutely
> required for locking. For example:
>
> CPU 0 CPU 1
> ----- -----
> spin_lock(&l); spin_lock(&l);
> x = 0; x = 1;
> spin_unlock(&l); spin_unlock(&l);
>
> Whichever CPU acquires the lock last had better be the one whose value
> "sticks" in x. So, if I see that the other CPU has released its lock,
> then any writes that I do need to override whatever writes the other CPU
> did within its critical section.

Yes.

On the whole, the best way to present your approach might be like this.
P0 and P1 are simple basic requirements for any SMP system. They have
nothing to do with memory barriers especially. So let's concentrate on
P2.

To express the various P2 requirements, let's say that a read "comes
after" a write if it sees the value of the write, and a second write
"comes after" a first write if it overwrites the first value. Let A and B
be addresses in memory, let A(n) and B(n) stand for any sort of accesses
to A and B by CPU n, and let M stand for any appropriate memory barrier.
Then whenever we have

A(0) B(1)
M M
B(0) A(1)

the requirement is: B(1) comes after B(0) implies A(1) comes after A(0) --
or at least A(0) doesn't come after A(1) -- whenever this makes sense in
terms of the actual access and barrier operations. This is a bit stronger
than your P2a-P2c all together.

For locking, let L stand for a lock operation and U for an unlock
operation. Then whenever we have

L L
{A(0),B(0)} {A(1),B(1)}
// For each n, A(n) and B(n) can occur in arbitrary order
U U

the requirement is that A(1) comes after A(0) implies B(1) comes after
B(0) -- or at least B(0) doesn't come after B(1) -- again, whenever that
makes sense. If L and U are implemented in terms of something like
atomic_xchg plus mb(), this ought to be derivable from the first
requirement.

Taken together, I think these two schema express all the properties you
want for P2. It's possible that you might want to state more strongly
that accesses cannot leak out of a critical section, even when viewed by a
CPU that's not in a critical section itself. However I'm not sure this is
either necessary or correct.

It's not clear that the schema can handle your next example.

> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> a = 1 while (l < 1); z = l;
> mb(); b = 1; mb();
> l = 1 mb(); y = b;
> l = 2; x = a;
> assert(z != 2 || (x == 1 && y == 1));

We know by P2a between CPUs 1 and 2 that (z != 2 || y == 1) always holds.
So you can simplify the assertion to

assert(!(z==2 && x==0));

> I believe that P0, P1, and P2a are sufficient in this case:
>
> 0. If CPU 2 sees l!=2, then by P0, it will see z!=2 at the
> assertion, which must therefore succeed.
> 1. On the other hand, if CPU 2 sees l==2, then by P2a, it must
> also see b==1.
> 2. By P0, CPU 1 observes that l==1 preceded l==2.
> 3. By P1 and #2, neither CPU 0 nor CPU can see l==1 following l==2.
> 4. Therefore, if CPU 2 observes l==2, there must have been
> an earlier l==1.
> 5. By P2a, any implied observation by CPU 2 of l==1 requires it to
> see a==1.

I don't like step 5. Reasoning with counterfactuals can easily lead to
trouble. Since CPU 2 doesn't ever actually see l==2, we mustn't deduce
anything on the assumption that it does.

> 6. By #4 and #5, if CPU 2 sees l==2, it must also see a==1.
> 7. By P0, if CPU 2 sees b==1, it must see y==1 at the assertion.
> 8. By P0, if CPU 2 sees a==1, it must see x==1 at the assertion.
> 9. #1, #6, #7, and #8 taken together implies that if CPU 2 sees
> l==2, the assertion succeeds.
> 10. #0 and #9 taken together demonstrate that the assertion always
> succeeds.
>
> I would need P2b or P2c if different combinations of reads and writes'
> happened in the critical section.

Maybe this example is complicated enough that you don't need to include it
among the requirements for usable synchronization.


> > I think that's important enough to be worth emphasizing, so that people
> > won't get it wrong.
>
> Yep. Just as it is necessary that in head->a, the fetch of "head" might
> well follow the fetch of "head->a". ;-)

It's worthwhile also explaining the difference between a data dependency
and a control dependency. In fact, for writes there are two kinds of data
dependencies: In one kind, the dependency is through the address to be
written (like what you have here) and in the other, it's through the value
to be written.


> > My thoughts have been moving in this direction:
> >
> > Describe everything in terms of a single global ordering. It's
> > easier to think about, and there shouldn't be a state-space
> > explosion because you can always confine your attention to the
> > events you care about and ignore the others.
>
> I may well need to use both viewpoints. However, my experience has been
> that thinking in terms of a single global ordering leads me to miss race
> conditions, so I am -extremely- reluctant (as you might have noticed!) to
> rely solely on explaining this in terms of a single global ordering.

That's a reasonable objection. Doesn't thinking in terms of partial
orders also lead you to miss race conditions? In my experience they are
easy to miss no matter how you organize your thinking! :-)

> > Reads take place at a particular time (when the return value is
> > committed to) but writes become visible at different times to
> > different CPUs (such as when they respond to the invalidate
> > message).
>
> Reads are as fuzzy as writes. The candidate times for a read would be:
>
> 0. When the CPU hands the read request to the cache (BTW, I do agree
> with your earlier suggestion to separate CPU and cache actions,
> though I might well change my mind after running through an example
> or three...).
>
> 1. When all invalidates that arrived at the local cache before the
> read have been processed. (Assuming the read reaches the local
> cache -- it might instead have been satisfied by the store queue.)
>
> 2. When the read request is transmitted from the cache to the rest
> of the system. (Assuming it missed in the cache.)
>
> 3. When the last prior conflicting write completes -- with the
> list of possible write-completion times given in a previous
> email. As near as I can tell, this is your preferred event
> for defining the time at which a read takes place.

No, I would prefer 4. That's when the value is committed to. 3 could
easily occur long before the read was executed.

> 4. When the cacheline containing the variable being read is
> received at the local cache.
>
> 5. When the value being read is delivered to the CPU.
>
> Of course, real CPUs are more complicated, so present more possible events
> to tag as being "when the read took place". ;-)

Yes. This is meant to be more of a suggestive model than an actual "this
is how the hardware works" pronouncement.

BTW, I'm starting to think that the time when a write becomes visible to a
CPU might better be defined as the time when the value is first written to
a cacheline or store buffer accessible by that CPU. Obviously any read
which completes before this time can't return the new value. In a
flat cache architecture this would mean that writes _do_ become visible at
the same time to all CPUs except the one doing the write. With a
hierarchical cache arrangement things get more complicated.

> > A CPU's own writes become visible to itself as soon as they
> > execute. They become visible to other CPUs at the same time or
> > later, but not before. A write might not ever become visible
> > to some CPUs.
>
> The first sentence is implied by P0. Not sure the second sentence is
> needed -- example? The third sentence is implied by P1.

The second sentence probably isn't needed. It merely states the obvious
fact that a write can't become visible before it executes. (Another
reason for not defining the time as when the invalidate response is sent.
There's no reason a CPU couldn't send an early invalidate message, even
before it knew whether it was going to execute the write.)

But your interpretation for the first and third sentences is backward.
They aren't implied by P0 and P1; rather P0 and P1 _follow_ as emergent
consequences of the behavior described here.

> > A read will always return the value of a store that has already
> > become visible to that CPU, but not necessarily the latest such
> > store.
>
> Not sure that this sentence is needed -- example?

CPU 1 does "a = 1" and then CPU 0 reads a==1. A little later CPU 1 does
"a = 2". CPU 0 acknowledges the invalidate message but hasn't yet
processed it when it tries to read a again and gets the old cached value
1. So even though both stores are visible to CPU 0, the second read
returns the value from the first store.

To put it another way, at the time of the read the second store has not
yet overwritten the first store on CPU 0.

> > So far everything is straightforward. The difficult part arises
> > because multiple stores to the same location can mask and overwrite
> > one another.
> >
> > If a read returns the value from a particular store, then later
> > reads (on the same CPU and of the same location) will never return
> > values from stores that became visible earlier than that one.
> > (The value has overwritten any earlier stores.)
>
> Implied separately by each of P0 and P1, depending on your taste.

Backward again. This is _used_ to show that P0 and P1 hold.

> > A store can be masked from a particular CPU by earlier or later
> > stores to the same location (perhaps made by other CPUs). A store
> > is masked from a CPU iff it never becomes visible to that CPU.

Some coherency requirements I thought of later:

If two stores are made to the same location, it's not possible
for the first to mask the second on one CPU and the second to
mask the first on another CPU. Also, if both stores are visible
to any CPU then it's not possible for the store that becomes
visible first to mask the other on any CPU.

> > If two stores are separated by wmb() then on any CPU where they
> > both become visible, they become visible in the order they were
> > executed.
>
> This is P2a.

No it isn't, since P2a doesn't say anything about stores being visible or
orders. It only talks about the whether certain reads return the values
of certain stores. Furthermore, you need the next principle as well as
this one before you can deduce P2a.

> > If multiple stores to the same location are visible when a CPU
> > executes rmb(), then after the rmb() the latest store has
> > overwritten all the earlier ones.
>
> This is P1 and P0, even if there is no rmb(). The rmb() does not affect
> loads and stores to a single location. In fact, no memory barrier
> affects loads and stores to a single location, except to the extent
> that it changes timing in a given implementation -- but a random number
> of trips through a loop containing only nops would change timing as well.

I disagree. Without this principle, we could have:

CPU 0 CPU 1
----- -----
a = 0;
// Long delay
a = 1; y = b;
wmb(); rmb();
b = 1; x = a;
assert(!(y==1 && x==0));

Suppose y==1. Then when CPU 1 reads a, both writes to a are visible.
However if the rmb() didn't affect the CPU's behavior with respect to the
repeated writes, then the read would not be obliged to return the value
from the latest visible write. That's what the rmb() forces.

> > If a CPU executes read-mb-store, then no other read on any CPU
> > can return the value of the store before this read completes.
> > (I'd like to say that the read completes before the store
> > becomes visible on any CPU; I'm not sure that either is right.
> > What about on the executing CPU?)
>
> This is P2b. I think.

I'm not so sure.

> > But this isn't complete. It doesn't say enough about when one write may
> > or may not mask another, or the fact that this occurs in such a way that
> > all CPUs will agree on the order of the stores to a single location they
> > see in common.
>
> This would be covered by P1.

The coherency principle I added covers much of this. For the rest, it's
still necessary to add:

Two stores cannot become visible in opposite orders to two CPUs.

> > And certainly there should be something saying that one way or another,
> > stores do eventually complete. Given a long enough time with no other
> > accesses, some single store to a given location should become visible to
> > every CPU.
>
> This would certainly be needed in order to analyze liveness, as opposed
> to "simply" analyzing SMP correctness (or lack thereof, as the case may be).
> SMP correctness "only" requires a full state-space search -- preferably
> -after- collapsing the state space. ;-)

There should also be time limits on how long a store is able to mask or
overwrite other stores.

> My thought is to take a more portability-paranoid approach. Start with
> locking, deriving the memory-barrier properties required for locking
> to work correctly. Rely only on those properties, since a number of
> CPUs seem to have been designed (to say nothing of validated!!!) only
> for locking. After all, if a CPU doesn't support locking, it cannot
> support an SMP build of Linux.

That's the high-level approach I mentioned at the beginning of this email.
It certainly is simpler that the lower-level approach.

Alan

2006-09-22 14:18:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, Sep 21, 2006 at 04:59:00PM -0400, Alan Stern wrote:
> We are really discussing two different things here. On one hand you're
> trying to present a high-level description of the requirements memory
> barriers must fulfill in order to support usable synchronization. On the
> other, I'm trying to present a lower-level description of how memory
> barriers operate, from which one could prove that your higher-level
> requirements are satisfied as a sort of emergent behavior.

I do want to do both -- just need to start with the requirements to make
sure we don't end up overconstraining things.

> On Wed, 20 Sep 2006, Paul E. McKenney wrote:
>
> > That is what I was thinking, but then I realized that P2c is absolutely
> > required for locking. For example:
> >
> > CPU 0 CPU 1
> > ----- -----
> > spin_lock(&l); spin_lock(&l);
> > x = 0; x = 1;
> > spin_unlock(&l); spin_unlock(&l);
> >
> > Whichever CPU acquires the lock last had better be the one whose value
> > "sticks" in x. So, if I see that the other CPU has released its lock,
> > then any writes that I do need to override whatever writes the other CPU
> > did within its critical section.
>
> Yes.
>
> On the whole, the best way to present your approach might be like this.
> P0 and P1 are simple basic requirements for any SMP system. They have
> nothing to do with memory barriers especially. So let's concentrate on
> P2.

True enough, P0 and P1 are the ordering properties you can count on
even in absence of memory barriers.

> To express the various P2 requirements, let's say that a read "comes
> after" a write if it sees the value of the write, and a second write
> "comes after" a first write if it overwrites the first value.

And a read "comes after" a read if the "later" read sees the value of at
least one write that "comes after" the "earlier" read. The relation of
the write and read is the converse of your first requirement, but that
should be OK -- could explicitly state it: a write "comes after" a read
if the read does not see the value of the write, but not sure whether it
is useful to do so. I need to think about it a bit, but this certainly
seems to fit with what I was previously calling "partial order".

> Let A and B
> be addresses in memory, let A(n) and B(n) stand for any sort of accesses
> to A and B by CPU n, and let M stand for any appropriate memory barrier.
> Then whenever we have
>
> A(0) B(1)
> M M
> B(0) A(1)
>
> the requirement is: B(1) comes after B(0) implies A(1) comes after A(0) --
> or at least A(0) doesn't come after A(1) -- whenever this makes sense in
> terms of the actual access and barrier operations. This is a bit stronger
> than your P2a-P2c all together.

As long as we avoid any notion of transitivity, I -think- I am OK with
this. (OK, so yes there is transitivity in the single-variable P1 case,
but not necessarily in the dual-variable P2x cases.)

> For locking, let L stand for a lock operation and U for an unlock
> operation. Then whenever we have
>
> L L
> {A(0),B(0)} {A(1),B(1)}
> // For each n, A(n) and B(n) can occur in arbitrary order
> U U
>
> the requirement is that A(1) comes after A(0) implies B(1) comes after
> B(0) -- or at least B(0) doesn't come after B(1) -- again, whenever that
> makes sense. If L and U are implemented in terms of something like
> atomic_xchg plus mb(), this ought to be derivable from the first
> requirement.

In fact, the atomic-plus-mb() would define which memory-barrier properties
Linux could safely rely on.

> Taken together, I think these two schema express all the properties you
> want for P2. It's possible that you might want to state more strongly
> that accesses cannot leak out of a critical section, even when viewed by a
> CPU that's not in a critical section itself. However I'm not sure this is
> either necessary or correct.

They are free to leak out, but only from the viewpoint of code that
is outside of any mutually excluded critical section. Our example a
few emails ago being a case in point.

> It's not clear that the schema can handle your next example.
>
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > a = 1 while (l < 1); z = l;
> > mb(); b = 1; mb();
> > l = 1 mb(); y = b;
> > l = 2; x = a;
> > assert(z != 2 || (x == 1 && y == 1));
>
> We know by P2a between CPUs 1 and 2 that (z != 2 || y == 1) always holds.
> So you can simplify the assertion to
>
> assert(!(z==2 && x==0));
>
> > I believe that P0, P1, and P2a are sufficient in this case:
> >
> > 0. If CPU 2 sees l!=2, then by P0, it will see z!=2 at the
> > assertion, which must therefore succeed.
> > 1. On the other hand, if CPU 2 sees l==2, then by P2a, it must
> > also see b==1.
> > 2. By P0, CPU 1 observes that l==1 preceded l==2.
> > 3. By P1 and #2, neither CPU 0 nor CPU can see l==1 following l==2.
> > 4. Therefore, if CPU 2 observes l==2, there must have been
> > an earlier l==1.
> > 5. By P2a, any implied observation by CPU 2 of l==1 requires it to
> > see a==1.
>
> I don't like step 5. Reasoning with counterfactuals can easily lead to
> trouble. Since CPU 2 doesn't ever actually see l==2, we mustn't deduce
> anything on the assumption that it does.

You lost me on the last sentence -- we only get here if CPU 2 did in fact
see l==2. Or did you mean to say l==1? In the latter case, then, yes,
the use of P1 in #3 was an example of the only permitted transitivity --
successive stores to a single variable. This is required for multiple
successive lock-based critical sections, so I feel justified relying
on it. If I acquire a lock, I see the results not only of the immediately
preceding critical section for that lock, but also of all earlier critical
sections for that lock (to the extent that the results of these earlier
critical sections were not overwritten by intervening critical sections,
of course).

> > 6. By #4 and #5, if CPU 2 sees l==2, it must also see a==1.
> > 7. By P0, if CPU 2 sees b==1, it must see y==1 at the assertion.
> > 8. By P0, if CPU 2 sees a==1, it must see x==1 at the assertion.
> > 9. #1, #6, #7, and #8 taken together implies that if CPU 2 sees
> > l==2, the assertion succeeds.
> > 10. #0 and #9 taken together demonstrate that the assertion always
> > succeeds.
> >
> > I would need P2b or P2c if different combinations of reads and writes'
> > happened in the critical section.
>
> Maybe this example is complicated enough that you don't need to include it
> among the requirements for usable synchronization.

Yeah, the point was to stress-test the definitions, not to illuminate
the readers. I guess that any reader who found that example illuminating
must have -really- been in the dark! ;-)

> > > I think that's important enough to be worth emphasizing, so that people
> > > won't get it wrong.
> >
> > Yep. Just as it is necessary that in head->a, the fetch of "head" might
> > well follow the fetch of "head->a". ;-)
>
> It's worthwhile also explaining the difference between a data dependency
> and a control dependency. In fact, for writes there are two kinds of data
> dependencies: In one kind, the dependency is through the address to be
> written (like what you have here) and in the other, it's through the value
> to be written.

Fair enough -- although as we discussed before, control dependency does
not necessarily result in memory ordering. The writes cannot speculate
ahead of the branch (as far as I know), but earlier writes could be
delayed on weak-memory machines, which could look the same from the
viewpoint of another CPU.

> > > My thoughts have been moving in this direction:
> > >
> > > Describe everything in terms of a single global ordering. It's
> > > easier to think about, and there shouldn't be a state-space
> > > explosion because you can always confine your attention to the
> > > events you care about and ignore the others.
> >
> > I may well need to use both viewpoints. However, my experience has been
> > that thinking in terms of a single global ordering leads me to miss race
> > conditions, so I am -extremely- reluctant (as you might have noticed!) to
> > rely solely on explaining this in terms of a single global ordering.
>
> That's a reasonable objection. Doesn't thinking in terms of partial
> orders also lead you to miss race conditions? In my experience they are
> easy to miss no matter how you organize your thinking! :-)

I stand corrected. I should have said "leads me to miss -even- -more-
race conditions". ;-)

> > > Reads take place at a particular time (when the return value is
> > > committed to) but writes become visible at different times to
> > > different CPUs (such as when they respond to the invalidate
> > > message).
> >
> > Reads are as fuzzy as writes. The candidate times for a read would be:
> >
> > 0. When the CPU hands the read request to the cache (BTW, I do agree
> > with your earlier suggestion to separate CPU and cache actions,
> > though I might well change my mind after running through an example
> > or three...).
> >
> > 1. When all invalidates that arrived at the local cache before the
> > read have been processed. (Assuming the read reaches the local
> > cache -- it might instead have been satisfied by the store queue.)
> >
> > 2. When the read request is transmitted from the cache to the rest
> > of the system. (Assuming it missed in the cache.)
> >
> > 3. When the last prior conflicting write completes -- with the
> > list of possible write-completion times given in a previous
> > email. As near as I can tell, this is your preferred event
> > for defining the time at which a read takes place.
>
> No, I would prefer 4. That's when the value is committed to. 3 could
> easily occur long before the read was executed.

Hmmm... In a previous email dated September 14 2006 10:58AM EDT,
you chose #6 for the writes "when CPUs send acknowledgements for the
invalidation requests". Consider the following sequence of events,
with A==0 initially:

o CPU 0 stores A=1, but the cacheline for A is local, and not
present in any other CPU's cache, so the write happens quite
quickly.

o CPU 1 loads A, sending out a read transaction on the bus.

o CPU 2 stores A=2, sending out invalidations.

o CPU 0 responds to CPU 1's read transaction.

o CPU 1 receives the invalidation request from CPU 2. By
your earlier choice, this is the time at which the subsequent
write (the one -not- seen by CPU 1's load of A) occurs.

o CPU 1 receives the read response from CPU 0. By your current
choice, this is the time at which CPU 0's load of A occurs.
CPU 1 presumably grabs the value of A from the cache line,
then responds to CPU 2's invalidation request, so that the
cacheline does a touch-and-go in CPU 1's cache before heading
off to CPU 2.

So even though the load of A happened -after- the store A=2, the load
sees A==1 rather than A==2.

Situations like this helped break me of my desire to think of loads
and stores as being non-fuzzy. Though I will admit that I did beat
my head against that particular wall for a bit before giving in...

> > 4. When the cacheline containing the variable being read is
> > received at the local cache.
> >
> > 5. When the value being read is delivered to the CPU.
> >
> > Of course, real CPUs are more complicated, so present more possible events
> > to tag as being "when the read took place". ;-)
>
> Yes. This is meant to be more of a suggestive model than an actual "this
> is how the hardware works" pronouncement.

Agreed, as I have no desire to try to drag people through a real CPU.
Hey, I don't even have much desire to drag -myself- through a real CPU!!!

> BTW, I'm starting to think that the time when a write becomes visible to a
> CPU might better be defined as the time when the value is first written to
> a cacheline or store buffer accessible by that CPU. Obviously any read
> which completes before this time can't return the new value. In a
> flat cache architecture this would mean that writes _do_ become visible at
> the same time to all CPUs except the one doing the write. With a
> hierarchical cache arrangement things get more complicated.

Yep. In particular, different reads from different CPUs issuing at the
same time can see different values. You can have quite a bit of fun
if your system happens to have a perfectly synchronized fine-grained
counter available at low overhead to each CPU, as you can then write a
small program that proves that a given variable is taking on different
values at different CPUs at the same time -- the store buffers could
each be holding a different value for that variable for the length of
time required to pull the cache line in from DRAM. ;-)

Of course if you choose "written to a store buffer" as the time for
a write, then later reads might not see that write, despite its having
happened earlier. So I don't see the value of labelling any one of
these events as being "the" time the write occurred -- but I do agree
that I must at least show how things look from this sort of viewpoint,
as it is no doubt a very natural one for anyone who has not spent 16
years beating their head against SMP hardware.

> > > A CPU's own writes become visible to itself as soon as they
> > > execute. They become visible to other CPUs at the same time or
> > > later, but not before. A write might not ever become visible
> > > to some CPUs.
> >
> > The first sentence is implied by P0. Not sure the second sentence is
> > needed -- example? The third sentence is implied by P1.
>
> The second sentence probably isn't needed. It merely states the obvious
> fact that a write can't become visible before it executes. (Another
> reason for not defining the time as when the invalidate response is sent.
> There's no reason a CPU couldn't send an early invalidate message, even
> before it knew whether it was going to execute the write.)

On i386, the programmer can force an early invalidation manually,
as well -- "atomic_add(&v, 0);". Yes, this -is- cheating. What is
your point? ;-)

Seriously, I vaguely remember several CPUs having cache-control
instructions that effectively allow assembly-language programmers and
compiler writers to issue the invalidate long before the write occurs.

> But your interpretation for the first and third sentences is backward.
> They aren't implied by P0 and P1; rather P0 and P1 _follow_ as emergent
> consequences of the behavior described here.

>From your viewpoint, yes (I think). But if you are reasoning from any
given starting point, all the other points are consequences of that
starting point.

> > > A read will always return the value of a store that has already
> > > become visible to that CPU, but not necessarily the latest such
> > > store.
> >
> > Not sure that this sentence is needed -- example?
>
> CPU 1 does "a = 1" and then CPU 0 reads a==1. A little later CPU 1 does
> "a = 2". CPU 0 acknowledges the invalidate message but hasn't yet
> processed it when it tries to read a again and gets the old cached value
> 1. So even though both stores are visible to CPU 0, the second read
> returns the value from the first store.

Both stores are really visible to CPU 0? Or only to its cache?

> To put it another way, at the time of the read the second store has not
> yet overwritten the first store on CPU 0.

Or from the viewpoint of code running on CPU 0, the first store has
happened, but the second store has not.

> > > So far everything is straightforward. The difficult part arises
> > > because multiple stores to the same location can mask and overwrite
> > > one another.
> > >
> > > If a read returns the value from a particular store, then later
> > > reads (on the same CPU and of the same location) will never return
> > > values from stores that became visible earlier than that one.
> > > (The value has overwritten any earlier stores.)
> >
> > Implied separately by each of P0 and P1, depending on your taste.
>
> Backward again. This is _used_ to show that P0 and P1 hold.

To be expected, given our opposing viewpoints.

> > > A store can be masked from a particular CPU by earlier or later
> > > stores to the same location (perhaps made by other CPUs). A store
> > > is masked from a CPU iff it never becomes visible to that CPU.
>
> Some coherency requirements I thought of later:
>
> If two stores are made to the same location, it's not possible
> for the first to mask the second on one CPU and the second to
> mask the first on another CPU.

P1 again -- accesses to a given single variable are seen in consistent
order by all CPUs.

> Also, if both stores are visible
> to any CPU then it's not possible for the store that becomes
> visible first to mask the other on any CPU.

Eh? Regardless of which one is visible on what CPU, the one that
comes first will come first, and thus be overwritten by any later
store to that same variable.

> > > If two stores are separated by wmb() then on any CPU where they
> > > both become visible, they become visible in the order they were
> > > executed.
> >
> > This is P2a.
>
> No it isn't, since P2a doesn't say anything about stores being visible or
> orders. It only talks about the whether certain reads return the values
> of certain stores. Furthermore, you need the next principle as well as
> this one before you can deduce P2a.

It is P2a -- the pair of ordered reads required by P2a are implied by
"they become visible in the order they were executed", at least if
we choose an appropriate definition of "executed".

Part of the conflict here appears to be that you are trying to define
what a single memory barrier does in isolation, and I am trying to
avoid that like the plague. ;-)

> > > If multiple stores to the same location are visible when a CPU
> > > executes rmb(), then after the rmb() the latest store has
> > > overwritten all the earlier ones.
> >
> > This is P1 and P0, even if there is no rmb(). The rmb() does not affect
> > loads and stores to a single location. In fact, no memory barrier
> > affects loads and stores to a single location, except to the extent
> > that it changes timing in a given implementation -- but a random number
> > of trips through a loop containing only nops would change timing as well.
>
> I disagree. Without this principle, we could have:
>
> CPU 0 CPU 1
> ----- -----
> a = 0;
> // Long delay
> a = 1; y = b;
> wmb(); rmb();
> b = 1; x = a;
> assert(!(y==1 && x==0));
>
> Suppose y==1. Then when CPU 1 reads a, both writes to a are visible.
> However if the rmb() didn't affect the CPU's behavior with respect to the
> repeated writes, then the read would not be obliged to return the value
> from the latest visible write. That's what the rmb() forces.

The only reason you need the rmb() is because you introduced the
second variable that went unmentioned in your earlier email. ;-)

> > > If a CPU executes read-mb-store, then no other read on any CPU
> > > can return the value of the store before this read completes.
> > > (I'd like to say that the read completes before the store
> > > becomes visible on any CPU; I'm not sure that either is right.
> > > What about on the executing CPU?)
> >
> > This is P2b. I think.
>
> I'm not so sure.

It is either P2b, or it is overconstrained. The other CPU would not be
permitted to do a write conflicting with the first CPU's initial read,
but even then, there have to be memory barriers on the second CPU to
enforce this.

> > > But this isn't complete. It doesn't say enough about when one write may
> > > or may not mask another, or the fact that this occurs in such a way that
> > > all CPUs will agree on the order of the stores to a single location they
> > > see in common.
> >
> > This would be covered by P1.
>
> The coherency principle I added covers much of this. For the rest, it's
> still necessary to add:
>
> Two stores cannot become visible in opposite orders to two CPUs.

If they are to the same variable, yes. But I thought we covered this
earlier.

> > > And certainly there should be something saying that one way or another,
> > > stores do eventually complete. Given a long enough time with no other
> > > accesses, some single store to a given location should become visible to
> > > every CPU.
> >
> > This would certainly be needed in order to analyze liveness, as opposed
> > to "simply" analyzing SMP correctness (or lack thereof, as the case may be).
> > SMP correctness "only" requires a full state-space search -- preferably
> > -after- collapsing the state space. ;-)
>
> There should also be time limits on how long a store is able to mask or
> overwrite other stores.

For liveness/latency, yes. And most computer systems I know of do have
hardware timeouts should the cache-coherence protocol leave something
stranded for too long, but if these trigger, it is typically "game over"
for the software. On NUMA-Q, the hardware essentially takes a crash dump
of the cache-coherence protocol state when this happens, in which appear
only random physical addresses and cache tags (maybe values sometimes,
I don't recall). In early prototypes of the hardware, it was possible
for the software to force such an event (in one case, by doing something
to the effect of referencing a virtual address that was mapped to a
"hole" in the physical address space that was populated by neither RAM
nor MMIO). We were thankfully able to shame the hardware guys into
taking more reasonable action in such cases, as debugging these sorts
of things was quite painful.

> > My thought is to take a more portability-paranoid approach. Start with
> > locking, deriving the memory-barrier properties required for locking
> > to work correctly. Rely only on those properties, since a number of
> > CPUs seem to have been designed (to say nothing of validated!!!) only
> > for locking. After all, if a CPU doesn't support locking, it cannot
> > support an SMP build of Linux.
>
> That's the high-level approach I mentioned at the beginning of this email.
> It certainly is simpler that the lower-level approach.

OK, so I will start from locking, derive rules, and show how they look
from the hardware. Give or take what happens when I actually try to
do this. ;-)

Thanx, Paul

2006-09-22 20:38:57

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, 21 Sep 2006, Paul E. McKenney wrote:

> > To express the various P2 requirements, let's say that a read "comes
> > after" a write if it sees the value of the write, and a second write
> > "comes after" a first write if it overwrites the first value.
>
> And a read "comes after" a read if the "later" read sees the value of at
> least one write that "comes after" the "earlier" read.

No, this doesn't make sense since we have no notion of a write "coming
after" a read.

> The relation of
> the write and read is the converse of your first requirement, but that
> should be OK -- could explicitly state it: a write "comes after" a read
> if the read does not see the value of the write, but not sure whether it
> is useful to do so.

Not only is this not useful, it can be positively wrong! It allows you
to draw incorrect conclusions. For example:

CPU 0 CPU 1
----- -----
a = 1; b = 1;
mb(); mb();
x = b + 1; y = a + 1;
assert(!(x==1 && y==1));

The assertion can fail if the writes to a and b cross in the night. But
if CPU 1 sees a==0 then you would say CPU 0's "a = 1" came after CPU 1's
read of a, hence CPU 0's read of b came after CPU 1's write to b, hence
x==2 and the assertion should hold.

So we can talk about a read or a write "coming after" a write. In general
we cannot talk about anything "coming after" a read. However it might be
okay when both accesses are on the same CPU:

A
M
A'

In this case we might allow ourselves to say that the A' access "comes
after" the A access even when A is a read. (When A is a write, it
follows from P0 that A' has to come after A.) Though not particularly
useful in itself, it could be combined with transitivity -- if you decide
to allow transitivity.

> I need to think about it a bit, but this certainly
> seems to fit with what I was previously calling "partial order".

Maybe it does. If you're careful. It's not clear whether "comes after"
really needs to be transitive.

> > Let A and B
> > be addresses in memory, let A(n) and B(n) stand for any sort of accesses
> > to A and B by CPU n, and let M stand for any appropriate memory barrier.
> > Then whenever we have
> >
> > A(0) B(1)
> > M M
> > B(0) A(1)
> >
> > the requirement is: B(1) comes after B(0) implies A(1) comes after A(0) --
> > or at least A(0) doesn't come after A(1) -- whenever this makes sense in
> > terms of the actual access and barrier operations. This is a bit stronger
> > than your P2a-P2c all together.
>
> As long as we avoid any notion of transitivity, I -think- I am OK with
> this. (OK, so yes there is transitivity in the single-variable P1 case,
> but not necessarily in the dual-variable P2x cases.)
>
> > For locking, let L stand for a lock operation and U for an unlock
> > operation. Then whenever we have
> >
> > L L
> > {A(0),B(0)} {A(1),B(1)}
> > // For each n, A(n) and B(n) can occur in arbitrary order
> > U U
> >
> > the requirement is that A(1) comes after A(0) implies B(1) comes after
> > B(0) -- or at least B(0) doesn't come after B(1) -- again, whenever that
> > makes sense. If L and U are implemented in terms of something like
> > atomic_xchg plus mb(), this ought to be derivable from the first
> > requirement.
>
> In fact, the atomic-plus-mb() would define which memory-barrier properties
> Linux could safely rely on.

Sounds reasonable. I haven't actually checked that this locking
requirement follows from the memory barrier requirement plus that
implementation. Maybe you'd like to go through the details. :-)

> > Taken together, I think these two schema express all the properties you
> > want for P2. It's possible that you might want to state more strongly
> > that accesses cannot leak out of a critical section, even when viewed by a
> > CPU that's not in a critical section itself. However I'm not sure this is
> > either necessary or correct.
>
> They are free to leak out, but only from the viewpoint of code that
> is outside of any mutually excluded critical section. Our example a
> few emails ago being a case in point.

Good.

> > It's not clear that the schema can handle your next example.
> >
> > > CPU 0 CPU 1 CPU 2
> > > ----- ----- -----
> > > a = 1 while (l < 1); z = l;
> > > mb(); b = 1; mb();
> > > l = 1 mb(); y = b;
> > > l = 2; x = a;
> > > assert(z != 2 || (x == 1 && y == 1));
> >
> > We know by P2a between CPUs 1 and 2 that (z != 2 || y == 1) always holds.
> > So you can simplify the assertion to
> >
> > assert(!(z==2 && x==0));
> >
> > > I believe that P0, P1, and P2a are sufficient in this case:
> > >
> > > 0. If CPU 2 sees l!=2, then by P0, it will see z!=2 at the
> > > assertion, which must therefore succeed.
> > > 1. On the other hand, if CPU 2 sees l==2, then by P2a, it must
> > > also see b==1.
> > > 2. By P0, CPU 1 observes that l==1 preceded l==2.
> > > 3. By P1 and #2, neither CPU 0 nor CPU can see l==1 following l==2.
> > > 4. Therefore, if CPU 2 observes l==2, there must have been
> > > an earlier l==1.
> > > 5. By P2a, any implied observation by CPU 2 of l==1 requires it to
> > > see a==1.
> >
> > I don't like step 5. Reasoning with counterfactuals can easily lead to
> > trouble. Since CPU 2 doesn't ever actually see l==2, we mustn't deduce
> > anything on the assumption that it does.
>
> You lost me on the last sentence -- we only get here if CPU 2 did in fact
> see l==2. Or did you mean to say l==1?

Yes, it was another typo. I try to weed them out, but a few always manage
to squeak through...

> In the latter case, then, yes,
> the use of P1 in #3 was an example of the only permitted transitivity --
> successive stores to a single variable. This is required for multiple
> successive lock-based critical sections, so I feel justified relying
> on it. If I acquire a lock, I see the results not only of the immediately
> preceding critical section for that lock, but also of all earlier critical
> sections for that lock (to the extent that the results of these earlier
> critical sections were not overwritten by intervening critical sections,
> of course).
>
> > > 6. By #4 and #5, if CPU 2 sees l==2, it must also see a==1.
> > > 7. By P0, if CPU 2 sees b==1, it must see y==1 at the assertion.
> > > 8. By P0, if CPU 2 sees a==1, it must see x==1 at the assertion.
> > > 9. #1, #6, #7, and #8 taken together implies that if CPU 2 sees
> > > l==2, the assertion succeeds.
> > > 10. #0 and #9 taken together demonstrate that the assertion always
> > > succeeds.
> > >
> > > I would need P2b or P2c if different combinations of reads and writes'
> > > happened in the critical section.
> >
> > Maybe this example is complicated enough that you don't need to include it
> > among the requirements for usable synchronization.
>
> Yeah, the point was to stress-test the definitions, not to illuminate
> the readers. I guess that any reader who found that example illuminating
> must have -really- been in the dark! ;-)

And yet, are you certain that the assertion must hold? The mb() on CPU 0
means that CPU 2 has to see "a = 1" before it sees "l = 1". And CPU 2 has
to see "l = 2" before it can get going. But I don't know of any reason
why CPU 2 might not see "l = 2" first, "a = 1" later, and "l = 1" not at
all.


> > It's worthwhile also explaining the difference between a data dependency
> > and a control dependency. In fact, for writes there are two kinds of data
> > dependencies: In one kind, the dependency is through the address to be
> > written (like what you have here) and in the other, it's through the value
> > to be written.
>
> Fair enough -- although as we discussed before, control dependency does
> not necessarily result in memory ordering. The writes cannot speculate
> ahead of the branch (as far as I know), but earlier writes could be
> delayed on weak-memory machines, which could look the same from the
> viewpoint of another CPU.

Yes, an earlier write could be delayed until after the branch. But a
later write cannot be moved forward before the branch.


> Hmmm... In a previous email dated September 14 2006 10:58AM EDT,
> you chose #6 for the writes "when CPUs send acknowledgements for the
> invalidation requests". Consider the following sequence of events,
> with A==0 initially:
>
> o CPU 0 stores A=1, but the cacheline for A is local, and not
> present in any other CPU's cache, so the write happens quite
> quickly.
>
> o CPU 1 loads A, sending out a read transaction on the bus.
>
> o CPU 2 stores A=2, sending out invalidations.
>
> o CPU 0 responds to CPU 1's read transaction.
>
> o CPU 1 receives the invalidation request from CPU 2. By
> your earlier choice, this is the time at which the subsequent
> write (the one -not- seen by CPU 1's load of A) occurs.
>
> o CPU 1 receives the read response from CPU 0. By your current
> choice, this is the time at which CPU 0's load of A occurs.
> CPU 1 presumably grabs the value of A from the cache line,
> then responds to CPU 2's invalidation request, so that the
> cacheline does a touch-and-go in CPU 1's cache before heading
> off to CPU 2.
>
> So even though the load of A happened -after- the store A=2, the load
> sees A==1 rather than A==2.

That's right. It's exactly what I had in mind when I said that a load
doesn't necessarily return the value of the most recent visible store.

> > CPU 1 does "a = 1" and then CPU 0 reads a==1. A little later CPU 1 does
> > "a = 2". CPU 0 acknowledges the invalidate message but hasn't yet
> > processed it when it tries to read a again and gets the old cached value
> > 1. So even though both stores are visible to CPU 0, the second read
> > returns the value from the first store.
>
> Both stores are really visible to CPU 0? Or only to its cache?

I'm not distinguishing those two concepts.

> > To put it another way, at the time of the read the second store has not
> > yet overwritten the first store on CPU 0.
>
> Or from the viewpoint of code running on CPU 0, the first store has
> happened, but the second store has not.

Yes. The reason for doing things this way is so that I can give a "local"
description of how wmb() and rmb() affect the CPU they execute on. For
example, wmb() guarantees that earlier stores will become visible before
later stores. It can't guarantee that code running on another CPU will
load the values in that order -- not unless the other CPU executes rmb().
But keeping track of whether or not some other CPU chooses to execute
rmb() is outside the scope of what wmb() should have to do. It's not part
of the job description. :-)


> > Also, if both stores are visible
> > to any CPU then it's not possible for the store that becomes
> > visible first to mask the other on any CPU.
>
> Eh? Regardless of which one is visible on what CPU, the one that
> comes first will come first, and thus be overwritten by any later
> store to that same variable.

I didn't say "overwrite", I said "mask". They aren't the same. An
earlier store can mask a later store. For instance, if CPU 0 does "a = 2"
and the assignment sits in a store buffer while CPU 1 completes storing "a
= 1", then "a = 2" will mask "a = 1" on CPU 0. CPU 1 will see both stores
in the order 1,2 and CPU 0 will see only the value 2.

This is my way of getting around the fact that _if_ both stores were
visible to CPU 0, they would become visible in the opposite order from
CPU 1.

> Part of the conflict here appears to be that you are trying to define
> what a single memory barrier does in isolation,

That's exactly right. I firmly believe that such a description should be
possible. For instance, in the MESI model there aren't any special
messages exchanged saying "I am executing a wmb()". No CPU has any way to
know what instructions another CPU is executing; that information just
isn't in the protocol. So whatever effects a memory barrier may have,
they shouldn't depend directly on what instructions other CPUs are
running. They can depend only on the messages sent over the bus.

> and I am trying to
> avoid that like the plague. ;-)

To each his own.

> > > > If multiple stores to the same location are visible when a CPU
> > > > executes rmb(), then after the rmb() the latest store has
> > > > overwritten all the earlier ones.
> > >
> > > This is P1 and P0, even if there is no rmb(). The rmb() does not affect
> > > loads and stores to a single location. In fact, no memory barrier
> > > affects loads and stores to a single location, except to the extent
> > > that it changes timing in a given implementation -- but a random number
> > > of trips through a loop containing only nops would change timing as well.
> >
> > I disagree. Without this principle, we could have:
> >
> > CPU 0 CPU 1
> > ----- -----
> > a = 0;
> > // Long delay
> > a = 1; y = b;
> > wmb(); rmb();
> > b = 1; x = a;
> > assert(!(y==1 && x==0));
> >
> > Suppose y==1. Then when CPU 1 reads a, both writes to a are visible.
> > However if the rmb() didn't affect the CPU's behavior with respect to the
> > repeated writes, then the read would not be obliged to return the value
> > from the latest visible write. That's what the rmb() forces.
>
> The only reason you need the rmb() is because you introduced the
> second variable that went unmentioned in your earlier email. ;-)

What? If the rmb() wasn't there then the assertion might fail. The idea
of this principle is to describe how the rmb() -- in isolation! -- affects
the operation of CPU 1.

Alan

2006-09-27 21:07:00

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

Paul:

Thinking about this a little more, I came up with something that you might
like. It's a bit abstract, but it does seem to capture all the essential
ingredients.

Let's start with some new notation. If A is a location in memory and n is
an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
for a load, store, or arbitrary access to A. The index n is simply a way
to distinguish among multiple accesses to the same location. If n isn't
needed we may choose to omit it.

This approach uses two important relations. The first is the "comes
before" relation we have already seen. Let's abbreviate it "c.b.", or
"c.a." for the converse "comes after" relation.

"Comes before" applies across CPUs, but only for accesses to the same
location in memory. st(A) c.b. ld(A) if the load returns the value of the
store. st(A,m) c.b. st(A,n) if the second store overwrites the first.
We do not allow loads to "come before" anything. This reflects the fact
that even though a store may have completed and may be visible to a
particular CPU, a subsequent load might not return the value of the store
(for example, if an invalidate message has been acknowledged but not
yet processed).

"Comes before" need not be transitive, depending on the architecture. We
can safely allow it to be transitive among stores that are all visible to
some single CPU, but not all stores need to be so visible.

As an example, consider a 4-CPU system where CPUs 0,1 share the cache C01
and CPUs 2,3 share the cache C23. Suppose that each CPU executes a store
to A concurrently. Then C01 might decide that the store from CPU 0 will
overwrite the store from CPU 1, and C23 might decide that the store from
CPU 2 will overwrite the store from CPU 3. Similarly, the two caches
together might decide that the store from CPU 0 will overwrite the store
from CPU 2. Under these conditions it makes no sense to compare the
stores from CPUs 1 and 3, because nowhere are both stores visible.

As a special case, we can assume that stores taking place under a bus lock
(such as the store in atomic_xchg) will be visible to all CPUs or caches,
and hence all such stores to a particular location can be placed in a
global total order consistent with the "comes before" relation.

As part of your P0, we can assert that whenever the sequence

st(A,m)
ac(A,n)

occurs on a single CPU, st(A,m) c.b. ac(A,n). In other words, each CPU
always sees the results of its own stores.

The second relation I'll call "sequencing", and I'll write it using the
standard "<" and ">" symbols. Unlike "comes before", sequencing applies
to arbitrary memory locations but only to accesses on a single CPU. It is
fully transitive, hence a genuine partial ordering. It's kind of a
strengthened form of "comes before", just as "comes before" is a
strengthened form of "occurs earlier in time".

If M is a memory barrier, then in this code

ac(A)
M
ac(B)

we will have ac(A) < ac(B), provided the barrier type is appropriate for
the sorts of access. As a special extra case, if st(B) has any kind of
dependency (control, data, or address) on a previous ld(A), then ld(A) <
st(B) even without a memory barrier. In other words, CPUs don't do
speculative writes. But in general, if two accesses are not separated by
a memory barrier then they are not sequenced.

Given this background, the operation of memory barries can be explained
very simply as follows. Whenever we have

ac(A,i) < ac(B,j) c.b. ac(B,k) < ac(A,l)

then ac(A,i) c.b. ac(A,l), or if i is a load and l is a store, then
st(A,l) !c.b. ld(A,i).

As degenerate subcases, when A and B are the same location we also have:

ac(A,i) < ac(A,j) c.b. ac(A,k) implies
ac(A,i) c.b. ac(A,k), or ac(A,k) !c.b. ac(A,i);

ac(A,j) c.b. ac(A,k) < ac(A,l) implies
ac(A,j) c.b. ac(A,l), or ac(A,l) !c.b. ac(A,j).

One way to view all this is that sequencing is transitive with "comes
before", roughly speaking.


Now, if we consider atomic_xchg() to be a combined load followed by a
store, its atomic nature is expressed by requiring that no other store can
occur in the middle. Symbolically, let's say atomic_xchg(&A) is
represented by

ld(A,m); st(A,n);

and we can even stipulate that since these are atomic accesses, ld(A,m) <
st(A,n). Then for any other st(A,k) on any CPU, if st(A,k) c.b. st(A,n)
we must have st(A,k) c.b. ld(A,m). The reverse implication follows from
one of the degenerate subcases above.

>From this you can prove that for any two atomic_xchg() calls on the same
atomic_t variable, one "comes before" the other. Going on from there, you
can show that -- assuming spinlocks are implemented via atomic_xchg() --
for any two critical sections, one comes completely before the other.
Furthermore every CPU will agree on which came first, so there is a
global total ordering of critical sections.

On the other hand, the fact that c.b. isn't transitive for all stores
means that this code can't be proved to work (all values initially 0):

CPU 0 CPU 1 CPU 2
----- ----- -----
a = 1; while (b < 1) ;
mb(); c = 1;
b = 1; mb();
b = 2; while (b < 2) ;
mb();
assert(a==1 && c==1);

CPU 2 would have been safe in asserting that c==1 alone. But the
possibility remains that CPU 2 might see b=2 before seeing a=1, and it
might not see b=1 at all. Symbolically, even though we have

a=1 < b=1 c.b. b=2 c.b. !(b<2) < (a==1)

we can't conclude that a=1 c.b. (a==1).

What do you think?

Alan


2006-09-30 01:10:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Sep 27, 2006 at 05:06:57PM -0400, Alan Stern wrote:
> Paul:
>
> Thinking about this a little more, I came up with something that you might
> like. It's a bit abstract, but it does seem to capture all the essential
> ingredients.
>
> Let's start with some new notation. If A is a location in memory and n is
> an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
> for a load, store, or arbitrary access to A. The index n is simply a way
> to distinguish among multiple accesses to the same location. If n isn't
> needed we may choose to omit it.

Don't we also need to have an argument indicating who is observing the
stores?

> This approach uses two important relations. The first is the "comes
> before" relation we have already seen. Let's abbreviate it "c.b.", or
> "c.a." for the converse "comes after" relation.
>
> "Comes before" applies across CPUs, but only for accesses to the same
> location in memory. st(A) c.b. ld(A) if the load returns the value of the
> store. st(A,m) c.b. st(A,n) if the second store overwrites the first.
> We do not allow loads to "come before" anything. This reflects the fact
> that even though a store may have completed and may be visible to a
> particular CPU, a subsequent load might not return the value of the store
> (for example, if an invalidate message has been acknowledged but not
> yet processed).
>
> "Comes before" need not be transitive, depending on the architecture. We
> can safely allow it to be transitive among stores that are all visible to
> some single CPU, but not all stores need to be so visible.

OK, I agree with total ordering on a specific variable, and also on
all loads and stores from a given CPU -- but the latter only from the
viewpoint of that particular CPU.

> As an example, consider a 4-CPU system where CPUs 0,1 share the cache C01
> and CPUs 2,3 share the cache C23. Suppose that each CPU executes a store
> to A concurrently. Then C01 might decide that the store from CPU 0 will
> overwrite the store from CPU 1, and C23 might decide that the store from
> CPU 2 will overwrite the store from CPU 3. Similarly, the two caches
> together might decide that the store from CPU 0 will overwrite the store
> from CPU 2. Under these conditions it makes no sense to compare the
> stores from CPUs 1 and 3, because nowhere are both stores visible.

Agreed -- in the absence of concurrent loads from A or use of things
like atomic_xchg() to do the stores, there is no way for the software
to know anything except that CPU 0 was the eventual winner. This means
that the six permutations of 1, 2, and 3 are possible from the software's
viewpoint -- it has no way of knowing that the order 3, 2, 1, 0 is the
"real" order.

> As a special case, we can assume that stores taking place under a bus lock

[or to a cacheline owned for the duration by the CPU doing the stores]

> (such as the store in atomic_xchg) will be visible to all CPUs or caches,
> and hence all such stores to a particular location can be placed in a
> global total order consistent with the "comes before" relation.
>
> As part of your P0, we can assert that whenever the sequence
>
> st(A,m)
> ac(A,n)
>
> occurs on a single CPU, st(A,m) c.b. ac(A,n). In other words, each CPU
> always sees the results of its own stores.
>
> The second relation I'll call "sequencing", and I'll write it using the
> standard "<" and ">" symbols. Unlike "comes before", sequencing applies
> to arbitrary memory locations but only to accesses on a single CPU. It is
> fully transitive, hence a genuine partial ordering. It's kind of a
> strengthened form of "comes before", just as "comes before" is a
> strengthened form of "occurs earlier in time".
>
> If M is a memory barrier, then in this code
>
> ac(A)
> M
> ac(B)
>
> we will have ac(A) < ac(B), provided the barrier type is appropriate for
> the sorts of access. As a special extra case, if st(B) has any kind of
> dependency (control, data, or address) on a previous ld(A), then ld(A) <
> st(B) even without a memory barrier. In other words, CPUs don't do
> speculative writes. But in general, if two accesses are not separated by
> a memory barrier then they are not sequenced.
>
> Given this background, the operation of memory barries can be explained
> very simply as follows. Whenever we have
>
> ac(A,i) < ac(B,j) c.b. ac(B,k) < ac(A,l)
>
> then ac(A,i) c.b. ac(A,l), or if i is a load and l is a store, then
> st(A,l) !c.b. ld(A,i).
>
> As degenerate subcases, when A and B are the same location we also have:
>
> ac(A,i) < ac(A,j) c.b. ac(A,k) implies
> ac(A,i) c.b. ac(A,k), or ac(A,k) !c.b. ac(A,i);
>
> ac(A,j) c.b. ac(A,k) < ac(A,l) implies
> ac(A,j) c.b. ac(A,l), or ac(A,l) !c.b. ac(A,j).
>
> One way to view all this is that sequencing is transitive with "comes
> before", roughly speaking.
>
>
> Now, if we consider atomic_xchg() to be a combined load followed by a
> store, its atomic nature is expressed by requiring that no other store can
> occur in the middle. Symbolically, let's say atomic_xchg(&A) is
> represented by
>
> ld(A,m); st(A,n);
>
> and we can even stipulate that since these are atomic accesses, ld(A,m) <
> st(A,n). Then for any other st(A,k) on any CPU, if st(A,k) c.b. st(A,n)
> we must have st(A,k) c.b. ld(A,m). The reverse implication follows from
> one of the degenerate subcases above.
>
> >From this you can prove that for any two atomic_xchg() calls on the same
> atomic_t variable, one "comes before" the other. Going on from there, you
> can show that -- assuming spinlocks are implemented via atomic_xchg() --
> for any two critical sections, one comes completely before the other.
> Furthermore every CPU will agree on which came first, so there is a
> global total ordering of critical sections.
>
> On the other hand, the fact that c.b. isn't transitive for all stores
> means that this code can't be proved to work (all values initially 0):
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> a = 1; while (b < 1) ;
> mb(); c = 1;
> b = 1; mb();
> b = 2; while (b < 2) ;
> mb();
> assert(a==1 && c==1);
>
> CPU 2 would have been safe in asserting that c==1 alone. But the
> possibility remains that CPU 2 might see b=2 before seeing a=1, and it
> might not see b=1 at all. Symbolically, even though we have
>
> a=1 < b=1 c.b. b=2 c.b. !(b<2) < (a==1)
>
> we can't conclude that a=1 c.b. (a==1).
>
> What do you think?

Interesting!

However, I believe we can safely claim a little bit more, given that
some CPUs do a blind store for the spin_unlock() operation. In this
blind-store case, a CPU that sees the store corresponding to (say) CPU
0's unlock would necessarily see the all the accesses corresponding to
(say) CPU 1's "earlier" critical section. Therefore, at least some
degree of transitivity can be assumed for sequences of loads and stores
to a single variable.

Thoughts?

Thanx, Paul

2006-09-30 21:01:08

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 29 Sep 2006, Paul E. McKenney wrote:

> > Let's start with some new notation. If A is a location in memory and n is
> > an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
> > for a load, store, or arbitrary access to A. The index n is simply a way
> > to distinguish among multiple accesses to the same location. If n isn't
> > needed we may choose to omit it.
>
> Don't we also need to have an argument indicating who is observing the
> stores?

Not here. Such an observation would itself have to be a separate load,
and so would have its own index.

> > "Comes before" need not be transitive, depending on the architecture. We
> > can safely allow it to be transitive among stores that are all visible to
> > some single CPU, but not all stores need to be so visible.
>
> OK, I agree with total ordering on a specific variable, and also on
> all loads and stores from a given CPU -- but the latter only from the
> viewpoint of that particular CPU.

What I meant above was that the ordering can be total on all stores for a
specific variable that are visible to a given CPU, regardless of the CPUs
executing the stores. ("Comes before" never tries to compare accesses to
different variables.)

I admit that this notion may be a little vague, since it's hard to say
whether a particular store is visible to a particular CPU in the absence
of a witnessing load. The same objection applies to the issue of whether
one store overwrites another -- if a third store comes along and
overwrites both then it can be difficult or impossible to define the
ordering of the first two.

> > As an example, consider a 4-CPU system where CPUs 0,1 share the cache C01
> > and CPUs 2,3 share the cache C23. Suppose that each CPU executes a store
> > to A concurrently. Then C01 might decide that the store from CPU 0 will
> > overwrite the store from CPU 1, and C23 might decide that the store from
> > CPU 2 will overwrite the store from CPU 3. Similarly, the two caches
> > together might decide that the store from CPU 0 will overwrite the store
> > from CPU 2. Under these conditions it makes no sense to compare the
> > stores from CPUs 1 and 3, because nowhere are both stores visible.
>
> Agreed -- in the absence of concurrent loads from A or use of things
> like atomic_xchg() to do the stores, there is no way for the software
> to know anything except that CPU 0 was the eventual winner. This means
> that the six permutations of 1, 2, and 3 are possible from the software's
> viewpoint -- it has no way of knowing that the order 3, 2, 1, 0 is the
> "real" order.

It's worse than you say. Even if there _are_ concurrent loads that see
the various intermediate states, there's still no single CPU that can see
both the CPU 1 and CPU 3 values. No matter how hard you looked, you
wouldn't be able to order those two stores.


> > Now, if we consider atomic_xchg() to be a combined load followed by a
> > store, its atomic nature is expressed by requiring that no other store can
> > occur in the middle. Symbolically, let's say atomic_xchg(&A) is
> > represented by
> >
> > ld(A,m); st(A,n);
> >
> > and we can even stipulate that since these are atomic accesses, ld(A,m) <
> > st(A,n). Then for any other st(A,k) on any CPU, if st(A,k) c.b. st(A,n)
> > we must have st(A,k) c.b. ld(A,m). The reverse implication follows from
> > one of the degenerate subcases above.
> >
> > >From this you can prove that for any two atomic_xchg() calls on the same
> > atomic_t variable, one "comes before" the other. Going on from there, you
> > can show that -- assuming spinlocks are implemented via atomic_xchg() --
> > for any two critical sections, one comes completely before the other.
> > Furthermore every CPU will agree on which came first, so there is a
> > global total ordering of critical sections.

> Interesting!
>
> However, I believe we can safely claim a little bit more, given that
> some CPUs do a blind store for the spin_unlock() operation. In this
> blind-store case, a CPU that sees the store corresponding to (say) CPU
> 0's unlock would necessarily see the all the accesses corresponding to
> (say) CPU 1's "earlier" critical section. Therefore, at least some
> degree of transitivity can be assumed for sequences of loads and stores
> to a single variable.
>
> Thoughts?

I'm not quite sure what you're saying. In practice does it amount to
this?

CPU 0 CPU 1 CPU 2
----- ----- -----
spin_lock(&L); spin_lock(&L);
a = 1; b = a + 1;
spin_unlock(&L); spin_unlock(&L);
while (spin_is_locked(&L)) ;
rmb();
assert(!(b==2 && a==0));

I think this follows from the principles I laid out. But of course it
depends crucially on the protection provided by the critical sections.

Alan

2006-10-02 00:05:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, Sep 30, 2006 at 05:01:05PM -0400, Alan Stern wrote:
> On Fri, 29 Sep 2006, Paul E. McKenney wrote:
>
> > > Let's start with some new notation. If A is a location in memory and n is
> > > an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
> > > for a load, store, or arbitrary access to A. The index n is simply a way
> > > to distinguish among multiple accesses to the same location. If n isn't
> > > needed we may choose to omit it.
> >
> > Don't we also need to have an argument indicating who is observing the
> > stores?
>
> Not here. Such an observation would itself have to be a separate load,
> and so would have its own index.

Ah! So ld(A,n) sees st(A,i) for the largest i<n?

> > > "Comes before" need not be transitive, depending on the architecture. We
> > > can safely allow it to be transitive among stores that are all visible to
> > > some single CPU, but not all stores need to be so visible.
> >
> > OK, I agree with total ordering on a specific variable, and also on
> > all loads and stores from a given CPU -- but the latter only from the
> > viewpoint of that particular CPU.
>
> What I meant above was that the ordering can be total on all stores for a
> specific variable that are visible to a given CPU, regardless of the CPUs
> executing the stores. ("Comes before" never tries to compare accesses to
> different variables.)
>
> I admit that this notion may be a little vague, since it's hard to say
> whether a particular store is visible to a particular CPU in the absence
> of a witnessing load. The same objection applies to the issue of whether
> one store overwrites another -- if a third store comes along and
> overwrites both then it can be difficult or impossible to define the
> ordering of the first two.

Definitely a complication!

> > > As an example, consider a 4-CPU system where CPUs 0,1 share the cache C01
> > > and CPUs 2,3 share the cache C23. Suppose that each CPU executes a store
> > > to A concurrently. Then C01 might decide that the store from CPU 0 will
> > > overwrite the store from CPU 1, and C23 might decide that the store from
> > > CPU 2 will overwrite the store from CPU 3. Similarly, the two caches
> > > together might decide that the store from CPU 0 will overwrite the store
> > > from CPU 2. Under these conditions it makes no sense to compare the
> > > stores from CPUs 1 and 3, because nowhere are both stores visible.
> >
> > Agreed -- in the absence of concurrent loads from A or use of things
> > like atomic_xchg() to do the stores, there is no way for the software
> > to know anything except that CPU 0 was the eventual winner. This means
> > that the six permutations of 1, 2, and 3 are possible from the software's
> > viewpoint -- it has no way of knowing that the order 3, 2, 1, 0 is the
> > "real" order.
>
> It's worse than you say. Even if there _are_ concurrent loads that see
> the various intermediate states, there's still no single CPU that can see
> both the CPU 1 and CPU 3 values. No matter how hard you looked, you
> wouldn't be able to order those two stores.

In absence of something like a synchronized fine-grained clock, yes, you are
all too correct.

> > > Now, if we consider atomic_xchg() to be a combined load followed by a
> > > store, its atomic nature is expressed by requiring that no other store can
> > > occur in the middle. Symbolically, let's say atomic_xchg(&A) is
> > > represented by
> > >
> > > ld(A,m); st(A,n);
> > >
> > > and we can even stipulate that since these are atomic accesses, ld(A,m) <
> > > st(A,n). Then for any other st(A,k) on any CPU, if st(A,k) c.b. st(A,n)
> > > we must have st(A,k) c.b. ld(A,m). The reverse implication follows from
> > > one of the degenerate subcases above.
> > >
> > > >From this you can prove that for any two atomic_xchg() calls on the same
> > > atomic_t variable, one "comes before" the other. Going on from there, you
> > > can show that -- assuming spinlocks are implemented via atomic_xchg() --
> > > for any two critical sections, one comes completely before the other.
> > > Furthermore every CPU will agree on which came first, so there is a
> > > global total ordering of critical sections.
>
> > Interesting!
> >
> > However, I believe we can safely claim a little bit more, given that
> > some CPUs do a blind store for the spin_unlock() operation. In this
> > blind-store case, a CPU that sees the store corresponding to (say) CPU
> > 0's unlock would necessarily see the all the accesses corresponding to
> > (say) CPU 1's "earlier" critical section. Therefore, at least some
> > degree of transitivity can be assumed for sequences of loads and stores
> > to a single variable.
> >
> > Thoughts?
>
> I'm not quite sure what you're saying. In practice does it amount to
> this?
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> spin_lock(&L); spin_lock(&L);
> a = 1; b = a + 1;
> spin_unlock(&L); spin_unlock(&L);
> while (spin_is_locked(&L)) ;
> rmb();
> assert(!(b==2 && a==0));
>
> I think this follows from the principles I laid out. But of course it
> depends crucially on the protection provided by the critical sections.

In absence of CONFIG_X86_OOSTORE or CONFIG_X86_PPRO_FENCE, i386's
implementation of spin_unlock() ends up being a simple store:

#define __raw_spin_unlock_string \
"movb $1,%0" \
:"+m" (lock->slock) : : "memory"

No explicit memory barrier, as the x86's implicit store-ordering memory
barriers suffice to keep stores inside the critical section. In addition,
x86 refuses to pull a store ahead of a load, so the loads are also confined
to the critical section.

So, your example then looks as follows:

CPU 0 CPU 1 CPU 2
----- ----- -----
spin_lock(&L); spin_lock(&L);
a = 1; b = a + 1;
implicit_mb(); implicit_mb();
L=unlocked; L=unlocked;
while (spin_is_locked(&L)) ;
rmb();
assert(!(b==2 && a==0));

I am then asserting that a very weak form of transitivity is required.
The fact that CPU 0 saw CPU 2's unlock and the fact that CPU 2 saw
CPU 1's assignment a=1 must imply that CPU 0 also sees CPU 1's a=1.
It is OK to also invoke the fact that CPU 2 also saw CPU 1's unlock before
seeing CPU 1's assignment a=1, and I am suggesting taking this latter
course, since it appears to me to be a weaker assumption.

Thoughts?

BTW, I like you approach of naming the orderings differently. For
the pseudo-ordering implied by a memory barrier, would something like
"conditionally preceeds" and "conditionally follows" get across the
fact that -some- sort of ordering is happening, but not necessarily
strict temporal ordering?

Thanx, Paul

2006-10-02 15:44:41

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sun, 1 Oct 2006, Paul E. McKenney wrote:

> On Sat, Sep 30, 2006 at 05:01:05PM -0400, Alan Stern wrote:
> > On Fri, 29 Sep 2006, Paul E. McKenney wrote:
> >
> > > > Let's start with some new notation. If A is a location in memory and n is
> > > > an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
> > > > for a load, store, or arbitrary access to A. The index n is simply a way
> > > > to distinguish among multiple accesses to the same location. If n isn't
> > > > needed we may choose to omit it.
> > >
> > > Don't we also need to have an argument indicating who is observing the
> > > stores?
> >
> > Not here. Such an observation would itself have to be a separate load,
> > and so would have its own index.
>
> Ah! So ld(A,n) sees st(A,i) for the largest i<n?

No, the index numbers don't have any intrinsic meaning. They're just
labels. They don't have to appear in any numerical or temporal sequence.
You can think of them as nothing more than line numbers in the source
code.

Whether or not a particular load (considered as a line of code in the
program source) sees a particular store depends on the execution details.
The labels you put on the source lines have nothing to do with it.

> > I admit that this notion may be a little vague, since it's hard to say
> > whether a particular store is visible to a particular CPU in the absence
> > of a witnessing load. The same objection applies to the issue of whether
> > one store overwrites another -- if a third store comes along and
> > overwrites both then it can be difficult or impossible to define the
> > ordering of the first two.
>
> Definitely a complication!

I don't know quite how to deal with it. It's beginning to look like the
original definition of "comes before" was too narrow. We should instead
say that a store "comes before" a load if the load _would_ return the
value of the store, had there been no intervening stores.


> In absence of CONFIG_X86_OOSTORE or CONFIG_X86_PPRO_FENCE, i386's
> implementation of spin_unlock() ends up being a simple store:
>
> #define __raw_spin_unlock_string \
> "movb $1,%0" \
> :"+m" (lock->slock) : : "memory"
>
> No explicit memory barrier, as the x86's implicit store-ordering memory
> barriers suffice to keep stores inside the critical section. In addition,
> x86 refuses to pull a store ahead of a load, so the loads are also confined
> to the critical section.
>
> So, your example then looks as follows:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> spin_lock(&L); spin_lock(&L);
> a = 1; b = a + 1;
> implicit_mb(); implicit_mb();
> L=unlocked; L=unlocked;
> while (spin_is_locked(&L)) ;
> rmb();
> assert(!(b==2 && a==0));
>
> I am then asserting that a very weak form of transitivity is required.
> The fact that CPU 0 saw CPU 2's unlock and the fact that CPU 2 saw
> CPU 1's assignment a=1 must imply that CPU 0 also sees CPU 1's a=1.
> It is OK to also invoke the fact that CPU 2 also saw CPU 1's unlock before
> seeing CPU 1's assignment a=1, and I am suggesting taking this latter
> course, since it appears to me to be a weaker assumption.
>
> Thoughts?

This does point out a weakness in the formalism. You can make it even
more pointed by replacing CPU 0's

while (spin_is_locked(&L)) ;
rmb();

with

spin_lock(&L);

and adding a corresponding spin_unlock(&L) to the end.

Interestingly, if simpler synchronization techniques were used instead of
the spin_lock operations then we might not expect the assertion to hold.
For example:

CPU 0 CPU 1 CPU 2
----- ----- -----
a = 1; while (x < 1) ;
mb(); mb();
x = 1; b = a + 1;
mb();
x = 2;
while (x < 2) ;
rmb();
assert(!(b==2 && a==0));

Imagine an architecture where stores get propagated rapidly from CPU 1 to
CPU 2 and from CPU 2 to CPU 0, but they take a long time to go from CPU 1
to CPU 0. In that case it's quite possible for CPU 0 to see CPU 2's
stores to x and b before it sees CPU 1's store to a (and it might not see
CPU 0's store to x at all).

So to make critical sections work as desired, there has to be something
very special about the primitive operation used in spin_lock(). It must
have a transitivity property:

st(L,i) c.b. st*(L,j) c.b. ac(L,k) implies
st(L,i) c.b. ac(L,k),

where st*(L,j) is the special sort of access used by spin_lock. It might
be the store part of atomic_xchg(), for instance.

Going back to the original example, we would then have (leaving out a few
minor details):

CPU 1: st*(L,1) // spin_lock(&L);
mb(2);
st(a,3); // a = 1;
mb(4);
st(L,5); // spin_unlock(&L);

CPU 2: st*(L,6); // spin_lock(&L);
mb(7);
ld(a,8);
st(b,9); // b = a + 1;
mb(10);
st(L,11); // spin_unlock(&L);

CPU 0: ld(L,12); // while (!spin_is_locked(&L)) ;
mb(13);
ld(b,14);
ld(a,15); // assert(!(b==2 && a==0));

Assuming that CPU 0 sees b==2, we then get:

st(a,3) < st(L,5) c.b. st*(L,6) < st(L,11) c.b. ld(L,12) < ld(a,15),

the <'s being justified by the various mb() instructions. The second <
can be absorbed, leaving

st(a,3) < st(L,5) c.b. st*(L,6) c.b. ld(L,12) < ld(a,15)

and then the new transitivity property gives us

st(a,3) < st(L,5) c.b. ld(L,12) < ld(a,15)

from which we get st(a,3) c.b. ld(a,15). This means that CPU 0 sees a==1,
since there are no intervening stores to a.


> BTW, I like you approach of naming the orderings differently. For
> the pseudo-ordering implied by a memory barrier, would something like
> "conditionally preceeds" and "conditionally follows" get across the
> fact that -some- sort of ordering is happening, but not necessarily
> strict temporal ordering?

Why use the word "conditionally"? Conditional on what?

If you want a more neutral term for expressing an ordering relation, how
about something like "dominates" or "supersedes"?

Alan

2006-10-04 15:34:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Oct 02, 2006 at 11:44:36AM -0400, Alan Stern wrote:
> On Sun, 1 Oct 2006, Paul E. McKenney wrote:
>
> > On Sat, Sep 30, 2006 at 05:01:05PM -0400, Alan Stern wrote:
> > > On Fri, 29 Sep 2006, Paul E. McKenney wrote:
> > >
> > > > > Let's start with some new notation. If A is a location in memory and n is
> > > > > an index number, let's write "ld(A,n)", "st(A,n)", and "ac(A,n)" to stand
> > > > > for a load, store, or arbitrary access to A. The index n is simply a way
> > > > > to distinguish among multiple accesses to the same location. If n isn't
> > > > > needed we may choose to omit it.
> > > >
> > > > Don't we also need to have an argument indicating who is observing the
> > > > stores?
> > >
> > > Not here. Such an observation would itself have to be a separate load,
> > > and so would have its own index.
> >
> > Ah! So ld(A,n) sees st(A,i) for the largest i<n?
>
> No, the index numbers don't have any intrinsic meaning. They're just
> labels. They don't have to appear in any numerical or temporal sequence.
> You can think of them as nothing more than line numbers in the source
> code.
>
> Whether or not a particular load (considered as a line of code in the
> program source) sees a particular store depends on the execution details.
> The labels you put on the source lines have nothing to do with it.

OK. Would a given label be repeated if the corresponding load was in
a loop or something?

> > > I admit that this notion may be a little vague, since it's hard to say
> > > whether a particular store is visible to a particular CPU in the absence
> > > of a witnessing load. The same objection applies to the issue of whether
> > > one store overwrites another -- if a third store comes along and
> > > overwrites both then it can be difficult or impossible to define the
> > > ordering of the first two.
> >
> > Definitely a complication!
>
> I don't know quite how to deal with it. It's beginning to look like the
> original definition of "comes before" was too narrow. We should instead
> say that a store "comes before" a load if the load _would_ return the
> value of the store, had there been no intervening stores.

Another approach is to define a (conceptual!!!) version number for each
variable that increments on every store to that variable. Then a store
"comes before" a load if the load returns the value stored or some value
corresponding to a larger version number.

> > In absence of CONFIG_X86_OOSTORE or CONFIG_X86_PPRO_FENCE, i386's
> > implementation of spin_unlock() ends up being a simple store:
> >
> > #define __raw_spin_unlock_string \
> > "movb $1,%0" \
> > :"+m" (lock->slock) : : "memory"
> >
> > No explicit memory barrier, as the x86's implicit store-ordering memory
> > barriers suffice to keep stores inside the critical section. In addition,
> > x86 refuses to pull a store ahead of a load, so the loads are also confined
> > to the critical section.
> >
> > So, your example then looks as follows:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > spin_lock(&L); spin_lock(&L);
> > a = 1; b = a + 1;
> > implicit_mb(); implicit_mb();
> > L=unlocked; L=unlocked;
> > while (spin_is_locked(&L)) ;
> > rmb();
> > assert(!(b==2 && a==0));
> >
> > I am then asserting that a very weak form of transitivity is required.
> > The fact that CPU 0 saw CPU 2's unlock and the fact that CPU 2 saw
> > CPU 1's assignment a=1 must imply that CPU 0 also sees CPU 1's a=1.
> > It is OK to also invoke the fact that CPU 2 also saw CPU 1's unlock before
> > seeing CPU 1's assignment a=1, and I am suggesting taking this latter
> > course, since it appears to me to be a weaker assumption.
> >
> > Thoughts?
>
> This does point out a weakness in the formalism. You can make it even
> more pointed by replacing CPU 0's
>
> while (spin_is_locked(&L)) ;
> rmb();
>
> with
>
> spin_lock(&L);
>
> and adding a corresponding spin_unlock(&L) to the end.
>
> Interestingly, if simpler synchronization techniques were used instead of
> the spin_lock operations then we might not expect the assertion to hold.
> For example:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> a = 1; while (x < 1) ;
> mb(); mb();
> x = 1; b = a + 1;
> mb();
> x = 2;
> while (x < 2) ;
> rmb();
> assert(!(b==2 && a==0));
>
> Imagine an architecture where stores get propagated rapidly from CPU 1 to
> CPU 2 and from CPU 2 to CPU 0, but they take a long time to go from CPU 1
> to CPU 0. In that case it's quite possible for CPU 0 to see CPU 2's
> stores to x and b before it sees CPU 1's store to a (and it might not see
> CPU 0's store to x at all).

You mean "CPU 1's store to x" in that last? In the sense that CPU 0
might never see x==1, agreed. If one is taking a version-number approach,
then the fact that CPU 0 eventually sees x==2 would mean that it also
saw x==1 in the sense that it saw a later version.

> So to make critical sections work as desired, there has to be something
> very special about the primitive operation used in spin_lock(). It must
> have a transitivity property:
>
> st(L,i) c.b. st*(L,j) c.b. ac(L,k) implies
> st(L,i) c.b. ac(L,k),
>
> where st*(L,j) is the special sort of access used by spin_lock. It might
> be the store part of atomic_xchg(), for instance.

Except that spin_unlock() can be implemented by a normal store (preceded
by an explicit memory barrier on sufficiently weakly ordered machines).
Hence my assertion in earlier email that visibility of stores to a single
variable has to be transitive. By this I mean that if a CPU sees version
N of a given variable, it must be as if it has also seen all versions
i<N of that variable. Otherwise the classic store-on-unlock form of
spinlocks don't work.

Or are you be arguing that you could get the same effect by something
special happening with the atomic operation that is part of spin_lock()?

> Going back to the original example, we would then have (leaving out a few
> minor details):
>
> CPU 1: st*(L,1) // spin_lock(&L);
> mb(2);
> st(a,3); // a = 1;

This notation is quite confusing -- my brain automatically interprets
"st(a,3)" as "store the value 3 to variable a" rather than the intended
"store some unknown value to variable a with source-code-line tag 3".

> mb(4);
> st(L,5); // spin_unlock(&L);
>
> CPU 2: st*(L,6); // spin_lock(&L);
> mb(7);
> ld(a,8);
> st(b,9); // b = a + 1;
> mb(10);
> st(L,11); // spin_unlock(&L);
>
> CPU 0: ld(L,12); // while (!spin_is_locked(&L)) ;
> mb(13);
> ld(b,14);
> ld(a,15); // assert(!(b==2 && a==0));
>
> Assuming that CPU 0 sees b==2, we then get:
>
> st(a,3) < st(L,5) c.b. st*(L,6) < st(L,11) c.b. ld(L,12) < ld(a,15),
>
> the <'s being justified by the various mb() instructions. The second <
> can be absorbed, leaving
>
> st(a,3) < st(L,5) c.b. st*(L,6) c.b. ld(L,12) < ld(a,15)
>
> and then the new transitivity property gives us
>
> st(a,3) < st(L,5) c.b. ld(L,12) < ld(a,15)
>
> from which we get st(a,3) c.b. ld(a,15). This means that CPU 0 sees a==1,
> since there are no intervening stores to a.
>
>
> > BTW, I like you approach of naming the orderings differently. For
> > the pseudo-ordering implied by a memory barrier, would something like
> > "conditionally preceeds" and "conditionally follows" get across the
> > fact that -some- sort of ordering is happening, but not necessarily
> > strict temporal ordering?
>
> Why use the word "conditionally"? Conditional on what?
>
> If you want a more neutral term for expressing an ordering relation, how
> about something like "dominates" or "supersedes"?

Let me try again...

When a single CPU does loads and stores, its loads see its own stores
without needing any special ordering instructions. When a bunch of
CPUs do loads and stores to a single variable, they see a consistent
view of the resulting sequence of values. But when multiple CPUs do
loads and stores to multiple variables -- even with memory barriers
added -- things get strange. The terms "comes before", "dominates",
"supersedes", etc. all imply some sort of linear ordering that simply
does not necessarily exist in this multi-CPU/multi-variable case.

So, conditional on what? Conditional on the other CPUs involved making
proper use of explicit (or implicit) memory barriers.

Fair enough?

Thanx, Paul

2006-10-04 18:04:44

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, 4 Oct 2006, Paul E. McKenney wrote:

> > > Ah! So ld(A,n) sees st(A,i) for the largest i<n?
> >
> > No, the index numbers don't have any intrinsic meaning. They're just
> > labels. They don't have to appear in any numerical or temporal sequence.
> > You can think of them as nothing more than line numbers in the source
> > code.
> >
> > Whether or not a particular load (considered as a line of code in the
> > program source) sees a particular store depends on the execution details.
> > The labels you put on the source lines have nothing to do with it.
>
> OK. Would a given label be repeated if the corresponding load was in
> a loop or something?

I knew you would ask that! No, for different iterations of a loop you
would use different labels. They don't correspond exactly to locations in
the source code, more like to particular events during the execution.


> > I don't know quite how to deal with it. It's beginning to look like the
> > original definition of "comes before" was too narrow. We should instead
> > say that a store "comes before" a load if the load _would_ return the
> > value of the store, had there been no intervening stores.
>
> Another approach is to define a (conceptual!!!) version number for each
> variable that increments on every store to that variable. Then a store
> "comes before" a load if the load returns the value stored or some value
> corresponding to a larger version number.

But that might not give the correct semantics. See below.


> > Interestingly, if simpler synchronization techniques were used instead of
> > the spin_lock operations then we might not expect the assertion to hold.
> > For example:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > a = 1; while (x < 1) ;
> > mb(); mb();
> > x = 1; b = a + 1;
> > mb();
> > x = 2;
> > while (x < 2) ;
> > rmb();
> > assert(!(b==2 && a==0));
> >
> > Imagine an architecture where stores get propagated rapidly from CPU 1 to
> > CPU 2 and from CPU 2 to CPU 0, but they take a long time to go from CPU 1
> > to CPU 0. In that case it's quite possible for CPU 0 to see CPU 2's
> > stores to x and b before it sees CPU 1's store to a (and it might not see
> > CPU 0's store to x at all).
>
> You mean "CPU 1's store to x" in that last?

Yes, yet another typo. They're impossible to eradicate. :-(

> In the sense that CPU 0
> might never see x==1, agreed. If one is taking a version-number approach,
> then the fact that CPU 0 eventually sees x==2 would mean that it also
> saw x==1 in the sense that it saw a later version.

And yet if you make that assumption (that CPU 0 "saw" x==1 because it does
see x==2), then the presence of the rmb() forces you to conclude that CPU
0 also sees a==1 -- which it might not, at least, not until after the
assertion has already failed.

This illustrates how using the "version number" approach can lead to
an incorrect result.

> > So to make critical sections work as desired, there has to be something
> > very special about the primitive operation used in spin_lock(). It must
> > have a transitivity property:
> >
> > st(L,i) c.b. st*(L,j) c.b. ac(L,k) implies
> > st(L,i) c.b. ac(L,k),
> >
> > where st*(L,j) is the special sort of access used by spin_lock. It might
> > be the store part of atomic_xchg(), for instance.
>
> Except that spin_unlock() can be implemented by a normal store (preceded
> by an explicit memory barrier on sufficiently weakly ordered machines).

That's right. So long as the spin_lock() operation is special, it doesn't
matter if spin_unlock() is normal.

> Hence my assertion in earlier email that visibility of stores to a single
> variable has to be transitive. By this I mean that if a CPU sees version
> N of a given variable, it must be as if it has also seen all versions
> i<N of that variable. Otherwise the classic store-on-unlock form of
> spinlocks don't work.
>
> Or are you be arguing that you could get the same effect by something
> special happening with the atomic operation that is part of spin_lock()?

Yes, that's exactly what I'm arguing. It seems better to say that an
atomic operation (which we already know can have strange effects on the
memory bus) -- or whatever other operation is used in the native
implementation of spin_lock() -- has special properties than it does to
say that all stores to a single variable must be transitive.

> > Going back to the original example, we would then have (leaving out a few
> > minor details):
> >
> > CPU 1: st*(L,1) // spin_lock(&L);
> > mb(2);
> > st(a,3); // a = 1;
>
> This notation is quite confusing -- my brain automatically interprets
> "st(a,3)" as "store the value 3 to variable a" rather than the intended
> "store some unknown value to variable a with source-code-line tag 3".

I'd be happy to use a different notation, if you want to suggest one.
Maybe something like 3:st(a).


> > > BTW, I like you approach of naming the orderings differently. For
> > > the pseudo-ordering implied by a memory barrier, would something like
> > > "conditionally preceeds" and "conditionally follows" get across the
> > > fact that -some- sort of ordering is happening, but not necessarily
> > > strict temporal ordering?
> >
> > Why use the word "conditionally"? Conditional on what?
> >
> > If you want a more neutral term for expressing an ordering relation, how
> > about something like "dominates" or "supersedes"?
>
> Let me try again...
>
> When a single CPU does loads and stores, its loads see its own stores
> without needing any special ordering instructions. When a bunch of
> CPUs do loads and stores to a single variable, they see a consistent
> view of the resulting sequence of values. But when multiple CPUs do
> loads and stores to multiple variables -- even with memory barriers
> added -- things get strange. The terms "comes before", "dominates",
> "supersedes", etc. all imply some sort of linear ordering that simply
> does not necessarily exist in this multi-CPU/multi-variable case.
>
> So, conditional on what? Conditional on the other CPUs involved making
> proper use of explicit (or implicit) memory barriers.
>
> Fair enough?

I see. The term in my mind was more like "sequenced before", although it
never got used in the message; you probably wouldn't like it either.
(You shouldn't object to "comes before", because that is defined to relate
only accesses to a single variable.)

The thing is, your objection to these terms has to do with the lack of
a single global linear ordering. The fact that they imply a linear
ordering should be okay, provided you accept that it is a _local_
ordering -- it makes sense only in the context of a single CPU.

For example, in the program

ld(A)
mb()
st(B)

the load and the store really are very strongly ordered by the mb(). The
CPU is prevented from rearranging them in ways that it normally would.
Sure, when you look at the results from the point of view of a different
CPU this ordering might or might not be apparent -- but it's still there
and very real.

So perhaps the best terms would be "locally precedes" and "locally
follows".

Alan

2006-10-13 16:50:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Oct 04, 2006 at 02:04:36PM -0400, Alan Stern wrote:
> On Wed, 4 Oct 2006, Paul E. McKenney wrote:
>
> > > > Ah! So ld(A,n) sees st(A,i) for the largest i<n?
> > >
> > > No, the index numbers don't have any intrinsic meaning. They're just
> > > labels. They don't have to appear in any numerical or temporal sequence.
> > > You can think of them as nothing more than line numbers in the source
> > > code.
> > >
> > > Whether or not a particular load (considered as a line of code in the
> > > program source) sees a particular store depends on the execution details.
> > > The labels you put on the source lines have nothing to do with it.
> >
> > OK. Would a given label be repeated if the corresponding load was in
> > a loop or something?
>
> I knew you would ask that! No, for different iterations of a loop you
> would use different labels. They don't correspond exactly to locations in
> the source code, more like to particular events during the execution.

Sorry to be so predictable. ;-)

Just for full disclosure, my experience has been than formal models
of memory ordering are not all that helpful in explaining memory ordering
to people. They -can- be good for formal verification, but their
track record even in that area has not been particularly stellar.

Formal verification of a particular -implementation- is another story,
that has worked -very- well -- but I am hoping for some machine
independence!

> > > I don't know quite how to deal with it. It's beginning to look like the
> > > original definition of "comes before" was too narrow. We should instead
> > > say that a store "comes before" a load if the load _would_ return the
> > > value of the store, had there been no intervening stores.
> >
> > Another approach is to define a (conceptual!!!) version number for each
> > variable that increments on every store to that variable. Then a store
> > "comes before" a load if the load returns the value stored or some value
> > corresponding to a larger version number.
>
> But that might not give the correct semantics. See below.

[Biting tongue and reading ahead...]

> > > Interestingly, if simpler synchronization techniques were used instead of
> > > the spin_lock operations then we might not expect the assertion to hold.
> > > For example:
> > >
> > > CPU 0 CPU 1 CPU 2
> > > ----- ----- -----
> > > a = 1; while (x < 1) ;
> > > mb(); mb();
> > > x = 1; b = a + 1;
> > > mb();
> > > x = 2;
> > > while (x < 2) ;
> > > rmb();
> > > assert(!(b==2 && a==0));
> > >
> > > Imagine an architecture where stores get propagated rapidly from CPU 1 to
> > > CPU 2 and from CPU 2 to CPU 0, but they take a long time to go from CPU 1
> > > to CPU 0. In that case it's quite possible for CPU 0 to see CPU 2's
> > > stores to x and b before it sees CPU 1's store to a (and it might not see
> > > CPU 0's store to x at all).
> >
> > You mean "CPU 1's store to x" in that last?
>
> Yes, yet another typo. They're impossible to eradicate. :-(

For me as welll... :-/

> > In the sense that CPU 0
> > might never see x==1, agreed. If one is taking a version-number approach,
> > then the fact that CPU 0 eventually sees x==2 would mean that it also
> > saw x==1 in the sense that it saw a later version.
>
> And yet if you make that assumption (that CPU 0 "saw" x==1 because it does
> see x==2), then the presence of the rmb() forces you to conclude that CPU
> 0 also sees a==1 -- which it might not, at least, not until after the
> assertion has already failed.
>
> This illustrates how using the "version number" approach can lead to
> an incorrect result.

My thought earlier was to allow transitivity in the case of stores and
loads to a single variable. You later argued in favor of taking the
less-restrictive-to-hardware approach of only allowing transitivity for
CPUs whose last access to the variable in question was via an atomic
operation. My "seen" approach would work for all CPU architectures
that I am familiar with, but your approach would also work for a few
even-more obnoxious CPUs that someone might someday build.

I have not yet decided which approach to take, will write things up
and talk to a few more CPU architects.

> > > So to make critical sections work as desired, there has to be something
> > > very special about the primitive operation used in spin_lock(). It must
> > > have a transitivity property:
> > >
> > > st(L,i) c.b. st*(L,j) c.b. ac(L,k) implies
> > > st(L,i) c.b. ac(L,k),
> > >
> > > where st*(L,j) is the special sort of access used by spin_lock. It might
> > > be the store part of atomic_xchg(), for instance.
> >
> > Except that spin_unlock() can be implemented by a normal store (preceded
> > by an explicit memory barrier on sufficiently weakly ordered machines).
>
> That's right. So long as the spin_lock() operation is special, it doesn't
> matter if spin_unlock() is normal.

Yep. The choices are:

1. The spin_lock() operation is special, and makes all prior
critical sections for the given lock visible. (Your approach.)
Since a CPU cannot tell spin_lock() from a hole in the
ground (OK, outside of a few research efforts), this would
seem to come down to an atomic operation.

2. Seeing a given value of a particular variable has the same
effect with respect to memory barriers as having seen any
earlier value of that variable. (My approach.)

I believe that any code that works on a CPU that does #1 would also
work on a CPU that does #2, but not vice versa. If true, this would
certainly be a -major- point in favor of your approach. ;-)

> > Hence my assertion in earlier email that visibility of stores to a single
> > variable has to be transitive. By this I mean that if a CPU sees version
> > N of a given variable, it must be as if it has also seen all versions
> > i<N of that variable. Otherwise the classic store-on-unlock form of
> > spinlocks don't work.
> >
> > Or are you be arguing that you could get the same effect by something
> > special happening with the atomic operation that is part of spin_lock()?
>
> Yes, that's exactly what I'm arguing. It seems better to say that an
> atomic operation (which we already know can have strange effects on the
> memory bus) -- or whatever other operation is used in the native
> implementation of spin_lock() -- has special properties than it does to
> say that all stores to a single variable must be transitive.

OK, good point.

> > > Going back to the original example, we would then have (leaving out a few
> > > minor details):
> > >
> > > CPU 1: st*(L,1) // spin_lock(&L);
> > > mb(2);
> > > st(a,3); // a = 1;
> >
> > This notation is quite confusing -- my brain automatically interprets
> > "st(a,3)" as "store the value 3 to variable a" rather than the intended
> > "store some unknown value to variable a with source-code-line tag 3".
>
> I'd be happy to use a different notation, if you want to suggest one.
> Maybe something like 3:st(a).

I am going to try to write it up and see what happens. My fear is that
any formal notation will be a barrier rather than a bridge to understanding.
That said, one possible approach would be to present informally and have
a formal notation in a separate appendix.

> > > > BTW, I like you approach of naming the orderings differently. For
> > > > the pseudo-ordering implied by a memory barrier, would something like
> > > > "conditionally preceeds" and "conditionally follows" get across the
> > > > fact that -some- sort of ordering is happening, but not necessarily
> > > > strict temporal ordering?
> > >
> > > Why use the word "conditionally"? Conditional on what?
> > >
> > > If you want a more neutral term for expressing an ordering relation, how
> > > about something like "dominates" or "supersedes"?
> >
> > Let me try again...
> >
> > When a single CPU does loads and stores, its loads see its own stores
> > without needing any special ordering instructions. When a bunch of
> > CPUs do loads and stores to a single variable, they see a consistent
> > view of the resulting sequence of values. But when multiple CPUs do
> > loads and stores to multiple variables -- even with memory barriers
> > added -- things get strange. The terms "comes before", "dominates",
> > "supersedes", etc. all imply some sort of linear ordering that simply
> > does not necessarily exist in this multi-CPU/multi-variable case.
> >
> > So, conditional on what? Conditional on the other CPUs involved making
> > proper use of explicit (or implicit) memory barriers.
> >
> > Fair enough?
>
> I see. The term in my mind was more like "sequenced before", although it
> never got used in the message; you probably wouldn't like it either.
> (You shouldn't object to "comes before", because that is defined to relate
> only accesses to a single variable.)

Agreed, I have no problem with globally ordered sequences for the values
taken on by a single variable. Or for the sequence of loads and stores
performed by a single CPU -- but only from that CPU's viewpoint (other
CPUs of course might see different orderings).

> The thing is, your objection to these terms has to do with the lack of
> a single global linear ordering. The fact that they imply a linear
> ordering should be okay, provided you accept that it is a _local_
> ordering -- it makes sense only in the context of a single CPU.

If we are looking at a single CPU, then yes, a single global linear
order for that CPU's accesses does make sense.

> For example, in the program
>
> ld(A)
> mb()
> st(B)
>
> the load and the store really are very strongly ordered by the mb(). The
> CPU is prevented from rearranging them in ways that it normally would.
> Sure, when you look at the results from the point of view of a different
> CPU this ordering might or might not be apparent -- but it's still there
> and very real.
>
> So perhaps the best terms would be "locally precedes" and "locally
> follows".

Here I must disagree, at least for normal cached memory (MMIO I
discuss later). Yes, there are many -implementations- of computer
systems that behave this way, and such implementations are indeed more
intuitively appealing to me than others, but it is really is possible to
create implementations where the CPU executing the above code really
does execute it out of order, where the rest of the system really
does see it out of order, and where it is up to the other CPU or the
interconnect to straighten things out based on the memory barriers.
For one (somewhat silly) example, imagine a system where each cache
line contained per-CPU sequence numbers. A CPU increments its sequence
number when it executes a memory barrier, and tags cache lines with the
corresponding sequence number. Would anyone actually do this? I doubt
it, at least with today's technology. But there are more subtle tricks
that can have this same effect in some situations.

Again, the ordering will only be guaranteed to take effect if the
memory barriers are properly paired.

OK, what about MMIO? In that case, yes, the CPU must make sure that
the actual MMIO operations are carried out in the order indicated by
memory barriers. The CPU can tell the difference -- some i386 CPUs use
memory-range registers, other types of CPUs use tags in the PTEs or TLB.
Either way, the memory range corresponding to MMIO is marked as "uncached"
or some such. And real CPUs do in fact special-case MMIO accesses.

Thanx, Paul

2006-10-13 18:30:25

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 13 Oct 2006, Paul E. McKenney wrote:

> Just for full disclosure, my experience has been than formal models
> of memory ordering are not all that helpful in explaining memory ordering
> to people. They -can- be good for formal verification, but their
> track record even in that area has not been particularly stellar.
>
> Formal verification of a particular -implementation- is another story,
> that has worked -very- well -- but I am hoping for some machine
> independence!

Well, this sort of simple formal model would have been helpful in
explaining memory ordering to _me_! :-) Already in the course of this
discussion we have ferretted out one or two significant points which the
standard explanations fail to mention.

(That was why I started this whole thing. There was a clear sense that
the standard write-ups were too vague and incomplete, but there was no way
to know what they were leaving out or how they were misdirecting.)


> My thought earlier was to allow transitivity in the case of stores and
> loads to a single variable. You later argued in favor of taking the
> less-restrictive-to-hardware approach of only allowing transitivity for
> CPUs whose last access to the variable in question was via an atomic
> operation. My "seen" approach would work for all CPU architectures
> that I am familiar with, but your approach would also work for a few
> even-more obnoxious CPUs that someone might someday build.
>
> I have not yet decided which approach to take, will write things up
> and talk to a few more CPU architects.

Okay. Let me know what they say.

> The choices are:
>
> 1. The spin_lock() operation is special, and makes all prior
> critical sections for the given lock visible. (Your approach.)
> Since a CPU cannot tell spin_lock() from a hole in the
> ground (OK, outside of a few research efforts), this would
> seem to come down to an atomic operation.
>
> 2. Seeing a given value of a particular variable has the same
> effect with respect to memory barriers as having seen any
> earlier value of that variable. (My approach.)
>
> I believe that any code that works on a CPU that does #1 would also
> work on a CPU that does #2, but not vice versa. If true, this would
> certainly be a -major- point in favor of your approach. ;-)

Doesn't #2 imply #1? #2 means that seeing any spin_lock for a given lock
variable would have the same effect as having seen all the previous
spin_lock's for that lock, from which you could prove that all the effects
of prior critical sections for that lock would be visible.


> Agreed, I have no problem with globally ordered sequences for the values
> taken on by a single variable. Or for the sequence of loads and stores
> performed by a single CPU -- but only from that CPU's viewpoint (other
> CPUs of course might see different orderings).
>
> > The thing is, your objection to these terms has to do with the lack of
> > a single global linear ordering. The fact that they imply a linear
> > ordering should be okay, provided you accept that it is a _local_
> > ordering -- it makes sense only in the context of a single CPU.
>
> If we are looking at a single CPU, then yes, a single global linear
> order for that CPU's accesses does make sense.
>
> > For example, in the program
> >
> > ld(A)
> > mb()
> > st(B)
> >
> > the load and the store really are very strongly ordered by the mb(). The
> > CPU is prevented from rearranging them in ways that it normally would.
> > Sure, when you look at the results from the point of view of a different
> > CPU this ordering might or might not be apparent -- but it's still there
> > and very real.
> >
> > So perhaps the best terms would be "locally precedes" and "locally
> > follows".
>
> Here I must disagree, at least for normal cached memory (MMIO I
> discuss later). Yes, there are many -implementations- of computer
> systems that behave this way, and such implementations are indeed more
> intuitively appealing to me than others, but it is really is possible to
> create implementations where the CPU executing the above code really
> does execute it out of order, where the rest of the system really
> does see it out of order, and where it is up to the other CPU or the
> interconnect to straighten things out based on the memory barriers.
> For one (somewhat silly) example, imagine a system where each cache
> line contained per-CPU sequence numbers. A CPU increments its sequence
> number when it executes a memory barrier, and tags cache lines with the
> corresponding sequence number. Would anyone actually do this? I doubt
> it, at least with today's technology. But there are more subtle tricks
> that can have this same effect in some situations.
>
> Again, the ordering will only be guaranteed to take effect if the
> memory barriers are properly paired.

So you disagree with my example, but do you agree with using the term
"locally precedes"? The "locally" emphasizes that the ordering exists
only from the point of view of a single CPU, which you say is perfectly
acceptable.

If not, then how about "logically precedes"? This points out the
distinction with "physically precedes" -- a notion which may not have any
clear meaning on a particular implementation.

Alan Stern

2006-10-13 22:38:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Oct 13, 2006 at 02:30:15PM -0400, Alan Stern wrote:
> On Fri, 13 Oct 2006, Paul E. McKenney wrote:
>
> > Just for full disclosure, my experience has been than formal models
> > of memory ordering are not all that helpful in explaining memory ordering
> > to people. They -can- be good for formal verification, but their
> > track record even in that area has not been particularly stellar.
> >
> > Formal verification of a particular -implementation- is another story,
> > that has worked -very- well -- but I am hoping for some machine
> > independence!
>
> Well, this sort of simple formal model would have been helpful in
> explaining memory ordering to _me_! :-) Already in the course of this
> discussion we have ferretted out one or two significant points which the
> standard explanations fail to mention.
>
> (That was why I started this whole thing. There was a clear sense that
> the standard write-ups were too vague and incomplete, but there was no way
> to know what they were leaving out or how they were misdirecting.)

So maybe a formal approach in some sort of appendix. Having two different
ways of looking at it can be helpful, I must agree.

> > My thought earlier was to allow transitivity in the case of stores and
> > loads to a single variable. You later argued in favor of taking the
> > less-restrictive-to-hardware approach of only allowing transitivity for
> > CPUs whose last access to the variable in question was via an atomic
> > operation. My "seen" approach would work for all CPU architectures
> > that I am familiar with, but your approach would also work for a few
> > even-more obnoxious CPUs that someone might someday build.
> >
> > I have not yet decided which approach to take, will write things up
> > and talk to a few more CPU architects.
>
> Okay. Let me know what they say.

One down... Several to go.

> > The choices are:
> >
> > 1. The spin_lock() operation is special, and makes all prior
> > critical sections for the given lock visible. (Your approach.)
> > Since a CPU cannot tell spin_lock() from a hole in the
> > ground (OK, outside of a few research efforts), this would
> > seem to come down to an atomic operation.
> >
> > 2. Seeing a given value of a particular variable has the same
> > effect with respect to memory barriers as having seen any
> > earlier value of that variable. (My approach.)
> >
> > I believe that any code that works on a CPU that does #1 would also
> > work on a CPU that does #2, but not vice versa. If true, this would
> > certainly be a -major- point in favor of your approach. ;-)
>
> Doesn't #2 imply #1? #2 means that seeing any spin_lock for a given lock
> variable would have the same effect as having seen all the previous
> spin_lock's for that lock, from which you could prove that all the effects
> of prior critical sections for that lock would be visible.

Yes, I believe that #2 is "stronger" in that any system that adheres to
#2 runs any software designed with #1 in mind. However, it is possible
to write code that works on a system adhering to #2, but which fails
to work on a system adhering to #1. The classic Dekker or Lamport
mutual-exclusion algorithms (with memory barriers added) would be examples.
They have no atomic operations, so would not work on a system that did
#1 but not #2.

Ewww... How about __kfifo_get() and __kfifo_put()? These have no atomic
operations. Ah, but they are restricted to pairs of tasks, so pairwise
memory barriers should suffice.

> > Agreed, I have no problem with globally ordered sequences for the values
> > taken on by a single variable. Or for the sequence of loads and stores
> > performed by a single CPU -- but only from that CPU's viewpoint (other
> > CPUs of course might see different orderings).
> >
> > > The thing is, your objection to these terms has to do with the lack of
> > > a single global linear ordering. The fact that they imply a linear
> > > ordering should be okay, provided you accept that it is a _local_
> > > ordering -- it makes sense only in the context of a single CPU.
> >
> > If we are looking at a single CPU, then yes, a single global linear
> > order for that CPU's accesses does make sense.
> >
> > > For example, in the program
> > >
> > > ld(A)
> > > mb()
> > > st(B)
> > >
> > > the load and the store really are very strongly ordered by the mb(). The
> > > CPU is prevented from rearranging them in ways that it normally would.
> > > Sure, when you look at the results from the point of view of a different
> > > CPU this ordering might or might not be apparent -- but it's still there
> > > and very real.
> > >
> > > So perhaps the best terms would be "locally precedes" and "locally
> > > follows".
> >
> > Here I must disagree, at least for normal cached memory (MMIO I
> > discuss later). Yes, there are many -implementations- of computer
> > systems that behave this way, and such implementations are indeed more
> > intuitively appealing to me than others, but it is really is possible to
> > create implementations where the CPU executing the above code really
> > does execute it out of order, where the rest of the system really
> > does see it out of order, and where it is up to the other CPU or the
> > interconnect to straighten things out based on the memory barriers.
> > For one (somewhat silly) example, imagine a system where each cache
> > line contained per-CPU sequence numbers. A CPU increments its sequence
> > number when it executes a memory barrier, and tags cache lines with the
> > corresponding sequence number. Would anyone actually do this? I doubt
> > it, at least with today's technology. But there are more subtle tricks
> > that can have this same effect in some situations.
> >
> > Again, the ordering will only be guaranteed to take effect if the
> > memory barriers are properly paired.
>
> So you disagree with my example, but do you agree with using the term
> "locally precedes"? The "locally" emphasizes that the ordering exists
> only from the point of view of a single CPU, which you say is perfectly
> acceptable.
>
> If not, then how about "logically precedes"? This points out the
> distinction with "physically precedes" -- a notion which may not have any
> clear meaning on a particular implementation.

For the pairwise memory barriers, I really like "conditionally precedes",
which makes it very clear that the observation of order is not automatic.
On both CPUs, and explicit memory barrier is required (with the exception
of MMIO, where the communication is instead with an I/O device).

For the single-variable case and for the single-CPU case, just plain
"precedes" works, at least as long as you are not doing fine-grained
timings that can allow you to observe cache lines in motion. But if
you are doing that, you had better know what you are doing anyway. ;-)

Thanx, Paul

2006-10-14 02:27:38

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 13 Oct 2006, Paul E. McKenney wrote:

> Ewww... How about __kfifo_get() and __kfifo_put()? These have no atomic
> operations. Ah, but they are restricted to pairs of tasks, so pairwise
> memory barriers should suffice.

Tasks can migrate from one CPU to another, of course. But that involves
context switching and plenty of synchronization operations in the kernel,
so you're okay in that respect.

> For the pairwise memory barriers, I really like "conditionally precedes",
> which makes it very clear that the observation of order is not automatic.
> On both CPUs, and explicit memory barrier is required (with the exception
> of MMIO, where the communication is instead with an I/O device).
>
> For the single-variable case and for the single-CPU case, just plain
> "precedes" works, at least as long as you are not doing fine-grained
> timings that can allow you to observe cache lines in motion. But if
> you are doing that, you had better know what you are doing anyway. ;-)

The reason I don't like "conditionally precedes" is because it suggests
the ordering is not automatic even in the single-CPU case.

Alan

2006-10-17 01:23:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Oct 13, 2006 at 10:27:36PM -0400, Alan Stern wrote:
> On Fri, 13 Oct 2006, Paul E. McKenney wrote:
>
> > Ewww... How about __kfifo_get() and __kfifo_put()? These have no atomic
> > operations. Ah, but they are restricted to pairs of tasks, so pairwise
> > memory barriers should suffice.
>
> Tasks can migrate from one CPU to another, of course. But that involves
> context switching and plenty of synchronization operations in the kernel,
> so you're okay in that respect.

Yep -- at least it had better be! Careful about how you write lightweight
schedulers! ;-)

> > For the pairwise memory barriers, I really like "conditionally precedes",
> > which makes it very clear that the observation of order is not automatic.
> > On both CPUs, and explicit memory barrier is required (with the exception
> > of MMIO, where the communication is instead with an I/O device).
> >
> > For the single-variable case and for the single-CPU case, just plain
> > "precedes" works, at least as long as you are not doing fine-grained
> > timings that can allow you to observe cache lines in motion. But if
> > you are doing that, you had better know what you are doing anyway. ;-)
>
> The reason I don't like "conditionally precedes" is because it suggests
> the ordering is not automatic even in the single-CPU case.

Aside from MMIO accesses, why would you be using memory barriers in the
single-CPU case? If you aren't using memory barriers, then just plain
"precedes" works fine -- "conditionally precedes" applies only to memory
barriers acting on normal memory (again, MMIO is handled specially).

So, ordering is indeed automatic in the single-CPU case. Or, more
accurately, ordering -looks- -like- it is automatic in the single-CPU
case. Except for MMIO -- MMIO giveth ordering in SMP and it taketh
ordering away on UP. ;-)

Thanx, Paul

2006-10-17 15:29:44

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, 16 Oct 2006, Paul E. McKenney wrote:

> > The reason I don't like "conditionally precedes" is because it suggests
> > the ordering is not automatic even in the single-CPU case.
>
> Aside from MMIO accesses, why would you be using memory barriers in the
> single-CPU case?

Obviously you wouldn't. But you might be fooled into doing so if you saw
the term "conditionally precedes" together with an explanation that the
"condition" requires a memory barrier to be present. You might also draw
this erroneous conclusion if you are on an SMP system but your variable is
accessed by only one of the CPUs.

> If you aren't using memory barriers, then just plain
> "precedes" works fine -- "conditionally precedes" applies only to memory
> barriers acting on normal memory (again, MMIO is handled specially).

No, no! Taken out of context this sentence looks terribly confused.
Read it again and you'll see what I mean. (Think about what it says for
people who don't use memory barriers on SMP systems.) Here's a much more
accurate statement:

If you are in the single-CPU case then just plain "precedes"
works fine for normal memory accesses (MMIO is handled
specially).

But when multiple CPUs access the same variable all ordering
is "conditional"; each CPU must use a memory barrier to
guarantee the desired ordering.

Alan

2006-10-17 17:26:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Oct 17, 2006 at 11:29:42AM -0400, Alan Stern wrote:
> On Mon, 16 Oct 2006, Paul E. McKenney wrote:
>
> > > The reason I don't like "conditionally precedes" is because it suggests
> > > the ordering is not automatic even in the single-CPU case.
> >
> > Aside from MMIO accesses, why would you be using memory barriers in the
> > single-CPU case?
>
> Obviously you wouldn't. But you might be fooled into doing so if you saw
> the term "conditionally precedes" together with an explanation that the
> "condition" requires a memory barrier to be present. You might also draw
> this erroneous conclusion if you are on an SMP system but your variable is
> accessed by only one of the CPUs.

OK. My thought would be to clearly state that memory barriers are needed
only in the following two cases: (1) for MMIO and (2) for sharing memory
among multiple CPUs.

> > If you aren't using memory barriers, then just plain
> > "precedes" works fine -- "conditionally precedes" applies only to memory
> > barriers acting on normal memory (again, MMIO is handled specially).
>
> No, no! Taken out of context this sentence looks terribly confused.
> Read it again and you'll see what I mean. (Think about what it says for
> people who don't use memory barriers on SMP systems.) Here's a much more
> accurate statement:
>
> If you are in the single-CPU case then just plain "precedes"
> works fine for normal memory accesses (MMIO is handled
> specially).
>
> But when multiple CPUs access the same variable all ordering
> is "conditional"; each CPU must use a memory barrier to
> guarantee the desired ordering.

"There -is- no ordering!" (Or was that "There -is- no spoon"?) ;-)

You know, I am not sure we are ever going to reconcile our two
viewpoints...

Your view (I believe) is that each execution produces some definite
order of loads and stores, with each load and store occurring at some
specific point in time. Of course, a given load or store might be visible
at different times from different CPUs, even to the point that CPUs
might disagree on the order in which given loads and stores occurred.

My view is that individual loads and stores are complex operations, each
of which takes significant time to complete. A pair of these operations
can therefore overlap in time, so that it might or might not make sense
to say that the pair occurred in any order at all -- nor to say that
any given load or store happens at a single point in time. I described
this earlier as loads and stores being "fuzzy".

The odds of you (or anyone) being able to pry me out of my viewpoint
are extremely low -- in fact, if such a thing were possible, the
odds would be represented by a negative number. The reason for this
admittedly unreasonable attitude is that every time I have strayed from
this viewpoint over the past decade or two, I have been brutally punished
by the appearance of large numbers of subtle and hard-to-find bugs.
My intuition about sequencing and ordering is so strong that if I let
it gain a foothold, it will blind me to the possibility of these bugs
occurring. I can either deny that the loads and stores happen in any
order whatsoever (which works, but is very hard to explain), or assert
that they take non-zero time to execute and can therefore overlap.

That said, I do recognize that my viewpoint is not universally applicable.

Someone using locking will probably be much more productive if they
leverage their intuition, assuming that locks are acquired and released
in a definite order and that all the memory references in the critical
sections executing in order, each at a definite point in time. After all,
the whole point of the locking primitives and their associated memory
barriers is to present exactly this illusion. It is quite possible that
it is better to use a viewpoint like yours when thinking about MMIO
accesses -- and given your work in USB and PCI, it might well be very
difficult to pry -you- out of -your- viewpoint.

But I believe that taking a viewpoint very similar to mine is critical
for getting the low-level synchronization primitives even halfway correct.
(Obscene quantities of testing are required as well.)

So, how to proceed?

One approach would be to demonstrate counter-intuitive results with a
small program (and I do have several at hand), enumerate the viewpoints,
and then use examples.

Another approach would be to use a formalism. Notations such as ">p"
(comes before) and "<p" (comes after) for program order and ">v"/"<v"
for order of values in a given variable have been used in the past.
These could be coupled with something vaguely resembling you suggestion
for loads and stores: l(v,c,l) for load from variable "v" by CPU "c"
at code line number "l" and s(v,c,l,n) for store to variable "v" by CPU
"c" at code line number "l" with new value "n". (In the examples below,
"l" can be omitted, since there is only one of each per CPU.)

If we were to take your choice for the transitivity required for locking,
we would need to also define an ">a"/"<a" or some such denoting a
chain of values where the last assignment was performed atomically
(possibly also by a particular CPU). In that case, ">v"/"<v" would
denote values separated by a single assignment. However, for simplicity
of nomenclature, I am taking my choice for this example -- if there is
at least one CPU that actually requires the atomic operation, then the
final result will need the more complex nomenclature.

Then take the usual code, with all variables initially zero:

CPU 0 CPU 1

A=1 Y=B
smp_mb() smp_mb()
B=1 X=A

Then the description of a memory barrier ends up being something like
the following:

Given the following:

l(B,1,) >p smp_mb() >p l(A,1,), and
s(A,0,,1) >p smp_mb() >p s(B,0,,1):

Then:

s(B,0,,1) >v l(B,1,) -> s(A,0,,1) >v l(A,1,).

This notation correctly distinguishes the following two cases (all
variables initially zero):

CPU 0 CPU 1 CPU 2

A=1 while (B==0); while (C==0);
smp_mb() C=1 smp_mb()
B=1 assert(A==1) <fails>

and:

CPU 0 CPU 1 CPU 2

A=1 while (B==0); while (B<2);
smp_mb() B++ smp_mb()
B=1 assert(A==1) <succeeds>

In the first case, we don't have s(B,0,,1) >v l(C,2,). Therefore,
we cannot rely on s(A,0,,1) >v l(A,2,). In the second case, we do
have s(B,0,,1) >v l(B,2,), so s(A,0,,1) >v l(A,2,) must hold.

My guess is that this formal approach is absolutely required for the
more mathematically inclined, but that it will instead be an obstacle
to understanding (in fact, an obstacle even to reading!) for many people.

So my thought is to put the formal approach later on as reference material.
And probably to improve the notation -- the ",,"s look ugly.

Thoughts?

Thanx, Paul

PS. One difference between our two viewpoints is that I would forbid
something like "s(A,0,,1) >v s(B,0,,1)", instead permitting ">v"/"<v"
to be used on loads and stores of a single variable, with the exception
of MMIO. My guess is that you are just fine with orderings of
assignments to different variables. ;-)

My example formalism for a memory barrier says nothing about the
actual order in which the assignments to A and B occurred, nor about
the actual order in which the loads from A and B occurred. No such
ordering is required to describe the action of the memory barrier.

2006-10-17 19:42:13

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 17 Oct 2006, Paul E. McKenney wrote:

> OK. My thought would be to clearly state that memory barriers are needed
> only in the following two cases: (1) for MMIO and (2) for sharing memory
> among multiple CPUs.

And then also state that for all other cases, "conditionally precedes" is
the same as "precedes".


> > If you are in the single-CPU case then just plain "precedes"
> > works fine for normal memory accesses (MMIO is handled
> > specially).
> >
> > But when multiple CPUs access the same variable all ordering
> > is "conditional"; each CPU must use a memory barrier to
> > guarantee the desired ordering.
>
> "There -is- no ordering!" (Or was that "There -is- no spoon"?) ;-)

(Ceci n'est pas un CPU!)

If there is no ordering then what does "conditionally precedes" mean?
Perhaps you would prefer to write instead:

... each CPU must use a memory barrier to obtain the apparent
effect of the desired ordering.

Or even:

... each CPU must use a memory barrier to satisfy the "condition"
in "conditionally precedes".

> You know, I am not sure we are ever going to reconcile our two
> viewpoints...
>
> Your view (I believe) is that each execution produces some definite
> order of loads and stores, with each load and store occurring at some
> specific point in time. Of course, a given load or store might be visible
> at different times from different CPUs, even to the point that CPUs
> might disagree on the order in which given loads and stores occurred.

Maybe so, but I am willing to converse in terms which assume no definite
ordering. Which makes the term "conditionally precedes" difficult to
justify, since the word "precedes" implies the existence of an ordering.

> My view is that individual loads and stores are complex operations, each
> of which takes significant time to complete. A pair of these operations
> can therefore overlap in time, so that it might or might not make sense
> to say that the pair occurred in any order at all -- nor to say that
> any given load or store happens at a single point in time. I described
> this earlier as loads and stores being "fuzzy".
>
> The odds of you (or anyone) being able to pry me out of my viewpoint
> are extremely low -- in fact, if such a thing were possible, the
> odds would be represented by a negative number. The reason for this
> admittedly unreasonable attitude is that every time I have strayed from
> this viewpoint over the past decade or two, I have been brutally punished
> by the appearance of large numbers of subtle and hard-to-find bugs.
> My intuition about sequencing and ordering is so strong that if I let
> it gain a foothold, it will blind me to the possibility of these bugs
> occurring. I can either deny that the loads and stores happen in any
> order whatsoever (which works, but is very hard to explain), or assert
> that they take non-zero time to execute and can therefore overlap.

That's different from saying "There -is- no ordering!" For example, if
loads and stores take non-zero time to execute and can therefore overlap,
then they can be ordered by their start times or by their end times. Or
you could set up a partial ordering: One access comes before another if
its end time precedes the other's start time.

These might not be _useful_ orderings, but they do exist. :-)

> That said, I do recognize that my viewpoint is not universally applicable.
>
> Someone using locking will probably be much more productive if they
> leverage their intuition, assuming that locks are acquired and released
> in a definite order and that all the memory references in the critical
> sections executing in order, each at a definite point in time. After all,
> the whole point of the locking primitives and their associated memory
> barriers is to present exactly this illusion. It is quite possible that
> it is better to use a viewpoint like yours when thinking about MMIO
> accesses -- and given your work in USB and PCI, it might well be very
> difficult to pry -you- out of -your- viewpoint.
>
> But I believe that taking a viewpoint very similar to mine is critical
> for getting the low-level synchronization primitives even halfway correct.
> (Obscene quantities of testing are required as well.)
>
> So, how to proceed?
>
> One approach would be to demonstrate counter-intuitive results with a
> small program (and I do have several at hand), enumerate the viewpoints,
> and then use examples.

That sounds like a pedagogically useful way to disturb people out of their
complacency.

> Another approach would be to use a formalism. Notations such as ">p"
> (comes before) and "<p" (comes after) for program order and ">v"/"<v"
> for order of values in a given variable have been used in the past.
> These could be coupled with something vaguely resembling you suggestion
> for loads and stores: l(v,c,l) for load from variable "v" by CPU "c"
> at code line number "l" and s(v,c,l,n) for store to variable "v" by CPU
> "c" at code line number "l" with new value "n". (In the examples below,
> "l" can be omitted, since there is only one of each per CPU.)
>
> If we were to take your choice for the transitivity required for locking,
> we would need to also define an ">a"/"<a" or some such denoting a
> chain of values where the last assignment was performed atomically
> (possibly also by a particular CPU). In that case, ">v"/"<v" would
> denote values separated by a single assignment. However, for simplicity
> of nomenclature, I am taking my choice for this example -- if there is
> at least one CPU that actually requires the atomic operation, then the
> final result will need the more complex nomenclature.
>
> Then take the usual code, with all variables initially zero:
>
> CPU 0 CPU 1
>
> A=1 Y=B
> smp_mb() smp_mb()
> B=1 X=A
>
> Then the description of a memory barrier ends up being something like
> the following:
>
> Given the following:
>
> l(B,1,) >p smp_mb() >p l(A,1,), and
> s(A,0,,1) >p smp_mb() >p s(B,0,,1):
>
> Then:
>
> s(B,0,,1) >v l(B,1,) -> s(A,0,,1) >v l(A,1,).
>
> This notation correctly distinguishes the following two cases (all
> variables initially zero):
>
> CPU 0 CPU 1 CPU 2
>
> A=1 while (B==0); while (C==0);
> smp_mb() C=1 smp_mb()
> B=1 assert(A==1) <fails>
>
> and:
>
> CPU 0 CPU 1 CPU 2
>
> A=1 while (B==0); while (B<2);
> smp_mb() B++ smp_mb()
> B=1 assert(A==1) <succeeds>
>
> In the first case, we don't have s(B,0,,1) >v l(C,2,). Therefore,
> we cannot rely on s(A,0,,1) >v l(A,2,). In the second case, we do
> have s(B,0,,1) >v l(B,2,), so s(A,0,,1) >v l(A,2,) must hold.

Using your notion of transitivity for all accesses to a single variable,
yes.

> My guess is that this formal approach is absolutely required for the
> more mathematically inclined, but that it will instead be an obstacle
> to understanding (in fact, an obstacle even to reading!) for many people.
>
> So my thought is to put the formal approach later on as reference material.
> And probably to improve the notation -- the ",,"s look ugly.
>
> Thoughts?

I agree; putting this material in an appendix would improve readability of
the main text and help give a single location for a formal definition of
the SMP memory access model used by Linux.

And yes, the notation could be improved. For instance, in typeset text
the CPU number could be a subscript. For stores the value could be
indicated by an equals sign, and the line number could precede the
expression. Maybe even use "lo" and "st" to make the names easier to
perceive. You'd end up with something like this: 17:st_0(A = 2). Then
it becomes a lot cleaner to omit the CPU number, the line number, or the
value.

> PS. One difference between our two viewpoints is that I would forbid
> something like "s(A,0,,1) >v s(B,0,,1)", instead permitting ">v"/"<v"
> to be used on loads and stores of a single variable, with the exception
> of MMIO. My guess is that you are just fine with orderings of
> assignments to different variables. ;-)

Earlier I defined two separate kinds of orderings: "comes before" and
"sequentially precedes". My "comes before" is essentially the same as
your "<v", applying only to accesses of the same variable. You don't have
any direct analog to "sequentially precedes", which is perhaps a weakness:
It will be harder for you to denote the effect of a load dependency on a
subsequent store. My "sequentially precedes" does _not_ require the
accesses to be to the same variable, but it does require them to take
place on the same CPU.

> My example formalism for a memory barrier says nothing about the
> actual order in which the assignments to A and B occurred, nor about
> the actual order in which the loads from A and B occurred. No such
> ordering is required to describe the action of the memory barrier.

Are you sure about that? I would think it was implicit in your definition
of "<v". Talking about the order of values in a variable during the past
isn't very different from talking about the order in which the
corresponding stores occurred.

For that matter, the whole concept of "the value in a variable" is itself
rather fuzzy. Even the sequence of values might not be well defined: If
you had some single CPU do nothing but repeatedly load the variable and
note its value, you could end up missing some of the values perceived by
other CPUs. That is, it could be possible for CPU 0 to see A take on the
values 0,1,2 while CPU 1 sees only the values 0,2.

Alan Stern

2006-10-17 20:14:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Oct 17, 2006 at 03:42:07PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2006, Paul E. McKenney wrote:
>
> > OK. My thought would be to clearly state that memory barriers are needed
> > only in the following two cases: (1) for MMIO and (2) for sharing memory
> > among multiple CPUs.
>
> And then also state that for all other cases, "conditionally precedes" is
> the same as "precedes".

Or that "conditionally precedes" is not to be used except for describing
the action of memory barriers.

> > > If you are in the single-CPU case then just plain "precedes"
> > > works fine for normal memory accesses (MMIO is handled
> > > specially).
> > >
> > > But when multiple CPUs access the same variable all ordering
> > > is "conditional"; each CPU must use a memory barrier to
> > > guarantee the desired ordering.
> >
> > "There -is- no ordering!" (Or was that "There -is- no spoon"?) ;-)
>
> (Ceci n'est pas un CPU!)

I wonder what our CPUs thought while processing that sentence. ;-)

> If there is no ordering then what does "conditionally precedes" mean?
> Perhaps you would prefer to write instead:
>
> ... each CPU must use a memory barrier to obtain the apparent
> effect of the desired ordering.
>
> Or even:
>
> ... each CPU must use a memory barrier to satisfy the "condition"
> in "conditionally precedes".

"There -is- no ordering" was a bit extreme, even for me. ;-)

> > You know, I am not sure we are ever going to reconcile our two
> > viewpoints...
> >
> > Your view (I believe) is that each execution produces some definite
> > order of loads and stores, with each load and store occurring at some
> > specific point in time. Of course, a given load or store might be visible
> > at different times from different CPUs, even to the point that CPUs
> > might disagree on the order in which given loads and stores occurred.
>
> Maybe so, but I am willing to converse in terms which assume no definite
> ordering. Which makes the term "conditionally precedes" difficult to
> justify, since the word "precedes" implies the existence of an ordering.

My hope is to come up with something that is semi-comfortable for both
viewpoints.

> > My view is that individual loads and stores are complex operations, each
> > of which takes significant time to complete. A pair of these operations
> > can therefore overlap in time, so that it might or might not make sense
> > to say that the pair occurred in any order at all -- nor to say that
> > any given load or store happens at a single point in time. I described
> > this earlier as loads and stores being "fuzzy".
> >
> > The odds of you (or anyone) being able to pry me out of my viewpoint
> > are extremely low -- in fact, if such a thing were possible, the
> > odds would be represented by a negative number. The reason for this
> > admittedly unreasonable attitude is that every time I have strayed from
> > this viewpoint over the past decade or two, I have been brutally punished
> > by the appearance of large numbers of subtle and hard-to-find bugs.
> > My intuition about sequencing and ordering is so strong that if I let
> > it gain a foothold, it will blind me to the possibility of these bugs
> > occurring. I can either deny that the loads and stores happen in any
> > order whatsoever (which works, but is very hard to explain), or assert
> > that they take non-zero time to execute and can therefore overlap.
>
> That's different from saying "There -is- no ordering!" For example, if
> loads and stores take non-zero time to execute and can therefore overlap,
> then they can be ordered by their start times or by their end times. Or
> you could set up a partial ordering: One access comes before another if
> its end time precedes the other's start time.
>
> These might not be _useful_ orderings, but they do exist. :-)

"Zen and the art of memory-barrier placement". ;-)

> > That said, I do recognize that my viewpoint is not universally applicable.
> >
> > Someone using locking will probably be much more productive if they
> > leverage their intuition, assuming that locks are acquired and released
> > in a definite order and that all the memory references in the critical
> > sections executing in order, each at a definite point in time. After all,
> > the whole point of the locking primitives and their associated memory
> > barriers is to present exactly this illusion. It is quite possible that
> > it is better to use a viewpoint like yours when thinking about MMIO
> > accesses -- and given your work in USB and PCI, it might well be very
> > difficult to pry -you- out of -your- viewpoint.
> >
> > But I believe that taking a viewpoint very similar to mine is critical
> > for getting the low-level synchronization primitives even halfway correct.
> > (Obscene quantities of testing are required as well.)
> >
> > So, how to proceed?
> >
> > One approach would be to demonstrate counter-intuitive results with a
> > small program (and I do have several at hand), enumerate the viewpoints,
> > and then use examples.
>
> That sounds like a pedagogically useful way to disturb people out of their
> complacency.

Guilty to charges as read!

> > Another approach would be to use a formalism. Notations such as ">p"
> > (comes before) and "<p" (comes after) for program order and ">v"/"<v"
> > for order of values in a given variable have been used in the past.
> > These could be coupled with something vaguely resembling you suggestion
> > for loads and stores: l(v,c,l) for load from variable "v" by CPU "c"
> > at code line number "l" and s(v,c,l,n) for store to variable "v" by CPU
> > "c" at code line number "l" with new value "n". (In the examples below,
> > "l" can be omitted, since there is only one of each per CPU.)
> >
> > If we were to take your choice for the transitivity required for locking,
> > we would need to also define an ">a"/"<a" or some such denoting a
> > chain of values where the last assignment was performed atomically
> > (possibly also by a particular CPU). In that case, ">v"/"<v" would
> > denote values separated by a single assignment. However, for simplicity
> > of nomenclature, I am taking my choice for this example -- if there is
> > at least one CPU that actually requires the atomic operation, then the
> > final result will need the more complex nomenclature.
> >
> > Then take the usual code, with all variables initially zero:
> >
> > CPU 0 CPU 1
> >
> > A=1 Y=B
> > smp_mb() smp_mb()
> > B=1 X=A
> >
> > Then the description of a memory barrier ends up being something like
> > the following:
> >
> > Given the following:
> >
> > l(B,1,) >p smp_mb() >p l(A,1,), and
> > s(A,0,,1) >p smp_mb() >p s(B,0,,1):
> >
> > Then:
> >
> > s(B,0,,1) >v l(B,1,) -> s(A,0,,1) >v l(A,1,).
> >
> > This notation correctly distinguishes the following two cases (all
> > variables initially zero):
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 while (B==0); while (C==0);
> > smp_mb() C=1 smp_mb()
> > B=1 assert(A==1) <fails>
> >
> > and:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 while (B==0); while (B<2);
> > smp_mb() B++ smp_mb()
> > B=1 assert(A==1) <succeeds>
> >
> > In the first case, we don't have s(B,0,,1) >v l(C,2,). Therefore,
> > we cannot rely on s(A,0,,1) >v l(A,2,). In the second case, we do
> > have s(B,0,,1) >v l(B,2,), so s(A,0,,1) >v l(A,2,) must hold.
>
> Using your notion of transitivity for all accesses to a single variable,
> yes.

Yep. In your scenario, both assertions could fail.

> > My guess is that this formal approach is absolutely required for the
> > more mathematically inclined, but that it will instead be an obstacle
> > to understanding (in fact, an obstacle even to reading!) for many people.
> >
> > So my thought is to put the formal approach later on as reference material.
> > And probably to improve the notation -- the ",,"s look ugly.
> >
> > Thoughts?
>
> I agree; putting this material in an appendix would improve readability of
> the main text and help give a single location for a formal definition of
> the SMP memory access model used by Linux.

Cool!

> And yes, the notation could be improved. For instance, in typeset text
> the CPU number could be a subscript. For stores the value could be
> indicated by an equals sign, and the line number could precede the
> expression. Maybe even use "lo" and "st" to make the names easier to
> perceive. You'd end up with something like this: 17:st_0(A = 2). Then
> it becomes a lot cleaner to omit the CPU number, the line number, or the
> value.

Interesting approach -- I will play with this.

> > PS. One difference between our two viewpoints is that I would forbid
> > something like "s(A,0,,1) >v s(B,0,,1)", instead permitting ">v"/"<v"
> > to be used on loads and stores of a single variable, with the exception
> > of MMIO. My guess is that you are just fine with orderings of
> > assignments to different variables. ;-)
>
> Earlier I defined two separate kinds of orderings: "comes before" and
> "sequentially precedes". My "comes before" is essentially the same as
> your "<v", applying only to accesses of the same variable. You don't have
> any direct analog to "sequentially precedes", which is perhaps a weakness:
> It will be harder for you to denote the effect of a load dependency on a
> subsequent store. My "sequentially precedes" does _not_ require the
> accesses to be to the same variable, but it does require them to take
> place on the same CPU.

This is similar to my ">p"/"<p" -- or was your "sequentially precedes"
somehow taking effects of other CPUs into account.

> > My example formalism for a memory barrier says nothing about the
> > actual order in which the assignments to A and B occurred, nor about
> > the actual order in which the loads from A and B occurred. No such
> > ordering is required to describe the action of the memory barrier.
>
> Are you sure about that? I would think it was implicit in your definition
> of "<v". Talking about the order of values in a variable during the past
> isn't very different from talking about the order in which the
> corresponding stores occurred.

My "<v" is valid only for a single variable. A computer that reversed
the order of execution of CPU 0's two assignments would be permitted,
as long as the loads on CPU 1 and CPU 2 got the correct values.

For an extreme example, consider mapping the code onto a dataflow
machine -- as long as the data dependencies are satisfied, the time
order is irrelevant.

> For that matter, the whole concept of "the value in a variable" is itself
> rather fuzzy. Even the sequence of values might not be well defined: If
> you had some single CPU do nothing but repeatedly load the variable and
> note its value, you could end up missing some of the values perceived by
> other CPUs. That is, it could be possible for CPU 0 to see A take on the
> values 0,1,2 while CPU 1 sees only the values 0,2.

Heck, if you have a synchronized clock register with sufficient accuracy,
you can catch different CPUs thinking that a given variable has different
values at the same point in time. ;-)

Thanx, Paul

2006-10-17 21:21:35

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 17 Oct 2006, Paul E. McKenney wrote:

> > Earlier I defined two separate kinds of orderings: "comes before" and
> > "sequentially precedes". My "comes before" is essentially the same as
> > your "<v", applying only to accesses of the same variable. You don't have
> > any direct analog to "sequentially precedes", which is perhaps a weakness:
> > It will be harder for you to denote the effect of a load dependency on a
> > subsequent store. My "sequentially precedes" does _not_ require the
> > accesses to be to the same variable, but it does require them to take
> > place on the same CPU.
>
> This is similar to my ">p"/"<p" -- or was your "sequentially precedes"
> somehow taking effects of other CPUs into account.

It was taking the effect of memory barriers into account. In the program
"load(A); store(B)" the load doesn't sequentially precede the store. But
in the program "load(A); smp_mb(); store(B)" it does. Similarly, in the
program "if (A) B = 2;" the load(A) sequentially precedes the store(B) --
thanks to the dependency or (if you prefer) the absence of speculative
stores.

Basically "sequentially precedes" means that any other CPU using the
appropriate memory barriers will observe the accesses apparently occurring
in this order.

> > > My example formalism for a memory barrier says nothing about the
> > > actual order in which the assignments to A and B occurred, nor about
> > > the actual order in which the loads from A and B occurred. No such
> > > ordering is required to describe the action of the memory barrier.
> >
> > Are you sure about that? I would think it was implicit in your definition
> > of "<v". Talking about the order of values in a variable during the past
> > isn't very different from talking about the order in which the
> > corresponding stores occurred.
>
> My "<v" is valid only for a single variable. A computer that reversed
> the order of execution of CPU 0's two assignments would be permitted,
> as long as the loads on CPU 1 and CPU 2 got the correct values.

Yes, I realize that. But if several CPUs store values to the same
variable at about the same time, it's not at all clear which stores are
"<v" others. Deciding this is tantamount to ordering all the stores to
that variable.

> > For that matter, the whole concept of "the value in a variable" is itself
> > rather fuzzy. Even the sequence of values might not be well defined: If
> > you had some single CPU do nothing but repeatedly load the variable and
> > note its value, you could end up missing some of the values perceived by
> > other CPUs. That is, it could be possible for CPU 0 to see A take on the
> > values 0,1,2 while CPU 1 sees only the values 0,2.
>
> Heck, if you have a synchronized clock register with sufficient accuracy,
> you can catch different CPUs thinking that a given variable has different
> values at the same point in time. ;-)

Exactly. That's why I'm not too comfortable with your "<v" -- and I'm not
completely certain of the validity of "comes before" either. Hardly
surprising, since they mean pretty much the same thing.

Alan

2006-10-17 22:57:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, Oct 17, 2006 at 05:21:33PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2006, Paul E. McKenney wrote:
>
> > > Earlier I defined two separate kinds of orderings: "comes before" and
> > > "sequentially precedes". My "comes before" is essentially the same as
> > > your "<v", applying only to accesses of the same variable. You don't have
> > > any direct analog to "sequentially precedes", which is perhaps a weakness:
> > > It will be harder for you to denote the effect of a load dependency on a
> > > subsequent store. My "sequentially precedes" does _not_ require the
> > > accesses to be to the same variable, but it does require them to take
> > > place on the same CPU.
> >
> > This is similar to my ">p"/"<p" -- or was your "sequentially precedes"
> > somehow taking effects of other CPUs into account.
>
> It was taking the effect of memory barriers into account. In the program
> "load(A); store(B)" the load doesn't sequentially precede the store. But
> in the program "load(A); smp_mb(); store(B)" it does. Similarly, in the
> program "if (A) B = 2;" the load(A) sequentially precedes the store(B) --
> thanks to the dependency or (if you prefer) the absence of speculative
> stores.
>
> Basically "sequentially precedes" means that any other CPU using the
> appropriate memory barriers will observe the accesses apparently occurring
> in this order.

Your first example in the previous paragraph fits the description.
The second does not, as illustrated by the following scenario:

CPU 0 CPU 1 CPU 2

A=1 while (B==0); while (C==0);
smp_mb() C=1 smp_mb()
B=1 assert(A==1) <fails>

Please note that the "<fails>" is not a theoretical assertion -- I have
seen this happen in real life. So, yes, the C=1 might not speculate ahead
of the load of B that produced a non-zero result, but CPU 2's assertion
can still fail, even though both CPU 2 and CPU 0 are using memory barriers.

> > > > My example formalism for a memory barrier says nothing about the
> > > > actual order in which the assignments to A and B occurred, nor about
> > > > the actual order in which the loads from A and B occurred. No such
> > > > ordering is required to describe the action of the memory barrier.
> > >
> > > Are you sure about that? I would think it was implicit in your definition
> > > of "<v". Talking about the order of values in a variable during the past
> > > isn't very different from talking about the order in which the
> > > corresponding stores occurred.
> >
> > My "<v" is valid only for a single variable. A computer that reversed
> > the order of execution of CPU 0's two assignments would be permitted,
> > as long as the loads on CPU 1 and CPU 2 got the correct values.
>
> Yes, I realize that. But if several CPUs store values to the same
> variable at about the same time, it's not at all clear which stores are
> "<v" others. Deciding this is tantamount to ordering all the stores to
> that variable.

Yep. Consider the following case:

CPU 0 CPU 1 CPU 2

A=1 B=1 X=C
smb_mb() smp_mb() smp_mb()
C=1 C=2 if (X==1) ???

In the then-clause of the "if", CPU 2 can only be sure that it will
see A==1. It might or might not see B==1. We simply don't know the
order of stores to C, even at runtime.

Now consider the following:

CPU 0 CPU 1 CPU 2

A=1 B=1 X=C
smb_mb() smp_mb() smp_mb()
atomic_inc(&C) atomic_inc(&C) assert(C!=2 || (A==1 && B==1))

This assertion is guaranteed to succeed (using my semantics of the
transitivity of ">v"/"<v" -- using yours, CPU 2 would instead need to
use an atomic operation to fetch the value of C). We still don't know
which atomic_inc() happened first (we would need atomic_inc_return()
to figure that out), but we can nevertheless determine if both have
happened and act accordingly.

> > > For that matter, the whole concept of "the value in a variable" is itself
> > > rather fuzzy. Even the sequence of values might not be well defined: If
> > > you had some single CPU do nothing but repeatedly load the variable and
> > > note its value, you could end up missing some of the values perceived by
> > > other CPUs. That is, it could be possible for CPU 0 to see A take on the
> > > values 0,1,2 while CPU 1 sees only the values 0,2.
> >
> > Heck, if you have a synchronized clock register with sufficient accuracy,
> > you can catch different CPUs thinking that a given variable has different
> > values at the same point in time. ;-)
>
> Exactly. That's why I'm not too comfortable with your "<v" -- and I'm not
> completely certain of the validity of "comes before" either. Hardly
> surprising, since they mean pretty much the same thing.

An alternative would be to use something like "sees" to describe "<v":

ld_1(A) <v st_0(A=1)

might be called "CPU 1's load of A sees CPU 0's store of 1 into A".
Then "<v" would be "is seen by". In my regime:

ld_2(A) <v ++_1(A=2) <v st_0(A=1) -> ld_2(A) <v st_0(A=1)

In yours, this would not hold unless the ld_2() was replaced by an atomic
operation (if I understand your regime correctly).

Does this "sees"/"is seen by" nomenclature seem more reasonable?
Or perhaps "visibility includes"/"visible to"? Or keep "sees"/"seen by"
and use "<s"/">s" to adjust the mneumonic?

Thanx, Paul

2006-10-18 19:05:28

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Tue, 17 Oct 2006, Paul E. McKenney wrote:

> > It was taking the effect of memory barriers into account. In the program
> > "load(A); store(B)" the load doesn't sequentially precede the store. But
> > in the program "load(A); smp_mb(); store(B)" it does. Similarly, in the
> > program "if (A) B = 2;" the load(A) sequentially precedes the store(B) --
> > thanks to the dependency or (if you prefer) the absence of speculative
> > stores.
> >
> > Basically "sequentially precedes" means that any other CPU using the
> > appropriate memory barriers will observe the accesses apparently occurring
> > in this order.
>
> Your first example in the previous paragraph fits the description.
> The second does not, as illustrated by the following scenario:
>
> CPU 0 CPU 1 CPU 2
>
> A=1 while (B==0); while (C==0);
> smp_mb() C=1 smp_mb()
> B=1 assert(A==1) <fails>

In what way is this inconsistent with my second example? Using "<" for
"sequentially precedes" we have:

st_0(A=1) < st_0(B=1) <v ld_1(B) < st_1(C=1) <v ld_2(C) < ld_2(A)

You cannot derive st_0(A=1) <v ld_2(A) from this. The rules are:

1:st(X) < 2:st(Y) <v 3:ac(Y) < 4:ac(X) implies
1:st(X) <v 4:ac(X)

and

1:ld(X) < 2:st(Y) <v 3:ac(Y) < 4:st(X) implies
4:st(X) !<v 1:ld(X).

> Please note that the "<fails>" is not a theoretical assertion -- I have
> seen this happen in real life. So, yes, the C=1 might not speculate ahead
> of the load of B that produced a non-zero result, but CPU 2's assertion
> can still fail, even though both CPU 2 and CPU 0 are using memory barriers.

Your example would be no less fallacious if CPU 1 executed smp_mb() before
C = 1. The assertion could still fail because CPU 1's write to C could
become visible to CPU 2 before CPU 0's writes to A and B.


> > Yes, I realize that. But if several CPUs store values to the same
> > variable at about the same time, it's not at all clear which stores are
> > "<v" others. Deciding this is tantamount to ordering all the stores to
> > that variable.
>
> Yep. Consider the following case:
>
> CPU 0 CPU 1 CPU 2
>
> A=1 B=1 X=C
> smb_mb() smp_mb() smp_mb()
> C=1 C=2 if (X==1) ???
>
> In the then-clause of the "if", CPU 2 can only be sure that it will
> see A==1. It might or might not see B==1. We simply don't know the
> order of stores to C, even at runtime.

It's one thing to say you don't know the order of stores. It's another
thing to say that there _is_ no order -- especially if you're going to use
the "<v" notation to implicitly impose such an order!

So maybe what you're claiming is that the stores to a single variable _do_
have a global order at runtime, even though we might not know what it is,
and the view each CPU has is always a suborder of that global order. (And
of course there is no fixed relation on how the global orders of stores to
two separate variables inter-relate, unless it is enforced by memory
barriers.)

Does this claim always make sense, even in a hierarchical cache system?

> Now consider the following:
>
> CPU 0 CPU 1 CPU 2
>
> A=1 B=1 X=C
> smb_mb() smp_mb() smp_mb()
> atomic_inc(&C) atomic_inc(&C) assert(C!=2 || (A==1 && B==1))
>
> This assertion is guaranteed to succeed (using my semantics of the
> transitivity of ">v"/"<v" -- using yours, CPU 2 would instead need to
> use an atomic operation to fetch the value of C). We still don't know
> which atomic_inc() happened first (we would need atomic_inc_return()
> to figure that out), but we can nevertheless determine if both have
> happened and act accordingly.

Yes. What is your point? And how is it related to the existence of a
global ordering for all stores to a single variable?


> An alternative would be to use something like "sees" to describe "<v":
>
> ld_1(A) <v st_0(A=1)
>
> might be called "CPU 1's load of A sees CPU 0's store of 1 into A".

You wrote this backwards; it should say: st_0(A=1) <v ld_1(A).

> Then "<v" would be "is seen by". In my regime:
>
> ld_2(A) <v ++_1(A=2) <v st_0(A=1) -> ld_2(A) <v st_0(A=1)

Backwards again. You mean:

st_0(A=1) <v ++_1(A=2) <v ld_2(A) implies st_0(A=1) <v ld_2(A)

> In yours, this would not hold unless the ld_2() was replaced by an atomic
> operation (if I understand your regime correctly).

Basically yes, although I'm not sure how a load manages to be atomic.
Maybe if it's part of an atomic exchange or something like that.

> Does this "sees"/"is seen by" nomenclature seem more reasonable?
> Or perhaps "visibility includes"/"visible to"? Or keep "sees"/"seen by"
> and use "<s"/">s" to adjust the mneumonic?

I'm not keen on either one. Does it make sense to say one store is
visible to (or is seen by) another store? Maybe... But "comes before"
seems more natural. Especially if you're assuming the existence of a
global ordering on these stores.

Incidentally, although you haven't mentioned this, it's important never to
state that a load is visible to (or is seen by or comes before) another
access. In other words, the global ordering of stores for a single
variable doesn't extend to a global ordering of _all_ accesses for that
variable.

Alan

2006-10-18 23:00:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, Oct 18, 2006 at 03:05:26PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2006, Paul E. McKenney wrote:
>
> > > It was taking the effect of memory barriers into account. In the program
> > > "load(A); store(B)" the load doesn't sequentially precede the store. But
> > > in the program "load(A); smp_mb(); store(B)" it does. Similarly, in the
> > > program "if (A) B = 2;" the load(A) sequentially precedes the store(B) --
> > > thanks to the dependency or (if you prefer) the absence of speculative
> > > stores.
> > >
> > > Basically "sequentially precedes" means that any other CPU using the
> > > appropriate memory barriers will observe the accesses apparently occurring
> > > in this order.
> >
> > Your first example in the previous paragraph fits the description.
> > The second does not, as illustrated by the following scenario:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 while (B==0); while (C==0);
> > smp_mb() C=1 smp_mb()
> > B=1 assert(A==1) <fails>
>
> In what way is this inconsistent with my second example?

I believe that it -is- in fact consistent with your second example,
though more elaborate than needed. But it doesn't matter, in the
following simpler example, the assertion also fails:

CPU 0 CPU 1 CPU 2

A=1 while (A==0); while (B==0);
B=1 smp_mb()
assert(A==1) <fails>

Perhaps I don't understand what you mean by "sequentially precedes"
in your earlier paragraph -- but you did say that "any other CPU using
the appropriate memory barriers will observe the accesses apparently
occurring in this order". One interpretation might be that CPU 2's
load of A temporally follows CPU 1's load of A. This would of course
not prevent CPU 2 from seeing an "earlier" value of A than did CPU 1,
as we have both noted many times. ;-)

> Using "<" for
> "sequentially precedes" we have:
>
> st_0(A=1) < st_0(B=1) <v ld_1(B) < st_1(C=1) <v ld_2(C) < ld_2(A)

We are using "<" backwards from each other. You are using it as if
you were comparing time values (which I guess I should have anticipated
given your temporal viewpoint), while I was using it to indicate flow
of data (for ">v") or flow of control (for ">p"). In my notation,
st > ld makes sense (the store's data flows to the load), in yours it
would instead indicate that the store happened after the load. I think.

My guess is that you are using "<" to indicate presence of a memory barrier.

Perhaps I should use ">>v", analogous to the C++ I/O operator, to
emphasize that I am not comparing things. Would that help?

Another alternative would be "->v", but that could be confused with
the implication operator.

> You cannot derive st_0(A=1) <v ld_2(A) from this. The rules are:
>
> 1:st(X) < 2:st(Y) <v 3:ac(Y) < 4:ac(X) implies
> 1:st(X) <v 4:ac(X)
>
> and
>
> 1:ld(X) < 2:st(Y) <v 3:ac(Y) < 4:st(X) implies
> 4:st(X) !<v 1:ld(X).

Agreed. If you -could- derive this, the assertion would never fail.

> > Please note that the "<fails>" is not a theoretical assertion -- I have
> > seen this happen in real life. So, yes, the C=1 might not speculate ahead
> > of the load of B that produced a non-zero result, but CPU 2's assertion
> > can still fail, even though both CPU 2 and CPU 0 are using memory barriers.
>
> Your example would be no less fallacious if CPU 1 executed smp_mb() before
> C = 1. The assertion could still fail because CPU 1's write to C could
> become visible to CPU 2 before CPU 0's writes to A and B.

Yep, at least if I interpret "fallacious" to mean "prone to assertion
failure".

> > > Yes, I realize that. But if several CPUs store values to the same
> > > variable at about the same time, it's not at all clear which stores are
> > > "<v" others. Deciding this is tantamount to ordering all the stores to
> > > that variable.
> >
> > Yep. Consider the following case:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 B=1 X=C
> > smb_mb() smp_mb() smp_mb()
> > C=1 C=2 if (X==1) ???
> >
> > In the then-clause of the "if", CPU 2 can only be sure that it will
> > see A==1. It might or might not see B==1. We simply don't know the
> > order of stores to C, even at runtime.
>
> It's one thing to say you don't know the order of stores. It's another
> thing to say that there _is_ no order -- especially if you're going to use
> the "<v" notation to implicitly impose such an order!

">v" only applies to single variables, at least in my viewpoint.

For multiple stores to multiple variables, it is easy to create scenarios
where people could disagree as to what the order of stores was, depending
on exactly which of the many events associated with a given store defined
that store's timestamp.

> So maybe what you're claiming is that the stores to a single variable _do_
> have a global order at runtime, even though we might not know what it is,
> and the view each CPU has is always a suborder of that global order. (And
> of course there is no fixed relation on how the global orders of stores to
> two separate variables inter-relate, unless it is enforced by memory
> barriers.)
>
> Does this claim always make sense, even in a hierarchical cache system?

For a single variable in a cache-coherent system, yes. Because if this
claim does not hold, then by definition, the system is not cache-coherent.

> > Now consider the following:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 B=1 X=C
> > smb_mb() smp_mb() smp_mb()
> > atomic_inc(&C) atomic_inc(&C) assert(C!=2 || (A==1 && B==1))
> >
> > This assertion is guaranteed to succeed (using my semantics of the
> > transitivity of ">v"/"<v" -- using yours, CPU 2 would instead need to
> > use an atomic operation to fetch the value of C). We still don't know
> > which atomic_inc() happened first (we would need atomic_inc_return()
> > to figure that out), but we can nevertheless determine if both have
> > happened and act accordingly.
>
> Yes. What is your point? And how is it related to the existence of a
> global ordering for all stores to a single variable?

The first example did not permit CPU 2 to detect when both CPU's stores
to C had completed, while the second example does permit this.

> > An alternative would be to use something like "sees" to describe "<v":
> >
> > ld_1(A) <v st_0(A=1)
> >
> > might be called "CPU 1's load of A sees CPU 0's store of 1 into A".
>
> You wrote this backwards; it should say: st_0(A=1) <v ld_1(A).

We are still disagreeing on the definition of "<v". I could try writing
it as ld_1(A) <<v st_0(a=1) to make it clear that I am not comparing
timestamps.

> > Then "<v" would be "is seen by". In my regime:
> >
> > ld_2(A) <v ++_1(A=2) <v st_0(A=1) -> ld_2(A) <v st_0(A=1)
>
> Backwards again. You mean:
>
> st_0(A=1) <v ++_1(A=2) <v ld_2(A) implies st_0(A=1) <v ld_2(A)

Or:

ld_2(A) <<v ++_1(A=2) <<v st_0(A=1) -> ld_2(A) <<v st_0(A=1)

> > In yours, this would not hold unless the ld_2() was replaced by an atomic
> > operation (if I understand your regime correctly).
>
> Basically yes, although I'm not sure how a load manages to be atomic.
> Maybe if it's part of an atomic exchange or something like that.

Yes, an atomic exchange is indeed one way to implement a lock-acquisition
primitive, which is the case that requires transitivity.

Perhaps this could be denoted by atomic\_inc_1(A) or some such.

> > Does this "sees"/"is seen by" nomenclature seem more reasonable?
> > Or perhaps "visibility includes"/"visible to"? Or keep "sees"/"seen by"
> > and use "<s"/">s" to adjust the mneumonic?
>
> I'm not keen on either one. Does it make sense to say one store is
> visible to (or is seen by) another store? Maybe... But "comes before"
> seems more natural. Especially if you're assuming the existence of a
> global ordering on these stores.

In your time-based viewpoint, I would guess that "comes before" would
be quite natural -- and this makes complete sense for MMIO accesses,
as far as I can tell (never have done much in the way of parallel device
drivers, though). For memory-based data structures, as we have discussed,
there are cases where taking a time-based approach can be misleading.
Or, in my case, can lead to bugs. ;-)

> Incidentally, although you haven't mentioned this, it's important never to
> state that a load is visible to (or is seen by or comes before) another
> access. In other words, the global ordering of stores for a single
> variable doesn't extend to a global ordering of _all_ accesses for that
> variable.

Yes. From my viewpoint (ignoring MMIO), stores are mute and loads
are deaf. So stores can "hear" loads, but not vice versa. I guess I
could make the "sees" analogy work just as well -- in that case stores
are blind and loads are invisible, so that loads can see stores, but
not vice versa. And loads can neither see nor hear other loads.

In the MMIO case, a load may be visible to a subsequent store due to
changes in device state. In principle, anyway, it has been awhile since
I have looked carefully at device-register specifications, so I have no
idea if any real devices do this -- on this, I must defer to you.

Thanx, Paul

2006-10-19 16:44:32

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Wed, 18 Oct 2006, Paul E. McKenney wrote:

> > > > Basically "sequentially precedes" means that any other CPU using the
> > > > appropriate memory barriers will observe the accesses apparently occurring
> > > > in this order.
> > >
> > > Your first example in the previous paragraph fits the description.
> > > The second does not, as illustrated by the following scenario:
> > >
> > > CPU 0 CPU 1 CPU 2
> > >
> > > A=1 while (B==0); while (C==0);
> > > smp_mb() C=1 smp_mb()
> > > B=1 assert(A==1) <fails>
> >
> > In what way is this inconsistent with my second example?
>
> I believe that it -is- in fact consistent with your second example,
> though more elaborate than needed. But it doesn't matter, in the
> following simpler example, the assertion also fails:
>
> CPU 0 CPU 1 CPU 2
>
> A=1 while (A==0); while (B==0);
> B=1 smp_mb()
> assert(A==1) <fails>

Again, there's nothing wrong with this. Writing the example out
symbolically gives:

st_0(A=1) <v ld_1(A) < st_1(B=1) <v ld_2(B) < ld_2(A)

You seem to think that this implies ld_1(A) <v ld_2(A), from which you
derive st_0(A=1) <v ld_2(A). But the reasoning is incorrect, because

ld_1(A) <v ld_2(A)

is an invalid expression. A load is never allowed to appear on the left
side of <v.

And again, it wouldn't matter if you inserted smp_mb() before CPU 1's
store to B. The assertion could fail anyway, and for the same reason. So
I see no reason to think the control dependency in CPU 1's assignment to B
is any weaker than a memory barrier.


> Perhaps I don't understand what you mean by "sequentially precedes"
> in your earlier paragraph -- but you did say that "any other CPU using
> the appropriate memory barriers will observe the accesses apparently
> occurring in this order". One interpretation might be that CPU 2's
> load of A temporally follows CPU 1's load of A. This would of course
> not prevent CPU 2 from seeing an "earlier" value of A than did CPU 1,
> as we have both noted many times. ;-)

"Sequentially precedes" means that the system behaves as though there were
a memory barrier between the two accesses.

> We are using "<" backwards from each other. You are using it as if
> you were comparing time values (which I guess I should have anticipated
> given your temporal viewpoint), while I was using it to indicate flow
> of data (for ">v") or flow of control (for ">p"). In my notation,
> st > ld makes sense (the store's data flows to the load), in yours it
> would instead indicate that the store happened after the load. I think.

In my notation "<" means "sequentially precedes" and ">" would therefore
mean "sequentially follows". So "st > ld" would indicate that the store
and the load both occurred on the same CPU, with the load preceding the
store in the object code, and there was either a dependency or an explicit
memory barrier between them.

There isn't a direct temporal implication, only an implication about the
order of statements in the object code. The CPU would be free to carry
out the two statements at any times and in any order it wants --
backwards, forwards, or upside down -- "st > ld" isn't concerned with
absolute or relative time values. It only cares about the flow of
execution.

> My guess is that you are using "<" to indicate presence of a memory barrier.

Approximately. It indicates that the behavior is the same as if a memory
barrier were present. But it also indicates program ordering. Given:

A = 1; mb(); B = 2;

Then we have st(A) < st(B) but not st(B) < st(A), because of the order
of the two statements in the object code. Perhaps you would want to
write it as st(A) >p st(B).


> Perhaps I should use ">>v", analogous to the C++ I/O operator, to
> emphasize that I am not comparing things. Would that help?
>
> Another alternative would be "->v", but that could be confused with
> the implication operator.

Well, we can write our orderings using either < or > as we please, just so
long as we are consistent. BUT... Considering that there is a strong
intuitive association here with the flow of time, I think you would end up
only confusing people with your convention:

We want to emphasize that the <v operator is not directly
connected with the flow of time. Hence we will write
"x <v y" to mean that x logically follows y.

I don't think people will go for this. :-)

> We are still disagreeing on the definition of "<v". I could try writing
> it as ld_1(A) <<v st_0(a=1) to make it clear that I am not comparing
> timestamps.

It is already clear. You don't need to do anything special to emphasize
it. We will explain that <v coincides only roughly with any actual time
ordering.


> > It's one thing to say you don't know the order of stores. It's another
> > thing to say that there _is_ no order -- especially if you're going to use
> > the "<v" notation to implicitly impose such an order!
>
> ">v" only applies to single variables, at least in my viewpoint.

That's what I've been talking about. By using ">v", you are implicitly
saying that there is a global order of all stores to a single variable.

> > So maybe what you're claiming is that the stores to a single variable _do_
> > have a global order at runtime, even though we might not know what it is,
> > and the view each CPU has is always a suborder of that global order. (And
> > of course there is no fixed relation on how the global orders of stores to
> > two separate variables inter-relate, unless it is enforced by memory
> > barriers.)
> >
> > Does this claim always make sense, even in a hierarchical cache system?
>
> For a single variable in a cache-coherent system, yes. Because if this
> claim does not hold, then by definition, the system is not cache-coherent.

Okay, that's fine. In a cache-coherent system, all stores to a single
variable fall into a global ordering which has a strong temporal nature
(if two stores are separated by a sufficiently long time, the earlier
store is bound to be ordered before the later store). In this situation
it certainly is natural to think of the ordering as a temporal one, even
though we know this is not exactly true. And I think it's confusing to
say that an earlier event is > a later event.

(BTW, can you explain the difference between "cache coherent" and "cache
consistent"? I never quite got it straight...)



> > Basically yes, although I'm not sure how a load manages to be atomic.
> > Maybe if it's part of an atomic exchange or something like that.
>
> Yes, an atomic exchange is indeed one way to implement a lock-acquisition
> primitive, which is the case that requires transitivity.
>
> Perhaps this could be denoted by atomic\_inc_1(A) or some such.

Or at_1(A+=1).


> > > Does this "sees"/"is seen by" nomenclature seem more reasonable?
> > > Or perhaps "visibility includes"/"visible to"? Or keep "sees"/"seen by"
> > > and use "<s"/">s" to adjust the mneumonic?
> >
> > I'm not keen on either one. Does it make sense to say one store is
> > visible to (or is seen by) another store? Maybe... But "comes before"
> > seems more natural. Especially if you're assuming the existence of a
> > global ordering on these stores.
>
> In your time-based viewpoint, I would guess that "comes before" would
> be quite natural -- and this makes complete sense for MMIO accesses,
> as far as I can tell (never have done much in the way of parallel device
> drivers, though). For memory-based data structures, as we have discussed,
> there are cases where taking a time-based approach can be misleading.
> Or, in my case, can lead to bugs. ;-)

The word "before" doesn't necessarily mean the same thing as "earlier in
time". Here it means "lower in the global ordering of all stores to this
variable".

You're too hung up on this idea that something which _looks_ like a
temporal ordering actually _must be_ a temporal ordering. Not at all --
it may resemble one for mnemonic purposes without actually being one.


> > Incidentally, although you haven't mentioned this, it's important never to
> > state that a load is visible to (or is seen by or comes before) another
> > access. In other words, the global ordering of stores for a single
> > variable doesn't extend to a global ordering of _all_ accesses for that
> > variable.
>
> Yes. From my viewpoint (ignoring MMIO), stores are mute and loads
> are deaf. So stores can "hear" loads, but not vice versa. I guess I
> could make the "sees" analogy work just as well -- in that case stores
> are blind and loads are invisible, so that loads can see stores, but
> not vice versa. And loads can neither see nor hear other loads.

The analogy breaks down for pairs of stores. If stores are blind then
they can't see other stores -- but we need them to.

> In the MMIO case, a load may be visible to a subsequent store due to
> changes in device state. In principle, anyway, it has been awhile since
> I have looked carefully at device-register specifications, so I have no
> idea if any real devices do this -- on this, I must defer to you.

They sometimes do, in rather specialized circumstances. It isn't very
common.

Alan

2006-10-19 19:21:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, Oct 19, 2006 at 12:44:28PM -0400, Alan Stern wrote:
> On Wed, 18 Oct 2006, Paul E. McKenney wrote:
>
> > > > > Basically "sequentially precedes" means that any other CPU using the
> > > > > appropriate memory barriers will observe the accesses apparently occurring
> > > > > in this order.
> > > >
> > > > Your first example in the previous paragraph fits the description.
> > > > The second does not, as illustrated by the following scenario:
> > > >
> > > > CPU 0 CPU 1 CPU 2
> > > >
> > > > A=1 while (B==0); while (C==0);
> > > > smp_mb() C=1 smp_mb()
> > > > B=1 assert(A==1) <fails>
> > >
> > > In what way is this inconsistent with my second example?
> >
> > I believe that it -is- in fact consistent with your second example,
> > though more elaborate than needed. But it doesn't matter, in the
> > following simpler example, the assertion also fails:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > A=1 while (A==0); while (B==0);
> > B=1 smp_mb()
> > assert(A==1) <fails>
>
> Again, there's nothing wrong with this. Writing the example out
> symbolically gives:
>
> st_0(A=1) <v ld_1(A) < st_1(B=1) <v ld_2(B) < ld_2(A)
>
> You seem to think that this implies ld_1(A) <v ld_2(A),

No, I thought that -you- thought that this implication held. :-/

> from which you
> derive st_0(A=1) <v ld_2(A). But the reasoning is incorrect, because
>
> ld_1(A) <v ld_2(A)
>
> is an invalid expression. A load is never allowed to appear on the left
> side of <v.

Yep, you have thoroughly destroyed that particular strawman. ;-)

> And again, it wouldn't matter if you inserted smp_mb() before CPU 1's
> store to B. The assertion could fail anyway, and for the same reason.

This I agree with.

> So
> I see no reason to think the control dependency in CPU 1's assignment to B
> is any weaker than a memory barrier.

I am assuming that you have in mind a special restricted memory barrier
that applies only to the load of A, and not necessarily to any other
preceding operations. Otherwise, these two code sequences would be
equivalent, and they are not (as usual, all variables initially zero):

CPU 0, Sequence 1 CPU 0, Sequence 2 CPU 1

A=1 A=1 while (C==0);
while (B==0); while (B==0); smp_mb();
C=1 smp_mb(); assert(A==1);
C=1

In sequence 1, CPU 1's assertion can fail. Not so with sequence 2.

Regardless of your definition of your posited memory barrier corresponding
to the control dependency, a counter example:

CPU 1 CPU 2

A=1;
...
while (A==0); while (B==0);
B=1 smp_mb()
assert(A==1) <fails>

Here, placing an smp_mb() after the "while (A==0)" does make a difference.

Degenerate, perhaps, given that the same CPU is assigning and while-ing,
but so it goes.

Even assuming a special restricted memory barrier, the example of DEC
Alpha and pointer dereferencing gives me pause. Feel free to berate
me for this, as you have done in the past. ;-)

Seriously, my judgement of this would depend on exactly what part of
the smp_mb() semantics you are claiming for the control dependency.
I do not believe that we could make progress without appealing to a
specific implementation, so I would rather ignore control dependencies,
at least for non-MMIO accesses. MMIO would be another story altogether.

> > Perhaps I don't understand what you mean by "sequentially precedes"
> > in your earlier paragraph -- but you did say that "any other CPU using
> > the appropriate memory barriers will observe the accesses apparently
> > occurring in this order". One interpretation might be that CPU 2's
> > load of A temporally follows CPU 1's load of A. This would of course
> > not prevent CPU 2 from seeing an "earlier" value of A than did CPU 1,
> > as we have both noted many times. ;-)
>
> "Sequentially precedes" means that the system behaves as though there were
> a memory barrier between the two accesses.

OK. As noted above, if I were to interpret "a memory barrier" as really
being everything entailed by smp_mb(), I disagree with your statement in an
earlier email stating:

Similarly, in the program "if (A) B = 2;" the load(A) sequentially
precedes the store(B) -- thanks to the dependency or (if you
prefer) the absence of speculative stores.

However, I don't believe that is what you mean by "a memory barrier" in
this case -- my guess again is that you mean a special memory barrier that
applies only the the load of A in one direction, but that applies to
everything following the load in the other direction.

> > We are using "<" backwards from each other. You are using it as if
> > you were comparing time values (which I guess I should have anticipated
> > given your temporal viewpoint), while I was using it to indicate flow
> > of data (for ">v") or flow of control (for ">p"). In my notation,
> > st > ld makes sense (the store's data flows to the load), in yours it
> > would instead indicate that the store happened after the load. I think.
>
> In my notation "<" means "sequentially precedes" and ">" would therefore
> mean "sequentially follows". So "st > ld" would indicate that the store
> and the load both occurred on the same CPU, with the load preceding the
> store in the object code, and there was either a dependency or an explicit
> memory barrier between them.
>
> There isn't a direct temporal implication, only an implication about the
> order of statements in the object code. The CPU would be free to carry
> out the two statements at any times and in any order it wants --
> backwards, forwards, or upside down -- "st > ld" isn't concerned with
> absolute or relative time values. It only cares about the flow of
> execution.

Which is why we need to use a notation that very clearly denotes flow,
without the temporal-ordering connotation that ">" has to many people
(me in particular).

> > My guess is that you are using "<" to indicate presence of a memory barrier.
>
> Approximately. It indicates that the behavior is the same as if a memory
> barrier were present. But it also indicates program ordering. Given:
>
> A = 1; mb(); B = 2;
>
> Then we have st(A) < st(B) but not st(B) < st(A), because of the order
> of the two statements in the object code. Perhaps you would want to
> write it as st(A) >p st(B).

I would use ">p" for the program-order relationship, and probably something
like ">b" for the memory-barrier relationship. There are other orderings,
including the control-flow ordering discussed earlier, data dependencies,
and so on.

> > Perhaps I should use ">>v", analogous to the C++ I/O operator, to
> > emphasize that I am not comparing things. Would that help?
> >
> > Another alternative would be "->v", but that could be confused with
> > the implication operator.
>
> Well, we can write our orderings using either < or > as we please, just so
> long as we are consistent. BUT... Considering that there is a strong
> intuitive association here with the flow of time, I think you would end up
> only confusing people with your convention:
>
> We want to emphasize that the <v operator is not directly
> connected with the flow of time. Hence we will write
> "x <v y" to mean that x logically follows y.
>
> I don't think people will go for this. :-)

The literature is quite inconsistent. The DEC Alpha manual takes your
approach, while Gharachorloo's dissertation takes my approach. Not to
be outdone, Steinke and Nutt's JACM paper (written long after the other
two) uses different directions for different types of orderings!!!
See http://arxiv.org/PS_cache/cs/pdf/0208/0208027.pdf, page 49,
Definitions A.5, A.6 on the one hand and Definition A.7 on the other. ;-)

> > We are still disagreeing on the definition of "<v". I could try writing
> > it as ld_1(A) <<v st_0(a=1) to make it clear that I am not comparing
> > timestamps.
>
> It is already clear. You don't need to do anything special to emphasize
> it. We will explain that <v coincides only roughly with any actual time
> ordering.

Given the disagreement between the two of us, to say nothing of the
different notations used in the literature, I cannot agree that it
is clear. And this stuff is confusing enough without conflicting
connotations! ;-)

> > > It's one thing to say you don't know the order of stores. It's another
> > > thing to say that there _is_ no order -- especially if you're going to use
> > > the "<v" notation to implicitly impose such an order!
> >
> > ">v" only applies to single variables, at least in my viewpoint.
>
> That's what I've been talking about. By using ">v", you are implicitly
> saying that there is a global order of all stores to a single variable.

And in a cache-coherent system, there must be. Or, more precisely,
there must not be different sequences of loads that indicate inconsistent
orderings of stores to a given single variable. If the system can
prove that there are no concurrent loads during a given period of
time, I guess it would be within its rights to ditch cache coherence
for that variable during that time...

> > > So maybe what you're claiming is that the stores to a single variable _do_
> > > have a global order at runtime, even though we might not know what it is,
> > > and the view each CPU has is always a suborder of that global order. (And
> > > of course there is no fixed relation on how the global orders of stores to
> > > two separate variables inter-relate, unless it is enforced by memory
> > > barriers.)
> > >
> > > Does this claim always make sense, even in a hierarchical cache system?
> >
> > For a single variable in a cache-coherent system, yes. Because if this
> > claim does not hold, then by definition, the system is not cache-coherent.
>
> Okay, that's fine. In a cache-coherent system, all stores to a single
> variable fall into a global ordering which has a strong temporal nature
> (if two stores are separated by a sufficiently long time, the earlier
> store is bound to be ordered before the later store). In this situation
> it certainly is natural to think of the ordering as a temporal one, even
> though we know this is not exactly true. And I think it's confusing to
> say that an earlier event is > a later event.

This is the connotation conflict. For you, it is confusing to write
"A > B" when A precedes B. For me, it is confusing to write "st < ld"
when data flows from the "st" to the "ld". So, the only way to resolve
this is to avoid use of ">" like the plague!

> (BTW, can you explain the difference between "cache coherent" and "cache
> consistent"? I never quite got it straight...)

"Cache coherent" is the preferred term, though "cache consistent" is
sometimes used as a synonym. If you want to be painfully correct, you
would say "cache coherent" when talking about stores to a single variable,
and "memory consistency model" when talking about ordering of accesses
to multiple variables.

> > > Basically yes, although I'm not sure how a load manages to be atomic.
> > > Maybe if it's part of an atomic exchange or something like that.
> >
> > Yes, an atomic exchange is indeed one way to implement a lock-acquisition
> > primitive, which is the case that requires transitivity.
> >
> > Perhaps this could be denoted by atomic\_inc_1(A) or some such.
>
> Or at_1(A+=1).

I do like your suggestion quite a bit better than I like mine!

> > > > Does this "sees"/"is seen by" nomenclature seem more reasonable?
> > > > Or perhaps "visibility includes"/"visible to"? Or keep "sees"/"seen by"
> > > > and use "<s"/">s" to adjust the mneumonic?
> > >
> > > I'm not keen on either one. Does it make sense to say one store is
> > > visible to (or is seen by) another store? Maybe... But "comes before"
> > > seems more natural. Especially if you're assuming the existence of a
> > > global ordering on these stores.
> >
> > In your time-based viewpoint, I would guess that "comes before" would
> > be quite natural -- and this makes complete sense for MMIO accesses,
> > as far as I can tell (never have done much in the way of parallel device
> > drivers, though). For memory-based data structures, as we have discussed,
> > there are cases where taking a time-based approach can be misleading.
> > Or, in my case, can lead to bugs. ;-)
>
> The word "before" doesn't necessarily mean the same thing as "earlier in
> time". Here it means "lower in the global ordering of all stores to this
> variable".
>
> You're too hung up on this idea that something which _looks_ like a
> temporal ordering actually _must be_ a temporal ordering. Not at all --
> it may resemble one for mnemonic purposes without actually being one.

Perhaps I am to hung up on this, but I came by this particular hangup
honestly! ;-)

> > > Incidentally, although you haven't mentioned this, it's important never to
> > > state that a load is visible to (or is seen by or comes before) another
> > > access. In other words, the global ordering of stores for a single
> > > variable doesn't extend to a global ordering of _all_ accesses for that
> > > variable.
> >
> > Yes. From my viewpoint (ignoring MMIO), stores are mute and loads
> > are deaf. So stores can "hear" loads, but not vice versa. I guess I
> > could make the "sees" analogy work just as well -- in that case stores
> > are blind and loads are invisible, so that loads can see stores, but
> > not vice versa. And loads can neither see nor hear other loads.
>
> The analogy breaks down for pairs of stores. If stores are blind then
> they can't see other stores -- but we need them to.

I would instead say that you need to execute some loads in order to be
able to see the effects of your pairs of stores.

> > In the MMIO case, a load may be visible to a subsequent store due to
> > changes in device state. In principle, anyway, it has been awhile since
> > I have looked carefully at device-register specifications, so I have no
> > idea if any real devices do this -- on this, I must defer to you.
>
> They sometimes do, in rather specialized circumstances. It isn't very
> common.

Good to know, thank you!

Thanx, Paul

2006-10-19 20:55:19

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, 19 Oct 2006, Paul E. McKenney wrote:

> > I see no reason to think the control dependency in CPU 1's assignment to B
> > is any weaker than a memory barrier.
>
> I am assuming that you have in mind a special restricted memory barrier
> that applies only to the load of A, and not necessarily to any other
> preceding operations. Otherwise, these two code sequences would be
> equivalent, and they are not (as usual, all variables initially zero):
>
> CPU 0, Sequence 1 CPU 0, Sequence 2 CPU 1
>
> A=1 A=1 while (C==0);
> while (B==0); while (B==0); smp_mb();
> C=1 smp_mb(); assert(A==1);
> C=1
>
> In sequence 1, CPU 1's assertion can fail. Not so with sequence 2.

Yes, that's a very good point. Indeed, I meant a restricted memory
barrier applying only to the two accesses involved. In the same sort of
way rmb() is a restricted memory barrier, applying only to pairs of
loads.

> Regardless of your definition of your posited memory barrier corresponding
> to the control dependency, a counter example:
>
> CPU 1 CPU 2
>
> A=1;
> ...
> while (A==0); while (B==0);
> B=1 smp_mb()
> assert(A==1) <fails>
>
> Here, placing an smp_mb() after the "while (A==0)" does make a difference.
>
> Degenerate, perhaps, given that the same CPU is assigning and while-ing,
> but so it goes.

The smp_mb() does make a difference. But it doesn't invalidate my notion
of a dependency acting as a restricted memory barrier. The notion allows
you to conclude from this example only that ld_1(A) >v ld_2(A), which is
meaningless (using your convention for >v). It doesn't allow you to
conclude st_1(A) >v ld_2(A).

> Even assuming a special restricted memory barrier, the example of DEC
> Alpha and pointer dereferencing gives me pause. Feel free to berate
> me for this, as you have done in the past. ;-)

Ah, interesting comment. With the Alpha and pointer dereferencing, the
problems arise because of failure to respect a data dependency between two
loads. Here I am talking about a dependency between a load and a
subsequent store, so it isn't the same thing at all. Failure to respect
this kind of dependency would mean the CPU was writing a value before it
knew what value to write (or whether to write it, or where to write it).
Not even the most aggressively speculative machine will do that!

> Seriously, my judgement of this would depend on exactly what part of
> the smp_mb() semantics you are claiming for the control dependency.
> I do not believe that we could make progress without appealing to a
> specific implementation, so I would rather ignore control dependencies,
> at least for non-MMIO accesses. MMIO would be another story altogether.

What I'm claiming is exactly what was written in an earlier email:

st(A) < st(B) >v ac(B) < ac(A) implies st(A) >v ac(A), and

ld(A) < st(B) >v ac(B) < st(A) implies st(A) !>v ld(A).

Here I'm using your convention for >v, and < indicates either an explicit
barrier between two accesses or a dependency between a load and a later
store.

> > "Sequentially precedes" means that the system behaves as though there were
> > a memory barrier between the two accesses.
>
> OK. As noted above, if I were to interpret "a memory barrier" as really
> being everything entailed by smp_mb(), I disagree with your statement in an
> earlier email stating:
>
> Similarly, in the program "if (A) B = 2;" the load(A) sequentially
> precedes the store(B) -- thanks to the dependency or (if you
> prefer) the absence of speculative stores.
>
> However, I don't believe that is what you mean by "a memory barrier" in
> this case -- my guess again is that you mean a special memory barrier that
> applies only the the load of A in one direction, but that applies to
> everything following the load in the other direction.

It applies to the load of A in one direction and to all later stores in
the other direction. Not to later loads.


> I would use ">p" for the program-order relationship, and probably something
> like ">b" for the memory-barrier relationship. There are other orderings,
> including the control-flow ordering discussed earlier, data dependencies,
> and so on.

> The literature is quite inconsistent. The DEC Alpha manual takes your
> approach, while Gharachorloo's dissertation takes my approach. Not to
> be outdone, Steinke and Nutt's JACM paper (written long after the other
> two) uses different directions for different types of orderings!!!
> See http://arxiv.org/PS_cache/cs/pdf/0208/0208027.pdf, page 49,
> Definitions A.5, A.6 on the one hand and Definition A.7 on the other. ;-)

> This is the connotation conflict. For you, it is confusing to write
> "A > B" when A precedes B. For me, it is confusing to write "st < ld"
> when data flows from the "st" to the "ld". So, the only way to resolve
> this is to avoid use of ">" like the plague!

Okay, let's change the notation. I don't like <v very much. Let's not
worry about potential confusion with implication signs, and use

1:st(A) -> 2:st(A)

to indicate that 1:st occurs earlier than 2:st in the global ordering of
all stores to A. And let's use

3:st(B) -> 4:ld(B)

to mean that 4:ld returned the value either of 3:st or of some other store
to B occuring later in the global ordering of all such stores. Lastly,
let's use

5:ac(A) +> 6:ac(B)

to indicate either that the two accesses are separated by a memory barrier
or that 5:ac is a load and 6:ac is a dependent store (all occurring on the
same CPU).


> And in a cache-coherent system, there must be. Or, more precisely,
> there must not be different sequences of loads that indicate inconsistent
> orderings of stores to a given single variable. If the system can
> prove that there are no concurrent loads during a given period of
> time, I guess it would be within its rights to ditch cache coherence
> for that variable during that time...

What about indirect indications of inconsistency? See my example below.

> > (BTW, can you explain the difference between "cache coherent" and "cache
> > consistent"? I never quite got it straight...)
>
> "Cache coherent" is the preferred term, though "cache consistent" is
> sometimes used as a synonym. If you want to be painfully correct, you
> would say "cache coherent" when talking about stores to a single variable,
> and "memory consistency model" when talking about ordering of accesses
> to multiple variables.

Hmmm. Then what about "DMA coherent" vs. "DMA consistent"?


> > The analogy breaks down for pairs of stores. If stores are blind then
> > they can't see other stores -- but we need them to.
>
> I would instead say that you need to execute some loads in order to be
> able to see the effects of your pairs of stores.

Consider this example:

CPU 0 CPU 1
----- -----
A = 1; B = 2;
mb(); mb();
B = 1; X = A + 1;
...
assert(!(B==2 && X==1));

The assertion cannot fail. But to prove it in our formalism requires
writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
(i.e., overwrites) CPU 0's store to B.

Alan

2006-10-19 22:45:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, Oct 19, 2006 at 04:55:16PM -0400, Alan Stern wrote:
> On Thu, 19 Oct 2006, Paul E. McKenney wrote:
>
> > > I see no reason to think the control dependency in CPU 1's assignment to B
> > > is any weaker than a memory barrier.
> >
> > I am assuming that you have in mind a special restricted memory barrier
> > that applies only to the load of A, and not necessarily to any other
> > preceding operations. Otherwise, these two code sequences would be
> > equivalent, and they are not (as usual, all variables initially zero):
> >
> > CPU 0, Sequence 1 CPU 0, Sequence 2 CPU 1
> >
> > A=1 A=1 while (C==0);
> > while (B==0); while (B==0); smp_mb();
> > C=1 smp_mb(); assert(A==1);
> > C=1
> >
> > In sequence 1, CPU 1's assertion can fail. Not so with sequence 2.
>
> Yes, that's a very good point. Indeed, I meant a restricted memory
> barrier applying only to the two accesses involved. In the same sort of
> way rmb() is a restricted memory barrier, applying only to pairs of
> loads.

OK.

> > Regardless of your definition of your posited memory barrier corresponding
> > to the control dependency, a counter example:
> >
> > CPU 1 CPU 2
> >
> > A=1;
> > ...
> > while (A==0); while (B==0);
> > B=1 smp_mb()
> > assert(A==1) <fails>
> >
> > Here, placing an smp_mb() after the "while (A==0)" does make a difference.
> >
> > Degenerate, perhaps, given that the same CPU is assigning and while-ing,
> > but so it goes.
>
> The smp_mb() does make a difference. But it doesn't invalidate my notion
> of a dependency acting as a restricted memory barrier. The notion allows
> you to conclude from this example only that ld_1(A) >v ld_2(A), which is
> meaningless (using your convention for >v). It doesn't allow you to
> conclude st_1(A) >v ld_2(A).

Yes, assuming that control dependencies result in your restricted memory
barrier.

> > Even assuming a special restricted memory barrier, the example of DEC
> > Alpha and pointer dereferencing gives me pause. Feel free to berate
> > me for this, as you have done in the past. ;-)
>
> Ah, interesting comment. With the Alpha and pointer dereferencing, the
> problems arise because of failure to respect a data dependency between two
> loads. Here I am talking about a dependency between a load and a
> subsequent store, so it isn't the same thing at all. Failure to respect
> this kind of dependency would mean the CPU was writing a value before it
> knew what value to write (or whether to write it, or where to write it).
> Not even the most aggressively speculative machine will do that!

http://www.tinker.ncsu.edu/techreports/vssepic.pdf

Not exactly the same thing, but certainly a very similar level of
speculative aggression!

> > Seriously, my judgement of this would depend on exactly what part of
> > the smp_mb() semantics you are claiming for the control dependency.
> > I do not believe that we could make progress without appealing to a
> > specific implementation, so I would rather ignore control dependencies,
> > at least for non-MMIO accesses. MMIO would be another story altogether.
>
> What I'm claiming is exactly what was written in an earlier email:
>
> st(A) < st(B) >v ac(B) < ac(A) implies st(A) >v ac(A), and
>
> ld(A) < st(B) >v ac(B) < st(A) implies st(A) !>v ld(A).
>
> Here I'm using your convention for >v, and < indicates either an explicit
> barrier between two accesses or a dependency between a load and a later
> store.

Your notion of control-dependency barriers makes sense in an intuitive
sense. Does Linux rely on it, other than for MMIO accesses?

> > > "Sequentially precedes" means that the system behaves as though there were
> > > a memory barrier between the two accesses.
> >
> > OK. As noted above, if I were to interpret "a memory barrier" as really
> > being everything entailed by smp_mb(), I disagree with your statement in an
> > earlier email stating:
> >
> > Similarly, in the program "if (A) B = 2;" the load(A) sequentially
> > precedes the store(B) -- thanks to the dependency or (if you
> > prefer) the absence of speculative stores.
> >
> > However, I don't believe that is what you mean by "a memory barrier" in
> > this case -- my guess again is that you mean a special memory barrier that
> > applies only the the load of A in one direction, but that applies to
> > everything following the load in the other direction.
>
> It applies to the load of A in one direction and to all later stores in
> the other direction. Not to later loads.

Ah, good point -- I didn't pick up on the fact that it needn't constrain
later loads.

> > I would use ">p" for the program-order relationship, and probably something
> > like ">b" for the memory-barrier relationship. There are other orderings,
> > including the control-flow ordering discussed earlier, data dependencies,
> > and so on.
>
> > The literature is quite inconsistent. The DEC Alpha manual takes your
> > approach, while Gharachorloo's dissertation takes my approach. Not to
> > be outdone, Steinke and Nutt's JACM paper (written long after the other
> > two) uses different directions for different types of orderings!!!
> > See http://arxiv.org/PS_cache/cs/pdf/0208/0208027.pdf, page 49,
> > Definitions A.5, A.6 on the one hand and Definition A.7 on the other. ;-)
>
> > This is the connotation conflict. For you, it is confusing to write
> > "A > B" when A precedes B. For me, it is confusing to write "st < ld"
> > when data flows from the "st" to the "ld". So, the only way to resolve
> > this is to avoid use of ">" like the plague!
>
> Okay, let's change the notation. I don't like <v very much. Let's not
> worry about potential confusion with implication signs, and use
>
> 1:st(A) -> 2:st(A)

Would "=>" work, or does that conflict with something else?

And the number before the colon is the CPU #, right?

> to indicate that 1:st occurs earlier than 2:st in the global ordering of
> all stores to A. And let's use
>
> 3:st(B) -> 4:ld(B)
>
> to mean that 4:ld returned the value either of 3:st or of some other store
> to B occuring later in the global ordering of all such stores.

OK... Though expressing your English description formally is a bit messy,
it does capture a very useful idiom.

> Lastly, let's use
>
> 5:ac(A) +> 6:ac(B)
>
> to indicate either that the two accesses are separated by a memory barrier
> or that 5:ac is a load and 6:ac is a dependent store (all occurring on the
> same CPU).

So the number preceding the colon is the value being loaded or stored?

Either way, the symbols seem reasonable. In a PDF, I would probably
set a symbol indicating the type of flow over a hollow arrow or something.

> > And in a cache-coherent system, there must be. Or, more precisely,
> > there must not be different sequences of loads that indicate inconsistent
> > orderings of stores to a given single variable. If the system can
> > prove that there are no concurrent loads during a given period of
> > time, I guess it would be within its rights to ditch cache coherence
> > for that variable during that time...
>
> What about indirect indications of inconsistency? See my example below.

I have some questions about that one.

> > > (BTW, can you explain the difference between "cache coherent" and "cache
> > > consistent"? I never quite got it straight...)
> >
> > "Cache coherent" is the preferred term, though "cache consistent" is
> > sometimes used as a synonym. If you want to be painfully correct, you
> > would say "cache coherent" when talking about stores to a single variable,
> > and "memory consistency model" when talking about ordering of accesses
> > to multiple variables.
>
> Hmmm. Then what about "DMA coherent" vs. "DMA consistent"?

No idea. Having worked with systems where DMA did not play particularly
nicely with the cache-coherence protocol, they both sound like good things,
though. ;-)

As near as I can tell by looking around, they are synonyms or nearly so.

> > > The analogy breaks down for pairs of stores. If stores are blind then
> > > they can't see other stores -- but we need them to.
> >
> > I would instead say that you need to execute some loads in order to be
> > able to see the effects of your pairs of stores.
>
> Consider this example:
>
> CPU 0 CPU 1
> ----- -----
> A = 1; B = 2;
> mb(); mb();
> B = 1; X = A + 1;
> ...
> assert(!(B==2 && X==1));
>
> The assertion cannot fail. But to prove it in our formalism requires
> writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
> (i.e., overwrites) CPU 0's store to B.

Alternatively, we could use a notation that states that a given load gets
exactly the value from a given store, for example "st ==> ld" as opposed
to "st => ld", where there might be an intervening store.

(1) B==2 -> st_1(B=2) ==> ld_0(B==2)

Because there is only one store of 2 into B.

(2) But st_0(B=1) =p> ld_0(B) -> st_0(B=1) => ld_0(B)

Here I use "=p>" to indicate program order, and rely on the
fact that a CPU must see its own accesses in order.

(3) (1) and (2) imply st_0(B=1) => st_1(B=2) ==> ld_0(B==2)

So, yes, we do end up saying something about the order of the
stores, but only indirectly, based on other observations -- in
this case, program order and direct value sequence. In other
words, we can sometimes say things about the order of stores
even though stores are blind.

(4) By memory-barrier implication:

(a) st_0(A=1) +> st_0(B=1) &&

(b) st_1(B=2) +> ld_1(A) &&

(c) st_0(B=1) => st_1(B=2)

-> st_0(A=1) => ld_1(A)

(5) Since there is only one store to A: st_0(A=1) ==> ld_1(A==1)

(6) Therefore, X==2 and the assertion cannot fail if B==2. But
if the assertion fails, it must be true that B==2, so the
assertion cannot fail.

Is that more or less what you had in mind?

Thanx, Paul

2006-10-20 16:54:56

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Thu, 19 Oct 2006, Paul E. McKenney wrote:

> Your notion of control-dependency barriers makes sense in an intuitive
> sense. Does Linux rely on it, other than for MMIO accesses?

I'm pretty sure that somewhere I saw some code which looked like it needed
a memory barrier but didn't have one. I decided it was okay because of
exactly this sort of control dependency. No way to know if it was
deliberate or merely an oversight. Don't ask me where that was; I can't
remember.


> > Okay, let's change the notation. I don't like <v very much. Let's not
> > worry about potential confusion with implication signs, and use
> >
> > 1:st(A) -> 2:st(A)
>
> Would "=>" work, or does that conflict with something else?

It's okay with me.

> And the number before the colon is the CPU #, right?

No, it's an instance label (kind of like a line number). As we've been
doing before. The CPU number is still a subscript.

> > to indicate that 1:st occurs earlier than 2:st in the global ordering of
> > all stores to A. And let's use
> >
> > 3:st(B) -> 4:ld(B)
> >
> > to mean that 4:ld returned the value either of 3:st or of some other store
> > to B occuring later in the global ordering of all such stores.
>
> OK... Though expressing your English description formally is a bit messy,
> it does capture a very useful idiom.
>
> > Lastly, let's use
> >
> > 5:ac(A) +> 6:ac(B)
> >
> > to indicate either that the two accesses are separated by a memory barrier
> > or that 5:ac is a load and 6:ac is a dependent store (all occurring on the
> > same CPU).
>
> So the number preceding the colon is the value being loaded or stored?

No, that value goes inside the parentheses.

> Either way, the symbols seem reasonable. In a PDF, I would probably
> set a symbol indicating the type of flow over a hollow arrow or something.


> > Hmmm. Then what about "DMA coherent" vs. "DMA consistent"?
>
> No idea. Having worked with systems where DMA did not play particularly
> nicely with the cache-coherence protocol, they both sound like good things,
> though. ;-)
>
> As near as I can tell by looking around, they are synonyms or nearly so.

I once saw someone draw a distinction between them, but I didn't
understand it at the time and now it has faded away.


> > Consider this example:
> >
> > CPU 0 CPU 1
> > ----- -----
> > A = 1; B = 2;
> > mb(); mb();
> > B = 1; X = A + 1;
> > ...
> > assert(!(B==2 && X==1));
> >
> > The assertion cannot fail. But to prove it in our formalism requires
> > writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
> > (i.e., overwrites) CPU 0's store to B.

I wrote that the assertion cannot fail, but I'm not so sure any more.
Couldn't there be a system where CPU 1's statements run before any of CPU
0's writes become visible to CPU 1, but nevertheless the caching hardware
decides that CPU 1's write to B "wins"? More on this below...

> Alternatively, we could use a notation that states that a given load gets
> exactly the value from a given store, for example "st ==> ld" as opposed
> to "st => ld", where there might be an intervening store.

That's a useful concept.

> (1) B==2 -> st_1(B=2) ==> ld_0(B==2)
>
> Because there is only one store of 2 into B.
>
> (2) But st_0(B=1) =p> ld_0(B) -> st_0(B=1) => ld_0(B)
>
> Here I use "=p>" to indicate program order, and rely on the
> fact that a CPU must see its own accesses in order.

Right.

> (3) (1) and (2) imply st_0(B=1) => st_1(B=2) ==> ld_0(B==2)

Yes. st_0(B=1) => ld_0(B) means that the load must return the value
either of the st_0 or of some later store. Since the value returned was
2, we know that the store on CPU 1 occurred later than the store on
CPU 0.

(By "later" I mean at a higher position according to the global ordering
of all stores to B. There doesn't seem to be any good way to express this
without using time-related words.)

> So, yes, we do end up saying something about the order of the
> stores, but only indirectly, based on other observations -- in
> this case, program order and direct value sequence. In other
> words, we can sometimes say things about the order of stores
> even though stores are blind.

Which was my original point.

> (4) By memory-barrier implication:
>
> (a) st_0(A=1) +> st_0(B=1) &&
>
> (b) st_1(B=2) +> ld_1(A) &&
>
> (c) st_0(B=1) => st_1(B=2)
>
> -> st_0(A=1) => ld_1(A)
>
> (5) Since there is only one store to A: st_0(A=1) ==> ld_1(A==1)
>
> (6) Therefore, X==2 and the assertion cannot fail if B==2. But
> if the assertion fails, it must be true that B==2, so the
> assertion cannot fail.
>
> Is that more or less what you had in mind?

Yes it was.

But now I wonder... This approach might require too much of memory
barriers. If we have

st(A); mb(); st(B);

does the memory barrier really say anything about CPUs which never see one
or both of those stores? Or does it say only that CPUs which see both
stores will see them in the correct order?

The only way to be certain a CPU sees a store is if it does a load which
returns the value of that store. In the absence of such a criterion we
don't know what really happened. For example:

CPU 0 CPU 1 CPU 2
----- ----- -----
A = 1; while (B < 1) ; while (B < 2) ;
mb(); mb(); mb();
B = 1; B = 2; assert(A==1);

I put memory barriers in for CPUs 1 and 2 even though they shouldn't be
necessary. Now look what we have:

st_0(A=1) +> st_0(B=1) ==> ld_1(B) =p>

st_1(B=2) ==> ld_2(B) +> ld_2(A).

>From this we can deduce that st_0(B=1) => st_1(B=2). For if not then
st_1(B=2) => st_0(B=1) => ld_1(B) and hence st_1(B) => ld_1(B),
contradicting the requirement that CPU 1 sees it own accesses in order.

But now we can go on to deduce st_0(B=1) => ld_2(B) (not ==>), and hence
st_0(A=1) => ld_2(A). Since there are no other writes to A, we get
st_0(A=1) ==> ld_2(A), meaning that the assertion must hold.

But we know that the assertion may fail on some systems! The reason being
that CPU 0's stores might propagate slowly to CPU 2 but the paths from
CPU 0 to CPU 1 and from CPU 1 to CPU 2 might be quick. Hence CPU 0's
write to A might not be visible to CPU 2 when the assertion runs, and
CPU 0's write to B might never be visible to CPU 2.

This wouldn't violate the guarantee made by CPU 0's mb(). The guarantee
wouldn't apply to CPU 2, since CPU 2 never sees the "B = 1".

To resolve this we need an additional notion, of a store to some variable
being visible to a subsequent access of that variable. The mere fact that
one store precedes another in the global ordering isn't enough to make it
visible.

How would you straighten this out?

Alan

2006-10-21 00:58:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, Oct 20, 2006 at 12:54:52PM -0400, Alan Stern wrote:
> On Thu, 19 Oct 2006, Paul E. McKenney wrote:
>
> > Your notion of control-dependency barriers makes sense in an intuitive
> > sense. Does Linux rely on it, other than for MMIO accesses?
>
> I'm pretty sure that somewhere I saw some code which looked like it needed
> a memory barrier but didn't have one. I decided it was okay because of
> exactly this sort of control dependency. No way to know if it was
> deliberate or merely an oversight. Don't ask me where that was; I can't
> remember.

There are a number of cases where this sort of code structure is OK,
even without assuming that control dependencies have any memory-ordering
consequences. But they do have to be taken case by case...

> > > Okay, let's change the notation. I don't like <v very much. Let's not
> > > worry about potential confusion with implication signs, and use
> > >
> > > 1:st(A) -> 2:st(A)
> >
> > Would "=>" work, or does that conflict with something else?
>
> It's okay with me.
>
> > And the number before the colon is the CPU #, right?
>
> No, it's an instance label (kind of like a line number). As we've been
> doing before. The CPU number is still a subscript.

When I started actually doing formulas, the use of ":" conflicted with
the logic-expression operator ("for every x there exists y and z such
that...", which turns into upside-down-A x backwards-E y,z colon ...,
at least if I am remembering all this correctly).

So how about having the line number be a second subscript following the
CPU number?

> > > to indicate that 1:st occurs earlier than 2:st in the global ordering of
> > > all stores to A. And let's use
> > >
> > > 3:st(B) -> 4:ld(B)
> > >
> > > to mean that 4:ld returned the value either of 3:st or of some other store
> > > to B occuring later in the global ordering of all such stores.
> >
> > OK... Though expressing your English description formally is a bit messy,
> > it does capture a very useful idiom.
> >
> > > Lastly, let's use
> > >
> > > 5:ac(A) +> 6:ac(B)
> > >
> > > to indicate either that the two accesses are separated by a memory barrier
> > > or that 5:ac is a load and 6:ac is a dependent store (all occurring on the
> > > same CPU).
> >
> > So the number preceding the colon is the value being loaded or stored?
>
> No, that value goes inside the parentheses.

Got it...

> > Either way, the symbols seem reasonable. In a PDF, I would probably
> > set a symbol indicating the type of flow over a hollow arrow or something.
>
>
> > > Hmmm. Then what about "DMA coherent" vs. "DMA consistent"?
> >
> > No idea. Having worked with systems where DMA did not play particularly
> > nicely with the cache-coherence protocol, they both sound like good things,
> > though. ;-)
> >
> > As near as I can tell by looking around, they are synonyms or nearly so.
>
> I once saw someone draw a distinction between them, but I didn't
> understand it at the time and now it has faded away.

I do remember some documentation mentioning this, will track it down.

> > > Consider this example:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > A = 1; B = 2;
> > > mb(); mb();
> > > B = 1; X = A + 1;
> > > ...
> > > assert(!(B==2 && X==1));
> > >
> > > The assertion cannot fail. But to prove it in our formalism requires
> > > writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
> > > (i.e., overwrites) CPU 0's store to B.
>
> I wrote that the assertion cannot fail, but I'm not so sure any more.
> Couldn't there be a system where CPU 1's statements run before any of CPU
> 0's writes become visible to CPU 1, but nevertheless the caching hardware
> decides that CPU 1's write to B "wins"? More on this below...

The memory barriers require that if the caching hardware decides that
CPU 1's B=2 "wins", then X=A+1 must see CPU 0's A=1. Now the assertion
could legally see B==2&&X==0, or B==1&&X={0,1,2}, but it cannot legally
see B==2&&X==1 because that would violate the constraints that the
memory barriers are supposed to be enforcing.

> > Alternatively, we could use a notation that states that a given load gets
> > exactly the value from a given store, for example "st ==> ld" as opposed
> > to "st => ld", where there might be an intervening store.
>
> That's a useful concept.

Especially when defining cache coherence -- need to be able to say that
only stores can change the value, and that any loads between a pair of
stores see the value from the first of the two stores. This can be done
without ==>, but it is more elaborate.

> > (1) B==2 -> st_1(B=2) ==> ld_0(B==2)
> >
> > Because there is only one store of 2 into B.
> >
> > (2) But st_0(B=1) =p> ld_0(B) -> st_0(B=1) => ld_0(B)
> >
> > Here I use "=p>" to indicate program order, and rely on the
> > fact that a CPU must see its own accesses in order.
>
> Right.
>
> > (3) (1) and (2) imply st_0(B=1) => st_1(B=2) ==> ld_0(B==2)
>
> Yes. st_0(B=1) => ld_0(B) means that the load must return the value
> either of the st_0 or of some later store. Since the value returned was
> 2, we know that the store on CPU 1 occurred later than the store on
> CPU 0.
>
> (By "later" I mean at a higher position according to the global ordering
> of all stores to B. There doesn't seem to be any good way to express this
> without using time-related words.)

One could say "the store on CPU 1 overwrote that of CPU 0", but this
admittedly gets cumbersome if there is a long sequence of stores.

> > So, yes, we do end up saying something about the order of the
> > stores, but only indirectly, based on other observations -- in
> > this case, program order and direct value sequence. In other
> > words, we can sometimes say things about the order of stores
> > even though stores are blind.
>
> Which was my original point.

OK. I misunderstood you.

> > (4) By memory-barrier implication:
> >
> > (a) st_0(A=1) +> st_0(B=1) &&
> >
> > (b) st_1(B=2) +> ld_1(A) &&
> >
> > (c) st_0(B=1) => st_1(B=2)
> >
> > -> st_0(A=1) => ld_1(A)
> >
> > (5) Since there is only one store to A: st_0(A=1) ==> ld_1(A==1)
> >
> > (6) Therefore, X==2 and the assertion cannot fail if B==2. But
> > if the assertion fails, it must be true that B==2, so the
> > assertion cannot fail.
> >
> > Is that more or less what you had in mind?
>
> Yes it was.
>
> But now I wonder... This approach might require too much of memory
> barriers. If we have
>
> st(A); mb(); st(B);
>
> does the memory barrier really say anything about CPUs which never see one
> or both of those stores? Or does it say only that CPUs which see both
> stores will see them in the correct order?

Only that CPUs that see both stores with appropriate use of memory barriers
will see them in the correct order.

> The only way to be certain a CPU sees a store is if it does a load which
> returns the value of that store. In the absence of such a criterion we
> don't know what really happened. For example:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- -----
> A = 1; while (B < 1) ; while (B < 2) ;
> mb(); mb(); mb();
> B = 1; B = 2; assert(A==1);
>
> I put memory barriers in for CPUs 1 and 2 even though they shouldn't be
> necessary. Now look what we have:
>
> st_0(A=1) +> st_0(B=1) ==> ld_1(B) =p>
>
> st_1(B=2) ==> ld_2(B) +> ld_2(A).
>
> >From this we can deduce that st_0(B=1) => st_1(B=2). For if not then
> st_1(B=2) => st_0(B=1) => ld_1(B) and hence st_1(B) => ld_1(B),
> contradicting the requirement that CPU 1 sees it own accesses in order.
>
> But now we can go on to deduce st_0(B=1) => ld_2(B) (not ==>), and hence
> st_0(A=1) => ld_2(A). Since there are no other writes to A, we get
> st_0(A=1) ==> ld_2(A), meaning that the assertion must hold.

Yep, which it must for locking to work, assuming my approach to transitivity.
If we were using your approach to transitivity, we would have a different
memory-barrier condition that required an atomic operation as the final
link in a chain of changes to a given variable -- and in that case, we
would find that the assertion could fail.

> But we know that the assertion may fail on some systems!

To be determined -- the fact that we are threading the single variable B
through all three CPUs potentially makes a difference. Working on the
empirical end. I know the assert() always would succeed on POWER, and
empirical evidence agrees (fine-grained low-access-overhead synchronized
time-base registers make this easier!). I believe that the assert would
also success on Alpha.

> The reason being
> that CPU 0's stores might propagate slowly to CPU 2 but the paths from
> CPU 0 to CPU 1 and from CPU 1 to CPU 2 might be quick. Hence CPU 0's
> write to A might not be visible to CPU 2 when the assertion runs, and
> CPU 0's write to B might never be visible to CPU 2.

Agreed -- if the assertion really is guaranteed to work, it is because
the cache-coherence protocol is required to explicitly make it be so.
One reason that I am not too pessimistic is that there are some popular
textbook algorithms that cannot work otherwise (e.g., the various locking
algorithms that don't use atomic instructions, possibly also some of the
NBS algorithms).

> This wouldn't violate the guarantee made by CPU 0's mb(). The guarantee
> wouldn't apply to CPU 2, since CPU 2 never sees the "B = 1".

Again, this depends on the form of the memory-barrier guarantee --
the presence of a "==>" vs. a "=>" is quite important! ;-)

> To resolve this we need an additional notion, of a store to some variable
> being visible to a subsequent access of that variable. The mere fact that
> one store precedes another in the global ordering isn't enough to make it
> visible.
>
> How would you straighten this out?

My hope is that the CPUs are guaranteed to act the way that I hope they
do, so that the memory-barrier rule is given by:

st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => ld(B) ->
st(A) => ld(A)

If so, everything works nicely. In your scenario, the memory-barrier
rule would instead be something like:

st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) ==> ld(B) ->
st(A) => ld(A)
st_0(A) -p> wmb -p> st(B) && ld_1(B) -p> rmb -p> ld(A) && st_0(B) => ld_1(B) &&
not-exist i: st_0(B) => st_i(B) => ld_1(B) ->
st(A) => ld(A)
st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => at(B) ->
st(A) => ld(A)

In other words, either: (1) the st(B) must immediately precede the ld(B), or
(2) there can only be loads between the st(B) and ld(B) (yes, I would have
to add other terms to exclude atomics and so on...), or (3) the access to A
by CPU 1 must be an atomic operation.

Given this set of rules, the assertion could fail. However, if the
"while (B < 2)" were replaced by (or followed by) an atomic operation
on B, as one would have with a lock, then the assertion would always
succeed.

Does this make sense?

Thanx, Paul

2006-10-21 19:47:34

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Fri, 20 Oct 2006, Paul E. McKenney wrote:

> > > And the number before the colon is the CPU #, right?
> >
> > No, it's an instance label (kind of like a line number). As we've been
> > doing before. The CPU number is still a subscript.
>
> When I started actually doing formulas, the use of ":" conflicted with
> the logic-expression operator ("for every x there exists y and z such
> that...", which turns into upside-down-A x backwards-E y,z colon ...,
> at least if I am remembering all this correctly).
>
> So how about having the line number be a second subscript following the
> CPU number?

Make it a second subscript following the close paren. That way either
subscript can be omitted without confusion.

> > > > Consider this example:
> > > >
> > > > CPU 0 CPU 1
> > > > ----- -----
> > > > A = 1; B = 2;
> > > > mb(); mb();
> > > > B = 1; X = A + 1;
> > > > ...
> > > > assert(!(B==2 && X==1));
> > > >
> > > > The assertion cannot fail. But to prove it in our formalism requires
> > > > writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
> > > > (i.e., overwrites) CPU 0's store to B.
> >
> > I wrote that the assertion cannot fail, but I'm not so sure any more.
> > Couldn't there be a system where CPU 1's statements run before any of CPU
> > 0's writes become visible to CPU 1, but nevertheless the caching hardware
> > decides that CPU 1's write to B "wins"? More on this below...
>
> The memory barriers require that if the caching hardware decides that
> CPU 1's B=2 "wins", then X=A+1 must see CPU 0's A=1. Now the assertion
> could legally see B==2&&X==0, or B==1&&X={0,1,2}, but it cannot legally
> see B==2&&X==1 because that would violate the constraints that the
> memory barriers are supposed to be enforcing.

I won't say that you're wrong. But how do you know for sure? That is,
exactly which constraint does it violate and where is it written in stone
that the memory barriers are supposed to enforce this constraint? After
all, this doesn't fit the usual st-wmb-st / ld-rmb-ld pattern.


> > But now I wonder... This approach might require too much of memory
> > barriers. If we have
> >
> > st(A); mb(); st(B);
> >
> > does the memory barrier really say anything about CPUs which never see one
> > or both of those stores? Or does it say only that CPUs which see both
> > stores will see them in the correct order?
>
> Only that CPUs that see both stores with appropriate use of memory barriers
> will see them in the correct order.

Right -- that's what I meant.

> > The only way to be certain a CPU sees a store is if it does a load which
> > returns the value of that store. In the absence of such a criterion we
> > don't know what really happened. For example:
> >
> > CPU 0 CPU 1 CPU 2
> > ----- ----- -----
> > A = 1; while (B < 1) ; while (B < 2) ;
> > mb(); mb(); mb();
> > B = 1; B = 2; assert(A==1);

> Yep, which it must for locking to work, assuming my approach to transitivity.
> If we were using your approach to transitivity, we would have a different
> memory-barrier condition that required an atomic operation as the final
> link in a chain of changes to a given variable -- and in that case, we
> would find that the assertion could fail.
>
> > But we know that the assertion may fail on some systems!
>
> To be determined -- the fact that we are threading the single variable B
> through all three CPUs potentially makes a difference. Working on the
> empirical end. I know the assert() always would succeed on POWER, and
> empirical evidence agrees (fine-grained low-access-overhead synchronized
> time-base registers make this easier!). I believe that the assert would
> also success on Alpha.

I overstated the case. We don't know of any actual architectures where
the assertion could fail, but we have considered a possible architecture
where it might. What about NUMA machines?

> > The reason being
> > that CPU 0's stores might propagate slowly to CPU 2 but the paths from
> > CPU 0 to CPU 1 and from CPU 1 to CPU 2 might be quick. Hence CPU 0's
> > write to A might not be visible to CPU 2 when the assertion runs, and
> > CPU 0's write to B might never be visible to CPU 2.
>
> Agreed -- if the assertion really is guaranteed to work, it is because
> the cache-coherence protocol is required to explicitly make it be so.
> One reason that I am not too pessimistic is that there are some popular
> textbook algorithms that cannot work otherwise (e.g., the various locking
> algorithms that don't use atomic instructions, possibly also some of the
> NBS algorithms).
>
> > This wouldn't violate the guarantee made by CPU 0's mb(). The guarantee
> > wouldn't apply to CPU 2, since CPU 2 never sees the "B = 1".
>
> Again, this depends on the form of the memory-barrier guarantee --
> the presence of a "==>" vs. a "=>" is quite important! ;-)

Just so.

> My hope is that the CPUs are guaranteed to act the way that I hope they
> do, so that the memory-barrier rule is given by:
>
> st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => ld(B) ->
> st(A) => ld(A)
>
> If so, everything works nicely. In your scenario, the memory-barrier
> rule would instead be something like:
>
> st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) ==> ld(B) ->
> st(A) => ld(A)
> st_0(A) -p> wmb -p> st(B) && ld_1(B) -p> rmb -p> ld(A) && st_0(B) => ld_1(B) &&
> not-exist i: st_0(B) => st_i(B) => ld_1(B) ->
> st(A) => ld(A)

This is identical to the previous version, since by definition

st_i(B) ==> ld_j(B) is equivalent to st_i(B) => ld_j(B) &&
not exist k: st_i(B) => st_k(B) => ld_j(B).

> st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => at(B) ->
> st(A) => ld(A)
>
> In other words, either: (1) the st(B) must immediately precede the ld(B), or
> (2) there can only be loads between the st(B) and ld(B) (yes, I would have
> to add other terms to exclude atomics and so on...), or (3) the access to A
> by CPU 1 must be an atomic operation.

(1) Yes; it's the same as saying that ld(B) returns the value of the
st(B) (no intervening stores).

(2) doesn't make sense, since loads aren't part of the global ordering of
accesses of B -- they are invisible. (BTW, you don't need to assume as
well that stores are blind; it's enough just to have loads be invisible.)
Each load sees an initial sequence of stores ending in the store whose
value is returned by the load, but this doesn't mean that the load occurs
between that store and the next one.

(3) The assumption should be that both accesses of B are atomic; it
doesn't matter whether the accesses of A are.

> Does this make sense?

Yes. Maybe we should include these rules as an alternative set of "very
weak" memory ordering assumptions? For normal uses they shouldn't make
any difference.

Alan

2006-10-21 22:51:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, Oct 21, 2006 at 03:47:31PM -0400, Alan Stern wrote:
> On Fri, 20 Oct 2006, Paul E. McKenney wrote:
>
> > > > And the number before the colon is the CPU #, right?
> > >
> > > No, it's an instance label (kind of like a line number). As we've been
> > > doing before. The CPU number is still a subscript.
> >
> > When I started actually doing formulas, the use of ":" conflicted with
> > the logic-expression operator ("for every x there exists y and z such
> > that...", which turns into upside-down-A x backwards-E y,z colon ...,
> > at least if I am remembering all this correctly).
> >
> > So how about having the line number be a second subscript following the
> > CPU number?
>
> Make it a second subscript following the close paren. That way either
> subscript can be omitted without confusion.

Or just use an unbound variable for the CPU if unspecified.

> > > > > Consider this example:
> > > > >
> > > > > CPU 0 CPU 1
> > > > > ----- -----
> > > > > A = 1; B = 2;
> > > > > mb(); mb();
> > > > > B = 1; X = A + 1;
> > > > > ...
> > > > > assert(!(B==2 && X==1));
> > > > >
> > > > > The assertion cannot fail. But to prove it in our formalism requires
> > > > > writing st_0(B=1) -> st_1(B=2). In other words, CPU 1's store to B sees
> > > > > (i.e., overwrites) CPU 0's store to B.
> > >
> > > I wrote that the assertion cannot fail, but I'm not so sure any more.
> > > Couldn't there be a system where CPU 1's statements run before any of CPU
> > > 0's writes become visible to CPU 1, but nevertheless the caching hardware
> > > decides that CPU 1's write to B "wins"? More on this below...
> >
> > The memory barriers require that if the caching hardware decides that
> > CPU 1's B=2 "wins", then X=A+1 must see CPU 0's A=1. Now the assertion
> > could legally see B==2&&X==0, or B==1&&X={0,1,2}, but it cannot legally
> > see B==2&&X==1 because that would violate the constraints that the
> > memory barriers are supposed to be enforcing.
>
> I won't say that you're wrong. But how do you know for sure? That is,
> exactly which constraint does it violate and where is it written in stone
> that the memory barriers are supposed to enforce this constraint? After
> all, this doesn't fit the usual st-wmb-st / ld-rmb-ld pattern.

I don't know for sure... Yet. Work in progress.

> > > But now I wonder... This approach might require too much of memory
> > > barriers. If we have
> > >
> > > st(A); mb(); st(B);
> > >
> > > does the memory barrier really say anything about CPUs which never see one
> > > or both of those stores? Or does it say only that CPUs which see both
> > > stores will see them in the correct order?
> >
> > Only that CPUs that see both stores with appropriate use of memory barriers
> > will see them in the correct order.
>
> Right -- that's what I meant.
>
> > > The only way to be certain a CPU sees a store is if it does a load which
> > > returns the value of that store. In the absence of such a criterion we
> > > don't know what really happened. For example:
> > >
> > > CPU 0 CPU 1 CPU 2
> > > ----- ----- -----
> > > A = 1; while (B < 1) ; while (B < 2) ;
> > > mb(); mb(); mb();
> > > B = 1; B = 2; assert(A==1);
>
> > Yep, which it must for locking to work, assuming my approach to transitivity.
> > If we were using your approach to transitivity, we would have a different
> > memory-barrier condition that required an atomic operation as the final
> > link in a chain of changes to a given variable -- and in that case, we
> > would find that the assertion could fail.
> >
> > > But we know that the assertion may fail on some systems!
> >
> > To be determined -- the fact that we are threading the single variable B
> > through all three CPUs potentially makes a difference. Working on the
> > empirical end. I know the assert() always would succeed on POWER, and
> > empirical evidence agrees (fine-grained low-access-overhead synchronized
> > time-base registers make this easier!). I believe that the assert would
> > also success on Alpha.
>
> I overstated the case. We don't know of any actual architectures where
> the assertion could fail, but we have considered a possible architecture
> where it might. What about NUMA machines?

If it fails on a given system, that system would be unable to execute
some popular textbook algorithms. So, no, I am not certain, but I am
reasonably confident.

> > > The reason being
> > > that CPU 0's stores might propagate slowly to CPU 2 but the paths from
> > > CPU 0 to CPU 1 and from CPU 1 to CPU 2 might be quick. Hence CPU 0's
> > > write to A might not be visible to CPU 2 when the assertion runs, and
> > > CPU 0's write to B might never be visible to CPU 2.
> >
> > Agreed -- if the assertion really is guaranteed to work, it is because
> > the cache-coherence protocol is required to explicitly make it be so.
> > One reason that I am not too pessimistic is that there are some popular
> > textbook algorithms that cannot work otherwise (e.g., the various locking
> > algorithms that don't use atomic instructions, possibly also some of the
> > NBS algorithms).
> >
> > > This wouldn't violate the guarantee made by CPU 0's mb(). The guarantee
> > > wouldn't apply to CPU 2, since CPU 2 never sees the "B = 1".
> >
> > Again, this depends on the form of the memory-barrier guarantee --
> > the presence of a "==>" vs. a "=>" is quite important! ;-)
>
> Just so.
>
> > My hope is that the CPUs are guaranteed to act the way that I hope they
> > do, so that the memory-barrier rule is given by:
> >
> > st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => ld(B) ->
> > st(A) => ld(A)
> >
> > If so, everything works nicely. In your scenario, the memory-barrier
> > rule would instead be something like:
> >
> > st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) ==> ld(B) ->
> > st(A) => ld(A)
> > st_0(A) -p> wmb -p> st(B) && ld_1(B) -p> rmb -p> ld(A) && st_0(B) => ld_1(B) &&
> > not-exist i: st_0(B) => st_i(B) => ld_1(B) ->
> > st(A) => ld(A)
>
> This is identical to the previous version, since by definition
>
> st_i(B) ==> ld_j(B) is equivalent to st_i(B) => ld_j(B) &&
> not exist k: st_i(B) => st_k(B) => ld_j(B).

OK -- we were assuming slightly different definitions of "==>". I as
assuming that if st==>ld1==>ld2, that it is not the case that "st==>ld2".
In this circumstance, your definition is certainly more convenient than
is mine. In the case of MMIO, the situation might be reversed.

> > st(A) -p> wmb -p> st(B) && ld(B) -p> rmb -p> ld(A) && st(B) => at(B) ->
> > st(A) => ld(A)
> >
> > In other words, either: (1) the st(B) must immediately precede the ld(B), or
> > (2) there can only be loads between the st(B) and ld(B) (yes, I would have
> > to add other terms to exclude atomics and so on...), or (3) the access to A
> > by CPU 1 must be an atomic operation.
>
> (1) Yes; it's the same as saying that ld(B) returns the value of the
> st(B) (no intervening stores).

Good.

> (2) doesn't make sense, since loads aren't part of the global ordering of
> accesses of B -- they are invisible. (BTW, you don't need to assume as
> well that stores are blind; it's enough just to have loads be invisible.)
> Each load sees an initial sequence of stores ending in the store whose
> value is returned by the load, but this doesn't mean that the load occurs
> between that store and the next one.

That is due to our difference in definition. Perhaps the following
definition: "A==>B" means either that B sees the value stored by A
or that B sees the same value as does A?

Some work will be required to see what is best.

> (3) The assumption should be that both accesses of B are atomic; it
> doesn't matter whether the accesses of A are.

Check out the i386 default definition of spin_unlock() -- no atomic
operations. So only the final access of B (the one corresponding to
spin_lock()) would need to be atomic.

> > Does this make sense?
>
> Yes. Maybe we should include these rules as an alternative set of "very
> weak" memory ordering assumptions? For normal uses they shouldn't make
> any difference.

Let's see what the actual architectures really impose. I would prefer to
shave the Mandlebrot set a bit more closely than absolutely needed than
vice versa. ;-)

Thanx, Paul

2006-10-22 02:18:41

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, 21 Oct 2006, Paul E. McKenney wrote:

> > This is identical to the previous version, since by definition
> >
> > st_i(B) ==> ld_j(B) is equivalent to st_i(B) => ld_j(B) &&
> > not exist k: st_i(B) => st_k(B) => ld_j(B).
>
> OK -- we were assuming slightly different definitions of "==>". I as
> assuming that if st==>ld1==>ld2, that it is not the case that "st==>ld2".
> In this circumstance, your definition is certainly more convenient than
> is mine. In the case of MMIO, the situation might be reversed.

MMIO of course is completely different. For regular memory accesses I
think we should never allow a load on the left side of "=>" or "==>".
Keep them invisible! :-)

Writing ld(A) => st(A) is bad because (1) it suggests that the store
somehow "sees" the load (which it doesn't; the load is invisible), and (2)
it suggests that the store occurs "later" in some sense than the load
(which might not be true, since a load doesn't necessarily return the
value of the temporally most recent store).

My viewpoint is that "=>" really provides an ordering of stores only.
Its use with loads is something of an artifact; it gives a convenient way
of expressing the fact that a load "sees" an initial segment of all the
stores to a variable (and the value it returns is that of the last store
in the segment).

> > (2) doesn't make sense, since loads aren't part of the global ordering of
> > accesses of B -- they are invisible. (BTW, you don't need to assume as
> > well that stores are blind; it's enough just to have loads be invisible.)
> > Each load sees an initial sequence of stores ending in the store whose
> > value is returned by the load, but this doesn't mean that the load occurs
> > between that store and the next one.
>
> That is due to our difference in definition. Perhaps the following
> definition: "A==>B" means either that B sees the value stored by A
> or that B sees the same value as does A?
>
> Some work will be required to see what is best.

How about this instead: "A==>B" means that B sees the value stored by A,
and "A==B" means that A and B are both loads and they see the value from
the same store. That way we avoid putting a load on the left side of
"==>".


> > (3) The assumption should be that both accesses of B are atomic; it
> > doesn't matter whether the accesses of A are.
>
> Check out the i386 default definition of spin_unlock() -- no atomic
> operations. So only the final access of B (the one corresponding to
> spin_lock()) would need to be atomic.

You are right.

Alan

2006-10-23 05:31:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sat, Oct 21, 2006 at 10:18:38PM -0400, Alan Stern wrote:
> On Sat, 21 Oct 2006, Paul E. McKenney wrote:
>
> > > This is identical to the previous version, since by definition
> > >
> > > st_i(B) ==> ld_j(B) is equivalent to st_i(B) => ld_j(B) &&
> > > not exist k: st_i(B) => st_k(B) => ld_j(B).
> >
> > OK -- we were assuming slightly different definitions of "==>". I as
> > assuming that if st==>ld1==>ld2, that it is not the case that "st==>ld2".
> > In this circumstance, your definition is certainly more convenient than
> > is mine. In the case of MMIO, the situation might be reversed.
>
> MMIO of course is completely different. For regular memory accesses I
> think we should never allow a load on the left side of "=>" or "==>".
> Keep them invisible! :-)
>
> Writing ld(A) => st(A) is bad because (1) it suggests that the store
> somehow "sees" the load (which it doesn't; the load is invisible), and (2)
> it suggests that the store occurs "later" in some sense than the load
> (which might not be true, since a load doesn't necessarily return the
> value of the temporally most recent store).

How about ld_i(A) => ld_j(A)? This would say that both loads corresponded
to the same store.

> My viewpoint is that "=>" really provides an ordering of stores only.
> Its use with loads is something of an artifact; it gives a convenient way
> of expressing the fact that a load "sees" an initial segment of all the
> stores to a variable (and the value it returns is that of the last store
> in the segment).

Seems reasonable at first glance, give or take comparing two loads.

> > > (2) doesn't make sense, since loads aren't part of the global ordering of
> > > accesses of B -- they are invisible. (BTW, you don't need to assume as
> > > well that stores are blind; it's enough just to have loads be invisible.)
> > > Each load sees an initial sequence of stores ending in the store whose
> > > value is returned by the load, but this doesn't mean that the load occurs
> > > between that store and the next one.
> >
> > That is due to our difference in definition. Perhaps the following
> > definition: "A==>B" means either that B sees the value stored by A
> > or that B sees the same value as does A?
> >
> > Some work will be required to see what is best.
>
> How about this instead: "A==>B" means that B sees the value stored by A,
> and "A==B" means that A and B are both loads and they see the value from
> the same store. That way we avoid putting a load on the left side of
> "==>".

My concern is that "==" might also have connotations of equal values from
distinct stores.

Thanx, Paul

2006-10-23 14:07:30

by Alan Stern

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Sun, 22 Oct 2006, Paul E. McKenney wrote:

> How about ld_i(A) => ld_j(A)? This would say that both loads corresponded
> to the same store.

> > How about this instead: "A==>B" means that B sees the value stored by A,
> > and "A==B" means that A and B are both loads and they see the value from
> > the same store. That way we avoid putting a load on the left side of
> > "==>".
>
> My concern is that "==" might also have connotations of equal values from
> distinct stores.

Okay, here's another suggestion: ld_i(A) <=> ld_j(A). This avoids
connotations of ordering and indicates the symmetry of the relation: both
loads return data from the same store.

Alan

2006-10-24 17:51:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Uses for memory barriers

On Mon, Oct 23, 2006 at 10:07:28AM -0400, Alan Stern wrote:
> On Sun, 22 Oct 2006, Paul E. McKenney wrote:
>
> > How about ld_i(A) => ld_j(A)? This would say that both loads corresponded
> > to the same store.
>
> > > How about this instead: "A==>B" means that B sees the value stored by A,
> > > and "A==B" means that A and B are both loads and they see the value from
> > > the same store. That way we avoid putting a load on the left side of
> > > "==>".
> >
> > My concern is that "==" might also have connotations of equal values from
> > distinct stores.
>
> Okay, here's another suggestion: ld_i(A) <=> ld_j(A). This avoids
> connotations of ordering and indicates the symmetry of the relation: both
> loads return data from the same store.

Good point -- will try something like this.

Thanx, Paul