LinuxLists.cc - Re: [RFC] LKMM: Add volatile

2021-06-07 11:54:32

Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 12:43:01PM +0200, Peter Zijlstra wrote:
> On Sun, Jun 06, 2021 at 11:43:42AM -0700, Linus Torvalds wrote:
> > So while the example code is insane and pointless (and you shouldn't
> > read *too* much into it), conceptually the notion of that pattern of
> >
> > if (READ_ONCE(a)) {
> > WRITE_ONCE(b,1);
> > .. do something ..
> > } else {
> > WRITE_ONCE(b,1);
> > .. do something else ..
> > }
>
> This is actually more tricky than it would appear (isn't it always).
>
> The thing is, that normally we must avoid speculative stores, because
> they'll result in out-of-thin-air values.
>
> *Except* in this case, where both branches emit the same store, then
> it's a given that the store will happen and it will not be OOTA.
> Someone's actually done the proof for that apparently (Will, you have a
> reference to Jade's paper?)

I don't think there's a paper on this, but Jade and I are hoping to talk
about aspects of it at LPC (assuming the toolchain MC gets accepted).

> There's apparently also a competition going on who can build the
> weakestest ARM64 implementation ever.
>
> Combine the two, and you'll get a CPU that *will* emit the store early
> :/

So there are a lot of important details missing here and, as above, I think
this is something worth discussing at LPC with Jade. The rough summary is
that the arm64 memory model recently (so recently that it's not yet landed
in the public docs) introduced something called "pick dependencies", which
are a bit like control dependencies only they don't create order to all
subsequent stores. These are useful for some conditional data-processing
instructions such as CSEL and CAS, but it's important to note here that
*conditional branch instructions behave exactly as you would expect*.

<disclaimer; I don't work for Arm so any mistakes here are mine>

To reiterate, in the code sequence at the top of this mail, if the compiler
emits something along the lines of:

LDR
<conditional branch instruction>
STR

then the load *will* be ordered before the store, even if the same store
instruction is executed regardless of the branch direction. Yes, one can
fantasize about a CPU that executes both taken and non-taken paths and
figures out that the STR can be hoisted before the load, but that is not
allowed by the architecture today.

It's the conditional instructions that are more fun. For example, the CSEL
instruction:

CSEL X0, X1, X2, <cond>

basically says:

if (cond)
X0 = X1;
else
X0 = X2;

these are just register-register operations, but the idea is that the CPU
can predict that "branching event" inside the CSEL instruction and
speculatively rename X0 while waiting for the condition to resolve.

So then you can add loads and stores to the mix along the lines of:

LDR X0, [X1] // X0 = *X1
CMP X0, X2
CSEL X3, X4, X5, EQ // X3 = (X0 == X2) ? X4 : X5
STR X3, [X6] // MUST BE ORDERED AFTER THE LOAD
STR X7, [X8] // Can be reordered

(assuming X1, X6, X8 all point to different locations in memory)

So now we have a dependency from the load to the first store, but the
interesting part is that the last store is _not_ ordered wrt either of the
other two memory accesses, whereas it would be if we used a conditional
branch instead of the CSEL. Make sense?

Now, obviously the compiler is blissfully unaware that conditional
data processing instructions can give rise to dependencies than
conditional branches, so the question really is how much do we need to
care in the kernel?

My preference is to use load-acquire instead of control dependencies so
that we don't have to worry about this, or any future relaxations to the
CPU architecture, at all.

Jade -- please can you correct me if I got any of this wrong?

Will

2021-06-07 15:26:59

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 12:52:35PM +0100, Will Deacon wrote:
> On Mon, Jun 07, 2021 at 12:43:01PM +0200, Peter Zijlstra wrote:
> > On Sun, Jun 06, 2021 at 11:43:42AM -0700, Linus Torvalds wrote:
> > > So while the example code is insane and pointless (and you shouldn't
> > > read *too* much into it), conceptually the notion of that pattern of
> > >
> > > if (READ_ONCE(a)) {
> > > WRITE_ONCE(b,1);
> > > .. do something ..
> > > } else {
> > > WRITE_ONCE(b,1);
> > > .. do something else ..
> > > }
> >
> > This is actually more tricky than it would appear (isn't it always).
> >
> > The thing is, that normally we must avoid speculative stores, because
> > they'll result in out-of-thin-air values.
> >
> > *Except* in this case, where both branches emit the same store, then
> > it's a given that the store will happen and it will not be OOTA.
> > Someone's actually done the proof for that apparently (Will, you have a
> > reference to Jade's paper?)
>
> I don't think there's a paper on this, but Jade and I are hoping to talk
> about aspects of it at LPC (assuming the toolchain MC gets accepted).
>
> > There's apparently also a competition going on who can build the
> > weakestest ARM64 implementation ever.
> >
> > Combine the two, and you'll get a CPU that *will* emit the store early
> > :/
>
> So there are a lot of important details missing here and, as above, I think
> this is something worth discussing at LPC with Jade. The rough summary is
> that the arm64 memory model recently (so recently that it's not yet landed
> in the public docs) introduced something called "pick dependencies", which
> are a bit like control dependencies only they don't create order to all
> subsequent stores. These are useful for some conditional data-processing
> instructions such as CSEL and CAS, but it's important to note here that
> *conditional branch instructions behave exactly as you would expect*.
>
> <disclaimer; I don't work for Arm so any mistakes here are mine>
>
> To reiterate, in the code sequence at the top of this mail, if the compiler
> emits something along the lines of:
>
> LDR
> <conditional branch instruction>
> STR
>
> then the load *will* be ordered before the store, even if the same store
> instruction is executed regardless of the branch direction. Yes, one can
> fantasize about a CPU that executes both taken and non-taken paths and
> figures out that the STR can be hoisted before the load, but that is not
> allowed by the architecture today.
>
> It's the conditional instructions that are more fun. For example, the CSEL
> instruction:
>
> CSEL X0, X1, X2, <cond>
>
> basically says:
>
> if (cond)
> X0 = X1;
> else
> X0 = X2;
>
> these are just register-register operations, but the idea is that the CPU
> can predict that "branching event" inside the CSEL instruction and
> speculatively rename X0 while waiting for the condition to resolve.
>
> So then you can add loads and stores to the mix along the lines of:
>
> LDR X0, [X1] // X0 = *X1
> CMP X0, X2
> CSEL X3, X4, X5, EQ // X3 = (X0 == X2) ? X4 : X5
> STR X3, [X6] // MUST BE ORDERED AFTER THE LOAD
> STR X7, [X8] // Can be reordered
>
> (assuming X1, X6, X8 all point to different locations in memory)
>
> So now we have a dependency from the load to the first store, but the
> interesting part is that the last store is _not_ ordered wrt either of the
> other two memory accesses, whereas it would be if we used a conditional
> branch instead of the CSEL. Make sense?

And if I remember correctly, this is why LKMM orders loads in the
"if" condition only with stores in the "then" and "else" clauses,
not with stores after the end of the "if" statement. Or is there
some case that I am missing?

> Now, obviously the compiler is blissfully unaware that conditional
> data processing instructions can give rise to dependencies than
> conditional branches, so the question really is how much do we need to
> care in the kernel?
>
> My preference is to use load-acquire instead of control dependencies so
> that we don't have to worry about this, or any future relaxations to the
> CPU architecture, at all.

From what I can see, ARMv8 has DMB(LD) and DMB(ST). Does it have
something like a DMB(LD,ST) that would act something like powerpc lwsync?

Or are you proposing rewriting the "if" conditions to upgrade
READ_ONCE() to smp_load_acquire()? Or something else?

Just trying to find out exactly what you are proposing. ;-)

Thanx, Paul

> Jade -- please can you correct me if I got any of this wrong?
>
> Will

2021-06-07 16:05:36

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

Hi Paul,

On Mon, Jun 07, 2021 at 08:25:33AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 07, 2021 at 12:52:35PM +0100, Will Deacon wrote:
> > It's the conditional instructions that are more fun. For example, the CSEL
> > instruction:
> >
> > CSEL X0, X1, X2, <cond>
> >
> > basically says:
> >
> > if (cond)
> > X0 = X1;
> > else
> > X0 = X2;
> >
> > these are just register-register operations, but the idea is that the CPU
> > can predict that "branching event" inside the CSEL instruction and
> > speculatively rename X0 while waiting for the condition to resolve.
> >
> > So then you can add loads and stores to the mix along the lines of:
> >
> > LDR X0, [X1] // X0 = *X1
> > CMP X0, X2
> > CSEL X3, X4, X5, EQ // X3 = (X0 == X2) ? X4 : X5
> > STR X3, [X6] // MUST BE ORDERED AFTER THE LOAD
> > STR X7, [X8] // Can be reordered
> >
> > (assuming X1, X6, X8 all point to different locations in memory)
> >
> > So now we have a dependency from the load to the first store, but the
> > interesting part is that the last store is _not_ ordered wrt either of the
> > other two memory accesses, whereas it would be if we used a conditional
> > branch instead of the CSEL. Make sense?
>
> And if I remember correctly, this is why LKMM orders loads in the
> "if" condition only with stores in the "then" and "else" clauses,
> not with stores after the end of the "if" statement. Or is there
> some case that I am missing?

It's not clear to me that such a restriction prevents the compiler from
using any of the arm64 conditional instructions in place of the conditional
branch in such a way that you end up with an "independent" store in the
assembly output constructed from two stores on the "then" and "else" paths
which the compiler determined where the same.

> > Now, obviously the compiler is blissfully unaware that conditional
> > data processing instructions can give rise to dependencies than
> > conditional branches, so the question really is how much do we need to
> > care in the kernel?
> >
> > My preference is to use load-acquire instead of control dependencies so
> > that we don't have to worry about this, or any future relaxations to the
> > CPU architecture, at all.
>
> From what I can see, ARMv8 has DMB(LD) and DMB(ST). Does it have
> something like a DMB(LD,ST) that would act something like powerpc lwsync?
>
> Or are you proposing rewriting the "if" conditions to upgrade
> READ_ONCE() to smp_load_acquire()? Or something else?
>
> Just trying to find out exactly what you are proposing. ;-)

Some options are:

(1) Do nothing until something actually goes wrong (and hope we spot/debug it)

(2) Have volatile_if force a conditional branch, assuming that it solves
the problem and doesn't hurt codegen (I still haven't convinced myself
for either case)

(3) Upgrade READ_ONCE() to RCpc acquire, relaxed atomic RMWs to RCsc
acquire on arm64

(4) Introduce e.g. READ_ONCE_CTRL(), atomic_add_return_ctrl() etc
specifically for control dependencies and upgrade only those for
arm64

(5) Work to get toolchain support for dependency ordering and use that

I'm suggesting (3) or (4) because, honestly, it feels like we're being
squeezed from both sides with both the compiler and the hardware prepared
to break control dependencies.

Will

2021-06-07 18:10:20

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 05:02:53PM +0100, Will Deacon wrote:
> Hi Paul,
>
> On Mon, Jun 07, 2021 at 08:25:33AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 07, 2021 at 12:52:35PM +0100, Will Deacon wrote:
> > > It's the conditional instructions that are more fun. For example, the CSEL
> > > instruction:
> > >
> > > CSEL X0, X1, X2, <cond>
> > >
> > > basically says:
> > >
> > > if (cond)
> > > X0 = X1;
> > > else
> > > X0 = X2;
> > >
> > > these are just register-register operations, but the idea is that the CPU
> > > can predict that "branching event" inside the CSEL instruction and
> > > speculatively rename X0 while waiting for the condition to resolve.
> > >
> > > So then you can add loads and stores to the mix along the lines of:
> > >
> > > LDR X0, [X1] // X0 = *X1
> > > CMP X0, X2
> > > CSEL X3, X4, X5, EQ // X3 = (X0 == X2) ? X4 : X5
> > > STR X3, [X6] // MUST BE ORDERED AFTER THE LOAD
> > > STR X7, [X8] // Can be reordered
> > >
> > > (assuming X1, X6, X8 all point to different locations in memory)
> > >
> > > So now we have a dependency from the load to the first store, but the
> > > interesting part is that the last store is _not_ ordered wrt either of the
> > > other two memory accesses, whereas it would be if we used a conditional
> > > branch instead of the CSEL. Make sense?
> >
> > And if I remember correctly, this is why LKMM orders loads in the
> > "if" condition only with stores in the "then" and "else" clauses,
> > not with stores after the end of the "if" statement. Or is there
> > some case that I am missing?
>
> It's not clear to me that such a restriction prevents the compiler from
> using any of the arm64 conditional instructions in place of the conditional
> branch in such a way that you end up with an "independent" store in the
> assembly output constructed from two stores on the "then" and "else" paths
> which the compiler determined where the same.
>
> > > Now, obviously the compiler is blissfully unaware that conditional
> > > data processing instructions can give rise to dependencies than
> > > conditional branches, so the question really is how much do we need to
> > > care in the kernel?
> > >
> > > My preference is to use load-acquire instead of control dependencies so
> > > that we don't have to worry about this, or any future relaxations to the
> > > CPU architecture, at all.
> >
> > From what I can see, ARMv8 has DMB(LD) and DMB(ST). Does it have
> > something like a DMB(LD,ST) that would act something like powerpc lwsync?
> >
> > Or are you proposing rewriting the "if" conditions to upgrade
> > READ_ONCE() to smp_load_acquire()? Or something else?
> >
> > Just trying to find out exactly what you are proposing. ;-)
>
> Some options are:
>
> (1) Do nothing until something actually goes wrong (and hope we spot/debug it)
>
> (2) Have volatile_if force a conditional branch, assuming that it solves
> the problem and doesn't hurt codegen (I still haven't convinced myself
> for either case)
>
> (3) Upgrade READ_ONCE() to RCpc acquire, relaxed atomic RMWs to RCsc
> acquire on arm64
>
> (4) Introduce e.g. READ_ONCE_CTRL(), atomic_add_return_ctrl() etc
> specifically for control dependencies and upgrade only those for
> arm64
>
> (5) Work to get toolchain support for dependency ordering and use that
>
> I'm suggesting (3) or (4) because, honestly, it feels like we're being
> squeezed from both sides with both the compiler and the hardware prepared
> to break control dependencies.

I will toss out this as well:

(6) Create a volatile_if() that does not support an "else" clause,
thus covering all current use cases and avoiding some of the
same-store issues. Which in the end might or might not help,
but perhaps worth looking into.

Thanx, Paul

2021-07-30 17:22:54

by Alglave, Jade

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

Dear all,

On Mon, Jun 07, 2021 at 12:52:35PM +0100, Will Deacon wrote:
> On Mon, Jun 07, 2021 at 12:43:01PM +0200, Peter Zijlstra wrote:
> > On Sun, Jun 06, 2021 at 11:43:42AM -0700, Linus Torvalds wrote:
> > > So while the example code is insane and pointless (and you shouldn't
> > > read *too* much into it), conceptually the notion of that pattern of
> > >
> > > if (READ_ONCE(a)) {
> > > WRITE_ONCE(b,1);
> > > .. do something ..
> > > } else {
> > > WRITE_ONCE(b,1);
> > > .. do something else ..
> > > }
> >
> > This is actually more tricky than it would appear (isn't it always).
> >
> > The thing is, that normally we must avoid speculative stores, because
> > they'll result in out-of-thin-air values.
> >
> > *Except* in this case, where both branches emit the same store, then
> > it's a given that the store will happen and it will not be OOTA.
> > Someone's actually done the proof for that apparently (Will, you have a
> > reference to Jade's paper?)
>
> I don't think there's a paper on this, but Jade and I are hoping to talk
> about aspects of it at LPC (assuming the toolchain MC gets accepted).
>
> > There's apparently also a competition going on who can build the
> > weakestest ARM64 implementation ever.
> >
> > Combine the two, and you'll get a CPU that *will* emit the store early
> > :/
>
> So there are a lot of important details missing here and, as above, I think
> this is something worth discussing at LPC with Jade. The rough summary is
> that the arm64 memory model recently (so recently that it's not yet landed
> in the public docs) introduced something called "pick dependencies", which
> are a bit like control dependencies only they don't create order to all
> subsequent stores. These are useful for some conditional data-processing
> instructions such as CSEL and CAS, but it's important to note here that
> *conditional branch instructions behave exactly as you would expect*.
>
> <disclaimer; I don't work for Arm so any mistakes here are mine>
>
> To reiterate, in the code sequence at the top of this mail, if the compiler
> emits something along the lines of:
>
> LDR
> <conditional branch instruction>
> STR
>
> then the load *will* be ordered before the store, even if the same store
> instruction is executed regardless of the branch direction. Yes, one can
> fantasize about a CPU that executes both taken and non-taken paths and
> figures out that the STR can be hoisted before the load, but that is not
> allowed by the architecture today.
>
> It's the conditional instructions that are more fun. For example, the CSEL
> instruction:
>
> CSEL X0, X1, X2, <cond>
>
> basically says:
>
> if (cond)
> X0 = X1;
> else
> X0 = X2;
>
> these are just register-register operations, but the idea is that the CPU
> can predict that "branching event" inside the CSEL instruction and
> speculatively rename X0 while waiting for the condition to resolve.
>
> So then you can add loads and stores to the mix along the lines of:
>
> LDR X0, [X1] // X0 = *X1
> CMP X0, X2
> CSEL X3, X4, X5, EQ // X3 = (X0 == X2) ? X4 : X5
> STR X3, [X6] // MUST BE ORDERED AFTER THE LOAD
> STR X7, [X8] // Can be reordered
>
> (assuming X1, X6, X8 all point to different locations in memory)
>
> So now we have a dependency from the load to the first store, but the
> interesting part is that the last store is _not_ ordered wrt either of the
> other two memory accesses, whereas it would be if we used a conditional
> branch instead of the CSEL. Make sense?
>
> Now, obviously the compiler is blissfully unaware that conditional
> data processing instructions can give rise to dependencies than
> conditional branches, so the question really is how much do we need to
> care in the kernel?
>
> My preference is to use load-acquire instead of control dependencies so
> that we don't have to worry about this, or any future relaxations to the
> CPU architecture, at all.
>
> Jade -- please can you correct me if I got any of this wrong?
>
Sincere apologies in taking so long to reply. I attach a technical
report which describes the status of dependencies in the Arm memory
model.

I have also released the corresponding cat files and a collection of
interesting litmus tests over here:
https://github.com/herd/herdtools7/commit/f80bd7c2e49d7d3adad22afc62ff4768d65bf830

I hope this material can help inform this conversation and I would love
to hear your thoughts.

Thanks,
Jade

> Will

Attachments:

(No filename) (4.51 kB)
TheBalladOfDependencies.pdf (336.85 kB)
Download all attachments

2021-07-30 20:40:25

by Alan Stern

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jul 30, 2021 at 06:20:22PM +0100, Jade Alglave wrote:
> Dear all,

> Sincere apologies in taking so long to reply. I attach a technical
> report which describes the status of dependencies in the Arm memory
> model.
>
> I have also released the corresponding cat files and a collection of
> interesting litmus tests over here:
> https://github.com/herd/herdtools7/commit/f80bd7c2e49d7d3adad22afc62ff4768d65bf830
>
> I hope this material can help inform this conversation and I would love
> to hear your thoughts.

Jade:

Here are a few very preliminary reactions (I haven't finished reading the
entire paper yet).

P.2: Typo: "the register X1 contains the address x" should be "the
register X1 contains the address of x".

P.4 and later: Several complicated instructions (including CSEL, CAS, and
SWP) are mentioned but not explained; the text assumes that the reader
already understands what these instructions do. A brief description of
their effects would help readers like me who aren't very familiar with the
ARM instruction set.

P.4: The text describing Instrinsic dependencies in CSEL instructions says
that if cond is true then there is an Intrinsic control dependencies from
the read of PSTATE.NZCV to the read of Xm. Why is this so? Can't the CPU
read Xm unconditionally before it knows whether the value will be used?

P.17: The definition of "Dependency through registers" uses the acronym
"PE", but the acronym isn't defined anywhere.

P.14: In the description of Figure 18, I wasn't previously aware --
although perhaps I should have been -- that ARM could speculatively place
a Store in a local store buffer, allowing it to be forwarded to a po-later
Read. Why doesn't the same mechanism apply to Figure 20, allowing the
Store in D to be speculatively placed in a local store buffer and
forwarded to E? Is this because conditional branches are predicted but
loads aren't? If so, that is a significant difference.

More to come...

Alan

2021-08-02 21:20:30

by Alan Stern

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jul 30, 2021 at 06:20:22PM +0100, Jade Alglave wrote:
> I hope this material can help inform this conversation and I would love
> to hear your thoughts.

More comments...

I find the herd-style diagrams (Figures 2, 3, 5, 7, 9, and so on) almost
impossible to decipher. While they might be useful to people running
herd, they have several drawbacks for readers of this report:

They include multiple instructions, not just the one for which
you want to illustrate the internal dependencies. How about
getting rid of the extraneous instructions?

Each box contains three lines of information, of which only the
first is really significant, and it is hard to figure out. How
about getting rid of the second and third lines, and replacing
things like "e: R0:X1q=x" in the first line with something more
along the lines of "RegR X0" or "tmp1 = RegR X0"?

The "iico" in the dependency arrows doesn't add anything.

Section 1.1 mentions order, data, and control Intrinsic dependencies but
doesn't give so much as a hint as to what they are. Instead the reader
is forced to invent his own generalizations by reading through several
complex special-case examples. There should be a short description of
what each Intrinsic dependency represents. For instance, the first
sentence in 1.3 would be a great way to explain data dependencies. (And
is it not true that control dependencies are mainly needed for
situations where an instruction's inputs and outputs may include the
same register or memory address, when it is necessary to enforce that
the input value is read before the output value is written?)

Some of the dependencies listed for CAS are surprising, but there is no
explanation. Why is C2 a control dependency rather than a data
dependency? After all, the value read from [Xn] is stored in Xs in both
cases. In fact, Df1 supersedes C2 in the failure case, doesn't it? And
why are C1 and Ds1 a control and data dependency respectively rather
than both order dependencies?

Section 2.1: Although the Store F is independent of the conditional
branch and so might be made visible to other observers early, isn't it
true that neither ARMv8 nor any other type of processor will do this?

General question: How does this discussion of conditional branches
relate overall to the way computed branches are handled?

Alan

2021-08-02 23:34:57

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jul 30, 2021 at 06:20:22PM +0100, Jade Alglave wrote:

[ . . . ]

> Sincere apologies in taking so long to reply. I attach a technical
> report which describes the status of dependencies in the Arm memory
> model.
>
> I have also released the corresponding cat files and a collection of
> interesting litmus tests over here:
> https://github.com/herd/herdtools7/commit/f80bd7c2e49d7d3adad22afc62ff4768d65bf830
>
> I hope this material can help inform this conversation and I would love
> to hear your thoughts.

It is very good to see this! A few random questions and comments below
based on a couple of passes through this document.

Thanx, Paul

------------------------------------------------------------------------

o Figure 2: The iico_data arc are essentially invisible after
printing, as in the text on the following page is darker
than these arcs. I had similar difficulties with many
of the other diagrams.

o Figure 2: What does the "q" signify in the upper line of the
uppermost event-c box ("R0:X1q=x")? I get that we are reading
register X1 and getting back the address of variable "x". I am
assuming that the "R0" means that process 0 is doing a read,
but I cannot be sure.

I am assuming that the "proc: P0 poi:0" means that this is
the first instruction of process P0. If this is incorrect,
please let me know.

o Figure 6 initializes X0, X1, X2, and X3, while Figure 4
initializes only X0 and X3. Is this difference meaningful?
(My guess is that you have default-zero initialization so
that it does not matter, but I figured that I should ask.)

o Figure 6: The iico_ctrl arcs are easier to see on printed copy
than the iico_data arcs, but it would be nice if they were a
bit darker. The rf-reg arcs are plainly visible.

o Figure 6: Why is there no po arc to the CSEL instruction?

o Section 1.3, "Swap instructions" paragraph. Please supply
a litmus-test figure to go along with Figure 12.

o Figures 10 and 11: Having these on the same page was extremely
helpful, thank you!

o Figure 11: What does the "*" signify in the first line event "a:"
("a: Rx*q=0")? Why is there no "*" in the corresponding event
in Figure 9?

o Figure 11: The ca arcs are nicely visible, but I am coming up
empty on hypotheses for their meaning. Or is ca the new co?

o Figure 11: Why two po arcs into the CAS instruction? Due
to independent register reads taht might proceed concurrently?
If so, why no po arc to event g?

o Figure 11: The connections between events a, f, and h lead me to
believe that the hardware is permitted to rewrite register X3
with the value previously read from X3 as opposed to the value
read from [X1]. Or maybe omit the write entirely.

I don't see anything wrong with taking this approach, but I
figured I should check.

o Section 2 I leave in Alan's capable hands.

o Section 3.1, "Dependency through registers": A "PE" is a
processing element or some such?

o Section 3.1, "Dependency through registers", first bullet:
The exclusion of Store Exclusive is to avoid ordering via
the 0/1 status store, correct?

o Section 3.1, "Address Dependency", second bullet, second
sub-bullet: OK, I will bite. The dependency from the Branching
Effect is due to a load from the program counter or some such?
Or are there some special-purpose ARMv8 branch instructions that
I should look up.

o Figure 27, "MOV W5, W0": It took me a bit to figure out that
this instruction exists strictly for the benefit of the
"exists" clause. Or am I missing something subtle?

o Section 4.1, "Interestingly, this notion of ``pick dependency...":
I suggest using something like "require" instead of "proscribe",
if that is what is meant. The hamming distance between the
antonyms "proscribe" and "prescribe" is quite small, which can
result in errors both when writing and when reading. :-(

o Figure 30: The discarding of register X3 is intentional, correct?
If so, it is indeed hard to imagine wanting ordering from this
code sequence. Though I might once again be suffering from a
failure of imagination...

o Figure 32: The reason for this litmus test being allowed is that
the ordering through CSEL is sort of like a control dependency,
and control dependencies to loads do not force ordering, correct?
Or did I miss a turn in there somewhere?

o Section 4.2, "Pick Basic dependency": Should the second and
third bullets be indented under the first bullet?

It will take at least another pass to get my head around pick
dependencies, so I will stop here for the moment.

Again, good stuff, and great to see the additional definition!

Thanx, Paul

2021-08-04 21:49:46

by Alan Stern

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Aug 02, 2021 at 04:31:56PM -0700, Paul E. McKenney wrote:
> o Section 2 I leave in Alan's capable hands.

Here goes, although I'm not sure how important this is, given that
section 2 is presented as merely a "straw man" argument for something
that ARM decided to abandon.

While reading this section (and the paper in general), it was annoying
that the terms "down-one-leg" and "down-two-legs" are never explained or
motivated. Even after reading section 2, I'm still not sure what they
are really intended to mean. My impression is that "down-one-leg" is an
attempt to express the idea that control dependencies apply to accesses
occurring along one leg of a conditional but not to accesses occurring
after the two legs have rejoined. Is that right?

P.17: "The drawback of this approach is that it would require order for
the “independent” case" -- this doesn't seem like a drawback to me.
Particularly since no existing architecture attempts to avoid ordering
the independent case.

Def. of "points of divergence": This is not very precise. What exactly
is a "branching decision"? Do the two paths of a CAS or CSEL
instruction count? What if the decision doesn't involve whether or not
to take the branch but rather where to branch to (as in a computed
branch or even just a call through a function pointer)?

Def. of "address dependency": How could there be a Dependency through
registers from D4 to R2? It's not at all easy to untangle the
definitions to see what this might mean. What would be an example? At
any rate, the case where RW2 is a Memory read doesn't seem right. It
says that:

R0 = Load
R1 = Load([R0])

is an address dependency but

R0 = Load
// Branching decision that depends on the value of R0 and
// carries a Dependency through registers to a new value for
// R0 (whatever that may mean) which is always equal to the
// existing value
R1 = Load([R0])

isn't. Is this really what you mean? If so, what is the motivation for
this definition? How does it relate to the discussion earlier in this
section?

Def. of antecedent: What is a Local read successor or an immediate Local
write successor? These terms aren't defined, and without knowing what
they mean it is impossible to understand what an antecedent is.

Def. of pre-equivalent effects and related terms: I don't understand how
you can have effects on different branches of a Point of divergence. By
definition, only one of the branches is executed -- how can there be any
effects on the speculated branch?

With all these concepts being so unclear, I was completely unable to
figure out what the definition of control dependency means. The text
doesn't help at all, because it doesn't contain any examples or
explanations to make these things more comprehensible.

The formalization in cat may have some historical interest, but it
conveys no information to a reader who isn't prepared to spend hours or
days trying to decipher it. Honestly, do you know _anybody_ who could
tell what Figures 22 - 25 mean and what they do just from reading them?
You pretty much have to be an expert in cat just to tell what some of
the recursive functions in Figs. 23 and 24 do.

(As just one very minor example, the "bisimulation" function in the
fourth-to-last line of Figure 25 isn't mentioned anywhere else. How are
people supposed to understand it?)

Alan

2021-08-05 21:28:47

by Alan Stern

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jul 30, 2021 at 06:20:22PM +0100, Jade Alglave wrote:
> I hope this material can help inform this conversation and I would love
> to hear your thoughts.

Thoughts on section 3...

The paragraph on Branching Effect is pretty meager. Exactly what
effects does a conditional branch instruction generate? I imagine
there's a register read of the flags register, to see whether the branch
should be taken. And evidently there's a branch effect, which may take
the register read as input but doesn't have any obvious outputs.
Anything else? -- there don't seem to be any register writes.

Why are Store Exclusive instructions explicitly disallowed in the
definition of dependency through registers? Is this because ARM CPUs
don't forward values written by such instructions to po-later reads? If
so, why don't they? (Paul asked a similar question.)

Since the recursive definition of dependency through registers starts
with either a register write or intrinsic order of events in an
instruction, it appears that there cannot be any dependency through
registers starting from a branching effect. So why does the definition
of address dependency talk about a dependency through registers from B4
(a branching effect) to R2? (Paul also asked about this -- does writing
to the program counter get treated as a register write? But almost no
instructions explicitly read from the program counter.)

What is the whole point of the special handling of branching effects in
the definition of address dependencies? It isn't obvious and the text
doesn't explain it.

Figure 26 includes a lot of terms that seem like herd primitives. They
must be relatively new, because they aren't mentioned in the
documentation that I've got. (I'm referring to such terms as iico_data,
iico_ctrl, intrinsic, same-instance, DATA, and NDATA.) Are they
explained anywhere?

Way back in Section 1, various Intrinsic dependency relations were
introduced. The reason for treating Intrinsic control dependencies
specially seems to be that the CPU can speculate past such dependencies
(though it would be nice if the text made this point explicitly). But
why do you differentiate between data and order Intrinsic dependencies?
Is this also related to some specific behavior of ARM CPUs?

Alan

2021-08-07 00:53:19

by Alan Stern

[permalink] [raw]

Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jul 30, 2021 at 06:20:22PM +0100, Jade Alglave wrote:
> I hope this material can help inform this conversation and I would love
> to hear your thoughts.

Thoughts on section 4...

The definition of Pick Basic dependency is phrased incorrectly. The
"all of the following apply" in the first paragraph refers only to first
bullet point, which in turn refers to the following two bullet points.
The "all of the following apply" phrase should be removed and the first
bullet point should be merged into the main text.

The definition of Pick dependency is redundant, because each Pick
Address and Pick Data dependency is itself already a Pick Basic
dependency. The same is true of the cat formalization.

In the cat code, the definition of Reg looks wrong. It is:

let Reg=~M | ~BR

Since (I presume) no event falls into both the M and BR classes, this
definition includes all events. It probably should be:

let Reg=~(M | BR)

or

let Reg=~M & ~BR

It's now clear that my original understanding of the underlying basis of
Intrinsic control dependencies was wrong. They aren't separated out
because CPUs can speculate through conditional branches; rather they are
separated out because they are the things which give rise to Pick
dependencies. It would have been nice if the text had explained this at
the start instead of leaving it up to me to figure out for myself.

Alan