2015-05-20 00:55:23

by Paul E. McKenney

[permalink] [raw]
Subject: Compilers and RCU readers: Once more unto the breach!

Hello!

Following up on last year's discussion (https://lwn.net/Articles/586838/,
https://lwn.net/Articles/588300/), I believe that we have a solution. If
I am wrong, I am sure you all will let me know, and in great detail. ;-)

The key simplification is to "just say no" to RCU-protected array indexes:
https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
This simplification means that rcu_dereference (AKA memory_order_consume)
need only return pointers. This in ture avoids things like (x-x),
(x*0), and (x%1) because if "x" is a pointer, these expressions either
return non-pointers are compilation errors. With a very few exceptions,
dependency chains can lead -to- non-pointers, but cannot pass -through-
them.

The result is that dependencies are carried only by operations for
which the compiler cannot easily optimize the away those dependencies,
these operations including simple assignment, integer offset (including
indexing), dereferencing, casts, passing as a function argument, return
values from functions and so on. A complete list with commentary starts
on page 28 of:

http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

Dependency chains are broken if a pointer compares equal to some other
pointer not part of the same dependency chain, if too many bits are ORed
onto or ANDed off of a intptr_t or uintptr_t, or if the dependency is
explicitly killed (which should now strictly speaking never be necessary,
but which might allow better diagnostics). These are set out in more
detail on page 30 of the above PDF.

This covers all the uses in the Linux kernel that I am aware of without
any source-code changes (other than to the rcu_dereference() primitives
themselves) and should also work for compilers and standards.

Thoughts?

Thanx, Paul


2015-05-20 01:57:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
<[email protected]> wrote:
>
> http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

>From a very quick read-through, the restricted dependency chain in 7.9
seems to be reasonable, and essentially covers "thats' what hardware
gives us anyway", making compiler writers happy.

I would clarify the language somewhat:

- it says that the result of a cast of a pointer is a dependency. You
need to make an exception for casting to bool, methinks (because
that's effectively just a test-against-NULL, which you later describe
as terminating the dependency).

Maybe get rid of the "any type", and just limit it to casts to
types of size intptr_t, ie ones that don't drop significant bits. All
the other rules talk about [u]intptr_t anyway.

- you clarify that the trivial "& 0" and "| ~0" kill the dependency
chain, but if you really want to be a stickler, you might want to
extend it to a few more cases. Things like "& 1" (to extract a tag
from the lot bit of a tagged pointer) should likely also drop the
dependency, since a compiler would commonly end up using the end
result as a conditional even if the code was written to then use
casting etc to look like a dereference.

- the "you can add/subtract integral values" still opens you up to
language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
dependency, which it clearly doesn't. But language-lawyering it does,
since all those operations (cast to pointer, cast to integer,
subtracting an integer) claim to be dependency-preserving operations.

So I think you want to limit the logical operators to things that
don't mask off too many bits, and you should probably limit the
add/subtract operations some way (maybe specify that the integer value
you add/subtract cannot be related to the pointer). But I think
limiting it to mostly pointer ops (and a _few_ integer operations to
do offsets and remove tag bits) is otherwise a good approach.

Linus

2015-05-20 02:10:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
<[email protected]> wrote:
>
> - the "you can add/subtract integral values" still opens you up to
> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> dependency, which it clearly doesn't. But language-lawyering it does,
> since all those operations (cast to pointer, cast to integer,
> subtracting an integer) claim to be dependency-preserving operations.
>
> So I think you want to limit the logical operators to things that
> don't mask off too many bits, and you should probably limit the
> add/subtract operations some way (maybe specify that the integer value
> you add/subtract cannot be related to the pointer).

Actually, "not related" doesn't work. For some buddy allocator thing,
you very much might want some of the bits to be related.

So I think you're better off just saying that operations designed to
drop significant bits break the dependency chain, and give things like
"& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.

Making that just an extension of your existing "& 0" language would
seem to be natural.

Humans will understand, and compiler writers won't care. They will
either depend on hardware semantics anyway (and argue that your
language is tight enough that they don't need to do anything special)
or they will turn the consume into an acquire (on platforms that have
too weak hardware).

Linus

2015-05-20 02:34:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
> On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
>
> >From a very quick read-through, the restricted dependency chain in 7.9
> seems to be reasonable, and essentially covers "thats' what hardware
> gives us anyway", making compiler writers happy.
>
> I would clarify the language somewhat:
>
> - it says that the result of a cast of a pointer is a dependency. You
> need to make an exception for casting to bool, methinks (because
> that's effectively just a test-against-NULL, which you later describe
> as terminating the dependency).
>
> Maybe get rid of the "any type", and just limit it to casts to
> types of size intptr_t, ie ones that don't drop significant bits. All
> the other rules talk about [u]intptr_t anyway.

Excellent point! I now say:

If a pointer is part of a dependency chain, then casting it
(either explicitly or implicitly) to any pointer-sized type
extends the chain to the result.

If this approach works out, the people in the Core Working Group will
come up with alternative language-lawyer-proof wording, but this informal
version will hopefully do for the moment.

> - you clarify that the trivial "& 0" and "| ~0" kill the dependency
> chain, but if you really want to be a stickler, you might want to
> extend it to a few more cases. Things like "& 1" (to extract a tag
> from the lot bit of a tagged pointer) should likely also drop the
> dependency, since a compiler would commonly end up using the end
> result as a conditional even if the code was written to then use
> casting etc to look like a dereference.

Ah, how about the following?

If a value of type intptr_t or uintptr_t is part of a dependency
chain, and if that value is one of the operands to an & or |
infix operator whose result has too few or too many bits set,
then the resulting value will not be part of any dependency
chain. For example, on a 64-bit system, if p is part of a
dependency chain, then (p & 0x7) provides just the tag bits,
and normally cannot even be legally dereferenced. Similarly,
(p | ~0) normally cannot be legally dereferenced.

> - the "you can add/subtract integral values" still opens you up to
> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> dependency, which it clearly doesn't. But language-lawyering it does,
> since all those operations (cast to pointer, cast to integer,
> subtracting an integer) claim to be dependency-preserving operations.

My thought was that the result of "(char *)ptr - (intptr_t)ptr" is a
NULL pointer in most environments, and dereferencing a NULL pointer
is undefined behavior. So it becomes irrelevant whether or not the
NULL pointer carries a dependency.

There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
but in that case, if the resulting pointer happens by chance to reference
valid memory, I believe a dependency would still be carried. Of course,
if you are producing code like that, I am guessing that dependencies are
the very least of your concerns.

However, I will give this some more thought.

> So I think you want to limit the logical operators to things that
> don't mask off too many bits, and you should probably limit the
> add/subtract operations some way (maybe specify that the integer value
> you add/subtract cannot be related to the pointer). But I think
> limiting it to mostly pointer ops (and a _few_ integer operations to
> do offsets and remove tag bits) is otherwise a good approach.

Glad you mostly like it! ;-)

Thanx, Paul

2015-05-20 02:41:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
> On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > - the "you can add/subtract integral values" still opens you up to
> > language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> > dependency, which it clearly doesn't. But language-lawyering it does,
> > since all those operations (cast to pointer, cast to integer,
> > subtracting an integer) claim to be dependency-preserving operations.
> >
> > So I think you want to limit the logical operators to things that
> > don't mask off too many bits, and you should probably limit the
> > add/subtract operations some way (maybe specify that the integer value
> > you add/subtract cannot be related to the pointer).
>
> Actually, "not related" doesn't work. For some buddy allocator thing,
> you very much might want some of the bits to be related.

Good point, you could do the buddy-allocator computations with add
and subtract instead of exclusive OR.

> So I think you're better off just saying that operations designed to
> drop significant bits break the dependency chain, and give things like
> "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.
>
> Making that just an extension of your existing "& 0" language would
> seem to be natural.

Works for me! I added the following bullet to the list of things
that break dependencies:

If a pointer is part of a dependency chain, and if the values
added to or subtracted from that pointer cancel the pointer
value so as to allow the compiler to precisely determine the
resulting value, then the resulting value will not be part of
any dependency chain. For example, if p is part of a dependency
chain, then ((char *)p-(uintptr_t)p)+65536 will not be.

Seem reasonable?

> Humans will understand, and compiler writers won't care. They will
> either depend on hardware semantics anyway (and argue that your
> language is tight enough that they don't need to do anything special)
> or they will turn the consume into an acquire (on platforms that have
> too weak hardware).

Agreed. Plus Core Working Group will hammer out the exact wording,
should this approach meet their approval.

Thanx, Paul

2015-05-20 07:35:08

by Jens Maurer

[permalink] [raw]
Subject: Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:

>> - the "you can add/subtract integral values" still opens you up to
>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
>> dependency, which it clearly doesn't. But language-lawyering it does,
>> since all those operations (cast to pointer, cast to integer,
>> subtracting an integer) claim to be dependency-preserving operations.

[...]

> There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
> but in that case, if the resulting pointer happens by chance to reference
> valid memory, I believe a dependency would still be carried.
[...]

>From a language lawyer standpoint, pointer arithmetic is only valid
within an array. These examples seem to go beyond the bounds of the
array and therefore have undefined behavior.

C++ standard section 5.7 paragraph 4
"If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined."

C99 and C11
identical phrasing in 6.5.6 paragraph 8

Jens

2015-05-20 09:03:09

by Richard Biener

[permalink] [raw]
Subject: Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 9:34 AM, Jens Maurer <[email protected]> wrote:
> On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
>> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
>
>>> - the "you can add/subtract integral values" still opens you up to
>>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
>>> dependency, which it clearly doesn't. But language-lawyering it does,
>>> since all those operations (cast to pointer, cast to integer,
>>> subtracting an integer) claim to be dependency-preserving operations.
>
> [...]
>
>> There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
>> but in that case, if the resulting pointer happens by chance to reference
>> valid memory, I believe a dependency would still be carried.
> [...]
>
> From a language lawyer standpoint, pointer arithmetic is only valid
> within an array. These examples seem to go beyond the bounds of the
> array and therefore have undefined behavior.
>
> C++ standard section 5.7 paragraph 4
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."
>
> C99 and C11
> identical phrasing in 6.5.6 paragraph 8

Of course you can try to circumvent that by doing
(char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).

Which (IMHO) gets you into the standard language that only makes conversion of
the exact same integer back to a pointer well-defined(?)

Richard.

> Jens

2015-05-20 11:47:53

by Will Deacon

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

Hi Paul,

On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
> > On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
> > <[email protected]> wrote:
> > So I think you're better off just saying that operations designed to
> > drop significant bits break the dependency chain, and give things like
> > "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.
> >
> > Making that just an extension of your existing "& 0" language would
> > seem to be natural.
>
> Works for me! I added the following bullet to the list of things
> that break dependencies:
>
> If a pointer is part of a dependency chain, and if the values
> added to or subtracted from that pointer cancel the pointer
> value so as to allow the compiler to precisely determine the
> resulting value, then the resulting value will not be part of
> any dependency chain. For example, if p is part of a dependency
> chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
>
> Seem reasonable?

Whilst I understand what you're saying (the ARM architecture makes these
sorts of distinctions when calling out dependency-based ordering), it
feels like we're dangerously close to defining the difference between a
true and a false dependency. If we want to do this in the context of the
C language specification, you run into issues because you need to evaluate
the program in order to determine data values in order to determine the
nature of the dependency.

You tackle this above by saying "to allow the compiler to precisely
determine the resulting value", but I can't see how that can be cleanly
fitted into something like the C language specification. Even if it can,
then we'd need to reword the "?:" treatment that you currently have:

"If a pointer is part of a dependency chain, and that pointer appears
in the entry of a ?: expression selected by the condition, then the
chain extends to the result."

which I think requires the state of the condition to be known statically
if we only want to extend the chain from the selected expression. In the
general case, wouldn't a compiler have to assume that the chain is
extended from both?

Additionally, what about the following code?

char *x = y ? z : z;

Does that extend a dependency chain from z to x? If so, I can imagine a
CPU breaking that in practice.

> > Humans will understand, and compiler writers won't care. They will
> > either depend on hardware semantics anyway (and argue that your
> > language is tight enough that they don't need to do anything special)
> > or they will turn the consume into an acquire (on platforms that have
> > too weak hardware).
>
> Agreed. Plus Core Working Group will hammer out the exact wording,
> should this approach meet their approval.

For the avoidance of doubt, I'm completely behind any attempts to tackle
this problem, but I anticipate an uphill struggle getting this text into
the C standard. Is your intention to change the carries-a-dependency
relation to encompass this change?

Cheers,

Will

2015-05-20 12:01:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1616] Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 09:34:10AM +0200, Jens Maurer wrote:
> On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> > On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
>
> >> - the "you can add/subtract integral values" still opens you up to
> >> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> >> dependency, which it clearly doesn't. But language-lawyering it does,
> >> since all those operations (cast to pointer, cast to integer,
> >> subtracting an integer) claim to be dependency-preserving operations.
>
> [...]
>
> > There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
> > but in that case, if the resulting pointer happens by chance to reference
> > valid memory, I believe a dependency would still be carried.
> [...]
>
> >From a language lawyer standpoint, pointer arithmetic is only valid
> within an array. These examples seem to go beyond the bounds of the
> array and therefore have undefined behavior.
>
> C++ standard section 5.7 paragraph 4
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."
>
> C99 and C11
> identical phrasing in 6.5.6 paragraph 8

Even better! I added a footnote calling out these two paragraphs.

Thax, Paul

2015-05-20 12:02:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 11:03:00AM +0200, Richard Biener wrote:
> On Wed, May 20, 2015 at 9:34 AM, Jens Maurer <[email protected]> wrote:
> > On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> >> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
> >
> >>> - the "you can add/subtract integral values" still opens you up to
> >>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> >>> dependency, which it clearly doesn't. But language-lawyering it does,
> >>> since all those operations (cast to pointer, cast to integer,
> >>> subtracting an integer) claim to be dependency-preserving operations.
> >
> > [...]
> >
> >> There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
> >> but in that case, if the resulting pointer happens by chance to reference
> >> valid memory, I believe a dependency would still be carried.
> > [...]
> >
> > From a language lawyer standpoint, pointer arithmetic is only valid
> > within an array. These examples seem to go beyond the bounds of the
> > array and therefore have undefined behavior.
> >
> > C++ standard section 5.7 paragraph 4
> > "If both the pointer operand and the result point to elements of the
> > same array object, or one past the last element of the array object,
> > the evaluation shall not produce an overflow; otherwise, the behavior
> > is undefined."
> >
> > C99 and C11
> > identical phrasing in 6.5.6 paragraph 8
>
> Of course you can try to circumvent that by doing
> (char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
> (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).
>
> Which (IMHO) gets you into the standard language that only makes conversion of
> the exact same integer back to a pointer well-defined(?)

I am feeling good about leaving the restriction and calling out
the two paragraphs in a footnote, then. ;-)

Thanx, Paul

2015-05-20 12:23:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> Hi Paul,
>
> On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
> > > On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
> > > <[email protected]> wrote:
> > > So I think you're better off just saying that operations designed to
> > > drop significant bits break the dependency chain, and give things like
> > > "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.
> > >
> > > Making that just an extension of your existing "& 0" language would
> > > seem to be natural.
> >
> > Works for me! I added the following bullet to the list of things
> > that break dependencies:
> >
> > If a pointer is part of a dependency chain, and if the values
> > added to or subtracted from that pointer cancel the pointer
> > value so as to allow the compiler to precisely determine the
> > resulting value, then the resulting value will not be part of
> > any dependency chain. For example, if p is part of a dependency
> > chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> >
> > Seem reasonable?
>
> Whilst I understand what you're saying (the ARM architecture makes these
> sorts of distinctions when calling out dependency-based ordering), it
> feels like we're dangerously close to defining the difference between a
> true and a false dependency. If we want to do this in the context of the
> C language specification, you run into issues because you need to evaluate
> the program in order to determine data values in order to determine the
> nature of the dependency.

Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(&gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency. So I am not seeing any particular danger.

> You tackle this above by saying "to allow the compiler to precisely
> determine the resulting value", but I can't see how that can be cleanly
> fitted into something like the C language specification.

I am sure that there will be significant rework from where this document
is to language appropriate from the standard. Which is why I am glad
that Jens is taking an interest in this, as he is particularly good at
producing standards language.

> Even if it can,
> then we'd need to reword the "?:" treatment that you currently have:
>
> "If a pointer is part of a dependency chain, and that pointer appears
> in the entry of a ?: expression selected by the condition, then the
> chain extends to the result."
>
> which I think requires the state of the condition to be known statically
> if we only want to extend the chain from the selected expression. In the
> general case, wouldn't a compiler have to assume that the chain is
> extended from both?

In practice, yes, if the compiler cannot determine which expression is
selected, it must arrange for the dependency to be carried from either,
depending on the run-time value of the condition. But you would have
to work pretty hard to create code that did not carry the dependencies
as require, not?

> Additionally, what about the following code?
>
> char *x = y ? z : z;
>
> Does that extend a dependency chain from z to x? If so, I can imagine a
> CPU breaking that in practice.

I am not seeing this. I would expect the compiler to optimize to
something like this:

char *x = z;

How does this avoid carrying the dependency? Or are you saying that
ARM loses the dependency via a store to memory and a later reload?
That would be a bit surprising...

> > > Humans will understand, and compiler writers won't care. They will
> > > either depend on hardware semantics anyway (and argue that your
> > > language is tight enough that they don't need to do anything special)
> > > or they will turn the consume into an acquire (on platforms that have
> > > too weak hardware).
> >
> > Agreed. Plus Core Working Group will hammer out the exact wording,
> > should this approach meet their approval.
>
> For the avoidance of doubt, I'm completely behind any attempts to tackle
> this problem, but I anticipate an uphill struggle getting this text into
> the C standard. Is your intention to change the carries-a-dependency
> relation to encompass this change?

I completely agree that this won't be easy, but this is the task at hand.
And yes, the intent is to change carries-a-dependency, given that the
current wording isn't helping anything. ;-)

Thanx, Paul

2015-05-20 13:18:49

by David Howells

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

Paul E. McKenney <[email protected]> wrote:

> > Additionally, what about the following code?
> >
> > char *x = y ? z : z;
> >
> > Does that extend a dependency chain from z to x? If so, I can imagine a
> > CPU breaking that in practice.
>
> I am not seeing this. I would expect the compiler to optimize to
> something like this:
>
> char *x = z;

Why? What if y has a potential side-effect (say it makes a function call)?

David

2015-05-20 13:30:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 02:18:37PM +0100, David Howells wrote:
> Paul E. McKenney <[email protected]> wrote:
>
> > > Additionally, what about the following code?
> > >
> > > char *x = y ? z : z;
> > >
> > > Does that extend a dependency chain from z to x? If so, I can imagine a
> > > CPU breaking that in practice.
> >
> > I am not seeing this. I would expect the compiler to optimize to
> > something like this:
> >
> > char *x = z;
>
> Why? What if y has a potential side-effect (say it makes a function call)?

I was thinking of "y" as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;

Thanx, Paul

2015-05-20 13:38:50

by David Howells

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

Paul E. McKenney <[email protected]> wrote:

> I was thinking of "y" as a simple variable, but if it is something more
> complex, then the compiler could do this, right?
>
> char *x;
>
> y;
> x = z;

Yeah. I presume it has to maintain the ordering, though.

David

2015-05-20 13:44:38

by Ramana Radhakrishnan

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!



On 20/05/15 14:37, David Howells wrote:
> Paul E. McKenney <[email protected]> wrote:
>
>> I was thinking of "y" as a simple variable, but if it is something more
>> complex, then the compiler could do this, right?
>>
>> char *x;
>>
>> y;
>> x = z;
>
> Yeah. I presume it has to maintain the ordering, though.

The scheduler for e.g. is free to reorder if it can prove there is no
dependence (or indeed side-effects for y) between insns produced for y
and `x = z'.

regards
Ramana

>
> David
>

2015-05-20 14:03:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1624] Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 02:37:05PM +0100, David Howells wrote:
> Paul E. McKenney <[email protected]> wrote:
>
> > I was thinking of "y" as a simple variable, but if it is something more
> > complex, then the compiler could do this, right?
> >
> > char *x;
> >
> > y;
> > x = z;
>
> Yeah. I presume it has to maintain the ordering, though.

Agreed. Unless of course y writes to x or some such.

Given that there is already code in the Linux kernel relying on
dependencies being carried through stores to local variables,
this should not be a problem.

Or am I missing something?

Thanx, Paul

2015-05-20 14:04:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
>
>
> On 20/05/15 14:37, David Howells wrote:
> >Paul E. McKenney <[email protected]> wrote:
> >
> >>I was thinking of "y" as a simple variable, but if it is something more
> >>complex, then the compiler could do this, right?
> >>
> >> char *x;
> >>
> >> y;
> >> x = z;
> >
> >Yeah. I presume it has to maintain the ordering, though.
>
> The scheduler for e.g. is free to reorder if it can prove there is
> no dependence (or indeed side-effects for y) between insns produced
> for y and `x = z'.

So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.

Or am I missing something subtle here?

Thanx, Paul

2015-05-20 14:15:52

by Ramana Radhakrishnan

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!



On 20/05/15 15:03, Paul E. McKenney wrote:
> On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
>>
>>
>> On 20/05/15 14:37, David Howells wrote:
>>> Paul E. McKenney <[email protected]> wrote:
>>>
>>>> I was thinking of "y" as a simple variable, but if it is something more
>>>> complex, then the compiler could do this, right?
>>>>
>>>> char *x;
>>>>
>>>> y;
>>>> x = z;
>>>
>>> Yeah. I presume it has to maintain the ordering, though.
>>
>> The scheduler for e.g. is free to reorder if it can prove there is
>> no dependence (or indeed side-effects for y) between insns produced
>> for y and `x = z'.
>
> So for example, if y is independent of z, the compiler can do the
> following:
>
> char *x;
>
> x = z;
> y;
>
> But the dependency ordering is still maintained from z to x, so this
> is not a problem.


Well, reads if any of x (assuming x was initialized elsewhere) would
need to happen before x got assigned to z.

I understood the original "maintain the ordering" as between the
statements `x = z' and `y'.


>
> Or am I missing something subtle here?

No, it sounds like we are on the same page here.

regards
Ramana

>
> Thanx, Paul
>

2015-05-20 15:18:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 03:15:48PM +0100, Ramana Radhakrishnan wrote:
>
>
> On 20/05/15 15:03, Paul E. McKenney wrote:
> >On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
> >>
> >>
> >>On 20/05/15 14:37, David Howells wrote:
> >>>Paul E. McKenney <[email protected]> wrote:
> >>>
> >>>>I was thinking of "y" as a simple variable, but if it is something more
> >>>>complex, then the compiler could do this, right?
> >>>>
> >>>> char *x;
> >>>>
> >>>> y;
> >>>> x = z;
> >>>
> >>>Yeah. I presume it has to maintain the ordering, though.
> >>
> >>The scheduler for e.g. is free to reorder if it can prove there is
> >>no dependence (or indeed side-effects for y) between insns produced
> >>for y and `x = z'.
> >
> >So for example, if y is independent of z, the compiler can do the
> >following:
> >
> > char *x;
> >
> > x = z;
> > y;
> >
> >But the dependency ordering is still maintained from z to x, so this
> >is not a problem.
>
>
> Well, reads if any of x (assuming x was initialized elsewhere) would
> need to happen before x got assigned to z.

Agreed, there needs to be a memory_order_consume load up there somewhere.
(AKA rcu_dereference().)

> I understood the original "maintain the ordering" as between the
> statements `x = z' and `y'.

Ah, I was assuming between x and z. David, what was your intent? ;-)

> >Or am I missing something subtle here?
>
> No, it sounds like we are on the same page here.

Whew! ;-)

Thanx, Paul

2015-05-20 15:46:15

by David Howells

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

Paul E. McKenney <[email protected]> wrote:

> Ah, I was assuming between x and z. David, what was your intent? ;-)

Clarification.

David

2015-05-20 15:46:26

by Will Deacon

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> > On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > > If a pointer is part of a dependency chain, and if the values
> > > added to or subtracted from that pointer cancel the pointer
> > > value so as to allow the compiler to precisely determine the
> > > resulting value, then the resulting value will not be part of
> > > any dependency chain. For example, if p is part of a dependency
> > > chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> > >
> > > Seem reasonable?
> >
> > Whilst I understand what you're saying (the ARM architecture makes these
> > sorts of distinctions when calling out dependency-based ordering), it
> > feels like we're dangerously close to defining the difference between a
> > true and a false dependency. If we want to do this in the context of the
> > C language specification, you run into issues because you need to evaluate
> > the program in order to determine data values in order to determine the
> > nature of the dependency.
>
> Indeed, something like this does -not- carry a dependency from the
> memory_order_consume load to q:
>
> char *p, q;
>
> p = atomic_load_explicit(&gp, memory_order_consume);
> q = gq + (intptr_t)p - (intptr_t)p;
>
> If this was compiled with -O0, ARM and Power might well carry a
> dependency, but given any optimization, the assembly language would have
> no hint of any such dependency. So I am not seeing any particular danger.

The above is a welcome relaxation over C11, since ARM doesn't even give
you ordering based off false data dependencies. My concern is more to do
with how this can be specified precisely without prohibing honest compiler
and hardware optimisations.

Out of interest, how do you tackle examples (4) and (5) of (assuming the
reads are promoted to consume loads)?:

http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html

my understanding is that you permit both outcomes (I appreciate you're
not directly tackling out-of-thin-air, but treatment of dependencies
is heavily related).

> > You tackle this above by saying "to allow the compiler to precisely
> > determine the resulting value", but I can't see how that can be cleanly
> > fitted into something like the C language specification.
>
> I am sure that there will be significant rework from where this document
> is to language appropriate from the standard. Which is why I am glad
> that Jens is taking an interest in this, as he is particularly good at
> producing standards language.

Ok. I'm curious to see how that comes along.

> > Even if it can,
> > then we'd need to reword the "?:" treatment that you currently have:
> >
> > "If a pointer is part of a dependency chain, and that pointer appears
> > in the entry of a ?: expression selected by the condition, then the
> > chain extends to the result."
> >
> > which I think requires the state of the condition to be known statically
> > if we only want to extend the chain from the selected expression. In the
> > general case, wouldn't a compiler have to assume that the chain is
> > extended from both?
>
> In practice, yes, if the compiler cannot determine which expression is
> selected, it must arrange for the dependency to be carried from either,
> depending on the run-time value of the condition. But you would have
> to work pretty hard to create code that did not carry the dependencies
> as require, not?

I'm not sure... you'd require the compiler to perform static analysis of
loops to determine the state of the machine when they exit (if they exit!)
in order to show whether or not a dependency is carried to subsequent
operations. If it can't prove otherwise, it would have to assume that a
dependency *is* carried, and it's not clear to me how it would use this
information to restrict any subsequent dependency removing optimisations.

I guess that's one for the GCC folks.

> > Additionally, what about the following code?
> >
> > char *x = y ? z : z;
> >
> > Does that extend a dependency chain from z to x? If so, I can imagine a
> > CPU breaking that in practice.
>
> I am not seeing this. I would expect the compiler to optimize to
> something like this:
>
> char *x = z;
>
> How does this avoid carrying the dependency? Or are you saying that
> ARM loses the dependency via a store to memory and a later reload?
> That would be a bit surprising...

I was thinking that the compiler would have to preserve the conditional
structure so that the dependency chain could be tracked correctly, but
if it can just assume that the dependency is carried regardless of y then
I agree that it doesn't matter for this code. All the CPU could do is
remove the conditional hazard.

> > > > Humans will understand, and compiler writers won't care. They will
> > > > either depend on hardware semantics anyway (and argue that your
> > > > language is tight enough that they don't need to do anything special)
> > > > or they will turn the consume into an acquire (on platforms that have
> > > > too weak hardware).
> > >
> > > Agreed. Plus Core Working Group will hammer out the exact wording,
> > > should this approach meet their approval.
> >
> > For the avoidance of doubt, I'm completely behind any attempts to tackle
> > this problem, but I anticipate an uphill struggle getting this text into
> > the C standard. Is your intention to change the carries-a-dependency
> > relation to encompass this change?
>
> I completely agree that this won't be easy, but this is the task at hand.
> And yes, the intent is to change carries-a-dependency, given that the
> current wording isn't helping anything. ;-)

Agreed on that!

Will

2015-05-20 15:55:54

by Andrew Haley

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On 05/20/2015 04:46 PM, Will Deacon wrote:
> I'm not sure... you'd require the compiler to perform static analysis of
> loops to determine the state of the machine when they exit (if they exit!)
> in order to show whether or not a dependency is carried to subsequent
> operations. If it can't prove otherwise, it would have to assume that a
> dependency *is* carried, and it's not clear to me how it would use this
> information to restrict any subsequent dependency removing optimisations.

It'd just convert consume to acquire.

Andrew.

2015-05-20 18:16:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> > > On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > > > If a pointer is part of a dependency chain, and if the values
> > > > added to or subtracted from that pointer cancel the pointer
> > > > value so as to allow the compiler to precisely determine the
> > > > resulting value, then the resulting value will not be part of
> > > > any dependency chain. For example, if p is part of a dependency
> > > > chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> > > >
> > > > Seem reasonable?
> > >
> > > Whilst I understand what you're saying (the ARM architecture makes these
> > > sorts of distinctions when calling out dependency-based ordering), it
> > > feels like we're dangerously close to defining the difference between a
> > > true and a false dependency. If we want to do this in the context of the
> > > C language specification, you run into issues because you need to evaluate
> > > the program in order to determine data values in order to determine the
> > > nature of the dependency.
> >
> > Indeed, something like this does -not- carry a dependency from the
> > memory_order_consume load to q:
> >
> > char *p, q;
> >
> > p = atomic_load_explicit(&gp, memory_order_consume);
> > q = gq + (intptr_t)p - (intptr_t)p;
> >
> > If this was compiled with -O0, ARM and Power might well carry a
> > dependency, but given any optimization, the assembly language would have
> > no hint of any such dependency. So I am not seeing any particular danger.
>
> The above is a welcome relaxation over C11, since ARM doesn't even give
> you ordering based off false data dependencies. My concern is more to do
> with how this can be specified precisely without prohibing honest compiler
> and hardware optimisations.

That last is the challenge. I believe that I am pretty close, but I am
sure that additional adjustment will be required. Especially given that
we also need the memory model to be amenable to formal analysis.

> Out of interest, how do you tackle examples (4) and (5) of (assuming the
> reads are promoted to consume loads)?:
>
> http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
>
> my understanding is that you permit both outcomes (I appreciate you're
> not directly tackling out-of-thin-air, but treatment of dependencies
> is heavily related).

Let's see... #4 is as follows, given promotion to memory_order_consume
and (I am guessing) memory_order_relaxed:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
atomic_store_explicit(&y, r1, memory_order_relaxed);
------------------------------------------------------
r2 = atomic_load_explicit(&y, memory_order_consume);
if (r2 == 42)
atomic_store_explicit(&x, 42, memory_order_relaxed);
else
atomic_store_explicit(&x, 42, memory_order_relaxed);

The second thread does not have a proper control dependency, even with
the memory_order_consume load because both branches assign the same
value to "x". This means that the compiler is within its rights to
optimize this into the following:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
atomic_store_explicit(&y, r1, memory_order_relaxed);
------------------------------------------------------
r2 = atomic_load_explicit(&y, memory_order_consume);
atomic_store_explicit(&x, 42, memory_order_relaxed);

There is no dependency between the second thread's pair of statements,
so both the compiler and the CPU are within their rights to optimize
further as follows:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
atomic_store_explicit(&y, r1, memory_order_relaxed);
------------------------------------------------------
atomic_store_explicit(&x, 42, memory_order_relaxed);
r2 = atomic_load_explicit(&y, memory_order_consume);

If the compiler makes this final optimization, even mythical SC hardware
is within its rights to end up with (r1 == 42 && r2 == 42). Which is
fine, as far as I am concerned. Or at least something that can be
lived with.

On to #5:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
atomic_store_explicit(&y, r1, memory_order_relaxed);
----------------------------------------------------
r2 = atomic_load_explicit(&y, memory_order_consume);
if (r2 == 42)
atomic_store_explicit(&x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered. The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), not
a memory_order_consume load (AKA rcu_dereference and friends). However,
Alpha would have a full barrier as part of the memory_order_consume load,
and the rest of the processors would (one way or another) respect the
control dependency. And the compiler would have some fun trying to
break it.

So the current Linux memory model would allow (r1 == 42 && r2 == 42),
but I don't know of any hardware/compiler combination that would
allow it. And no, I am -not- going to update memory-barriers.txt for
this litmus test, its theoretical interest notwithstanding! ;-)

In summary, both #4 and #5 would be allowed, as modified above.
Seem reasonable?

> > > You tackle this above by saying "to allow the compiler to precisely
> > > determine the resulting value", but I can't see how that can be cleanly
> > > fitted into something like the C language specification.
> >
> > I am sure that there will be significant rework from where this document
> > is to language appropriate from the standard. Which is why I am glad
> > that Jens is taking an interest in this, as he is particularly good at
> > producing standards language.
>
> Ok. I'm curious to see how that comes along.

Me too...

> > > Even if it can,
> > > then we'd need to reword the "?:" treatment that you currently have:
> > >
> > > "If a pointer is part of a dependency chain, and that pointer appears
> > > in the entry of a ?: expression selected by the condition, then the
> > > chain extends to the result."
> > >
> > > which I think requires the state of the condition to be known statically
> > > if we only want to extend the chain from the selected expression. In the
> > > general case, wouldn't a compiler have to assume that the chain is
> > > extended from both?
> >
> > In practice, yes, if the compiler cannot determine which expression is
> > selected, it must arrange for the dependency to be carried from either,
> > depending on the run-time value of the condition. But you would have
> > to work pretty hard to create code that did not carry the dependencies
> > as require, not?
>
> I'm not sure... you'd require the compiler to perform static analysis of
> loops to determine the state of the machine when they exit (if they exit!)
> in order to show whether or not a dependency is carried to subsequent
> operations. If it can't prove otherwise, it would have to assume that a
> dependency *is* carried, and it's not clear to me how it would use this
> information to restrict any subsequent dependency removing optimisations.
>
> I guess that's one for the GCC folks.

The goal is to restrict the dependency carrying such that the compiler
folks don't need to care. (Yeah, I know, great goal and all that...)

> > > Additionally, what about the following code?
> > >
> > > char *x = y ? z : z;
> > >
> > > Does that extend a dependency chain from z to x? If so, I can imagine a
> > > CPU breaking that in practice.
> >
> > I am not seeing this. I would expect the compiler to optimize to
> > something like this:
> >
> > char *x = z;
> >
> > How does this avoid carrying the dependency? Or are you saying that
> > ARM loses the dependency via a store to memory and a later reload?
> > That would be a bit surprising...
>
> I was thinking that the compiler would have to preserve the conditional
> structure so that the dependency chain could be tracked correctly, but
> if it can just assume that the dependency is carried regardless of y then
> I agree that it doesn't matter for this code. All the CPU could do is
> remove the conditional hazard.

OK, sounds like we are on the same page on this one.

> > > > > Humans will understand, and compiler writers won't care. They will
> > > > > either depend on hardware semantics anyway (and argue that your
> > > > > language is tight enough that they don't need to do anything special)
> > > > > or they will turn the consume into an acquire (on platforms that have
> > > > > too weak hardware).
> > > >
> > > > Agreed. Plus Core Working Group will hammer out the exact wording,
> > > > should this approach meet their approval.
> > >
> > > For the avoidance of doubt, I'm completely behind any attempts to tackle
> > > this problem, but I anticipate an uphill struggle getting this text into
> > > the C standard. Is your intention to change the carries-a-dependency
> > > relation to encompass this change?
> >
> > I completely agree that this won't be easy, but this is the task at hand.
> > And yes, the intent is to change carries-a-dependency, given that the
> > current wording isn't helping anything. ;-)
>
> Agreed on that!

;-) ;-) ;-)

Thanx, Paul

2015-05-20 18:16:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 04:54:51PM +0100, Andrew Haley wrote:
> On 05/20/2015 04:46 PM, Will Deacon wrote:
> > I'm not sure... you'd require the compiler to perform static analysis of
> > loops to determine the state of the machine when they exit (if they exit!)
> > in order to show whether or not a dependency is carried to subsequent
> > operations. If it can't prove otherwise, it would have to assume that a
> > dependency *is* carried, and it's not clear to me how it would use this
> > information to restrict any subsequent dependency removing optimisations.
>
> It'd just convert consume to acquire.

It should not need to, actually.

Thanx, Paul

2015-05-21 14:22:45

by Michael Matz

[permalink] [raw]
Subject: Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

Hi,

On Wed, 20 May 2015, Paul E. McKenney wrote:

> > > I'm not sure... you'd require the compiler to perform static analysis of
> > > loops to determine the state of the machine when they exit (if they exit!)
> > > in order to show whether or not a dependency is carried to subsequent
> > > operations. If it can't prove otherwise, it would have to assume that a
> > > dependency *is* carried, and it's not clear to me how it would use this
> > > information to restrict any subsequent dependency removing optimisations.
> >
> > It'd just convert consume to acquire.
>
> It should not need to, actually.

[with GCC hat, and having only lightly read your document]

Then you need to provide language or at least informal reasons why the
compiler is allowed to not do that. Without that a compiler would have to
be conservative, if it can't _prove_ that a dependency chain is stopped,
then it has to assume it hasn't.

For instance I can't really make out easily what your document says about
the following simple situation (well, actually I have difficulties to
differ between what you're proposing as the good-new model of this all,
and what you're merely describing as different current states of affair):

char * fancy_assign (char *in) { return in; }
...
char *x, *y;

x = atomic_load_explicit(p, memory_order_consume);
y = fancy_assign (x);
atomic_store_explicit(q, y, memory_order_relaxed);

So, is there, or is there not a dependency carried from x to y in your
proposed model (and which rule in your document states so)? Clearly,
without any other language the compiler would have to assume that there is
(because the equivalent 'y = x' assignment would carry the dependency).

If it has to assume this, then the whole model is not going to work very
well, as usual with models that assume a certain less-optimal fact
("carries-dep" is less optimal for code generation purposes that
"not-carries-dep") unless very specific circumstances say it can be
ignored.


Ciao,
Michael.

2015-05-21 15:10:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

On Thu, May 21, 2015 at 04:22:38PM +0200, Michael Matz wrote:
> Hi,
>
> On Wed, 20 May 2015, Paul E. McKenney wrote:
>
> > > > I'm not sure... you'd require the compiler to perform static analysis of
> > > > loops to determine the state of the machine when they exit (if they exit!)
> > > > in order to show whether or not a dependency is carried to subsequent
> > > > operations. If it can't prove otherwise, it would have to assume that a
> > > > dependency *is* carried, and it's not clear to me how it would use this
> > > > information to restrict any subsequent dependency removing optimisations.
> > >
> > > It'd just convert consume to acquire.
> >
> > It should not need to, actually.
>
> [with GCC hat, and having only lightly read your document]

Understood.

> Then you need to provide language or at least informal reasons why the
> compiler is allowed to not do that. Without that a compiler would have to
> be conservative, if it can't _prove_ that a dependency chain is stopped,
> then it has to assume it hasn't.
>
> For instance I can't really make out easily what your document says about
> the following simple situation (well, actually I have difficulties to
> differ between what you're proposing as the good-new model of this all,
> and what you're merely describing as different current states of affair):

The point is -exactly- to codify the current state of affairs. I expect
a follow-on effort to specify some sort of marking regimen, as noted in
the last paragraph of 7.9 and as discussed with Torvald Riegel. However,
given that there are not yet any implementations or practical experience
with such markings, I suspect that some time will be required to hammer
out a good marking scheme.

> char * fancy_assign (char *in) { return in; }
> ...
> char *x, *y;
>
> x = atomic_load_explicit(p, memory_order_consume);
> y = fancy_assign (x);
> atomic_store_explicit(q, y, memory_order_relaxed);
>
> So, is there, or is there not a dependency carried from x to y in your
> proposed model (and which rule in your document states so)? Clearly,
> without any other language the compiler would have to assume that there is
> (because the equivalent 'y = x' assignment would carry the dependency).

The dependency is not carried, though this is due to the current set
of rules not covering atomic loads and stores, which I need to fix.
Here is the sequence of events:

o A memory_order_consume load heads a dependency chain.

o Rule 2 says that if a value is part of a dependency chain and
is used as the right-hand side of an assignment operator,
the expression extends the chain to cover the assignment.
And I switched to numbered bullet items here:
http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21a.pdf

o Rule 14 says that if a value is part of a dependency chain and
is used as the actual parameter of a function call, then the
dependency chain extends to the corresponding formal parameter,
namely "in" of fancy_assign().

o Rule 15 says that if a value is part of a dependency chain and
is returned from a function, then the dependency chain extends
to the returned value in the calling function.

o And you are right. I need to make the first and second rules
cover the relaxed atomic operations, or at least atomic loads and
stores. Not that this is an issue for existing Linux-kernel code.

But given such a change, the new version of rule 2 would
extend the dependency chain to cover the atomic_store_explicit().

> If it has to assume this, then the whole model is not going to work very
> well, as usual with models that assume a certain less-optimal fact
> ("carries-dep" is less optimal for code generation purposes that
> "not-carries-dep") unless very specific circumstances say it can be
> ignored.

Although that is a good general rule of thumb, I do not believe that it
applies to this situation, with the exception that I do indeed assume
that no one is insane enough to do value-speculation optimizations for
non-NULL values on loads from pointers.

So what am I missing here? Do you have a specific example where the
compiler would need to suppress a production-quality optimization?

Thanx, Paul

2015-05-21 16:17:51

by Michael Matz

[permalink] [raw]
Subject: Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

Hi,

On Thu, 21 May 2015, Paul E. McKenney wrote:

> The point is -exactly- to codify the current state of affairs.

Ah, I see, so it's not yet about creating a more useful (for compilers,
that is) model.

> > char * fancy_assign (char *in) { return in; }
> > ...
> > char *x, *y;
> >
> > x = atomic_load_explicit(p, memory_order_consume);
> > y = fancy_assign (x);
> > atomic_store_explicit(q, y, memory_order_relaxed);
> >
> > So, is there, or is there not a dependency carried from x to y in your
> > proposed model (and which rule in your document states so)? Clearly,
> > without any other language the compiler would have to assume that there is
> > (because the equivalent 'y = x' assignment would carry the dependency).
>
> The dependency is not carried, though this is due to the current set
> of rules not covering atomic loads and stores, which I need to fix.

Okay, so with the current regime(s), the dependency carries ...

> o Rule 14 says that if a value is part of a dependency chain and
> is used as the actual parameter of a function call, then the
> dependency chain extends to the corresponding formal parameter,
> namely "in" of fancy_assign().
>
> o Rule 15 says that if a value is part of a dependency chain and
> is returned from a function, then the dependency chain extends
> to the returned value in the calling function.
>
> o And you are right. I need to make the first and second rules
> cover the relaxed atomic operations, or at least atomic loads and
> stores. Not that this is an issue for existing Linux-kernel code.
>
> But given such a change, the new version of rule 2 would
> extend the dependency chain to cover the atomic_store_explicit().

... (if this detail would be fixed). Okay, that's quite awful ...

> > If it has to assume this, then the whole model is not going to work
> > very well, as usual with models that assume a certain less-optimal
> > fact ("carries-dep" is less optimal for code generation purposes that
> > "not-carries-dep") unless very specific circumstances say it can be
> > ignored.
>
> Although that is a good general rule of thumb, I do not believe that it
> applies to this situation, with the exception that I do indeed assume
> that no one is insane enough to do value-speculation optimizations for
> non-NULL values on loads from pointers.
>
> So what am I missing here?

... because you are then missing that if "carries-dep" can flow through
function calls from arguments to return values by default, the compiler
has to assume this in fact always happens when it can't see the function
body, or can't analyze it. In effect that's making the whole "carries-dep
stops at these and those uses" a useless excercise because a malicious
user (malicious in the sense of abusing the model to show that it's
hindering optimizations), i.e. me, can hide all such carries-dep stopping
effects inside a function, et voila, the dependecy carries through. So
for a slightly more simple example:

extern void *foo (void *); // body not available
x = load
y = foo (x);
store (y)

the compiler has to assume that there's a dep-chain from x to y; always.
What's worse, it also has to assume a carries-dep for this:

extern void foo (void *in, void **out1, void **out2);
x = load
foo (x, &o1, &o2);
store (o1);
store (o2);

Now the compiler has to assume that the body of 'foo' is just mean enough
to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to
assume that for both). This extends to _all_ memory accessible from foo's
body, i.e. generally all global and all local address-taken variables, so
as soon as you have a function call into which a dep-chain value flows
you're creating a dep-chain extension from that value to each and every
global piece of memory, because the compiler cannot assume that the
black box called foo is not mean. This could conceivably be stopped by
making normal stores not to carry the dependency; then only the return
value might be infected; but I don't see that in your rules, as a "normal
store" is just an assigment in your model and hence rules 1 and 2 apply
(that is, carries-dep flows through all assignments, incl. loads and
stores).

Basically whenever you can construct black boxes for the compiler, you
have to limit their effects on such transitive relations like carries-dep
by default, at the border of such black boxes; otherwise that transitive
relation quickly becomes an most-x-everything relation (i.e.
mostthings carries-dep to everything), and as such is then totally
useless, because such a universally filled relation (like an empty one)
doesn't bear any interesting information, at which point it then is
questionably why the compiler should jump through hoops to analyse the few
cases that would be allowed to stop the carries-dep flow, when it more
often than not have to give up anyway and generate slow code.

> Do you have a specific example where the compiler would need to suppress
> a production-quality optimization?

I can't say; I haven't completely grokked all details of the different sub
mem-models and their interaction with compiler optimizations, and when to
convert consume to aquire. But I do know that transitive relations that
carry a code generation cost (i.e. that when two entities are related
you're limited in what you can do), let's call them infections :), have to
be quite strictly limited in scope, otherwise they become useless.

Real world examples that are quite terrible to get right, and get good
code out of at the same time, is aliasing and effective-type of memory
cells, which have some similar properties to the current carries-dep.
Both concepts added to the language after-the-fact (to capture some
assumptions that were used in practice, and that seemed sensible), but
then in a way that weren't limiting their scope very well. For instance,
if you follow the c++ language strictly, you can't assume that a
anonymuous cell that was holding an int before a function call is still
holding an int afterwards, even without any stores in between, because the
function could use placement new to change the cells type. Guess how GCC
deals with this? We've designed and redesigned our memory model to
accomodate for this: It's conservatively assuming that crap-happens-here
is part of most function calls (because in the real world, it indeed is
the case that crap-happens-there) :)


Ciao,
Michael.

2015-05-21 18:37:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

On Thu, May 21, 2015 at 06:17:43PM +0200, Michael Matz wrote:
> Hi,
>
> On Thu, 21 May 2015, Paul E. McKenney wrote:
>
> > The point is -exactly- to codify the current state of affairs.
>
> Ah, I see, so it's not yet about creating a more useful (for compilers,
> that is) model.

There are several approaches being considered for that as well, but we
do need to codify current usage.

> > > char * fancy_assign (char *in) { return in; }
> > > ...
> > > char *x, *y;
> > >
> > > x = atomic_load_explicit(p, memory_order_consume);
> > > y = fancy_assign (x);
> > > atomic_store_explicit(q, y, memory_order_relaxed);
> > >
> > > So, is there, or is there not a dependency carried from x to y in your
> > > proposed model (and which rule in your document states so)? Clearly,
> > > without any other language the compiler would have to assume that there is
> > > (because the equivalent 'y = x' assignment would carry the dependency).
> >
> > The dependency is not carried, though this is due to the current set
> > of rules not covering atomic loads and stores, which I need to fix.
>
> Okay, so with the current regime(s), the dependency carries ...

Yes, that is the intent.

> > o Rule 14 says that if a value is part of a dependency chain and
> > is used as the actual parameter of a function call, then the
> > dependency chain extends to the corresponding formal parameter,
> > namely "in" of fancy_assign().
> >
> > o Rule 15 says that if a value is part of a dependency chain and
> > is returned from a function, then the dependency chain extends
> > to the returned value in the calling function.
> >
> > o And you are right. I need to make the first and second rules
> > cover the relaxed atomic operations, or at least atomic loads and
> > stores. Not that this is an issue for existing Linux-kernel code.
> >
> > But given such a change, the new version of rule 2 would
> > extend the dependency chain to cover the atomic_store_explicit().
>
> ... (if this detail would be fixed). Okay, that's quite awful ...
>
> > > If it has to assume this, then the whole model is not going to work
> > > very well, as usual with models that assume a certain less-optimal
> > > fact ("carries-dep" is less optimal for code generation purposes that
> > > "not-carries-dep") unless very specific circumstances say it can be
> > > ignored.
> >
> > Although that is a good general rule of thumb, I do not believe that it
> > applies to this situation, with the exception that I do indeed assume
> > that no one is insane enough to do value-speculation optimizations for
> > non-NULL values on loads from pointers.
> >
> > So what am I missing here?
>
> ... because you are then missing that if "carries-dep" can flow through
> function calls from arguments to return values by default, the compiler
> has to assume this in fact always happens when it can't see the function
> body, or can't analyze it. In effect that's making the whole "carries-dep
> stops at these and those uses" a useless excercise because a malicious
> user (malicious in the sense of abusing the model to show that it's
> hindering optimizations), i.e. me, can hide all such carries-dep stopping
> effects inside a function, et voila, the dependecy carries through. So
> for a slightly more simple example:
>
> extern void *foo (void *); // body not available
> x = load
> y = foo (x);
> store (y)
>
> the compiler has to assume that there's a dep-chain from x to y; always.

Yes, the compiler does have to make this assumption. And the intent
behind the rules is to ensure that this assumption does not get in the
way of reasonable optimizations. So although I am sure that you are as
busy as the rest of us, I really do need you to go through the rules in
detail before you get too much more excited about this.

> What's worse, it also has to assume a carries-dep for this:
>
> extern void foo (void *in, void **out1, void **out2);
> x = load
> foo (x, &o1, &o2);
> store (o1);
> store (o2);
>
> Now the compiler has to assume that the body of 'foo' is just mean enough
> to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to
> assume that for both). This extends to _all_ memory accessible from foo's
> body, i.e. generally all global and all local address-taken variables, so
> as soon as you have a function call into which a dep-chain value flows
> you're creating a dep-chain extension from that value to each and every
> global piece of memory, because the compiler cannot assume that the
> black box called foo is not mean. This could conceivably be stopped by
> making normal stores not to carry the dependency; then only the return
> value might be infected; but I don't see that in your rules, as a "normal
> store" is just an assigment in your model and hence rules 1 and 2 apply
> (that is, carries-dep flows through all assignments, incl. loads and
> stores).
>
> Basically whenever you can construct black boxes for the compiler, you
> have to limit their effects on such transitive relations like carries-dep
> by default, at the border of such black boxes; otherwise that transitive
> relation quickly becomes an most-x-everything relation (i.e.
> mostthings carries-dep to everything), and as such is then totally
> useless, because such a universally filled relation (like an empty one)
> doesn't bear any interesting information, at which point it then is
> questionably why the compiler should jump through hoops to analyse the few
> cases that would be allowed to stop the carries-dep flow, when it more
> often than not have to give up anyway and generate slow code.

Furthermore, in general, if the compiler loads a pointer (only a
pointer, not any other type) from some random location, it needs to
assume that the value loaded is part of a dependency chain. Again,
the intent behind the rules is to make sure that the developers get the
guarantees that they need from the compiler without getting in the way
of reasonable optimizations. (Value speculation of pointer loads being
the only example unreasonable optimization that I am aware of.)

> > Do you have a specific example where the compiler would need to suppress
> > a production-quality optimization?
>
> I can't say; I haven't completely grokked all details of the different sub
> mem-models and their interaction with compiler optimizations, and when to
> convert consume to aquire. But I do know that transitive relations that
> carry a code generation cost (i.e. that when two entities are related
> you're limited in what you can do), let's call them infections :), have to
> be quite strictly limited in scope, otherwise they become useless.

When I get the rules right (and I am sure that they need additional help),
it should never be necessary to promote consume to acquire.

> Real world examples that are quite terrible to get right, and get good
> code out of at the same time, is aliasing and effective-type of memory
> cells, which have some similar properties to the current carries-dep.
> Both concepts added to the language after-the-fact (to capture some
> assumptions that were used in practice, and that seemed sensible), but
> then in a way that weren't limiting their scope very well. For instance,
> if you follow the c++ language strictly, you can't assume that a
> anonymuous cell that was holding an int before a function call is still
> holding an int afterwards, even without any stores in between, because the
> function could use placement new to change the cells type. Guess how GCC
> deals with this? We've designed and redesigned our memory model to
> accomodate for this: It's conservatively assuming that crap-happens-here
> is part of most function calls (because in the real world, it indeed is
> the case that crap-happens-there) :)

Agreed, crap-happens-everywhere is a good approximation for real-world
code. ;-)

But again, I believe that the rules do what needs to be done, although
some additional adjustments are no doubt needed. I have updated the
document:

http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21b.pdf

Section 7.9 starts on page 28.

Thanx, Paul

2015-05-21 19:24:29

by Will Deacon

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> > On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > > Indeed, something like this does -not- carry a dependency from the
> > > memory_order_consume load to q:
> > >
> > > char *p, q;
> > >
> > > p = atomic_load_explicit(&gp, memory_order_consume);
> > > q = gq + (intptr_t)p - (intptr_t)p;
> > >
> > > If this was compiled with -O0, ARM and Power might well carry a
> > > dependency, but given any optimization, the assembly language would have
> > > no hint of any such dependency. So I am not seeing any particular danger.
> >
> > The above is a welcome relaxation over C11, since ARM doesn't even give
> > you ordering based off false data dependencies. My concern is more to do
> > with how this can be specified precisely without prohibing honest compiler
> > and hardware optimisations.
>
> That last is the challenge. I believe that I am pretty close, but I am
> sure that additional adjustment will be required. Especially given that
> we also need the memory model to be amenable to formal analysis.

Well, there's still the whole thin-air problem which unfortunately doesn't
go away with your proposal... (I was hoping that differentiating between
true and false dependencies would solve that, but your set of rules isn't
broad enough and I don't blame you at all for that!).

> > Out of interest, how do you tackle examples (4) and (5) of (assuming the
> > reads are promoted to consume loads)?:
> >
> > http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> >
> > my understanding is that you permit both outcomes (I appreciate you're
> > not directly tackling out-of-thin-air, but treatment of dependencies
> > is heavily related).

Thanks for taking the time to walk these two examples through.

> Let's see... #4 is as follows, given promotion to memory_order_consume
> and (I am guessing) memory_order_relaxed:
>
> r1 = atomic_load_explicit(&x, memory_order_consume);
> if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
> ------------------------------------------------------
> r2 = atomic_load_explicit(&y, memory_order_consume);
> if (r2 == 42)
> atomic_store_explicit(&x, 42, memory_order_relaxed);
> else
> atomic_store_explicit(&x, 42, memory_order_relaxed);
>
> The second thread does not have a proper control dependency, even with
> the memory_order_consume load because both branches assign the same
> value to "x". This means that the compiler is within its rights to
> optimize this into the following:
>
> r1 = atomic_load_explicit(&x, memory_order_consume);
> if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
> ------------------------------------------------------
> r2 = atomic_load_explicit(&y, memory_order_consume);
> atomic_store_explicit(&x, 42, memory_order_relaxed);
>
> There is no dependency between the second thread's pair of statements,
> so both the compiler and the CPU are within their rights to optimize
> further as follows:
>
> r1 = atomic_load_explicit(&x, memory_order_consume);
> if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
> ------------------------------------------------------
> atomic_store_explicit(&x, 42, memory_order_relaxed);
> r2 = atomic_load_explicit(&y, memory_order_consume);
>
> If the compiler makes this final optimization, even mythical SC hardware
> is within its rights to end up with (r1 == 42 && r2 == 42). Which is
> fine, as far as I am concerned. Or at least something that can be
> lived with.

Agreed.

> On to #5:
>
> r1 = atomic_load_explicit(&x, memory_order_consume);
> if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
> ----------------------------------------------------
> r2 = atomic_load_explicit(&y, memory_order_consume);
> if (r2 == 42)
> atomic_store_explicit(&x, 42, memory_order_relaxed);
>
> The first thread's accesses are dependency ordered. The second thread's
> ordering is in a corner case that memory-barriers.txt does not cover.
> You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> a memory_order_consume load (AKA rcu_dereference and friends). However,
> Alpha would have a full barrier as part of the memory_order_consume load,
> and the rest of the processors would (one way or another) respect the
> control dependency. And the compiler would have some fun trying to
> break it.

But this is interesting because the first thread is ordered whilst the
second is not, so doesn't that effectively forbid the compiler from
constant-folding values if it can't prove that there is no dependency
chain?

> So the current Linux memory model would allow (r1 == 42 && r2 == 42),
> but I don't know of any hardware/compiler combination that would
> allow it. And no, I am -not- going to update memory-barriers.txt for
> this litmus test, its theoretical interest notwithstanding! ;-)

Indeed, I also don't know of any hardware which permits speculative
writes to become visible, but it's the compiler (and the language
definition) that we need to think about here.

> In summary, both #4 and #5 would be allowed, as modified above.
> Seem reasonable?

It feels like it's suppressing a reasonable compiler optimisation, but again,
I'm not a compiler writer ;)

Will

2015-05-21 20:02:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
> On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> > On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> > > On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > > > Indeed, something like this does -not- carry a dependency from the
> > > > memory_order_consume load to q:
> > > >
> > > > char *p, q;
> > > >
> > > > p = atomic_load_explicit(&gp, memory_order_consume);
> > > > q = gq + (intptr_t)p - (intptr_t)p;
> > > >
> > > > If this was compiled with -O0, ARM and Power might well carry a
> > > > dependency, but given any optimization, the assembly language would have
> > > > no hint of any such dependency. So I am not seeing any particular danger.
> > >
> > > The above is a welcome relaxation over C11, since ARM doesn't even give
> > > you ordering based off false data dependencies. My concern is more to do
> > > with how this can be specified precisely without prohibing honest compiler
> > > and hardware optimisations.
> >
> > That last is the challenge. I believe that I am pretty close, but I am
> > sure that additional adjustment will be required. Especially given that
> > we also need the memory model to be amenable to formal analysis.
>
> Well, there's still the whole thin-air problem which unfortunately doesn't
> go away with your proposal... (I was hoping that differentiating between
> true and false dependencies would solve that, but your set of rules isn't
> broad enough and I don't blame you at all for that!).
>
> > > Out of interest, how do you tackle examples (4) and (5) of (assuming the
> > > reads are promoted to consume loads)?:
> > >
> > > http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> > >
> > > my understanding is that you permit both outcomes (I appreciate you're
> > > not directly tackling out-of-thin-air, but treatment of dependencies
> > > is heavily related).
>
> Thanks for taking the time to walk these two examples through.

;-) ;-) ;-)

> > Let's see... #4 is as follows, given promotion to memory_order_consume
> > and (I am guessing) memory_order_relaxed:
> >
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > ------------------------------------------------------
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > if (r2 == 42)
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > else
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> >
> > The second thread does not have a proper control dependency, even with
> > the memory_order_consume load because both branches assign the same
> > value to "x". This means that the compiler is within its rights to
> > optimize this into the following:
> >
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > ------------------------------------------------------
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> >
> > There is no dependency between the second thread's pair of statements,
> > so both the compiler and the CPU are within their rights to optimize
> > further as follows:
> >
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > ------------------------------------------------------
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> >
> > If the compiler makes this final optimization, even mythical SC hardware
> > is within its rights to end up with (r1 == 42 && r2 == 42). Which is
> > fine, as far as I am concerned. Or at least something that can be
> > lived with.
>
> Agreed.
>
> > On to #5:
> >
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > ----------------------------------------------------
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > if (r2 == 42)
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> >
> > The first thread's accesses are dependency ordered. The second thread's
> > ordering is in a corner case that memory-barriers.txt does not cover.
> > You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> > a memory_order_consume load (AKA rcu_dereference and friends). However,
> > Alpha would have a full barrier as part of the memory_order_consume load,
> > and the rest of the processors would (one way or another) respect the
> > control dependency. And the compiler would have some fun trying to
> > break it.
>
> But this is interesting because the first thread is ordered whilst the
> second is not, so doesn't that effectively forbid the compiler from
> constant-folding values if it can't prove that there is no dependency
> chain?

You lost me on this one. Are you suggesting that the compiler
speculate the second thread's atomic store? That would be very
bad regardless of dependency chains.

So what constant-folding optimization are you thinking of here?
If the above example is not amenable to such an optimization, could
you please give me an example where constant folding would apply
in a way that is sensitive to dependency chains?

> > So the current Linux memory model would allow (r1 == 42 && r2 == 42),
> > but I don't know of any hardware/compiler combination that would
> > allow it. And no, I am -not- going to update memory-barriers.txt for
> > this litmus test, its theoretical interest notwithstanding! ;-)
>
> Indeed, I also don't know of any hardware which permits speculative
> writes to become visible, but it's the compiler (and the language
> definition) that we need to think about here.

The compiler can (and does) speculate non-atomic non-volatile writes
in some cases, but I do not believe that it is permitted to speculate
either volatile or atomic writes. All that aside, I reiterate that I am
assuming that the compiler does not speculate the value of pointer loads.

> > In summary, both #4 and #5 would be allowed, as modified above.
> > Seem reasonable?
>
> It feels like it's suppressing a reasonable compiler optimisation, but again,
> I'm not a compiler writer ;)

The intent of this proposal is to avoid suppressing reasonable compiler
optimizations, so could you please send along an example?

Thanx, Paul

2015-05-21 20:42:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
<[email protected]> wrote:
>
> The compiler can (and does) speculate non-atomic non-volatile writes
> in some cases, but I do not believe that it is permitted to speculate
> either volatile or atomic writes.

I do *not* believe that a compiler is ever allowed to speculate *any*
writes - volatile or not - unless the compiler can prove that the end
result is either single-threaded, or the write in question is
guaranteed to only be visible in that thread (ie local stack variable
etc).

Quite frankly, I'd be much happier if the C standard just said so outright.

Also, I do think that the whole "consume" read should be explained
better to compiler writers. Right now the language (including very
much in the "restricted dependency" model) is described in very
abstract terms. Yet those abstract terms are actually very subtle and
complex, and very opaque to a compiler writer.

If I was a compiler writer, I'd absolutely detest that definition.
It's very far removed from my problem space as a compiler writer, and
nothing in the language *explains* the odd and subtle abstract rules.
It smells ad-hoc to me.

Now, I actually understand the point of those odd and abstract rules,
but to a compiler writer that doesn't understand the background, the
whole section reads as "this is really painful for me to track all
those dependencies and what kills them".

So I would very much suggest that there would be language that
*explains* this. Basically, tell the compiler writer:

(a) the "official" rules are completely pointless, and make sense
only because the standard is written for some random "abstract
machine" that doesn't actually exist.

(b) in *real life*, the whole and only point of the rules is to make
sure that the compiler doesn't turn a data depenency into a control
dependency, which on ARM and POWERPC does not honor causal memory
ordering

(c) on x86, since *all* memory accesses are causal, all the magical
dependency rules are just pointless anyway, and what it really means
is that you cannot re-order accesses with value speculation.

(c) the *actual* relevant rule for a compiler writer is very simple:
the compiler must not do value speculation on a "consume" load, and
the abstract machine rules are written so that any other sane
optimization is legal.

(d) if the compiler writer really thinks they want to do value
speculation, they have to turn the "consume" load into an "acquire"
load. And you have to do that anyway on architectures like alpha that
aren't causal even for data dependencies.

I personally think the whole "abstract machine" model of the C
language is a mistake. It would be much better to talk about things in
terms of actual code generation and actual issues. Make all the
problems much more concrete, with actual examples of how memory
ordering matters on different architectures.

99% of all the problems with the whole "consume" memory ordering comes
not from anything relevant to a compiler writer. All of it comes from
trying to "define" the issue in the wrong terms.

Linus

2015-05-21 22:02:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Thu, May 21, 2015 at 01:42:11PM -0700, Linus Torvalds wrote:
> On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > The compiler can (and does) speculate non-atomic non-volatile writes
> > in some cases, but I do not believe that it is permitted to speculate
> > either volatile or atomic writes.
>
> I do *not* believe that a compiler is ever allowed to speculate *any*
> writes - volatile or not - unless the compiler can prove that the end
> result is either single-threaded, or the write in question is
> guaranteed to only be visible in that thread (ie local stack variable
> etc).
>
> Quite frankly, I'd be much happier if the C standard just said so outright.

So would I. ;-)

The usual example is the following, where x is non-volatile and
non-atomic:

if (a)
x = 42;
else
x = 7;

The current rules allow this to be transformed as follows:

x = 7;
if (a)
x = 42;

So the variable x takes on the value 7 momentarily when it otherwise
would not have.

At least C11 now prohibits "drive-by" speculative writes, where the
compiler writes a larger size and then fixes things up afterwards.

> Also, I do think that the whole "consume" read should be explained
> better to compiler writers. Right now the language (including very
> much in the "restricted dependency" model) is described in very
> abstract terms. Yet those abstract terms are actually very subtle and
> complex, and very opaque to a compiler writer.
>
> If I was a compiler writer, I'd absolutely detest that definition.
> It's very far removed from my problem space as a compiler writer, and
> nothing in the language *explains* the odd and subtle abstract rules.
> It smells ad-hoc to me.
>
> Now, I actually understand the point of those odd and abstract rules,
> but to a compiler writer that doesn't understand the background, the
> whole section reads as "this is really painful for me to track all
> those dependencies and what kills them".
>
> So I would very much suggest that there would be language that
> *explains* this. Basically, tell the compiler writer:
>
> (a) the "official" rules are completely pointless, and make sense
> only because the standard is written for some random "abstract
> machine" that doesn't actually exist.
>
> (b) in *real life*, the whole and only point of the rules is to make
> sure that the compiler doesn't turn a data depenency into a control
> dependency, which on ARM and POWERPC does not honor causal memory
> ordering
>
> (c) on x86, since *all* memory accesses are causal, all the magical
> dependency rules are just pointless anyway, and what it really means
> is that you cannot re-order accesses with value speculation.
>
> (c) the *actual* relevant rule for a compiler writer is very simple:
> the compiler must not do value speculation on a "consume" load, and
> the abstract machine rules are written so that any other sane
> optimization is legal.
>
> (d) if the compiler writer really thinks they want to do value
> speculation, they have to turn the "consume" load into an "acquire"
> load. And you have to do that anyway on architectures like alpha that
> aren't causal even for data dependencies.

I am certainly more than willing to add this sort of wording to
the document.

> I personally think the whole "abstract machine" model of the C
> language is a mistake. It would be much better to talk about things in
> terms of actual code generation and actual issues. Make all the
> problems much more concrete, with actual examples of how memory
> ordering matters on different architectures.
>
> 99% of all the problems with the whole "consume" memory ordering comes
> not from anything relevant to a compiler writer. All of it comes from
> trying to "define" the issue in the wrong terms.

I certainly cannot deny that there seems to be significant confusion
in the discussion thus far.

Thanx, Paul

2015-05-22 06:43:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!


* Linus Torvalds <[email protected]> wrote:

> (a) the "official" rules are completely pointless, and make sense
> only because the standard is written for some random "abstract
> machine" that doesn't actually exist.

Presuming the intent of the abstract machine specification is to avoid
being seen as biased towards any specific machine (politics), maybe
write this as:

(a) the "official" rules are written for a somewhat weird and
complex "union of all known and theoretically possible CPU
architectures that exist or which might exist in the future",
which machine does not actually exist in practice, but which
allows a single abstract set of rules to apply to all machines.
These rules are complex, but if applied to a specific machine
they become considerably simpler. Here's a few examples: ...

?

(Assuming it's a goal of this standard to be human parseable to more
than a few dozen people on the planet.)

Thanks,

Ingo

2015-05-22 10:55:07

by kenner

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

> (Assuming it's a goal of this standard to be human parseable to more
> than a few dozen people on the planet.)

Unfortunately, that's rarely a goal of most standards. ;-)

2015-05-22 13:11:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Fri, May 22, 2015 at 06:43:32AM -0400, Richard Kenner wrote:
> > (Assuming it's a goal of this standard to be human parseable to more
> > than a few dozen people on the planet.)
>
> Unfortunately, that's rarely a goal of most standards. ;-)

My experience does match Richard's, sad to say. There has been some
vigorous discussion in another thread involving undefined behavior and
value narrowing, which has resulted in some useful changes to Section 7.9.
The updated document is here:

http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.22a.pdf

Thanx, Paul

2015-05-22 13:13:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Fri, May 22, 2015 at 08:43:44AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > (a) the "official" rules are completely pointless, and make sense
> > only because the standard is written for some random "abstract
> > machine" that doesn't actually exist.
>
> Presuming the intent of the abstract machine specification is to avoid
> being seen as biased towards any specific machine (politics), maybe
> write this as:
>
> (a) the "official" rules are written for a somewhat weird and
> complex "union of all known and theoretically possible CPU
> architectures that exist or which might exist in the future",
> which machine does not actually exist in practice, but which
> allows a single abstract set of rules to apply to all machines.
> These rules are complex, but if applied to a specific machine
> they become considerably simpler. Here's a few examples: ...
>
> ?
>
> (Assuming it's a goal of this standard to be human parseable to more
> than a few dozen people on the planet.)

Should something based on Section 7.9 go in, then I would need to add
a more developer-friendly explanation in Documentation/RCU, no two
ways about it! ;-)

Thanx, Paul

2015-05-22 17:30:39

by Will Deacon

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

Hi Paul,

On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote:
> On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
> > On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> > > On to #5:
> > >
> > > r1 = atomic_load_explicit(&x, memory_order_consume);
> > > if (r1 == 42)
> > > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > > ----------------------------------------------------
> > > r2 = atomic_load_explicit(&y, memory_order_consume);
> > > if (r2 == 42)
> > > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > >
> > > The first thread's accesses are dependency ordered. The second thread's
> > > ordering is in a corner case that memory-barriers.txt does not cover.
> > > You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> > > a memory_order_consume load (AKA rcu_dereference and friends). However,
> > > Alpha would have a full barrier as part of the memory_order_consume load,
> > > and the rest of the processors would (one way or another) respect the
> > > control dependency. And the compiler would have some fun trying to
> > > break it.
> >
> > But this is interesting because the first thread is ordered whilst the
> > second is not, so doesn't that effectively forbid the compiler from
> > constant-folding values if it can't prove that there is no dependency
> > chain?
>
> You lost me on this one. Are you suggesting that the compiler
> speculate the second thread's atomic store? That would be very
> bad regardless of dependency chains.
>
> So what constant-folding optimization are you thinking of here?
> If the above example is not amenable to such an optimization, could
> you please give me an example where constant folding would apply
> in a way that is sensitive to dependency chains?

Unless I'm missing something, I can't see what would prevent a compiler
from looking at the code in thread1 and transforming it into the code in
thread2 (i.e. constant folding r1 with 42 given that the taken branch
must mean that r1 == 42). However, such an optimisation breaks the
dependency chain, which means that a compiler needs to walk backwards
to see if there is a dependency chain extending to r1.

> > > So the current Linux memory model would allow (r1 == 42 && r2 == 42),
> > > but I don't know of any hardware/compiler combination that would
> > > allow it. And no, I am -not- going to update memory-barriers.txt for
> > > this litmus test, its theoretical interest notwithstanding! ;-)

Of course, I'm not asking for that at all! I'm just trying to see how
your proposal holds up with the example.

Will

2015-05-22 18:55:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Fri, May 22, 2015 at 06:30:29PM +0100, Will Deacon wrote:
> Hi Paul,
>
> On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote:
> > On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
> > > On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> > > > On to #5:
> > > >
> > > > r1 = atomic_load_explicit(&x, memory_order_consume);
> > > > if (r1 == 42)
> > > > atomic_store_explicit(&y, r1, memory_order_relaxed);
> > > > ----------------------------------------------------
> > > > r2 = atomic_load_explicit(&y, memory_order_consume);
> > > > if (r2 == 42)
> > > > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > > >
> > > > The first thread's accesses are dependency ordered. The second thread's
> > > > ordering is in a corner case that memory-barriers.txt does not cover.
> > > > You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> > > > a memory_order_consume load (AKA rcu_dereference and friends). However,
> > > > Alpha would have a full barrier as part of the memory_order_consume load,
> > > > and the rest of the processors would (one way or another) respect the
> > > > control dependency. And the compiler would have some fun trying to
> > > > break it.
> > >
> > > But this is interesting because the first thread is ordered whilst the
> > > second is not, so doesn't that effectively forbid the compiler from
> > > constant-folding values if it can't prove that there is no dependency
> > > chain?
> >
> > You lost me on this one. Are you suggesting that the compiler
> > speculate the second thread's atomic store? That would be very
> > bad regardless of dependency chains.
> >
> > So what constant-folding optimization are you thinking of here?
> > If the above example is not amenable to such an optimization, could
> > you please give me an example where constant folding would apply
> > in a way that is sensitive to dependency chains?
>
> Unless I'm missing something, I can't see what would prevent a compiler
> from looking at the code in thread1 and transforming it into the code in
> thread2 (i.e. constant folding r1 with 42 given that the taken branch
> must mean that r1 == 42). However, such an optimisation breaks the
> dependency chain, which means that a compiler needs to walk backwards
> to see if there is a dependency chain extending to r1.

Indeed! Which is one reason that (1) integers are not allowed in
dependency chains with a very few extremely constrained exceptions and
(2) sequences of comparisons and/or undefined-behavior considerations
that allow the compiler to exactly determine the pointer value break
the dependency chain.

> > > > So the current Linux memory model would allow (r1 == 42 && r2 == 42),
> > > > but I don't know of any hardware/compiler combination that would
> > > > allow it. And no, I am -not- going to update memory-barriers.txt for
> > > > this litmus test, its theoretical interest notwithstanding! ;-)
>
> Of course, I'm not asking for that at all! I'm just trying to see how
> your proposal holds up with the example.

Whew! ;-)

Thanx, Paul

2015-05-26 17:08:47

by Torvald Riegel

[permalink] [raw]
Subject: Re: [c++std-parallel-1611] Compilers and RCU readers: Once more unto the breach!

On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote:
> http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

I have been discussing Section 7.9 with Paul during last week.

While I think that 7.9 helps narrow down the problem somewhat, I'm still
concerned that it effectively requires compilers to either track
dependencies or conservatively prevent optimizations like value
speculation and specialization based on that. Neither is a good thing
for a compiler.


7.9 adds requirements that dependency chains stop if the program itself
informs the compiler about the value of something in the dependency
chain (e.g., as shown in Figure 33). Also, if a correct program that
does not have undefined behavior must use a particular value, this is
also considered as "informing" the compiler about that value. For
example:
int arr[2];
int* x = foo.load(mo_consume);
if (x > arr) // implies same object/array, so x is in arr[]
int r1 = *x; // compiler knows x == arr + 1
The program, assuming no undefined behavior, first tells the compiler
that x should be within arr, and then the comparison tells the compiler
that x!=arr, so x==arr+1 must hold because there are just two elements
in the array.

Having these requirements is generally good, but we don't yet know how
to specify this properly. For example, I suppose we'd need to also say
that the compiler cannot assume to know anything about a value returned
from an mo_consume load; otherwise, nothing prevents a compiler from
using knowledge about the stores that the mo_consume load can read from
(see Section 7.2).

Also, a compiler is still required to not do value-speculation or
optimizations based on that. For example, this program:

void op(type *p)
{
foo /= p->a;
bar = p->b;
}
void bar()
{
pointer = ppp.load(mo_consume);
op(pointer);
}

... must not be transformed into this program, even if the compiler
knows that global_var->a == 1:

void op(type *p) { /* unchanged */}
void bar()
{
pointer = ppp.load(mo_consume);
if (pointer != global_var) {
op(pointer);
else // specialization for global_var
{
// compiler knows global_var->a==1;
// compiler uses global_var directly, inlines, optimizes:
bar = global_var->b;
}

The compiler could optimize out the division if pointer==global_var but
it must not access field b directly through global_var. This would be
pretty awkwaard; the compiler may work based on an optimized expression
in the specialization (ie, create code that assumes global_var instead
of pointer) but it would still have to carry around and use the
non-optimized expression.


This wouldn't be as bad if it were easily constrained to code sequences
that really need the dependencies. However, 7.9 does not effectively
contain dependencies to only the code that really needs them, IMO.
Unless the compiler proves otherwise, it has to assume that a load from
a pointer carries a dependency. Proving that is often very hard because
it requires points-to analysis; 7.9 restricts this to intra-thread
analysis but that is still nontrivial.
Michael Matz' had a similar concern (in terms of what it results in).


Given that mo_consume is useful but a very specialized feature, I
wouldn't be optimistic that 7.9 would actually be supported by many
compilers. The trade-off between having to track dependencies or having
to disallow optimizations is a bad one to make. The simple way out for
a compiler would be to just emit mo_acquire instead of mo_consume and be
done with all -- and this might be the most practical decision overall,
or the default general-purpose implementation. At least I haven't heard
any compiler implementer say that they think it's obviously worth
implementing.

I also don't think 7.9 is ready for ISO standardization yet (or any of
the other alternatives mentioned in the paper). Standardizing a feature
that we're not sure whether it will actually be implemented is not a
good thing to do; it's too costly for all involved parties (compiler
writers *and* users).


IMO, the approach outlined in Section 7.7 is still the most promising
contender in the long run. It avoid the perhaps more pervasive changes
that a type-system-based approach as the one in Section 7.2 might result
in, yet still informs the compiler where dependencies are actually used
and which chain of expressions would be involved in that. Tracking is
probably simplified, as dependencies are never open-ended and
potentially leaking into various other regions of code. It seems easier
to specify in a standard because we just need the programmer to annotate
the intent and the rest is compiler QoI. It would require users to
annotate their use of dependencies, but they don't need to follow
further rules; performance tuning of the code so it actually makes use
of dependencies is mostly a compiler QoI thing, and if the compiler
can't maintain a dependency, it can issue warnings and thus make the
tuning interactive for the user.

Of course, finding a good way to reference the source of the dependency
in the API (see the paper for some of the sub-issues) will be a
challenge. But, currently, I'm more optimistic we can find a useful
solution for this than finding a standardizable version of 7.9.

2015-05-26 17:38:23

by Torvald Riegel

[permalink] [raw]
Subject: Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!

On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote:
> On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > The compiler can (and does) speculate non-atomic non-volatile writes
> > in some cases, but I do not believe that it is permitted to speculate
> > either volatile or atomic writes.
>
> I do *not* believe that a compiler is ever allowed to speculate *any*
> writes - volatile or not - unless the compiler can prove that the end
> result is either single-threaded, or the write in question is
> guaranteed to only be visible in that thread (ie local stack variable
> etc).

It must not speculative volatile accesses. It could speculate
non-volatiles even if those are atomic and observable by other threads
but that would require further work/checks on all potential observers of
those (ie, to still satisfy as-if). Thus, compilers are unlikely to do
such speculation, I'd say.

The as-if rule (ie, equality of observable behavior (ie, volatiles, ...)
to the abstract machine) makes all this clear.

> Also, I do think that the whole "consume" read should be explained
> better to compiler writers. Right now the language (including very
> much in the "restricted dependency" model) is described in very
> abstract terms. Yet those abstract terms are actually very subtle and
> complex, and very opaque to a compiler writer.

I believe the issues around the existing specification of mo_consume
where pointed out by compiler folks. It's a complex problem, and I'm
all for more explanations, but I did get the impression that the
compiler writers in ISO C++ Study Group 1 do have a good understanding
of the problem.

> I personally think the whole "abstract machine" model of the C
> language is a mistake. It would be much better to talk about things in
> terms of actual code generation and actual issues. Make all the
> problems much more concrete, with actual examples of how memory
> ordering matters on different architectures.

As someone working for a toolchain team, I don't see how the
abstract-machine-based specification is a problem at all, nor have I
seen compiler writers struggling with it. It does give precise rules
for code generation.

The level of abstraction is a good thing for most programs because for
those, the primary concern is that the observable behavior and end
result is computed -- it's secondary and QoI how that happens. In
contrast, if you specify on the level of code generation, you'd have to
foresee how code generation might look in the future, including predict
future optimizations and all that. That doesn't look future-proof to
me.

I do realize that this may be less than ideal for cases when one would
want to use a C compiler more like a convenient assembler. But that
case isn't the 99%, I believe.

2015-05-27 01:41:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote:
> On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote:
> > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
>
> I have been discussing Section 7.9 with Paul during last week.
>
> While I think that 7.9 helps narrow down the problem somewhat, I'm still
> concerned that it effectively requires compilers to either track
> dependencies or conservatively prevent optimizations like value
> speculation and specialization based on that. Neither is a good thing
> for a compiler.

I do believe that we can find some middle ground.

> 7.9 adds requirements that dependency chains stop if the program itself
> informs the compiler about the value of something in the dependency
> chain (e.g., as shown in Figure 33). Also, if a correct program that
> does not have undefined behavior must use a particular value, this is
> also considered as "informing" the compiler about that value. For
> example:
> int arr[2];
> int* x = foo.load(mo_consume);
> if (x > arr) // implies same object/array, so x is in arr[]
> int r1 = *x; // compiler knows x == arr + 1
> The program, assuming no undefined behavior, first tells the compiler
> that x should be within arr, and then the comparison tells the compiler
> that x!=arr, so x==arr+1 must hold because there are just two elements
> in the array.

The updated version of Section 7.9 says that if undefined behavior
allows the compiler to deduce the exact pointer value, as in the
case you show above, the dependency chain is broken.

> Having these requirements is generally good, but we don't yet know how
> to specify this properly. For example, I suppose we'd need to also say
> that the compiler cannot assume to know anything about a value returned
> from an mo_consume load; otherwise, nothing prevents a compiler from
> using knowledge about the stores that the mo_consume load can read from
> (see Section 7.2).

I expect that the Linux kernel's rcu_dereference() primitives would
map to volatile memory_order_consume loads for this reason.

> Also, a compiler is still required to not do value-speculation or
> optimizations based on that. For example, this program:
>
> void op(type *p)
> {
> foo /= p->a;
> bar = p->b;
> }
> void bar()
> {
> pointer = ppp.load(mo_consume);
> op(pointer);
> }
>
> ... must not be transformed into this program, even if the compiler
> knows that global_var->a == 1:
>
> void op(type *p) { /* unchanged */}
> void bar()
> {
> pointer = ppp.load(mo_consume);
> if (pointer != global_var) {
> op(pointer);
> else // specialization for global_var
> {
> // compiler knows global_var->a==1;
> // compiler uses global_var directly, inlines, optimizes:
> bar = global_var->b;
> }
>
> The compiler could optimize out the division if pointer==global_var but
> it must not access field b directly through global_var. This would be
> pretty awkwaard; the compiler may work based on an optimized expression
> in the specialization (ie, create code that assumes global_var instead
> of pointer) but it would still have to carry around and use the
> non-optimized expression.

Exactly how valuable is this sort of optimization in real code? And
how likely is the compiler to actually be able to apply it?

(I nevertheless will take another pass through the Linux kernel looking
for global variables being added to RCU-protected linked structures.
Putting a global into an RCU-protected structure seems more likely than
is an RCU-protected pointer into a two-element array.)

> This wouldn't be as bad if it were easily constrained to code sequences
> that really need the dependencies. However, 7.9 does not effectively
> contain dependencies to only the code that really needs them, IMO.
> Unless the compiler proves otherwise, it has to assume that a load from
> a pointer carries a dependency. Proving that is often very hard because
> it requires points-to analysis; 7.9 restricts this to intra-thread
> analysis but that is still nontrivial.
> Michael Matz' had a similar concern (in terms of what it results in).

Again, I will be looking through the Linux kernel for vulnerabilities to
this sort of transformation. However, I am having a very hard time seeing
how the compiler is going to know that much about the vast majority of
the Linux-kernel use cases. The linked structures are allocated on the
heap, not in arrays or in globals.

> Given that mo_consume is useful but a very specialized feature, I
> wouldn't be optimistic that 7.9 would actually be supported by many
> compilers. The trade-off between having to track dependencies or having
> to disallow optimizations is a bad one to make. The simple way out for
> a compiler would be to just emit mo_acquire instead of mo_consume and be
> done with all -- and this might be the most practical decision overall,
> or the default general-purpose implementation. At least I haven't heard
> any compiler implementer say that they think it's obviously worth
> implementing.

I believe that we need to balance the specialization of the
memory_order_consume feature against the real-world usefulness of
the optimizations. In addition, I believe that we can rule out some
of the more ornate situations, which would allow the compiler to do
the optimizations and allow the designated set of dependencies to be
preserved.

> I also don't think 7.9 is ready for ISO standardization yet (or any of
> the other alternatives mentioned in the paper). Standardizing a feature
> that we're not sure whether it will actually be implemented is not a
> good thing to do; it's too costly for all involved parties (compiler
> writers *and* users).

Understood, but it seems premature to give up on 7.9. I believe that we
are actually quite close.

> IMO, the approach outlined in Section 7.7 is still the most promising
> contender in the long run. It avoid the perhaps more pervasive changes
> that a type-system-based approach as the one in Section 7.2 might result
> in, yet still informs the compiler where dependencies are actually used
> and which chain of expressions would be involved in that. Tracking is
> probably simplified, as dependencies are never open-ended and
> potentially leaking into various other regions of code. It seems easier
> to specify in a standard because we just need the programmer to annotate
> the intent and the rest is compiler QoI. It would require users to
> annotate their use of dependencies, but they don't need to follow
> further rules; performance tuning of the code so it actually makes use
> of dependencies is mostly a compiler QoI thing, and if the compiler
> can't maintain a dependency, it can issue warnings and thus make the
> tuning interactive for the user.
>
> Of course, finding a good way to reference the source of the dependency
> in the API (see the paper for some of the sub-issues) will be a
> challenge. But, currently, I'm more optimistic we can find a useful
> solution for this than finding a standardizable version of 7.9.

As discussed before, I have no objection to exploring the other options,
including the one in Section 7.7, but we really do need to be able to
handle the common-case dependency chain without overdosing on syntactic
sugar.

Thanx, Paul

2015-07-14 00:45:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Compilers and RCU readers: Once more unto the breach!

On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
> Hello!
>
> Following up on last year's discussion (https://lwn.net/Articles/586838/,
> https://lwn.net/Articles/588300/), I believe that we have a solution. If
> I am wrong, I am sure you all will let me know, and in great detail. ;-)
>
> The key simplification is to "just say no" to RCU-protected array indexes:
> https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
> This simplification means that rcu_dereference (AKA memory_order_consume)
> need only return pointers. This in ture avoids things like (x-x),
> (x*0), and (x%1) because if "x" is a pointer, these expressions either
> return non-pointers are compilation errors. With a very few exceptions,
> dependency chains can lead -to- non-pointers, but cannot pass -through-
> them.
>
> The result is that dependencies are carried only by operations for
> which the compiler cannot easily optimize the away those dependencies,
> these operations including simple assignment, integer offset (including
> indexing), dereferencing, casts, passing as a function argument, return
> values from functions and so on. A complete list with commentary starts
> on page 28 of:
>
> http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

And an update is available here:

http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf

Among other things, this update addresses the points about equality
comparisons introduced by the compiler, and outlines some of the
issues head-/tail-marked alternatives face with respect to abstraction.
The updated "Restricted Dependency Chains" section starts on page 28.

Thoughts?

Thanx, Paul