Hi,
Let's give this linux-toolchains thing a test-run...
As some of you might know, there's a bit of a discrepancy between what
compiler and kernel people consider 'valid' use of the compiler :-)
One area where this shows up is in implicit (memory) ordering provided
by the hardware, which we kernel people would like to use to avoid
explicit fences (expensive) but which the compiler is unaware of and
could ruin (bad).
During the last LPC we had a session on that; find here:
https://linuxplumbersconf.org/event/7/contributions/821/
With recordings of the event here:
https://youtu.be/FFjV9f_Ub9o?t=89
That presentation covers 3 different implicit dependencies and various
ways in which a compiler can ruin the game. For this thread, I'd like to
limit things to just control-dependencies. We can start separate threads
for the other issues.
In short, the control dependency relies on the hardware never
speculating stores (instant OOTA) to provide a LOAD->STORE ordering.
That is, a LOAD must be completed to resolve a conditional branch, the
STORE is after the branch and cannot be made visible until the branch is
determined (which implies the load is complete).
However, our 'dear' C language has no clue of any of this.
So given code like:
x = *foo;
if (x > 42)
*bar = 1;
Which, if literally translated into assembly, would provide a
LOAD->STORE order between foo and bar, could, in the hands of an
evil^Woptimizing compiler, become:
x = *foo;
*bar = 1;
because it knows, through value tracking, that the condition must be
true.
Our Documentation/memory-barriers.txt has a Control Dependencies section
(which I shall not replicate here for brevity) which lists a number of
caveats. But in general the work-around we use is:
x = READ_ONCE(*foo);
if (x > 42)
WRITE_ONCE(*bar, 1);
Where READ/WRITE_ONCE() cast the variable volatile. The volatile
qualifier dissuades the compiler from assuming it knows things and we
then hope it will indeed emit the branch like we'd expect.
Now, hoping the compiler generates correct code is clearly not ideal and
very dangerous indeed. Which is why my question to the compiler folks
assembled here is:
Can we get a C language extention for this?
And while we have a fair number (and growing) existing users of this in
the kernel, I'd not be adverse to having to annotate them.
Any suggestions from the compiler people present on how they'd like to
provide us this feature?
Even just being able to detect this going wrong would be a step forward.
~ Peter
From: Peter Zijlstra
> Sent: 06 October 2020 12:47
> Hi,
>
> Let's give this linux-toolchains thing a test-run...
>
> As some of you might know, there's a bit of a discrepancy between what
> compiler and kernel people consider 'valid' use of the compiler :-)
>
> One area where this shows up is in implicit (memory) ordering provided
> by the hardware, which we kernel people would like to use to avoid
> explicit fences (expensive) but which the compiler is unaware of and
> could ruin (bad).
...
>
> In short, the control dependency relies on the hardware never
> speculating stores (instant OOTA) to provide a LOAD->STORE ordering.
> That is, a LOAD must be completed to resolve a conditional branch, the
> STORE is after the branch and cannot be made visible until the branch is
> determined (which implies the load is complete).
>
> However, our 'dear' C language has no clue of any of this.
>
> So given code like:
>
> x = *foo;
> if (x > 42)
> *bar = 1;
>
> Which, if literally translated into assembly, would provide a
> LOAD->STORE order between foo and bar, could, in the hands of an
> evil^Woptimizing compiler, become:
>
> x = *foo;
> *bar = 1;
>
> because it knows, through value tracking, that the condition must be
> true.
>
> Our Documentation/memory-barriers.txt has a Control Dependencies section
> (which I shall not replicate here for brevity) which lists a number of
> caveats. But in general the work-around we use is:
>
> x = READ_ONCE(*foo);
> if (x > 42)
> WRITE_ONCE(*bar, 1);
An alternative is to 'persuade' the compiler that
any 'tracked' value for a local variable is invalid.
Rather like the way that barrier() 'invalidates' memory.
So you generate:
x = *foo
asm ("" : "+r" (x));
if (x > 42)
*bar = 1;
Since the "+r" constraint indicates that the value of 'x'
might have changed it can't optimise based on any
presumed old value.
(Unless it looks inside the asm opcodes...)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Oct 06, 2020 at 12:37:06PM +0000, David Laight wrote:
> > Our Documentation/memory-barriers.txt has a Control Dependencies section
> > (which I shall not replicate here for brevity) which lists a number of
> > caveats. But in general the work-around we use is:
> >
> > x = READ_ONCE(*foo);
> > if (x > 42)
> > WRITE_ONCE(*bar, 1);
>
> An alternative is to 'persuade' the compiler that
> any 'tracked' value for a local variable is invalid.
> Rather like the way that barrier() 'invalidates' memory.
> So you generate:
>
> x = *foo
> asm ("" : "+r" (x));
> if (x > 42)
> *bar = 1;
>
> Since the "+r" constraint indicates that the value of 'x'
> might have changed it can't optimise based on any
> presumed old value.
> (Unless it looks inside the asm opcodes...)
I'm using exactly this in userland to prevent the compiler from guessing
what I'm doing with a variable, and it's also useful sometimes to shut up
certain warnings when I know a condition is satisfied but can hardly be
expressed in a way to please the compiler. Overall I find that it's no
big deal and forces the developer to think twice before doing it, which
is probably a good thing in general.
Willy
On Tue, Oct 06, 2020 at 12:37:06PM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 06 October 2020 12:47
> > Hi,
> >
> > Let's give this linux-toolchains thing a test-run...
> >
> > As some of you might know, there's a bit of a discrepancy between what
> > compiler and kernel people consider 'valid' use of the compiler :-)
> >
> > One area where this shows up is in implicit (memory) ordering provided
> > by the hardware, which we kernel people would like to use to avoid
> > explicit fences (expensive) but which the compiler is unaware of and
> > could ruin (bad).
> ...
> >
> > In short, the control dependency relies on the hardware never
> > speculating stores (instant OOTA) to provide a LOAD->STORE ordering.
> > That is, a LOAD must be completed to resolve a conditional branch, the
> > STORE is after the branch and cannot be made visible until the branch is
> > determined (which implies the load is complete).
> >
> > However, our 'dear' C language has no clue of any of this.
> >
> > So given code like:
> >
> > x = *foo;
> > if (x > 42)
> > *bar = 1;
> >
> > Which, if literally translated into assembly, would provide a
> > LOAD->STORE order between foo and bar, could, in the hands of an
> > evil^Woptimizing compiler, become:
> >
> > x = *foo;
> > *bar = 1;
> >
> > because it knows, through value tracking, that the condition must be
> > true.
> >
> > Our Documentation/memory-barriers.txt has a Control Dependencies section
> > (which I shall not replicate here for brevity) which lists a number of
> > caveats. But in general the work-around we use is:
> >
> > x = READ_ONCE(*foo);
> > if (x > 42)
> > WRITE_ONCE(*bar, 1);
>
> An alternative is to 'persuade' the compiler that
> any 'tracked' value for a local variable is invalid.
> Rather like the way that barrier() 'invalidates' memory.
> So you generate:
>
> x = *foo
> asm ("" : "+r" (x));
> if (x > 42)
> *bar = 1;
>
> Since the "+r" constraint indicates that the value of 'x'
> might have changed it can't optimise based on any
> presumed old value.
> (Unless it looks inside the asm opcodes...)
The compiler can still try and lift the store out of the block, possibly
by inventing more stores.
Please go read memory-barriers.txt for a bunch of other examples.
This thread is not to collect work-arounds that might convince a
compiler to emit the desired code as a side effect, but to get the
compiler people involved and get control-dependencies recognised such
that correct code gen is guaranteed.
Only if we get the compiler people on board and have them provide means
are we guaranteed safe from the optimizer. Otherwise we'll just keep
playing whack-a-mole with fancy new optimization techniques. And given
how horridly painful it is to debug memory ordering problems, I feel it
is best to make sure we're not going to have to more than necessary.
On Tue, Oct 06, 2020 at 03:31:15PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 06, 2020 at 12:37:06PM +0000, David Laight wrote:
> > From: Peter Zijlstra
> > > Sent: 06 October 2020 12:47
> > > Hi,
> > >
> > > Let's give this linux-toolchains thing a test-run...
> > >
> > > As some of you might know, there's a bit of a discrepancy between what
> > > compiler and kernel people consider 'valid' use of the compiler :-)
> > >
> > > One area where this shows up is in implicit (memory) ordering provided
> > > by the hardware, which we kernel people would like to use to avoid
> > > explicit fences (expensive) but which the compiler is unaware of and
> > > could ruin (bad).
> > ...
> > >
> > > In short, the control dependency relies on the hardware never
> > > speculating stores (instant OOTA) to provide a LOAD->STORE ordering.
> > > That is, a LOAD must be completed to resolve a conditional branch, the
> > > STORE is after the branch and cannot be made visible until the branch is
> > > determined (which implies the load is complete).
> > >
> > > However, our 'dear' C language has no clue of any of this.
> > >
> > > So given code like:
> > >
> > > x = *foo;
> > > if (x > 42)
> > > *bar = 1;
> > >
> > > Which, if literally translated into assembly, would provide a
> > > LOAD->STORE order between foo and bar, could, in the hands of an
> > > evil^Woptimizing compiler, become:
> > >
> > > x = *foo;
> > > *bar = 1;
> > >
> > > because it knows, through value tracking, that the condition must be
> > > true.
> > >
> > > Our Documentation/memory-barriers.txt has a Control Dependencies section
> > > (which I shall not replicate here for brevity) which lists a number of
> > > caveats. But in general the work-around we use is:
> > >
> > > x = READ_ONCE(*foo);
> > > if (x > 42)
> > > WRITE_ONCE(*bar, 1);
> >
> > An alternative is to 'persuade' the compiler that
> > any 'tracked' value for a local variable is invalid.
> > Rather like the way that barrier() 'invalidates' memory.
> > So you generate:
> >
> > x = *foo
> > asm ("" : "+r" (x));
> > if (x > 42)
> > *bar = 1;
> >
> > Since the "+r" constraint indicates that the value of 'x'
> > might have changed it can't optimise based on any
> > presumed old value.
> > (Unless it looks inside the asm opcodes...)
>
> The compiler can still try and lift the store out of the block, possibly
> by inventing more stores.
>
> Please go read memory-barriers.txt for a bunch of other examples.
>
> This thread is not to collect work-arounds that might convince a
> compiler to emit the desired code as a side effect, but to get the
> compiler people involved and get control-dependencies recognised such
> that correct code gen is guaranteed.
>
> Only if we get the compiler people on board and have them provide means
> are we guaranteed safe from the optimizer. Otherwise we'll just keep
> playing whack-a-mole with fancy new optimization techniques. And given
> how horridly painful it is to debug memory ordering problems, I feel it
> is best to make sure we're not going to have to more than necessary.
Given that you would have to add a compiler annotation, isn't it just as
easy to use READ_ONCE and WRITE_ONCE?
Or are you worried that even with READ_ONCE and WRITE_ONCE, the compiler
might still somehow defeat the control dependency?
Alan
On Tue, Oct 06, 2020 at 10:23:24AM -0400, [email protected] wrote:
> On Tue, Oct 06, 2020 at 03:31:15PM +0200, Peter Zijlstra wrote:
> > Only if we get the compiler people on board and have them provide means
> > are we guaranteed safe from the optimizer. Otherwise we'll just keep
> > playing whack-a-mole with fancy new optimization techniques. And given
> > how horridly painful it is to debug memory ordering problems, I feel it
> > is best to make sure we're not going to have to more than necessary.
>
> Given that you would have to add a compiler annotation, isn't it just as
> easy to use READ_ONCE and WRITE_ONCE?
>
> Or are you worried that even with READ_ONCE and WRITE_ONCE, the compiler
> might still somehow defeat the control dependency?
Yes.
Also, not all instances actually have the WRITE_ONCE() on. The one in
the perf ringbuffer for example uses local_read() for the load (which is
basically READ_ONCE()), but doesn't have WRITE_ONCE() on the inside.
Also, afaiu WRITE_ONCE() also doesn't stop the compiler from lifting it
if it thinks it can get away with it -- memory-barriers.txt has
examples.
And then there's the case where the common branch has a store, I know
ARM64 ARM isn't clear on that, but I'm thinking any actual hardware will
still have to respect it, and it's a matter of 'fixing' the ARM.
Mostly I just want the compiler people to say they'll guarantee the
behaviour if we do 'X'. If 'X' happens to be 'any dynamic branch headed
by a volatile load' that's fine by me.
If they want a new keyword or attribute, that's also fine. But I want to
have the compiler people tell me what they want and guarantee they'll
not come and wreck things.
> Please go read memory-barriers.txt for a bunch of other examples.
My brain hurts.
It's not as though I've not known about most of this since
the first SMP capable SPARC cpus that had a 'store buffer'.
(Probably 30 years ago.)
I sometimes wonder whether compiling things like the kernel
requires a compiler mode that defaults everything (except
function locals) to 'volatile'.
Although that still does fix control dependencies.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Hi Guys,
OK, so playing the devils advocate here...
> Mostly I just want the compiler people to say they'll guarantee the
> behaviour if we do 'X'. If 'X' happens to be 'any dynamic branch headed
> by a volatile load' that's fine by me.
Is a compiler hack really the right way to go here ?
After all, if you do get this feature added it will make it
harder to compile the kernel with other compilers (*cough
LLVM cough*), or older versions of gcc. Plus code like this
is often subject to very aggressive optimization and all it
takes is one bug in the compiler implementation and you lose
the gains you were trying for.
My suggestion as an alternative is to use assembler instead.
That way you can guarantee that you get the instructions you
want in the order that you want them. It should be fairly
straightforward to create a macro or inline function that
contains the necessary code and this can be done once and
then used wherever the functionality is required.
Cheers
Nick
> My suggestion as an alternative is to use assembler instead.
> That way you can guarantee that you get the instructions you
> want in the order that you want them. It should be fairly
> straightforward to create a macro or inline function that
> contains the necessary code and this can be done once and
> then used wherever the functionality is required.
C exists because K&R got fed up of writing pdp-11 assembler.
Compared to some modern ones it is nice and easy to write
(I'm old enough to have used pdp-11.)
You can't put control dependencies inside C asm statements.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Oct 06, 2020 at 03:37:24PM +0000, David Laight wrote:
> > My suggestion as an alternative is to use assembler instead.
> > That way you can guarantee that you get the instructions you
> > want in the order that you want them. It should be fairly
> > straightforward to create a macro or inline function that
> > contains the necessary code and this can be done once and
> > then used wherever the functionality is required.
>
> C exists because K&R got fed up of writing pdp-11 assembler.
> Compared to some modern ones it is nice and easy to write
> (I'm old enough to have used pdp-11.)
>
> You can't put control dependencies inside C asm statements.
What David said!
And not only are modern machine languages quite complex and strange
compared to that of the PDP-11, but writing core Linux-kernel code
in assembler requires writing it 25 times, once for each of the
supported architectures. And even that is an underestimate because
some architectures have multiple variants, for but one example, arm's
multiple instruction sets.
Comparing 25+ assembly languages to but two compilers most definitely
motivates looking hard at doing something with the compilers.
Thanx, Paul
On Tue, Oct 06, 2020 at 08:50:52AM -0700, Paul E. McKenney wrote:
> Comparing 25+ assembly languages to but two compilers most definitely
> motivates looking hard at doing something with the compilers.
+1, especially since the kernel is not special; anyone working with
threads faces the same issues in userland, which are often hidden
behind the implicit whole-memory clobbers of certain operations or
the call to mutex manipulation functions, but which become a reality
again as soon as you go slightly away from these and try to use
lockless mechanisms.
Willy
From: Willy Tarreau
> Sent: 06 October 2020 17:11
>
> On Tue, Oct 06, 2020 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > Comparing 25+ assembly languages to but two compilers most definitely
> > motivates looking hard at doing something with the compilers.
>
> +1, especially since the kernel is not special; anyone working with
> threads faces the same issues in userland, which are often hidden
> behind the implicit whole-memory clobbers of certain operations or
> the call to mutex manipulation functions, but which become a reality
> again as soon as you go slightly away from these and try to use
> lockless mechanisms.
AFAICT most windows and android apps completely ignore the problem
of thread locking - which is why the crash and lock up all the time :-)
I've spent most of the day looking at some library traces from a
customer bug.
I almost suspect a bug in the pthread mutex code on their system.
They are using a nice, modern, 3.10.0-957.el7.x86_64 kernel.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Oct 06, 2020 at 04:22:10PM +0000, David Laight wrote:
> From: Willy Tarreau
> > Sent: 06 October 2020 17:11
> >
> > On Tue, Oct 06, 2020 at 08:50:52AM -0700, Paul E. McKenney wrote:
> > > Comparing 25+ assembly languages to but two compilers most definitely
> > > motivates looking hard at doing something with the compilers.
> >
> > +1, especially since the kernel is not special; anyone working with
> > threads faces the same issues in userland, which are often hidden
> > behind the implicit whole-memory clobbers of certain operations or
> > the call to mutex manipulation functions, but which become a reality
> > again as soon as you go slightly away from these and try to use
> > lockless mechanisms.
>
> AFAICT most windows and android apps completely ignore the problem
> of thread locking - which is why the crash and lock up all the time :-)
>
> I've spent most of the day looking at some library traces from a
> customer bug.
> I almost suspect a bug in the pthread mutex code on their system.
> They are using a nice, modern, 3.10.0-957.el7.x86_64 kernel.
It would be good if the compiler were more helpful! Failing that, if
-something- could be more helpful!!!
Thanx, Paul
* Peter Zijlstra:
> Our Documentation/memory-barriers.txt has a Control Dependencies section
> (which I shall not replicate here for brevity) which lists a number of
> caveats. But in general the work-around we use is:
>
> x = READ_ONCE(*foo);
> if (x > 42)
> WRITE_ONCE(*bar, 1);
>
> Where READ/WRITE_ONCE() cast the variable volatile. The volatile
> qualifier dissuades the compiler from assuming it knows things and we
> then hope it will indeed emit the branch like we'd expect.
>
>
> Now, hoping the compiler generates correct code is clearly not ideal and
> very dangerous indeed. Which is why my question to the compiler folks
> assembled here is:
>
> Can we get a C language extention for this?
For what exactly?
Do you want a compiler that never simplifies conditional expressions
(like some people want compilers that never re-associate floating point
operations)?
With a bit of effort, we could prototype such a compiler and run
benchmarks against a kernel that was built using it. But I'm not sure
if that's a good use of resources.
> And while we have a fair number (and growing) existing users of this in
> the kernel, I'd not be adverse to having to annotate them.
But not using READ_ONCE and WRITE_ONCE?
I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
and __atomic_store_n(foo, __ATOMIC_RELAXED). GCC can't optimize relaxed
MO loads and stores because the C memory model is defective and does not
actually guarantee the absence of out-of-thin-air values (a property it
was supposed to have). Obviously, actual implementations want to
provide this guarantee. So it's really impossible right now to argue
about this in any formal way and determine the validity of optimizations
(which is why there are none, hopefully).
In standard C, there is <stdatomic.h>, but its relaxed MO loads and
stores can of course be optimized, so you'll need a compiler extension
here.
A different way of annotating this would be a variant of _Atomic where
plain accesses have relaxed MO, not seq-cst MO.
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Tue, Oct 06, 2020 at 11:20:01PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
>
> > Our Documentation/memory-barriers.txt has a Control Dependencies section
> > (which I shall not replicate here for brevity) which lists a number of
> > caveats. But in general the work-around we use is:
> >
> > x = READ_ONCE(*foo);
> > if (x > 42)
> > WRITE_ONCE(*bar, 1);
> >
> > Where READ/WRITE_ONCE() cast the variable volatile. The volatile
> > qualifier dissuades the compiler from assuming it knows things and we
> > then hope it will indeed emit the branch like we'd expect.
> >
> >
> > Now, hoping the compiler generates correct code is clearly not ideal and
> > very dangerous indeed. Which is why my question to the compiler folks
> > assembled here is:
> >
> > Can we get a C language extention for this?
>
> For what exactly?
A branch that cannot be optimized away and prohibits lifting stores
over. One possible suggestion would be allowing the volatile keyword as
a qualifier to if.
x = *foo;
volatile if (x > 42)
*bar = 1;
This would tell the compiler that the condition is special in that it
must emit a conditional branch instruction and that it must not lift
stores (or sequence points) over it.
> Do you want a compiler that never simplifies conditional expressions
> (like some people want compilers that never re-associate floating point
> operations)?
No. I'm fine with optimizing things in general, I just want to be able
to control/limit it for a few specific cases.
> > And while we have a fair number (and growing) existing users of this in
> > the kernel, I'd not be adverse to having to annotate them.
>
> But not using READ_ONCE and WRITE_ONCE?
I'm OK with READ_ONCE(), but the WRITE_ONCE() doesn't help much, if
anything. The compiler is always allowed to lift stores, regardless of
the qualifiers used.
> I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
> and __atomic_store_n(foo, __ATOMIC_RELAXED). GCC can't optimize relaxed
> MO loads and stores because the C memory model is defective and does not
> actually guarantee the absence of out-of-thin-air values (a property it
> was supposed to have).
AFAIK people want to get that flaw in the C memory model fixed (which to
me seemd like a very good idea).
Also, AFAIK the compiler would be allowed to lift __atomic_store_n(foo,
__ATOMIC_RELAXED) out of a branch.
> A different way of annotating this would be a variant of _Atomic where
> plain accesses have relaxed MO, not seq-cst MO.
So Linux isn't going to use _Atomic, we disagree with the C memory model
too much. Also, volatile is perfectly sufficient for things.
I know there's a bunch of people in the C committee that want to get rid
of volatile, but that's just not going to happen in the real world,
there's too much volatile out there.
More to the point, Linux already relies on this without the later stores
being annotated, and it works because lifting those stores just really
doesn't make sense (and it's further constrained by sequence points,
although I'm not sure what, if anything, the effect of LTO optimization
is on sequence points -- inline for example removes sequence points,
which is sometimes scary as heck).
* Peter Zijlstra:
> On Tue, Oct 06, 2020 at 11:20:01PM +0200, Florian Weimer wrote:
>> * Peter Zijlstra:
>>
>> > Our Documentation/memory-barriers.txt has a Control Dependencies section
>> > (which I shall not replicate here for brevity) which lists a number of
>> > caveats. But in general the work-around we use is:
>> >
>> > x = READ_ONCE(*foo);
>> > if (x > 42)
>> > WRITE_ONCE(*bar, 1);
>> >
>> > Where READ/WRITE_ONCE() cast the variable volatile. The volatile
>> > qualifier dissuades the compiler from assuming it knows things and we
>> > then hope it will indeed emit the branch like we'd expect.
>> >
>> >
>> > Now, hoping the compiler generates correct code is clearly not ideal and
>> > very dangerous indeed. Which is why my question to the compiler folks
>> > assembled here is:
>> >
>> > Can we get a C language extention for this?
>>
>> For what exactly?
>
> A branch that cannot be optimized away and prohibits lifting stores
> over. One possible suggestion would be allowing the volatile keyword as
> a qualifier to if.
>
> x = *foo;
> volatile if (x > 42)
> *bar = 1;
>
> This would tell the compiler that the condition is special in that it
> must emit a conditional branch instruction and that it must not lift
> stores (or sequence points) over it.
But it's not the if statement, but the expression in it, right? So this
would not be a valid transformation:
x = *foo;
bool flag = x > 42;
volatile if (flag)
*bar = 1;
And if we had this:
unsigned x = *foo;
volatile if (x >= 17 && x < 42)
*bar = 1;
Would it be valid to transform this into (assuming that I got the
arithmetic correct):
unsigned x = *foo;
volatile if ((x - 17) < 25)
*bar = 1;
Or would this destroy the magic because arithmetic happens on the value
before the comparison?
>> But not using READ_ONCE and WRITE_ONCE?
>
> I'm OK with READ_ONCE(), but the WRITE_ONCE() doesn't help much, if
> anything. The compiler is always allowed to lift stores, regardless of
> the qualifiers used.
I would hope that with some level of formalization, it can be shown that
no additional synchronization is necessary beyond the load/conditional
sequence.
>> I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
>> and __atomic_store_n(foo, __ATOMIC_RELAXED). GCC can't optimize relaxed
>> MO loads and stores because the C memory model is defective and does not
>> actually guarantee the absence of out-of-thin-air values (a property it
>> was supposed to have).
>
> AFAIK people want to get that flaw in the C memory model fixed (which to
> me seemd like a very good idea).
It's been a long time since people realized that this problem exists,
with several standard releases since then.
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Wed, Oct 07, 2020 at 11:32:43AM +0200, Peter Zijlstra wrote:
> A branch that cannot be optimized away and prohibits lifting stores
> over. One possible suggestion would be allowing the volatile keyword as
> a qualifier to if.
>
> x = *foo;
> volatile if (x > 42)
> *bar = 1;
>
> This would tell the compiler that the condition is special in that it
> must emit a conditional branch instruction and that it must not lift
> stores (or sequence points) over it.
This test is interesting, because if foo and bar are of the same type,
nothing prevents them from aliasing and the compiler cannot make wild
guesses on them (i.e. they may be plain memory as well as memory-mapped
registers).
Extending it like this shows a difference between the use of volatile
and __atomic_{load,store}_n. While both are correct in that each access
is properly performed, for an unknown reason the compiler decided to
implement two distinct branches in the atomic case and to inflate the
code:
$ gcc -v
gcc version 9.3.0 (GCC)
$ cat foo-volatile.c
long foobar(long *foo, long *bar)
{
*(volatile long *)bar = 10;
if (*(volatile long *)foo <= 42)
*(volatile long *)bar = 64;
if (*(volatile long *)foo > 42)
*(volatile long *)bar = 0;
return *(volatile long *)bar;
}
$ gcc -c -O2 foo-volatile.c
$ objdump -dr foo-volatile.o
0000000000000000 <foobar>:
0: 48 c7 06 0a 00 00 00 movq $0xa,(%rsi)
7: 48 8b 07 mov (%rdi),%rax
a: 48 83 f8 2a cmp $0x2a,%rax
e: 7f 07 jg 17 <foobar+0x17>
10: 48 c7 06 40 00 00 00 movq $0x40,(%rsi)
17: 48 8b 07 mov (%rdi),%rax
1a: 48 83 f8 2a cmp $0x2a,%rax
1e: 7e 07 jle 27 <foobar+0x27>
20: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
27: 48 8b 06 mov (%rsi),%rax
2a: c3 retq
$ cat foo-atomic.c
long foobar(long *foo, long *bar)
{
__atomic_store_n(bar, 10, __ATOMIC_RELAXED);
if (__atomic_load_n(foo, __ATOMIC_RELAXED) <= 42)
__atomic_store_n(bar, 64, __ATOMIC_RELAXED);
if (__atomic_load_n(foo, __ATOMIC_RELAXED) > 42)
__atomic_store_n(bar, 0, __ATOMIC_RELAXED);
return __atomic_load_n(bar, __ATOMIC_RELAXED);
}
$ objdump -dr foo-atomic.o
0000000000000000 <foobar>:
0: 48 c7 06 0a 00 00 00 movq $0xa,(%rsi)
7: 48 8b 07 mov (%rdi),%rax
a: 48 83 f8 2a cmp $0x2a,%rax
e: 7e 10 jle 20 <foobar+0x20>
10: 48 8b 07 mov (%rdi),%rax
13: 48 83 f8 2a cmp $0x2a,%rax
17: 7f 17 jg 30 <foobar+0x30>
19: 48 8b 06 mov (%rsi),%rax
1c: c3 retq
1d: 0f 1f 00 nopl (%rax)
20: 48 c7 06 40 00 00 00 movq $0x40,(%rsi)
27: 48 8b 07 mov (%rdi),%rax
2a: 48 83 f8 2a cmp $0x2a,%rax
2e: 7e e9 jle 19 <foobar+0x19>
30: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
37: 48 8b 06 mov (%rsi),%rax
3a: c3 retq
When building at -Os both produce the same code as the volatile version
above. It *seems* to me that the volatile version always produces more
optimal code, but is it always correct ? This is just an illustration of
how tricky this can currently be and how confusing it can sometimes be
for the developer to make sure the desired code is emitted in a few
special cases. And just for this, having the compiler support more easily
predictable constructs would be a nice improvement.
Willy
On Wed, Oct 07, 2020 at 12:20:41PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
> > A branch that cannot be optimized away and prohibits lifting stores
> > over. One possible suggestion would be allowing the volatile keyword as
> > a qualifier to if.
> >
> > x = *foo;
> > volatile if (x > 42)
> > *bar = 1;
> >
> > This would tell the compiler that the condition is special in that it
> > must emit a conditional branch instruction and that it must not lift
> > stores (or sequence points) over it.
>
> But it's not the if statement, but the expression in it, right?
No, it *IS* the if statement, the magic is a conditional branch
instruction and the fact that CPUs are not allowed to speculate stores
(which would lead to instant OOTA).
> So this would not be a valid transformation:
>
> x = *foo;
> bool flag = x > 42;
> volatile if (flag)
> *bar = 1;
It would be valid, it still ensures the load of *foo happens before the
store of *bar.
> And if we had this:
>
> unsigned x = *foo;
> volatile if (x >= 17 && x < 42)
> *bar = 1;
>
> Would it be valid to transform this into (assuming that I got the
> arithmetic correct):
>
> unsigned x = *foo;
> volatile if ((x - 17) < 25)
> *bar = 1;
>
> Or would this destroy the magic because arithmetic happens on the value
> before the comparison?
Nope, that'd still be good. The critical part is that the resolution of
the conditional branch depend on the load. All these transformations
preserve that.
So we use the data dependency between the load and the branch
instruction coupled with the inability to speculate stores, to generate
a LOAD to STORE ordering.
> >> But not using READ_ONCE and WRITE_ONCE?
> >
> > I'm OK with READ_ONCE(), but the WRITE_ONCE() doesn't help much, if
> > anything. The compiler is always allowed to lift stores, regardless of
> > the qualifiers used.
>
> I would hope that with some level of formalization, it can be shown that
> no additional synchronization is necessary beyond the load/conditional
> sequence.
Agreed. Those are the critical part, the tricky bit is ensuring the
compiler doesn't lift stuff over the condition.
> >> I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
> >> and __atomic_store_n(foo, __ATOMIC_RELAXED). GCC can't optimize relaxed
> >> MO loads and stores because the C memory model is defective and does not
> >> actually guarantee the absence of out-of-thin-air values (a property it
> >> was supposed to have).
> >
> > AFAIK people want to get that flaw in the C memory model fixed (which to
> > me seemd like a very good idea).
>
> It's been a long time since people realized that this problem exists,
> with several standard releases since then.
I've been given to believe it is a hard problem. Personally I hold the
opinion that prohibiting store speculation (of all kinds) is both
necesary and sufficient to avoid OOTA. But I have 0 proof for that.
On Wed, Oct 07, 2020 at 01:50:54PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 12:20:41PM +0200, Florian Weimer wrote:
> > * Peter Zijlstra:
[ . . . ]
> > >> I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
> > >> and __atomic_store_n(foo, __ATOMIC_RELAXED). GCC can't optimize relaxed
> > >> MO loads and stores because the C memory model is defective and does not
> > >> actually guarantee the absence of out-of-thin-air values (a property it
> > >> was supposed to have).
> > >
> > > AFAIK people want to get that flaw in the C memory model fixed (which to
> > > me seemd like a very good idea).
> >
> > It's been a long time since people realized that this problem exists,
> > with several standard releases since then.
>
> I've been given to believe it is a hard problem. Personally I hold the
> opinion that prohibiting store speculation (of all kinds) is both
> necesary and sufficient to avoid OOTA. But I have 0 proof for that.
There are proofs for some definitions of store speculation, for example,
as proposed by Demsky and Boehm [1] and as prototyped by Demsky's student,
Peizhao Ou [2]. But these require marking all accesses and end up being
optimized variants of acquire load and release store. One optimization
is that if you have a bunch of loads followed by a bunch of stores,
the compiler can emit a single memory-barrier instruction between the
last load and the first store.
I am not a fan of this approach.
Challenges include:
o Unmarked accesses. Compilers are quite aggressive about
moving normal code.
o Separately compiled code. For example, does the compiler have
unfortunatel optimization opportunities when "volatile if"
appears in one translation unit and the dependent stores in
some other translation unit?
o LTO, as has already been mentioned in this thread.
Probably other issues as well, but a starting point.
Thanx, Paul
[1] https://dl.acm.org/doi/10.1145/2618128.2618134
"Outlawing ghosts: avoiding out-of-thin-air results"
Hans-J. Boehm and Brian Demsky.
[2] https://escholarship.org/uc/item/2vm546k1
"An Initial Study of Two Approaches to Eliminating Out-of-Thin-Air
Results" Peizhao Ou.
On Wed, Oct 07, 2020 at 10:11:07AM -0700, Paul E. McKenney wrote:
> Challenges include:
>
> o Unmarked accesses. Compilers are quite aggressive about
> moving normal code.
Which is why this thread exists :-) We wants to dis-allow lifting the
stores over our volatile-if.
> o Separately compiled code. For example, does the compiler have
> unfortunatel optimization opportunities when "volatile if"
> appears in one translation unit and the dependent stores in
> some other translation unit?
It can hardly lift anything outside a TU (barring the next point). So I
don't see how it can go wrong here. This is in fact the case with the
perf ringbuffer. The ctrl-dep lives in a different TU from the
stores.
> o LTO, as has already been mentioned in this thread.
So I would probably advocate the volatile-if to be a full sync point,
and LTO would have to preserve that.
On Wed, Oct 07, 2020 at 11:07:17PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 10:11:07AM -0700, Paul E. McKenney wrote:
>
> > Challenges include:
> >
> > o Unmarked accesses. Compilers are quite aggressive about
> > moving normal code.
>
> Which is why this thread exists :-) We wants to dis-allow lifting the
> stores over our volatile-if.
Of course. But you should expect this point to be a continual source
of shock and surprise to compiler folks. ;-)
> > o Separately compiled code. For example, does the compiler have
> > unfortunatel optimization opportunities when "volatile if"
> > appears in one translation unit and the dependent stores in
> > some other translation unit?
>
> It can hardly lift anything outside a TU (barring the next point). So I
> don't see how it can go wrong here. This is in fact the case with the
> perf ringbuffer. The ctrl-dep lives in a different TU from the
> stores.
I don't see how it could either, but I have been surprised before.
> > o LTO, as has already been mentioned in this thread.
>
> So I would probably advocate the volatile-if to be a full sync point,
> and LTO would have to preserve that.
Completely agreed! And probably not the only place that LTO needs
to be reined in a bit.
Thanx, Paul