2021-06-04 18:28:35

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 04, 2021 at 06:17:20PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 04, 2021 at 11:51:54AM -0400, Alan Stern wrote:
> > On Fri, Jun 04, 2021 at 05:42:28PM +0200, Peter Zijlstra wrote:
>
> > > #define volatile_if(cond) if (({ bool __t = (cond); BUILD_BUG_ON(__builtin_constant_p(__t)); volatile_cond(__t); }))
> >
> > That won't help with more complicated examples, such as:
> >
> > volatile_if (READ_ONCE(*x) * 0 + READ_ONCE(*y))
>
> That's effectively:
>
> volatile_if (READ_ONCE(*y))
> WRITE_ONCE(*y, 42);

Sorry, what I meant to write was:

volatile_if (READ_ONCE(*x) * 0 + READ_ONCE(*y))
WRITE_ONCE(*z, 42);

where there is no ordering between *x and *z. It's not daft, and yes, a
macro won't be able to warn about it.

Alan

> which is a valid, but daft, LOAD->STORE order, no? A compiler might
> maybe be able to WARN on that, but that's definitely beyond what we can
> do with macros.


2021-06-04 19:14:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 4, 2021 at 11:27 AM Alan Stern <[email protected]> wrote:
>
> volatile_if (READ_ONCE(*x) * 0 + READ_ONCE(*y))
> WRITE_ONCE(*z, 42);
>
> where there is no ordering between *x and *z.

I wouldn't worry about it.

I think a compiler is allowed to optimize away stupid code.

I get upset when a compiler says "oh, that's undefined, so I will
ignore the obvious meaning of it", but that's a different thing
entirely.

I really wish that the C standards group showed some spine, and said
"there is no undefined, there is only implementation-defined". That
would solve a *lot* of problems.

But I also realize that will never happen. Because "spine" and "good
taste" is not something that I've ever heard of happening in an
industry standards committee.

Side note: it is worth noting that my version of "volatile_if()" has
an added little quirk: it _ONLY_ orders the stuff inside the
if-statement.

I do think it's worth not adding new special cases (especially that
"asm goto" hack that will generate worse code than the compiler could
do), but it means that

x = READ_ONCE(ptr);
volatile_if (x > 0)
WRITE_ONCE(*z, 42);

has an ordering, but if you write it as

x = READ_ONCE(ptr);
volatile_if (x <= 0)
return;
WRITE_ONCE(*z, 42);

then I could in theory see teh compiler doing that WRITE_ONCE() as
some kind of non-control dependency.

That said, I don't actually see how the compiler could do anything
that actually broke the _semantics_ of the code. Yes, it could do the
write using a magical data dependency on the conditional and turning
it into a store on a conditional address instead (before doing the
branch), but honestly, I don't see how that would actually break
anything.

So this is more of a "in theory, the two sides are not symmetric". The
"asm volatile" in a barrier() will force the compiler to generate the
branch, and the memory clobber in barrier() will most certainly force
any stores inside the "volatile_if()" to be after the branch.

But because the memory clobber is only inside the if-statement true
case, the false case could have the compiler migrate any code in that
false thing to before the if.

Again, semantics do matter, and I don't see how the compiler could
actually break the fundamental issue of "load->conditional->store is a
fundamental ordering even without memory barriers because of basic
causality", because you can't just arbitrarily generate speculative
stores that would be visible to others.

But at the same time, that's *such* a fundamental rule that I really
am intrigued why people think "volatile_if()" is needed in reality (as
opposed to some "in theory, the compiler can know things that are
unknowable thanks to a magical oracle" BS argument)

Linus

2021-06-04 19:21:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 4, 2021 at 12:09 PM Linus Torvalds
<[email protected]> wrote:
>
> Again, semantics do matter, and I don't see how the compiler could
> actually break the fundamental issue of "load->conditional->store is a
> fundamental ordering even without memory barriers because of basic
> causality", because you can't just arbitrarily generate speculative
> stores that would be visible to others.

This, after all, is why we trust that the *hardware* can't do it.

Even if the hardware mis-speculates and goes down the wrong branch,
and speculatively does the store when it shouldn't have, we don't
care: we know that such a speculative store can not possibly become
semantically visible (*) to other threads.

For all the same reasons, I don't see how a compiler can violate
causal ordering of the code (assuming, again, that the test is
_meaningful_ - if we write nonsensical code, that's a different
issue).

If we have compilers that create speculative stores that are visible
to other threads, we need to fix them.

Linus

(*) By "semantically visible" I intend to avoid the whole timing/cache
pattern kind of non-semantic visibility that is all about the spectre
leakage kind of things.

2021-06-04 20:59:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 04, 2021 at 12:18:43PM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 12:09 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Again, semantics do matter, and I don't see how the compiler could
> > actually break the fundamental issue of "load->conditional->store is a
> > fundamental ordering even without memory barriers because of basic
> > causality", because you can't just arbitrarily generate speculative
> > stores that would be visible to others.
>
> This, after all, is why we trust that the *hardware* can't do it.
>
> Even if the hardware mis-speculates and goes down the wrong branch,
> and speculatively does the store when it shouldn't have, we don't
> care: we know that such a speculative store can not possibly become
> semantically visible (*) to other threads.
>
> For all the same reasons, I don't see how a compiler can violate
> causal ordering of the code (assuming, again, that the test is
> _meaningful_ - if we write nonsensical code, that's a different
> issue).

I am probably missing your point, but something like this:

if (READ_ONCE(x))
y = 42;
else
y = 1729;

Can in theory be transformed into something like this:

y = 1729;
if (READ_ONCE(x))
y = 42;

The usual way to prevent it is to use WRITE_ONCE().

Fortunately, register sets are large, and gcc manages to do a single
store and use only %eax.

Thanx, Paul

> If we have compilers that create speculative stores that are visible
> to other threads, we need to fix them.
>
> Linus
>
> (*) By "semantically visible" I intend to avoid the whole timing/cache
> pattern kind of non-semantic visibility that is all about the spectre
> leakage kind of things.

2021-06-04 21:31:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 4, 2021 at 1:56 PM Paul E. McKenney <[email protected]> wrote:
>
> The usual way to prevent it is to use WRITE_ONCE().

The very *documentation example* for "volatile_if()" uses that WRITE_ONCE().

IOW, the patch that started this discussion has this comment in it:

+/**
+ * volatile_if() - Provide a control-dependency
+ *
+ * volatile_if(READ_ONCE(A))
+ * WRITE_ONCE(B, 1);
+ *
+ * will ensure that the STORE to B happens after the LOAD of A.

and my point is that I don't see *ANY THEORETICALLY POSSIBLE* way that
that "volatile_if()" could not be just a perfectly regular "if ()".

Can you?

Because we *literally* depend on the fundamental concept of causality
to make the hardware not re-order those operations.

That is the WHOLE AND ONLY point of this whole construct: we're
avoiding a possibly expensive hardware barrier operation, because we
know we have a more fundamental barrier that is INHERENT TO THE
OPERATION.

And I cannot for the life of me see how a compiler can break that
fundamental concept of causality either.

Seriously. Tell me how a compiler could _possibly_ turn that into
something that breaks the fundamental causal relationship. The same
fundamental causal relationship that is the whole and only reason we
don't need a memory barrier for the hardware.

And no, there is not a way in hell that the above can be written with
some kind of semantically visible speculative store without the
compiler being a total pile of garbage that wouldn't be usable for
compiling a kernel with.

If your argument is that the compiler can magically insert speculative
stores that can then be overwritten later, then MY argument is that
such a compiler could do that for *ANYTHING*. "volatile_if()" wouldn't
save us.

If that's valid compiler behavior in your opinion, then we have
exactly two options:

(a) give up

(b) not use that broken garbage of a compiler.

So I can certainly accept the patch with the simpler implementation of
"volatile_if()", but dammit, I want to see an actual real example
arguing for why it would be relevant and why the compiler would need
our help.

Because the EXACT VERY EXAMPLE that was in the patch as-is sure as
hell is no such thing.

If the intent is to *document* that "this conditional is part of a
load-conditional-store memory ordering pattern, then that is one
thing. But if that's the intent, then we might as well just write it
as

#define volatile_if(x) if (x)

and add a *comment* about why this kind of sequence doesn't need a
memory barrier.

I'd much rather have that kind of documentation, than have barriers
that are magical for theoretical compiler issues that aren't real, and
don't have any grounding in reality.

Without a real and valid example of how this could matter, this is
just voodoo programming.

We don't actually need to walk three times widdershins around the
computer before compiling the kernel.That's not how kernel development
works.

And we don't need to add a "volatile_if()" with magical barriers that
have no possibility of having real semantic meaning.

So I want to know what the semantic meaning of volatile_if() would be,
and why it fixes anything that a plain "if()" wouldn't. I want to see
the sequence where that "volatile_if()" actually *fixes* something.

Linus

2021-06-04 21:44:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 04, 2021 at 02:27:49PM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 1:56 PM Paul E. McKenney <[email protected]> wrote:
> >
> > The usual way to prevent it is to use WRITE_ONCE().
>
> The very *documentation example* for "volatile_if()" uses that WRITE_ONCE().

Whew! ;-)

> IOW, the patch that started this discussion has this comment in it:
>
> +/**
> + * volatile_if() - Provide a control-dependency
> + *
> + * volatile_if(READ_ONCE(A))
> + * WRITE_ONCE(B, 1);
> + *
> + * will ensure that the STORE to B happens after the LOAD of A.
>
> and my point is that I don't see *ANY THEORETICALLY POSSIBLE* way that
> that "volatile_if()" could not be just a perfectly regular "if ()".
>
> Can you?

I cannot, maybe due to failure of imagination. But please see below.

> Because we *literally* depend on the fundamental concept of causality
> to make the hardware not re-order those operations.
>
> That is the WHOLE AND ONLY point of this whole construct: we're
> avoiding a possibly expensive hardware barrier operation, because we
> know we have a more fundamental barrier that is INHERENT TO THE
> OPERATION.
>
> And I cannot for the life of me see how a compiler can break that
> fundamental concept of causality either.
>
> Seriously. Tell me how a compiler could _possibly_ turn that into
> something that breaks the fundamental causal relationship. The same
> fundamental causal relationship that is the whole and only reason we
> don't need a memory barrier for the hardware.
>
> And no, there is not a way in hell that the above can be written with
> some kind of semantically visible speculative store without the
> compiler being a total pile of garbage that wouldn't be usable for
> compiling a kernel with.
>
> If your argument is that the compiler can magically insert speculative
> stores that can then be overwritten later, then MY argument is that
> such a compiler could do that for *ANYTHING*. "volatile_if()" wouldn't
> save us.
>
> If that's valid compiler behavior in your opinion, then we have
> exactly two options:
>
> (a) give up
>
> (b) not use that broken garbage of a compiler.
>
> So I can certainly accept the patch with the simpler implementation of
> "volatile_if()", but dammit, I want to see an actual real example
> arguing for why it would be relevant and why the compiler would need
> our help.
>
> Because the EXACT VERY EXAMPLE that was in the patch as-is sure as
> hell is no such thing.
>
> If the intent is to *document* that "this conditional is part of a
> load-conditional-store memory ordering pattern, then that is one
> thing. But if that's the intent, then we might as well just write it
> as
>
> #define volatile_if(x) if (x)
>
> and add a *comment* about why this kind of sequence doesn't need a
> memory barrier.
>
> I'd much rather have that kind of documentation, than have barriers
> that are magical for theoretical compiler issues that aren't real, and
> don't have any grounding in reality.
>
> Without a real and valid example of how this could matter, this is
> just voodoo programming.
>
> We don't actually need to walk three times widdershins around the
> computer before compiling the kernel.That's not how kernel development
> works.
>
> And we don't need to add a "volatile_if()" with magical barriers that
> have no possibility of having real semantic meaning.
>
> So I want to know what the semantic meaning of volatile_if() would be,
> and why it fixes anything that a plain "if()" wouldn't. I want to see
> the sequence where that "volatile_if()" actually *fixes* something.

Here is one use case:

volatile_if(READ_ONCE(A)) {
WRITE_ONCE(B, 1);
do_something();
} else {
WRITE_ONCE(B, 1);
do_something_else();
}

With plain "if", the compiler is within its rights to do this:

tmp = READ_ONCE(A);
WRITE_ONCE(B, 1);
if (tmp)
do_something();
else
do_something_else();

On x86, still no problem. But weaker hardware could now reorder the
store to B before the load from A. With volatile_if(), this reordering
would be prevented.

Thanx, Paul

2021-06-04 22:22:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 4, 2021 at 2:40 PM Paul E. McKenney <[email protected]> wrote:
>
> Here is one use case:
>
> volatile_if(READ_ONCE(A)) {
> WRITE_ONCE(B, 1);
> do_something();
> } else {
> WRITE_ONCE(B, 1);
> do_something_else();
> }
>
> With plain "if", the compiler is within its rights to do this:
>
> tmp = READ_ONCE(A);
> WRITE_ONCE(B, 1);
> if (tmp)
> do_something();
> else
> do_something_else();
>
> On x86, still no problem. But weaker hardware could now reorder the
> store to B before the load from A. With volatile_if(), this reordering
> would be prevented.

But *should* it be prevented? For code like the above?

I'm not really seeing that the above is a valid code sequence.

Sure, that "WRITE_ONCE(B, 1)" could be seen as a lock release, and
then it would be wrong to have the read of 'A' happen after the lock
has actually been released. But if that's the case, then it should
have used a smp_store_release() in the first place, not a
WRITE_ONCE().

So I don't see the above as much of a valid example of actual
READ/WRITE_ONCE() use.

If people use READ/WRITE_ONCE() like the above, and they actually
depend on that kind of ordering, I think that code is likely wrong to
begin with. Using "volatile_if()" doesn't make it more valid.

Now, part of this is that I do think that in *general* we should never
use this very suble load-cond-store pattern to begin with. We should
strive to use more smp_load_acquire() and smp_store_release() if we
care about ordering of accesses. They are typically cheap enough, and
if there's much of an ordering issue, they are the right things to do.

I think the whole "load-to-store ordering" subtle non-ordered case is
for very very special cases, when you literally don't have a general
memory ordering, you just have an ordering for *one* very particular
access. Like some of the very magical code in the rw-semaphore case,
or that smp_cond_load_acquire().

IOW, I would expect that we have a handful of uses of this thing. And
none of them have that "the conditional store is the same on both
sides" pattern, afaik.

And immediately when the conditional store is different, you end up
having a dependency on it that orders it.

But I guess I can accept the above made-up example as an "argument",
even though I feel it is entirely irrelevant to the actual issues and
uses we have.

Linus

2021-06-05 03:15:48

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 04, 2021 at 12:09:26PM -0700, Linus Torvalds wrote:
> Side note: it is worth noting that my version of "volatile_if()" has
> an added little quirk: it _ONLY_ orders the stuff inside the
> if-statement.
>
> I do think it's worth not adding new special cases (especially that
> "asm goto" hack that will generate worse code than the compiler could
> do), but it means that
>
> x = READ_ONCE(ptr);
> volatile_if (x > 0)
> WRITE_ONCE(*z, 42);
>
> has an ordering, but if you write it as
>
> x = READ_ONCE(ptr);
> volatile_if (x <= 0)
> return;
> WRITE_ONCE(*z, 42);
>
> then I could in theory see teh compiler doing that WRITE_ONCE() as
> some kind of non-control dependency.

This may be a minor point, but can that loophole be closed as follows?

define volatile_if(x) \
if ((({ _Bool __x = (x); BUILD_BUG_ON(__builtin_constant_p(__x)); __x; }) && \
({ barrier(); 1; })) || ({ barrier(); 0; }))

(It's now a little later at night than when I usually think about this
sort of thing, so my brain isn't firing on all its cylinders. Forgive
me if this is a dumb question.)

Alan

2021-06-05 14:59:41

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 04, 2021 at 03:19:11PM -0700, Linus Torvalds wrote:
> Now, part of this is that I do think that in *general* we should never
> use this very suble load-cond-store pattern to begin with. We should
> strive to use more smp_load_acquire() and smp_store_release() if we
> care about ordering of accesses. They are typically cheap enough, and
> if there's much of an ordering issue, they are the right things to do.
>
> I think the whole "load-to-store ordering" subtle non-ordered case is
> for very very special cases, when you literally don't have a general
> memory ordering, you just have an ordering for *one* very particular
> access. Like some of the very magical code in the rw-semaphore case,
> or that smp_cond_load_acquire().
>
> IOW, I would expect that we have a handful of uses of this thing. And
> none of them have that "the conditional store is the same on both
> sides" pattern, afaik.
>
> And immediately when the conditional store is different, you end up
> having a dependency on it that orders it.
>
> But I guess I can accept the above made-up example as an "argument",
> even though I feel it is entirely irrelevant to the actual issues and
> uses we have.

Indeed, the expansion of the currently proposed version of

volatile_if (A) {
B;
} else {
C;
}

is basically the same as

if (A) {
barrier();
B;
} else {
barrier();
C;
}

which is just about as easy to write by hand. (For some reason my
fingers don't like typing "volatile_"; the letters tend to get
scrambled.)

So given that:

1. Reliance on control dependencies is uncommon in the kernel,
and

2. The loads in A could just be replaced with load_acquires
at a low penalty (or store-releases could go into B and C),

it seems that we may not need volatile_if at all! The only real reason
for having it in the first place was to avoid the penalty of
load-acquire on architectures where it has a significant cost, when the
control dependency would provide the necessary ordering for free. Such
architectures are getting less and less common.

Alan

2021-06-05 16:28:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Fri, Jun 4, 2021 at 8:14 PM Alan Stern <[email protected]> wrote:
>
> >
> > then I could in theory see teh compiler doing that WRITE_ONCE() as
> > some kind of non-control dependency.
>
> This may be a minor point, but can that loophole be closed as follows?

Note that it's actually entirely sufficient to have the barrier just
on one side.

I brought it up mainly as an oddity, and that it can result in the
compiler generating different code for the two different directions.

The reason that it is sufficient is that with the barrier in place (on
either side), the compiler really can't do much. It can't join either
of the sides, because it has to do that barrier on one side before any
common code.

In fact, even if the compiler decides to first do a conditional call
just around the barrier, and then do any common code (and then do
_another_ conditional branch), it still did that conditional branch
first, and the problem is solved. The CPU doesn't care, it will have
to resolve the branch before any subsequent stores are finalized.

Of course, if the compiler creates a conditional call just around the
barrier, and the barrier is empty (like we do now), and the compiler
leaves no mark of it in the result (like it does seem to do for empty
asm stataments), I could imagine some optimizing assembler (or linker)
screwing things up for us, and saying "a conditional branch to the
next instruction can just be removed).

At that point, we've lost again, and it's a toolchain issue. I don't
think that issue can currently happen, but it's an example of yet
another really subtle problem that *could* happen even if *we* do
everything right.

I also do not believe that any of our code that has this pattern would
have that situation where the compiler would generate a branch over
just the barrier. It's kind of similar to Paul's example in that
sense. When we use volatile_if(), the two sides are very very
different entirely regardless of the barrier, so in practice I think
this is all entirely moot.

Linus

2021-06-06 00:16:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 10:57:39AM -0400, Alan Stern wrote:
> On Fri, Jun 04, 2021 at 03:19:11PM -0700, Linus Torvalds wrote:
> > Now, part of this is that I do think that in *general* we should never
> > use this very suble load-cond-store pattern to begin with. We should
> > strive to use more smp_load_acquire() and smp_store_release() if we
> > care about ordering of accesses. They are typically cheap enough, and
> > if there's much of an ordering issue, they are the right things to do.
> >
> > I think the whole "load-to-store ordering" subtle non-ordered case is
> > for very very special cases, when you literally don't have a general
> > memory ordering, you just have an ordering for *one* very particular
> > access. Like some of the very magical code in the rw-semaphore case,
> > or that smp_cond_load_acquire().
> >
> > IOW, I would expect that we have a handful of uses of this thing. And
> > none of them have that "the conditional store is the same on both
> > sides" pattern, afaik.
> >
> > And immediately when the conditional store is different, you end up
> > having a dependency on it that orders it.
> >
> > But I guess I can accept the above made-up example as an "argument",
> > even though I feel it is entirely irrelevant to the actual issues and
> > uses we have.
>
> Indeed, the expansion of the currently proposed version of
>
> volatile_if (A) {
> B;
> } else {
> C;
> }
>
> is basically the same as
>
> if (A) {
> barrier();
> B;
> } else {
> barrier();
> C;
> }
>
> which is just about as easy to write by hand. (For some reason my
> fingers don't like typing "volatile_"; the letters tend to get
> scrambled.)
>
> So given that:
>
> 1. Reliance on control dependencies is uncommon in the kernel,
> and
>
> 2. The loads in A could just be replaced with load_acquires
> at a low penalty (or store-releases could go into B and C),
>
> it seems that we may not need volatile_if at all! The only real reason
> for having it in the first place was to avoid the penalty of
> load-acquire on architectures where it has a significant cost, when the
> control dependency would provide the necessary ordering for free. Such
> architectures are getting less and less common.

That does sound good, but...

Current compilers beg to differ at -O2: https://godbolt.org/z/5K55Gardn

------------------------------------------------------------------------
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (READ_ONCE(x) = (val))
#define barrier() __asm__ __volatile__("": : :"memory")

int x, y;

int main(int argc, char *argv[])
{
if (READ_ONCE(x)) {
barrier();
WRITE_ONCE(y, 1);
} else {
barrier();
WRITE_ONCE(y, 1);
}
return 0;
}
------------------------------------------------------------------------

Both gcc and clang generate a load followed by a store, with no branch.
ARM gets the same results from both compilers.

As Linus suggested, removing one (but not both!) invocations of barrier()
does cause a branch to be emitted, so maybe that is a way forward.
Assuming it is more than just dumb luck, anyway. :-/

Thanx, Paul

2021-06-06 01:37:02

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 05:14:18PM -0700, Paul E. McKenney wrote:
> On Sat, Jun 05, 2021 at 10:57:39AM -0400, Alan Stern wrote:
> > Indeed, the expansion of the currently proposed version of
> >
> > volatile_if (A) {
> > B;
> > } else {
> > C;
> > }
> >
> > is basically the same as
> >
> > if (A) {
> > barrier();
> > B;
> > } else {
> > barrier();
> > C;
> > }

> That does sound good, but...
>
> Current compilers beg to differ at -O2: https://godbolt.org/z/5K55Gardn
>
> ------------------------------------------------------------------------
> #define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
> #define WRITE_ONCE(x, val) (READ_ONCE(x) = (val))
> #define barrier() __asm__ __volatile__("": : :"memory")
>
> int x, y;
>
> int main(int argc, char *argv[])
> {
> if (READ_ONCE(x)) {
> barrier();
> WRITE_ONCE(y, 1);
> } else {
> barrier();
> WRITE_ONCE(y, 1);
> }
> return 0;
> }
> ------------------------------------------------------------------------
>
> Both gcc and clang generate a load followed by a store, with no branch.
> ARM gets the same results from both compilers.
>
> As Linus suggested, removing one (but not both!) invocations of barrier()
> does cause a branch to be emitted, so maybe that is a way forward.
> Assuming it is more than just dumb luck, anyway. :-/

Interesting. And changing one of the branches from barrier() to __asm__
__volatile__("nop": : :"memory") also causes a branch to be emitted. So
even though the compiler doesn't "look inside" assembly code, it does
compare two pieces at least textually and apparently assumes if they are
identical then they do the same thing.

Alan

2021-06-06 03:49:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 5, 2021 at 6:29 PM Alan Stern <[email protected]> wrote:
>
> Interesting. And changing one of the branches from barrier() to __asm__
> __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> even though the compiler doesn't "look inside" assembly code, it does
> compare two pieces at least textually and apparently assumes if they are
> identical then they do the same thing.

That's actually a feature in some cases, ie the ability to do CSE on
asm statements (ie the "always has the same output" optimization that
the docs talk about).

So gcc has always looked at the asm string for that reason, afaik.

I think it's something of a bug when it comes to "asm volatile", but
the documentation isn't exactly super-specific.

There is a statement of "Under certain circumstances, GCC may
duplicate (or remove duplicates of) your assembly code when
optimizing" and a suggestion of using "%=" to generate a unique
instance of an asm.

Which might actually be a good idea for "barrier()", just in case.
However, the problem with that is that I don't think we are guaranteed
to have a universal comment character for asm statements.

IOW, it might be a good idea to do something like

#define barrier() \
__asm__ __volatile__("# barrier %=": : :"memory")

but I'm not 100% convinced that '#' is always a comment in asm code,
so the above might not actually build everywhere.

However, *testing* the above (in my config, where '#' does work as a
comment character) shows that gcc doesn't actually consider them to be
distinct EVEN THEN, and will still merge two barrier statements.

That's distressing.

So the gcc docs are actively wrong, and %= does nothing - it will
still compare as the exact same inline asm, because the string
equality testing is apparently done before any expansion.

Something like this *does* seem to work:

#define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
#define __barrier(id) ____barrier(id)
#define barrier() __barrier(__COUNTER__)

which is "interesting" or "disgusting" depending on how you happen to feel.

And again - the above works only as long as "#" is a valid comment
character in the assembler. And I have this very dim memory of us
having comments in inline asm, and it breaking certain configurations
(for when the assembler that the compiler uses is a special
human-unfriendly one that only accepts compiler output).

You could make even more disgusting hacks, and have it generate something like

.pushsection .discard.barrier
.long #id
.popsection

instead of a comment. We already expect that to work and have generic
inline asm cases that generate code like that.

Linus

2021-06-06 04:48:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> On Sat, Jun 5, 2021 at 6:29 PM Alan Stern <[email protected]> wrote:
> >
> > Interesting. And changing one of the branches from barrier() to __asm__
> > __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> > even though the compiler doesn't "look inside" assembly code, it does
> > compare two pieces at least textually and apparently assumes if they are
> > identical then they do the same thing.
>
> That's actually a feature in some cases, ie the ability to do CSE on
> asm statements (ie the "always has the same output" optimization that
> the docs talk about).

Agreed, albeit reluctantly. ;-)

> So gcc has always looked at the asm string for that reason, afaik.
>
> I think it's something of a bug when it comes to "asm volatile", but
> the documentation isn't exactly super-specific.
>
> There is a statement of "Under certain circumstances, GCC may
> duplicate (or remove duplicates of) your assembly code when
> optimizing" and a suggestion of using "%=" to generate a unique
> instance of an asm.

So gcc might some day note a do-nothing asm and duplicate it for
the sole purpose of collapsing the "then" and "else" clauses. I
guess I need to keep my paranoia for the time being, then. :-/

> Which might actually be a good idea for "barrier()", just in case.
> However, the problem with that is that I don't think we are guaranteed
> to have a universal comment character for asm statements.
>
> IOW, it might be a good idea to do something like
>
> #define barrier() \
> __asm__ __volatile__("# barrier %=": : :"memory")
>
> but I'm not 100% convinced that '#' is always a comment in asm code,
> so the above might not actually build everywhere.
>
> However, *testing* the above (in my config, where '#' does work as a
> comment character) shows that gcc doesn't actually consider them to be
> distinct EVEN THEN, and will still merge two barrier statements.
>
> That's distressing.

If I keep the old definition of barrier() and make a barrier1() as
you defined above:

#define barrier1() __asm__ __volatile__("# barrier %=": : :"memory")

Then putting barrier() in the "then" clause and barrier1() in the
"else" clause works, though clang 12 for whatever reason generates
an extra jump in that case. https://godbolt.org/z/YhbcsxsxG

Increasing the optimization level gets rid of the extra jump.

Of course, there is no guarantee that gcc won't learn about
assembler constants. :-/

> So the gcc docs are actively wrong, and %= does nothing - it will
> still compare as the exact same inline asm, because the string
> equality testing is apparently done before any expansion.
>
> Something like this *does* seem to work:
>
> #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> #define __barrier(id) ____barrier(id)
> #define barrier() __barrier(__COUNTER__)
>
> which is "interesting" or "disgusting" depending on how you happen to feel.
>
> And again - the above works only as long as "#" is a valid comment
> character in the assembler. And I have this very dim memory of us
> having comments in inline asm, and it breaking certain configurations
> (for when the assembler that the compiler uses is a special
> human-unfriendly one that only accepts compiler output).
>
> You could make even more disgusting hacks, and have it generate something like
>
> .pushsection .discard.barrier
> .long #id
> .popsection
>
> instead of a comment. We already expect that to work and have generic
> inline asm cases that generate code like that.

And that does the trick as well, at least with recent gcc and clang.
https://godbolt.org/z/P8zPv9f9o

Thanx, Paul

2021-06-06 12:02:59

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 09:29:03PM -0400, Alan Stern wrote:
> Interesting. And changing one of the branches from barrier() to __asm__
> __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> even though the compiler doesn't "look inside" assembly code, it does
> compare two pieces at least textually and apparently assumes if they are
> identical then they do the same thing.

And that is a simple fact, since the same assembler code (at the same
spot in the program) will do the same thing no matter how that ended up
there.

And the compiler always is allowed to duplicate, join, delete, you name
it, inline assembler code. The only thing that it cares about is
semantics of the code, just like for any other code.


Segher

2021-06-06 13:06:29

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> On Sat, Jun 5, 2021 at 6:29 PM Alan Stern <[email protected]> wrote:
> > Interesting. And changing one of the branches from barrier() to __asm__
> > __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> > even though the compiler doesn't "look inside" assembly code, it does
> > compare two pieces at least textually and apparently assumes if they are
> > identical then they do the same thing.
>
> That's actually a feature in some cases, ie the ability to do CSE on
> asm statements (ie the "always has the same output" optimization that
> the docs talk about).
>
> So gcc has always looked at the asm string for that reason, afaik.

GCC does not pretend it can understand the asm. But it can see when
two asm statements are identical.

> I think it's something of a bug when it comes to "asm volatile", but
> the documentation isn't exactly super-specific.

Why would that be? "asm volatile" does not prevent optimisation. It
says this code has some unspecified side effect, and that is all! All
the usual C rules cover everything needed: the same side effects have to
be executed in the same order on the real machine as they would on the
abstract machine.

> There is a statement of "Under certain circumstances, GCC may
> duplicate (or remove duplicates of) your assembly code when
> optimizing" and a suggestion of using "%=" to generate a unique
> instance of an asm.

"%=" outputs a number unique for every output instruction (the whole asm
is one instruction; these are GCC internal instructions, not the same
thing as machine instructions). This will not help here. The actual
thing the manual says is
Under certain circumstances, GCC may duplicate (or remove duplicates
of) your assembly code when optimizing. This can lead to unexpected
duplicate symbol errors during compilation if your 'asm' code defines
symbols or labels. Using '%=' may help resolve this problem.
It helps prevent duplicated symbols and labels. It does not do much
else.

> Which might actually be a good idea for "barrier()", just in case.
> However, the problem with that is that I don't think we are guaranteed
> to have a universal comment character for asm statements.

That's right. But ";#" works on most systems, you may be able to use
that?

> IOW, it might be a good idea to do something like
>
> #define barrier() \
> __asm__ __volatile__("# barrier %=": : :"memory")
>
> but I'm not 100% convinced that '#' is always a comment in asm code,
> so the above might not actually build everywhere.

Some assemblers use ";", some use "!", and there are more variations.

But this will not do what you want. "%=" is output as a unique number
*after* everything GCC has done with the asm.

> However, *testing* the above (in my config, where '#' does work as a
> comment character) shows that gcc doesn't actually consider them to be
> distinct EVEN THEN, and will still merge two barrier statements.

Yes, the insns have the same templates, will output the exact same to
the generated assembler code, so are CSEd.

> So the gcc docs are actively wrong, and %= does nothing - it will
> still compare as the exact same inline asm, because the string
> equality testing is apparently done before any expansion.

They are not wrong. Maybe the doc could be clearer though? Patches
welcome.

> Something like this *does* seem to work:
>
> #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> #define __barrier(id) ____barrier(id)
> #define barrier() __barrier(__COUNTER__)
>
> which is "interesting" or "disgusting" depending on how you happen to feel.

__COUNTER__ is a preprocessor thing, much more like what you want here:
this does its work *before* everything the compiler does, while %= does
its thing *after* :-)

(Not that I actually understand what you are trying to do with this).


Segher

2021-06-06 13:23:16

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 09:43:33PM -0700, Paul E. McKenney wrote:
> So gcc might some day note a do-nothing asm and duplicate it for
> the sole purpose of collapsing the "then" and "else" clauses. I
> guess I need to keep my paranoia for the time being, then. :-/

Or a "do-something" asm, even. What it does is make sure it is executed
on the real machine exactly like on the abstract machine. That is how C
is defined, what a compiler *does*.

The programmer does not have any direct control over the generated code.

> Of course, there is no guarantee that gcc won't learn about
> assembler constants. :-/

I am not sure what you call an "assembler constant" here. But you can
be sure that GCC will not start doing anything here. GCC does not try
to understand what you wrote in an inline asm, it just fills in the
operands and that is all. It can do all the same things to it that it
can do to any other code of course: duplicate it, deduplicate it,
frobnicate it, etc.


Segher

2021-06-06 13:48:53

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 06:53:36AM -0500, Segher Boessenkool wrote:
> On Sat, Jun 05, 2021 at 09:29:03PM -0400, Alan Stern wrote:
> > Interesting. And changing one of the branches from barrier() to __asm__
> > __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> > even though the compiler doesn't "look inside" assembly code, it does
> > compare two pieces at least textually and apparently assumes if they are
> > identical then they do the same thing.
>
> And that is a simple fact, since the same assembler code (at the same
> spot in the program) will do the same thing no matter how that ended up
> there.

Sure. But the same assembler code at two different spots in the program
might not do the same thing. (Think of code that stores the current EIP
register's value into a variable.)

So while de-duplicating such code may be allowed, it will give rise to
observable results at execution time.

Alan

> And the compiler always is allowed to duplicate, join, delete, you name
> it, inline assembler code. The only thing that it cares about is
> semantics of the code, just like for any other code.
>
>
> Segher

2021-06-06 13:52:12

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 07:59:55AM -0500, Segher Boessenkool wrote:
> On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 5, 2021 at 6:29 PM Alan Stern <[email protected]> wrote:
> > > Interesting. And changing one of the branches from barrier() to __asm__
> > > __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> > > even though the compiler doesn't "look inside" assembly code, it does
> > > compare two pieces at least textually and apparently assumes if they are
> > > identical then they do the same thing.
> >
> > That's actually a feature in some cases, ie the ability to do CSE on
> > asm statements (ie the "always has the same output" optimization that
> > the docs talk about).
> >
> > So gcc has always looked at the asm string for that reason, afaik.
>
> GCC does not pretend it can understand the asm. But it can see when
> two asm statements are identical.

How similar do two asm strings have to be before they are considered
identical? For instance, do changes to the amount of leading or
trailing whitespace matter?

Or what about including an empty assembly statement in one but not the
other?

Alan

2021-06-06 17:19:52

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 09:47:49AM -0400, Alan Stern wrote:
> > GCC does not pretend it can understand the asm. But it can see when
> > two asm statements are identical.
>
> How similar do two asm strings have to be before they are considered
> identical? For instance, do changes to the amount of leading or
> trailing whitespace matter?

They have to be identical to be considered identical.

> Or what about including an empty assembly statement in one but not the
> other?

GCC does not parse the assembler template.


Segher

2021-06-06 18:14:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 4:56 AM Segher Boessenkool
<[email protected]> wrote:
>
> And that is a simple fact, since the same assembler code (at the same
> spot in the program) will do the same thing no matter how that ended up
> there.

The thing is, that's exactl;y what gcc violates.

The example - you may not have been cc'd personally on that one - was
something like

if (READ_ONCE(a)) {
barrier();
WRITE_ONCE(b,1);
} else {
barrier();
WRITE_ONCE(b, 1);
}

and currently because gcc thinks "same exact code", it will actually
optimize this to (pseudo-asm):

LD A
"empty asm"
ST $1,B

which is very much NOT equivalent to

LD A
BEQ over
"empty asm"
ST $1,B
JMP join

over:
"empty asm"
ST $1,B

join:

and that's the whole point of the barriers.

It's not equivalent exactly because of memory ordering. In the first
case, there is no ordering on weak architectures. In the second case,
there is always an ordering, because of CPU consistency guarantees.

And no, gcc doesn't understand about memory ordering. But that's
exactly why we use inline asms.

> And the compiler always is allowed to duplicate, join, delete, you name
> it, inline assembler code. The only thing that it cares about is
> semantics of the code, just like for any other code.

See, but it VIOLATES the semantics of the code.

You can't join those two empty asm's (and then remove the branch),
because the semantics of the code really aren't the same any more if
you do. Truly.

Linus

2021-06-06 18:24:20

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 11:04:49AM -0700, Linus Torvalds wrote:
> On Sun, Jun 6, 2021 at 4:56 AM Segher Boessenkool
> <[email protected]> wrote:
> >
> > And that is a simple fact, since the same assembler code (at the same
> > spot in the program) will do the same thing no matter how that ended up
> > there.
>
> The thing is, that's exactl;y what gcc violates.
>
> The example - you may not have been cc'd personally on that one - was
> something like
>
> if (READ_ONCE(a)) {
> barrier();
> WRITE_ONCE(b,1);
> } else {
> barrier();
> WRITE_ONCE(b, 1);
> }
>
> and currently because gcc thinks "same exact code", it will actually
> optimize this to (pseudo-asm):
>
> LD A
> "empty asm"
> ST $1,B
>
> which is very much NOT equivalent to
>
> LD A
> BEQ over
> "empty asm"
> ST $1,B
> JMP join
>
> over:
> "empty asm"
> ST $1,B
>
> join:
>
> and that's the whole point of the barriers.
>
> It's not equivalent exactly because of memory ordering. In the first
> case, there is no ordering on weak architectures. In the second case,
> there is always an ordering, because of CPU consistency guarantees.
>
> And no, gcc doesn't understand about memory ordering. But that's
> exactly why we use inline asms.
>
> > And the compiler always is allowed to duplicate, join, delete, you name
> > it, inline assembler code. The only thing that it cares about is
> > semantics of the code, just like for any other code.
>
> See, but it VIOLATES the semantics of the code.
>
> You can't join those two empty asm's (and then remove the branch),
> because the semantics of the code really aren't the same any more if
> you do. Truly.

To be fair, the same argument applies even without the asm code. The
compiler will translate

if (READ_ONCE(a))
WRITE_ONCE(b, 1);
else
WRITE_ONCE(b, 1);

to

LD A
ST $1,B

intstead of

LD A
BEQ over
ST $1,B
JMP join

over:
ST $1,B

join:

And these two are different for the same memory ordering reasons as
above.

Alan

2021-06-06 18:38:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 6:03 AM Segher Boessenkool
<[email protected]> wrote:
>
> On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> >
> > I think it's something of a bug when it comes to "asm volatile", but
> > the documentation isn't exactly super-specific.
>
> Why would that be? "asm volatile" does not prevent optimisation.

Sure it does.

That's the whole and only *POINT* of the "volatile".

It's the same as a vol;atile memory access. That very much prevents
certain optimizations. You can't just join two volatile reads or
writes, because they have side effects.

And the exact same thing is true of inline asm. Even when they are
*identical*, inline asms have side effects that gcc simply doesn't
understand.

And yes, those side effects can - and do - include "you can't just merge these".

> It says this code has some unspecified side effect, and that is all!

And that should be sufficient. But gcc then violates it, because gcc
doesn't understand the side effects.

Now, the side effects may be *subtle*, but they are very very real.
Just placement of code wrt a branch will actually affect memory
ordering, as that one example was.

> > Something like this *does* seem to work:
> >
> > #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> > #define __barrier(id) ____barrier(id)
> > #define barrier() __barrier(__COUNTER__)
> >
> > which is "interesting" or "disgusting" depending on how you happen to feel.
>
> __COUNTER__ is a preprocessor thing, much more like what you want here:
> this does its work *before* everything the compiler does, while %= does
> its thing *after* :-)
>
> (Not that I actually understand what you are trying to do with this).

See my previous email for why two barriers in two different code
sequences cannot just be joined into one and moved into the common
parent. It actually is semantically meaningful *where* they are, and
they are distinct barriers.

The case we happen to care about is memory ordering issues. The
example quoted may sound pointless and insane, and I actually don't
believe we have real code that triggers the issue, because whenever we
have a conditional barrier, the two sides of the conditional are
generally so different that gcc would never merge any of it anyway.

So the issue is mostly theoretical, but we do have code that is fairly
critical, and that depends on memory ordering, and on some weakly
ordered machines (which is where all these problems would happen),
actual explicit memory barriers are also <i>much</i> too expensive.

End result: we have code that depends on the fact that a read-to-write
ordering exists if there is a data dependency or a control dependency
between the two. No actual expensive CPU instruction to specify the
ordering, because the ordering is implicit in the code flow itself.

But that's what we need a compiler barrier for in the first place -
the compiler certainly doesn't understand about this very subtle
memory ordering issue, and we want to make sure that the code sequence
*remains* that "if A then write B".

Linus

2021-06-06 18:45:43

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> On Sat, Jun 5, 2021 at 6:29 PM Alan Stern <[email protected]> wrote:
> >
> > Interesting. And changing one of the branches from barrier() to __asm__
> > __volatile__("nop": : :"memory") also causes a branch to be emitted. So
> > even though the compiler doesn't "look inside" assembly code, it does
> > compare two pieces at least textually and apparently assumes if they are
> > identical then they do the same thing.
>
> That's actually a feature in some cases, ie the ability to do CSE on
> asm statements (ie the "always has the same output" optimization that
> the docs talk about).
>
> So gcc has always looked at the asm string for that reason, afaik.
>
> I think it's something of a bug when it comes to "asm volatile", but
> the documentation isn't exactly super-specific.
>
> There is a statement of "Under certain circumstances, GCC may
> duplicate (or remove duplicates of) your assembly code when
> optimizing" and a suggestion of using "%=" to generate a unique
> instance of an asm.
>
> Which might actually be a good idea for "barrier()", just in case.
> However, the problem with that is that I don't think we are guaranteed
> to have a universal comment character for asm statements.
>
> IOW, it might be a good idea to do something like
>
> #define barrier() \
> __asm__ __volatile__("# barrier %=": : :"memory")
>
> but I'm not 100% convinced that '#' is always a comment in asm code,
> so the above might not actually build everywhere.
>
> However, *testing* the above (in my config, where '#' does work as a
> comment character) shows that gcc doesn't actually consider them to be
> distinct EVEN THEN, and will still merge two barrier statements.
>
> That's distressing.
>
> So the gcc docs are actively wrong, and %= does nothing - it will
> still compare as the exact same inline asm, because the string
> equality testing is apparently done before any expansion.
>
> Something like this *does* seem to work:
>
> #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> #define __barrier(id) ____barrier(id)
> #define barrier() __barrier(__COUNTER__)
>
> which is "interesting" or "disgusting" depending on how you happen to feel.
>
> And again - the above works only as long as "#" is a valid comment
> character in the assembler. And I have this very dim memory of us
> having comments in inline asm, and it breaking certain configurations
> (for when the assembler that the compiler uses is a special
> human-unfriendly one that only accepts compiler output).
>
> You could make even more disgusting hacks, and have it generate something like
>
> .pushsection .discard.barrier
> .long #id
> .popsection
>
> instead of a comment. We already expect that to work and have generic
> inline asm cases that generate code like that.

I tried the experiment with this code:

#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (READ_ONCE(x) = (val))
#define barrier() __asm__ __volatile__("": : :"memory")

int x, y;

int main(int argc, char *argv[])
{
if (READ_ONCE(x)) {
barrier();
y = 1;
} else {
y = 1;
}
return 0;
}

The output from gcc -O2 is:

main:
mov eax, DWORD PTR x[rip]
test eax, eax
je .L2
.L2:
mov DWORD PTR y[rip], 1

The output from clang is essentially the same (the mov and test are
replaced by a cmp).

This does what we want, but I wouldn't bet against a future
optimization pass getting rid of the "useless" test and branch.

Alan

2021-06-06 18:46:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 11:22 AM Alan Stern <[email protected]> wrote:
>
> To be fair, the same argument applies even without the asm code. The
> compiler will translate

Yes, yes.

But that is literally why the asm exists in the first place.

It's supposed to be the barrier that makes sure that doesn't happen.

So your point that "but this would happen without the asm" is missing
the whole point. This is exactly the thing that the asm is supposed to
avoid.

And it actually works fine when just one side has the barrier, because
then no merging can take place, because there is nothing to merge.

That's why my suggested fix for "volatile_if()" was this #define

#define barrier_true() ({ barrier(); 1; })
#define volatile_if(x) if ((x) && barrier_true())

because now code like

volatile_if (READ_ONCE(a))
WRITE_ONCE(b, 1);
else
WRITE_ONCE(b, 1);

would force that branch. And it's actually fine to merge the
"WRITE(b,1)", as loing as the branch exists, so the above can (and
does) compile to

LD A
BEQ over
"empty asm"
over:
ST $1,B

and the above is actually perfectly valid code and actually solves the
problem, even if it admittedly looks entirely insane.

With that crazy "conditional jump over nothing" the store to B is
ordered wrt the load from A on real machines.

And again: I do not believe we actually have this kind of code in the
kernel. I could imagine some CPU turning "conditional branch over
nothing" into a nop-op internally, and losing the ordering. And that's
ok, exactly because the above kind of code that *only* does the
WRITE_ONCE() and nothing else is crazy and stupid.

So don't get hung up on the "branch over nothing", that's just for
this insane unreal example.

But I *could* see us having something where both branches do end up
writing to "B", and it might even be the first thing both branches end
up doing. Not the *only* thing they do, but "B" might be a flag for "I
am actively working on this issue", and I could see a situation where
we care that the read of "A" (which might be what specifies *what* the
issue is) would need to be ordered with regards to that "I'm working
on it" flag.

IOW, another CPU might want to know *what* somebody is working on, and do

/* Is somebody working on this */
if (READ_ONCE(B)) {
smp_rmb();
READ_ONCE(A); <- this is what they are working on

and the ordering requirement in this all is that B has to be written
after A has been read.

So while the example code is insane and pointless (and you shouldn't
read *too* much into it), conceptually the notion of that pattern of

if (READ_ONCE(a)) {
WRITE_ONCE(b,1);
.. do something ..
} else {
WRITE_ONCE(b,1);
.. do something else ..
}

is not insane or entirely unrealistic - the WRITE_ONCE() might
basically be an ACK for "I have read the value of A and will act on
it".

Odd? Yes. Unusual? Yes. Do we do this now? No. But it does worry me
that we don't seem to have a good way to add that required barrier.

Adding it on one side is good, and works, but what if somebody then does

volatile_if (READ_ONCE(a))
WRITE_ONCE(b, 1);
else {
barrier();
WRITE_ONCE(b, 1);
}

and now we end up with it on both sides again, and then the second
barrier basically undoes the first one..

Linus

2021-06-06 18:48:03

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 11:04:49AM -0700, Linus Torvalds wrote:
> if (READ_ONCE(a)) {
> barrier();
> WRITE_ONCE(b,1);
> } else {
> barrier();
> WRITE_ONCE(b, 1);
> }
>
> and currently because gcc thinks "same exact code", it will actually
> optimize this to (pseudo-asm):
>
> LD A
> "empty asm"
> ST $1,B
>
> which is very much NOT equivalent to
>
> LD A
> BEQ over
> "empty asm"
> ST $1,B
> JMP join
>
> over:
> "empty asm"
> ST $1,B
>
> join:
>
> and that's the whole point of the barriers.

You didn't use a barrier with these semantics though. There is nothing
in that code that guarantees a branch.

> See, but it VIOLATES the semantics of the code.

The code violates your expectations of the code.

> You can't join those two empty asm's (and then remove the branch),
> because the semantics of the code really aren't the same any more if
> you do. Truly.

You truly should have written a branch in tthe asm if you truly wanted
a branch instruction.


Segher

2021-06-06 18:52:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 11:43 AM Segher Boessenkool
<[email protected]> wrote:
>
> You truly should have written a branch in tthe asm if you truly wanted
> a branch instruction.

That's exactly what I don't want to do, and what the original patch by
PeterZ did.

Why?

Because then we need to write that stupid pointless branch for every
single architecture.

And to work well, it needs "asm goto", which is so recent that a lot
of compilers don't support it (thank God for clang dragging gcc
kicking and screaming to implement it at all - I'd asked for it over a
decade ago).

So you get bad code generation in a lot of cases, which entirely
obviates the _point_ of this all - which is that we can avoid an
expensive operation (a memory barrier) by just doing clever code
generation.

So if we can't get the clever code generation, it's all pretty much
moot, imnsho.

A working barrier "just fixes it".

I suspect the best we can do is to just work around the gcc badness
with that __COUNTER__ trick of mine. The lack of a reliable comment
character is the biggest issue with that trick.

Linus

2021-06-06 18:56:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 11:48 AM Linus Torvalds
<[email protected]> wrote:
> And to work well, it needs "asm goto", which is so recent that a lot
> of compilers don't support it (thank God for clang dragging gcc
> kicking and screaming to implement it at all - I'd asked for it over a
> decade ago).

Oh, actually, I'm wrong on this.

We don't need an output from the asm (the output ends up being in the
targets), so we can use the old-style asm goto that we've been relying
on for a long time.

So the main code generation problem is just (a) all the architectures
and (b) we'd have to use a fixed conditional against zero.

Linus

2021-06-06 19:01:28

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> Something like this *does* seem to work:
>
> #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> #define __barrier(id) ____barrier(id)
> #define barrier() __barrier(__COUNTER__)
>
> which is "interesting" or "disgusting" depending on how you happen to feel.

I think just
#define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
should be enough (or "X" instead of "i" if some arch uses -fpic and will not
accept small constants in PIC code), for CSE gcc compares that the asm template
string and all arguments are the same.

As for volatile, that is implicit on asm without any output operands and
it is about whether the inline asm can be DCEd, not whether it can be CSEd.

Jakub

2021-06-06 19:10:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 08:17:40AM -0500, Segher Boessenkool wrote:
> On Sat, Jun 05, 2021 at 09:43:33PM -0700, Paul E. McKenney wrote:
> > So gcc might some day note a do-nothing asm and duplicate it for
> > the sole purpose of collapsing the "then" and "else" clauses. I
> > guess I need to keep my paranoia for the time being, then. :-/
>
> Or a "do-something" asm, even. What it does is make sure it is executed
> on the real machine exactly like on the abstract machine. That is how C
> is defined, what a compiler *does*.
>
> The programmer does not have any direct control over the generated code.

I am not looking for direct control, simply sufficient influence. ;-)

> > Of course, there is no guarantee that gcc won't learn about
> > assembler constants. :-/
>
> I am not sure what you call an "assembler constant" here. But you can
> be sure that GCC will not start doing anything here. GCC does not try
> to understand what you wrote in an inline asm, it just fills in the
> operands and that is all. It can do all the same things to it that it
> can do to any other code of course: duplicate it, deduplicate it,
> frobnicate it, etc.

Apologies, that "assembler constants" should have been "assembler
comments".

Thanx, Paul

2021-06-06 19:17:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 08:59:22PM +0200, Jakub Jelinek wrote:
> On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> > Something like this *does* seem to work:
> >
> > #define ____barrier(id) __asm__ __volatile__("#" #id: : :"memory")
> > #define __barrier(id) ____barrier(id)
> > #define barrier() __barrier(__COUNTER__)
> >
> > which is "interesting" or "disgusting" depending on how you happen to feel.
>
> I think just
> #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> should be enough (or "X" instead of "i" if some arch uses -fpic and will not
> accept small constants in PIC code), for CSE gcc compares that the asm template
> string and all arguments are the same.

This does seem to do the trick: https://godbolt.org/z/K5j3bYqGT

So thank you for that!

Thanx, Paul

> As for volatile, that is implicit on asm without any output operands and
> it is about whether the inline asm can be DCEd, not whether it can be CSEd.
>
> Jakub
>

2021-06-06 19:25:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 11:59 AM Jakub Jelinek <[email protected]> wrote:
>
> I think just
> #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> should be enough

Oh, I like that. Much better.

It avoids all the issues with comments etc, and because it's not using
__COUNTER__ as a string, it doesn't need the preprocessor games with
double expansion either.

So yeah, that seems like a nice solution to the issue, and should make
the barriers all unique to the compiler.

Linus

2021-06-06 19:25:34

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 11:25:46AM -0700, Linus Torvalds wrote:
> On Sun, Jun 6, 2021 at 6:03 AM Segher Boessenkool
> <[email protected]> wrote:
> >
> > On Sat, Jun 05, 2021 at 08:41:00PM -0700, Linus Torvalds wrote:
> > >
> > > I think it's something of a bug when it comes to "asm volatile", but
> > > the documentation isn't exactly super-specific.
> >
> > Why would that be? "asm volatile" does not prevent optimisation.
>
> Sure it does.
>
> That's the whole and only *POINT* of the "volatile".
>
> It's the same as a vol;atile memory access. That very much prevents
> certain optimizations. You can't just join two volatile reads or
> writes, because they have side effects.

You can though. In exactly this same way:

volatile int x;
void g(int);
void f(int n) { if (n) g(x); else g(x); }

==>

f:
movl x(%rip), %edi
jmp g

You can do whatever you want with code with side effects. The only
thing required is that the side effects are executed as often as before
and in the same order. Merging identical sides of a diamond is just
fine.

> And the exact same thing is true of inline asm. Even when they are
> *identical*, inline asms have side effects that gcc simply doesn't
> understand.

Only volatile asm does (including all asm without outputs). But that
still does not mean GCC cannot manipulate the asm!

> And yes, those side effects can - and do - include "you can't just merge these".

They do not. That is not what a side effect is.

> > It says this code has some unspecified side effect, and that is all!
>
> And that should be sufficient. But gcc then violates it, because gcc
> doesn't understand the side effects.
>
> Now, the side effects may be *subtle*, but they are very very real.
> Just placement of code wrt a branch will actually affect memory
> ordering, as that one example was.

You have a different definition of "side effect" than C does apparently.

5.1.2.3/2:
Accessing a volatile object, modifying an object, modifying a file, or
calling a function that does any of those operations are all side
effects, which are changes in the state of the execution environment.
Evaluation of an expression in general includes both value
computations and initiation of side effects. Value computation for an
lvalue expression includes determining the identity of the designated
object.

> But that's what we need a compiler barrier for in the first place -
> the compiler certainly doesn't understand about this very subtle
> memory ordering issue, and we want to make sure that the code sequence
> *remains* that "if A then write B".

The compiler doesn't magically understand your intention, no. Some real
work will need to be done to make this work.


Segher

2021-06-06 20:01:14

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 11:48:32AM -0700, Linus Torvalds wrote:
> On Sun, Jun 6, 2021 at 11:43 AM Segher Boessenkool
> <[email protected]> wrote:
> >
> > You truly should have written a branch in tthe asm if you truly wanted
> > a branch instruction.
>
> That's exactly what I don't want to do, and what the original patch by
> PeterZ did.

Yes, I know. But it is literally the *only* way to *always* get a
conditional branch: by writing one.

> And to work well, it needs "asm goto", which is so recent that a lot
> of compilers don't support it (thank God for clang dragging gcc
> kicking and screaming to implement it at all - I'd asked for it over a
> decade ago).

GCC has had it since 2009.

> So you get bad code generation in a lot of cases, which entirely
> obviates the _point_ of this all - which is that we can avoid an
> expensive operation (a memory barrier) by just doing clever code
> generation.
>
> So if we can't get the clever code generation, it's all pretty much
> moot, imnsho.

Yes.


Segher

2021-06-06 20:16:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 12:56 PM Segher Boessenkool
<[email protected]> wrote:
>
> Yes, I know. But it is literally the *only* way to *always* get a
> conditional branch: by writing one.

The thing is, I don't actually believe you.

The barrier() thing can work - all we need to do is to simply make it
impossible for gcc to validly create anything but a conditional
branch.

If either side of the thing have an asm that cannot be combined, gcc
simply doesn't have any choice in the matter. There's no other valid
model than a conditional branch around it (of some sort - doing an
indirect branch that has a data dependency isn't wrong either, it just
wouldn't be something that a sane compiler would generate because it's
obviously much slower and more complicated).

We are very used to just making the compiler generate the code we
need. That is, fundamentally, what any use of inline asm is all about.
We want the compiler to generate all the common cases and all the
regular instructions.

The conditional branch itself - and the instructions leading up to it
- are exactly those "common regular instructions" that we'd want the
compiler to generate. That is in fact more true here than for most
inline asm, exactly because there are so many different possible
combinations of conditional branches (equal, not equal, less than,..)
and so many ways to generate the code that generates the condition.

So we are much better off letting the compiler do all that for us -
it's very much what the compiler is good at.

Linus

2021-06-06 20:19:37

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 12:22:44PM -0700, Linus Torvalds wrote:
> On Sun, Jun 6, 2021 at 11:59 AM Jakub Jelinek <[email protected]> wrote:
> >
> > I think just
> > #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> > should be enough
>
> Oh, I like that. Much better.
>
> It avoids all the issues with comments etc, and because it's not using
> __COUNTER__ as a string, it doesn't need the preprocessor games with
> double expansion either.
>
> So yeah, that seems like a nice solution to the issue, and should make
> the barriers all unique to the compiler.

__COUNTER__ is a preprocessor thing as well, and it may not do all that
you expect. Ex.:

===
#define fm() __COUNTER__
int gm(void) { return fm(); }
int hm(void) { return fm(); }

int fi(void) { return __COUNTER__; }
int gi(void) { return fi(); }
int hi(void) { return fi(); }
===

The macro version here works as you would hope, but the inlined one has
the same number everywhere.


Segher

2021-06-06 20:32:25

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 01:11:53PM -0700, Linus Torvalds wrote:
> On Sun, Jun 6, 2021 at 12:56 PM Segher Boessenkool
> <[email protected]> wrote:
> >
> > Yes, I know. But it is literally the *only* way to *always* get a
> > conditional branch: by writing one.
>
> The thing is, I don't actually believe you.

Fortune favours the bold!

> The barrier() thing can work - all we need to do is to simply make it
> impossible for gcc to validly create anything but a conditional
> branch.

And the only foolproof way of doing that is by writing a branch.

> If either side of the thing have an asm that cannot be combined, gcc
> simply doesn't have any choice in the matter. There's no other valid
> model than a conditional branch around it (of some sort - doing an
> indirect branch that has a data dependency isn't wrong either, it just
> wouldn't be something that a sane compiler would generate because it's
> obviously much slower and more complicated).

Or push something to the stack and return. Or rewrite the whole thing
as an FSM. Or or or.

(And yes, there are existing compilers that can do both of these things
on some code).

> We are very used to just making the compiler generate the code we
> need. That is, fundamentally, what any use of inline asm is all about.
> We want the compiler to generate all the common cases and all the
> regular instructions.
>
> The conditional branch itself - and the instructions leading up to it
> - are exactly those "common regular instructions" that we'd want the
> compiler to generate. That is in fact more true here than for most
> inline asm, exactly because there are so many different possible
> combinations of conditional branches (equal, not equal, less than,..)
> and so many ways to generate the code that generates the condition.
>
> So we are much better off letting the compiler do all that for us -
> it's very much what the compiler is good at.

Yes, exactly.

I am saying that if you depend on that some C code you write will result
in some particular machine code, without actually *forcing* the compiler
to output that exact machine code, then you will be disappointed. Maybe
not today, and maybe it will take years, if you are lucky.

(s/forcing/instructing/ of course, compilers have feelings too!)


Segher

2021-06-06 21:22:57

by Alexander Monakov

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()



On Sun, 6 Jun 2021, Linus Torvalds wrote:

> On Sun, Jun 6, 2021 at 11:59 AM Jakub Jelinek <[email protected]> wrote:
> >
> > I think just
> > #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> > should be enough
>
> Oh, I like that. Much better.
>
> It avoids all the issues with comments etc, and because it's not using
> __COUNTER__ as a string, it doesn't need the preprocessor games with
> double expansion either.
>
> So yeah, that seems like a nice solution to the issue, and should make
> the barriers all unique to the compiler.

It also plants a nice LTO time-bomb (__COUNTER__ values will be unique
only within each LTO input unit, not across all of them).

Alexander

2021-06-06 22:47:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 6, 2021 at 2:19 PM Alexander Monakov <[email protected]> wrote:
>
> > So yeah, that seems like a nice solution to the issue, and should make
> > the barriers all unique to the compiler.
>
> It also plants a nice LTO time-bomb (__COUNTER__ values will be unique
> only within each LTO input unit, not across all of them).

That could be an issue in other circumstances, but for at least
volatile_if() that doesn't much matter. The decision there is purely
local, and it's literally about the two sides of the conditional not
being merged.

Now, an optimizing linker or assembler can of course do anything at
all in theory: and if that ends up being an issue we'd have to have
some way to actually propagate the barrier from being just a compiler
thing. Right now gcc doesn't even output the barrier in the assembly
code, so it's invisible to any optimizing assembler/linker thing.

But I don't think that's an issue with what _currently_ goes on in an
assembler or linker - not even a smart one like LTO.

And such things really are independent of "volatile_if()". We use
barriers for other things where we need to force some kind of
operation ordering, and right now the only thing that re-orders
accesses etc is the compiler.

Btw, since we have compiler people on line, the suggested 'barrier()'
isn't actually perfect for this particular use:

#define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")

in the general barrier case, we very much want to have that "memory"
clobber, because the whole point of the general barrier case is that
we want to make sure that the compiler doesn't cache memory state
across it (ie the traditional use was basically what we now use
"cpu_relax()" for, and you would use it for busy-looping on some
condition).

In the case of "volatile_if()", we actually would like to have not a
memory clobber, but a "memory read". IOW, it would be a barrier for
any writes taking place, but reads can move around it.

I don't know of any way to express that to the compiler. We've used
hacks for it before (in gcc, BLKmode reads turn into that kind of
barrier in practice, so you can do something like make the memory
input to the asm be a big array). But that turned out to be fairly
unreliable, so now we use memory clobbers even if we just mean "reads
random memory".

Example: variable_test_bit(), which generates a "bt" instruction, does

: "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");

and the memory clobber is obviously wrong: 'bt' only *reads* memory,
but since the whole reason we use it is that it's not just that word
at address 'addr', in order to make sure that any previous writes are
actually stable in memory, we use that "memory" clobber.

It would be much nicer to have a "memory read" marker instead, to let
the compiler know "I need to have done all pending writes to memory,
but I can still cache read values over this op because it doesn't
_change_ memory".

Anybody have ideas or suggestions for something like that?

Linus

2021-06-06 23:47:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 03:26:16PM -0500, Segher Boessenkool wrote:
> On Sun, Jun 06, 2021 at 01:11:53PM -0700, Linus Torvalds wrote:
> > On Sun, Jun 6, 2021 at 12:56 PM Segher Boessenkool
> > <[email protected]> wrote:
> > >
> > > Yes, I know. But it is literally the *only* way to *always* get a
> > > conditional branch: by writing one.
> >
> > The thing is, I don't actually believe you.
>
> Fortune favours the bold!
>
> > The barrier() thing can work - all we need to do is to simply make it
> > impossible for gcc to validly create anything but a conditional
> > branch.
>
> And the only foolproof way of doing that is by writing a branch.
>
> > If either side of the thing have an asm that cannot be combined, gcc
> > simply doesn't have any choice in the matter. There's no other valid
> > model than a conditional branch around it (of some sort - doing an
> > indirect branch that has a data dependency isn't wrong either, it just
> > wouldn't be something that a sane compiler would generate because it's
> > obviously much slower and more complicated).
>
> Or push something to the stack and return. Or rewrite the whole thing
> as an FSM. Or or or.
>
> (And yes, there are existing compilers that can do both of these things
> on some code).
>
> > We are very used to just making the compiler generate the code we
> > need. That is, fundamentally, what any use of inline asm is all about.
> > We want the compiler to generate all the common cases and all the
> > regular instructions.
> >
> > The conditional branch itself - and the instructions leading up to it
> > - are exactly those "common regular instructions" that we'd want the
> > compiler to generate. That is in fact more true here than for most
> > inline asm, exactly because there are so many different possible
> > combinations of conditional branches (equal, not equal, less than,..)
> > and so many ways to generate the code that generates the condition.
> >
> > So we are much better off letting the compiler do all that for us -
> > it's very much what the compiler is good at.
>
> Yes, exactly.
>
> I am saying that if you depend on that some C code you write will result
> in some particular machine code, without actually *forcing* the compiler
> to output that exact machine code, then you will be disappointed. Maybe
> not today, and maybe it will take years, if you are lucky.
>
> (s/forcing/instructing/ of course, compilers have feelings too!)

OK, I will bite...

What would you suggest as a way of instructing the compiler to emit the
conditional branch that we are looking for?

Thanx, Paul

2021-06-06 23:48:17

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On 07/06/2021 00.38, Linus Torvalds wrote:

> Example: variable_test_bit(), which generates a "bt" instruction, does
>
> : "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
>
> and the memory clobber is obviously wrong: 'bt' only *reads* memory,
> but since the whole reason we use it is that it's not just that word
> at address 'addr', in order to make sure that any previous writes are
> actually stable in memory, we use that "memory" clobber.
>
> It would be much nicer to have a "memory read" marker instead, to let
> the compiler know "I need to have done all pending writes to memory,
> but I can still cache read values over this op because it doesn't
> _change_ memory".
>
> Anybody have ideas or suggestions for something like that?

The obvious thing is to try and mark the function as pure. But when
applied to a static inline, gcc seems to read the contents and say "nah,
you have something here that declares itself to possibly write to
memory". Replacing with a call to an extern function marked pure does
indeed cause gcc to cache the value of y*z, so in theory this should be
possible, if one could convince gcc to "trust me, this really is a pure
function".

https://godbolt.org/z/s4546K6Pj

Rasmus

2021-06-06 23:50:24

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On 07/06/2021 01.39, Rasmus Villemoes wrote:

> memory". Replacing with a call to an extern function marked pure does
> indeed cause gcc to cache the value of y*z, so in theory this should be
> possible, if one could convince gcc to "trust me, this really is a pure
> function".

Don't know why I didn't think to check before sending, but FWIW clang
doesn't need convincing, it already takes the __pure at face value and
caches y*z.

Rasmus

2021-06-07 08:05:24

by Alexander Monakov

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, 6 Jun 2021, Linus Torvalds wrote:

> On Sun, Jun 6, 2021 at 2:19 PM Alexander Monakov <[email protected]> wrote:
> >
> > > So yeah, that seems like a nice solution to the issue, and should make
> > > the barriers all unique to the compiler.
> >
> > It also plants a nice LTO time-bomb (__COUNTER__ values will be unique
> > only within each LTO input unit, not across all of them).
>
> That could be an issue in other circumstances, but for at least
> volatile_if() that doesn't much matter. The decision there is purely
> local, and it's literally about the two sides of the conditional not
> being merged.
>
> Now, an optimizing linker or assembler can of course do anything at
> all in theory: and if that ends up being an issue we'd have to have
> some way to actually propagate the barrier from being just a compiler
> thing. Right now gcc doesn't even output the barrier in the assembly
> code, so it's invisible to any optimizing assembler/linker thing.
>
> But I don't think that's an issue with what _currently_ goes on in an
> assembler or linker - not even a smart one like LTO.
>
> And such things really are independent of "volatile_if()". We use
> barriers for other things where we need to force some kind of
> operation ordering, and right now the only thing that re-orders
> accesses etc is the compiler.

Uhh... I was not talking about some (non-existent) "optimizing linker".
LTO works by relaunching the compiler from the linker and letting it
consume multiple translation units (which are fully preprocessed by that
point). So the very thing you wanted to avoid -- such barriers appearing
in close proximity where they can be deduplicated -- may arise after a
little bit of cross-unit inlining.

My main point here is that using __COUNTER__ that way (making things
"unique" for the compiler) does not work in general when LTO enters the
picture. As long as that is remembered, I'm happy.

> Btw, since we have compiler people on line, the suggested 'barrier()'
> isn't actually perfect for this particular use:
>
> #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
>
> in the general barrier case, we very much want to have that "memory"
> clobber, because the whole point of the general barrier case is that
> we want to make sure that the compiler doesn't cache memory state
> across it (ie the traditional use was basically what we now use
> "cpu_relax()" for, and you would use it for busy-looping on some
> condition).
>
> In the case of "volatile_if()", we actually would like to have not a
> memory clobber, but a "memory read". IOW, it would be a barrier for
> any writes taking place, but reads can move around it.
>
> I don't know of any way to express that to the compiler. We've used
> hacks for it before (in gcc, BLKmode reads turn into that kind of
> barrier in practice, so you can do something like make the memory
> input to the asm be a big array). But that turned out to be fairly
> unreliable, so now we use memory clobbers even if we just mean "reads
> random memory".

So the barrier which is a compiler barrier but not a machine barrier is
__atomic_signal_fence(model), but internally GCC will not treat it smarter
than an asm-with-memory-clobber today.

> Example: variable_test_bit(), which generates a "bt" instruction, does
>
> : "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
>
> and the memory clobber is obviously wrong: 'bt' only *reads* memory,
> but since the whole reason we use it is that it's not just that word
> at address 'addr', in order to make sure that any previous writes are
> actually stable in memory, we use that "memory" clobber.
>
> It would be much nicer to have a "memory read" marker instead, to let
> the compiler know "I need to have done all pending writes to memory,
> but I can still cache read values over this op because it doesn't
> _change_ memory".
>
> Anybody have ideas or suggestions for something like that?

In the specific case of 'bt', the offset cannot be negative, so I think you
can simply spell out the extent of the array being accessed:

: "m" *(unsigned long (*)[-1UL / 8 / sizeof(long) + 1])addr

In the general case (possibility of negative offsets, or no obvious base to
supply), have you considered adding a "wild read" through a char pointer
that is initialized in a non-transparent way? Like this:

char *wild_pointer;

asm(""
: "=X"(wild_pointer)
: "X"(base1)
, "X"(base2)); // unknown value related to given base pointers

asm("pattern"
: // normal outputs
: // normal inputs
, "m"(*wild_pointer));

The "X" constraint in theory should not tie up neither a register nor a stack
slot.

Alexander

2021-06-07 08:32:57

by Marco Elver

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, 7 Jun 2021 at 10:02, Alexander Monakov <[email protected]> wrote:
> On Sun, 6 Jun 2021, Linus Torvalds wrote:
[...]
> > On Sun, Jun 6, 2021 at 2:19 PM Alexander Monakov <[email protected]> wrote:
[...]
> > Btw, since we have compiler people on line, the suggested 'barrier()'
> > isn't actually perfect for this particular use:
> >
> > #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> >
> > in the general barrier case, we very much want to have that "memory"
> > clobber, because the whole point of the general barrier case is that
> > we want to make sure that the compiler doesn't cache memory state
> > across it (ie the traditional use was basically what we now use
> > "cpu_relax()" for, and you would use it for busy-looping on some
> > condition).
> >
> > In the case of "volatile_if()", we actually would like to have not a
> > memory clobber, but a "memory read". IOW, it would be a barrier for
> > any writes taking place, but reads can move around it.
> >
> > I don't know of any way to express that to the compiler. We've used
> > hacks for it before (in gcc, BLKmode reads turn into that kind of
> > barrier in practice, so you can do something like make the memory
> > input to the asm be a big array). But that turned out to be fairly
> > unreliable, so now we use memory clobbers even if we just mean "reads
> > random memory".
>
> So the barrier which is a compiler barrier but not a machine barrier is
> __atomic_signal_fence(model), but internally GCC will not treat it smarter
> than an asm-with-memory-clobber today.

FWIW, Clang seems to be cleverer about it, and seems to do the optimal
thing if I use a __atomic_signal_fence(__ATOMIC_RELEASE):
https://godbolt.org/z/4v5xojqaY

Thanks,
-- Marco

2021-06-07 14:20:13

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 04:37:29PM -0700, Paul E. McKenney wrote:
> > > The barrier() thing can work - all we need to do is to simply make it
> > > impossible for gcc to validly create anything but a conditional
> > > branch.
> >
> > And the only foolproof way of doing that is by writing a branch.

[ ... ]

> > I am saying that if you depend on that some C code you write will result
> > in some particular machine code, without actually *forcing* the compiler
> > to output that exact machine code, then you will be disappointed. Maybe
> > not today, and maybe it will take years, if you are lucky.
> >
> > (s/forcing/instructing/ of course, compilers have feelings too!)
>
> OK, I will bite...
>
> What would you suggest as a way of instructing the compiler to emit the
> conditional branch that we are looking for?

You write it in the assembler code.

Yes, it sucks. But it is the only way to get a branch if you really
want one. Now, you do not really need one here anyway, so there may be
some other way to satisfy the actual requirements.


Segher

2021-06-07 15:28:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 09:12:42AM -0500, Segher Boessenkool wrote:
> On Sun, Jun 06, 2021 at 04:37:29PM -0700, Paul E. McKenney wrote:
> > > > The barrier() thing can work - all we need to do is to simply make it
> > > > impossible for gcc to validly create anything but a conditional
> > > > branch.
> > >
> > > And the only foolproof way of doing that is by writing a branch.
>
> [ ... ]
>
> > > I am saying that if you depend on that some C code you write will result
> > > in some particular machine code, without actually *forcing* the compiler
> > > to output that exact machine code, then you will be disappointed. Maybe
> > > not today, and maybe it will take years, if you are lucky.
> > >
> > > (s/forcing/instructing/ of course, compilers have feelings too!)
> >
> > OK, I will bite...
> >
> > What would you suggest as a way of instructing the compiler to emit the
> > conditional branch that we are looking for?
>
> You write it in the assembler code.
>
> Yes, it sucks. But it is the only way to get a branch if you really
> want one. Now, you do not really need one here anyway, so there may be
> some other way to satisfy the actual requirements.

Hmmm... What do you see Peter asking for that is different than what
I am asking for? ;-)

Thanx, Paul

2021-06-07 15:30:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 10:27:10AM +0200, Marco Elver wrote:
> On Mon, 7 Jun 2021 at 10:02, Alexander Monakov <[email protected]> wrote:
> > On Sun, 6 Jun 2021, Linus Torvalds wrote:
> [...]
> > > On Sun, Jun 6, 2021 at 2:19 PM Alexander Monakov <[email protected]> wrote:
> [...]
> > > Btw, since we have compiler people on line, the suggested 'barrier()'
> > > isn't actually perfect for this particular use:
> > >
> > > #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> > >
> > > in the general barrier case, we very much want to have that "memory"
> > > clobber, because the whole point of the general barrier case is that
> > > we want to make sure that the compiler doesn't cache memory state
> > > across it (ie the traditional use was basically what we now use
> > > "cpu_relax()" for, and you would use it for busy-looping on some
> > > condition).
> > >
> > > In the case of "volatile_if()", we actually would like to have not a
> > > memory clobber, but a "memory read". IOW, it would be a barrier for
> > > any writes taking place, but reads can move around it.
> > >
> > > I don't know of any way to express that to the compiler. We've used
> > > hacks for it before (in gcc, BLKmode reads turn into that kind of
> > > barrier in practice, so you can do something like make the memory
> > > input to the asm be a big array). But that turned out to be fairly
> > > unreliable, so now we use memory clobbers even if we just mean "reads
> > > random memory".
> >
> > So the barrier which is a compiler barrier but not a machine barrier is
> > __atomic_signal_fence(model), but internally GCC will not treat it smarter
> > than an asm-with-memory-clobber today.
>
> FWIW, Clang seems to be cleverer about it, and seems to do the optimal
> thing if I use a __atomic_signal_fence(__ATOMIC_RELEASE):
> https://godbolt.org/z/4v5xojqaY

Indeed it does! But I don't know of a guarantee for that helpful
behavior.

Thanx, Paul

2021-06-07 17:07:13

by Marco Elver

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 08:28AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 07, 2021 at 10:27:10AM +0200, Marco Elver wrote:
> > On Mon, 7 Jun 2021 at 10:02, Alexander Monakov <[email protected]> wrote:
> > > On Sun, 6 Jun 2021, Linus Torvalds wrote:
> > [...]
> > > > On Sun, Jun 6, 2021 at 2:19 PM Alexander Monakov <[email protected]> wrote:
> > [...]
> > > > Btw, since we have compiler people on line, the suggested 'barrier()'
> > > > isn't actually perfect for this particular use:
> > > >
> > > > #define barrier() __asm__ __volatile__("" : : "i" (__COUNTER__) : "memory")
> > > >
> > > > in the general barrier case, we very much want to have that "memory"
> > > > clobber, because the whole point of the general barrier case is that
> > > > we want to make sure that the compiler doesn't cache memory state
> > > > across it (ie the traditional use was basically what we now use
> > > > "cpu_relax()" for, and you would use it for busy-looping on some
> > > > condition).
> > > >
> > > > In the case of "volatile_if()", we actually would like to have not a
> > > > memory clobber, but a "memory read". IOW, it would be a barrier for
> > > > any writes taking place, but reads can move around it.
> > > >
> > > > I don't know of any way to express that to the compiler. We've used
> > > > hacks for it before (in gcc, BLKmode reads turn into that kind of
> > > > barrier in practice, so you can do something like make the memory
> > > > input to the asm be a big array). But that turned out to be fairly
> > > > unreliable, so now we use memory clobbers even if we just mean "reads
> > > > random memory".
> > >
> > > So the barrier which is a compiler barrier but not a machine barrier is
> > > __atomic_signal_fence(model), but internally GCC will not treat it smarter
> > > than an asm-with-memory-clobber today.
> >
> > FWIW, Clang seems to be cleverer about it, and seems to do the optimal
> > thing if I use a __atomic_signal_fence(__ATOMIC_RELEASE):
> > https://godbolt.org/z/4v5xojqaY
>
> Indeed it does! But I don't know of a guarantee for that helpful
> behavior.

Is there a way we can interpret the standard in such a way that it
should be guaranteed?

If yes, it should be easy to add tests to the compiler repos for
snippets that the Linux kernel relies on (if we decide to use
__atomic_signal_fence() for this).

If no, we can still try to add tests to the compiler repos, but may
receive some push-back at the very latest when some optimization pass
decides to break it. Because the argument then is that it's well within
the language standard.

Adding language extensions will likely be met with resistance, because
some compiler folks are afraid of creating language forks (the reason
why we have '-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang').
That could be solved if we declare Linux-C a "standard", and finally get
-std=linux or such, at which point asking for "volatile if" directly
would probably be easier without jumping through hoops.

The jumping-through-hoops variant would probably be asking for a
__builtin primitive that allows constructing volatile_if() (if we can't
bend existing primitives to do what we want).

Thanks,
-- Marco

2021-06-07 17:48:40

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Sun, Jun 06, 2021 at 03:38:06PM -0700, Linus Torvalds wrote:
> In the case of "volatile_if()", we actually would like to have not a
> memory clobber, but a "memory read". IOW, it would be a barrier for
> any writes taking place, but reads can move around it.
>
> I don't know of any way to express that to the compiler. We've used
> hacks for it before (in gcc, BLKmode reads turn into that kind of
> barrier in practice, so you can do something like make the memory
> input to the asm be a big array). But that turned out to be fairly
> unreliable, so now we use memory clobbers even if we just mean "reads
> random memory".
>
> Example: variable_test_bit(), which generates a "bt" instruction, does
>
> : "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
>
> and the memory clobber is obviously wrong: 'bt' only *reads* memory,
> but since the whole reason we use it is that it's not just that word
> at address 'addr', in order to make sure that any previous writes are
> actually stable in memory, we use that "memory" clobber.

You can split the "I" version from the "r" version, it does not need
the memory clobber. If you know the actual maximum bit offset used you
don't need the clobber for "r" either. Or you could even write
"m"(((unsigned long *)addr)[nr/32])
That should work for all cases.

> Anybody have ideas or suggestions for something like that?

Is it useful in general for the kernel to have separate "read" and
"write" clobbers in asm expressions? And for other applications?


Segher

2021-06-07 17:58:17

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 11:01:39AM +0300, Alexander Monakov wrote:
> Uhh... I was not talking about some (non-existent) "optimizing linker".
> LTO works by relaunching the compiler from the linker and letting it
> consume multiple translation units (which are fully preprocessed by that
> point). So the very thing you wanted to avoid -- such barriers appearing
> in close proximity where they can be deduplicated -- may arise after a
> little bit of cross-unit inlining.
>
> My main point here is that using __COUNTER__ that way (making things
> "unique" for the compiler) does not work in general when LTO enters the
> picture. As long as that is remembered, I'm happy.

Yup. Exactly the same issue as using this in any function that may end
up inlined.

> > In the case of "volatile_if()", we actually would like to have not a
> > memory clobber, but a "memory read". IOW, it would be a barrier for
> > any writes taking place, but reads can move around it.
> >
> > I don't know of any way to express that to the compiler. We've used
> > hacks for it before (in gcc, BLKmode reads turn into that kind of
> > barrier in practice, so you can do something like make the memory
> > input to the asm be a big array). But that turned out to be fairly
> > unreliable, so now we use memory clobbers even if we just mean "reads
> > random memory".
>
> So the barrier which is a compiler barrier but not a machine barrier is
> __atomic_signal_fence(model), but internally GCC will not treat it smarter
> than an asm-with-memory-clobber today.

It will do nothing for relaxed ordering, and do blockage for everything
else. Can it do anything weaker than that?


Segher

2021-06-07 18:09:37

by Alexander Monakov

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, 7 Jun 2021, Segher Boessenkool wrote:

> > So the barrier which is a compiler barrier but not a machine barrier is
> > __atomic_signal_fence(model), but internally GCC will not treat it smarter
> > than an asm-with-memory-clobber today.
>
> It will do nothing for relaxed ordering, and do blockage for everything
> else. Can it do anything weaker than that?

It's a "blockage instruction" after transitioning to RTL, but before that,
on GIMPLE, the compiler sees it properly as a corresponding built-in, and
may optimize according to given memory model. And on RTL, well, if anyone
cares they'll need to invent RTL representation for it, I guess.

Alexander

2021-06-07 18:25:35

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 09:07:58PM +0300, Alexander Monakov wrote:
> On Mon, 7 Jun 2021, Segher Boessenkool wrote:
>
> > > So the barrier which is a compiler barrier but not a machine barrier is
> > > __atomic_signal_fence(model), but internally GCC will not treat it smarter
> > > than an asm-with-memory-clobber today.
> >
> > It will do nothing for relaxed ordering, and do blockage for everything
> > else. Can it do anything weaker than that?
>
> It's a "blockage instruction" after transitioning to RTL, but before that,
> on GIMPLE, the compiler sees it properly as a corresponding built-in, and
> may optimize according to given memory model. And on RTL, well, if anyone
> cares they'll need to invent RTL representation for it, I guess.

My question was if anything weaker is *valid* :-) (And if so, why!)


Segher

2021-06-07 18:29:33

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 08:27:12AM -0700, Paul E. McKenney wrote:
> > > > > The barrier() thing can work - all we need to do is to simply make it
> > > > > impossible for gcc to validly create anything but a conditional
> > > > > branch.

> > > What would you suggest as a way of instructing the compiler to emit the
> > > conditional branch that we are looking for?
> >
> > You write it in the assembler code.
> >
> > Yes, it sucks. But it is the only way to get a branch if you really
> > want one. Now, you do not really need one here anyway, so there may be
> > some other way to satisfy the actual requirements.
>
> Hmmm... What do you see Peter asking for that is different than what
> I am asking for? ;-)

I don't know what you are referring to, sorry?

I know what you asked for: literally some way to tell the compiler to
emit a conditional branch. If that is what you want, the only way to
make sure that is what you get is by writing exactly that in assembler.


Segher

2021-06-07 19:53:12

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 01:23:35PM -0500, Segher Boessenkool wrote:
> On Mon, Jun 07, 2021 at 08:27:12AM -0700, Paul E. McKenney wrote:
> > > > > > The barrier() thing can work - all we need to do is to simply make it
> > > > > > impossible for gcc to validly create anything but a conditional
> > > > > > branch.
>
> > > > What would you suggest as a way of instructing the compiler to emit the
> > > > conditional branch that we are looking for?
> > >
> > > You write it in the assembler code.
> > >
> > > Yes, it sucks. But it is the only way to get a branch if you really
> > > want one. Now, you do not really need one here anyway, so there may be
> > > some other way to satisfy the actual requirements.
> >
> > Hmmm... What do you see Peter asking for that is different than what
> > I am asking for? ;-)
>
> I don't know what you are referring to, sorry?
>
> I know what you asked for: literally some way to tell the compiler to
> emit a conditional branch. If that is what you want, the only way to
> make sure that is what you get is by writing exactly that in assembler.

That's not necessarily it.

People would be happy to have an easy way of telling the compiler that
all writes in the "if" branch of an if statement must be ordered after
any reads that the condition depends on. Or maybe all writes in either
the "if" branch or the "else" branch. And maybe not all reads that the
condition depends on, but just the reads appearing syntactically in the
condition. Or maybe even just the volatile reads appearing in the
condition. Nobody has said exactly.

The exact method used for doing this doesn't matter. It could be
accomplished by treating those reads as load-acquires. Or it could be
done by ensuring that the object code contains a dependency (control or
data) from the reads to the writes. Or it could be done by treating
the writes as store-releases. But we do want the execution-time
penalty to be small.

In short, we want to guarantee somehow that the conditional writes are
not re-ordered before the reads in the condition. (But note that
"conditional writes" includes identical writes occurring in both
branches.)

Alan

2021-06-07 20:18:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 03:51:44PM -0400, Alan Stern wrote:
> On Mon, Jun 07, 2021 at 01:23:35PM -0500, Segher Boessenkool wrote:
> > On Mon, Jun 07, 2021 at 08:27:12AM -0700, Paul E. McKenney wrote:
> > > > > > > The barrier() thing can work - all we need to do is to simply make it
> > > > > > > impossible for gcc to validly create anything but a conditional
> > > > > > > branch.
> >
> > > > > What would you suggest as a way of instructing the compiler to emit the
> > > > > conditional branch that we are looking for?
> > > >
> > > > You write it in the assembler code.
> > > >
> > > > Yes, it sucks. But it is the only way to get a branch if you really
> > > > want one. Now, you do not really need one here anyway, so there may be
> > > > some other way to satisfy the actual requirements.
> > >
> > > Hmmm... What do you see Peter asking for that is different than what
> > > I am asking for? ;-)
> >
> > I don't know what you are referring to, sorry?
> >
> > I know what you asked for: literally some way to tell the compiler to
> > emit a conditional branch. If that is what you want, the only way to
> > make sure that is what you get is by writing exactly that in assembler.
>
> That's not necessarily it.
>
> People would be happy to have an easy way of telling the compiler that
> all writes in the "if" branch of an if statement must be ordered after
> any reads that the condition depends on. Or maybe all writes in either
> the "if" branch or the "else" branch. And maybe not all reads that the
> condition depends on, but just the reads appearing syntactically in the
> condition. Or maybe even just the volatile reads appearing in the
> condition. Nobody has said exactly.
>
> The exact method used for doing this doesn't matter. It could be
> accomplished by treating those reads as load-acquires. Or it could be
> done by ensuring that the object code contains a dependency (control or
> data) from the reads to the writes. Or it could be done by treating
> the writes as store-releases. But we do want the execution-time
> penalty to be small.
>
> In short, we want to guarantee somehow that the conditional writes are
> not re-ordered before the reads in the condition. (But note that
> "conditional writes" includes identical writes occurring in both
> branches.)

What Alan said! ;-)

Thanx, Paul

2021-06-07 20:36:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 7, 2021 at 10:45 AM Segher Boessenkool
<[email protected]> wrote:
>
> On Sun, Jun 06, 2021 at 03:38:06PM -0700, Linus Torvalds wrote:
> >
> > Example: variable_test_bit(), which generates a "bt" instruction, does
> >
> > : "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
> >
> > and the memory clobber is obviously wrong: 'bt' only *reads* memory,
> > but since the whole reason we use it is that it's not just that word
> > at address 'addr', in order to make sure that any previous writes are
> > actually stable in memory, we use that "memory" clobber.
>
> You can split the "I" version from the "r" version, it does not need
> the memory clobber. If you know the actual maximum bit offset used you
> don't need the clobber for "r" either. Or you could even write
> "m"(((unsigned long *)addr)[nr/32])
> That should work for all cases.

Note that the bit test thing really was just an example.

And some other cases don't actually have an address range at all,
because they affect arbitrary ranges, not - like that bit test - just
one particular range.

To pick a couple of examples of that, think of

(a) write memory barrier. On some architectures it's an explicit
instruction, on x86 it's just a compiler barrier, since writes are
ordered on the CPU anyway, and we only need to make sure that the
compiler doesn't re-order writes around the barrier

Again, we currently use that same "barrier()" macro for that:

#define __smp_wmb() barrier()

but as mentioned, the barrier() thing has a "memory" clobber, and that
means that this write barrier - which is really really cheap on x86 -
also unnecessarily ends up causing pointless reloads from globals. It
obviously doesn't actually *change* memory, but it very much requires
that writes are not moved around it.

(b) things like cache flush and/or invalidate instructions, eg

asm volatile("wbinvd": : :"memory");

Again, this one doesn't actually *modify* memory, and honestly, this
one is not performance critical so the memory clobber is not actually
a problem, but I'm pointing it out as an example of the exact same
issue: the notion of an instruction that we don't want _writes_ to
move around, but reads can happily be moved and/or cached around it.

(c) this whole "volatile_if()" situation: we want to make sure writes
can't move around it, but there's no reason to re-load memory values,
because it doesn't modify memory, and we only need to make sure that
any writes are delayed to after the conditional.

We long long ago (over 20 years by now) used to do things like this:

struct __dummy { unsigned long a[100]; };
#define ADDR (*(volatile struct __dummy *) addr)

__asm__ __volatile__(
"btl %2,%1\n\tsbbl %0,%0"
:"=r" (oldbit)
:"m" (ADDR),"ir" (nr));

for that test-bit thing. Note how the above doesn't need the memory
clobber, because for gcc that ADDR thing (access to a big struct) ends
up being a "BLKmode" read, and then gcc at least used to treat it as
an arbitrary read.

I forget just why we had to stop using that trick, I think it caused
some reload confusion for some gcc version at some point. Probably
exactly because inline asms had issues with some BLKmode thing. That
change happened before 2001, we didn't have nice changelogs with
detailed commit messages back then, so

> > Anybody have ideas or suggestions for something like that?
>
> Is it useful in general for the kernel to have separate "read" and
> "write" clobbers in asm expressions? And for other applications?

See above. It's actually not all that uncommon that you have a "this
doesn't modify memory, but you can't move writes around it". It's
usually very much about cache handling or memory ordering operations,
and that bit test example was probably a bad example exactly because
it made it look like it's about some controlled range.

The "write memory barroer" is likely the best and simplest example,
but it's in not the only one.

Linus

2021-06-07 22:47:59

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 01:16:33PM -0700, Paul E. McKenney wrote:
> On Mon, Jun 07, 2021 at 03:51:44PM -0400, Alan Stern wrote:
> > On Mon, Jun 07, 2021 at 01:23:35PM -0500, Segher Boessenkool wrote:
> > > On Mon, Jun 07, 2021 at 08:27:12AM -0700, Paul E. McKenney wrote:
> > > > > > > > The barrier() thing can work - all we need to do is to simply make it
> > > > > > > > impossible for gcc to validly create anything but a conditional
> > > > > > > > branch.
> > >
> > > > > > What would you suggest as a way of instructing the compiler to emit the
> > > > > > conditional branch that we are looking for?
> > > > >
> > > > > You write it in the assembler code.
> > > > >
> > > > > Yes, it sucks. But it is the only way to get a branch if you really
> > > > > want one. Now, you do not really need one here anyway, so there may be
> > > > > some other way to satisfy the actual requirements.
> > > >
> > > > Hmmm... What do you see Peter asking for that is different than what
> > > > I am asking for? ;-)
> > >
> > > I don't know what you are referring to, sorry?
> > >
> > > I know what you asked for: literally some way to tell the compiler to
> > > emit a conditional branch. If that is what you want, the only way to
> > > make sure that is what you get is by writing exactly that in assembler.
> >
> > That's not necessarily it.
> >
> > People would be happy to have an easy way of telling the compiler that
> > all writes in the "if" branch of an if statement must be ordered after
> > any reads that the condition depends on. Or maybe all writes in either
> > the "if" branch or the "else" branch. And maybe not all reads that the
> > condition depends on, but just the reads appearing syntactically in the
> > condition. Or maybe even just the volatile reads appearing in the
> > condition. Nobody has said exactly.
> >
> > The exact method used for doing this doesn't matter. It could be
> > accomplished by treating those reads as load-acquires. Or it could be
> > done by ensuring that the object code contains a dependency (control or
> > data) from the reads to the writes. Or it could be done by treating
> > the writes as store-releases. But we do want the execution-time
> > penalty to be small.
> >
> > In short, we want to guarantee somehow that the conditional writes are
> > not re-ordered before the reads in the condition. (But note that
> > "conditional writes" includes identical writes occurring in both
> > branches.)
>
> What Alan said! ;-)

Okay, I'll think about that.

But you wrote:

> > > > > > What would you suggest as a way of instructing the compiler to emit the
> > > > > > conditional branch that we are looking for?

... and that is what I answered. I am sorry if you do not like being
taken literally, but that is how I read technical remarks: as literally
what they say. If you say you want a branch, I take it you want a
branch! :-)


Segher

2021-06-07 23:02:50

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 01:31:24PM -0700, Linus Torvalds wrote:
> > Is it useful in general for the kernel to have separate "read" and
> > "write" clobbers in asm expressions? And for other applications?
>
> See above. It's actually not all that uncommon that you have a "this
> doesn't modify memory, but you can't move writes around it". It's
> usually very much about cache handling or memory ordering operations,
> and that bit test example was probably a bad example exactly because
> it made it look like it's about some controlled range.
>
> The "write memory barroer" is likely the best and simplest example,
> but it's in not the only one.

Thanks for the examples! I opened <https://gcc.gnu.org/PR100953> so
that we can easily track it.


Segher

2021-06-07 23:29:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, Jun 07, 2021 at 05:40:37PM -0500, Segher Boessenkool wrote:
> On Mon, Jun 07, 2021 at 01:16:33PM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 07, 2021 at 03:51:44PM -0400, Alan Stern wrote:
> > > On Mon, Jun 07, 2021 at 01:23:35PM -0500, Segher Boessenkool wrote:
> > > > On Mon, Jun 07, 2021 at 08:27:12AM -0700, Paul E. McKenney wrote:
> > > > > > > > > The barrier() thing can work - all we need to do is to simply make it
> > > > > > > > > impossible for gcc to validly create anything but a conditional
> > > > > > > > > branch.
> > > >
> > > > > > > What would you suggest as a way of instructing the compiler to emit the
> > > > > > > conditional branch that we are looking for?
> > > > > >
> > > > > > You write it in the assembler code.
> > > > > >
> > > > > > Yes, it sucks. But it is the only way to get a branch if you really
> > > > > > want one. Now, you do not really need one here anyway, so there may be
> > > > > > some other way to satisfy the actual requirements.
> > > > >
> > > > > Hmmm... What do you see Peter asking for that is different than what
> > > > > I am asking for? ;-)
> > > >
> > > > I don't know what you are referring to, sorry?
> > > >
> > > > I know what you asked for: literally some way to tell the compiler to
> > > > emit a conditional branch. If that is what you want, the only way to
> > > > make sure that is what you get is by writing exactly that in assembler.
> > >
> > > That's not necessarily it.
> > >
> > > People would be happy to have an easy way of telling the compiler that
> > > all writes in the "if" branch of an if statement must be ordered after
> > > any reads that the condition depends on. Or maybe all writes in either
> > > the "if" branch or the "else" branch. And maybe not all reads that the
> > > condition depends on, but just the reads appearing syntactically in the
> > > condition. Or maybe even just the volatile reads appearing in the
> > > condition. Nobody has said exactly.
> > >
> > > The exact method used for doing this doesn't matter. It could be
> > > accomplished by treating those reads as load-acquires. Or it could be
> > > done by ensuring that the object code contains a dependency (control or
> > > data) from the reads to the writes. Or it could be done by treating
> > > the writes as store-releases. But we do want the execution-time
> > > penalty to be small.
> > >
> > > In short, we want to guarantee somehow that the conditional writes are
> > > not re-ordered before the reads in the condition. (But note that
> > > "conditional writes" includes identical writes occurring in both
> > > branches.)
> >
> > What Alan said! ;-)
>
> Okay, I'll think about that.
>
> But you wrote:
>
> > > > > > > What would you suggest as a way of instructing the compiler to emit the
> > > > > > > conditional branch that we are looking for?
>
> ... and that is what I answered. I am sorry if you do not like being
> taken literally, but that is how I read technical remarks: as literally
> what they say. If you say you want a branch, I take it you want a
> branch! :-)

When it is the cheapest means of providing the needed ordering, I really
do want a branch. ;-)

And a branch would implement Alan's "control dependency" above.

Thanx, Paul

2021-06-08 09:33:54

by Marco Elver

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Mon, 7 Jun 2021 at 19:04, Marco Elver <[email protected]> wrote:
[...]
> > > > So the barrier which is a compiler barrier but not a machine barrier is
> > > > __atomic_signal_fence(model), but internally GCC will not treat it smarter
> > > > than an asm-with-memory-clobber today.
> > >
> > > FWIW, Clang seems to be cleverer about it, and seems to do the optimal
> > > thing if I use a __atomic_signal_fence(__ATOMIC_RELEASE):
> > > https://godbolt.org/z/4v5xojqaY
> >
> > Indeed it does! But I don't know of a guarantee for that helpful
> > behavior.
>
> Is there a way we can interpret the standard in such a way that it
> should be guaranteed?

I figured out why it works, and unfortunately it's suboptimal codegen.
In LLVM __atomic_signal_fence() turns into a real IR instruction,
which when lowered to asm just doesn't emit anything. But most
optimizations happen before in IR, and a "fence" cannot be removed.
Essentially imagine there's an invisible instruction, which explains
why it does what it does. Sadly we can't rely on that.

> The jumping-through-hoops variant would probably be asking for a
> __builtin primitive that allows constructing volatile_if() (if we can't
> bend existing primitives to do what we want).

I had a think about this. I think if we ask for some primitive
compiler support, "volatile if" as the target is suboptimal design,
because it somewhat limits composability (and of course make it hard
to get as an extension). That primitive should probably also support
for/while/switch. But "volatile if" would also preclude us from
limiting the scope of the source of forced dependency, e.g. say we
have "if (A && B)", but we only care about A.

The cleaner approach would be an expression wrapper, e.g. "if
(ctrl_depends(A) && B) { ... }".

I imagine syntactically it'd be similar to __builtin_expect(..). I
think that's also easier to request an extension for, say
__builtin_ctrl_depends(expr). (If that is appealing, we can try and
propose it as std::ctrl_depends() along with std::dependent_ptr<>.)

Thoughts?

Thanks,
-- Marco

2021-06-08 11:27:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Tue, Jun 08, 2021 at 11:30:36AM +0200, Marco Elver wrote:

> The cleaner approach would be an expression wrapper, e.g. "if
> (ctrl_depends(A) && B) { ... }".
>
> I imagine syntactically it'd be similar to __builtin_expect(..). I
> think that's also easier to request an extension for, say
> __builtin_ctrl_depends(expr). (If that is appealing, we can try and
> propose it as std::ctrl_depends() along with std::dependent_ptr<>.)
>
> Thoughts?

Works for me; and note how it mirrors how we implemented volatile_if()
in the first place, by doing an expression wrapper.

__builtin_ctrl_depends(expr) would have to:

- ensure !__builtin_const_p(expr) (A)
- imply an acquire compiler fence (B)
- ensure cond-branch is emitted (C)

*OR*

- ensure !__builtin_const_p(expr); (A)
- upgrade the load in @expr to load-acquire (D)


A)

This all hinges on there actually being a LOAD, if expr is constant, we
have a malformed program and can emit a compiler error.

B)

We want to capture any store, not just volatile stores that come after.

The example here is a ring-buffer that loads the (head and) tail pointer
to check for space and then writes data elements. It would be
'cumbersome' to have all the data writes as volatile.

C)

We depend on the load-to-branch data dependency to guard the store to
provide the LOAD->STORE memory order.

D)

Upgrading LOAD to LOAD-ACQUIRE also provides LOAD->STORE ordering, but
it does require that the compiler has access to the LOAD in the first
place, which isn't a given seeing how much asm() we have around. Also
the achitecture should have a sheep LOAD-ACQUIRE in the first place,
otherwise there's no point.

If this is done, the branch is allowed to be optimized away if the
compiler so wants.

Now, Will will also want to allow the load-acquire to be run-time
patched between the RCsc and RCpc variant depending on what ARMv8
extentions are available, which will be 'interesting' (although I can
think of ways to actually do that, one would be to keep a special
section that tracks the location of these __builtin_ctrl_depends()
generated load-acquire instruction).


2021-06-09 04:24:47

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Tue, Jun 08, 2021 at 01:22:58PM +0200, Peter Zijlstra wrote:
> Works for me; and note how it mirrors how we implemented volatile_if()
> in the first place, by doing an expression wrapper.
>
> __builtin_ctrl_depends(expr) would have to:
>
> - ensure !__builtin_const_p(expr) (A)

Why would it be an error if __builtin_constant_p(expr)? In many
programs the compiler can figure out some expression does never change.
Having a control dependency on sometthing like that is not erroneous.

> - imply an acquire compiler fence (B)
> - ensure cond-branch is emitted (C)

(C) is almost impossible to do. This should be reformulated to talk
about the effect of the generated code, instead.

> *OR*
>
> - ensure !__builtin_const_p(expr); (A)
> - upgrade the load in @expr to load-acquire (D)

So that will only work if there is exactly one read from memory in expr?
That is problematic.

This needs some work.


Segher

2021-06-09 17:28:19

by Marco Elver

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Tue, 8 Jun 2021 at 17:30, Segher Boessenkool
<[email protected]> wrote:
> On Tue, Jun 08, 2021 at 01:22:58PM +0200, Peter Zijlstra wrote:
> > Works for me; and note how it mirrors how we implemented volatile_if()
> > in the first place, by doing an expression wrapper.
> >
> > __builtin_ctrl_depends(expr) would have to:
> >
> > - ensure !__builtin_const_p(expr) (A)
>
> Why would it be an error if __builtin_constant_p(expr)? In many
> programs the compiler can figure out some expression does never change.
> Having a control dependency on sometthing like that is not erroneous.
>
> > - imply an acquire compiler fence (B)
> > - ensure cond-branch is emitted (C)
>
> (C) is almost impossible to do. This should be reformulated to talk
> about the effect of the generated code, instead.
>
> > *OR*
> >
> > - ensure !__builtin_const_p(expr); (A)
> > - upgrade the load in @expr to load-acquire (D)
>
> So that will only work if there is exactly one read from memory in expr?
> That is problematic.
>
> This needs some work.

There is a valid concern that something at the level of the memory
model requires very precise specification in terms of language
semantics and not generated code. Otherwise it seems difficult to get
compiler folks onboard. And coming up with such a specification may
take a while, especially if we have to venture in the realm of the
C11/C++11 memory model while still trying to somehow make it work for
the LKMM. That seems like a very tricky maze we may want to avoid.

An alternative design would be to use a statement attribute to only
enforce (C) ("__attribute__((mustcontrol))" ?). The rest can be
composed through existing primitives I think (the compiler barriers
need optimizing though), which should give us ctrl_depends().

At least for Clang, it should be doable: https://reviews.llvm.org/D103958

Thanks,
-- Marco

2021-06-09 18:06:45

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] LKMM: Add volatile_if()

On Wed, Jun 09, 2021 at 02:44:08PM +0200, Marco Elver wrote:
> On Tue, 8 Jun 2021 at 17:30, Segher Boessenkool
> <[email protected]> wrote:
> > This needs some work.
>
> There is a valid concern that something at the level of the memory
> model requires very precise specification in terms of language
> semantics and not generated code.

Yup, exactly. Especially because the meaning of generated code is hard
to describe, and even much more so if you do not limit yourself to a
single machine architecture.

> Otherwise it seems difficult to get
> compiler folks onboard.

It isn't just difficult to get us on board without it, we know it just
is impossible to do anything sensible without it.

> And coming up with such a specification may
> take a while,

Yes.

> especially if we have to venture in the realm of the
> C11/C++11 memory model while still trying to somehow make it work for
> the LKMM. That seems like a very tricky maze we may want to avoid.

Well, you only need to use the saner parts of the memory model (not the
full thing), and extensions are fine as well of course.

> An alternative design would be to use a statement attribute to only
> enforce (C) ("__attribute__((mustcontrol))" ?).

Statement attributes only exist for empty statements. It is unclear how
(and if!) we could support it for general statements.

Some new builtin seems to fit the requirements better? I haven't looked
too closely though.


Segher