LinuxLists.cc - [PATCH] x86: Optimize variable_test

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <[email protected]> wrote:
>
> Since test_bit() doesn't actually have any output variables, we can use
> asm goto without having to add a memory clobber. This reduces the code
> to something sensible:

Yes, looks good, except if we have anything that actually wants to use
the value rather than branch on it. But a quick grep seems to show
that the vast majority of them are all about just directly testing the
result.

It worries me a bit that gcc now cannot pick the likely branch any
more. It will always branch out for the bit being set. So code like
this:

net/core/dev.c: if
(likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))

wouldn't work, but almost all of those seem to be the constant case
that doesn't get to this anyway.

So on the whole, this would seem to be a win.

> PS. should we kill the memory clobber for __test_and_change_bit()? It
> seems inconsistent and out of place.

We don't seem to have it for the clear case, so yeah..

> PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> is this something we can get 'fixed' ?

I suspect the problem is that now the particular register allocation
choices are basically not just around the asm, they'd affect the
target labels of the asm too.

I think that for the kernel, it would *generally* be ok to just say
that the outputs are only valid in the case the asm does *not* branch
out, assuming that the *clobbers* obviously clobber things regardless.
Keeping the register allocation for the asm itself still purely
"local" to the asm.

Something with a memory output we could just turn into a memory
clobber (so we could do the test-and-change bits today without using
any outputs - just mark memory clobbered).

Don't get me wrong - it would be even better if outputs would be valid
even for the labels the asm can jump to, but I can see that being
fundamentally hard. For example, two different asm goto's that share
one label, but that have different output register allocation. That
would be a nightmare (the compiler could basically have to duplicate
the label etc - although maybe you have to do that anyway).

And many of the places where I would personally like to use "asm goto"
are where the goto label is for an error case. Things like a failed
cmpxchg, or a failed user access etc. It would generally be ok if the
output values from the asm were "lost", because it's about the cleanup
(or trying again)..

So we might be ok in many cases with that kind of "weaker output",
where the output is only valid when the goto is *not* taken.

Linus

2015-05-01 16:17:10

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Since test_bit() doesn't actually have any output variables, we can use
> > asm goto without having to add a memory clobber. This reduces the code
> > to something sensible:
>
> Yes, looks good, except if we have anything that actually wants to use
> the value rather than branch on it. But a quick grep seems to show
> that the vast majority of them are all about just directly testing the
> result.
>
> It worries me a bit that gcc now cannot pick the likely branch any
> more. It will always branch out for the bit being set. So code like
> this:
>
> net/core/dev.c: if
> (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
>
> wouldn't work, but almost all of those seem to be the constant case
> that doesn't get to this anyway.

Ah yes, that's another thing we've previously discussed with the GCC
people (IIRC). The GCC manual states you can use hot and cold attributes
on the labels (although when we tested that it didn't actually work, it
might now). But that's no good if the hint is one (or more) layer up
from the asm goto.

If would indeed be very good if the likely/unlikely thing would work as
expected.

2015-05-01 16:19:23

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <[email protected]> wrote:
> > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > is this something we can get 'fixed' ?
>
> I suspect the problem is that now the particular register allocation
> choices are basically not just around the asm, they'd affect the
> target labels of the asm too.
>
> I think that for the kernel, it would *generally* be ok to just say
> that the outputs are only valid in the case the asm does *not* branch
> out, assuming that the *clobbers* obviously clobber things regardless.
> Keeping the register allocation for the asm itself still purely
> "local" to the asm.
>
> Something with a memory output we could just turn into a memory
> clobber (so we could do the test-and-change bits today without using
> any outputs - just mark memory clobbered).

The risk is of course that we'll cause too much stores and reloads
around them and regress instead of win.

A single variable clobber might be a solution here ?

2015-05-01 16:29:37

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 06:16:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Since test_bit() doesn't actually have any output variables, we can use
> > > asm goto without having to add a memory clobber. This reduces the code
> > > to something sensible:
> >
> > Yes, looks good, except if we have anything that actually wants to use
> > the value rather than branch on it. But a quick grep seems to show
> > that the vast majority of them are all about just directly testing the
> > result.
> >
> > It worries me a bit that gcc now cannot pick the likely branch any
> > more. It will always branch out for the bit being set. So code like
> > this:
> >
> > net/core/dev.c: if
> > (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> >
> > wouldn't work, but almost all of those seem to be the constant case
> > that doesn't get to this anyway.
>
> Ah yes, that's another thing we've previously discussed with the GCC
> people (IIRC). The GCC manual states you can use hot and cold attributes
> on the labels (although when we tested that it didn't actually work, it
> might now). But that's no good if the hint is one (or more) layer up
> from the asm goto.
>
> If would indeed be very good if the likely/unlikely thing would work as
> expected.

Ah, I see what you meant. Yes we're stuck with the 'jc', the compiler
cannot flip that into a jnc.

The best it can do is add unconditional jumps to re-arrange the blocks,
and that's somewhat ugly indeed, although better than nothing at all.
And from experiments back when we did the static_branch stuff that all
didn't work at all.

2015-05-01 16:34:25

by Jakub Jelinek

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > is this something we can get 'fixed' ?

CCing Richard as author of asm goto and Vlad as register allocator
maintainer. There are a few enhancement requests to support this, like
http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
reason why no outputs are allowed is the register allocation issue.
Don't know if LRA would be better suited to handle that case, but it would
indeed be pretty hard.

Jakub

2015-05-01 16:45:40

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 1, 2015 at 9:33 AM, Jakub Jelinek <[email protected]> wrote:
>
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer. There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.

So would it help if it was documented that output registers would only
be valid for the "fallthrough" case of an asm goto, and would be
considered undefined for any jump targets?

It wouldn't be the perfect situation, but it would be better than not
having any outputs at all, and as mentioned, it would probably be
sufficient for a fair number of cases.

Linus

2015-05-01 16:56:33

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > > is this something we can get 'fixed' ?
>
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer. There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.
> Don't know if LRA would be better suited to handle that case, but it would
> indeed be pretty hard.

So it would b awesome if we could use these freshly modeled flags as output
for regular asm stmts; that would obviate much of the asm goto hackery we now
do/have and allow gcc to pick the right branch for likely/unlikely.

2015-05-01 17:17:36

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

* Peter Zijlstra <[email protected]> wrote:

> On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote:
> > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > > > is this something we can get 'fixed' ?
> >
> > CCing Richard as author of asm goto and Vlad as register allocator
> > maintainer. There are a few enhancement requests to support this,
> > like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 ,
> > but indeed the reason why no outputs are allowed is the register
> > allocation issue. Don't know if LRA would be better suited to
> > handle that case, but it would indeed be pretty hard.
>
> So it would b awesome if we could use these freshly modeled flags as
> output for regular asm stmts; that would obviate much of the asm
> goto hackery we now do/have and allow gcc to pick the right branch
> for likely/unlikely.

If I may hijack the discussion a bit: it would also be awesome if
there was a GCC flag that would allow us to use __builtin_expect()
hints even when automatic branch heuristics are disabled:

I.e. very similar to -fno-guess-branch-probability, just that explicit
__builtin_expect() hints would not be ignored (like
-fno-guess-branch-probability does it today).

We could use this to compress the kernel instruction cache footprint
by about 5% on x86-64, while still having all the hand-made
optimizations that __builtin_expect() allows us.

It would be a perfect solution if -fno-guess-branch-probability just
stopped ignoring __builtin_expect().

Thanks,

Ingo

2015-05-01 19:03:16

by Vladimir Makarov

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On 01/05/15 12:33 PM, Jakub Jelinek wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
>>> PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
>>> is this something we can get 'fixed' ?
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer. There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.
> Don't know if LRA would be better suited to handle that case, but it would
> indeed be pretty hard.
>
>
GCC RA is a major reason to prohibit output operands for asm goto.

Reload pass was not designed to deal with output reloads for control
flow insns. It is very hard to implement this feature there and
implement it in a reliable way. Also nobody does any development for
reload for long time. So I doubt that somebody would do this for reload.

LRA is more suitable to implement the feature. In general, even
outputs used on any branch can be permitted. Although critical edges
can complicate the implementation as new BBs are created. But it is
doable too.

The only problem is that asm goto semantics in this case should be
defineddepending on what local register allocator (reload or LRA) GCC
for given target use.

Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA. All other targets are still
reload based.

So I could implement the output reloads in LRA, probably for the
next GCC release. How to enable and mostly use it for multi-target
code like the kernel is another question.

2015-05-01 20:49:58

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <[email protected]> wrote:
>
> GCC RA is a major reason to prohibit output operands for asm goto.

Hmm.. Thinking some more about it, I think that what would actually
work really well at least for the kernel is:

(a) allow *memory* operands (ie "=m") as outputs and having them be
meaningful even at any output labels (obviously with the caveat that
the asm instructions that write to memory would have to happen before
the branch ;)

This covers the somewhat common case of having magic instructions that
result in conditions that can't be tested at a C level. Things like
"bit clear and test" on x86 (with or without the lock) .

(b) allow other operands to be meaningful onlty for the fallthrough case.

>From a register allocation standpoint, these should be the easy cases.
(a) doesn't need any register allocation of the output (only on the
input to set up the effective address of the memory location), and (b)
would explicitly mean that an "asm goto" would leave any non-memory
outputs undefined in any of the goto cases, so from a RA standpoint it
ends up being equivalent to a non-goto asm..

Hmm?

So as an example of something that the kernel does and which wants to
have an output register. is to do a load from user space that can
fault. When it faults, we obviously simply don't *have* an actual
result, and we return an error. But for the successful fallthrough
case, we get a value in a register.

I'd love to be able to write it as (this is simplified, and doesn't
worry about all the different access sizes, or the "stac/clac"
sequence to enable user accesses on modern Intel CPU's):

asm goto(
"1:"
"\tmovl %0,%1\n"
_ASM_EXTABLE(1b,%l[error])
: "=r" (val)
: "m" (*userptr)
: : error);

where that "_ASM_EXTABLE()" is our magic macro for generating an
exception entry for that instruction, so that if the load takes an
exception, it will instead to to the "error" label.

But if it goes to the error label, the "val" output register really
doesn't contain anything, so we wouldn't even *want* gcc to try to do
any register allocation for the "jump to label from assembly" case.

So at least for one of the major cases that I'd like to use "asm goto"
with an output, I actually don't *want* any register allocation for
anything but the fallthrough case. And I suspect that's a
not-too-uncommon pattern - it's probably often about error handling.

Linus

2015-05-01 22:22:51

by Vladimir Makarov

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On 01/05/15 04:49 PM, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <[email protected]> wrote:
>> GCC RA is a major reason to prohibit output operands for asm goto.
> Hmm.. Thinking some more about it, I think that what would actually
> work really well at least for the kernel is:
>
> (a) allow *memory* operands (ie "=m") as outputs and having them be
> meaningful even at any output labels (obviously with the caveat that
> the asm instructions that write to memory would have to happen before
> the branch ;)
>
> This covers the somewhat common case of having magic instructions that
> result in conditions that can't be tested at a C level. Things like
> "bit clear and test" on x86 (with or without the lock) .
>
> (b) allow other operands to be meaningful onlty for the fallthrough case.
>
> From a register allocation standpoint, these should be the easy cases.
> (a) doesn't need any register allocation of the output (only on the
> input to set up the effective address of the memory location), and (b)
> would explicitly mean that an "asm goto" would leave any non-memory
> outputs undefined in any of the goto cases, so from a RA standpoint it
> ends up being equivalent to a non-goto asm..
Thanks for explanation what you need in the most common case.

Big part of GCC RA (at least local register allocators -- reload pass
and LRA) besides assigning hard registers to pseudos is to make
transformations to satisfy insn constraints. If there is not enough
hard registers, a pseudo can be allocated to a stack slot and if insn
using the pseudo needs a hard register, load or/and store should be
generated before/after the insn. And the problem for the old (reload
pass) and new RA (LRA) is that they were not designed to put new insns
after an insn changing control flow. Assigning hard registers itself is
not an issue for asm goto case.

If I understood you correctly, you assume that just permitting =m will
make GCC generates the correct code. Unfortunately, it is more
complicated. The operand can be not a memory or memory not satisfying
memory constraint 'm'. So still insns for moving memory satisfying 'm'
into output operand location might be necessary after the asm goto.

We could make asm goto semantics requiring that a user should provide
memory for such output operand (e.g. a pointer dereferrencing in your
case) and generate an error otherwise. By the way the same could be
done for output *register* operand. And user to avoid the error should
use a local register variable (a GCC extension) as an operand. But it
might be a bad idea with code performance point of view.

Unfortunately, the operand can be substituted by an equiv. value during
different transformations and even if an user think it will be a memory
before RA, it might be wrong. Although I believe there are some cases
where we can be sure that it will be memory (e.g. dereferrencing pointer
which is a function argument and is not used anywhere else in
function). Still it makes asm goto semantics complicated imho.

We could prevent equiv. substitution for output memory operand of asm
goto through all the optimizations but it is probably even harder task
than implementing output reloads in *reload* pass (it is 28-year old
pass with so many changes during its life that practically nobody can
understand it now well and change w/o introducing a new bug). As for
LRA, I wrote implementing output reloads is a double task.

> Hmm?
>
> So as an example of something that the kernel does and which wants to
> have an output register. is to do a load from user space that can
> fault. When it faults, we obviously simply don't *have* an actual
> result, and we return an error. But for the successful fallthrough
> case, we get a value in a register.
>
> I'd love to be able to write it as (this is simplified, and doesn't
> worry about all the different access sizes, or the "stac/clac"
> sequence to enable user accesses on modern Intel CPU's):
>
> asm goto(
> "1:"
> "\tmovl %0,%1\n"
> _ASM_EXTABLE(1b,%l[error])
> : "=r" (val)
> : "m" (*userptr)
> : : error);
>
> where that "_ASM_EXTABLE()" is our magic macro for generating an
> exception entry for that instruction, so that if the load takes an
> exception, it will instead to to the "error" label.
>
> But if it goes to the error label, the "val" output register really
> doesn't contain anything, so we wouldn't even *want* gcc to try to do
> any register allocation for the "jump to label from assembly" case.
>
> So at least for one of the major cases that I'd like to use "asm goto"
> with an output, I actually don't *want* any register allocation for
> anything but the fallthrough case. And I suspect that's a
> not-too-uncommon pattern - it's probably often about error handling.
>
>
As I wrote already if we implement output reloads after the control flow
insn, it does not matter what operand constraint should be (memory or
register). Implementing it only for fall-through case simplify the task
but not so much. For LRA it is doable and I can do this, for reload
pass it is very hard (requirement only memory operand can simplify the
implementation in reload although I am not sure about it).

But may be somebody will agree to do it for reload, sorry only not me --
i can not think about this without flinching.

2015-05-02 12:40:26

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <[email protected]> wrote:
> >
> > GCC RA is a major reason to prohibit output operands for asm goto.
>
> Hmm.. Thinking some more about it, I think that what would actually
> work really well at least for the kernel is:
>
> (a) allow *memory* operands (ie "=m") as outputs and having them be
> meaningful even at any output labels (obviously with the caveat that
> the asm instructions that write to memory would have to happen before
> the branch ;)
>
> This covers the somewhat common case of having magic instructions that
> result in conditions that can't be tested at a C level. Things like
> "bit clear and test" on x86 (with or without the lock) .

Would not something like:

static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
{
bool oldbit;

asm volatile ("btr %2, %1"
: "CF" (oldbit), "+m" (*addr)
: "Ir" (nr));

return oldbit;
}

Be the far better solution for this? Bug 59615 comment 7 states that
they actually modeled the flags in the .md file, so the above should be
possible to implement.

Now GCC can decide to use "sbb %0, %0" to convert CF into a register
value or use "jnc" / "jc" for branches, depending on what
__test_and_clear_bit() was used for.

We don't have to (ab)use asm goto for these things anymore; furthermore
I think the above will naturally work with our __builtin_expect() hints,
whereas the asm goto stuff has a hard time with that (afaik).

That's not to say output operants for asm goto would not still be useful
for other things (like your EXTABLE example).

2015-05-02 12:43:54

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:
> Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
> PPC, SH, and ARC are moving to LRA. All other targets are still
> reload based.
>
> So I could implement the output reloads in LRA, probably for the
> next GCC release. How to enable and mostly use it for multi-target
> code like the kernel is another question.

Pretty much all inline asm is in per arch code; so one arch having
different asm features than another should not be a problem at all.

2015-05-04 13:43:22

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On Fri, May 01, 2015 at 05:16:30PM +0200, Peter Zijlstra wrote:

> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index cfe3b954d5e4..bcf4fa77c04f 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -313,6 +313,15 @@ static __always_inline int constant_test_bit(long nr, const volatile unsigned lo
>
> static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
> {
> +#ifdef CC_HAVE_ASM_GOTO
> + asm_volatile_goto ("bt %1, %0\n\t"
> + "jc %l[cc_label]"
> + : : "m" (*(unsigned long *)addr), "Ir" (nr)
> + : : cc_label);
> + return 0;
> +cc_label:
> + return 1;
> +#else

I figured I'd try both jc and jnc versions:

text data bss dec hex filename
12203914 1738112 1081344 15023370 e53d0a defconfig-build/vmlinux
12204228 1738112 1081344 15023684 e53e44 defconfig-build/vmlinux-jc
12203240 1738112 1081344 15022696 e53a68 defconfig-build/vmlinux-jnc

Clearly I picked the wrong one :-) It also shows there is real value in
exposing this decision to GCC.

2015-05-04 15:38:04

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On 05/02/2015 05:39 AM, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote:
>> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <[email protected]> wrote:
>>>
>>> GCC RA is a major reason to prohibit output operands for asm goto.
>>
>> Hmm.. Thinking some more about it, I think that what would actually
>> work really well at least for the kernel is:
>>
>> (a) allow *memory* operands (ie "=m") as outputs and having them be
>> meaningful even at any output labels (obviously with the caveat that
>> the asm instructions that write to memory would have to happen before
>> the branch ;)
>>
>> This covers the somewhat common case of having magic instructions that
>> result in conditions that can't be tested at a C level. Things like
>> "bit clear and test" on x86 (with or without the lock) .
>
> Would not something like:
>
> static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
> {
> bool oldbit;
>
> asm volatile ("btr %2, %1"
> : "CF" (oldbit), "+m" (*addr)
> : "Ir" (nr));
>
> return oldbit;
> }
>
> Be the far better solution for this? Bug 59615 comment 7 states that
> they actually modeled the flags in the .md file, so the above should be
> possible to implement.
>
> Now GCC can decide to use "sbb %0, %0" to convert CF into a register
> value or use "jnc" / "jc" for branches, depending on what
> __test_and_clear_bit() was used for.
>
> We don't have to (ab)use asm goto for these things anymore; furthermore
> I think the above will naturally work with our __builtin_expect() hints,
> whereas the asm goto stuff has a hard time with that (afaik).
>
> That's not to say output operants for asm goto would not still be useful
> for other things (like your EXTABLE example).
>

I agree that being able to model flags outputs, and thus minimize the amount of
code actually within the asm, is superior to the complexity of asm goto.

r~

2015-05-04 18:07:44

by Vladimir Makarov

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On 02/05/15 08:43 AM, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:
>> Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
>> PPC, SH, and ARC are moving to LRA. All other targets are still
>> reload based.
>>
>> So I could implement the output reloads in LRA, probably for the
>> next GCC release. How to enable and mostly use it for multi-target
>> code like the kernel is another question.
> Pretty much all inline asm is in per arch code; so one arch having
> different asm features than another should not be a problem at all.
Ok, then. I'll try to implement output operands for asm-goto in LRA for
the next GCC release.

Of course, if nobody objects to changing asm goto semantics from

An 'asm goto' statement cannot have outputs ...

to

An 'asm goto' statement cannot have outputs on some targets ...

2015-05-04 19:33:52

[permalink] [raw]

Subject: [RFC] Design for flag bit outputs from asms

On 05/02/2015 05:39 AM, Peter Zijlstra wrote:
> static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
> {
> bool oldbit;
>
> asm volatile ("btr %2, %1"
> : "CF" (oldbit), "+m" (*addr)
> : "Ir" (nr));
>
> return oldbit;
> }
>
> Be the far better solution for this? Bug 59615 comment 7 states that
> they actually modeled the flags in the .md file, so the above should be
> possible to implement.
>
> Now GCC can decide to use "sbb %0, %0" to convert CF into a register
> value or use "jnc" / "jc" for branches, depending on what
> __test_and_clear_bit() was used for.
>
> We don't have to (ab)use asm goto for these things anymore; furthermore
> I think the above will naturally work with our __builtin_expect() hints,
> whereas the asm goto stuff has a hard time with that (afaik).
>
> That's not to say output operants for asm goto would not still be useful
> for other things (like your EXTABLE example).
>

(0) The C level output variable should be an integral type, from bool on up.

The flags are a scarse resource, easily clobbered. We cannot allow user code
to keep data in the flags. While x86 does have lahf/sahf, they don't exactly
perform well. And other targets like arm don't even have that bad option.

Therefore, the language level semantics are that the output is a boolean store
into the variable with a condition specified by a magic constraint.

That said, just like the compiler should be able to optimize

void bar(int y)
{
int x = (y <= 0);
if (x) foo();
}

such that we only use a single compare against y, the expectation is that
within a similarly constrained context the compiler will not require two tests
for these boolean outputs.

Therefore:

(1) Each target defines a set of constraint strings,

E.g. for x86, wherein we're almost out of constraint letters,

ja aux carry flag
jc carry flag
jo overflow flag
jp parity flag
js sign flag
jz zero flag

E.g. for arm/aarch64 (using "j" here, but other possibilities exist):

jn negative flag
jc carry flag
jz zero flag
jv overflow flag

E.g. for s390x (I've thought less about what's useful here)

j<m> where m is a hex digit, and is the mask of CC values
for which the condition is true; exactly corresponding
to the M1 field in the branch on condition instruction.

(2) A new target hook post-processes the asm_insn, looking for the
new constraint strings. The hook expands the condition prescribed
by the string, adjusting the asm_insn as required.

E.g.

bool x, y, z;
asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

originally

(parallel [
(set (reg:QI 83 [ x ])
(asm_operands/v:QI ("xyzzy") ("=jc") 0 []
[]
[] z.c:4))
(set (reg:QI 84 [ y ])
(asm_operands/v:QI ("xyzzy") ("=jp") 1 []
[]
[] z.c:4))
(set (reg:QI 85 [ z ])
(asm_operands/v:QI ("xyzzy") ("=jo") 2 []
[]
[] z.c:4))
(clobber (reg:QI 18 fpsr))
(clobber (reg:QI 17 flags))
])

becomes

(parallel [
(set (reg:CC 17 flags)
(asm_operands/v:CC ("xyzzy") ("=j_") 0 []
[]
[] z.c:4))
(clobber (reg:QI 18 fpsr))
])
(set (reg:QI 83 [ x ])
(ne:QI (reg:CCC 17 flags) (const_int 0)))
(set (reg:QI 84 [ y ])
(ne:QI (reg:CCP 17 flags) (const_int 0)))
(set (reg:QI 85 [ z ])
(ne:QI (reg:CCO 17 flags) (const_int 0)))

which ought to assemble to something like

xyzzy
setc %dl
setp %cl
seto %r15l

Note that rtl level data flow is preserved via the flags hard register,
and the lifetime of flags would not extended any further than we would
for a normal cstore pattern.

Note that the output constraints are adjusted to a single internal "=j_"
which would match the flags register in any mode. We can collapse
several output flags to a single set of the flags hard register.

(3) Note that ppc is both easier and more complicated.

There we have 8 4-bit registers, although most of the integer
non-comparisons only write to CR0. And the vector non-comparisons
only write to CR1, though of course that's of less interest in the
context of kernel code.

For the purposes of cr0, the same scheme could certainly work, although
the hook would not insert a hard register use, but rather a pseudo to
be allocated to cr0 (constaint "x").

That said, it's my understanding that "dot insns", setting cr0 are
expensive in current processor generations. There's also a lot less
of the x86-style "operate and set a flag based on something useful".

Can anyone think of any drawbacks, pitfalls, or portability issues to less
popular targets that I havn't considered?

r~

2015-05-04 20:15:24

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 12:33 PM, Richard Henderson wrote:
>
> (0) The C level output variable should be an integral type, from bool on up.
>
> The flags are a scarse resource, easily clobbered. We cannot allow user code
> to keep data in the flags. While x86 does have lahf/sahf, they don't exactly
> perform well. And other targets like arm don't even have that bad option.
>
> Therefore, the language level semantics are that the output is a boolean store
> into the variable with a condition specified by a magic constraint.
>
> That said, just like the compiler should be able to optimize
>
> void bar(int y)
> {
> int x = (y <= 0);
> if (x) foo();
> }
>
> such that we only use a single compare against y, the expectation is that
> within a similarly constrained context the compiler will not require two tests
> for these boolean outputs.
>
> Therefore:
>
> (1) Each target defines a set of constraint strings,
>
> E.g. for x86, wherein we're almost out of constraint letters,
>
> ja aux carry flag
> jc carry flag
> jo overflow flag
> jp parity flag
> js sign flag
> jz zero flag
>

I would argue that for x86 what you actually want is to model the
*conditions* that are available on the flags, not the flags themselves.
There are 16 such conditions, 8 if we discard the inversions.

It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
instructions; it is only ever consumed by the DAA/DAS instructions which
makes it pointless to try to model it in a compiler any more than, say, IF.

> (2) A new target hook post-processes the asm_insn, looking for the
> new constraint strings. The hook expands the condition prescribed
> by the string, adjusting the asm_insn as required.
>
> E.g.
>
> bool x, y, z;
> asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

Other than that, this is exactly what would be wonderful to see.

-hpa

2015-05-04 20:15:49

[permalink] [raw]

Subject: Re: [PATCH] x86: Optimize variable_test_bit()

On 05/04/2015 11:07 AM, Vladimir Makarov wrote:
>>>
>>> So I could implement the output reloads in LRA, probably for the
>>> next GCC release. How to enable and mostly use it for multi-target
>>> code like the kernel is another question.
>> Pretty much all inline asm is in per arch code; so one arch having
>> different asm features than another should not be a problem at all.
> Ok, then. I'll try to implement output operands for asm-goto in LRA for
> the next GCC release.
>
> Of course, if nobody objects to changing asm goto semantics from
>
> An 'asm goto' statement cannot have outputs ...
>
> to
>
> An 'asm goto' statement cannot have outputs on some targets ...
>

A gradual implementation should be fine.

-hpa

2015-05-04 20:28:58

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 01:14 PM, H. Peter Anvin wrote:
>>
>> Therefore:
>>
>> (1) Each target defines a set of constraint strings,
>>
>> E.g. for x86, wherein we're almost out of constraint letters,
>>
>> ja aux carry flag
>> jc carry flag
>> jo overflow flag
>> jp parity flag
>> js sign flag
>> jz zero flag
>>
>
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.
> There are 16 such conditions, 8 if we discard the inversions.
>
> It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
> instructions; it is only ever consumed by the DAA/DAS instructions which
> makes it pointless to try to model it in a compiler any more than, say, IF.
>

OK, let me qualify that. This is only necessary if it is impractical
for gcc to optimize boolean combinations of flags. If such
optimizations are available then it doesn't matter and is probably
needlessly complex. For example:

char foo(void)
{
bool zf, sf, of;

asm("xyzzy" : "=jz" (zf), "=js" (sf), "=jo" (of));

return zf || (sf != of);
}

... should compile to ...

xyzzy
setng %al
ret

-hpa

2015-05-04 20:33:53

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 01:14 PM, H. Peter Anvin wrote:
> On 05/04/2015 12:33 PM, Richard Henderson wrote:
>>
>> (0) The C level output variable should be an integral type, from bool on up.
>>
>> The flags are a scarse resource, easily clobbered. We cannot allow user code
>> to keep data in the flags. While x86 does have lahf/sahf, they don't exactly
>> perform well. And other targets like arm don't even have that bad option.
>>
>> Therefore, the language level semantics are that the output is a boolean store
>> into the variable with a condition specified by a magic constraint.
>>
>> That said, just like the compiler should be able to optimize
>>
>> void bar(int y)
>> {
>> int x = (y <= 0);
>> if (x) foo();
>> }
>>
>> such that we only use a single compare against y, the expectation is that
>> within a similarly constrained context the compiler will not require two tests
>> for these boolean outputs.
>>
>> Therefore:
>>
>> (1) Each target defines a set of constraint strings,
>>
>> E.g. for x86, wherein we're almost out of constraint letters,
>>
>> ja aux carry flag
>> jc carry flag
>> jo overflow flag
>> jp parity flag
>> js sign flag
>> jz zero flag
>>
>
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.
> There are 16 such conditions, 8 if we discard the inversions.

A fair point. Though honestly, I was hoping that this feature would mostly be
used for conditions that are "weird" -- that is, not normally describable by
arithmetic at all. Otherwise, why are you using inline asm for it?

> It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
> instructions; it is only ever consumed by the DAA/DAS instructions which
> makes it pointless to try to model it in a compiler any more than, say, IF.

Oh yeah. Consider that dropped.

r~

2015-05-04 20:36:06

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <[email protected]> wrote:
>
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.

Yes. Otherwise it would be a nightmare to try to describe simple
conditions like "le", which a rather complicated combination of three
of the actual flag bits:

((SF ^^ OF) || ZF) = 1

which would just be ridiculously painful for (a) the user to describe
and (b) fior the compiler to recognize once described.

Now, I do admit that most of the cases where you'd use inline asm with
condition codes would probably fall into just simple "test ZF or CF".
But I could certainly imagine other cases.

Linus

2015-05-04 20:43:01

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 01:35 PM, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <[email protected]> wrote:
>>
>> I would argue that for x86 what you actually want is to model the
>> *conditions* that are available on the flags, not the flags themselves.
>
> Yes. Otherwise it would be a nightmare to try to describe simple
> conditions like "le", which a rather complicated combination of three
> of the actual flag bits:
>
> ((SF ^^ OF) || ZF) = 1
>
> which would just be ridiculously painful for (a) the user to describe
> and (b) fior the compiler to recognize once described.
>
> Now, I do admit that most of the cases where you'd use inline asm with
> condition codes would probably fall into just simple "test ZF or CF".
> But I could certainly imagine other cases.
>

Yes, although once again I'm more than happy to let gcc do the boolean
optimizations if it already has logic to do so (which it might have/want
for its own reasons.)

-hpa

2015-05-04 20:45:13

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <[email protected]> wrote:
>
> A fair point. Though honestly, I was hoping that this feature would mostly be
> used for conditions that are "weird" -- that is, not normally describable by
> arithmetic at all. Otherwise, why are you using inline asm for it?

I could easily imagine using some of the combinations for atomic operations.

For example, doing a "lock decl", and wanting to see if the result is
negative or zero. Sure, it would be possible to set *two* booleans (ZF
and SF), but there's a contiional for "BE"..

Linus

2015-05-04 20:57:23

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 01:45 PM, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <[email protected]> wrote:
>>
>> A fair point. Though honestly, I was hoping that this feature would mostly be
>> used for conditions that are "weird" -- that is, not normally describable by
>> arithmetic at all. Otherwise, why are you using inline asm for it?
>
> I could easily imagine using some of the combinations for atomic operations.
>
> For example, doing a "lock decl", and wanting to see if the result is
> negative or zero. Sure, it would be possible to set *two* booleans (ZF
> and SF), but there's a contiional for "BE"..

Sure.

I'd be more inclined to support these compound conditionals directly, rather
than try to get the compiler to recognize them after the fact.

Indeed, I believe we have a near complete set of them in the x86 backend
already. It'd just be a matter of selecting the spellings for the constraints.

r~

2015-05-04 21:23:59

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On 05/04/2015 01:57 PM, Richard Henderson wrote:
>
> Sure.
>
> I'd be more inclined to support these compound conditionals directly, rather
> than try to get the compiler to recognize them after the fact.
>
> Indeed, I believe we have a near complete set of them in the x86 backend
> already. It'd just be a matter of selecting the spellings for the constraints.
>

Whichever works for you.

The full set of conditions, mnemonics, and a bitmask with the bits in
the order from MSB to LSB (OF,SF,ZF,PF,CF) which is probably the sanest
way to model these for the purpose of boolean optimization.

Opcode Mnemonics Condition Bitmask
0 o OF 0xffff0000
1 no !OF 0x0000ffff
2 b/c/nae CF 0xaaaaaaaa
3 ae/nb/nc !CF 0x55555555
4 e/z ZF 0xf0f0f0f0
5 ne/nz !ZF 0x0f0f0f0f
6 na CF || ZF 0xfafafafa
7 a !CF && !ZF 0x05050505
8 s SF 0xff00ff00
9 ns !SF 0x00ff00ff
A p/pe PF 0xcccccccc
B np/po !PF 0x33333333
C l/nge SF != OF 0x00ffff00
D ge/nl SF == OF 0xff0000ff
E le/ng ZF || (SF != OF) 0xf0fffff0
F g/nle !ZF && (SF == OF) 0x0f00000f

-hpa

2015-05-05 09:41:05

by Gabriel Paubert

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote:
[snipped]
> (3) Note that ppc is both easier and more complicated.
>
> There we have 8 4-bit registers, although most of the integer
> non-comparisons only write to CR0. And the vector non-comparisons
> only write to CR1, though of course that's of less interest in the
> context of kernel code.

Actually vector (Altivec) write to CR6. Standard FPU optionally write to
CR1, but the written value does not exactly depend on the result of the last
instruction; it is an instead an accrued exception status.

>
> For the purposes of cr0, the same scheme could certainly work, although
> the hook would not insert a hard register use, but rather a pseudo to
> be allocated to cr0 (constaint "x").

Yes, but we might also want to leave the choice of a cr register to the compiler.

>
> That said, it's my understanding that "dot insns", setting cr0 are
> expensive in current processor generations.

Not that much if I understand properly power7.md and power8.md:
no (P7) or one (P8) additional clock for common instructions
(add/sub and logical), but nothing else, so they are likely a win.

Shift/rotate/sign extensions seem to have more decoding restrictions:
the recording ("dot") forms are "cracked" and use 2 integer units.

> There's also a lot less
> of the x86-style "operate and set a flag based on something useful".
>

But there is at least an important one, which I occasionally wished I had:
the conditional stores.

The overflow bit might also be useful, not really
for the kernel, but for applications (and mfxer is slow).

Regards,
Gabriel

2015-05-05 16:08:52

by Segher Boessenkool

[permalink] [raw]

Subject: Re: [RFC] Design for flag bit outputs from asms

On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote:
> (1) Each target defines a set of constraint strings,

> (2) A new target hook post-processes the asm_insn, looking for the
> new constraint strings. The hook expands the condition prescribed
> by the string, adjusting the asm_insn as required.

Since it is pre-processed, there is no real reason to overlap this with
the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for
inputs) mean the target needs to do some "xy" transform here.

> Note that the output constraints are adjusted to a single internal "=j_"
> which would match the flags register in any mode. We can collapse
> several output flags to a single set of the flags hard register.

Many targets would use an already existing contraint that describes the
flags. Targets that need a fixed register could just insert the hard
register here as far as I see? (I'm assuming this happens at expand time).

> (3) Note that ppc is both easier and more complicated.
>
> There we have 8 4-bit registers, although most of the integer
> non-comparisons only write to CR0. And the vector non-comparisons
> only write to CR1, though of course that's of less interest in the
> context of kernel code.
>
> For the purposes of cr0, the same scheme could certainly work, although
> the hook would not insert a hard register use, but rather a pseudo to
> be allocated to cr0 (constaint "x").

And "y" for "any CR field".

> That said, it's my understanding that "dot insns", setting cr0 are
> expensive in current processor generations.

They are not. (Cell BE is not "current" :-) )

PowerPC also has some other bits (the carry bit for example, CA) that
could be usefully exposed via this mechanism.

> Can anyone think of any drawbacks, pitfalls, or portability issues to less
> popular targets that I havn't considered?

I don't like co-opting the constraint names for this; other than that, it
looks quite good :-)

Segher

2015-05-05 16:10:45