2022-11-06 09:54:58

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

Below snippet:

#include <linux/bitops.h>

unsigned int foo(unsigned long word)
{
return __fls(word);
}

produces this on GCC 12.1.0:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 48 89 fb mov %rdi,%rbx
d: e8 00 00 00 00 call 12 <foo+0x12>
12: 48 0f bd c3 bsr %rbx,%rax
16: 5b pop %rbx
17: 31 ff xor %edi,%edi
19: e9 00 00 00 00 jmp 1e <foo+0x1e>

and that on clang 14.0.6:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 50 push %rax
b: 48 89 fb mov %rdi,%rbx
e: e8 00 00 00 00 call 13 <foo+0x13>
13: 48 89 1c 24 mov %rbx,(%rsp)
17: 48 0f bd 04 24 bsr (%rsp),%rax
1c: 48 83 c4 08 add $0x8,%rsp
20: 5b pop %rbx
21: c3 ret

The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
produces the exact same code on GCC and below code on clang:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 48 89 fb mov %rdi,%rbx
d: e8 00 00 00 00 call 12 <foo+0x12>
12: 48 0f bd c3 bsr %rbx,%rax
16: 5b pop %rbx
17: c3 ret

The builtin implementation is better for two reasons:

1/ it saves two instructions on clang (a push and a stack pointer
decrement) because of a useless tentative to save rax.

2/ when used on constant expressions, the compiler is only able to
fold the builtin version (c.f. [2]).

For those two reasons, replace the assembly implementation by its
builtin counterpart.

[1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <[email protected]>
CC: Nick Desaulniers <[email protected]>
CC: Yury Norov <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 2edf68475fec..a31453d7686d 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
(unsigned long)__builtin_ctzl(~word) : \
variable_ffz(word))

-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
- asm("bsr %1,%0"
- : "=r" (word)
- : "rm" (word));
- return word;
-}
+#include <asm-generic/bitops/builtin-__fls.h>

#undef ADDR

--
2.37.4



2022-11-07 10:58:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> The builtin implementation is better for two reasons:
>
> 1/ it saves two instructions on clang (a push and a stack pointer
> decrement) because of a useless tentative to save rax.

I'm thinking this is the same old clang-sucks-at-"rm" constraints and
*really* should not be a reason to change things. Clang should get fixed
already.

> 2/ when used on constant expressions, the compiler is only able to
> fold the builtin version (c.f. [2]).
>
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
>
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
>
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

I would much prefer consistently with 146034fed6ee.

2022-11-07 12:51:26

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

On Mon. 7 Nov. 2022 at 18:38, Peter Zijlstra <[email protected]> wrote:
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> > 1/ it saves two instructions on clang (a push and a stack pointer
> > decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.
>
> > 2/ when used on constant expressions, the compiler is only able to
> > fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

There is one big difference between 146034fed6ee and this patch. For
the ffs() functions, the x86 asm produces *better* code so there is a
reason to keep the x86 asm.
The clang missed optimization is not the core reason for me to send
this patch. The purpose of the x86 asm code is to be more performant
than the generic implementation, isn't it? Let's imagine for a moment
that the x86 asm and the builtin produced exactly the same output,
then what would be the reason for keeping the x86 asm version?

My point is not that x86 asm is worse, but that x86 asm isn't better.
The clang missed optimization is one additional reason for this patch,
not the main one.

Yours sincerely,
Vincent Mailhol

2022-11-10 19:21:29

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
<[email protected]> wrote:
>
> Below snippet:
>
> #include <linux/bitops.h>
>
> unsigned int foo(unsigned long word)
> {
> return __fls(word);
> }
>
> produces this on GCC 12.1.0:
>
> 0000000000000000 <foo>:
> 0: f3 0f 1e fa endbr64
> 4: e8 00 00 00 00 call 9 <foo+0x9>
> 9: 53 push %rbx
> a: 48 89 fb mov %rdi,%rbx
> d: e8 00 00 00 00 call 12 <foo+0x12>
> 12: 48 0f bd c3 bsr %rbx,%rax
> 16: 5b pop %rbx
> 17: 31 ff xor %edi,%edi
> 19: e9 00 00 00 00 jmp 1e <foo+0x1e>
>
> and that on clang 14.0.6:
>
> 0000000000000000 <foo>:
> 0: f3 0f 1e fa endbr64
> 4: e8 00 00 00 00 call 9 <foo+0x9>
> 9: 53 push %rbx
> a: 50 push %rax
> b: 48 89 fb mov %rdi,%rbx
> e: e8 00 00 00 00 call 13 <foo+0x13>
> 13: 48 89 1c 24 mov %rbx,(%rsp)
> 17: 48 0f bd 04 24 bsr (%rsp),%rax
> 1c: 48 83 c4 08 add $0x8,%rsp
> 20: 5b pop %rbx
> 21: c3 ret
>
> The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
> produces the exact same code on GCC and below code on clang:
>
> 0000000000000000 <foo>:
> 0: f3 0f 1e fa endbr64
> 4: e8 00 00 00 00 call 9 <foo+0x9>
> 9: 53 push %rbx
> a: 48 89 fb mov %rdi,%rbx
> d: e8 00 00 00 00 call 12 <foo+0x12>
> 12: 48 0f bd c3 bsr %rbx,%rax
> 16: 5b pop %rbx
> 17: c3 ret
>
> The builtin implementation is better for two reasons:
>
> 1/ it saves two instructions on clang (a push and a stack pointer
> decrement) because of a useless tentative to save rax.
>
> 2/ when used on constant expressions, the compiler is only able to
> fold the builtin version (c.f. [2]).
>
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
>
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
>
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> CC: Borislav Petkov <[email protected]>
> CC: Nick Desaulniers <[email protected]>
> CC: Yury Norov <[email protected]>
> Signed-off-by: Vincent Mailhol <[email protected]>

LGTM; thanks for the patch!
Reviewed-by: Nick Desaulniers <[email protected]>

> ---
> arch/x86/include/asm/bitops.h | 14 +-------------
> 1 file changed, 1 insertion(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 2edf68475fec..a31453d7686d 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
> (unsigned long)__builtin_ctzl(~word) : \
> variable_ffz(word))
>
> -/*
> - * __fls: find last set bit in word
> - * @word: The word to search
> - *
> - * Undefined if no set bit exists, so code should check against 0 first.
> - */
> -static __always_inline unsigned long __fls(unsigned long word)
> -{
> - asm("bsr %1,%0"
> - : "=r" (word)
> - : "rm" (word));
> - return word;
> -}
> +#include <asm-generic/bitops/builtin-__fls.h>
>
> #undef ADDR
>
> --
> 2.37.4
>


--
Thanks,
~Nick Desaulniers

2022-11-10 19:30:16

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

On Mon, Nov 7, 2022 at 1:39 AM Peter Zijlstra <[email protected]> wrote:
>
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> > 1/ it saves two instructions on clang (a push and a stack pointer
> > decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.

Well messing up constant folding for all compilers absolutely should
be a reason!

I did get a chance to speak with some colleagues more about this at
the LLVM developer meeting during the past 2 days. We have some ideas
on approaches that might work. There's some higher priority features
we're working on first, but I suspect we'll be able to visit that
issue soon. It's a pretty tricky dance between instruction selection
and register allocation.

>
> > 2/ when used on constant expressions, the compiler is only able to
> > fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

The bottom of this file arch/x86/include/asm/bitops.h is full of
#include <asm-generic/bitops/*.h>

--
Thanks,
~Nick Desaulniers