For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.
* Example *
Let's consider two dummy functions foo() and bar() as below:
| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }
GCC would produce below assembly code:
| 0000000000000000 <foo>:
| 0: ba 00 00 00 01 mov $0x1000000,%edx
| 5: b8 ff ff ff ff mov $0xffffffff,%eax
| a: 0f bc c2 bsf %edx,%eax
| d: 83 c0 01 add $0x1,%eax
| 10: c3 ret
| 11: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
| 18: 00 00 00 00
| 1c: 0f 1f 40 00 nopl 0x0(%rax)
|
| 0000000000000020 <bar>:
| 20: b8 19 00 00 00 mov $0x19,%eax
| 25: c3 ret
And clang would produce:
| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
| c: 83 c0 01 add $0x1,%eax
| f: c3 ret
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret
For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.
As a side benefit, this patch also removes below -Wshadow warning:
| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792
So, roughly 26.7% of the call to ffs() were using constant expression
and were optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]
CC: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
Le 11/05/2022 à 18:03, Vincent Mailhol a écrit :
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
[...]
>
> ** Statistics **
>
> On a allyesconfig, before applying this patch...:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 3607
>
> ...and after:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 792
>
> So, roughly 26.7% of the call to ffs() were using constant expression
> and were optimized out.
>
>
nitpicking: numbers look odd.
3607 is the exact same number as in patch 2/2. (ok, could be)
26.7% is surprising with these numbers. (I guess it is (total_before
- remaining) / total_before x 100 = (3607-792)/36.07 = 78.0%)
(but patch looks great to me :)
CJ
On Thu. 12 May 2022 at 06:35, Nick Desaulniers <[email protected]> wrote:
> On Wed, May 11, 2022 at 9:03 AM Vincent Mailhol
> <[email protected]> wrote:
> >
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
> > * Example *
> >
> > Let's consider two dummy functions foo() and bar() as below:
> >
> > | #include <linux/bitops.h>
> > | #define CONST 0x01000000
> > |
> > | unsigned int foo(void)
> > | {
> > | return ffs(CONST);
> > | }
> > |
> > | unsigned int bar(void)
> > | {
> > | return __builtin_ffs(CONST);
> > | }
> >
> > GCC would produce below assembly code:
>
> Thanks for the patch! LGTM.
> Reviewed-by: Nick Desaulniers <[email protected]>
>
> >
> > | 0000000000000000 <foo>:
> > | 0: ba 00 00 00 01 mov $0x1000000,%edx
> > | 5: b8 ff ff ff ff mov $0xffffffff,%eax
> > | a: 0f bc c2 bsf %edx,%eax
> > | d: 83 c0 01 add $0x1,%eax
> > | 10: c3 ret
>
> This should be the end of foo. I...actually don't know what's at the
> end here. But I don't think the region from here...
>
> > | 11: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
> > | 18: 00 00 00 00
> > | 1c: 0f 1f 40 00 nopl 0x0(%rax)
>
> ...to here is relevant.
I do not know either. I was hesitating to redact this part but finally
sent it to be verbatim.
I will redact this in v3.
> > |
> > | 0000000000000020 <bar>:
> > | 20: b8 19 00 00 00 mov $0x19,%eax
> > | 25: c3 ret
> >
> > And clang would produce:
> >
> > | 0000000000000000 <foo>:
> > | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> > | 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
> > | c: 83 c0 01 add $0x1,%eax
> > | f: c3 ret
>
> Weird, so I just tried this:
> ```
> $ cat /tmp/x.c
> #define CONST 0x01000000
>
> unsigned ffs (int x) {
> int r;
> asm("bsfl %1,%0"
> : "=r" (r)
> : "rm" (x), "0" (-1));
> return r;
> }
>
> unsigned int foo(void) {
> return ffs(CONST);
> }
>
> unsigned int bar(void) {
> return __builtin_ffs(CONST);
> }
> $ clang /tmp/x.c -O2 -o /tmp/x.o -c && llvm-objdump -dr /tmp/x.o
> --disassemble-symbols=foo
> ...
> 0000000000000010 <foo>:
> 10: b8 19 00 00 00 movl $25, %eax
> 15: c3 retq
> 16: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
> ```
> but if we make `ffs` `static`, we get:
> ```
> 0000000000000000 <foo>:
> 0: b8 ff ff ff ff movl $4294967295, %eax
> # imm = 0xFFFFFFFF
> 5: 0f bc 05 00 00 00 00 bsfl (%rip), %eax
> # 0xc <foo+0xc>
> 0000000000000008: R_X86_64_PC32 .LCPI0_0-0x4
> c: c3 retq
> d: 0f 1f 00 nopl (%rax)
> ```
> Which is very interesting to me; it looks like constant propagation
> actually hurt optimization, we lost that this was a libcall which we
> could have optimized.
>
> As in LLVM does:
> 1. sink CONST into ffs; it's static and has one caller
> 2. delete x parameter; it's unused
> 3. now libcall optimization just sees a call to ffs with no params,
> that doesn't match the signature of libc.
>
> Your change should fix that since we don't even call a function named
> ffs if we have a constant (explicitly, or via optimization). Filed
> https://github.com/llvm/llvm-project/issues/55394
Great! Didn't realize my patch had so many side benefits.
Will add a one sentence remark in v3 and point to your message.
Thanks!
On Wed, May 11, 2022 at 9:03 AM Vincent Mailhol
<[email protected]> wrote:
>
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
> * Example *
>
> Let's consider two dummy functions foo() and bar() as below:
>
> | #include <linux/bitops.h>
> | #define CONST 0x01000000
> |
> | unsigned int foo(void)
> | {
> | return ffs(CONST);
> | }
> |
> | unsigned int bar(void)
> | {
> | return __builtin_ffs(CONST);
> | }
>
> GCC would produce below assembly code:
Thanks for the patch! LGTM.
Reviewed-by: Nick Desaulniers <[email protected]>
>
> | 0000000000000000 <foo>:
> | 0: ba 00 00 00 01 mov $0x1000000,%edx
> | 5: b8 ff ff ff ff mov $0xffffffff,%eax
> | a: 0f bc c2 bsf %edx,%eax
> | d: 83 c0 01 add $0x1,%eax
> | 10: c3 ret
This should be the end of foo. I...actually don't know what's at the
end here. But I don't think the region from here...
> | 11: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
> | 18: 00 00 00 00
> | 1c: 0f 1f 40 00 nopl 0x0(%rax)
...to here is relevant.
> |
> | 0000000000000020 <bar>:
> | 20: b8 19 00 00 00 mov $0x19,%eax
> | 25: c3 ret
>
> And clang would produce:
>
> | 0000000000000000 <foo>:
> | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> | 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
> | c: 83 c0 01 add $0x1,%eax
> | f: c3 ret
Weird, so I just tried this:
```
$ cat /tmp/x.c
#define CONST 0x01000000
unsigned ffs (int x) {
int r;
asm("bsfl %1,%0"
: "=r" (r)
: "rm" (x), "0" (-1));
return r;
}
unsigned int foo(void) {
return ffs(CONST);
}
unsigned int bar(void) {
return __builtin_ffs(CONST);
}
$ clang /tmp/x.c -O2 -o /tmp/x.o -c && llvm-objdump -dr /tmp/x.o
--disassemble-symbols=foo
...
0000000000000010 <foo>:
10: b8 19 00 00 00 movl $25, %eax
15: c3 retq
16: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
```
but if we make `ffs` `static`, we get:
```
0000000000000000 <foo>:
0: b8 ff ff ff ff movl $4294967295, %eax
# imm = 0xFFFFFFFF
5: 0f bc 05 00 00 00 00 bsfl (%rip), %eax
# 0xc <foo+0xc>
0000000000000008: R_X86_64_PC32 .LCPI0_0-0x4
c: c3 retq
d: 0f 1f 00 nopl (%rax)
```
Which is very interesting to me; it looks like constant propagation
actually hurt optimization, we lost that this was a libcall which we
could have optimized.
As in LLVM does:
1. sink CONST into ffs; it's static and has one caller
2. delete x parameter; it's unused
3. now libcall optimization just sees a call to ffs with no params,
that doesn't match the signature of libc.
Your change should fix that since we don't even call a function named
ffs if we have a constant (explicitly, or via optimization). Filed
https://github.com/llvm/llvm-project/issues/55394
--
Thanks,
~Nick Desaulniers