2022-05-12 15:21:53

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
| 0: ba 00 00 00 01 mov $0x1000000,%edx
| 5: b8 ff ff ff ff mov $0xffffffff,%eax
| a: 0f bc c2 bsf %edx,%eax
| d: 83 c0 01 add $0x1,%eax
| 10: c3 ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
| 20: b8 19 00 00 00 mov $0x19,%eax
| 25: c3 ret

And clang would produce:

| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
| c: 83 c0 01 add $0x1,%eax
| f: c3 ret
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret

For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]

Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR

#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;

@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}

+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1



2022-05-14 02:43:46

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

On Thu, 2022-05-12 at 10:18 +0900, Vincent Mailhol wrote:
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
[]
> -static __always_inline int ffs(int x)
> +static __always_inline int variable_ffs(int x)
> {
> int r;
>
> @@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
> return r + 1;
> }
>
> +/**
> + * ffs - find first set bit in word
> + * @x: the word to search
> + *
> + * This is defined the same way as the libc and compiler builtin ffs
> + * routines, therefore differs in spirit from the other bitops.
> + *
> + * ffs(value) returns 0 if value is 0 or the position of the first
> + * set bit if value is nonzero. The first (least significant) bit
> + * is at position 1.
> + */
> +#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))

How about not defining another function and using parentheses around
the function definition to avoid the macro expansion like:

#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : ffs(x))

and

static __always_inline int (ffs)(int x)
{
etc...
}



2022-05-14 03:40:46

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

On Thu. 12 May 2022 at 12:02, Joe Perches <[email protected]> wrote:
> On Thu, 2022-05-12 at 10:18 +0900, Vincent Mailhol wrote:
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> []
> > -static __always_inline int ffs(int x)
> > +static __always_inline int variable_ffs(int x)
> > {
> > int r;
> >
> > @@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
> > return r + 1;
> > }
> >
> > +/**
> > + * ffs - find first set bit in word
> > + * @x: the word to search
> > + *
> > + * This is defined the same way as the libc and compiler builtin ffs
> > + * routines, therefore differs in spirit from the other bitops.
> > + *
> > + * ffs(value) returns 0 if value is 0 or the position of the first
> > + * set bit if value is nonzero. The first (least significant) bit
> > + * is at position 1.
> > + */
> > +#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
>
> How about not defining another function and using parentheses around
> the function definition to avoid the macro expansion like:
>
> #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : ffs(x))
>
> and
>
> static __always_inline int (ffs)(int x)
> {
> etc...
> }

Sorry, but I don’t really like this approach.

Main issue I see is that this code will emit a -Wshadow warning.

And using parentheses around the function definition just seems an
obscure hack to me. The variable_foo() gives me less headache. Was
this pattern ever used anywhere else in the kernel?


Yours sincerely,
Vincent Mailhol