The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste type in statistics of patch 1. Number of occurences
before patches are 1081 and not 3607 (percentage reduction of
26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
v1 -> v2:
* Use the ORC unwinder for the produced assembly code in patch 1.
* Rename the functions as follow:
- __ffs_asm() -> variable_ffs()
- __ffs_asm_not_zero() -> __variable_ffs()
- ffz_asm() -> variable_ffs()
* fit #define ffs(x) in a single line.
* Correct the statistics for ffs() in patch 1 and add the statistics
for __ffs() and ffz() in patch 2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.
* Example *
Let's consider two dummy functions foo() and bar() as below:
| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }
GCC would produce below assembly code:
| 0000000000000000 <foo>:
| 0: ba 00 00 00 01 mov $0x1000000,%edx
| 5: b8 ff ff ff ff mov $0xffffffff,%eax
| a: 0f bc c2 bsf %edx,%eax
| d: 83 c0 01 add $0x1,%eax
| 10: c3 ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
| 20: b8 19 00 00 00 mov $0x19,%eax
| 25: c3 ret
And clang would produce:
| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
| c: 83 c0 01 add $0x1,%eax
| f: c3 ret
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret
For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.
As a side benefit, this patch also removes below -Wshadow warning:
| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)
And finally, Nick Desaulniers pointed out in [2] that this also fixes
a constant propagation missed-optimization in clang.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081
...and after:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]
[2] https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/[email protected]
CC: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..fb0d7cd9f957 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1