The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v1 -> v2:
* Use the ORC unwinder for the produced assembly code in patch 1.
* Rename the functions as follow:
- __ffs_asm() -> variable_ffs()
- __ffs_asm_not_zero() -> __variable_ffs()
- ffz_asm() -> variable_ffs()
* fit #define ffs(x) in a single line.
* Correct the statistics for ffs() in patch 1 and add the statistics
for __ffs() and ffz() in patch 2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600
So, roughly 27.9% of the call to either __ffs() or ffz() were using
constant expression and were optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/[email protected]
CC: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..7cf5374ce403 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __variable_ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ __variable_ffs(word))
+
+static __always_inline unsigned long __variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ __variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
On Wed, May 11, 2022 at 9:04 AM Vincent Mailhol
<[email protected]> wrote:
>
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x). Because
> __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
>
> For x86_64, the current __ffs() and ffz() implementations do not
> produce optimized code when called with a constant expression. On the
> contrary, the __builtin_ctzl() gets simplified into a single
> instruction.
>
> However, for non constant expressions, the __ffs() and ffz() asm
> versions of the kernel remains slightly better than the code produced
> by GCC (it produces a useless instruction to clear eax).
>
> This patch uses the __builtin_constant_p() to select between the
> kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
> the argument is constant or not.
>
> ** Statistics **
>
> On a allyesconfig, before applying this patch...:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 3607
>
> ...and after:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 2600
>
> So, roughly 27.9% of the call to either __ffs() or ffz() were using
> constant expression and were optimized out.
>
> (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
>
> Note: on x86_64, the asm bsf instruction produces tzcnt when used with
> the ret prefix (which is why we grep tzcnt instead of bsf in above
> benchmark). c.f. [1]
>
> [1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
> http://lkml.kernel.org/r/[email protected]
>
> CC: Nick Desaulniers <[email protected]>
> Signed-off-by: Vincent Mailhol <[email protected]>
Patch LGTM, though I find the location of the double unscores in the
names slightly against my taste.
> ---
> arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
> 1 file changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 6ed979547086..7cf5374ce403 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
> ? constant_test_bit((nr), (addr)) \
> : variable_test_bit((nr), (addr)))
>
> -/**
> - * __ffs - find first set bit in word
> - * @word: The word to search
> - *
> - * Undefined if no bit exists, so code should check against 0 first.
> - */
> -static __always_inline unsigned long __ffs(unsigned long word)
> +static __always_inline unsigned long __variable_ffs(unsigned long word)
How about `variable___ffs`? Patch 1/2 used `variable_ffs` for `ffs`?
> {
> asm("rep; bsf %1,%0"
> : "=r" (word)
> @@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
> return word;
> }
>
> -/**
> - * ffz - find first zero bit in word
> - * @word: The word to search
> - *
> - * Undefined if no zero exists, so code should check against ~0UL first.
> - */
> -static __always_inline unsigned long ffz(unsigned long word)
> +/**
> + * __ffs - find first set bit in word
> + * @word: The word to search
> + *
> + * Undefined if no bit exists, so code should check against 0 first.
> + */
> +#define __ffs(word) \
> + (__builtin_constant_p(word) ? \
> + (unsigned long)__builtin_ctzl(word) : \
> + __variable_ffs(word))
> +
> +static __always_inline unsigned long __variable_ffz(unsigned long word)
`ffz` had no underscore. Regardless of `__ffs`, this should definitely
be `variable_ffz` IMO.
> {
> asm("rep; bsf %1,%0"
> : "=r" (word)
> @@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
> return word;
> }
>
> +/**
> + * ffz - find first zero bit in word
> + * @word: The word to search
> + *
> + * Undefined if no zero exists, so code should check against ~0UL first.
> + */
> +#define ffz(word) \
> + (__builtin_constant_p(word) ? \
> + (unsigned long)__builtin_ctzl(~word) : \
> + __variable_ffz(word))
> +
> /*
> * __fls: find last set bit in word
> * @word: The word to search
> --
> 2.35.1
>
--
Thanks,
~Nick Desaulniers
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
On Thu. 12 May 2022 at 10:20, Vincent Mailhol
<[email protected]> wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() function of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
> * use kernel assembly for non constant expressions
>
> * use compiler's __builtin function for constant expressions.
>
> I also think that the fls() and fls64() can be optimized in a similar
> way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> trivial so I want to focus on this series first. If it get accepted, I
> will then work on those two additionnal function.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v3 -> v4:
>
> * (no changes on code, only commit comment was modified)
>
> * Remove note and link to Nick's message in patch 1/2, c.f.:
> https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 2/2.
>
>
> v2 -> v3:
>
> * Redacted out the instructions after ret and before next function
> in the assembly output.
>
> * Added a note and a link to Nick's message on the constant
> propagation missed-optimization in clang:
> https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
> * Fix copy/paste typo in statistics of patch 1/2. Number of
> occurences before patches are 1081 and not 3607 (percentage
> reduction of 26.7% remains correct)
>
> * Rename the functions as follow:
> - __varible_ffs() -> variable___ffs()
> - __variable_ffz() -> variable_ffz()
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 1/2.
>
> Vincent Mailhol (2):
> x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> expressions
> x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> expressions
Hi Thomas, Ingo and Borislav,
Are there any chances for you to pick those two patches during this
week's merge windows?
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
Thank you!
Yours sincerely,
Vincent Mailhol
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.
* Example *
Let's consider two dummy functions foo() and bar() as below:
| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }
GCC would produce below assembly code:
| 0000000000000000 <foo>:
| 0: ba 00 00 00 01 mov $0x1000000,%edx
| 5: b8 ff ff ff ff mov $0xffffffff,%eax
| a: 0f bc c2 bsf %edx,%eax
| d: 83 c0 01 add $0x1,%eax
| 10: c3 ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
| 20: b8 19 00 00 00 mov $0x19,%eax
| 25: c3 ret
And clang would produce:
| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
| c: 83 c0 01 add $0x1,%eax
| f: c3 ret
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret
In both examples, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.
As a side benefit, this patch also removes below -Wshadow warning:
| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081
...and after:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.
* Example *
Let's consider two dummy functions foo() and bar() as below:
| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }
GCC would produce below assembly code:
| 0000000000000000 <foo>:
| 0: ba 00 00 00 01 mov $0x1000000,%edx
| 5: b8 ff ff ff ff mov $0xffffffff,%eax
| a: 0f bc c2 bsf %edx,%eax
| d: 83 c0 01 add $0x1,%eax
| 10: c3 ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
| 20: b8 19 00 00 00 mov $0x19,%eax
| 25: c3 ret
And clang would produce:
| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
| c: 83 c0 01 add $0x1,%eax
| f: c3 ret
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret
In both examples, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.
As a side benefit, this patch also removes below -Wshadow warning:
| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081
...and after:
| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.
** Statistics **
On a allyesconfig, before applying this patch...:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607
...and after:
| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
On Sun 24 Jul. 2022 at 00:15, Vincent Mailhol
<[email protected]> wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() function of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
> * use kernel assembly for non constant expressions
>
> * use compiler's __builtin function for constant expressions.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v3 -> v4:
>
> * (no changes on code, only commit comment was modified)
>
> * Remove note and link to Nick's message in patch 1/2, c.f.:
> https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 2/2.
>
>
> v2 -> v3:
>
> * Redacted out the instructions after ret and before next function
> in the assembly output.
>
> * Added a note and a link to Nick's message on the constant
> propagation missed-optimization in clang:
> https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
> * Fix copy/paste typo in statistics of patch 1/2. Number of
> occurences before patches are 1081 and not 3607 (percentage
> reduction of 26.7% remains correct)
>
> * Rename the functions as follow:
> - __varible_ffs() -> variable___ffs()
> - __variable_ffz() -> variable_ffz()
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> in tag in patch 1/2.
>
> Vincent Mailhol (2):
> x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> expressions
> x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> expressions
>
> arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
> 1 file changed, 38 insertions(+), 26 deletions(-)
Hi Thomas, Ingo, Borislav, Dave and Peter,
This patch series [1] has been waiting for more than two months
already. So far, I have not heard back from any of the x86 mainteners.
Do you see anything wrong with this series? If not, any chances to
have someone of you to pick it up?
[1] https://lore.kernel.org/all/[email protected]/#t
Thank you,
Yours sincerely,
Vincent Mailhol
On Fri, Jul 29, 2022 at 08:24:58PM +0900, Vincent MAILHOL wrote:
> This patch series [1] has been waiting for more than two months
> already. So far, I have not heard back from any of the x86 mainteners.
> Do you see anything wrong with this series? If not, any chances to
> have someone of you to pick it up?
They're on my todo list but you have to be patient. If you haven't
heard, we're still busy with this thing called retbleed.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri. 29 Jul. 2022 at 21:22, Borislav Petkov <[email protected]> worte:
> On Fri, Jul 29, 2022 at 08:24:58PM +0900, Vincent MAILHOL wrote:
> > This patch series [1] has been waiting for more than two months
> > already. So far, I have not heard back from any of the x86 mainteners.
> > Do you see anything wrong with this series? If not, any chances to
> > have someone of you to pick it up?
>
> They're on my todo list but you have to be patient. If you haven't
> heard, we're still busy with this thing called retbleed.
Understood and thanks for the update!
If this is on your radar, then no more worries on my side. I wish you
courage and good luck for the retbleed fix!
Yours sincerely,
Vincent Mailhol
On Sun, Jul 24, 2022 at 12:15:20AM +0900, Vincent Mailhol wrote:
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
> * Example *
>
> Let's consider two dummy functions foo() and bar() as below:
>
> | #include <linux/bitops.h>
> | #define CONST 0x01000000
Those code examples you can simply indent with two spaces.
> In both examples, we clearly see the benefit of using __builtin_ffs()
Who's "we"?
Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.
> instead of the kernel's asm implementation for constant expressions.
>
> However, for non constant expressions, the ffs() asm version of the
> kernel remains better for x86_64 because, contrary to GCC, it doesn't
> emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> able optimize out the CMOV call).
>
> This patch uses the __builtin_constant_p() to select between the
Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.
Also, do
$ git grep 'This patch' Documentation/process
for more details.
> kernel's ffs() and the __builtin_ffs() depending on whether the
> argument is constant or not.
In general, you don't have to say what the patch does - that should be
visible from the diff. The more important part is the *why*. And that
you do.
Rest looks ok.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v4 -> v5:
* (no changes on code, only commit comment was modified)
* Rewrite the commit log:
- Use two spaces instead of `| ' to indent code snippets.
- Do not use `we'.
- Do not use `this patch' in the commit description. Instead,
use imperative tone.
Link: https://lore.kernel.org/all/[email protected]/
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
Hi Borislav,
On Thu. 11 Aug 2022 at 23:59, Borislav Petkov <[email protected]> wrote:
> On Sun, Jul 24, 2022 at 12:15:20AM +0900, Vincent Mailhol wrote:
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
> > * Example *
> >
> > Let's consider two dummy functions foo() and bar() as below:
> >
> > | #include <linux/bitops.h>
> > | #define CONST 0x01000000
>
> Those code examples you can simply indent with two spaces.
>
> > In both examples, we clearly see the benefit of using __builtin_ffs()
>
> Who's "we"?
>
> Please use passive voice in your commit message: no "we" or "I", etc,
> and describe your changes in imperative mood.
>
> > instead of the kernel's asm implementation for constant expressions.
> >
> > However, for non constant expressions, the ffs() asm version of the
> > kernel remains better for x86_64 because, contrary to GCC, it doesn't
> > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> > able optimize out the CMOV call).
> >
> > This patch uses the __builtin_constant_p() to select between the
>
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
>
> Also, do
>
> $ git grep 'This patch' Documentation/process
>
> for more details.
>
> > kernel's ffs() and the __builtin_ffs() depending on whether the
> > argument is constant or not.
>
> In general, you don't have to say what the patch does - that should be
> visible from the diff. The more important part is the *why*. And that
> you do.
>
> Rest looks ok.
Thank you for the review!
I addressed all your comments and sent a v5:
https://lore.kernel.org/all/[email protected]/
Yours sincerely,
Vincent Mailhol
For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to fold the expression into a single instruction.
** Example **
Let's consider two dummy functions foo() and bar() as below:
#include <linux/bitops.h>
#define CONST 0x01000000
unsigned int foo(void)
{
return ffs(CONST);
}
unsigned int bar(void)
{
return __builtin_ffs(CONST);
}
GCC would produce below assembly code:
0000000000000000 <foo>:
0: ba 00 00 00 01 mov $0x1000000,%edx
5: b8 ff ff ff ff mov $0xffffffff,%eax
a: 0f bc c2 bsf %edx,%eax
d: 83 c0 01 add $0x1,%eax
10: c3 ret
<Instructions after ret and before next function were redacted>
0000000000000020 <bar>:
20: b8 19 00 00 00 mov $0x19,%eax
25: c3 ret
And clang would produce:
0000000000000000 <foo>:
0: b8 ff ff ff ff mov $0xffffffff,%eax
5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
c: 83 c0 01 add $0x1,%eax
f: c3 ret
0000000000000010 <bar>:
10: b8 19 00 00 00 mov $0x19,%eax
15: c3 ret
Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.
As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:
./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep bsf | wc -l
1081
...and after:
$ objdump -d vmlinux.o | grep bsf | wc -l
792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
3607
...and after:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..bd49aef87ab6 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)
Are you sure about this?
My gcc documentation says:
"Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
Note the undefined part.
Also,
__builtin_ctzl(0): 0x40
ffs(0): 0x0
I'm using the kernel ffs() version in a small program which is basically
a wrapper around BSF.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> Callers of these need to guard against zero input, as the pre-existing
> comment notes:
>
> >> Undefined if no bit exists, so code should check against 0 first.
This is just silly.
And then there's
* ffs(value) returns 0 if value is 0 or the position of the first
* set bit if value is nonzero. The first (least significant) bit
* is at position 1.
*/
static __always_inline int ffs(int x)
Can we unify the two and move the guard against 0 inside the function
or, like ffs() does, have it return 0 if value is 0?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 23, 2022 at 9:23 AM Borislav Petkov <[email protected]> wrote:
>
> On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> > __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)
>
> Are you sure about this?
>
> My gcc documentation says:
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
> Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> Note the undefined part.
>
> Also,
>
> __builtin_ctzl(0): 0x40
> ffs(0): 0x0
>
> I'm using the kernel ffs() version in a small program which is basically
> a wrapper around BSF.
Callers of these need to guard against zero input, as the pre-existing
comment notes:
>> Undefined if no bit exists, so code should check against 0 first.
--
Thanks,
~Nick Desaulniers
On Wed. 24 Aug. 2022 at 02:43, Borislav Petkov <[email protected]> wrote:
> On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> > Callers of these need to guard against zero input, as the pre-existing
> > comment notes:
> >
> > >> Undefined if no bit exists, so code should check against 0 first.
If the fact that __ffs(0) is undefined is a concern, I can add a safety net:
#define __ffs(word) \
(__builtin_constant_p(word) ? \
(unsigned long)__builtin_ctzl(word) +
BUILD_BUG_ON_ZERO(word): \
variable___ffs(word))
It will only catch the constant expression but still better than
nothing (this comment also applies to the other functions undefined
when the argument is zero).
Also, if this aspect was unclear, I can add a comment in the commit
log to explain.
> This is just silly.
>
> And then there's
>
> * ffs(value) returns 0 if value is 0 or the position of the first
> * set bit if value is nonzero. The first (least significant) bit
> * is at position 1.
> */
> static __always_inline int ffs(int x)
>
> Can we unify the two and move the guard against 0 inside the function
> or, like ffs() does, have it return 0 if value is 0?
There is an index issue. __ffs() starts at 0 but ffs() starts at one.
i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
explains why __fss(0) is undefined.
Yours sincerely,
Vincent Mailhol
On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> If the fact that __ffs(0) is undefined is a concern,
So what is of concern is I'm looking at those *ffs things and they look
like a real mess:
* Undefined if no bit exists, so code should check against 0 first.
*/
static __always_inline unsigned long __ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
and that's TZCNT.
And nowhere in TZCNT's description does it talk about undefined behavior
- it is all defined.
So I have no clue what that comment is supposed to mean?
Then:
* ffs - find first set bit in word
* @x: the word to search
*
* This is defined the same way as the libc and compiler builtin ffs
* routines, therefore differs in spirit from the other bitops.
*
* ffs(value) returns 0 if value is 0 or the position of the first
* set bit if value is nonzero. The first (least significant) bit
* is at position 1.
while
"Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
as previously pasted.
So ffs() doesn't have undefined behavior either.
I guess it wants to say, it is undefined in the *respective* libc or
compiler helper implementation. And that should be explained.
> I can add a safety net:
Nah, no need. It seems this "behavior" has been the case a long time so
callers should know better (or have burned themselves properly :)).
> There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> explains why __fss(0) is undefined.
I'd love to drop the undefined thing and start counting at 1 while
keeping the 0 case the special one.
But that ship has sailed a long time ago - look at all the __ffs() and
ffs() callers.
Back to your patch: I think the text should be fixed to say that both
ffs() and __ffs()'s kernel implementation doesn't have undefined results
but since it needs to adhere to the libc variants' API, it treats 0
differently. They surely can handle 0 as input.
I.e., I'd like to see a comment there explaining the whole difference
between ffs() and __ffs() so that people are aware.
Btw, pls do
s/variable___ffs/variable__ffs/g
Two underscores are just fine.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed. 24 Aug 2022 at 17:43, Borislav Petkov <[email protected]> wrote:
> On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> > If the fact that __ffs(0) is undefined is a concern,
>
> So what is of concern is I'm looking at those *ffs things and they look
> like a real mess:
I agree that the thing is a mess. Especially the naming: adding
underscores when the behaviour is different is misleading. I think
that ctzl() would have been a better name than __ffs().
> * Undefined if no bit exists, so code should check against 0 first.
> */
> static __always_inline unsigned long __ffs(unsigned long word)
> {
> asm("rep; bsf %1,%0"
>
> and that's TZCNT.
Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…
> And nowhere in TZCNT's description does it talk about undefined behavior
> - it is all defined.
>
> So I have no clue what that comment is supposed to mean?
It means that __ffs() is not a x86_64 specific function. Each
architecture is free to provide an optimized implementation and are
free to ignore __ffs(0) because this is undefined.
For ffs(0) to be defined, every architecture would have to produce the
same result, and this is not the case.
> Then:
>
> * ffs - find first set bit in word
> * @x: the word to search
> *
> * This is defined the same way as the libc and compiler builtin ffs
> * routines, therefore differs in spirit from the other bitops.
> *
> * ffs(value) returns 0 if value is 0 or the position of the first
> * set bit if value is nonzero. The first (least significant) bit
> * is at position 1.
>
> while
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
> Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> as previously pasted.
>
> So ffs() doesn't have undefined behavior either.
>
> I guess it wants to say, it is undefined in the *respective* libc or
> compiler helper implementation. And that should be explained.
>
> > I can add a safety net:
>
> Nah, no need. It seems this "behavior" has been the case a long time so
> callers should know better (or have burned themselves properly :)).
>
> > There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> > i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> > Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> > explains why __fss(0) is undefined.
>
> I'd love to drop the undefined thing and start counting at 1 while
> keeping the 0 case the special one.
>
> But that ship has sailed a long time ago - look at all the __ffs() and
> ffs() callers.
ACK. I do not believe that this is something which can be changed now.
At least, I am not willing to start such a crusade.
> Back to your patch: I think the text should be fixed to say that both
> ffs() and __ffs()'s kernel implementation doesn't have undefined results
NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
x86_64 and BSF instruction for x86). Even if x86_64 and x86 had the
same behaviour that would still not be OK as it may fool developers
into believing that __ffs(0) is defined kernel wide and would result
in non portable code.
> but since it needs to adhere to the libc variants' API, it treats 0
> differently. They surely can handle 0 as input.
>
> I.e., I'd like to see a comment there explaining the whole difference
> between ffs() and __ffs() so that people are aware.
This would be helpful but the priority would then be to modify asm-generic:
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/__ffs.h#L11
Regardless, I do not think that the comment of __ffs() and ffs() is
related to this patch series.
> Btw, pls do
>
> s/variable___ffs/variable__ffs/g
>
> Two underscores are just fine.
OK for me. The rationale was to name it variable_<function_name>()
thus the three underscores. But I will also be happy with two
underscores.
Yours sincerely,
Vincent Mailhol
On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…
Not x86 - some old models which do not understand TZCNT, I'm being told.
And I'm being also told, "Intel and AMD disagree on what BSF does when
passed 0". So this is more mess.
> It means that __ffs() is not a x86_64 specific function. Each
No, not that. The comment "Undefined if no bit exists".
On my machine, __ffs(0) - the way it is implemented:
rep; bsf %1,%0
is well-defined:
"If the input operand is zero, CF is set to 1 and the size (in bits) of
the input operand is written to the destination register. Otherwise, CF
is cleared."
Leading to
__ffs(0): 0x40
i.e., input operand of 64 bits.
So on this particular x86 implementation, TZCNT(0) is well defined.
So I'd like for that "undefined" thing to be expanded upon and
explained. Something along the lines of "the libc/compiler primitives'
*ffs(0) is undefined. Our inline asm helpers adhere to that behavior
even if the result they return for input operand of 0 is very well
defined."
Now, if there are some machines which do not adhere to the current hw
behavior, then they should be ALTERNATIVEd.
Better?
> > Back to your patch: I think the text should be fixed to say that both
> > ffs() and __ffs()'s kernel implementation doesn't have undefined results
>
> NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
The *text* - not __ffs(0) itself. The *text* should be fixed to explain
what undefined means. See above too.
IOW, to start explaining this humongous mess I've scratched the surface
of.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed. 24 Aug. 2022 at 22:24, Borislav Petkov <[email protected]> wrote:
> On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> > Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…
>
> Not x86 - some old models which do not understand TZCNT, I'm being told.
ACK.
> And I'm being also told, "Intel and AMD disagree on what BSF does when
> passed 0". So this is more mess.
ACK.
> > It means that __ffs() is not a x86_64 specific function. Each
>
> No, not that. The comment "Undefined if no bit exists".
>
> On my machine, __ffs(0) - the way it is implemented:
>
> rep; bsf %1,%0
>
> is well-defined:
>
> "If the input operand is zero, CF is set to 1 and the size (in bits) of
> the input operand is written to the destination register. Otherwise, CF
> is cleared."
It is well defined on *your* machine.
On some other machines, it might be undefined:
"If the content of the source operand is 0, the content of the
destination operand is undefined."
https://www.felixcloutier.com/x86/bsf
> Leading to
>
> __ffs(0): 0x40
>
> i.e., input operand of 64 bits.
>
> So on this particular x86 implementation, TZCNT(0) is well defined.
It is here where I do not follow you. OK that on most of the recent
machines, the compiler will emit a TZCNT and that this instruction is
well defined for zero. But on some older machines, it will emit BSF,
and on a subset of those machines, BSF(0) might be undefined.
> So I'd like for that "undefined" thing to be expanded upon and
> explained. Something along the lines of "the libc/compiler primitives'
> *ffs(0) is undefined. Our inline asm helpers adhere to that behavior
> even if the result they return for input operand of 0 is very well
> defined."
>
> Now, if there are some machines which do not adhere to the current hw
> behavior, then they should be ALTERNATIVEd.
>
> Better?
>
> > > Back to your patch: I think the text should be fixed to say that both
> > > ffs() and __ffs()'s kernel implementation doesn't have undefined results
> >
> > NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
>
> NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
> The *text* - not __ffs(0) itself. The *text* should be fixed to explain
> what undefined means. See above too.
>
> IOW, to start explaining this humongous mess I've scratched the surface
> of.
Agree that this is only the surface. But, my patch series is about
constant folding, not about the text of *ffs(). Here, I just *move*
the existing text, I did not modify anything.
Can we agree that this is a separate topic? I do not think I am the
good person to fix that mess (and in all honesty, I am not a domain
expert in this domain and I am afraid I would just make you lose your
time if I had to work on this).
Yours sincerely,
Vincent Mailhol
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v5 -> v6:
* Rename variable___ffs() into variable__ffs() (two underscores
instead of three)
v4 -> v5:
* (no changes on code, only commit comment was modified)
* Rewrite the commit log:
- Use two spaces instead of `| ' to indent code snippets.
- Do not use `we'.
- Do not use `this patch' in the commit description. Instead,
use imperative tone.
Link: https://lore.kernel.org/all/[email protected]/
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to fold the expression into a single instruction.
** Example **
Let's consider two dummy functions foo() and bar() as below:
#include <linux/bitops.h>
#define CONST 0x01000000
unsigned int foo(void)
{
return ffs(CONST);
}
unsigned int bar(void)
{
return __builtin_ffs(CONST);
}
GCC would produce below assembly code:
0000000000000000 <foo>:
0: ba 00 00 00 01 mov $0x1000000,%edx
5: b8 ff ff ff ff mov $0xffffffff,%eax
a: 0f bc c2 bsf %edx,%eax
d: 83 c0 01 add $0x1,%eax
10: c3 ret
<Instructions after ret and before next function were redacted>
0000000000000020 <bar>:
20: b8 19 00 00 00 mov $0x19,%eax
25: c3 ret
And clang would produce:
0000000000000000 <foo>:
0: b8 ff ff ff ff mov $0xffffffff,%eax
5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
c: 83 c0 01 add $0x1,%eax
f: c3 ret
0000000000000010 <bar>:
10: b8 19 00 00 00 mov $0x19,%eax
15: c3 ret
Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).
Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.
As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:
./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep bsf | wc -l
1081
...and after:
$ objdump -d vmlinux.o | grep bsf | wc -l
792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
return r + 1;
}
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1
On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() functions of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be folded when used
> with constant expressions.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
> * use kernel assembly for non constant expressions
>
> * use compiler's __builtin function for constant expressions.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
Hi Vincent,
Can you please add a test for this? We've recently added a very similar
test_bitmap_const_eval() in lib/test_bitmap.c.
dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
assertions")
Would be nice to have something like this for ffs() and ffz() in
lib/test_bitops.c.
Please keep me in loop in case of new versions.
Thanks,
Yury
On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > The compilers provide some builtin expression equivalent to the ffs(),
> > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > assembly which produces better code than the builtin
> > functions. However, such assembly code can not be folded when used
> > with constant expressions.
> >
> > This series relies on __builtin_constant_p to select the optimal solution:
> >
> > * use kernel assembly for non constant expressions
> >
> > * use compiler's __builtin function for constant expressions.
> >
> >
> > ** Statistics **
> >
> > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > of __ffs() and ffz() calls (details of the calculation in each patch).
>
> Hi Vincent,
>
> Can you please add a test for this? We've recently added a very similar
> test_bitmap_const_eval() in lib/test_bitmap.c.
>
> dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> assertions")
>
> Would be nice to have something like this for ffs() and ffz() in
> lib/test_bitops.c.
>
> Please keep me in loop in case of new versions.
Also, what about fls? Is there any difference with ffs/ffz wrt compile
time optimizations? If not, would be great if the series will take
care of it too.
Thanks,
Yury
On Tue. 1 sept. 2022 at 12:49, Yury Norov <[email protected]> wrote:
> On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > The compilers provide some builtin expression equivalent to the ffs(),
> > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be folded when used
> > > with constant expressions.
> > >
> > > This series relies on __builtin_constant_p to select the optimal solution:
> > >
> > > * use kernel assembly for non constant expressions
> > >
> > > * use compiler's __builtin function for constant expressions.
> > >
> > >
> > > ** Statistics **
> > >
> > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > of __ffs() and ffz() calls (details of the calculation in each patch).
> >
> > Hi Vincent,
> >
> > Can you please add a test for this? We've recently added a very similar
> > test_bitmap_const_eval() in lib/test_bitmap.c.
> >
> > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > assertions")
> >
> > Would be nice to have something like this for ffs() and ffz() in
> > lib/test_bitops.c.
> >
> > Please keep me in loop in case of new versions.
Hi Yury,
My patch only takes care of the x86 architecture. Assuming some other
architectures are not optimized yet, adding such a test might break
some builds. I am fine with adding the test, however, I will not write
patches for the other architecture because I do not have the
environment to compile and test it.
Does it still make sense to add the test before fixing all the architectures?
> Also, what about fls? Is there any difference with ffs/ffz wrt compile
> time optimizations? If not, would be great if the series will take
> care of it too.
Agree. The fls() and fls64() can use __builtin_ctz() and
__builtin_ctzll(). However, those two functions are a bit less
trivial. I wanted to have this first series approved first before
working on *fls*().
Yours sincerely,
Vincent Mailhol
On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> On Tue. 1 sept. 2022 at 12:49, Yury Norov <[email protected]> wrote:
> > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > assembly which produces better code than the builtin
> > > > functions. However, such assembly code can not be folded when used
> > > > with constant expressions.
> > > >
> > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > >
> > > > * use kernel assembly for non constant expressions
> > > >
> > > > * use compiler's __builtin function for constant expressions.
> > > >
> > > >
> > > > ** Statistics **
> > > >
> > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > >
> > > Hi Vincent,
> > >
> > > Can you please add a test for this? We've recently added a very similar
> > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > >
> > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > assertions")
> > >
> > > Would be nice to have something like this for ffs() and ffz() in
> > > lib/test_bitops.c.
> > >
> > > Please keep me in loop in case of new versions.
>
> Hi Yury,
>
> My patch only takes care of the x86 architecture.
OK, I just realized that you started submitting this at least back in May.
For me, v6 is good enough and well-described. So, for the series:
Reviewed-by: Yury Norov <[email protected]>
How are you going to merge it? If you haven't a specific tree in mind
already, I can take it in my bitmap tree because ffs and ffz are closely
related to find_bit() functions.
> Assuming some other
> architectures are not optimized yet, adding such a test might break
> some builds. I am fine with adding the test, however, I will not write
> patches for the other architecture because I do not have the
> environment to compile and test it.
>
> Does it still make sense to add the test before fixing all the architectures?
All-arches fix should begin with changing the ffs design. Namely, there
should be a generic ffs() in include/linux/bitops.h, and arch-specific
arch__ffs() in arch/xxx/include/asm/bitops.h; like we do for the set_bit()
family. I have a feeling that it's far beyond the scope of your series.
The test is a different story. Good tests are always welcome, even if
they don't cover all the arches.
> > Also, what about fls? Is there any difference with ffs/ffz wrt compile
> > time optimizations? If not, would be great if the series will take
> > care of it too.
>
> Agree. The fls() and fls64() can use __builtin_ctz() and
> __builtin_ctzll(). However, those two functions are a bit less
> trivial. I wanted to have this first series approved first before
> working on *fls*().
OK, the test and fls() can be a matter of a follow-up series, taking
into account how long are these 2 patches moving.
Thanks,
Yury
On Thu, Sep 1, 2022 at 7:19 AM Yury Norov <[email protected]> wrote:
>
> On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> > On Tue. 1 sept. 2022 at 12:49, Yury Norov <[email protected]> wrote:
> > > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > > assembly which produces better code than the builtin
> > > > > functions. However, such assembly code can not be folded when used
> > > > > with constant expressions.
> > > > >
> > > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > > >
> > > > > * use kernel assembly for non constant expressions
> > > > >
> > > > > * use compiler's __builtin function for constant expressions.
> > > > >
> > > > >
> > > > > ** Statistics **
> > > > >
> > > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > > >
> > > > Hi Vincent,
> > > >
> > > > Can you please add a test for this? We've recently added a very similar
> > > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > > >
> > > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > > assertions")
> > > >
> > > > Would be nice to have something like this for ffs() and ffz() in
> > > > lib/test_bitops.c.
> > > >
> > > > Please keep me in loop in case of new versions.
> >
> > Hi Yury,
> >
> > My patch only takes care of the x86 architecture.
>
> OK, I just realized that you started submitting this at least back in May.
>
> For me, v6 is good enough and well-described. So, for the series:
> Reviewed-by: Yury Norov <[email protected]>
>
> How are you going to merge it? If you haven't a specific tree in mind
> already, I can take it in my bitmap tree because ffs and ffz are closely
> related to find_bit() functions.
Unless Boris feels strong enough to nack the series, I think Yury
accepting the series is the way to go.
--
Thanks,
~Nick Desaulniers
On Thu. 1 Sep. 2022 at 23:19, Yury Norov <[email protected]> wrote:
> On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> > On Tue. 1 sept. 2022 at 12:49, Yury Norov <[email protected]> wrote:
> > > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > > assembly which produces better code than the builtin
> > > > > functions. However, such assembly code can not be folded when used
> > > > > with constant expressions.
> > > > >
> > > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > > >
> > > > > * use kernel assembly for non constant expressions
> > > > >
> > > > > * use compiler's __builtin function for constant expressions.
> > > > >
> > > > >
> > > > > ** Statistics **
> > > > >
> > > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > > >
> > > > Hi Vincent,
> > > >
> > > > Can you please add a test for this? We've recently added a very similar
> > > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > > >
> > > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > > assertions")
> > > >
> > > > Would be nice to have something like this for ffs() and ffz() in
> > > > lib/test_bitops.c.
> > > >
> > > > Please keep me in loop in case of new versions.
> >
> > Hi Yury,
> >
> > My patch only takes care of the x86 architecture.
>
> OK, I just realized that you started submitting this at least back in May.
>
> For me, v6 is good enough and well-described. So, for the series:
> Reviewed-by: Yury Norov <[email protected]>
Thanks for the review!
> How are you going to merge it? If you haven't a specific tree in mind
> already, I can take it in my bitmap tree because ffs and ffz are closely
> related to find_bit() functions.
I never thought of a specific tree. I just CCed the x86 architecture
maintainers according to get_maintainer.pl and was expecting it to go
through the x86/asm branch of the tip tree. But I am perfectly fine if
it goes through your tree.
So same as Nick's comment below, unless Borislav still has concern on
the v6, please take it in your tree.
> > Assuming some other
> > architectures are not optimized yet, adding such a test might break
> > some builds. I am fine with adding the test, however, I will not write
> > patches for the other architecture because I do not have the
> > environment to compile and test it.
> >
> > Does it still make sense to add the test before fixing all the architectures?
>
> All-arches fix should begin with changing the ffs design. Namely, there
> should be a generic ffs() in include/linux/bitops.h,
Currently, the generic ffl, ffs, flz are under:
/include/asm-generic/bitops
especially, here is the generic ffs():
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/ffs.h
Isn't this sufficient?
> and arch-specific
> arch__ffs() in arch/xxx/include/asm/bitops.h; like we do for the set_bit()
> family. I have a feeling that it's far beyond the scope of your series.
>
> The test is a different story. Good tests are always welcome, even if
> they don't cover all the arches.
ACK. I will add the test in a different patch *after* this series gets
accepted. But to be clear, I will not fix other architectures.
> > > Also, what about fls? Is there any difference with ffs/ffz wrt compile
> > > time optimizations? If not, would be great if the series will take
> > > care of it too.
> >
> > Agree. The fls() and fls64() can use __builtin_ctz() and
> > __builtin_ctzll(). However, those two functions are a bit less
> > trivial. I wanted to have this first series approved first before
> > working on *fls*().
>
> OK, the test and fls() can be a matter of a follow-up series, taking
> into account how long are these 2 patches moving.
ACK.
> Thanks,
> Yury
On Tue. 31 Aug 2022 at 17:51, Yury Norov <[email protected]> wrote:
> On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
(...)
> Please keep me in loop in case of new versions.
A side comment, but if you want not to miss such discussion again in
the future, why not do this?
diff --git a/MAINTAINERS b/MAINTAINERS
index 56af0182a93b..f6caf4252799 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3602,6 +3602,7 @@ M: Yury Norov <[email protected]>
R: Andy Shevchenko <[email protected]>
R: Rasmus Villemoes <[email protected]>
S: Maintained
+F: arch/*/include/asm/bitops.h
F: include/linux/bitmap.h
F: include/linux/cpumask.h
F: include/linux/find.h
N.B. this is just a suggestion, please feel free to discard it.
> Thanks,
> Yury
On Thu, Sep 01, 2022 at 10:06:48AM -0700, Nick Desaulniers wrote:
> Unless Boris feels strong enough to nack the series, I think Yury
> accepting the series is the way to go.
arch/x86/ patches go through the tip tree unless there's a compelling
reason to carry them through another. I don't see one here.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v6 -> v7:
* (no changes on code, only commit tag was modified)
* Add Reviewed-by: Yury Norov <[email protected]> in both patches
v5 -> v6:
* Rename variable___ffs() into variable__ffs() (two underscores
instead of three)
v4 -> v5:
* (no changes on code, only commit comment was modified)
* Rewrite the commit log:
- Use two spaces instead of `| ' to indent code snippets.
- Do not use `we'.
- Do not use `this patch' in the commit description. Instead,
use imperative tone.
Link: https://lore.kernel.org/all/[email protected]/
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
3607
...and after:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]
[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Nick Desaulniers <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 879238e5a6a0..08eca2938f27 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -247,13 +247,7 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
variable_test_bit(nr, addr);
}
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable__ffs(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -261,13 +255,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
return word;
}
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -275,6 +274,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1
On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
<[email protected]> wrote:
>
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() functions of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be folded when used
> with constant expressions.
Another tact which may help additional sources other than just the
Linux kernel; it seems that compilers should be able to fold this.
Vincent, if you're interested in making such an optimization in LLVM,
we'd welcome the contribution, and I'd be happy to show you where to
make such changes within LLVM; please let me know off thread.
If not, at the least, we should file feature requests in both:
* https://github.com/llvm/llvm-project/issues
* https://gcc.gnu.org/bugzilla/
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
> * use kernel assembly for non constant expressions
>
> * use compiler's __builtin function for constant expressions.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v6 -> v7:
>
> * (no changes on code, only commit tag was modified)
>
> * Add Reviewed-by: Yury Norov <[email protected]> in both patches
>
>
> v5 -> v6:
> * Rename variable___ffs() into variable__ffs() (two underscores
> instead of three)
>
>
> v4 -> v5:
>
> * (no changes on code, only commit comment was modified)
>
> * Rewrite the commit log:
> - Use two spaces instead of `| ' to indent code snippets.
> - Do not use `we'.
> - Do not use `this patch' in the commit description. Instead,
> use imperative tone.
> Link: https://lore.kernel.org/all/[email protected]/
>
>
> v3 -> v4:
>
> * (no changes on code, only commit comment was modified)
>
> * Remove note and link to Nick's message in patch 1/2, c.f.:
> Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
> patch 2/2.
>
>
> v2 -> v3:
>
> * Redacted out the instructions after ret and before next function
> in the assembly output.
>
> * Added a note and a link to Nick's message on the constant
> propagation missed-optimization in clang:
> Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
> * Fix copy/paste typo in statistics of patch 1/2. Number of
> occurences before patches are 1081 and not 3607 (percentage
> reduction of 26.7% remains correct)
>
> * Rename the functions as follow:
> - __varible_ffs() -> variable___ffs()
> - __variable_ffz() -> variable_ffz()
>
> * Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
> patch 1/2.
>
>
> Vincent Mailhol (2):
> x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> expressions
> x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> expressions
>
> arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
> 1 file changed, 38 insertions(+), 26 deletions(-)
>
> --
> 2.35.1
>
--
Thanks,
~Nick Desaulniers
On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> Agree that this is only the surface. But, my patch series is about
> constant folding, not about the text of *ffs(). Here, I just *move*
> the existing text, I did not modify anything.
> Can we agree that this is a separate topic?
Sure we can.
But then you can't start your commit message with:
"__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x)."
which will bring unenlightened readers like me into the very same mess.
So at least mention that there's a difference between the kernel
implementation using hw insns which are well defined on some machines
and what the glibc API does. So that at least people are aware that
there's something dangerous to be cautious about.
Ok?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed. 7 Sep 2022 at 13:06, Borislav Petkov <[email protected]> wrote:
> On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> > Agree that this is only the surface. But, my patch series is about
> > constant folding, not about the text of *ffs(). Here, I just *move*
> > the existing text, I did not modify anything.
> > Can we agree that this is a separate topic?
>
> Sure we can.
>
> But then you can't start your commit message with:
>
> "__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x)."
>
> which will bring unenlightened readers like me into the very same mess.
>
> So at least mention that there's a difference between the kernel
> implementation using hw insns which are well defined on some machines
> and what the glibc API does. So that at least people are aware that
> there's something dangerous to be cautious about.
>
> Ok?
OK.
I rephrased the beginning of the commit message as below:
If x is not 0, __ffs(x) is equivalent to:
(unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
(unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.
Does this solve the issue? If yes, I will prepare the v8 right away.
Yours sincerely,
Vincent Mailhol
On Tue, Sep 6, 2022 at 11:26 AM Nick Desaulniers
<[email protected]> wrote:
>
> On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
> <[email protected]> wrote:
> >
> > The compilers provide some builtin expression equivalent to the ffs(),
> > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > assembly which produces better code than the builtin
> > functions. However, such assembly code can not be folded when used
> > with constant expressions.
>
> Another tact which may help additional sources other than just the
> Linux kernel; it seems that compilers should be able to fold this.
>
> Vincent, if you're interested in making such an optimization in LLVM,
> we'd welcome the contribution, and I'd be happy to show you where to
> make such changes within LLVM; please let me know off thread.
Oh right, it already does.
https://github.com/llvm/llvm-project/blob/ea953b9d9a65c202985a79f1f95da115829baef6/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L2635
I see what's happening. Constant propagation sinks constants into a
specialized version of ffs when there's only 1 callsite in a given
translation unit (or multiple call sites with the same constant).
Then dead argument elimination removes the argument, so libcall
optimization thinks this isn't the ffs(int) you're looking for, and
skips it. Nice.
https://github.com/llvm/llvm-project/issues/57599
I guess ffs() is usually forward declared in strings.h, so we don't
have such a static inline definition available to constant
prop/specialize in normal C code.
--
Thanks,
~Nick Desaulniers
On Wed. 7 Sep. 2022 at 16:04, Nick Desaulniers <[email protected]> wrote:
> On Tue, Sep 6, 2022 at 11:26 AM Nick Desaulniers
> <[email protected]> wrote:
> >
> > On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
> > <[email protected]> wrote:
> > >
> > > The compilers provide some builtin expression equivalent to the ffs(),
> > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be folded when used
> > > with constant expressions.
> >
> > Another tact which may help additional sources other than just the
> > Linux kernel; it seems that compilers should be able to fold this.
Initially, I thought that you were suggesting folding the asm code
(which doesn’t seem trivial at all).
> > Vincent, if you're interested in making such an optimization in LLVM,
> > we'd welcome the contribution, and I'd be happy to show you where to
> > make such changes within LLVM; please let me know off thread.
>
> Oh right, it already does.
> https://github.com/llvm/llvm-project/blob/ea953b9d9a65c202985a79f1f95da115829baef6/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L2635
> I see what's happening. Constant propagation sinks constants into a
> specialized version of ffs when there's only 1 callsite in a given
> translation unit (or multiple call sites with the same constant).
> Then dead argument elimination removes the argument, so libcall
> optimization thinks this isn't the ffs(int) you're looking for, and
> skips it.
Isn’t it a wise decision to skip it? How should the optimization be
able to decide that the redefined ffs() is equivalent to
__builtin_ffs()?
More generally, if I write my own foo() which shadows a
__builtin_foo() function, the two functions might do something totally
different and I would be pissed off if the compiler decided to
constant-fold my foo().
Dummy example:
===================
char *s;
/* ffs: fast forward string
* @i: how many bytes to move forward
*
* Move forward the global s pointer by @i or strlen(s) (whoever is smaller).
*
* Return: how many bytes we move forward.
*/
int ffs(int i)
{
int len = strlen(s);
int forward = i < len ? i : len;
s += forward;
return forward;
}
===================
How would you instruct the compiler to constant-fold the kernel’s
ffs() but not fold above dummy ffs()?
> Nice.
> https://github.com/llvm/llvm-project/issues/57599
> I guess ffs() is usually forward declared in strings.h, so we don't
> have such a static inline definition available to constant
> prop/specialize in normal C code.
On Wed, Sep 07, 2022 at 02:35:41PM +0900, Vincent MAILHOL wrote:
> I rephrased the beginning of the commit message as below:
>
>
> If x is not 0, __ffs(x) is equivalent to:
> (unsigned long)__builtin_ctzl(x)
> And if x is not ~0UL, ffz(x) is equivalent to:
> (unsigned long)__builtin_ctzl(~x)
> Because __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
>
> Concerning the edge cases, __builtin_ctzl(0) is always undefined,
> whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
> the processor. Regardless, for both functions, developers are asked to
> check against 0 or ~0UL so replacing __ffs() or ffz() by
> __builting_ctzl() is safe.
>
>
>
> Does this solve the issue?
Yes, that sounds better.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.
This series relies on __builtin_constant_p to select the optimal solution:
* use kernel assembly for non constant expressions
* use compiler's __builtin function for constant expressions.
** Statistics **
Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).
** Changelog **
v7 -> v8:
* (no changes on code, only commit comment was modified)
* Rewrite introduction of patch 2/2 to add nuances on the
define/undefined behaviors of __builting_clzl(0), __ffs(0) and
ffz(~0UL).
v6 -> v7:
* (no changes on code, only commit tag was modified)
* Add Reviewed-by: Yury Norov <[email protected]> in both patches
v5 -> v6:
* Rename variable___ffs() into variable__ffs() (two underscores
instead of three)
v4 -> v5:
* (no changes on code, only commit comment was modified)
* Rewrite the commit log:
- Use two spaces instead of `| ' to indent code snippets.
- Do not use `we'.
- Do not use `this patch' in the commit description. Instead,
use imperative tone.
Link: https://lore.kernel.org/all/[email protected]/
v3 -> v4:
* (no changes on code, only commit comment was modified)
* Remove note and link to Nick's message in patch 1/2, c.f.:
Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 2/2.
v2 -> v3:
* Redacted out the instructions after ret and before next function
in the assembly output.
* Added a note and a link to Nick's message on the constant
propagation missed-optimization in clang:
Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
* Fix copy/paste typo in statistics of patch 1/2. Number of
occurences before patches are 1081 and not 3607 (percentage
reduction of 26.7% remains correct)
* Rename the functions as follow:
- __varible_ffs() -> variable___ffs()
- __variable_ffz() -> variable_ffz()
* Add Reviewed-by: Nick Desaulniers <[email protected]> tag in
patch 1/2.
Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions
arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 26 deletions(-)
--
2.35.1
The following commit has been merged into the x86/asm branch of tip:
Commit-ID: 146034fed6ee75ec09cf8f996165e2296ceae0bb
Gitweb: https://git.kernel.org/tip/146034fed6ee75ec09cf8f996165e2296ceae0bb
Author: Vincent Mailhol <[email protected]>
AuthorDate: Wed, 07 Sep 2022 18:09:34 +09:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 20 Sep 2022 15:31:17 +02:00
x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions
For x86_64, the current ffs() implementation does not produce optimized
code when called with a constant expression. On the contrary, the
__builtin_ffs() functions of both GCC and clang are able to fold the
expression into a single instruction.
** Example **
Consider two dummy functions foo() and bar() as below:
#include <linux/bitops.h>
#define CONST 0x01000000
unsigned int foo(void)
{
return ffs(CONST);
}
unsigned int bar(void)
{
return __builtin_ffs(CONST);
}
GCC would produce below assembly code:
0000000000000000 <foo>:
0: ba 00 00 00 01 mov $0x1000000,%edx
5: b8 ff ff ff ff mov $0xffffffff,%eax
a: 0f bc c2 bsf %edx,%eax
d: 83 c0 01 add $0x1,%eax
10: c3 ret
<Instructions after ret and before next function were redacted>
0000000000000020 <bar>:
20: b8 19 00 00 00 mov $0x19,%eax
25: c3 ret
And clang would produce:
0000000000000000 <foo>:
0: b8 ff ff ff ff mov $0xffffffff,%eax
5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
c: 83 c0 01 add $0x1,%eax
f: c3 ret
0000000000000010 <bar>:
10: b8 19 00 00 00 mov $0x19,%eax
15: c3 ret
Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the kernel's ffs() asm version
remains better for x86_64 because, contrary to GCC, it doesn't emit the
CMOV assembly instruction, c.f. [1] (noticeably, clang is able optimize
out the CMOV call).
Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.
As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:
./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep bsf | wc -l
1081
...and after:
$ objdump -d vmlinux.o | grep bsf | wc -l
792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
[ bp: Massage commit message. ]
Signed-off-by: Vincent Mailhol <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 0fe9de5..879238e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -292,18 +292,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR
#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
{
int r;
@@ -334,6 +323,19 @@ static __always_inline int ffs(int x)
}
/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
+/**
* fls - find last set bit in word
* @x: the word to search
*
The following commit has been merged into the x86/asm branch of tip:
Commit-ID: fdb6649ab7c142e497539a471e573c2593b9c923
Gitweb: https://git.kernel.org/tip/fdb6649ab7c142e497539a471e573c2593b9c923
Author: Vincent Mailhol <[email protected]>
AuthorDate: Wed, 07 Sep 2022 18:09:35 +09:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 20 Sep 2022 15:35:37 +02:00
x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions
If x is not 0, __ffs(x) is equivalent to:
(unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
(unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
3607
...and after:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the BSF instruction produces TZCNT when used with the
REP prefix (which explain the use of `grep tzcnt' instead of `grep bsf'
in above benchmark). c.f. [1]
[1] e26a44a2d618 ("x86: Use REP BSF unconditionally")
[ bp: Massage commit message. ]
Signed-off-by: Vincent Mailhol <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/bitops.h | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 879238e..2edf684 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -247,17 +247,30 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
variable_test_bit(nr, addr);
}
+static __always_inline unsigned long variable__ffs(unsigned long word)
+{
+ asm("rep; bsf %1,%0"
+ : "=r" (word)
+ : "rm" (word));
+ return word;
+}
+
/**
* __ffs - find first set bit in word
* @word: The word to search
*
* Undefined if no bit exists, so code should check against 0 first.
*/
-static __always_inline unsigned long __ffs(unsigned long word)
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
- : "rm" (word));
+ : "r" (~word));
return word;
}
@@ -267,13 +280,10 @@ static __always_inline unsigned long __ffs(unsigned long word)
*
* Undefined if no zero exists, so code should check against ~0UL first.
*/
-static __always_inline unsigned long ffz(unsigned long word)
-{
- asm("rep; bsf %1,%0"
- : "=r" (word)
- : "r" (~word));
- return word;
-}
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ variable_ffz(word))
/*
* __fls: find last set bit in word