2022-11-06 10:22:09

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions

The compilers provide some builtin expression equivalent to the fls(),
__fls() and fls64() functions of the kernel.

The kernel's x86 implementation relies on assembly code. This assembly
code can not be folded when used with constant expressions.

This series replaces the kernel assembly by a builtin equivalent when
appropriate. It is a follow-up on this previous series:

https://lore.kernel.org/all/[email protected]/

in which I promised to also modify the fls() functions.

Vincent Mailhol (2):
x86/asm/bitops: Replace __fls() by its generic builtin implementation
x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions

arch/x86/include/asm/bitops.h | 71 ++++++++++++++++++++---------------
1 file changed, 41 insertions(+), 30 deletions(-)

--
2.37.4



2022-11-25 02:42:42

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for constant expressions

The compilers provide some builtin expression equivalent to the fls(),
__fls() and fls64() functions of the kernel.

The kernel's x86 implementation relies on assembly code. This assembly
code can not be folded when used with constant expressions.

This series replaces the kernel assembly by a builtin equivalent when
appropriate. It is a follow-up on this previous series:

https://lore.kernel.org/all/[email protected]/

in which I promised to also modify the fls() functions.

* Changelog *

v1 -> v2:

* [patch 1/2] add Reviewed-by: Nick Desaulniers tag.

* [patch 2/2] replace:
BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
by:
BUILD_BUG_ON(!__same_type(x, unsigned long long));
and slightly modify the commit description.

Vincent Mailhol (2):
x86/asm/bitops: Replace __fls() by its generic builtin implementation
x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions

arch/x86/include/asm/bitops.h | 70 ++++++++++++++++++++---------------
1 file changed, 40 insertions(+), 30 deletions(-)

--
2.37.4

2022-11-25 02:42:55

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v2 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

Below snippet:

#include <linux/bitops.h>

unsigned int foo(unsigned long word)
{
return __fls(word);
}

produces this on GCC 12.1.0:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 48 89 fb mov %rdi,%rbx
d: e8 00 00 00 00 call 12 <foo+0x12>
12: 48 0f bd c3 bsr %rbx,%rax
16: 5b pop %rbx
17: 31 ff xor %edi,%edi
19: e9 00 00 00 00 jmp 1e <foo+0x1e>

and that on clang 14.0.6:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 50 push %rax
b: 48 89 fb mov %rdi,%rbx
e: e8 00 00 00 00 call 13 <foo+0x13>
13: 48 89 1c 24 mov %rbx,(%rsp)
17: 48 0f bd 04 24 bsr (%rsp),%rax
1c: 48 83 c4 08 add $0x8,%rsp
20: 5b pop %rbx
21: c3 ret

The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
produces the exact same code on GCC and below code on clang:

0000000000000000 <foo>:
0: f3 0f 1e fa endbr64
4: e8 00 00 00 00 call 9 <foo+0x9>
9: 53 push %rbx
a: 48 89 fb mov %rdi,%rbx
d: e8 00 00 00 00 call 12 <foo+0x12>
12: 48 0f bd c3 bsr %rbx,%rax
16: 5b pop %rbx
17: c3 ret

The builtin implementation is better for two reasons:

1/ it saves two instructions on clang (a push and a stack pointer
decrement).

2/ when used on constant expressions, the compiler is only able to
fold the builtin version (c.f. [2]).

For those two reasons, replace the assembly implementation by its
builtin counterpart.

[1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <[email protected]>
CC: Yury Norov <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
---
arch/x86/include/asm/bitops.h | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 2edf68475fec..a31453d7686d 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
(unsigned long)__builtin_ctzl(~word) : \
variable_ffz(word))

-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
- asm("bsr %1,%0"
- : "=r" (word)
- : "rm" (word));
- return word;
-}
+#include <asm-generic/bitops/builtin-__fls.h>

#undef ADDR

--
2.37.4

2022-11-25 02:58:17

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH v2 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions

GCC and clang offers the __builtin_clz(x) and __builtin_clzll(x)
functions which return the number of leading zero bits in
x. c.f. [1]. By a simple subtraction, we can derive below equivalences:

* For fls:
Aside of the x = 0 special case, fls(x) is equivalent to
BITS_PER_TYPE(x) - __builtin_clz(x).

* For fls64: Aside of the x = 0 special case, fls64(x) is equivalent
to BITS_PER_TYPE(x) - __builtin_clzll(x).
__builtin_clzll() takes an unsigned long long as argument instead of
a u64 which is fine because those two types are equivalent. Regardless,
add a BUILD_BUG_ON() safety net is added to formally assert that u64
and unsigned long long are the same.

When used on constant expressions, the compiler is only able to fold
the builtin version (c.f. [2]). However, for non constant expressions,
the kernel inline assembly results in better code for both GCC and
clang.

Use __builtin_constant_p() to select between the kernel's
fls()/fls64() and __builtin_clz()/__builtin_clzll() depending on
whether the argument is constant or not.

While renaming fls64() to variable_fls64(), change the argument type
from __64 to u64 because we are not in an uapi header.

[1] Built-in Functions Provided by GCC:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <[email protected]>
CC: Nick Desaulniers <[email protected]>
CC: Yury Norov <[email protected]>
Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 56 ++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a31453d7686d..d10e774af343 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -333,18 +333,15 @@ static __always_inline int variable_ffs(int x)
*/
#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))

-/**
- * fls - find last set bit in word
- * @x: the word to search
- *
- * This is defined in a similar way as the libc and compiler builtin
- * ffs, but returns the position of the most significant set bit.
- *
- * fls(value) returns 0 if value is 0 or the position of the last
- * set bit if value is nonzero. The last (most significant) bit is
- * at position 32.
- */
-static __always_inline int fls(unsigned int x)
+static __always_inline int constant_fls(unsigned int x)
+{
+ if (!x)
+ return 0;
+
+ return BITS_PER_TYPE(x) - __builtin_clz(x);
+}
+
+static __always_inline int variable_fls(unsigned int x)
{
int r;

@@ -375,18 +372,29 @@ static __always_inline int fls(unsigned int x)
}

/**
- * fls64 - find last set bit in a 64-bit word
+ * fls - find last set bit in word
* @x: the word to search
*
* This is defined in a similar way as the libc and compiler builtin
- * ffsll, but returns the position of the most significant set bit.
+ * ffs, but returns the position of the most significant set bit.
*
- * fls64(value) returns 0 if value is 0 or the position of the last
+ * fls(value) returns 0 if value is 0 or the position of the last
* set bit if value is nonzero. The last (most significant) bit is
- * at position 64.
+ * at position 32.
*/
+#define fls(x) (__builtin_constant_p(x) ? constant_fls(x) : variable_fls(x))
+
#ifdef CONFIG_X86_64
-static __always_inline int fls64(__u64 x)
+static __always_inline int constant_fls64(u64 x)
+{
+ if (!x)
+ return 0;
+
+ BUILD_BUG_ON(!__same_type(x, unsigned long long));
+ return BITS_PER_TYPE(x) - __builtin_clzll(x);
+}
+
+static __always_inline int variable_fls64(u64 x)
{
int bitpos = -1;
/*
@@ -399,6 +407,20 @@ static __always_inline int fls64(__u64 x)
: "rm" (x));
return bitpos + 1;
}
+
+/**
+ * fls64 - find last set bit in a 64-bit word
+ * @x: the word to search
+ *
+ * This is defined in a similar way as the libc and compiler builtin
+ * ffsll, but returns the position of the most significant set bit.
+ *
+ * fls64(value) returns 0 if value is 0 or the position of the last
+ * set bit if value is nonzero. The last (most significant) bit is
+ * at position 64.
+ */
+#define fls64(x) \
+ (__builtin_constant_p(x) ? constant_fls64(x) : variable_fls64(x))
#else
#include <asm-generic/bitops/fls64.h>
#endif
--
2.37.4