2022-05-10 19:49:18

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

The compilers provides some builtin expression equivalent to the
ffs(), __ffs() and ffz() function of the kernel. The kernel uses
optimized assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

* use kernel assembly for non constant expressions

* use compiler's __builtin function for constant expressions.

I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.


** Statistics **

On a allyesconfig, before applying this series, I get:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

After applying this series:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the call to either ffs() or __ffs() were using
constant expression and can be optimized (I did not produce the
figures for ffz()).

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)


Vincent Mailhol (2):
x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
expressions
x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
expressions

arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
1 file changed, 40 insertions(+), 25 deletions(-)

--
2.35.1



2022-05-10 20:27:26

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.


Signed-off-by: Vincent Mailhol <[email protected]>
---
Usually, helper functions are prefixed by two underscores. But here,
__ffs() is already prefixed, instead of creating an ____ffs(), I
renamed it instead to __ffs_asm_not_zero(). The _not_zero() suffix is
to differentiate it from the __ffs_asm() helper function of the
previous patch.

If someone has better inspiration of the naming for this patch and the
previous one, please let me know().
---
arch/x86/include/asm/bitops.h | 36 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 535a7a358c14..694d0d100399 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))

-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __ffs_asm_not_zero(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -239,12 +233,17 @@ static __always_inline unsigned long __ffs(unsigned long word)
}

/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(word) : \
+ __ffs_asm_not_zero(word))
+
+static __always_inline unsigned long __ffz_asm(unsigned long word)
{
asm("rep; bsf %1,%0"
: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
return word;
}

+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word) \
+ (__builtin_constant_p(word) ? \
+ (unsigned long)__builtin_ctzl(~word) : \
+ __ffz_asm(word))
+
/*
* __fls: find last set bit in word
* @word: The word to search
--
2.35.1


2022-05-10 21:08:37

by Vincent MAILHOL

[permalink] [raw]
Subject: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
| 0: b8 ff ff ff ff mov $0xffffffff,%eax
| 5: 0f bc c7 bsf %edi,%eax
| 8: 83 c0 01 add $0x1,%eax
| b: c3 ret
| c: 0f 1f 40 00 nopl 0x0(%rax)
|
| 0000000000000010 <bar>:
| 10: b8 19 00 00 00 mov $0x19,%eax
| 15: c3 ret

And clang would produce:

| 0000000000000000 <foo>:
| 0: 55 push %rbp
| 1: 48 89 e5 mov %rsp,%rbp
| 4: b8 ff ff ff ff mov $0xffffffff,%eax
| 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10>
| 10: ff c0 inc %eax
| 12: 5d pop %rbp
| 13: c3 ret
| 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
| 1b: 00 00 00
| 1e: 66 90 xchg %ax,%ax
|
| 0000000000000020 <bar>:
| 20: 55 push %rbp
| 21: 48 89 e5 mov %rsp,%rbp
| 24: b8 19 00 00 00 mov $0x19,%eax
| 29: 5d pop %rbp
| 2a: c3 ret

For both examples, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.


As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
| 283 | static __always_inline int ffs(int x)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/[email protected]


Signed-off-by: Vincent Mailhol <[email protected]>
---
arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..535a7a358c14 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
#undef ADDR

#ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int __ffs_asm(int x)
{
int r;

@@ -310,6 +299,22 @@ static __always_inline int ffs(int x)
return r + 1;
}

+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) \
+ (__builtin_constant_p(x) ? \
+ __builtin_ffs(x) : \
+ __ffs_asm(x))
+
/**
* fls - find last set bit in word
* @x: the word to search
--
2.35.1


2022-05-10 22:48:02

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
<[email protected]> wrote:
>
> The compilers provides some builtin expression equivalent to the
> ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> optimized assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
> * use kernel assembly for non constant expressions
>
> * use compiler's __builtin function for constant expressions.
>
> I also think that the fls() and fls64() can be optimized in a similar
> way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> trivial so I want to focus on this series first. If it get accepted, I
> will then work on those two additionnal function.
>
>
> ** Statistics **
>
> On a allyesconfig, before applying this series, I get:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 1081
>
> After applying this series:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 792
>
> So, roughly 26.7% of the call to either ffs() or __ffs() were using
> constant expression and can be optimized (I did not produce the
> figures for ffz()).

These stats are interesting; consider putting them on patch 1/2 commit
message though (in addition to the cover letter). (Sending thoughts on
1/2 next).

>
> (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Here's the same measure of x86_64 allyesconfig (./scripts/config -d
CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
LLVM (~clang-15):

Before:
$ objdump -d vmlinux.o | grep bsf | wc -l
1454

After:
$ objdump -d vmlinux.o | grep bsf | wc -l
1070

-26.4% :)


>
>
> Vincent Mailhol (2):
> x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> expressions
> x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> expressions
>
> arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
> 1 file changed, 40 insertions(+), 25 deletions(-)
>
> --
> 2.35.1
>


--
Thanks,
~Nick Desaulniers

2022-05-11 03:58:25

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

On Wed. 11 May 2022 at 07:14, Nick Desaulniers <[email protected]> wrote:
> On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> <[email protected]> wrote:
> >
> > The compilers provides some builtin expression equivalent to the
> > ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> > optimized assembly which produces better code than the builtin
> > functions. However, such assembly code can not be optimized when used
> > on constant expression.
> >
> > This series relies on __builtin_constant_p to select the optimal solution:
> >
> > * use kernel assembly for non constant expressions
> >
> > * use compiler's __builtin function for constant expressions.
> >
> > I also think that the fls() and fls64() can be optimized in a similar
> > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> > trivial so I want to focus on this series first. If it get accepted, I
> > will then work on those two additionnal function.
> >
> >
> > ** Statistics **
> >
> > On a allyesconfig, before applying this series, I get:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 1081
> >
> > After applying this series:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 792
> >
> > So, roughly 26.7% of the call to either ffs() or __ffs() were using
> > constant expression and can be optimized (I did not produce the
> > figures for ffz()).
>
> These stats are interesting; consider putting them on patch 1/2 commit
> message though (in addition to the cover letter). (Sending thoughts on
> 1/2 next).

The fact is that patch 1/2 changes ffs() and patch 2/2 changes
__ffs(). For v2, I will run the stats on each patch separately in
order not to mix the results.

> >
> > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
>
> Here's the same measure of x86_64 allyesconfig (./scripts/config -d
> CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
> LLVM (~clang-15):
>
> Before:
> $ objdump -d vmlinux.o | grep bsf | wc -l
> 1454
>
> After:
> $ objdump -d vmlinux.o | grep bsf | wc -l
> 1070
>
> -26.4% :)

Roughly same ratio. I am just surprise that the absolute number
are different:

* GCC before: 1081, after 792
* clang before 1454, after 1070

I wonder why clang produces more bsf instructions than GCC?

Also, on a side note, I am not the first one to realize that
__builtin_ffs() is able to optimize the constant variable. Some
people already used it to locally:

| $ git grep __builtin_ffs | wc -l
| 80

> >
> >
> > Vincent Mailhol (2):
> > x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> > expressions
> > x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> > expressions
> >
> > arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
> > 1 file changed, 40 insertions(+), 25 deletions(-)
> >
> > --
> > 2.35.1
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

2022-05-11 05:59:27

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
<[email protected]> wrote:
>
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
> * Example *
>
> Let's consider two dummy functions foo() and bar() as below:
>
> | #include <linux/bitops.h>
> | #define CONST 0x01000000
> |
> | unsigned int foo(void)
> | {
> | return ffs(CONST);
> | }
> |
> | unsigned int bar(void)
> | {
> | return __builtin_ffs(CONST);
> | }
>
> GCC would produce below assembly code:
>
> | 0000000000000000 <foo>:
> | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> | 5: 0f bc c7 bsf %edi,%eax
> | 8: 83 c0 01 add $0x1,%eax
> | b: c3 ret
> | c: 0f 1f 40 00 nopl 0x0(%rax)
> |
> | 0000000000000010 <bar>:
> | 10: b8 19 00 00 00 mov $0x19,%eax
> | 15: c3 ret
>
> And clang would produce:
>
> | 0000000000000000 <foo>:
> | 0: 55 push %rbp
> | 1: 48 89 e5 mov %rsp,%rbp
> | 4: b8 ff ff ff ff mov $0xffffffff,%eax
> | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10>
> | 10: ff c0 inc %eax
> | 12: 5d pop %rbp
> | 13: c3 ret
> | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> | 1b: 00 00 00
> | 1e: 66 90 xchg %ax,%ax
> |
> | 0000000000000020 <bar>:
> | 20: 55 push %rbp
> | 21: 48 89 e5 mov %rsp,%rbp
> | 24: b8 19 00 00 00 mov $0x19,%eax
> | 29: 5d pop %rbp
> | 2a: c3 ret

Right, we need to allocate registers to move the inputs into the asm
block, and the results back out. Inline asm is analogous to a call
with a custom calling convention, where we don't look into the body of
the inline asm.

Does -fomit-frame-pointer clean make these snippets clearer, or did
you not build with -O2? Consider using those flags if so, since we
generally prefer the ORC unwinder on x86, not the frame pointer
unwinder. If the compilers are forcing a frame pointer when using the
builtins once optimizations are enabled, that's a problem (that we've
seen in the past with the builtins for reading eflags with clang; now
fixed).

>
> For both examples, we clearly see the benefit of using __builtin_ffs()
> instead of the kernel's asm implementation for constant expressions.
>
> However, for non constant expressions, the ffs() asm version of the
> kernel remains better for x86_64 because, contrary to GCC, it doesn't
> emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> able optimize out the CMOV call).
>
> This patch uses the __builtin_constant_p() to select between the
> kernel's ffs() and the __builtin_ffs() depending on whether the
> argument is constant or not.
>
>
> As a side benefit, this patch also removes below -Wshadow warning:
>
> | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
> | 283 | static __always_inline int ffs(int x)

Nice! :)

>
> [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
> http://lkml.kernel.org/r/[email protected]

+ David, author of ca3d30cc02f7. I was wondering if this applied to
more than just x86, but I see now that some architectures just include
include/asm-generic/bitops/builtin-ffs.h into their
arch/*/include/asm/bitops.h. It's only when we want to beat the
compiler for non-ICE expressions.

Patch LGTM; just minor comments on commit message, naming, and formatting.

>
>
> Signed-off-by: Vincent Mailhol <[email protected]>
> ---
> arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index a288ecd230ab..535a7a358c14 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
> #undef ADDR
>
> #ifdef __KERNEL__
> -/**
> - * ffs - find first set bit in word
> - * @x: the word to search
> - *
> - * This is defined the same way as the libc and compiler builtin ffs
> - * routines, therefore differs in spirit from the other bitops.
> - *
> - * ffs(value) returns 0 if value is 0 or the position of the first
> - * set bit if value is nonzero. The first (least significant) bit
> - * is at position 1.
> - */
> -static __always_inline int ffs(int x)
> +static __always_inline int __ffs_asm(int x)

How about variable_ffs rather than __ffs_asm? Let's try to stick with
the convention used by test_bit?

> {
> int r;
>
> @@ -310,6 +299,22 @@ static __always_inline int ffs(int x)
> return r + 1;
> }
>
> +/**
> + * ffs - find first set bit in word
> + * @x: the word to search
> + *
> + * This is defined the same way as the libc and compiler builtin ffs
> + * routines, therefore differs in spirit from the other bitops.
> + *
> + * ffs(value) returns 0 if value is 0 or the position of the first
> + * set bit if value is nonzero. The first (least significant) bit
> + * is at position 1.
> + */
> +#define ffs(x) \
> + (__builtin_constant_p(x) ? \
> + __builtin_ffs(x) : \
> + __ffs_asm(x))
> +

I think this whole #define can fit on one line? If not, perhaps the
BCP can start on the initial line? Otherwise it looks like the
then/else clauses are indented by 1 tab followed by 1 space. Consider
just using tabs.

> /**
> * fls - find last set bit in word
> * @x: the word to search
> --
> 2.35.1
>


--
Thanks,
~Nick Desaulniers

2022-05-11 09:21:33

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

On Wed. 11 May 2022 at 07:29, Nick Desaulniers <[email protected]> wrote:
> On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> <[email protected]> wrote:
> >
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
> > * Example *
> >
> > Let's consider two dummy functions foo() and bar() as below:
> >
> > | #include <linux/bitops.h>
> > | #define CONST 0x01000000
> > |
> > | unsigned int foo(void)
> > | {
> > | return ffs(CONST);
> > | }
> > |
> > | unsigned int bar(void)
> > | {
> > | return __builtin_ffs(CONST);
> > | }
> >
> > GCC would produce below assembly code:
> >
> > | 0000000000000000 <foo>:
> > | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> > | 5: 0f bc c7 bsf %edi,%eax
> > | 8: 83 c0 01 add $0x1,%eax
> > | b: c3 ret
> > | c: 0f 1f 40 00 nopl 0x0(%rax)
> > |
> > | 0000000000000010 <bar>:
> > | 10: b8 19 00 00 00 mov $0x19,%eax
> > | 15: c3 ret
> >
> > And clang would produce:
> >
> > | 0000000000000000 <foo>:
> > | 0: 55 push %rbp
> > | 1: 48 89 e5 mov %rsp,%rbp
> > | 4: b8 ff ff ff ff mov $0xffffffff,%eax
> > | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10>
> > | 10: ff c0 inc %eax
> > | 12: 5d pop %rbp
> > | 13: c3 ret
> > | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> > | 1b: 00 00 00
> > | 1e: 66 90 xchg %ax,%ax
> > |
> > | 0000000000000020 <bar>:
> > | 20: 55 push %rbp
> > | 21: 48 89 e5 mov %rsp,%rbp
> > | 24: b8 19 00 00 00 mov $0x19,%eax
> > | 29: 5d pop %rbp
> > | 2a: c3 ret
>
> Right, we need to allocate registers to move the inputs into the asm
> block, and the results back out. Inline asm is analogous to a call
> with a custom calling convention, where we don't look into the body of
> the inline asm.
>
> Does -fomit-frame-pointer clean make these snippets clearer, or did
> you not build with -O2? Consider using those flags if so, since we
> generally prefer the ORC unwinder on x86, not the frame pointer
> unwinder. If the compilers are forcing a frame pointer when using the
> builtins once optimizations are enabled, that's a problem (that we've
> seen in the past with the builtins for reading eflags with clang; now
> fixed).

I have not played with those parameters yet, so short answer, I
am using the kernel default (above assembly was compiled with
Kbuild). You are touching a few topics I am not familiar with, I
need some research on this before answering you in more detail.

> >
> > For both examples, we clearly see the benefit of using __builtin_ffs()
> > instead of the kernel's asm implementation for constant expressions.
> >
> > However, for non constant expressions, the ffs() asm version of the
> > kernel remains better for x86_64 because, contrary to GCC, it doesn't
> > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> > able optimize out the CMOV call).
> >
> > This patch uses the __builtin_constant_p() to select between the
> > kernel's ffs() and the __builtin_ffs() depending on whether the
> > argument is constant or not.
> >
> >
> > As a side benefit, this patch also removes below -Wshadow warning:
> >
> > | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
> > | 283 | static __always_inline int ffs(int x)
>
> Nice! :)

Thanks.

For the record, fixing the -Wshadow is my real motivation. I am
pissed when the header files through some W=12 warnings. Once
this patch is applied, there will be one last annoying W=2
warning to clear in order to only see local warnings and not
random spam from headers when doing a W=12 (at least on x86_64).

But because those kinds of W=2 fixes aren't so popular, I figured
it would be better to offer something else. I first checked if
GCC produces less optimized code than the kernel assembly: that
was still the case. I then looked at the GCC mailing list to see
if discussion on this topic existed. Didn't find it but found
instead that GCC could optimize constant expressions. And voilĂ 
how I came to the creation of this patch.

> >
> > [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
> > http://lkml.kernel.org/r/[email protected]
>
> + David, author of ca3d30cc02f7. I was wondering if this applied to
> more than just x86, but I see now that some architectures just include
> include/asm-generic/bitops/builtin-ffs.h into their
> arch/*/include/asm/bitops.h. It's only when we want to beat the
> compiler for non-ICE expressions.

Yes, I did a quick research, the majority of the architectures already
rely on the builtin function.
Would need to give a deeper look to track if anyone else other than
x86 also uses assembly.

Also, this potentially may apply to builtin functions other than
the ffs family. Just did not find any other cases so far.

> Patch LGTM; just minor comments on commit message, naming, and formatting.
>
> >
> >
> > Signed-off-by: Vincent Mailhol <[email protected]>
> > ---
> > arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------
> > 1 file changed, 17 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index a288ecd230ab..535a7a358c14 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
> > #undef ADDR
> >
> > #ifdef __KERNEL__
> > -/**
> > - * ffs - find first set bit in word
> > - * @x: the word to search
> > - *
> > - * This is defined the same way as the libc and compiler builtin ffs
> > - * routines, therefore differs in spirit from the other bitops.
> > - *
> > - * ffs(value) returns 0 if value is 0 or the position of the first
> > - * set bit if value is nonzero. The first (least significant) bit
> > - * is at position 1.
> > - */
> > -static __always_inline int ffs(int x)
> > +static __always_inline int __ffs_asm(int x)
>
> How about variable_ffs rather than __ffs_asm? Let's try to stick with
> the convention used by test_bit?

ACK. I will also follow this comment for path 2/2 and use
__variable_ffs instead of __ffs_asm_not_zero there.

> > {
> > int r;
> >
> > @@ -310,6 +299,22 @@ static __always_inline int ffs(int x)
> > return r + 1;
> > }
> >
> > +/**
> > + * ffs - find first set bit in word
> > + * @x: the word to search
> > + *
> > + * This is defined the same way as the libc and compiler builtin ffs
> > + * routines, therefore differs in spirit from the other bitops.
> > + *
> > + * ffs(value) returns 0 if value is 0 or the position of the first
> > + * set bit if value is nonzero. The first (least significant) bit
> > + * is at position 1.
> > + */
> > +#define ffs(x) \
> > + (__builtin_constant_p(x) ? \
> > + __builtin_ffs(x) : \
> > + __ffs_asm(x))
> > +
>
> I think this whole #define can fit on one line?

I split it into multiple lines to be consistent with other macros
in the file. I have no objections to doing it on a single line (I
will just check if this is within the 80 characters limit and if
so, will do a single line in v2).

> If not, perhaps the
> BCP can start on the initial line? Otherwise it looks like the
> then/else clauses are indented by 1 tab followed by 1 space. Consider
> just using tabs.

Right, I inadvertently added a space after the tab of the first
line and the editor auto indentation repeated the patterns on the
other lines.

With that said, I will prepare the v2. I will send it within two
days I think (can not do it right now).

> > /**
> > * fls - find last set bit in word
> > * @x: the word to search
> > --
> > 2.35.1
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

2022-05-11 20:14:03

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

On Wed. 11 mai 2022 at 08:24, Vincent MAILHOL
<[email protected]> wrote:
> On Wed. 11 May 2022 at 07:14, Nick Desaulniers <[email protected]> wrote:
> > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> > <[email protected]> wrote:
> > >
> > > The compilers provides some builtin expression equivalent to the
> > > ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> > > optimized assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be optimized when used
> > > on constant expression.
> > >
> > > This series relies on __builtin_constant_p to select the optimal solution:
> > >
> > > * use kernel assembly for non constant expressions
> > >
> > > * use compiler's __builtin function for constant expressions.
> > >
> > > I also think that the fls() and fls64() can be optimized in a similar
> > > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> > > trivial so I want to focus on this series first. If it get accepted, I
> > > will then work on those two additionnal function.
> > >
> > >
> > > ** Statistics **
> > >
> > > On a allyesconfig, before applying this series, I get:
> > >
> > > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > > | 1081
> > >
> > > After applying this series:
> > >
> > > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > > | 792
> > >
> > > So, roughly 26.7% of the call to either ffs() or __ffs() were using
> > > constant expression and can be optimized (I did not produce the
> > > figures for ffz()).
> >
> > These stats are interesting; consider putting them on patch 1/2 commit
> > message though (in addition to the cover letter). (Sending thoughts on
> > 1/2 next).
>
> The fact is that patch 1/2 changes ffs() and patch 2/2 changes
> __ffs(). For v2, I will run the stats on each patch separately in
> order not to mix the results.
>
> > >
> > > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
> >
> > Here's the same measure of x86_64 allyesconfig (./scripts/config -d
> > CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
> > LLVM (~clang-15):
> >
> > Before:
> > $ objdump -d vmlinux.o | grep bsf | wc -l
> > 1454
> >
> > After:
> > $ objdump -d vmlinux.o | grep bsf | wc -l
> > 1070
> >
> > -26.4% :)
>
> Roughly same ratio. I am just surprise that the absolute number
> are different:
>
> * GCC before: 1081, after 792
> * clang before 1454, after 1070
>
> I wonder why clang produces more bsf instructions than GCC?

Did not find the answer yet, but while looking at this, I found
another interesting thing: on x86_64 the bsf instruction produces
tzcnt when used with the ret prefix. So ffs() produces bsf assembly
instructions but __ffs() and ffz() produces tzcnt. c.f.

http://lkml.kernel.org/r/[email protected]

I will update the figures in v2 and benchmark both bsf and tzcnt.

> Also, on a side note, I am not the first one to realize that
> __builtin_ffs() is able to optimize the constant variable. Some
> people already used it to locally:
>
> | $ git grep __builtin_ffs | wc -l
> | 80

2022-05-13 14:17:21

by Vincent MAILHOL

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

On Wed. 11 May 2022 at 08:54, Vincent MAILHOL
<[email protected]> wrote:
> On Wed. 11 May 2022 at 07:29, Nick Desaulniers <[email protected]> wrote:
> > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> > <[email protected]> wrote:
> > >
> > > For x86_64, the current ffs() implementation does not produce
> > > optimized code when called with a constant expression. On the
> > > contrary, the __builtin_ffs() function of both GCC and clang is able
> > > to simplify the expression into a single instruction.
> > >
> > > * Example *
> > >
> > > Let's consider two dummy functions foo() and bar() as below:
> > >
> > > | #include <linux/bitops.h>
> > > | #define CONST 0x01000000
> > > |
> > > | unsigned int foo(void)
> > > | {
> > > | return ffs(CONST);
> > > | }
> > > |
> > > | unsigned int bar(void)
> > > | {
> > > | return __builtin_ffs(CONST);
> > > | }
> > >
> > > GCC would produce below assembly code:
> > >
> > > | 0000000000000000 <foo>:
> > > | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> > > | 5: 0f bc c7 bsf %edi,%eax
> > > | 8: 83 c0 01 add $0x1,%eax
> > > | b: c3 ret
> > > | c: 0f 1f 40 00 nopl 0x0(%rax)
> > > |
> > > | 0000000000000010 <bar>:
> > > | 10: b8 19 00 00 00 mov $0x19,%eax
> > > | 15: c3 ret
> > >
> > > And clang would produce:
> > >
> > > | 0000000000000000 <foo>:
> > > | 0: 55 push %rbp
> > > | 1: 48 89 e5 mov %rsp,%rbp
> > > | 4: b8 ff ff ff ff mov $0xffffffff,%eax
> > > | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10>
> > > | 10: ff c0 inc %eax
> > > | 12: 5d pop %rbp
> > > | 13: c3 ret
> > > | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> > > | 1b: 00 00 00
> > > | 1e: 66 90 xchg %ax,%ax
> > > |
> > > | 0000000000000020 <bar>:
> > > | 20: 55 push %rbp
> > > | 21: 48 89 e5 mov %rsp,%rbp
> > > | 24: b8 19 00 00 00 mov $0x19,%eax
> > > | 29: 5d pop %rbp
> > > | 2a: c3 ret
> >
> > Right, we need to allocate registers to move the inputs into the asm
> > block, and the results back out. Inline asm is analogous to a call
> > with a custom calling convention, where we don't look into the body of
> > the inline asm.
> >
> > Does -fomit-frame-pointer clean make these snippets clearer, or did
> > you not build with -O2? Consider using those flags if so, since we
> > generally prefer the ORC unwinder on x86, not the frame pointer
> > unwinder. If the compilers are forcing a frame pointer when using the
> > builtins once optimizations are enabled, that's a problem (that we've
> > seen in the past with the builtins for reading eflags with clang; now
> > fixed).
>
> I have not played with those parameters yet, so short answer, I
> am using the kernel default (above assembly was compiled with
> Kbuild). You are touching a few topics I am not familiar with, I
> need some research on this before answering you in more detail.

I got the answer: actually, I was using an allnoconfig when I
generated the above assembly code. And it appears that allnoconfig has
-fno-omit-frame-pointer by default. I will activate the ORC unwinder
and update the snippets in v2.

> > >
> > > For both examples, we clearly see the benefit of using __builtin_ffs()
> > > instead of the kernel's asm implementation for constant expressions.
> > >
> > > However, for non constant expressions, the ffs() asm version of the
> > > kernel remains better for x86_64 because, contrary to GCC, it doesn't
> > > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> > > able optimize out the CMOV call).
> > >
> > > This patch uses the __builtin_constant_p() to select between the
> > > kernel's ffs() and the __builtin_ffs() depending on whether the
> > > argument is constant or not.
> > >
> > >
> > > As a side benefit, this patch also removes below -Wshadow warning:
> > >
> > > | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
> > > | 283 | static __always_inline int ffs(int x)
> >
> > Nice! :)
>
> Thanks.
>
> For the record, fixing the -Wshadow is my real motivation. I am
> pissed when the header files through some W=12 warnings. Once
> this patch is applied, there will be one last annoying W=2
> warning to clear in order to only see local warnings and not
> random spam from headers when doing a W=12 (at least on x86_64).
>
> But because those kinds of W=2 fixes aren't so popular, I figured
> it would be better to offer something else. I first checked if
> GCC produces less optimized code than the kernel assembly: that
> was still the case. I then looked at the GCC mailing list to see
> if discussion on this topic existed. Didn't find it but found
> instead that GCC could optimize constant expressions. And voilĂ 
> how I came to the creation of this patch.
>
> > >
> > > [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
> > > http://lkml.kernel.org/r/[email protected]
> >
> > + David, author of ca3d30cc02f7. I was wondering if this applied to
> > more than just x86, but I see now that some architectures just include
> > include/asm-generic/bitops/builtin-ffs.h into their
> > arch/*/include/asm/bitops.h. It's only when we want to beat the
> > compiler for non-ICE expressions.
>
> Yes, I did a quick research, the majority of the architectures already
> rely on the builtin function.
> Would need to give a deeper look to track if anyone else other than
> x86 also uses assembly.
>
> Also, this potentially may apply to builtin functions other than
> the ffs family. Just did not find any other cases so far.
>
> > Patch LGTM; just minor comments on commit message, naming, and formatting.
> >
> > >
> > >
> > > Signed-off-by: Vincent Mailhol <[email protected]>
> > > ---
> > > arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------
> > > 1 file changed, 17 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > > index a288ecd230ab..535a7a358c14 100644
> > > --- a/arch/x86/include/asm/bitops.h
> > > +++ b/arch/x86/include/asm/bitops.h
> > > @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
> > > #undef ADDR
> > >
> > > #ifdef __KERNEL__
> > > -/**
> > > - * ffs - find first set bit in word
> > > - * @x: the word to search
> > > - *
> > > - * This is defined the same way as the libc and compiler builtin ffs
> > > - * routines, therefore differs in spirit from the other bitops.
> > > - *
> > > - * ffs(value) returns 0 if value is 0 or the position of the first
> > > - * set bit if value is nonzero. The first (least significant) bit
> > > - * is at position 1.
> > > - */
> > > -static __always_inline int ffs(int x)
> > > +static __always_inline int __ffs_asm(int x)
> >
> > How about variable_ffs rather than __ffs_asm? Let's try to stick with
> > the convention used by test_bit?
>
> ACK. I will also follow this comment for path 2/2 and use
> __variable_ffs instead of __ffs_asm_not_zero there.
>
> > > {
> > > int r;
> > >
> > > @@ -310,6 +299,22 @@ static __always_inline int ffs(int x)
> > > return r + 1;
> > > }
> > >
> > > +/**
> > > + * ffs - find first set bit in word
> > > + * @x: the word to search
> > > + *
> > > + * This is defined the same way as the libc and compiler builtin ffs
> > > + * routines, therefore differs in spirit from the other bitops.
> > > + *
> > > + * ffs(value) returns 0 if value is 0 or the position of the first
> > > + * set bit if value is nonzero. The first (least significant) bit
> > > + * is at position 1.
> > > + */
> > > +#define ffs(x) \
> > > + (__builtin_constant_p(x) ? \
> > > + __builtin_ffs(x) : \
> > > + __ffs_asm(x))
> > > +
> >
> > I think this whole #define can fit on one line?
>
> I split it into multiple lines to be consistent with other macros
> in the file. I have no objections to doing it on a single line (I
> will just check if this is within the 80 characters limit and if
> so, will do a single line in v2).
>
> > If not, perhaps the
> > BCP can start on the initial line? Otherwise it looks like the
> > then/else clauses are indented by 1 tab followed by 1 space. Consider
> > just using tabs.
>
> Right, I inadvertently added a space after the tab of the first
> line and the editor auto indentation repeated the patterns on the
> other lines.
>
> With that said, I will prepare the v2. I will send it within two
> days I think (can not do it right now).
>
> > > /**
> > > * fls - find last set bit in word
> > > * @x: the word to search
> > > --
> > > 2.35.1
> > >
> >
> >
> > --
> > Thanks,
> > ~Nick Desaulniers