2017-04-21 11:18:48

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH 0/4] powerpc: Replace some bitops functions by builtin/generic ones

This patchset replaces the following bitops by builtin/generic ones:
- ffs() / __ffs()
- fls() / __fls() / fls64()
- ffz()
- __ilog2() / __ilog2_u32() / __ilog2_u64()

The current functions are written as inline assembly which prevents
GCC to optimise them in case of constant parameters and obliges GCC
to group the related instructions all together.

With the builtin alternatives, GCC optimises better

Christophe Leroy (4):
powerpc: Discard ffs()/__ffs() function and use builtin functions
instead
powerpc: Use builtin functions for fls()/__fls()/fls64()
powerpc: Replace ffz() by equivalent generic function
powerpc: remove __ilog2()s and use generic ones

arch/powerpc/Kconfig | 8 ----
arch/powerpc/include/asm/bitops.h | 87 ++++-----------------------------------
2 files changed, 7 insertions(+), 88 deletions(-)

--
2.12.0


2017-04-21 11:18:54

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH 2/4] powerpc: Use builtin functions for fls()/__fls()/fls64()

With the fls() functions as defined in arch/powerpc/include/asm/bitops.h
GCC will not optimise the code in case of constant parameter.

This patch replaces __fls() by the builtin function, and modifies
fls() and fls64() to use builtins instead of inline assembly

For non constant calls, the generated code is doing the same:

int testfls(unsigned int x)
{
return fls(x);
}

unsigned long test__fls(unsigned long x)
{
return __fls(x);
}

int testfls64(__u64 x)
{
return fls64(x);
}

On PPC32, before the patch:
00000064 <testfls>:
64: 7c 63 00 34 cntlzw r3,r3
68: 20 63 00 20 subfic r3,r3,32
6c: 4e 80 00 20 blr

00000070 <test__fls>:
70: 7c 63 00 34 cntlzw r3,r3
74: 20 63 00 1f subfic r3,r3,31
78: 4e 80 00 20 blr

0000007c <testfls64>:
7c: 2c 03 00 00 cmpwi r3,0
80: 40 82 00 10 bne 90 <testfls64+0x14>
84: 7c 83 00 34 cntlzw r3,r4
88: 20 63 00 20 subfic r3,r3,32
8c: 4e 80 00 20 blr
90: 7c 63 00 34 cntlzw r3,r3
94: 20 63 00 40 subfic r3,r3,64
98: 4e 80 00 20 blr

On PPC32, after the patch:
00000054 <testfls>:
54: 7c 63 00 34 cntlzw r3,r3
58: 20 63 00 20 subfic r3,r3,32
5c: 4e 80 00 20 blr

00000060 <test__fls>:
60: 7c 63 00 34 cntlzw r3,r3
64: 20 63 00 1f subfic r3,r3,31
68: 4e 80 00 20 blr

0000006c <testfls64>:
6c: 2c 03 00 00 cmpwi r3,0
70: 41 82 00 10 beq 80 <testfls64+0x14>
74: 7c 63 00 34 cntlzw r3,r3
78: 20 63 00 40 subfic r3,r3,64
7c: 4e 80 00 20 blr
80: 7c 83 00 34 cntlzw r3,r4
84: 20 63 00 40 subfic r3,r3,32
88: 4e 80 00 20 blr

On PPC64, before the patch:
00000000000000a0 <.testfls>:
a0: 7c 63 00 34 cntlzw r3,r3
a4: 20 63 00 20 subfic r3,r3,32
a8: 7c 63 07 b4 extsw r3,r3
ac: 4e 80 00 20 blr

00000000000000b0 <.test__fls>:
b0: 7c 63 00 74 cntlzd r3,r3
b4: 20 63 00 3f subfic r3,r3,63
b8: 7c 63 07 b4 extsw r3,r3
bc: 4e 80 00 20 blr

00000000000000c0 <.testfls64>:
c0: 7c 63 00 74 cntlzd r3,r3
c4: 20 63 00 40 subfic r3,r3,64
c8: 7c 63 07 b4 extsw r3,r3
cc: 4e 80 00 20 blr

On PPC64, after the patch:
0000000000000090 <.testfls>:
90: 7c 63 00 34 cntlzw r3,r3
94: 20 63 00 20 subfic r3,r3,32
98: 7c 63 07 b4 extsw r3,r3
9c: 4e 80 00 20 blr

00000000000000a0 <.test__fls>:
a0: 7c 63 00 74 cntlzd r3,r3
a4: 20 63 00 3f subfic r3,r3,63
a8: 4e 80 00 20 blr
ac: 60 00 00 00 nop

00000000000000b0 <.testfls64>:
b0: 7c 63 00 74 cntlzd r3,r3
b4: 20 63 00 40 subfic r3,r3,64
b8: 7c 63 07 b4 extsw r3,r3
bc: 4e 80 00 20 blr

Those builtins have been in GCC since at least 3.4.6 (see
https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html )

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/bitops.h | 24 +++---------------------
1 file changed, 3 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index 71b05685f3a7..af36b404dbe8 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -263,33 +263,15 @@ static __inline__ unsigned long ffz(unsigned long x)
*/
static __inline__ int fls(unsigned int x)
{
- int lz;
-
- asm ("cntlzw %0,%1" : "=r" (lz) : "r" (x));
- return 32 - lz;
+ return 32 - __builtin_clz(x);
}

-static __inline__ unsigned long __fls(unsigned long x)
-{
- return __ilog2(x);
-}
+#include <asm-generic/bitops/builtin-__fls.h>

-/*
- * 64-bit can do this using one cntlzd (count leading zeroes doubleword)
- * instruction; for 32-bit we use the generic version, which does two
- * 32-bit fls calls.
- */
-#ifdef __powerpc64__
static __inline__ int fls64(__u64 x)
{
- int lz;
-
- asm ("cntlzd %0,%1" : "=r" (lz) : "r" (x));
- return 64 - lz;
+ return 64 - __builtin_clzll(x);
}
-#else
-#include <asm-generic/bitops/fls64.h>
-#endif /* __powerpc64__ */

#ifdef CONFIG_PPC64
unsigned int __arch_hweight8(unsigned int w);
--
2.12.0

2017-04-21 11:18:56

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH 1/4] powerpc: Discard ffs()/__ffs() function and use builtin functions instead

With the ffs() function as defined in arch/powerpc/include/asm/bitops.h
GCC will not optimise the code in case of constant parameter, as shown
by the small exemple below.

int ffs_test(void)
{
return 4 << ffs(31);
}

c0012334 <ffs_test>:
c0012334: 39 20 00 01 li r9,1
c0012338: 38 60 00 04 li r3,4
c001233c: 7d 29 00 34 cntlzw r9,r9
c0012340: 21 29 00 20 subfic r9,r9,32
c0012344: 7c 63 48 30 slw r3,r3,r9
c0012348: 4e 80 00 20 blr

With this patch, the same function will compile as follows:

c0012334 <ffs_test>:
c0012334: 38 60 00 08 li r3,8
c0012338: 4e 80 00 20 blr

The same happens with __ffs()

For non constant calls, the generated code is doing the same,
allthought it is slightly different on 64 bits for ffs():

unsigned long test__ffs(unsigned long x)
{
return __ffs(x);
}

int testffs(int x)
{
return ffs(x);
}

On PPC32, before the patch:
0000003c <test__ffs>:
3c: 7d 23 00 d0 neg r9,r3
40: 7d 23 18 38 and r3,r9,r3
44: 7c 63 00 34 cntlzw r3,r3
48: 20 63 00 1f subfic r3,r3,31
4c: 4e 80 00 20 blr

00000050 <testffs>:
50: 7d 23 00 d0 neg r9,r3
54: 7d 23 18 38 and r3,r9,r3
58: 7c 63 00 34 cntlzw r3,r3
5c: 20 63 00 20 subfic r3,r3,32
60: 4e 80 00 20 blr

On PPC32, after the patch:
0000002c <test__ffs>:
2c: 7d 23 00 d0 neg r9,r3
30: 7d 23 18 38 and r3,r9,r3
34: 7c 63 00 34 cntlzw r3,r3
38: 20 63 00 1f subfic r3,r3,31
3c: 4e 80 00 20 blr

00000040 <testffs>:
40: 7d 23 00 d0 neg r9,r3
44: 7d 23 18 38 and r3,r9,r3
48: 7c 63 00 34 cntlzw r3,r3
4c: 20 63 00 20 subfic r3,r3,32
50: 4e 80 00 20 blr

On PPC64, before the patch:
0000000000000060 <.test__ffs>:
60: 7c 03 00 d0 neg r0,r3
64: 7c 03 18 38 and r3,r0,r3
68: 7c 63 00 74 cntlzd r3,r3
6c: 20 63 00 3f subfic r3,r3,63
70: 7c 63 07 b4 extsw r3,r3
74: 4e 80 00 20 blr

0000000000000080 <.testffs>:
80: 7c 03 00 d0 neg r0,r3
84: 7c 03 18 38 and r3,r0,r3
88: 7c 63 00 74 cntlzd r3,r3
8c: 20 63 00 40 subfic r3,r3,64
90: 7c 63 07 b4 extsw r3,r3
94: 4e 80 00 20 blr

On PPC64, after the patch:
0000000000000050 <.test__ffs>:
50: 7c 03 00 d0 neg r0,r3
54: 7c 03 18 38 and r3,r0,r3
58: 7c 63 00 74 cntlzd r3,r3
5c: 20 63 00 3f subfic r3,r3,63
60: 4e 80 00 20 blr

0000000000000070 <.testffs>:
70: 7c 03 00 d0 neg r0,r3
74: 7c 03 18 38 and r3,r0,r3
78: 7c 63 00 34 cntlzw r3,r3
7c: 20 63 00 20 subfic r3,r3,32
80: 7c 63 07 b4 extsw r3,r3
84: 4e 80 00 20 blr
(ffs() operates on an int so cntlzw is equivalent to cntlzd)

In addition, when reading the generated vmlinux, we can observe
that with the builtin functions, GCC sometimes efficiently spreads
the instructions within the generated functions while the inline
assembly force them to remain grouped together.

__builtin_ffs() is already used in arch/powerpc/include/asm/page_32.h

Those builtins have been in GCC since at least 3.4.6 (see
https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html )

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/bitops.h | 16 ++--------------
1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index 33a24fdd7958..71b05685f3a7 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -253,21 +253,9 @@ static __inline__ unsigned long ffz(unsigned long x)
return __ilog2(x & -x);
}

-static __inline__ unsigned long __ffs(unsigned long x)
-{
- return __ilog2(x & -x);
-}
+#include <asm-generic/bitops/builtin-__ffs.h>

-/*
- * ffs: find first bit set. This is defined the same way as
- * the libc and compiler builtin ffs routines, therefore
- * differs in spirit from the above ffz (man ffs).
- */
-static __inline__ int ffs(int x)
-{
- unsigned long i = (unsigned long)x;
- return __ilog2(i & -i) + 1;
-}
+#include <asm-generic/bitops/builtin-ffs.h>

/*
* fls: find last (most-significant) bit set.
--
2.12.0

2017-04-21 11:19:02

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH 3/4] powerpc: Replace ffz() by equivalent generic function

With the ffz() function as defined in arch/powerpc/include/asm/bitops.h
GCC will not optimise the code in case of constant parameter.

This patch replaces ffz() by the generic function.

The generic ffz(x) expects to never be called with ~x == 0
as written in the comment in include/asm-generic/bitops/ffz.h
The only user of ffz() within arch/powerpc/ is
platforms/512x/mpc5121_ads_cpld.c, which checks if x is not 0xff

For non constant calls, the generated code is doing the same:

unsigned long testffz(unsigned long x)
{
return ffz(x);
}

On PPC32, before the patch:
00000018 <testffz>:
18: 7c 63 18 f9 not. r3,r3
1c: 40 82 00 0c bne 28 <testffz+0x10>
20: 38 60 00 20 li r3,32
24: 4e 80 00 20 blr
28: 7d 23 00 d0 neg r9,r3
2c: 7d 23 18 38 and r3,r9,r3
30: 7c 63 00 34 cntlzw r3,r3
34: 20 63 00 1f subfic r3,r3,31
38: 4e 80 00 20 blr

On PPC32, after the patch:
00000018 <testffz>:
18: 39 23 00 01 addi r9,r3,1
1c: 7d 23 18 78 andc r3,r9,r3
20: 7c 63 00 34 cntlzw r3,r3
24: 20 63 00 1f subfic r3,r3,31
28: 4e 80 00 20 blr

On PPC64, before the patch:
0000000000000030 <.testffz>:
30: 7c 60 18 f9 not. r0,r3
34: 38 60 00 40 li r3,64
38: 4d 82 00 20 beqlr
3c: 7c 60 00 d0 neg r3,r0
40: 7c 63 00 38 and r3,r3,r0
44: 7c 63 00 74 cntlzd r3,r3
48: 20 63 00 3f subfic r3,r3,63
4c: 7c 63 07 b4 extsw r3,r3
50: 4e 80 00 20 blr

On PPC64, after the patch:
0000000000000030 <.testffz>:
30: 38 03 00 01 addi r0,r3,1
34: 7c 03 18 78 andc r3,r0,r3
38: 7c 63 00 74 cntlzd r3,r3
3c: 20 63 00 3f subfic r3,r3,63
40: 4e 80 00 20 blr

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/bitops.h | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index af36b404dbe8..d835cd697d6b 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -233,25 +233,7 @@ int __ilog2_u64(u64 n)
}
#endif

-/*
- * Determines the bit position of the least significant 0 bit in the
- * specified double word. The returned bit position will be
- * zero-based, starting from the right side (63/31 - 0).
- */
-static __inline__ unsigned long ffz(unsigned long x)
-{
- /* no zero exists anywhere in the 8 byte area. */
- if ((x = ~x) == 0)
- return BITS_PER_LONG;
-
- /*
- * Calculate the bit position of the least significant '1' bit in x
- * (since x has been changed this will actually be the least significant
- * '0' bit in * the original x). Note: (x & -x) gives us a mask that
- * is the least significant * (RIGHT-most) 1-bit of the value in x.
- */
- return __ilog2(x & -x);
-}
+#include <asm-generic/bitops/ffz.h>

#include <asm-generic/bitops/builtin-__ffs.h>

--
2.12.0

2017-04-21 11:19:29

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH 4/4] powerpc: remove __ilog2()s and use generic ones

With the __ilog2() function as defined in
arch/powerpc/include/asm/bitops.h, GCC will not optimise the code
in case of constant parameter.

The generic ilog2() function in include/linux/log2.h is written
to handle the case of the constant parameter.

This patch discards the three __ilog2() functions and
defines __ilog2() as ilog2()

For non constant calls, the generated code is doing the same:
int test__ilog2(unsigned long x)
{
return __ilog2(x);
}

int test__ilog2_u32(u32 n)
{
return __ilog2_u32(n);
}

int test__ilog2_u64(u64 n)
{
return __ilog2_u64(n);
}

On PPC32 before the patch:
00000000 <test__ilog2>:
0: 7c 63 00 34 cntlzw r3,r3
4: 20 63 00 1f subfic r3,r3,31
8: 4e 80 00 20 blr

0000000c <test__ilog2_u32>:
c: 7c 63 00 34 cntlzw r3,r3
10: 20 63 00 1f subfic r3,r3,31
14: 4e 80 00 20 blr

On PPC32 after the patch:
00000000 <test__ilog2>:
0: 7c 63 00 34 cntlzw r3,r3
4: 20 63 00 1f subfic r3,r3,31
8: 4e 80 00 20 blr

0000000c <test__ilog2_u32>:
c: 7c 63 00 34 cntlzw r3,r3
10: 20 63 00 1f subfic r3,r3,31
14: 4e 80 00 20 blr

On PPC64 before the patch:
0000000000000000 <.test__ilog2>:
0: 7c 63 00 74 cntlzd r3,r3
4: 20 63 00 3f subfic r3,r3,63
8: 7c 63 07 b4 extsw r3,r3
c: 4e 80 00 20 blr

0000000000000010 <.test__ilog2_u32>:
10: 7c 63 00 34 cntlzw r3,r3
14: 20 63 00 1f subfic r3,r3,31
18: 7c 63 07 b4 extsw r3,r3
1c: 4e 80 00 20 blr

0000000000000020 <.test__ilog2_u64>:
20: 7c 63 00 74 cntlzd r3,r3
24: 20 63 00 3f subfic r3,r3,63
28: 7c 63 07 b4 extsw r3,r3
2c: 4e 80 00 20 blr

On PPC64 after the patch:
0000000000000000 <.test__ilog2>:
0: 7c 63 00 74 cntlzd r3,r3
4: 20 63 00 3f subfic r3,r3,63
8: 7c 63 07 b4 extsw r3,r3
c: 4e 80 00 20 blr

0000000000000010 <.test__ilog2_u32>:
10: 7c 63 00 34 cntlzw r3,r3
14: 20 63 00 1f subfic r3,r3,31
18: 7c 63 07 b4 extsw r3,r3
1c: 4e 80 00 20 blr

0000000000000020 <.test__ilog2_u64>:
20: 7c 63 00 74 cntlzd r3,r3
24: 20 63 00 3f subfic r3,r3,63
28: 7c 63 07 b4 extsw r3,r3
2c: 4e 80 00 20 blr

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/Kconfig | 8 --------
arch/powerpc/include/asm/bitops.h | 27 +--------------------------
2 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2bfd49098c5e..f231b0f190b5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -62,14 +62,6 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config ARCH_HAS_ILOG2_U32
- bool
- default y
-
-config ARCH_HAS_ILOG2_U64
- bool
- default y if 64BIT
-
config GENERIC_HWEIGHT
bool
default y
diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index d835cd697d6b..b750ffef83c7 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -206,32 +206,7 @@ static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr)
* Return the zero-based bit position (LE, not IBM bit numbering) of
* the most significant 1-bit in a double word.
*/
-static __inline__ __attribute__((const))
-int __ilog2(unsigned long x)
-{
- int lz;
-
- asm (PPC_CNTLZL "%0,%1" : "=r" (lz) : "r" (x));
- return BITS_PER_LONG - 1 - lz;
-}
-
-static inline __attribute__((const))
-int __ilog2_u32(u32 n)
-{
- int bit;
- asm ("cntlzw %0,%1" : "=r" (bit) : "r" (n));
- return 31 - bit;
-}
-
-#ifdef __powerpc64__
-static inline __attribute__((const))
-int __ilog2_u64(u64 n)
-{
- int bit;
- asm ("cntlzd %0,%1" : "=r" (bit) : "r" (n));
- return 63 - bit;
-}
-#endif
+#define __ilog2(x) ilog2(x)

#include <asm-generic/bitops/ffz.h>

--
2.12.0

2017-06-05 10:22:27

by Michael Ellerman

[permalink] [raw]
Subject: Re: [1/4] powerpc: Discard ffs()/__ffs() function and use builtin functions instead

On Fri, 2017-04-21 at 11:18:46 UTC, Christophe Leroy wrote:
> With the ffs() function as defined in arch/powerpc/include/asm/bitops.h
> GCC will not optimise the code in case of constant parameter, as shown
> by the small exemple below.
...
>
> In addition, when reading the generated vmlinux, we can observe
> that with the builtin functions, GCC sometimes efficiently spreads
> the instructions within the generated functions while the inline
> assembly force them to remain grouped together.
>
> __builtin_ffs() is already used in arch/powerpc/include/asm/page_32.h
>
> Those builtins have been in GCC since at least 3.4.6 (see
> https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html )
>
> Signed-off-by: Christophe Leroy <[email protected]>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f83647d642270f6b9d75736817fb5a

cheers