2022-07-28 16:13:54

by Yury Norov

[permalink] [raw]
Subject: [PATCH 0/5] lib/find: optimize find_bit() functions

In the recent discussion [1], it was noticed that find_next_bit()
functions may be improved by adding wrappers around common
__find_next_bit().

I tried that trick; and although it doesn't affect performance
significantly, as reported by find_bit_benchmark, there's ~2.5K
decrease in image size.

find_first_bit() is reworked accordingly.

[1] https://lkml.org/lkml/2022/7/25/830

Yury Norov (5):
lib/find_bit: rename le to need_swab in __find_next_bit()
lib/find_bit: optimize find_next_bit() functions
lib/find_bit: unify _find_first_{,and,zero}_bit implementations
lib/find_bit: create find_first_zero_bit_le()
lib/find_bit: re-use __find_first_bit() in __find_next_bit()

include/linux/find.h | 45 +++++++++----
lib/find_bit.c | 155 +++++++++++++++++++++++++++++--------------
2 files changed, 138 insertions(+), 62 deletions(-)

--
2.34.1


2022-07-28 16:31:23

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/5] lib/find_bit: optimize find_next_bit() functions

The function _find_next_bit() takes parameters that modify its behavior to
implement and- zero- and le- flavors. The parameters are passed at compile
time, but current design prevents the compiled from optimizing them out.

This patch adds wrappers around _find_next_bit(), and turns it into
internal inline helper, so that the optimization becomes possible.

I ran find_bit_benchmark 5 times on top of 5.19-rc8 and 5 times on top
of this patch. The results for kvm/x86_64 are:
v5.19-rc8 Optimized Difference (more - better)
Random dense bitmap ns ns % sigmas*
find_next_bit: 721209 692337 4 0.73
find_next_zero_bit: 738138 701094 5 0.52
find_last_bit: 802393 698133 13 0.49
find_first_bit: 3560900 3574644 0 -0.07
find_first_and_bit: 38601442 37945046 2 0.71
find_next_and_bit: 335574 306184 9 2.36

Random sparse bitmap
find_next_bit: 15868 13856 13 0.82
find_next_zero_bit: 1311843 1227418 4 0.72
find_last_bit: 13633 14080 -3 -0.74
find_first_bit: 1273625 1253343 1 0.52
find_first_and_bit: 8548 8157 7 0.32
find_next_and_bit: 8828 8437 6 0.52

* Calculated as:
(mean(before) - mean(after)) / max(std(before), std(after))

All at all, optimized code is generally faster, but the difference
never reaches solid 3 sigmas. find_next_and_bit almost touches the
limit in dence bitmap test, but no...

However, bloat-o-meter shows significant ~2.5K decrease of image size
So, the optimization has total positive impact.

Bloat-o-meter:
add/remove: 4/2 grow/shrink: 18/193 up/down: 627/-3108 (-2481)

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 25 ++++++++++++++--------
lib/find_bit.c | 49 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..3ace995d7079 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -8,9 +8,18 @@

#include <linux/bitops.h>

-extern unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+ unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start);
+#ifdef __BIG_ENDIAN
+unsigned long _find_next_zero_bit_le(const void *addr, unsigned
+ long size, unsigned long offset);
+unsigned long _find_next_bit_le(const void *addr, unsigned
+ long size, unsigned long offset);
+#endif
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -41,7 +50,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, size, offset);
}
#endif

@@ -71,7 +80,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_and_bit(addr1, addr2, size, offset);
}
#endif

@@ -99,7 +108,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_zero_bit(addr, size, offset);
}
#endif

@@ -247,7 +256,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_zero_bit_le(addr, size, offset);
}
#endif

@@ -266,7 +275,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit_le(addr, size, offset);
}
#endif

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 04c142acfc40..4ef3151b3109 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -19,9 +19,6 @@
#include <linux/minmax.h>
#include <linux/swab.h>

-#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
- !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
@@ -29,7 +26,7 @@
* searching it for one bits.
* - The optional "addr2", which is anded with "addr1" if present.
*/
-unsigned long _find_next_bit(const unsigned long *addr1,
+static inline unsigned long __find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
unsigned long start, unsigned long invert, bool need_swab)
{
@@ -68,9 +65,53 @@ unsigned long _find_next_bit(const unsigned long *addr1,

return min(start + __ffs(tmp), nbits);
}
+
+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+ return __find_next_bit(addr, NULL, nbits, start, 0UL, 0);
+}
EXPORT_SYMBOL(_find_next_bit);
#endif

+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start)
+{
+ return __find_next_bit(addr1, addr2, nbits, start, 0UL, 0);
+}
+EXPORT_SYMBOL(_find_next_and_bit);
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start)
+{
+ return __find_next_bit(addr, NULL, nbits, start, ~0UL, 0);
+}
+EXPORT_SYMBOL(_find_next_zero_bit);
+#endif
+
+#ifdef __BIG_ENDIAN
+#ifndef find_next_zero_bit_le
+unsigned long _find_next_zero_bit_le(const void *addr, unsigned
+ long size, unsigned long offset)
+{
+ return __find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+}
+EXPORT_SYMBOL(_find_next_zero_bit_le);
+#endif
+
+#ifndef find_next_bit_le
+unsigned long _find_next_bit_le(const void *addr, unsigned
+ long size, unsigned long offset)
+{
+ return __find_next_bit(addr, NULL, size, offset, 0UL, 1);
+}
+EXPORT_SYMBOL(_find_next_bit_le);
+#endif
+#endif /* __BIG_ENDIAN */
+
#ifndef find_first_bit
/*
* Find the first set bit in a memory region.
--
2.34.1

2022-07-28 16:32:21

by Yury Norov

[permalink] [raw]
Subject: [PATCH 3/5] lib/find_bit: unify _find_first_{,and,zero}_bit implementations

The functions are almost identical, so create a common helper for them so
that compiler will be able to either inline the helper and optimize-out
parameters known at compile-time, or save some space by keeping it as a
real function.

On kvm/x86_64, bloat-o-meter reports +9 bytes. Find_bit_benchmark 5 times
before and after doesn't show significant (i.e. delta is greater than 3
sigma) difference, except find_next_bit, which is most likely an outlier
(although, lucky for the patch):
v5.19-rc8 Optimized Difference (more - better)
Random dense bitmap ns ns % sigmas
find_next_bit: 721209 594936 18 3.19
find_next_zero_bit: 738138 638182 14 1.40
find_last_bit: 802393 940846 -17 -0.31
find_first_bit: 3560900 3379983 5 0.65
find_first_and_bit: 38601442 37683449 2 1.00
find_next_and_bit: 335574 300373 10 2.82

Random sparse bitmap
find_next_bit: 15868 13856 13 0.82
find_next_zero_bit: 1311843 1227418 6 0.72
find_last_bit: 13633 14080 -3 -0.74
find_first_bit: 1273625 1253343 2 0.52
find_first_and_bit: 8548 8157 5 0.32
find_next_and_bit: 8828 8437 4 0.52

Signed-off-by: Yury Norov <[email protected]>
---
lib/find_bit.c | 62 +++++++++++++++++++++++++++-----------------------
1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 4ef3151b3109..d207d1699834 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -20,12 +20,38 @@
#include <linux/swab.h>

/*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
+ * This is a common helper functions for find_{first,next}_bit{,_le}.
+ * Internal parameters are:
* - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
- * - The optional "addr2", which is anded with "addr1" if present.
+ * searching it for set bits; to implement find_*_zero_bit().
+ * - The optional "addr2", which is ANDed with "addr1" if present; to
+ * implement find_*_and_bit().
+ * - The "need_swab" that converts words to BE format; to implement
+ * find_*_le() on big-endian machines.
*/
+static inline
+unsigned long __find_first_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long size, unsigned long invert, bool need_swab)
+{
+ unsigned long idx, val;
+
+ for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
+ val = addr1[idx];
+ if (addr2)
+ val &= addr2[idx];
+
+ val ^= invert;
+
+ if (val) {
+ if (need_swab)
+ val = swab(val);
+ return min(idx * BITS_PER_LONG + __ffs(val), size);
+ }
+ }
+
+ return size;
+}
+
static inline unsigned long __find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
unsigned long start, unsigned long invert, bool need_swab)
@@ -118,14 +144,7 @@ EXPORT_SYMBOL(_find_next_bit_le);
*/
unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
{
- unsigned long idx;
-
- for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
- if (addr[idx])
- return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
- }
-
- return size;
+ return __find_first_bit(addr, NULL, size, 0UL, false);
}
EXPORT_SYMBOL(_find_first_bit);
#endif
@@ -138,15 +157,7 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2,
unsigned long size)
{
- unsigned long idx, val;
-
- for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
- val = addr1[idx] & addr2[idx];
- if (val)
- return min(idx * BITS_PER_LONG + __ffs(val), size);
- }
-
- return size;
+ return __find_first_bit(addr1, addr2, size, 0UL, false);
}
EXPORT_SYMBOL(_find_first_and_bit);
#endif
@@ -157,14 +168,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
*/
unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
- unsigned long idx;
-
- for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
- if (addr[idx] != ~0UL)
- return min(idx * BITS_PER_LONG + ffz(addr[idx]), size);
- }
-
- return size;
+ return __find_first_bit(addr, NULL, size, ~0UL, false);
}
EXPORT_SYMBOL(_find_first_zero_bit);
#endif
--
2.34.1

2022-07-28 16:34:01

by Yury Norov

[permalink] [raw]
Subject: [PATCH 4/5] lib/find_bit: create find_first_zero_bit_le()

find_first_zero_bit_le() is an alias to find_next_zero_bit_le();
despite that 'next' is known to be slower than 'first' version.

Now that we have common __find_first_bit(), it's trivial to implement
find_first_zero_bit_le() as a real function.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 20 +++++++++++++++-----
lib/find_bit.c | 11 +++++++++++
2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 3ace995d7079..8d326d1518f4 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -19,6 +19,7 @@ unsigned long _find_next_zero_bit_le(const void *addr, unsigned
long size, unsigned long offset);
unsigned long _find_next_bit_le(const void *addr, unsigned
long size, unsigned long offset);
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
#endif
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
@@ -260,6 +261,20 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
}
#endif

+#ifndef find_first_zero_bit_le
+static inline
+unsigned long find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val = swab(*addr) | ~GENMASK(size - 1, 0);
+
+ return val == ~0UL ? size : ffz(val);
+ }
+
+ return _find_first_zero_bit_le(addr, size);
+}
+#endif
+
#ifndef find_next_bit_le
static inline
unsigned long find_next_bit_le(const void *addr, unsigned
@@ -279,11 +294,6 @@ unsigned long find_next_bit_le(const void *addr, unsigned
}
#endif

-#ifndef find_first_zero_bit_le
-#define find_first_zero_bit_le(addr, size) \
- find_next_zero_bit_le((addr), (size), 0)
-#endif
-
#else
#error "Please fix <asm/byteorder.h>"
#endif
diff --git a/lib/find_bit.c b/lib/find_bit.c
index d207d1699834..a56872611a59 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -119,6 +119,17 @@ EXPORT_SYMBOL(_find_next_zero_bit);
#endif

#ifdef __BIG_ENDIAN
+#ifndef find_first_zero_bit_le
+/*
+ * Find the first cleared bit in an LE memory region.
+ */
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
+{
+ return __find_first_bit(addr, NULL, size, ~0UL, true);
+}
+EXPORT_SYMBOL(_find_first_zero_bit_le);
+#endif
+
#ifndef find_next_zero_bit_le
unsigned long _find_next_zero_bit_le(const void *addr, unsigned
long size, unsigned long offset)
--
2.34.1

2022-07-28 16:34:23

by Yury Norov

[permalink] [raw]
Subject: [RFC PATCH 5/5] lib/find_bit: re-use __find_first_bit() in __find_next_bit()

Similarly to m68k, switch __find_first_bit() to use __find_next_bit().
The only difference between them is that 'next' should skip #start bits.
So do it in prologue, and then just call 'first' version if needed.

Signed-off-by: Yury Norov <[email protected]>

---

This patch is just to see how m68k approach would work in general
(x86_64) case. It almost doubles the size of find_next() functions,
and adds nothing to performance.

I would prefer not taking it upstream.

add/remove: 0/0 grow/shrink: 4/0 up/down: 332/0 (332)
Function old new delta
_find_next_zero_bit 103 191 +88
find_next_clump8 133 219 +86
_find_next_and_bit 120 203 +83
_find_next_bit 99 174 +75

v5.19-rc8 Optimized Difference (more - better)
Random dense bitmap ns ns % sigmas
find_next_bit: 721209 742050 -3 -0.09
find_next_zero_bit: 738138 920508 -25 -0.51
find_last_bit: 802393 839157 -5 -0.08
find_first_bit: 3560900 3747795 -5 -0.30
find_first_and_bit: 38601442 37423751 3 1.28
find_next_and_bit: 335574 322220 4 1.07

Random sparse bitmap
find_next_bit: 15868 13969 12 2.21
find_next_zero_bit: 1311843 1290946 2 0.18
find_last_bit: 13633 13856 -2 -0.37
find_first_bit: 1273625 1233955 3 1.01
find_first_and_bit: 8548 14974 -75 -0.43
find_next_and_bit: 8828 7766 12 1.37


lib/find_bit.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index a56872611a59..137e606a05a1 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -61,12 +61,15 @@ static inline unsigned long __find_next_bit(const unsigned long *addr1,
if (unlikely(start >= nbits))
return nbits;

+ if (IS_ALIGNED(start, BITS_PER_LONG))
+ goto aligned;
+
+ /* Handle 1st word. */
tmp = addr1[start / BITS_PER_LONG];
if (addr2)
tmp &= addr2[start / BITS_PER_LONG];
tmp ^= invert;

- /* Handle 1st word. */
mask = BITMAP_FIRST_WORD_MASK(start);
if (need_swab)
mask = swab(mask);
@@ -74,22 +77,20 @@ static inline unsigned long __find_next_bit(const unsigned long *addr1,
tmp &= mask;

start = round_down(start, BITS_PER_LONG);
-
- while (!tmp) {
- start += BITS_PER_LONG;
- if (start >= nbits)
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ if (tmp) {
+ if (need_swab)
+ tmp = swab(tmp);
+ return min(start + __ffs(tmp), nbits);
}

- if (need_swab)
- tmp = swab(tmp);
+ start += BITS_PER_LONG;
+ if (start >= nbits)
+ return nbits;

- return min(start + __ffs(tmp), nbits);
+aligned:
+ return start + __find_first_bit(addr1 + start/BITS_PER_LONG,
+ addr2 ? addr2 + start/BITS_PER_LONG : NULL,
+ nbits - start, invert, need_swab);
}

#ifndef find_next_bit
--
2.34.1

2022-07-28 16:49:16

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/5] lib/find_bit: rename "le" to "need_swab" in __find_next_bit()

The parameter is used to swap bytes on BE machines when searching for
bits in little-endian bitmaps. On LE machines this parameter is 0, which
misleads readers.

Signed-off-by: Yury Norov <[email protected]>
---
lib/find_bit.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..04c142acfc40 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -31,7 +31,7 @@
*/
unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ unsigned long start, unsigned long invert, bool need_swab)
{
unsigned long tmp, mask;

@@ -45,7 +45,7 @@ unsigned long _find_next_bit(const unsigned long *addr1,

/* Handle 1st word. */
mask = BITMAP_FIRST_WORD_MASK(start);
- if (le)
+ if (need_swab)
mask = swab(mask);

tmp &= mask;
@@ -63,7 +63,7 @@ unsigned long _find_next_bit(const unsigned long *addr1,
tmp ^= invert;
}

- if (le)
+ if (need_swab)
tmp = swab(tmp);

return min(start + __ffs(tmp), nbits);
--
2.34.1

2022-07-28 19:55:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] lib/find: optimize find_bit() functions

On Thu, Jul 28, 2022 at 9:12 AM Yury Norov <[email protected]> wrote:
>
> In the recent discussion [1], it was noticed that find_next_bit()
> functions may be improved by adding wrappers around common
> __find_next_bit().

So looking at the end result, this generates fairly good code.

I say "fairly good" because _find_next_and_bit() ends up being disgusting.

The reason? That

if (addr2)
tmp &= addr2[start / BITS_PER_LONG];

generates horrific code when 'addr2' isn't seen to be always NULL.

So this doesn't affect the regular _find_next_bit case, because the
inliner sees that addr2 is always NULL and doesn't generate extra code
for it.

But the code that actually *does* have two incoming bitmasks will now
pointlessly test that second bitmask pointer all the time.

Now, that particular function probably doesn't matter, but this code
seems to be unnecessarily written to try to be overly generic, and
that

(a) makes it hard to read the source code because the source code
doesn't do the obvious thing

(b) clearly generates suboptimal code too

so I'm really not a huge fan of this "try to share code". Not when the
resulting shared code is harder to read, and impossible for the
compiler to do a great job at.

And the code sharing really doesn't help all that much.

If you really want to generate good code, use macros, and share the
logic that way. Not hugely readable either, but I think it's actually
not bad.

I think something like this would work:

#define BIT_FIND_SETUP(addr, size, start) \
unsigned long val, idx; \
if (unlikely(start >= size)) \
return size; \
idx = start / BITS_PER_LONG;

#define BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
if (!IS_ALIGNED(start, BITS_PER_LONG)) { \
unsigned long mask; \
mask = BITMAP_FIRST_WORD_MASK(start); \
if ((val = mask & (EXPRESSION)) != 0) \
goto found; \
idx++; \
}

#define BIT_WORD_LOOP(ptr, size, idx, val, EXPRESSION) \
do { \
if ((val = (EXPRESSION)) != 0) \
goto found; \
idx++; \
} while ((idx)*BITS_PER_LONG < (size));

#define BIT_FIND_BODY(addr, size, start, EXPRESSION) \
BIT_FIND_SETUP(addr, size, start) \
BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
return size; \
found: BIT_WORD_SWAB(val); \
return min((idx)*BITS_PER_LONG + __ffs(val), size)

#define BIT_WORD_SWAB(x) /* Nothing */

and then for the BE versions you just use the same macros, but you
make BIT_WORD_SWAB() do the swab.

I'm attaching an UNTESTED version of lib/find_bit.c that works the
above way (note: it is based on your header file changes!)

It builds for me and seems to generate reasonable code, although I
notice that clang messes up the "__ffs()" inline asm and forces the
source into memory.

(Gcc doesn't do that, but gcc follows the code exactly and generates
"shl $6" instructions, while clang seems to figure out that it can
just add 64 instead)

Linus


Attachments:
find_bit.c (3.31 kB)

2022-07-28 22:12:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] lib/find: optimize find_bit() functions

On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
<[email protected]> wrote:
>
> It builds for me and seems to generate reasonable code, although I
> notice that clang messes up the "__ffs()" inline asm and forces the
> source into memory.

I have created a llvm issue for this at

https://github.com/llvm/llvm-project/issues/56789

and while I noticed this while looking at the rather odd code
generation for the bit finding functions, it seems to be a general
issue with clang inline asm.

It looks like any instruction that takes a mod/rm input (so a register
or memory) will always force the thing to be in memory. Which is very
pointless in itself, but it actually causes some functions to have a
stack frame that they wouldn't otherwise need or want. So it actually
has secondary downsides too.

And yes, that particular case could be solved with __builtin_ctzl(),
which seems to DTRT. But that uses plain bsf, and we seem to really
want tzcnt ("rep bsf") here, although I didn't check why (the comment
explicitly says "Undefined if no bit exists", which is the main
difference between bsf and tzcnt).

I _think_ it's because tzcnt is faster when it exists exactly because
it always writes the destination, so 'bsf' is actually the inferior
op, and clang shouldn't generate it.

But the "rm" thing exists elsewhere too, and I just checked - this
same issue seems to happen with "g" too (ie "any general integer
input").

Linus

2022-08-01 18:49:57

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH 0/5] lib/find: optimize find_bit() functions

On Thu, Jul 28, 2022 at 2:49 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > It builds for me and seems to generate reasonable code, although I
> > notice that clang messes up the "__ffs()" inline asm and forces the
> > source into memory.
>
> I have created a llvm issue for this at
>
> https://github.com/llvm/llvm-project/issues/56789

Thanks for the report. I left a response there (in case you have
notification emails from github filtered; following up here).
https://github.com/llvm/llvm-project/issues/56789#issuecomment-1201525395

So it looks like at least 3 things we can clean up:
1. https://github.com/llvm/llvm-project/issues/20571
2. https://github.com/llvm/llvm-project/issues/34191
3. https://github.com/llvm/llvm-project/issues/33216

>
> and while I noticed this while looking at the rather odd code
> generation for the bit finding functions, it seems to be a general
> issue with clang inline asm.
>
> It looks like any instruction that takes a mod/rm input (so a register
> or memory) will always force the thing to be in memory. Which is very
> pointless in itself, but it actually causes some functions to have a
> stack frame that they wouldn't otherwise need or want. So it actually
> has secondary downsides too.
>
> And yes, that particular case could be solved with __builtin_ctzl(),
> which seems to DTRT. But that uses plain bsf, and we seem to really
> want tzcnt ("rep bsf") here, although I didn't check why (the comment
> explicitly says "Undefined if no bit exists", which is the main
> difference between bsf and tzcnt).
>
> I _think_ it's because tzcnt is faster when it exists exactly because
> it always writes the destination, so 'bsf' is actually the inferior
> op, and clang shouldn't generate it.
>
> But the "rm" thing exists elsewhere too, and I just checked - this
> same issue seems to happen with "g" too (ie "any general integer
> input").
>
> Linus



--
Thanks,
~Nick Desaulniers