LinuxLists.cc - [PATCH v2 0/4] lib: optimize find

2022-08-24 01:52:22

Subject: [PATCH v2 0/4] lib: optimize find_bit() functions

In the recent discussion, it was noticed that find_next_bit() functions may
be improved by adding wrappers around common __find_next_bit() in .c file.

As suggested by Linus, I tried the meta-programming trick with the
EXPRESSION macro, which is passed from wrapper into find_bit()
helpers:

#define BIT_FIND_BODY(addr, size, start, EXPRESSION) \
BIT_FIND_SETUP(addr, size, start) \
BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
return size; \
found: BIT_WORD_SWAB(val); \
return min((idx)*BITS_PER_LONG + __ffs(val), size)

unsigned long _find_next_and_bit(const unsigned long *addr1,
const unsigned long *addr2,
unsigned long size,
unsigned long start)
{ BIT_FIND_BODY(addr, size, start, addr1[idx] & addr2[idx]); }

I appreciated the potential of how the EXPRESSION works, but I don't like
that the resulting macro is constructed from pieces because it makes it
harder to understand what happens behind the ifdefery. Checkpatch isn't
happy as well because the final macro contains 'return' statement; and I
would agree that it's better to avoid it.

I spined the idea one more time, trying to make FIND helper a more or
less standard looking macros.

This new approach saves 10-11K of Image size, and is 15% faster in the
performance benchmark. See the last patch for some statistics.

v1: https://lore.kernel.org/all/[email protected]/T/

Yury Norov (3):
lib/find_bit: introduce FIND_FIRST_BIT() macro
lib/find_bit: create find_first_zero_bit_le()
lib/find_bit: optimize find_next_bit() functions

MAINTAINERS | 2 +
include/linux/find.h | 46 +++++++++++++------
lib/Makefile | 1 +
lib/find_bit.c | 104 ++++++++++++-------------------------------
lib/find_bit.h | 45 +++++++++++++++++++
lib/find_bit_be.c | 42 +++++++++++++++++
6 files changed, 152 insertions(+), 88 deletions(-)
create mode 100644 lib/find_bit.h
create mode 100644 lib/find_bit_be.c

--
2.34.1

2022-08-24 02:11:49

by Yury Norov

[permalink] [raw]

Subject: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.

As find_next_bit() API grows, I expect that more parameterss will be added.
Current designs would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.

This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimization becomes possible.

The common logic is moved to the new macro, and all flavors may be
generated by providing an EXPRESSION macro parameter, like in this example:

#define FIND_NEXT_BIT(EXPRESSION, size, start) ...

find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
{
return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
}

The EXPRESSION may be of any complexity, as soon as it only refers
the bitmap(s) and an iterator idx.

I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of this series. The results for kvm/x86_64 are:

v6.0-rc2 Optimized Difference Z-score
Random dense bitmap ns ns ns %
find_next_bit: 787735 670546 117189 14.9 3.97
find_next_zero_bit: 777492 664208 113284 14.6 10.51
find_last_bit: 830925 687573 143352 17.3 2.35
find_first_bit: 3874366 3306635 567731 14.7 1.84
find_first_and_bit: 40677125 37739887 2937238 7.2 1.36
find_next_and_bit: 347865 304456 43409 12.5 1.35

Random sparse bitmap
find_next_bit: 19816 14021 5795 29.2 6.10
find_next_zero_bit: 1318901 1223794 95107 7.2 1.41
find_last_bit: 14573 13514 1059 7.3 6.92
find_first_bit: 1313321 1249024 64297 4.9 1.53
find_first_and_bit: 8921 8098 823 9.2 4.56
find_next_and_bit: 9796 7176 2620 26.7 5.39

Where the statistics is significant (z-score > 3), the improvement
is ~15%.

According to bloat-o-meter, the Image size is 10-11K less:

x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)

arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
Checkpatch warns that the FIND_NEXT_BIT() is a "Macros with flow
control statements", but in this case I think it's false positive
because the label is defined in the same code block as the 'goto'
statement, and control can't flow out of the block.

include/linux/find.h | 23 ++++++++-----
lib/find_bit.c | 78 +++++++++++++++-----------------------------
lib/find_bit.h | 21 ++++++++++++
lib/find_bit_be.c | 19 +++++++++++
4 files changed, 81 insertions(+), 60 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 2464bff5de04..dead6f53a97b 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -8,9 +8,12 @@

#include <linux/bitops.h>

-extern unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+ unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -19,6 +22,10 @@ extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long siz

#ifdef __BIG_ENDIAN
unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset);
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset);
#endif

#ifndef find_next_bit
@@ -45,7 +52,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, size, offset);
}
#endif

@@ -75,7 +82,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_and_bit(addr1, addr2, size, offset);
}
#endif

@@ -103,7 +110,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_zero_bit(addr, size, offset);
}
#endif

@@ -251,7 +258,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_zero_bit_le(addr, size, offset);
}
#endif

@@ -284,7 +291,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit_le(addr, size, offset);
}
#endif

diff --git a/lib/find_bit.c b/lib/find_bit.c
index ccc4fb1dfc71..357750d25ff9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,58 +21,6 @@

#include "find_bit.h"

-#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
- !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
-/*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
- * - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
- * - The optional "addr2", which is anded with "addr1" if present.
- */
-unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
-{
- unsigned long tmp, mask;
-
- if (unlikely(start >= nbits))
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
-
- /* Handle 1st word. */
- mask = BITMAP_FIRST_WORD_MASK(start);
- if (le)
- mask = swab(mask);
-
- tmp &= mask;
-
- start = round_down(start, BITS_PER_LONG);
-
- while (!tmp) {
- start += BITS_PER_LONG;
- if (start >= nbits)
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
- }
-
- if (le)
- tmp = swab(tmp);
-
- return min(start + __ffs(tmp), nbits);
-}
-EXPORT_SYMBOL(_find_next_bit);
-#endif
-
#ifndef find_first_bit
/*
* Find the first set bit in a memory region.
@@ -108,6 +56,32 @@ unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size
EXPORT_SYMBOL(_find_first_zero_bit);
#endif

+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+ return FIND_NEXT_BIT(addr[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_bit);
+#endif
+
+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+ unsigned long nbits, unsigned long start)
+{
+ return FIND_NEXT_BIT(addr1[idx] & addr2[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_bit);
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start)
+{
+ return FIND_NEXT_BIT(~addr[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_zero_bit);
+#endif
+
#ifndef find_last_bit
unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
{
diff --git a/lib/find_bit.h b/lib/find_bit.h
index b4b6245ddbf6..6b6312f93301 100644
--- a/lib/find_bit.h
+++ b/lib/find_bit.h
@@ -21,4 +21,25 @@
sz; \
})

+#define FIND_NEXT_BIT(EXPRESSION, size, start) \
+({ \
+ unsigned long mask, idx, tmp, sz = (size), __start = (start); \
+ \
+ if (unlikely(__start >= sz)) \
+ goto out; \
+ \
+ mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
+ idx = __start / BITS_PER_LONG; \
+ \
+ for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \
+ if (idx > sz / BITS_PER_LONG) \
+ goto out; \
+ idx++; \
+ } \
+ \
+ sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz); \
+out: \
+ sz; \
+})
+
#endif /* _LIB_FIND_BIT_H */
diff --git a/lib/find_bit_be.c b/lib/find_bit_be.c
index 36173cb7e012..cbf669aaf3cb 100644
--- a/lib/find_bit_be.c
+++ b/lib/find_bit_be.c
@@ -21,3 +21,22 @@ unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long s
}
EXPORT_SYMBOL(_find_first_zero_bit_le);
#endif
+
+#ifndef find_next_zero_bit_le
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset)
+{
+ return FIND_NEXT_BIT(~addr[idx], size, offset);
+}
+EXPORT_SYMBOL(_find_next_zero_bit_le);
+#endif
+
+#ifndef find_next_bit_le
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+ long size, unsigned long offset)
+{
+ return FIND_NEXT_BIT(addr[idx], size, offset);
+}
+EXPORT_SYMBOL(_find_next_bit_le);
+
+#endif
--
2.34.1

2022-08-24 02:17:26

by Yury Norov

[permalink] [raw]

Subject: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than the 'first' version.

Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.

Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
to the _le needs by wiring word_op to swab.

Signed-off-by: Yury Norov <[email protected]>
---
Like other find_*_le() functions, the new one takes void *addr, instead
of unsigned long *. This should be fixed for all in a separate series.

MAINTAINERS | 1 +
include/linux/find.h | 23 ++++++++++++++++++-----
lib/Makefile | 1 +
lib/find_bit_be.c | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 5 deletions(-)
create mode 100644 lib/find_bit_be.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 02e11f2dbafe..fd1d1625b053 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3614,6 +3614,7 @@ F: lib/bitmap.c
F: lib/cpumask.c
F: lib/find_bit.h
F: lib/find_bit.c
+F: lib/find_bit_le.c
F: lib/find_bit_benchmark.c
F: lib/test_bitmap.c
F: tools/include/linux/bitmap.h
diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..2464bff5de04 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -17,6 +17,10 @@ extern unsigned long _find_first_and_bit(const unsigned long *addr1,
extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);

+#ifdef __BIG_ENDIAN
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+#endif
+
#ifndef find_next_bit
/**
* find_next_bit - find the next set bit in a memory region
@@ -251,6 +255,20 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
}
#endif

+#ifndef find_first_zero_bit_le
+static inline
+unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
+
+ return val == ~0UL ? size : ffz(val);
+ }
+
+ return _find_first_zero_bit_le(addr, size);
+}
+#endif
+
#ifndef find_next_bit_le
static inline
unsigned long find_next_bit_le(const void *addr, unsigned
@@ -270,11 +288,6 @@ unsigned long find_next_bit_le(const void *addr, unsigned
}
#endif

-#ifndef find_first_zero_bit_le
-#define find_first_zero_bit_le(addr, size) \
- find_next_zero_bit_le((addr), (size), 0)
-#endif
-
#else
#error "Please fix <asm/byteorder.h>"
#endif
diff --git a/lib/Makefile b/lib/Makefile
index 5927d7fa0806..0f41b76a277e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -49,6 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
percpu-refcount.o rhashtable.o base64.o \
once.o refcount.o usercopy.o errseq.o bucket_locks.o \
generic-radix-tree.o
+obj-$(CONFIG_CPU_BIG_ENDIAN) += find_bit_be.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/find_bit_be.c b/lib/find_bit_be.c
new file mode 100644
index 000000000000..36173cb7e012
--- /dev/null
+++ b/lib/find_bit_be.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Big-endian routines for bit search */
+
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <linux/export.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+#include <linux/swab.h>
+
+#define word_op swab
+#include "find_bit.h"
+
+#ifndef find_first_zero_bit_le
+/*
+ * Find the first cleared bit in an LE memory region.
+ */
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
+{
+ return FIND_FIRST_BIT(~addr[idx], size);
+}
+EXPORT_SYMBOL(_find_first_zero_bit_le);
+#endif
--
2.34.1

2022-08-24 09:31:40

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <[email protected]> wrote:
>
> Over the past couple years, the function _find_next_bit() was extended
> with parameters that modify its behavior to implement and- zero- and le-
> flavors. The parameters are passed at compile time, but current design
> prevents a compiler from optimizing out the conditionals.
>
> As find_next_bit() API grows, I expect that more parameterss will be added.

parameters

> Current designs would require more conditional code in _find_next_bit(),
> which would bloat the helper even more and make it barely readable.
>
> This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> a set of wrappers, so that the compile-time optimization becomes possible.
>
> The common logic is moved to the new macro, and all flavors may be
> generated by providing an EXPRESSION macro parameter, like in this example:
>
> #define FIND_NEXT_BIT(EXPRESSION, size, start) ...
>
> find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
> {
> return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
> }
>
> The EXPRESSION may be of any complexity, as soon as it only refers
> the bitmap(s) and an iterator idx.

...

> +#define FIND_NEXT_BIT(EXPRESSION, size, start) \
> +({ \
> + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> + \
> + if (unlikely(__start >= sz)) \
> + goto out; \
> + \
> + mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
> + idx = __start / BITS_PER_LONG; \
> + \
> + for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \

for (unsigned long tmp ...;
But hey, why not loop over idx (which probably should be named as
offset) as I proposed in the first patch? You will drop a lot of
divisions / multiplications, no?

> + if (idx > sz / BITS_PER_LONG) \
> + goto out; \
> + idx++; \
> + } \
> + \
> + sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz); \
> +out: \
> + sz; \
> +})

--
With Best Regards,
Andy Shevchenko

2022-08-24 09:33:00

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <[email protected]> wrote:
>
> find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
> despite that 'next' is known to be slower than the 'first' version.
>
> Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
> to implement find_first_zero_bit_le() as a real function.
>
> Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
> to the _le needs by wiring word_op to swab.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> Like other find_*_le() functions, the new one takes void *addr, instead
> of unsigned long *. This should be fixed for all in a separate series.

From this comment it is unclear to me why we can't fix them first and
then apply this with the correct type?

...

> +#define word_op swab
> +#include "find_bit.h"

Looking at this, I would rather always require to define __ffs_word_op
(or whatever name) in the user and replace #ifndef in the find_bit.h
with
#error "The __ffs_word_op must be defined before including find_bit.h!"

--
With Best Regards,
Andy Shevchenko

2022-08-24 09:33:31

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 12:22 PM Andy Shevchenko
<[email protected]> wrote:
> On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <[email protected]> wrote:

...

> > +#define word_op swab
> > +#include "find_bit.h"
>
> Looking at this, I would rather always require to define __ffs_word_op
> (or whatever name) in the user and replace #ifndef in the find_bit.h
> with
> #error "The __ffs_word_op must be defined before including find_bit.h!"

The rationale is that the missed definition may give wrong results
while being compiled with no errors. With the above, the developer
must think about what they are doing.

--
With Best Regards,
Andy Shevchenko

2022-08-24 09:46:36

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] lib: optimize find_bit() functions

On Wed, Aug 24, 2022 at 4:39 AM Yury Norov <[email protected]> wrote:

...

> v1: https://lore.kernel.org/all/[email protected]/T/
>
> Yury Norov (3):
> lib/find_bit: introduce FIND_FIRST_BIT() macro
> lib/find_bit: create find_first_zero_bit_le()
> lib/find_bit: optimize find_next_bit() functions
>
> MAINTAINERS | 2 +
> include/linux/find.h | 46 +++++++++++++------
> lib/Makefile | 1 +
> lib/find_bit.c | 104 ++++++++++++-------------------------------
> lib/find_bit.h | 45 +++++++++++++++++++
> lib/find_bit_be.c | 42 +++++++++++++++++
> 6 files changed, 152 insertions(+), 88 deletions(-)
> create mode 100644 lib/find_bit.h
> create mode 100644 lib/find_bit_be.c
>
> --
> 2.34.1

Seems like the cover letter is generated by Git, but it has 0/4 in the
Subject while the rest of the series is n/3. What happened on your
side with the numbering of the patches?

--
With Best Regards,
Andy Shevchenko

2022-08-24 13:45:45

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 12:22:33PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <[email protected]> wrote:
> >
> > find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
> > despite that 'next' is known to be slower than the 'first' version.
> >
> > Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
> > to implement find_first_zero_bit_le() as a real function.
> >
> > Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
> > to the _le needs by wiring word_op to swab.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > Like other find_*_le() functions, the new one takes void *addr, instead
> > of unsigned long *. This should be fixed for all in a separate series.
>
> From this comment it is unclear to me why we can't fix them first and
> then apply this with the correct type?

Because there is a codebase that relies on existing types, mostly in
filesystem code. And those fs fixes would require 5 or 6 patches.

This would triple the length of this series, and is completely
unrelated. That's why I think that:
> > This should be fixed for all in a separate series.

> ...
>
> > +#define word_op swab
> > +#include "find_bit.h"
>
> Looking at this, I would rather always require to define __ffs_word_op
> (or whatever name) in the user and replace #ifndef in the find_bit.h
> with
> #error "The __ffs_word_op must be defined before including find_bit.h!"

This is a local header which is not intended to be included anywhere
except lib/find_bit{,_be}.c. I don't expect someone else would want to
include it, even in lib. So what you suggest is a bit overthinking to
me. But if you insist, I can do that.

Thanks,
Yury

2022-08-24 14:29:45

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <[email protected]> wrote:
> >
> > Over the past couple years, the function _find_next_bit() was extended
> > with parameters that modify its behavior to implement and- zero- and le-
> > flavors. The parameters are passed at compile time, but current design
> > prevents a compiler from optimizing out the conditionals.
> >
> > As find_next_bit() API grows, I expect that more parameterss will be added.
>
> parameters
>
> > Current designs would require more conditional code in _find_next_bit(),
> > which would bloat the helper even more and make it barely readable.
> >
> > This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> > a set of wrappers, so that the compile-time optimization becomes possible.
> >
> > The common logic is moved to the new macro, and all flavors may be
> > generated by providing an EXPRESSION macro parameter, like in this example:
> >
> > #define FIND_NEXT_BIT(EXPRESSION, size, start) ...
> >
> > find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
> > {
> > return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
> > }
> >
> > The EXPRESSION may be of any complexity, as soon as it only refers
> > the bitmap(s) and an iterator idx.
>
> ...
>
> > +#define FIND_NEXT_BIT(EXPRESSION, size, start) \
> > +({ \
> > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> > + \
> > + if (unlikely(__start >= sz)) \
> > + goto out; \
> > + \
> > + mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
> > + idx = __start / BITS_PER_LONG; \
> > + \
> > + for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \
>
> for (unsigned long tmp ...;
> But hey, why not loop over idx (which probably should be named as
> offset)

Offset in structure, index in array, isn't?

> as I proposed in the first patch? You will drop a lot of
> divisions / multiplications, no?

Those divisions and multiplications are optimized away, and
what you suggested blows up the EXPRESSION.

I tried like this:
mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
idx = __start / BITS_PER_LONG;
tmp = (EXPRESSION);

while (1) {
if (tmp) {
sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
break;
}

if (++idx > sz)
break;

tmp = (EXPRESSION);
}

And it generated the same code, but looks less expressive to me.
If you have some elegant approach in mind - can you please share
it, and how the generated code looks?

Thanks,
Yury

2022-08-24 18:00:33

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <[email protected]> wrote:
> On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <[email protected]> wrote:

...

> > > +#define FIND_NEXT_BIT(EXPRESSION, size, start) \
> > > +({ \
> > > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> > > + \
> > > + if (unlikely(__start >= sz)) \
> > > + goto out; \
> > > + \
> > > + mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
> > > + idx = __start / BITS_PER_LONG; \
> > > + \
> > > + for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \
> >
> > for (unsigned long tmp ...;
> > But hey, why not loop over idx (which probably should be named as
> > offset)
>
> Offset in structure, index in array, isn't?
>
> > as I proposed in the first patch? You will drop a lot of
> > divisions / multiplications, no?
>
> Those divisions and multiplications are optimized away, and
> what you suggested blows up the EXPRESSION.
>
> I tried like this:
> mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
> idx = __start / BITS_PER_LONG;
> tmp = (EXPRESSION);
>
> while (1) {
> if (tmp) {
> sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
> break;
> }
>
> if (++idx > sz)
> break;
>
> tmp = (EXPRESSION);
> }
>
> And it generated the same code, but looks less expressive to me.
> If you have some elegant approach in mind - can you please share
> it, and how the generated code looks?

for (unsigned long idx = 0; idx < sz; idx++) {
unsigned long tmp;

tmp = (EXPRESSION);
if (tmp) {
...
}
}

No?

--
With Best Regards,
Andy Shevchenko

2022-08-24 18:00:52

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

On Wed, Aug 24, 2022 at 8:54 PM Andy Shevchenko
<[email protected]> wrote:
> On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <[email protected]> wrote:
> > On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <[email protected]> wrote:

...

> > > > +#define FIND_NEXT_BIT(EXPRESSION, size, start) \
> > > > +({ \
> > > > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> > > > + \
> > > > + if (unlikely(__start >= sz)) \
> > > > + goto out; \
> > > > + \
> > > > + mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
> > > > + idx = __start / BITS_PER_LONG; \
> > > > + \
> > > > + for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \
> > >
> > > for (unsigned long tmp ...;
> > > But hey, why not loop over idx (which probably should be named as
> > > offset)
> >
> > Offset in structure, index in array, isn't?
> >
> > > as I proposed in the first patch? You will drop a lot of
> > > divisions / multiplications, no?
> >
> > Those divisions and multiplications are optimized away, and
> > what you suggested blows up the EXPRESSION.
> >
> > I tried like this:
> > mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
> > idx = __start / BITS_PER_LONG;
> > tmp = (EXPRESSION);
> >
> > while (1) {
> > if (tmp) {
> > sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
> > break;
> > }
> >
> > if (++idx > sz)
> > break;
> >
> > tmp = (EXPRESSION);
> > }
> >
> > And it generated the same code, but looks less expressive to me.
> > If you have some elegant approach in mind - can you please share
> > it, and how the generated code looks?
>
> for (unsigned long idx = 0; idx < sz; idx++) {

Of source 0 should be changed to whatever start you have there.

> unsigned long tmp;
>
> tmp = (EXPRESSION);
> if (tmp) {
> ...
> }
> }
>
> No?

--
With Best Regards,
Andy Shevchenko

2022-08-24 18:18:44

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

Ugh, and here is the "odd trick" immediately afterwards:

On Tue, Aug 23, 2022 at 6:26 PM Yury Norov <[email protected]> wrote:
>
> +++ b/lib/find_bit_be.c
> @@ -0,0 +1,23 @@
> +
> +#define word_op swab
> +#include "find_bit.h"
> +
> +#ifndef find_first_zero_bit_le
> +/*
> + * Find the first cleared bit in an LE memory region.
> + */
> +unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
> +{
> + return FIND_FIRST_BIT(~addr[idx], size);
> +}
> +EXPORT_SYMBOL(_find_first_zero_bit_le);
> +#endif

I looked at that "_find_first_zero_bit_le()" code a long time and says
"that can't be right".

Until I noticed the

> +#define word_op swab

that was several lines above it, and not even close to the use of it.

So no.

Just make the macro take that "last word op" as another argument, and
then you don't need these tricks, and you can do the _le() version in
lib/find.c file *exactly* like the regular version, using just

#ifdef __BIG_ENDIAN
unsigned long find_first_zero_bit_le(const unsigned long *addr,
unsigned long size)
{ return FIND_FIRST_BIT(~addr[idx], swab(val), size); }
#else
unsigned long find_first_zero_bit_le(const unsigned long *addr,
unsigned long size) __alias(find_first_zero_bit);
#endif

or something like that.

And yes, it means that the "regular" versions need to pass in just
"val" as a last-word expression, but who cares? It's still cleaner,
and easier to explain: "The first expression finds the word that
matches our requirement, the second expression munges the word we
found into the bit order we use".

Or something.

Linus

2022-08-24 18:25:44

by Russell King (Oracle)

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 06:37:45AM -0700, Yury Norov wrote:
> Because there is a codebase that relies on existing types, mostly in
> filesystem code. And those fs fixes would require 5 or 6 patches.

Does that mean that are there filesystems that are passing pointers to
the find_bit functions which are _not_ aligned to an "unsigned long" ?

If there are, we should _not_ convert 32-bit ARM to use word loads or
use the generic code; unaligned loads are expensive on older ARM CPUs,
at least not the code for older ARM CPUs.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2022-08-24 18:35:41

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 4:37 PM Yury Norov <[email protected]> wrote:
> On Wed, Aug 24, 2022 at 12:22:33PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <[email protected]> wrote:

...

> > > Like other find_*_le() functions, the new one takes void *addr, instead
> > > of unsigned long *. This should be fixed for all in a separate series.
> >
> > From this comment it is unclear to me why we can't fix them first and
> > then apply this with the correct type?
>
> Because there is a codebase that relies on existing types, mostly in
> filesystem code. And those fs fixes would require 5 or 6 patches.
>
> This would triple the length of this series, and is completely
> unrelated. That's why I think that:
> > > This should be fixed for all in a separate series.

So comment update then, if a new version is required?

...

> > > +#define word_op swab
> > > +#include "find_bit.h"
> >
> > Looking at this, I would rather always require to define __ffs_word_op
> > (or whatever name) in the user and replace #ifndef in the find_bit.h
> > with
> > #error "The __ffs_word_op must be defined before including find_bit.h!"
>
> This is a local header which is not intended to be included anywhere
> except lib/find_bit{,_be}.c. I don't expect someone else would want to
> include it, even in lib. So what you suggest is a bit overthinking to
> me. But if you insist, I can do that.

Basically by the above you assured me that #error is the right
approach, please do it that way and we will definitely catch the
incorrect users (even by `git grep -lw __ffs_word_op` if they slip
into the kernel somehow).

--
With Best Regards,
Andy Shevchenko

2022-08-24 20:32:30

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

On Wed, Aug 24, 2022 at 06:58:10PM +0100, Russell King (Oracle) wrote:
> On Wed, Aug 24, 2022 at 06:37:45AM -0700, Yury Norov wrote:
> > Because there is a codebase that relies on existing types, mostly in
> > filesystem code. And those fs fixes would require 5 or 6 patches.
>
> Does that mean that are there filesystems that are passing pointers to
> the find_bit functions which are _not_ aligned to an "unsigned long" ?

Not necessarily. For example, I looked at ext2/4 code, and it looks like
they need void *bitmap, because they pass opaque pointer ->b_data from
struct buffer_head:

struct buffer_head {
...
char *b_data; /* pointer to data within the page */
...
}

So, quite probably, the pointer is aligned when it points to a bitmap.
But I have nothing to prove it, except standard anthropic principle
"otherwise people would complain".

In LE case, the find_bit_le() functions are aliased to find_bit(),
which is aligned, and somehow it works.

> If there are, we should _not_ convert 32-bit ARM to use word loads or
> use the generic code; unaligned loads are expensive on older ARM CPUs,
> at least not the code for older ARM CPUs.

I wonder, if there's an arch that throws an exception in case of unaligned
access. IIRC, ARM can do that, but this option is disabled in kernel.
Right?

I have a series that adds a runtime parameters checker for bitmaps.
I'll add a test for alignment and see how it works (not very soon).

Thanks,
Yury

2022-08-24 21:35:44

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions

On Wed, Aug 24, 2022 at 08:56:02PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 8:54 PM Andy Shevchenko
> <[email protected]> wrote:
> > On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <[email protected]> wrote:
> > > On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > > > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <[email protected]> wrote:
>
> ...
>
> > > > > +#define FIND_NEXT_BIT(EXPRESSION, size, start) \
> > > > > +({ \
> > > > > + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> > > > > + \
> > > > > + if (unlikely(__start >= sz)) \
> > > > > + goto out; \
> > > > > + \
> > > > > + mask = word_op(BITMAP_FIRST_WORD_MASK(__start)); \
> > > > > + idx = __start / BITS_PER_LONG; \
> > > > > + \
> > > > > + for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) { \
> > > >
> > > > for (unsigned long tmp ...;
> > > > But hey, why not loop over idx (which probably should be named as
> > > > offset)
> > >
> > > Offset in structure, index in array, isn't?
> > >
> > > > as I proposed in the first patch? You will drop a lot of
> > > > divisions / multiplications, no?
> > >
> > > Those divisions and multiplications are optimized away, and
> > > what you suggested blows up the EXPRESSION.
> > >
> > > I tried like this:
> > > mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
> > > idx = __start / BITS_PER_LONG;
> > > tmp = (EXPRESSION);
> > >
> > > while (1) {
> > > if (tmp) {
> > > sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
> > > break;
> > > }
> > >
> > > if (++idx > sz)
> > > break;
> > >
> > > tmp = (EXPRESSION);
> > > }
> > >
> > > And it generated the same code, but looks less expressive to me.
> > > If you have some elegant approach in mind - can you please share
> > > it, and how the generated code looks?
> >
> > for (unsigned long idx = 0; idx < sz; idx++) {
>
> Of source 0 should be changed to whatever start you have there.
>
> > unsigned long tmp;
> >
> > tmp = (EXPRESSION);
> > if (tmp) {
> > ...
> > }
> > }
> >
> > No?

No. For the first iteration, the tmp can't be calculated inside the loop
(my example above is wrong) because we need to clear first bits:

mask = BITMAP_FIRST_WORD_MASK(__start);
idx = __start / BITS_PER_LONG;
tmp = (EXPRESSION) & mask; // First fetch is here

while (1) {
if (tmp) { // Evaluate here
sz = min(idx * BITS_PER_LONG + __ffs(tmp), sz);
break;
}

if (++idx > sz) // Increment here
break;

tmp = (EXPRESSION); // Other fetches here
}

Trying to move iterator increment inside the for-loop, like you suggested
would break the sequence - common-case word fetch will happen before the
idx++.

2022-08-24 22:57:11

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()

> Just make the macro take that "last word op" as another argument, and
> then you don't need these tricks, and you can do the _le() version in
> lib/find.c file *exactly* like the regular version, using just
>
> #ifdef __BIG_ENDIAN
> unsigned long find_first_zero_bit_le(const unsigned long *addr,
> unsigned long size)
> { return FIND_FIRST_BIT(~addr[idx], swab(val), size); }
> #else
> unsigned long find_first_zero_bit_le(const unsigned long *addr,
> unsigned long size) __alias(find_first_zero_bit);
> #endif
>
> or something like that.
>
> And yes, it means that the "regular" versions need to pass in just
> "val" as a last-word expression, but who cares? It's still cleaner,
> and easier to explain: "The first expression finds the word that
> matches our requirement, the second expression munges the word we
> found into the bit order we use".

OK. I'll do like this:

#define FIND_FIRST_BIT(FETCH, MUNG, size) ...

unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
return FIND_FIRST_BIT(~addr[idx], /* NOP */, size);
}

unsigned long find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
{
return FIND_FIRST_BIT(~addr[idx], swab, size);
}

It doesn't actually look that worse.

Thanks,
Yury