2024-05-02 23:32:18

by Yury Norov

[permalink] [raw]
Subject: [PATCH 0/4] bitops: optimize fns() for more

This series follows up [1].

[1] improves performance by switching from __ffs() + __clear_bit()
in fns() to the equivalent but much faster expression that searches
and clears first N-1 bits:

while (word && n--)
word &= word - 1;

We can squeeze out of fns() even more by replacing linear walk over all
the bits in a word with a binary search.

Patch #3 implements it by adding fns8(), fns16(), fns32() and fns64(),
and patches 1 and 2 are cleanups related to fns().

The last patch creates a MAINTAINERS record for bitops. Currently they
aren't maintained. I add Rasmus and myself as a reviewer and maintainer,
accordingly, just like for bitmaps. Rasmus, please let me know if you
don't want to review it.

[1] https://lore.kernel.org/linux-kernel/[email protected]/T/

Yury Norov (4):
lib: make test_bitops compilable into the kernel image
bitmap: relax find_nth_bit() limitation on return value
bitops: squeeze even more out of fns()
MAINTAINERS: add BITOPS API record

MAINTAINERS | 13 +++++++++++++
include/linux/bitops.h | 42 +++++++++++++++++++++++++++++++++++++-----
include/linux/find.h | 2 +-
lib/Kconfig.debug | 1 -
lib/find_bit.c | 2 +-
lib/test_bitmap.c | 4 ++--
6 files changed, 54 insertions(+), 10 deletions(-)

--
2.40.1



2024-05-02 23:32:30

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

The test is limited to be compiled as a module. There's no technical
reason for it. Now that the test bears performance benchmark, it would
be reasonable to allow running it at kernel load time, before userspace
starts, to reduce possible jitter.

Signed-off-by: Yury Norov <[email protected]>
---
lib/Kconfig.debug | 1 -
1 file changed, 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c63a5fbf1f1c..fc8fe1ea5b49 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2436,7 +2436,6 @@ config TEST_LKM

config TEST_BITOPS
tristate "Test module for compilation of bitops operations"
- depends on m
help
This builds the "test_bitops" module that is much like the
TEST_LKM module except that it does a basic exercise of the
--
2.40.1


2024-05-02 23:32:38

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/4] bitmap: relax find_nth_bit() limitation on return value

The function claims to return the bitmap size if Nth bit doesn't exist.
This rule is violated in inline case because the fns() that is used
there doesn't know anything about size of the bitmap.

So, relax this requirement to '>= size', and make the outline
implementation a bit cheaper.

All in-tree kernel users of find_nth_bit() are safe against that.

Reported-by: Rasmus Villemoes <[email protected]>
Closes: https://lore.kernel.org/all/Zi50cAgR8nZvgLa3@yury-ThinkPad/T/#m6da806a0525e74dcc91f35e5f20766ed4e853e8a
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 2 +-
lib/find_bit.c | 2 +-
lib/test_bitmap.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index c69598e383c1..02751e43d448 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -220,7 +220,7 @@ unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
* idx = find_first_bit(addr, size);
*
* Returns the bit number of the N'th set bit.
- * If no such, returns @size.
+ * If no such, returns >= @size.
*/
static inline
unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n)
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 32f99e9a670e..0bddfc3ff248 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -87,7 +87,7 @@ out: \
if (sz % BITS_PER_LONG) \
tmp = (FETCH) & BITMAP_LAST_WORD_MASK(sz); \
found: \
- sz = min(idx * BITS_PER_LONG + fns(tmp, nr), sz); \
+ sz = idx * BITS_PER_LONG + fns(tmp, nr); \
out: \
sz; \
})
diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 6b2b33579f56..088838f829c9 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -244,7 +244,7 @@ static void __init test_find_nth_bit(void)
expect_eq_uint(60, find_nth_bit(bmap, 64 * 3, 5));
expect_eq_uint(80, find_nth_bit(bmap, 64 * 3, 6));
expect_eq_uint(123, find_nth_bit(bmap, 64 * 3, 7));
- expect_eq_uint(64 * 3, find_nth_bit(bmap, 64 * 3, 8));
+ expect_eq_uint(0, !!(find_nth_bit(bmap, 64 * 3, 8) < 64 * 3));

expect_eq_uint(10, find_nth_bit(bmap, 64 * 3 - 1, 0));
expect_eq_uint(20, find_nth_bit(bmap, 64 * 3 - 1, 1));
@@ -254,7 +254,7 @@ static void __init test_find_nth_bit(void)
expect_eq_uint(60, find_nth_bit(bmap, 64 * 3 - 1, 5));
expect_eq_uint(80, find_nth_bit(bmap, 64 * 3 - 1, 6));
expect_eq_uint(123, find_nth_bit(bmap, 64 * 3 - 1, 7));
- expect_eq_uint(64 * 3 - 1, find_nth_bit(bmap, 64 * 3 - 1, 8));
+ expect_eq_uint(0, !!(find_nth_bit(bmap, 64 * 3 - 1, 8) < 64 * 3 - 1));

for_each_set_bit(bit, exp1, EXP1_IN_BITS) {
b = find_nth_bit(exp1, EXP1_IN_BITS, cnt++);
--
2.40.1


2024-05-02 23:32:52

by Yury Norov

[permalink] [raw]
Subject: [PATCH 3/4] bitops: squeeze even more out of fns()

The function clears N-1 first set bits to find the N'th one with:

while (word && n--)
word &= word - 1;

In the worst case, it would take 63 iterations.

Instead of linear walk through the set bits, we can do a binary search
by using hweight(). This would work even better on platforms supporting
hardware-assisted hweight() - pretty much every modern arch.

On my Ryzen 9 5900X, the test_fns() benchmark runs binary fns() twice
faster, comparing to linear one.

The fns8() returns 64 to make sure that in case of no bit found, the
return value will be greater than the bit capacity of arguments of all
fnsXX() functions up to fns64().

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/bitops.h | 42 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 57ecef354f47..1c4773db56e0 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -247,17 +247,49 @@ static inline unsigned long __ffs64(u64 word)
return __ffs((unsigned long)word);
}

+static inline unsigned long fns8(u8 word, unsigned int n)
+{
+ while (word && n--)
+ word &= word - 1;
+
+ return word ? __ffs(word) : 64;
+}
+
+static inline unsigned long fns16(u16 word, unsigned int n)
+{
+ unsigned int w = hweight8((u8)word);
+
+ return n < w ? fns8((u8)word, n) : 8 + fns8((u8)(word >> 8), n - w);
+}
+
+static inline unsigned long fns32(u32 word, unsigned int n)
+{
+ unsigned int w = hweight16((u16)word);
+
+ return n < w ? fns16((u16)word, n) : 16 + fns16((u16)(word >> 16), n - w);
+}
+
+static inline unsigned long fns64(u64 word, unsigned int n)
+{
+ unsigned int w = hweight32((u32)word);
+
+ return n < w ? fns32((u32)word, n) : 32 + fns32((u32)(word >> 32), n - w);
+}
+
/**
* fns - find N'th set bit in a word
* @word: The word to search
- * @n: Bit to find
+ * @n: Bit to find, counting from 0
+ *
+ * Returns N'th set bit. If no such bit found, returns >= BITS_PER_LONG
*/
static inline unsigned long fns(unsigned long word, unsigned int n)
{
- while (word && n--)
- word &= word - 1;
-
- return word ? __ffs(word) : BITS_PER_LONG;
+#if BITS_PER_LONG == 64
+ return fns64(word, n);
+#else
+ return fns32(word, n);
+#endif
}

/**
--
2.40.1


2024-05-02 23:36:39

by Yury Norov

[permalink] [raw]
Subject: [PATCH 4/4] MAINTAINERS: add BITOPS API record

Bitops API is the very basic, and it's widely used by the kernel. But
corresponding files are not maintained.

Bitmaps actively use bit operations, and big share of bitops material
already moves through the bitmap branch.

I would like to take a closer look to bitops.

This patch creates a BITOPS API record in the MAINTAINERS, and adds
Rasmus as a reviewer, and myself as a maintainer of those files.

Signed-off-by: Yury Norov <[email protected]>
---
MAINTAINERS | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c23fda1aa1f0..0cfd2c5d9086 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3638,6 +3638,20 @@ F: tools/include/vdso/bits.h
F: tools/lib/bitmap.c
F: tools/lib/find_bit.c

+BITOPS API
+M: Yury Norov <[email protected]>
+R: Rasmus Villemoes <[email protected]>
+S: Maintained
+F: arch/*/include/asm/bitops.h
+F: arch/*/include/asm/bitops_32.h
+F: arch/*/include/asm/bitops_64.h
+F: arch/*/lib/bitops.c
+F: include/asm-generic/bitops
+F: include/asm-generic/bitops.h
+F: include/linux/bitops.h
+F: lib/test_bitops.c
+F: tools/*/bitops*
+
BLINKM RGB LED DRIVER
M: Jan-Simon Moeller <[email protected]>
S: Maintained
--
2.40.1


2024-05-03 02:00:21

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

On Thu, May 02, 2024 at 04:32:01PM -0700, Yury Norov wrote:
> The test is limited to be compiled as a module. There's no technical
> reason for it. Now that the test bears performance benchmark, it would
> be reasonable to allow running it at kernel load time, before userspace
> starts, to reduce possible jitter.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/Kconfig.debug | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index c63a5fbf1f1c..fc8fe1ea5b49 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2436,7 +2436,6 @@ config TEST_LKM
>
> config TEST_BITOPS
> tristate "Test module for compilation of bitops operations"
> - depends on m


Perhaps it would be better to modify the description in the following
help section at the same time?

Reviewed-by: Kuan-Wei Chiu <[email protected]>

> help
> This builds the "test_bitops" module that is much like the
> TEST_LKM module except that it does a basic exercise of the
> --
> 2.40.1
>

2024-05-03 02:14:22

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

On Fri, May 03, 2024 at 10:00:07AM +0800, Kuan-Wei Chiu wrote:
> On Thu, May 02, 2024 at 04:32:01PM -0700, Yury Norov wrote:
> > The test is limited to be compiled as a module. There's no technical
> > reason for it. Now that the test bears performance benchmark, it would
> > be reasonable to allow running it at kernel load time, before userspace
> > starts, to reduce possible jitter.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > lib/Kconfig.debug | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index c63a5fbf1f1c..fc8fe1ea5b49 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -2436,7 +2436,6 @@ config TEST_LKM
> >
> > config TEST_BITOPS
> > tristate "Test module for compilation of bitops operations"
> > - depends on m
>
>
> Perhaps it would be better to modify the description in the following
> help section at the same time?

What exactly you want to change?

> Reviewed-by: Kuan-Wei Chiu <[email protected]>
>
> > help
> > This builds the "test_bitops" module that is much like the
> > TEST_LKM module except that it does a basic exercise of the
> > --
> > 2.40.1
> >

2024-05-03 02:19:22

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH 3/4] bitops: squeeze even more out of fns()

+Cc Chin-Chun Chen & Ching-Chun (Jim) Huang

On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> The function clears N-1 first set bits to find the N'th one with:
>
> while (word && n--)
> word &= word - 1;
>
> In the worst case, it would take 63 iterations.
>
> Instead of linear walk through the set bits, we can do a binary search
> by using hweight(). This would work even better on platforms supporting
> hardware-assisted hweight() - pretty much every modern arch.
>
Chin-Chun once proposed a method similar to binary search combined with
hamming weight and discussed it privately with me and Jim. However,
Chin-Chun found that binary search would actually impair performance
when n is small. Since we are unsure about the typical range of n in
our actual workload, we have not yet proposed any relevant patches. If
considering only the overall benchmark results, this patch looks good
to me.

Regards,
Kuan-Wei

> On my Ryzen 9 5900X, the test_fns() benchmark runs binary fns() twice
> faster, comparing to linear one.
>
> The fns8() returns 64 to make sure that in case of no bit found, the
> return value will be greater than the bit capacity of arguments of all
> fnsXX() functions up to fns64().
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> include/linux/bitops.h | 42 +++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 37 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> index 57ecef354f47..1c4773db56e0 100644
> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -247,17 +247,49 @@ static inline unsigned long __ffs64(u64 word)
> return __ffs((unsigned long)word);
> }
>
> +static inline unsigned long fns8(u8 word, unsigned int n)
> +{
> + while (word && n--)
> + word &= word - 1;
> +
> + return word ? __ffs(word) : 64;
> +}
> +
> +static inline unsigned long fns16(u16 word, unsigned int n)
> +{
> + unsigned int w = hweight8((u8)word);
> +
> + return n < w ? fns8((u8)word, n) : 8 + fns8((u8)(word >> 8), n - w);
> +}
> +
> +static inline unsigned long fns32(u32 word, unsigned int n)
> +{
> + unsigned int w = hweight16((u16)word);
> +
> + return n < w ? fns16((u16)word, n) : 16 + fns16((u16)(word >> 16), n - w);
> +}
> +
> +static inline unsigned long fns64(u64 word, unsigned int n)
> +{
> + unsigned int w = hweight32((u32)word);
> +
> + return n < w ? fns32((u32)word, n) : 32 + fns32((u32)(word >> 32), n - w);
> +}
> +
> /**
> * fns - find N'th set bit in a word
> * @word: The word to search
> - * @n: Bit to find
> + * @n: Bit to find, counting from 0
> + *
> + * Returns N'th set bit. If no such bit found, returns >= BITS_PER_LONG
> */
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> - while (word && n--)
> - word &= word - 1;
> -
> - return word ? __ffs(word) : BITS_PER_LONG;
> +#if BITS_PER_LONG == 64
> + return fns64(word, n);
> +#else
> + return fns32(word, n);
> +#endif
> }
>
> /**
> --
> 2.40.1
>

2024-05-03 02:24:35

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

On Thu, May 02, 2024 at 07:14:10PM -0700, Yury Norov wrote:
> On Fri, May 03, 2024 at 10:00:07AM +0800, Kuan-Wei Chiu wrote:
> > On Thu, May 02, 2024 at 04:32:01PM -0700, Yury Norov wrote:
> > > The test is limited to be compiled as a module. There's no technical
> > > reason for it. Now that the test bears performance benchmark, it would
> > > be reasonable to allow running it at kernel load time, before userspace
> > > starts, to reduce possible jitter.
> > >
> > > Signed-off-by: Yury Norov <[email protected]>
> > > ---
> > > lib/Kconfig.debug | 1 -
> > > 1 file changed, 1 deletion(-)
> > >
> > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > index c63a5fbf1f1c..fc8fe1ea5b49 100644
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -2436,7 +2436,6 @@ config TEST_LKM
> > >
> > > config TEST_BITOPS
> > > tristate "Test module for compilation of bitops operations"
> > > - depends on m
> >
> >
> > Perhaps it would be better to modify the description in the following
> > help section at the same time?
>
> What exactly you want to change?
>
It seems to me that the entire description is written specifically for
the module. For instance, "doesn't run or load unless explicitly
requested by name. for example: modprobe test_bitops." In my view, this
description is no longer accurate.

Regards,
Kuan-Wei

> > Reviewed-by: Kuan-Wei Chiu <[email protected]>
> >
> > > help
> > > This builds the "test_bitops" module that is much like the
> > > TEST_LKM module except that it does a basic exercise of the
> > > --
> > > 2.40.1
> > >

2024-05-03 15:13:19

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

On Fri, May 03, 2024 at 10:24:23AM +0800, Kuan-Wei Chiu wrote:
> On Thu, May 02, 2024 at 07:14:10PM -0700, Yury Norov wrote:
> > On Fri, May 03, 2024 at 10:00:07AM +0800, Kuan-Wei Chiu wrote:
> > > On Thu, May 02, 2024 at 04:32:01PM -0700, Yury Norov wrote:
> > > > The test is limited to be compiled as a module. There's no technical
> > > > reason for it. Now that the test bears performance benchmark, it would
> > > > be reasonable to allow running it at kernel load time, before userspace
> > > > starts, to reduce possible jitter.
> > > >
> > > > Signed-off-by: Yury Norov <[email protected]>
> > > > ---
> > > > lib/Kconfig.debug | 1 -
> > > > 1 file changed, 1 deletion(-)
> > > >
> > > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > > index c63a5fbf1f1c..fc8fe1ea5b49 100644
> > > > --- a/lib/Kconfig.debug
> > > > +++ b/lib/Kconfig.debug
> > > > @@ -2436,7 +2436,6 @@ config TEST_LKM
> > > >
> > > > config TEST_BITOPS
> > > > tristate "Test module for compilation of bitops operations"
> > > > - depends on m
> > >
> > >
> > > Perhaps it would be better to modify the description in the following
> > > help section at the same time?
> >
> > What exactly you want to change?
> >
> It seems to me that the entire description is written specifically for
> the module. For instance, "doesn't run or load unless explicitly
> requested by name. for example: modprobe test_bitops." In my view, this
> description is no longer accurate.

In-kernel module is still module. Everything is the same as for .ko,
except that it's loaded automatically and earlier for you. To me this
part of the description is correct.

If you feel it should be reworded - feel free to submit a patch. Now
that we add more functionality in that, it's probably worth to do. Not
in this series, though.

Thanks,
Yury

2024-05-03 15:30:26

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH 1/4] lib: make test_bitops compilable into the kernel image

On Fri, May 03, 2024 at 08:13:05AM -0700, Yury Norov wrote:
> On Fri, May 03, 2024 at 10:24:23AM +0800, Kuan-Wei Chiu wrote:
> > On Thu, May 02, 2024 at 07:14:10PM -0700, Yury Norov wrote:
> > > On Fri, May 03, 2024 at 10:00:07AM +0800, Kuan-Wei Chiu wrote:
> > > > On Thu, May 02, 2024 at 04:32:01PM -0700, Yury Norov wrote:
> > > > > The test is limited to be compiled as a module. There's no technical
> > > > > reason for it. Now that the test bears performance benchmark, it would
> > > > > be reasonable to allow running it at kernel load time, before userspace
> > > > > starts, to reduce possible jitter.
> > > > >
> > > > > Signed-off-by: Yury Norov <[email protected]>
> > > > > ---
> > > > > lib/Kconfig.debug | 1 -
> > > > > 1 file changed, 1 deletion(-)
> > > > >
> > > > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > > > index c63a5fbf1f1c..fc8fe1ea5b49 100644
> > > > > --- a/lib/Kconfig.debug
> > > > > +++ b/lib/Kconfig.debug
> > > > > @@ -2436,7 +2436,6 @@ config TEST_LKM
> > > > >
> > > > > config TEST_BITOPS
> > > > > tristate "Test module for compilation of bitops operations"
> > > > > - depends on m
> > > >
> > > >
> > > > Perhaps it would be better to modify the description in the following
> > > > help section at the same time?
> > >
> > > What exactly you want to change?
> > >
> > It seems to me that the entire description is written specifically for
> > the module. For instance, "doesn't run or load unless explicitly
> > requested by name. for example: modprobe test_bitops." In my view, this
> > description is no longer accurate.
>
> In-kernel module is still module. Everything is the same as for .ko,
> except that it's loaded automatically and earlier for you. To me this
> part of the description is correct.
>
> If you feel it should be reworded - feel free to submit a patch. Now
> that we add more functionality in that, it's probably worth to do. Not
> in this series, though.
>
> Thanks,
> Yury

Got it, thank you for the explanation! :)

Regards,
Kuan-Wei

2024-05-03 16:22:00

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/4] bitops: squeeze even more out of fns()

On Fri, May 03, 2024 at 10:19:10AM +0800, Kuan-Wei Chiu wrote:
> +Cc Chin-Chun Chen & Ching-Chun (Jim) Huang
>
> On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> > The function clears N-1 first set bits to find the N'th one with:
> >
> > while (word && n--)
> > word &= word - 1;
> >
> > In the worst case, it would take 63 iterations.
> >
> > Instead of linear walk through the set bits, we can do a binary search
> > by using hweight(). This would work even better on platforms supporting
> > hardware-assisted hweight() - pretty much every modern arch.
> >
> Chin-Chun once proposed a method similar to binary search combined with
> hamming weight and discussed it privately with me and Jim. However,
> Chin-Chun found that binary search would actually impair performance
> when n is small. Since we are unsure about the typical range of n in
> our actual workload, we have not yet proposed any relevant patches. If
> considering only the overall benchmark results, this patch looks good
> to me.

fns() is used only as a helper to find_nth_bit().

In the kernel the find_nth_bit() is used in
- bitmap_bitremap((),
- bitmap_remap(), and
- cpumask_local_spread() via sched_numa_find_nth_cpu()

with the bit to search calculated as n = n % cpumask_weigth(). This
virtually implies random uniformly distributed n and word, just like
in the test_fns().

In rebalance_wq_table() in drivers/crypto/intel/iaa/iaa_crypto_main.c
it's used like:

for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
int node_cpu = cpumask_nth(cpu, node_cpus);
...
}

This is an API abuse, and should be rewritten with for_each_cpu()

In cpumask_any_housekeeping() at arch/x86/kernel/cpu/resctrl/internal.h
it's used like:

90 hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
91 if (hk_cpu == exclude_cpu)
92 hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
93
94 if (hk_cpu < nr_cpu_ids)
95 cpu = hk_cpu;

And this is another example of the API abuse. We need to introduce a new
helper cpumask_andnot_any_but() and use it like:

hk_cpu = cpumask_andnot_any_but(exclude_cpu, mask, tick_nohz_full_mask).
if (hk_cpu < nr_cpu_ids)
cpu = hk_cpu;

So, where the use of find_nth_bit() is legitimate, the parameters are
distributed like in the test, and I would expect the real-life
performance impact to be similar to the test.

Optimizing the helper for non-legitimate cases doesn't worth the
effort.

Thanks,
Yury

2024-05-04 08:54:00

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH 3/4] bitops: squeeze even more out of fns()

On Fri, May 03, 2024 at 09:13:32AM -0700, Yury Norov wrote:
> On Fri, May 03, 2024 at 10:19:10AM +0800, Kuan-Wei Chiu wrote:
> > +Cc Chin-Chun Chen & Ching-Chun (Jim) Huang
> >
> > On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> > > The function clears N-1 first set bits to find the N'th one with:
> > >
> > > while (word && n--)
> > > word &= word - 1;
> > >
> > > In the worst case, it would take 63 iterations.
> > >
> > > Instead of linear walk through the set bits, we can do a binary search
> > > by using hweight(). This would work even better on platforms supporting
> > > hardware-assisted hweight() - pretty much every modern arch.
> > >
> > Chin-Chun once proposed a method similar to binary search combined with
> > hamming weight and discussed it privately with me and Jim. However,
> > Chin-Chun found that binary search would actually impair performance
> > when n is small. Since we are unsure about the typical range of n in
> > our actual workload, we have not yet proposed any relevant patches. If
> > considering only the overall benchmark results, this patch looks good
> > to me.
>
> fns() is used only as a helper to find_nth_bit().
>
> In the kernel the find_nth_bit() is used in
> - bitmap_bitremap((),
> - bitmap_remap(), and
> - cpumask_local_spread() via sched_numa_find_nth_cpu()
>
> with the bit to search calculated as n = n % cpumask_weigth(). This
> virtually implies random uniformly distributed n and word, just like
> in the test_fns().
>
> In rebalance_wq_table() in drivers/crypto/intel/iaa/iaa_crypto_main.c
> it's used like:
>
> for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
> int node_cpu = cpumask_nth(cpu, node_cpus);
> ...
> }
>
> This is an API abuse, and should be rewritten with for_each_cpu()
>
> In cpumask_any_housekeeping() at arch/x86/kernel/cpu/resctrl/internal.h
> it's used like:
>
> 90 hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> 91 if (hk_cpu == exclude_cpu)
> 92 hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
> 93
> 94 if (hk_cpu < nr_cpu_ids)
> 95 cpu = hk_cpu;
>
> And this is another example of the API abuse. We need to introduce a new
> helper cpumask_andnot_any_but() and use it like:
>
> hk_cpu = cpumask_andnot_any_but(exclude_cpu, mask, tick_nohz_full_mask).
> if (hk_cpu < nr_cpu_ids)
> cpu = hk_cpu;
>
> So, where the use of find_nth_bit() is legitimate, the parameters are
> distributed like in the test, and I would expect the real-life
> performance impact to be similar to the test.
>
> Optimizing the helper for non-legitimate cases doesn't worth the
> effort.
>
Got it, thank you for your detailed explanation :)

Reviewed-by: Kuan-Wei Chiu <[email protected]>

Regards,
Kuan-Wei