2024-05-01 13:21:01

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v4 0/2] bitops: Optimize fns() for improved performance

Hello,

This patch series optimizes the fns() function by avoiding repeated
calls to __ffs(). Additionally, tests for fns() have been added in
lib/test_bitops.c.

Changes in v4:
- Correct get_random_long() -> get_random_bytes() in the commit
message.

Changes in v3:
- Move the benchmark test for fns() to lib/test_bitops.c.
- Exclude the overhead of random number generation from the benchmark
result.
- Change the output to print only a total gross instead of each n in
the benchmark result.
- Update the commit message in the second patch.

Changes in v2:
- Add benchmark test for fns() in lib/find_bit_benchmark.c.
- Change the loop in fns() by counting down from n to 0.
- Add find_bit benchmark result for find_nth_bit in commit message.

Link to v3: https://lkml.kernel.org/[email protected]
Link to v2: https://lkml.kernel.org/[email protected]
Link to v1: https://lkml.kernel.org/[email protected]

Kuan-Wei Chiu (2):
lib/test_bitops: Add benchmark test for fns()
bitops: Optimize fns() for improved performance

include/linux/bitops.h | 12 +++---------
lib/test_bitops.c | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+), 9 deletions(-)

--
2.34.1



2024-05-01 13:21:15

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Introduce a benchmark test for the fns(). It measures the total time
taken by fns() to process 1,000,000 test data generated using
get_random_bytes() for each n in the range [0, BITS_PER_LONG).

example:
test_bitops: fns: 5876762553 ns, 64000000 iterations

Signed-off-by: Kuan-Wei Chiu <[email protected]>
---

Changes in v4:
- Correct get_random_long() -> get_random_bytes() in the commit
message.

lib/test_bitops.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/lib/test_bitops.c b/lib/test_bitops.c
index 3b7bcbee84db..ed939f124417 100644
--- a/lib/test_bitops.c
+++ b/lib/test_bitops.c
@@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
};
#endif

+static unsigned long buf[1000000];
+
+static int __init test_fns(void)
+{
+ unsigned int i, n;
+ ktime_t time;
+
+ get_random_bytes(buf, sizeof(buf));
+ time = ktime_get();
+
+ for (n = 0; n < BITS_PER_LONG; n++)
+ for (i = 0; i < 1000000; i++)
+ fns(buf[i], n);
+
+ time = ktime_get() - time;
+ pr_err("fns: %18llu ns, %6d iterations\n", time, BITS_PER_LONG * 1000000);
+
+ return 0;
+}
+
static int __init test_bitops_startup(void)
{
int i, bit_set;
@@ -94,6 +114,8 @@ static int __init test_bitops_startup(void)
if (bit_set != BITOPS_LAST)
pr_err("ERROR: FOUND SET BIT %d\n", bit_set);

+ test_fns();
+
pr_info("Completed bitops test\n");

return 0;
--
2.34.1


2024-05-01 13:22:15

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v4 2/2] bitops: Optimize fns() for improved performance

The current fns() repeatedly uses __ffs() to find the index of the
least significant bit and then clears the corresponding bit using
__clear_bit(). The method for clearing the least significant bit can be
optimized by using word &= word - 1 instead.

Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction. To
improve performance, the loop for clearing the least significant bit
has been replaced with word &= word - 1, followed by a single __ffs()
operation to obtain the answer. This change reduces the number of
__ffs() iterations from n to just one, enhancing overall performance.

This modification significantly accelerates the fns() function in the
test_bitops benchmark, improving its speed by approximately 439 times.
Additionally, it enhances the performance of find_nth_bit() in the
find_bit benchmark by approximately 26%.

Before:
test_bitops: fns: 5876762553 ns, 64000000 iterations
find_nth_bit: 4254313 ns, 16525 iterations

After:
test_bitops: fns: 13388431 ns, 64000000 iterations
find_nth_bit: 3362863 ns, 16501 iterations

Signed-off-by: Kuan-Wei Chiu <[email protected]>
---
include/linux/bitops.h | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 2ba557e067fe..57ecef354f47 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -254,16 +254,10 @@ static inline unsigned long __ffs64(u64 word)
*/
static inline unsigned long fns(unsigned long word, unsigned int n)
{
- unsigned int bit;
+ while (word && n--)
+ word &= word - 1;

- while (word) {
- bit = __ffs(word);
- if (n-- == 0)
- return bit;
- __clear_bit(bit, &word);
- }
-
- return BITS_PER_LONG;
+ return word ? __ffs(word) : BITS_PER_LONG;
}

/**
--
2.34.1


2024-05-01 16:29:54

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> Introduce a benchmark test for the fns(). It measures the total time
> taken by fns() to process 1,000,000 test data generated using
> get_random_bytes() for each n in the range [0, BITS_PER_LONG).
>
> example:
> test_bitops: fns: 5876762553 ns, 64000000 iterations

So... 5 seconds for a test sounds too much. I see the following patch
improves it dramatically, but in general let's stay in a range of
milliseconds. On other machines it may run much slower and trigger
a stall watchdog.

> Signed-off-by: Kuan-Wei Chiu <[email protected]>

Suggested-by: Yury Norov <[email protected]>

> ---
>
> Changes in v4:
> - Correct get_random_long() -> get_random_bytes() in the commit
> message.
>
> lib/test_bitops.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> index 3b7bcbee84db..ed939f124417 100644
> --- a/lib/test_bitops.c
> +++ b/lib/test_bitops.c
> @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> };
> #endif
>
> +static unsigned long buf[1000000];

Can you make it __init, or allocate with kmalloc_array(), so that 64M
of memory will not last forever in the kernel?

> +static int __init test_fns(void)
> +{
> + unsigned int i, n;
> + ktime_t time;
> +
> + get_random_bytes(buf, sizeof(buf));
> + time = ktime_get();
> +
> + for (n = 0; n < BITS_PER_LONG; n++)
> + for (i = 0; i < 1000000; i++)
> + fns(buf[i], n);

What concerns me here is that fns() is a in fact a const function, and
the whole loop may be eliminated. Can you make sure it's not your case
because 450x performance boost sounds a bit too much to me.

You can declare a "static volatile __used __init" variable to assign
the result of fns(), and ensure that the code is not eliminated

> + time = ktime_get() - time;
> + pr_err("fns: %18llu ns, %6d iterations\n", time, BITS_PER_LONG * 1000000);

Here the number of iterations is always the same. What's the point to
print it?

Thanks,
Yury

2024-05-05 13:12:46

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

From: Yury Norov
> Sent: 01 May 2024 17:30
>
> On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > Introduce a benchmark test for the fns(). It measures the total time
> > taken by fns() to process 1,000,000 test data generated using
> > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> >
> > example:
> > test_bitops: fns: 5876762553 ns, 64000000 iterations
>
> So... 5 seconds for a test sounds too much. I see the following patch
> improves it dramatically, but in general let's stay in a range of
> milliseconds. On other machines it may run much slower and trigger
> a stall watchdog.
>
> > Signed-off-by: Kuan-Wei Chiu <[email protected]>
>
> Suggested-by: Yury Norov <[email protected]>
>
> > ---
> >
> > Changes in v4:
> > - Correct get_random_long() -> get_random_bytes() in the commit
> > message.
> >
> > lib/test_bitops.c | 22 ++++++++++++++++++++++
> > 1 file changed, 22 insertions(+)
> >
> > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > index 3b7bcbee84db..ed939f124417 100644
> > --- a/lib/test_bitops.c
> > +++ b/lib/test_bitops.c
> > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> > };
> > #endif
> >
> > +static unsigned long buf[1000000];
>
> Can you make it __init, or allocate with kmalloc_array(), so that 64M
> of memory will not last forever in the kernel?
>
> > +static int __init test_fns(void)
> > +{
> > + unsigned int i, n;
> > + ktime_t time;
> > +
> > + get_random_bytes(buf, sizeof(buf));
> > + time = ktime_get();
> > +
> > + for (n = 0; n < BITS_PER_LONG; n++)
> > + for (i = 0; i < 1000000; i++)
> > + fns(buf[i], n);
>
> What concerns me here is that fns() is a in fact a const function, and
> the whole loop may be eliminated. Can you make sure it's not your case
> because 450x performance boost sounds a bit too much to me.
>
> You can declare a "static volatile __used __init" variable to assign
> the result of fns(), and ensure that the code is not eliminated

Yep, without 'c' this compiler to 'return 0'.

static inline unsigned long fns(unsigned long word, unsigned int n)
{
while (word && n--)
word &= word - 1;
return word ? __builtin_ffs(word) : 8 * sizeof (long);
}

unsigned long buf[1000000];

volatile int c;

int test_fns(void)
{
unsigned int i, n;

for (n = 0; n < 8*sizeof (long); n++)
for (i = 0; i < 1000000; i++)
c = fns(buf[i], n);
return 0;
}

You are also hitting the random number generator.
It would be better to use a predictable sequence.
Then you could, for instance, add up all the fns() results
and check you get the expected value.

With a really trivial 'RNG' (like step a CRC one bit) you
could do it inside the loop and not nee a buffer at all.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2024-05-05 18:48:26

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

On Sun, May 05, 2024 at 01:11:53PM +0000, David Laight wrote:
> From: Yury Norov
> > Sent: 01 May 2024 17:30
> >
> > On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > > Introduce a benchmark test for the fns(). It measures the total time
> > > taken by fns() to process 1,000,000 test data generated using
> > > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> > >
> > > example:
> > > test_bitops: fns: 5876762553 ns, 64000000 iterations
> >
> > So... 5 seconds for a test sounds too much. I see the following patch
> > improves it dramatically, but in general let's stay in a range of
> > milliseconds. On other machines it may run much slower and trigger
> > a stall watchdog.
> >
> > > Signed-off-by: Kuan-Wei Chiu <[email protected]>
> >
> > Suggested-by: Yury Norov <[email protected]>
> >
> > > ---
> > >
> > > Changes in v4:
> > > - Correct get_random_long() -> get_random_bytes() in the commit
> > > message.
> > >
> > > lib/test_bitops.c | 22 ++++++++++++++++++++++
> > > 1 file changed, 22 insertions(+)
> > >
> > > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > > index 3b7bcbee84db..ed939f124417 100644
> > > --- a/lib/test_bitops.c
> > > +++ b/lib/test_bitops.c
> > > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> > > };
> > > #endif
> > >
> > > +static unsigned long buf[1000000];
> >
> > Can you make it __init, or allocate with kmalloc_array(), so that 64M
> > of memory will not last forever in the kernel?
> >
> > > +static int __init test_fns(void)
> > > +{
> > > + unsigned int i, n;
> > > + ktime_t time;
> > > +
> > > + get_random_bytes(buf, sizeof(buf));
> > > + time = ktime_get();
> > > +
> > > + for (n = 0; n < BITS_PER_LONG; n++)
> > > + for (i = 0; i < 1000000; i++)
> > > + fns(buf[i], n);
> >
> > What concerns me here is that fns() is a in fact a const function, and
> > the whole loop may be eliminated. Can you make sure it's not your case
> > because 450x performance boost sounds a bit too much to me.
> >
> > You can declare a "static volatile __used __init" variable to assign
> > the result of fns(), and ensure that the code is not eliminated
>
> Yep, without 'c' this compiler to 'return 0'.
>
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> while (word && n--)
> word &= word - 1;
> return word ? __builtin_ffs(word) : 8 * sizeof (long);
> }
>
> unsigned long buf[1000000];
>
> volatile int c;
>
> int test_fns(void)
> {
> unsigned int i, n;
>
> for (n = 0; n < 8*sizeof (long); n++)
> for (i = 0; i < 1000000; i++)
> c = fns(buf[i], n);
> return 0;
> }
>
> You are also hitting the random number generator.
> It would be better to use a predictable sequence.
> Then you could, for instance, add up all the fns() results
> and check you get the expected value.
>
> With a really trivial 'RNG' (like step a CRC one bit) you
> could do it inside the loop and not nee a buffer at all.
>
Hi David,

I do think that conducting correctness testing here is a good idea.
However, we are about to change the return value of fns() from return
BITS_PER_LONG to return >= BITS_PER_LONG [1][2] when the nth bit is not
found. Therefore, using a fixed input series here and checking the sum
of return values may not accurately test it. Do you know if there are
any other more suitable testing methods?

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/

Regards,
Kuan-Wei

> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2024-05-05 20:40:06

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

On Mon, May 06, 2024 at 02:48:14AM +0800, Kuan-Wei Chiu wrote:
> On Sun, May 05, 2024 at 01:11:53PM +0000, David Laight wrote:
> > From: Yury Norov
> > > Sent: 01 May 2024 17:30
> > >
> > > On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > > > Introduce a benchmark test for the fns(). It measures the total time
> > > > taken by fns() to process 1,000,000 test data generated using
> > > > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> > > >
> > > > example:
> > > > test_bitops: fns: 5876762553 ns, 64000000 iterations
> > >
> > > So... 5 seconds for a test sounds too much. I see the following patch
> > > improves it dramatically, but in general let's stay in a range of
> > > milliseconds. On other machines it may run much slower and trigger
> > > a stall watchdog.
> > >
> > > > Signed-off-by: Kuan-Wei Chiu <[email protected]>
> > >
> > > Suggested-by: Yury Norov <[email protected]>
> > >
> > > > ---
> > > >
> > > > Changes in v4:
> > > > - Correct get_random_long() -> get_random_bytes() in the commit
> > > > message.
> > > >
> > > > lib/test_bitops.c | 22 ++++++++++++++++++++++
> > > > 1 file changed, 22 insertions(+)
> > > >
> > > > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > > > index 3b7bcbee84db..ed939f124417 100644
> > > > --- a/lib/test_bitops.c
> > > > +++ b/lib/test_bitops.c
> > > > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> > > > };
> > > > #endif
> > > >
> > > > +static unsigned long buf[1000000];
> > >
> > > Can you make it __init, or allocate with kmalloc_array(), so that 64M
> > > of memory will not last forever in the kernel?
> > >
> > > > +static int __init test_fns(void)
> > > > +{
> > > > + unsigned int i, n;
> > > > + ktime_t time;
> > > > +
> > > > + get_random_bytes(buf, sizeof(buf));
> > > > + time = ktime_get();
> > > > +
> > > > + for (n = 0; n < BITS_PER_LONG; n++)
> > > > + for (i = 0; i < 1000000; i++)
> > > > + fns(buf[i], n);
> > >
> > > What concerns me here is that fns() is a in fact a const function, and
> > > the whole loop may be eliminated. Can you make sure it's not your case
> > > because 450x performance boost sounds a bit too much to me.
> > >
> > > You can declare a "static volatile __used __init" variable to assign
> > > the result of fns(), and ensure that the code is not eliminated
> >
> > Yep, without 'c' this compiler to 'return 0'.
> >
> > static inline unsigned long fns(unsigned long word, unsigned int n)
> > {
> > while (word && n--)
> > word &= word - 1;
> > return word ? __builtin_ffs(word) : 8 * sizeof (long);
> > }
> >
> > unsigned long buf[1000000];
> >
> > volatile int c;
> >
> > int test_fns(void)
> > {
> > unsigned int i, n;
> >
> > for (n = 0; n < 8*sizeof (long); n++)
> > for (i = 0; i < 1000000; i++)
> > c = fns(buf[i], n);
> > return 0;
> > }
> >
> > You are also hitting the random number generator.
> > It would be better to use a predictable sequence.
> > Then you could, for instance, add up all the fns() results
> > and check you get the expected value.
> >
> > With a really trivial 'RNG' (like step a CRC one bit) you
> > could do it inside the loop and not nee a buffer at all.
> >
> Hi David,
>
> I do think that conducting correctness testing here is a good idea.
> However, we are about to change the return value of fns() from return
> BITS_PER_LONG to return >= BITS_PER_LONG [1][2] when the nth bit is not
> found. Therefore, using a fixed input series here and checking the sum
> of return values may not accurately test it. Do you know if there are
> any other more suitable testing methods?

Hi David,

Let's do one thing in a time. test_fns() is about performance testing,
and that's it. Kuan-Wei wrote it specifically for that.

He also chose to use random numbers to feed the fns(), and that's a
reasonable choice. I see nothing wrong in his approach with the array,
as well as what you proposed. But the test is done, and it works. If
you think it's worth testing the function with your approach, feel
free to submit your test, I'll take it just as well, and we'll have
even better coverage.

Regarding summing up all the fns() returns to check correctness - I
think it's a wrong idea. At first because current behavior when we
return BITS_PER_LONG is not a contract, just implementation detail.
And even now, on 64- and 32-bit arches fns() will sum up to different
numbers.

Second, this summing up doesn't sound reliable. Imagine a case when
the function fails one time returning a less-by-one number, and in the
following run - a greater-by-one. So the sum will be correct, and the
test will not catch the error.

We're running a test_find_nth_bit(), and to me it's enough for testing
fns(). But if you'd like, please send a more specific test.

Thanks,
Yury

2024-05-05 21:50:37

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

..
> He also chose to use random numbers to feed the fns(), and that's a
> reasonable choice. I see nothing wrong in his approach with the array,
> as well as what you proposed. But the test is done, and it works. If
> you think it's worth testing the function with your approach, feel
> free to submit your test, I'll take it just as well, and we'll have
> even better coverage.

Accessing a very large array is going to be affected by cache line
and TLB misses.
It wont really be entirely affected by implementation of the
function itself.
This is especially relevant for something that does really apply
to large memory blocks.

OTOH you don't want to repeatedly call the version that used
a binary chop with constant data - the branch predictor
will learn the pattern instead of getting if wrong 50% of the time.

A compromise would be to repeatedly process a short random
buffer enough times to get something measurable.
Although it is perfectly possible to use a cycle counter
(but not the x86 TSC) to time single function calls and
get repeatable answers that can actually be compared to
calculated values.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)