2024-04-30 05:49:32

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v2 0/2] bitops: Optimize fns() for improved performance

Hello,

This patch series optimizes the fns() function by avoiding repeated
calls to __ffs(). Additionally, tests for fns() have been added in
lib/find_bit_benchmark.c.

Changes in v2:
- Add benchmark test for fns() in lib/find_bit_benchmark.c.
- Change the loop in fns() by counting down from n to 0.
- Add find_bit benchmark result for find_nth_bit in commit message.

Link to v1: https://lkml.kernel.org/[email protected]

Kuan-Wei Chiu (2):
lib/find_bit_benchmark: Add benchmark test for fns()
bitops: Optimize fns() for improved performance

include/linux/bitops.h | 12 +++---------
lib/find_bit_benchmark.c | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+), 9 deletions(-)

--
2.34.1



2024-04-30 05:49:43

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v2 1/2] lib/find_bit_benchmark: Add benchmark test for fns()

Introduce a benchmark test for the fns(). It measures the total time
taken by fns() to process 1,000,000 test data generated using
get_random_long() for each n in the range [0, BITS_PER_LONG].

Signed-off-by: Kuan-Wei Chiu <[email protected]>
---
lib/find_bit_benchmark.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
index d3fb09e6eff1..8712eacf3bbd 100644
--- a/lib/find_bit_benchmark.c
+++ b/lib/find_bit_benchmark.c
@@ -146,6 +146,28 @@ static int __init test_find_next_and_bit(const void *bitmap,
return 0;
}

+static int __init test_fns(void)
+{
+ const unsigned long round = 1000000;
+ s64 time[BITS_PER_LONG + 1];
+ unsigned int i, n;
+ volatile unsigned long x, y;
+
+ for (n = 0; n <= BITS_PER_LONG; n++) {
+ time[n] = ktime_get();
+ for (i = 0; i < round; i++) {
+ x = get_random_long();
+ y = fns(x, n);
+ }
+ time[n] = ktime_get() - time[n];
+ }
+
+ for (n = 0; n <= BITS_PER_LONG; n++)
+ pr_err("fns: n = %2u: %12lld ns\n", n, time[n]);
+
+ return 0;
+}
+
static int __init find_bit_test(void)
{
unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -186,6 +208,9 @@ static int __init find_bit_test(void)
test_find_first_and_bit(bitmap, bitmap2, BITMAP_LEN);
test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);

+ pr_err("\nStart testing for fns()\n");
+ test_fns();
+
/*
* Everything is OK. Return error just to let user run benchmark
* again without annoying rmmod.
--
2.34.1


2024-04-30 05:49:59

by Kuan-Wei Chiu

[permalink] [raw]
Subject: [PATCH v2 2/2] bitops: Optimize fns() for improved performance

The current fns() repeatedly uses __ffs() to find the index of the
least significant bit and then clears the corresponding bit using
__clear_bit(). The method for clearing the least significant bit can be
optimized by using word &= word - 1 instead.

Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction. To
improve performance, the loop for clearing the least significant bit
has been replaced with word &= word - 1, followed by a single __ffs()
operation to obtain the answer. This change reduces the number of
__ffs() iterations from n to just one, enhancing overall performance.

This change speeds up the find_nth_bit() in the find_bit benchmark by
approximately ~1.26 times:

Before:
find_nth_bit: 4254313 ns, 16525 iterations

After:
find_nth_bit: 3362863 ns, 16501 iterations

The following microbenchmark data, conducted on my x86-64 machine,
shows the execution time (in microseconds) required for 1000000 test
data generated by get_random_u64() and executed by fns() under
different values of n:

+-----+---------------+---------------+
| n | time_old | time_new |
+-----+---------------+---------------+
| 0 | 29194 | 25878 |
| 1 | 25510 | 25497 |
| 2 | 27836 | 25721 |
| 3 | 30140 | 25673 |
| 4 | 32569 | 25426 |
| 5 | 34792 | 25690 |
| 6 | 37117 | 25651 |
| 7 | 39742 | 25383 |
| 8 | 42360 | 25657 |
| 9 | 44672 | 25897 |
| 10 | 47237 | 25819 |
| 11 | 49884 | 26530 |
| 12 | 51864 | 26647 |
| 13 | 54265 | 28915 |
| 14 | 56440 | 28373 |
| 15 | 58839 | 28616 |
| 16 | 62383 | 29128 |
| 17 | 64257 | 30041 |
| 18 | 66805 | 29773 |
| 19 | 69368 | 33203 |
| 20 | 72942 | 33688 |
| 21 | 77006 | 34518 |
| 22 | 80926 | 34298 |
| 23 | 85723 | 35586 |
| 24 | 90324 | 36376 |
| 25 | 95992 | 37465 |
| 26 | 101101 | 37599 |
| 27 | 106520 | 37466 |
| 28 | 113287 | 38163 |
| 29 | 120552 | 38810 |
| 30 | 128040 | 39373 |
| 31 | 135624 | 40500 |
| 32 | 142580 | 40343 |
| 33 | 148915 | 40460 |
| 34 | 154005 | 41294 |
| 35 | 157996 | 41730 |
| 36 | 160806 | 41523 |
| 37 | 162975 | 42088 |
| 38 | 163426 | 41530 |
| 39 | 164872 | 41789 |
| 40 | 164477 | 42505 |
| 41 | 164758 | 41879 |
| 42 | 164182 | 41415 |
| 43 | 164842 | 42119 |
| 44 | 164881 | 42297 |
| 45 | 164870 | 42145 |
| 46 | 164673 | 42066 |
| 47 | 164616 | 42051 |
| 48 | 165055 | 41902 |
| 49 | 164847 | 41862 |
| 50 | 165171 | 41960 |
| 51 | 164851 | 42089 |
| 52 | 164763 | 41717 |
| 53 | 164635 | 42154 |
| 54 | 164757 | 41983 |
| 55 | 165095 | 41419 |
| 56 | 164641 | 42381 |
| 57 | 164601 | 41654 |
| 58 | 164864 | 41834 |
| 59 | 164594 | 41920 |
| 60 | 165207 | 42020 |
| 61 | 165056 | 41185 |
| 62 | 165160 | 41722 |
| 63 | 164923 | 41702 |
| 64 | 164777 | 41880 |
+-----+---------------+---------------+

Signed-off-by: Kuan-Wei Chiu <[email protected]>
---

Changes in v2:
- Change the loop in fns() by counting down from n to 0.
- Add find_bit benchmark result for find_nth_bit in commit message.

include/linux/bitops.h | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 2ba557e067fe..57ecef354f47 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -254,16 +254,10 @@ static inline unsigned long __ffs64(u64 word)
*/
static inline unsigned long fns(unsigned long word, unsigned int n)
{
- unsigned int bit;
+ while (word && n--)
+ word &= word - 1;

- while (word) {
- bit = __ffs(word);
- if (n-- == 0)
- return bit;
- __clear_bit(bit, &word);
- }
-
- return BITS_PER_LONG;
+ return word ? __ffs(word) : BITS_PER_LONG;
}

/**
--
2.34.1


2024-04-30 06:11:56

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] bitops: Optimize fns() for improved performance

On Tue, Apr 30, 2024 at 01:49:10PM +0800, Kuan-Wei Chiu wrote:
> Hello,
>
> This patch series optimizes the fns() function by avoiding repeated
> calls to __ffs(). Additionally, tests for fns() have been added in
> lib/find_bit_benchmark.c.
>
> Changes in v2:
> - Add benchmark test for fns() in lib/find_bit_benchmark.c.
> - Change the loop in fns() by counting down from n to 0.
> - Add find_bit benchmark result for find_nth_bit in commit message.
>
> Link to v1: https://lkml.kernel.org/[email protected]

Sorry for pasting the wrong link, the link to v1 should be:
https://lkml.kernel.org/[email protected]

Regards,
Kuan-Wei
>
> Kuan-Wei Chiu (2):
> lib/find_bit_benchmark: Add benchmark test for fns()
> bitops: Optimize fns() for improved performance
>
> include/linux/bitops.h | 12 +++---------
> lib/find_bit_benchmark.c | 25 +++++++++++++++++++++++++
> 2 files changed, 28 insertions(+), 9 deletions(-)
>
> --
> 2.34.1
>

2024-04-30 17:24:19

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] lib/find_bit_benchmark: Add benchmark test for fns()

On Tue, Apr 30, 2024 at 01:49:11PM +0800, Kuan-Wei Chiu wrote:
> Introduce a benchmark test for the fns(). It measures the total time
> taken by fns() to process 1,000,000 test data generated using
> get_random_long() for each n in the range [0, BITS_PER_LONG].

Can you also print an example of test output?

> Signed-off-by: Kuan-Wei Chiu <[email protected]>
> ---
> lib/find_bit_benchmark.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
> index d3fb09e6eff1..8712eacf3bbd 100644
> --- a/lib/find_bit_benchmark.c
> +++ b/lib/find_bit_benchmark.c
> @@ -146,6 +146,28 @@ static int __init test_find_next_and_bit(const void *bitmap,
> return 0;
> }
>
> +static int __init test_fns(void)
> +{
> + const unsigned long round = 1000000;
> + s64 time[BITS_PER_LONG + 1];
> + unsigned int i, n;
> + volatile unsigned long x, y;
> +
> + for (n = 0; n <= BITS_PER_LONG; n++) {

n == BITS_PER_LONG is an error. Testing error case together with
normal cases is even worse error because it fools readers.

> + time[n] = ktime_get();
> + for (i = 0; i < round; i++) {
> + x = get_random_long();
> + y = fns(x, n);
> + }

Here you count fns() + get_random_long() time. For your microbench
purposes it would be better exclude a random number generation
overhead.

> + time[n] = ktime_get() - time[n];
> + }
> +
> + for (n = 0; n <= BITS_PER_LONG; n++)
> + pr_err("fns: n = %2u: %12lld ns\n", n, time[n]);

Nah, not like that. Each test in there prints one line in the
report. Let's keep it that way for test_fns() too. Unless we have
a strong evidence that fns() for a particular input is worth to be
tracked separately, let's just print a total gross?

> +
> + return 0;
> +}

I'd suggest to modify it like:

static unsigned long buf[1000000];

static int __init test_fns(void)
{
get_random_bytes(buf, ARRAY_SIZE(buf));
time = ktime_get();

for (n = 0; n < BITS_PER_LONG; n++)
for (i = 0; i < 1000000; i++)
fns(buf[i], n);

time = ktime_get() - time;
pr_err(...);
}

> static int __init find_bit_test(void)
> {
> unsigned long nbits = BITMAP_LEN / SPARSE;
> @@ -186,6 +208,9 @@ static int __init find_bit_test(void)
> test_find_first_and_bit(bitmap, bitmap2, BITMAP_LEN);
> test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
>
> + pr_err("\nStart testing for fns()\n");
> + test_fns();

There are 2 sections in the test - one for regular, and another for
sparse data. Adding a new section for a just one function doesn't look
like a good idea.

Even more, the fns() is already tested here. Maybe test_bitops is a
better place for this test?

> +
> /*
> * Everything is OK. Return error just to let user run benchmark
> * again without annoying rmmod.
> --
> 2.34.1

2024-04-30 17:36:49

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] bitops: Optimize fns() for improved performance

On Tue, Apr 30, 2024 at 01:49:12PM +0800, Kuan-Wei Chiu wrote:
> The current fns() repeatedly uses __ffs() to find the index of the
> least significant bit and then clears the corresponding bit using
> __clear_bit(). The method for clearing the least significant bit can be
> optimized by using word &= word - 1 instead.
>
> Typically, the execution time of one __ffs() plus one __clear_bit() is
> longer than that of a bitwise AND operation and a subtraction. To
> improve performance, the loop for clearing the least significant bit
> has been replaced with word &= word - 1, followed by a single __ffs()
> operation to obtain the answer. This change reduces the number of
> __ffs() iterations from n to just one, enhancing overall performance.
>
> This change speeds up the find_nth_bit() in the find_bit benchmark by
> approximately ~1.26 times:

26% sounds better.

> Before:
> find_nth_bit: 4254313 ns, 16525 iterations
>
> After:
> find_nth_bit: 3362863 ns, 16501 iterations
>
> The following microbenchmark data, conducted on my x86-64 machine,
> shows the execution time (in microseconds) required for 1000000 test
> data generated by get_random_u64() and executed by fns() under
> different values of n:
>
> +-----+---------------+---------------+
> | n | time_old | time_new |
> +-----+---------------+---------------+
> | 0 | 29194 | 25878 |
> | 1 | 25510 | 25497 |
> | 2 | 27836 | 25721 |
> | 3 | 30140 | 25673 |
> | 4 | 32569 | 25426 |
> | 5 | 34792 | 25690 |
> | 6 | 37117 | 25651 |
> | 7 | 39742 | 25383 |
> | 8 | 42360 | 25657 |
> | 9 | 44672 | 25897 |
> | 10 | 47237 | 25819 |
> | 11 | 49884 | 26530 |
> | 12 | 51864 | 26647 |
> | 13 | 54265 | 28915 |
> | 14 | 56440 | 28373 |
> | 15 | 58839 | 28616 |
> | 16 | 62383 | 29128 |
> | 17 | 64257 | 30041 |
> | 18 | 66805 | 29773 |
> | 19 | 69368 | 33203 |
> | 20 | 72942 | 33688 |
> | 21 | 77006 | 34518 |
> | 22 | 80926 | 34298 |
> | 23 | 85723 | 35586 |
> | 24 | 90324 | 36376 |
> | 25 | 95992 | 37465 |
> | 26 | 101101 | 37599 |
> | 27 | 106520 | 37466 |
> | 28 | 113287 | 38163 |
> | 29 | 120552 | 38810 |
> | 30 | 128040 | 39373 |
> | 31 | 135624 | 40500 |
> | 32 | 142580 | 40343 |
> | 33 | 148915 | 40460 |
> | 34 | 154005 | 41294 |
> | 35 | 157996 | 41730 |
> | 36 | 160806 | 41523 |
> | 37 | 162975 | 42088 |
> | 38 | 163426 | 41530 |
> | 39 | 164872 | 41789 |
> | 40 | 164477 | 42505 |
> | 41 | 164758 | 41879 |
> | 42 | 164182 | 41415 |
> | 43 | 164842 | 42119 |
> | 44 | 164881 | 42297 |
> | 45 | 164870 | 42145 |
> | 46 | 164673 | 42066 |
> | 47 | 164616 | 42051 |
> | 48 | 165055 | 41902 |
> | 49 | 164847 | 41862 |
> | 50 | 165171 | 41960 |
> | 51 | 164851 | 42089 |
> | 52 | 164763 | 41717 |
> | 53 | 164635 | 42154 |
> | 54 | 164757 | 41983 |
> | 55 | 165095 | 41419 |
> | 56 | 164641 | 42381 |
> | 57 | 164601 | 41654 |
> | 58 | 164864 | 41834 |
> | 59 | 164594 | 41920 |
> | 60 | 165207 | 42020 |
> | 61 | 165056 | 41185 |
> | 62 | 165160 | 41722 |
> | 63 | 164923 | 41702 |
> | 64 | 164777 | 41880 |
> +-----+---------------+---------------+

Please, just a total gross in the commit message.

> Signed-off-by: Kuan-Wei Chiu <[email protected]>
> ---
>
> Changes in v2:
> - Change the loop in fns() by counting down from n to 0.
> - Add find_bit benchmark result for find_nth_bit in commit message.
>
> include/linux/bitops.h | 12 +++---------
> 1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> index 2ba557e067fe..57ecef354f47 100644
> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -254,16 +254,10 @@ static inline unsigned long __ffs64(u64 word)
> */
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> - unsigned int bit;
> + while (word && n--)
> + word &= word - 1;
>
> - while (word) {
> - bit = __ffs(word);
> - if (n-- == 0)
> - return bit;
> - __clear_bit(bit, &word);
> - }
> -
> - return BITS_PER_LONG;
> + return word ? __ffs(word) : BITS_PER_LONG;
> }
>
> /**
> --
> 2.34.1

2024-04-30 22:51:11

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] lib/find_bit_benchmark: Add benchmark test for fns()

Hi Kuan-Wei,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.9-rc6 next-20240430]
[cannot apply to akpm-mm/mm-everything akpm-mm/mm-nonmm-unstable]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kuan-Wei-Chiu/lib-find_bit_benchmark-Add-benchmark-test-for-fns/20240430-144241
base: linus/master
patch link: https://lore.kernel.org/r/20240430054912.124237-2-visitorckw%40gmail.com
patch subject: [PATCH v2 1/2] lib/find_bit_benchmark: Add benchmark test for fns()
config: i386-buildonly-randconfig-004-20240501 (https://download.01.org/0day-ci/archive/20240501/[email protected]/config)
compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240501/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

lib/find_bit_benchmark.c: In function 'test_fns':
>> lib/find_bit_benchmark.c:154:35: warning: variable 'y' set but not used [-Wunused-but-set-variable]
154 | volatile unsigned long x, y;
| ^


vim +/y +154 lib/find_bit_benchmark.c

148
149 static int __init test_fns(void)
150 {
151 const unsigned long round = 1000000;
152 s64 time[BITS_PER_LONG + 1];
153 unsigned int i, n;
> 154 volatile unsigned long x, y;
155
156 for (n = 0; n <= BITS_PER_LONG; n++) {
157 time[n] = ktime_get();
158 for (i = 0; i < round; i++) {
159 x = get_random_long();
160 y = fns(x, n);
161 }
162 time[n] = ktime_get() - time[n];
163 }
164
165 for (n = 0; n <= BITS_PER_LONG; n++)
166 pr_err("fns: n = %2u: %12lld ns\n", n, time[n]);
167
168 return 0;
169 }
170

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-05-01 04:50:16

by Kuan-Wei Chiu

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] lib/find_bit_benchmark: Add benchmark test for fns()

On Tue, Apr 30, 2024 at 10:24:03AM -0700, Yury Norov wrote:
> On Tue, Apr 30, 2024 at 01:49:11PM +0800, Kuan-Wei Chiu wrote:
> > Introduce a benchmark test for the fns(). It measures the total time
> > taken by fns() to process 1,000,000 test data generated using
> > get_random_long() for each n in the range [0, BITS_PER_LONG].
>
> Can you also print an example of test output?
>
> > Signed-off-by: Kuan-Wei Chiu <[email protected]>
> > ---
> > lib/find_bit_benchmark.c | 25 +++++++++++++++++++++++++
> > 1 file changed, 25 insertions(+)
> >
> > diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
> > index d3fb09e6eff1..8712eacf3bbd 100644
> > --- a/lib/find_bit_benchmark.c
> > +++ b/lib/find_bit_benchmark.c
> > @@ -146,6 +146,28 @@ static int __init test_find_next_and_bit(const void *bitmap,
> > return 0;
> > }
> >
> > +static int __init test_fns(void)
> > +{
> > + const unsigned long round = 1000000;
> > + s64 time[BITS_PER_LONG + 1];
> > + unsigned int i, n;
> > + volatile unsigned long x, y;
> > +
> > + for (n = 0; n <= BITS_PER_LONG; n++) {
>
> n == BITS_PER_LONG is an error. Testing error case together with
> normal cases is even worse error because it fools readers.
>
My initial intention was to add a test for fns() always returning
BITS_PER_LONG. However, I agree that this is not a good idea and may
confuse readers.

> > + time[n] = ktime_get();
> > + for (i = 0; i < round; i++) {
> > + x = get_random_long();
> > + y = fns(x, n);
> > + }
>
> Here you count fns() + get_random_long() time. For your microbench
> purposes it would be better exclude a random number generation
> overhead.
>
> > + time[n] = ktime_get() - time[n];
> > + }
> > +
> > + for (n = 0; n <= BITS_PER_LONG; n++)
> > + pr_err("fns: n = %2u: %12lld ns\n", n, time[n]);
>
> Nah, not like that. Each test in there prints one line in the
> report. Let's keep it that way for test_fns() too. Unless we have
> a strong evidence that fns() for a particular input is worth to be
> tracked separately, let's just print a total gross?
>
> > +
> > + return 0;
> > +}
>
> I'd suggest to modify it like:
>
> static unsigned long buf[1000000];
>
> static int __init test_fns(void)
> {
> get_random_bytes(buf, ARRAY_SIZE(buf));

Instead of ARRAY_SIZE(buf), it should be sizeof(buf).

> time = ktime_get();
>
> for (n = 0; n < BITS_PER_LONG; n++)
> for (i = 0; i < 1000000; i++)
> fns(buf[i], n);
>
> time = ktime_get() - time;
> pr_err(...);
> }
>

That does seem like a better approach. I'll move it to lib/test_bitops
and send a v3 patch series.

Regards,
Kuan-Wei

> > static int __init find_bit_test(void)
> > {
> > unsigned long nbits = BITMAP_LEN / SPARSE;
> > @@ -186,6 +208,9 @@ static int __init find_bit_test(void)
> > test_find_first_and_bit(bitmap, bitmap2, BITMAP_LEN);
> > test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
> >
> > + pr_err("\nStart testing for fns()\n");
> > + test_fns();
>
> There are 2 sections in the test - one for regular, and another for
> sparse data. Adding a new section for a just one function doesn't look
> like a good idea.
>
> Even more, the fns() is already tested here. Maybe test_bitops is a
> better place for this test?
>
> > +
> > /*
> > * Everything is OK. Return error just to let user run benchmark
> > * again without annoying rmmod.
> > --
> > 2.34.1