2021-06-18 20:08:48

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

The macros iterate thru all set/clear bits in a bitmap. They search a
first bit using find_first_bit(), and the rest bits using find_next_bit().

Since find_next_bit() is called shortly after find_first_bit(), we can
save few lines of I-cache by not using find_first_bit().

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 4500e8ab93e2..ae9ed52b52b8 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
#endif

#define for_each_set_bit(bit, addr, size) \
- for ((bit) = find_first_bit((addr), (size)); \
+ for ((bit) = find_next_bit((addr), (size), 0); \
(bit) < (size); \
(bit) = find_next_bit((addr), (size), (bit) + 1))

@@ -291,7 +291,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
(bit) = find_next_bit((addr), (size), (bit) + 1))

#define for_each_clear_bit(bit, addr, size) \
- for ((bit) = find_first_zero_bit((addr), (size)); \
+ for ((bit) = find_next_zero_bit((addr), (size), 0); \
(bit) < (size); \
(bit) = find_next_zero_bit((addr), (size), (bit) + 1))

--
2.30.2


2021-06-19 20:33:44

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

On Fri, Jun 18, 2021 at 12:57:34PM -0700, Yury Norov wrote:
> The macros iterate thru all set/clear bits in a bitmap. They search a
> first bit using find_first_bit(), and the rest bits using find_next_bit().
>
> Since find_next_bit() is called shortly after find_first_bit(), we can
> save few lines of I-cache by not using find_first_bit().

Any number available?

--
With Best Regards,
Andy Shevchenko


2021-06-19 20:48:49

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

On Fri, 18 Jun 2021 20:57:34 +0100,
Yury Norov <[email protected]> wrote:
>
> The macros iterate thru all set/clear bits in a bitmap. They search a
> first bit using find_first_bit(), and the rest bits using find_next_bit().
>
> Since find_next_bit() is called shortly after find_first_bit(), we can
> save few lines of I-cache by not using find_first_bit().

Really?

>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> include/linux/find.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 4500e8ab93e2..ae9ed52b52b8 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> #endif
>
> #define for_each_set_bit(bit, addr, size) \
> - for ((bit) = find_first_bit((addr), (size)); \
> + for ((bit) = find_next_bit((addr), (size), 0); \

On which architecture do you observe a gain? Only 32bit ARM and m68k
implement their own version of find_first_bit(), and everyone else
uses the canonical implementation:

#ifndef find_first_bit
#define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
#endif

These architectures explicitly have different implementations for
find_first_bit() and find_next_bit() because they can do better
(whether that is true or not is another debate). I don't think you
should remove this optimisation until it has been measured on these
two architectures.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-06-19 20:49:03

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote:
> On Fri, 18 Jun 2021 20:57:34 +0100,
> Yury Norov <[email protected]> wrote:
> >
> > The macros iterate thru all set/clear bits in a bitmap. They search a
> > first bit using find_first_bit(), and the rest bits using find_next_bit().
> >
> > Since find_next_bit() is called shortly after find_first_bit(), we can
> > save few lines of I-cache by not using find_first_bit().
>
> Really?
>
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > include/linux/find.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/find.h b/include/linux/find.h
> > index 4500e8ab93e2..ae9ed52b52b8 100644
> > --- a/include/linux/find.h
> > +++ b/include/linux/find.h
> > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> > #endif
> >
> > #define for_each_set_bit(bit, addr, size) \
> > - for ((bit) = find_first_bit((addr), (size)); \
> > + for ((bit) = find_next_bit((addr), (size), 0); \
>
> On which architecture do you observe a gain? Only 32bit ARM and m68k
> implement their own version of find_first_bit(), and everyone else
> uses the canonical implementation:

And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips
and s390.

> #ifndef find_first_bit
> #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> #endif
>
> These architectures explicitly have different implementations for
> find_first_bit() and find_next_bit() because they can do better
> (whether that is true or not is another debate). I don't think you
> should remove this optimisation until it has been measured on these
> two architectures.

This patch is based on a series that enables separate implementation
of find_first_bit() for all architectures; according to my tests,
find_first* is ~ twice faster than find_next* on arm64 and x86.

https://lore.kernel.org/lkml/[email protected]/T/#t

After applying the series, I noticed that my small kernel module that
calls for_each_set_bit() is now using find_first_bit() to just find
one bit, and find_next_bit() for all others. I think it's better to
always use find_next_bit() in this case to minimize the chance of
cache miss. But if it's not that obvious, I'll try to write some test.

2021-06-27 17:10:20

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

On Sat, Jun 19, 2021 at 10:28:07AM -0700, Yury Norov wrote:
> On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote:
> > On Fri, 18 Jun 2021 20:57:34 +0100,
> > Yury Norov <[email protected]> wrote:
> > >
> > > The macros iterate thru all set/clear bits in a bitmap. They search a
> > > first bit using find_first_bit(), and the rest bits using find_next_bit().
> > >
> > > Since find_next_bit() is called shortly after find_first_bit(), we can
> > > save few lines of I-cache by not using find_first_bit().
> >
> > Really?
> >
> > >
> > > Signed-off-by: Yury Norov <[email protected]>
> > > ---
> > > include/linux/find.h | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/find.h b/include/linux/find.h
> > > index 4500e8ab93e2..ae9ed52b52b8 100644
> > > --- a/include/linux/find.h
> > > +++ b/include/linux/find.h
> > > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> > > #endif
> > >
> > > #define for_each_set_bit(bit, addr, size) \
> > > - for ((bit) = find_first_bit((addr), (size)); \
> > > + for ((bit) = find_next_bit((addr), (size), 0); \
> >
> > On which architecture do you observe a gain? Only 32bit ARM and m68k
> > implement their own version of find_first_bit(), and everyone else
> > uses the canonical implementation:
>
> And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips
> and s390.
>
> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> >
> > These architectures explicitly have different implementations for
> > find_first_bit() and find_next_bit() because they can do better
> > (whether that is true or not is another debate). I don't think you
> > should remove this optimisation until it has been measured on these
> > two architectures.
>
> This patch is based on a series that enables separate implementation
> of find_first_bit() for all architectures; according to my tests,
> find_first* is ~ twice faster than find_next* on arm64 and x86.
>
> https://lore.kernel.org/lkml/[email protected]/T/#t
>
> After applying the series, I noticed that my small kernel module that
> calls for_each_set_bit() is now using find_first_bit() to just find
> one bit, and find_next_bit() for all others. I think it's better to
> always use find_next_bit() in this case to minimize the chance of
> cache miss. But if it's not that obvious, I'll try to write some test.

This test measures the difference between for_each_set_bit() and
for_each_set_bit_from().

diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
index 5637c5711db9..1f37e99090b0 100644
--- a/lib/find_bit_benchmark.c
+++ b/lib/find_bit_benchmark.c
@@ -111,6 +111,59 @@ static int __init test_find_next_and_bit(const void *bitmap,
return 0;
}

+#ifdef CONFIG_X86_64
+#define flush_cache_all() wbinvd()
+#endif
+
+static int __init test_for_each_set_bit(int flags)
+{
+#ifdef flush_cache_all
+ DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+ unsigned long i, cnt = 0;
+ ktime_t time;
+
+ bm[0] = 1; bm[1] = 0;
+
+ time = ktime_get();
+ while (cnt < 1000) {
+ if (flags)
+ flush_cache_all();
+ for_each_set_bit(i, bm, BITS_PER_LONG * 2)
+ cnt++;
+ }
+
+ time = ktime_get() - time;
+
+ pr_err("for_each_set_bit: %18llu ns, %6ld iterations\n", time, cnt);
+#endif
+ return 0;
+}
+
+static int __init test_for_each_set_bit_from(int flags)
+{
+#ifdef flush_cache_all
+ DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+ unsigned long i, cnt = 0;
+ ktime_t time;
+
+ bm[0] = 1; bm[1] = 0;
+
+ time = ktime_get();
+ while (cnt < 1000) {
+ if (flags)
+ flush_cache_all();
+ i = 0;
+ for_each_set_bit_from(i, bm, BITS_PER_LONG * 2)
+ cnt++;
+ }
+
+ time = ktime_get() - time;
+
+ pr_err("for_each_set_bit_from:%16llu ns, %6ld iterations\n", time, cnt);
+#endif
+ return 0;
+}
+
static int __init find_bit_test(void)
{
unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -147,6 +200,16 @@ static int __init find_bit_test(void)
test_find_first_bit(bitmap, BITMAP_LEN);
test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);

+ pr_err("\nStart testing for_each_bit()\n");
+
+ test_for_each_set_bit(0);
+ test_for_each_set_bit_from(0);
+
+ pr_err("\nStart testing for_each_bit() with cash flushing\n");
+
+ test_for_each_set_bit(1);
+ test_for_each_set_bit_from(1);
+
/*
* Everything is OK. Return error just to let user run benchmark
* again without annoying rmmod.

Here on each iteration:
- for_each_set_bit() calls find_first_bit() once, and find_next_bit() once.
- for_each_set_bit_from() calls find_next_bit() twice.

On my AMD Ryzen 7 4700U, the result is like this:

Start testing for_each_bit()
for_each_set_bit: 15296 ns, 1000 iterations
for_each_set_bit_from: 15225 ns, 1000 iterations

Start testing for_each_bit() with cash flushing
for_each_set_bit: 547626 ns, 1000 iterations
for_each_set_bit_from: 497899 ns, 1000 iterations

for_each_set_bit_from() is ~10% faster than for_each_set_bit() in
case of cold caches, and no significant difference was observed if
flush_cache_all() is not called.

So, it looks reasonable to switch for_each_set_bit() to use
find_next_bit() only.

Thanks,
Yury