Subject: [PATCH] x86: Change x86 to use generic find_next_bit

x86: Change x86 to use the generic find_next_bit implementation

The versions with inline assembly are in fact slower on the machines I
tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
crashing the benchmark.

Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
1...512, for each possible bitmap with one bit set, for each possible
offset: find the position of the first bit starting at offset. If you
follow ;). Times include setup of the bitmap and checking of the
results.

Athlon Xeon Opteron 32/64bit
x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s

If the bitmap size is not a multiple of BITS_PER_LONG, and no set
(cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
value outside of the range [0,size]. The generic version always returns
exactly size. The generic version also uses unsigned long everywhere,
while the x86 versions use a mishmash of int, unsigned (int), long and
unsigned long.

Using the generic version does give a slightly bigger kernel, though.

defconfig: text data bss dec hex filename
x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)

arch/x86/Kconfig | 3 ++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/bitops_32.c | 70 -------------------------------------------
arch/x86/lib/bitops_64.c | 68 -----------------------------------------
include/asm-x86/bitops.h | 6 ++++
include/asm-x86/bitops_32.h | 16 ----------
include/asm-x86/bitops_64.h | 2 -
lib/find_next_bit.c | 2 +
8 files changed, 12 insertions(+), 157 deletions(-)

Signed-off-by: Alexander van Heukelum <[email protected]>

---

Hi,

I think it would be a good idea to use the generic versions of
find_next_bit and find_next_zero_bit. The i386 versions have
a bug, and the generic ones are faster (in my probably
not-so-representative micro-benchmark, that is).

Patch is against -x86#testing. It compiles.

Greetings,
Alexander

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ab85a04..29a5a38 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -83,6 +83,9 @@ config GENERIC_BUG
def_bool y
depends on BUG

+config GENERIC_FIND_NEXT_BIT
+ def_bool y
+
config GENERIC_HWEIGHT
def_bool y

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 25df1c1..4360932 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -11,7 +11,7 @@ lib-y += memcpy_$(BITS).o
ifeq ($(CONFIG_X86_32),y)
lib-y += checksum_32.o
lib-y += strstr_32.o
- lib-y += bitops_32.o semaphore_32.o string_32.o
+ lib-y += semaphore_32.o string_32.o

lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
diff --git a/arch/x86/lib/bitops_32.c b/arch/x86/lib/bitops_32.c
deleted file mode 100644
index b654404..0000000
--- a/arch/x86/lib/bitops_32.c
+++ /dev/null
@@ -1,70 +0,0 @@
-#include <linux/bitops.h>
-#include <linux/module.h>
-
-/**
- * find_next_bit - find the next set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The maximum size to search
- */
-int find_next_bit(const unsigned long *addr, int size, int offset)
-{
- const unsigned long *p = addr + (offset >> 5);
- int set = 0, bit = offset & 31, res;
-
- if (bit) {
- /*
- * Look for nonzero in the first 32 bits:
- */
- __asm__("bsfl %1,%0\n\t"
- "jne 1f\n\t"
- "movl $32, %0\n"
- "1:"
- : "=r" (set)
- : "r" (*p >> bit));
- if (set < (32 - bit))
- return set + offset;
- set = 32 - bit;
- p++;
- }
- /*
- * No set bit yet, search remaining full words for a bit
- */
- res = find_first_bit (p, size - 32 * (p - addr));
- return (offset + set + res);
-}
-EXPORT_SYMBOL(find_next_bit);
-
-/**
- * find_next_zero_bit - find the first zero bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The maximum size to search
- */
-int find_next_zero_bit(const unsigned long *addr, int size, int offset)
-{
- const unsigned long *p = addr + (offset >> 5);
- int set = 0, bit = offset & 31, res;
-
- if (bit) {
- /*
- * Look for zero in the first 32 bits.
- */
- __asm__("bsfl %1,%0\n\t"
- "jne 1f\n\t"
- "movl $32, %0\n"
- "1:"
- : "=r" (set)
- : "r" (~(*p >> bit)));
- if (set < (32 - bit))
- return set + offset;
- set = 32 - bit;
- p++;
- }
- /*
- * No zero yet, search remaining full bytes for a zero
- */
- res = find_first_zero_bit(p, size - 32 * (p - addr));
- return (offset + set + res);
-}
-EXPORT_SYMBOL(find_next_zero_bit);
diff --git a/arch/x86/lib/bitops_64.c b/arch/x86/lib/bitops_64.c
index 0e8f491..0eeb704 100644
--- a/arch/x86/lib/bitops_64.c
+++ b/arch/x86/lib/bitops_64.c
@@ -1,9 +1,7 @@
#include <linux/bitops.h>

#undef find_first_zero_bit
-#undef find_next_zero_bit
#undef find_first_bit
-#undef find_next_bit

static inline long
__find_first_zero_bit(const unsigned long * addr, unsigned long size)
@@ -57,39 +55,6 @@ long find_first_zero_bit(const unsigned long * addr, unsigned long size)
return __find_first_zero_bit (addr, size);
}

-/**
- * find_next_zero_bit - find the next zero bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The maximum size to search
- */
-long find_next_zero_bit (const unsigned long * addr, long size, long offset)
-{
- const unsigned long * p = addr + (offset >> 6);
- unsigned long set = 0;
- unsigned long res, bit = offset&63;
-
- if (bit) {
- /*
- * Look for zero in first word
- */
- asm("bsfq %1,%0\n\t"
- "cmoveq %2,%0"
- : "=r" (set)
- : "r" (~(*p >> bit)), "r"(64L));
- if (set < (64 - bit))
- return set + offset;
- set = 64 - bit;
- p++;
- }
- /*
- * No zero yet, search remaining full words for a zero
- */
- res = __find_first_zero_bit (p, size - 64 * (p - addr));
-
- return (offset + set + res);
-}
-
static inline long
__find_first_bit(const unsigned long * addr, unsigned long size)
{
@@ -136,40 +101,7 @@ long find_first_bit(const unsigned long * addr, unsigned long size)
return __find_first_bit(addr,size);
}

-/**
- * find_next_bit - find the first set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The maximum size to search
- */
-long find_next_bit(const unsigned long * addr, long size, long offset)
-{
- const unsigned long * p = addr + (offset >> 6);
- unsigned long set = 0, bit = offset & 63, res;
-
- if (bit) {
- /*
- * Look for nonzero in the first 64 bits:
- */
- asm("bsfq %1,%0\n\t"
- "cmoveq %2,%0\n\t"
- : "=r" (set)
- : "r" (*p >> bit), "r" (64L));
- if (set < (64 - bit))
- return set + offset;
- set = 64 - bit;
- p++;
- }
- /*
- * No set bit yet, search remaining full words for a bit
- */
- res = __find_first_bit (p, size - 64 * (p - addr));
- return (offset + set + res);
-}
-
#include <linux/module.h>

-EXPORT_SYMBOL(find_next_bit);
EXPORT_SYMBOL(find_first_bit);
EXPORT_SYMBOL(find_first_zero_bit);
-EXPORT_SYMBOL(find_next_zero_bit);
diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 1a23ce1..7fbf93a 100644
--- a/include/asm-x86/bitops.h
+++ b/include/asm-x86/bitops.h
@@ -312,6 +312,12 @@ static int test_bit(int nr, const volatile unsigned long *addr);

#undef ADDR

+unsigned long find_next_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+unsigned long find_next_zero_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+
+
#ifdef CONFIG_X86_32
# include "bitops_32.h"
#else
diff --git a/include/asm-x86/bitops_32.h b/include/asm-x86/bitops_32.h
index e4d75fc..570f0fa 100644
--- a/include/asm-x86/bitops_32.h
+++ b/include/asm-x86/bitops_32.h
@@ -38,14 +38,6 @@ static inline int find_first_zero_bit(const unsigned long *addr, unsigned size)
}

/**
- * find_next_zero_bit - find the first zero bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bit number to start searching at
- * @size: The maximum size to search
- */
-int find_next_zero_bit(const unsigned long *addr, int size, int offset);
-
-/**
* __ffs - find first bit in word.
* @word: The word to search
*
@@ -81,14 +73,6 @@ static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
}

/**
- * find_next_bit - find the first set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bit number to start searching at
- * @size: The maximum size to search
- */
-int find_next_bit(const unsigned long *addr, int size, int offset);
-
-/**
* ffz - find first zero in word.
* @word: The word to search
*
diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
index aaf1519..40bf0f8 100644
--- a/include/asm-x86/bitops_64.h
+++ b/include/asm-x86/bitops_64.h
@@ -6,9 +6,7 @@
*/

extern long find_first_zero_bit(const unsigned long *addr, unsigned long size);
-extern long find_next_zero_bit(const unsigned long *addr, long size, long offset);
extern long find_first_bit(const unsigned long *addr, unsigned long size);
-extern long find_next_bit(const unsigned long *addr, long size, long offset);

/* return index of first bet set in val or max when no bit is set */
static inline long __scanbit(unsigned long val, unsigned long max)
diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
index 78ccd73..5820e07 100644
--- a/lib/find_next_bit.c
+++ b/lib/find_next_bit.c
@@ -15,6 +15,8 @@
#include <asm/byteorder.h>

#define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
+#undef find_next_bit
+#undef find_next_zero_bit

/**
* find_next_bit - find the next set bit in a memory region


2008-03-09 20:10:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


* Alexander van Heukelum <[email protected]> wrote:

> x86: Change x86 to use the generic find_next_bit implementation
>
> The versions with inline assembly are in fact slower on the machines I
> tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz,
> AMD Opteron 270). The i386-version needed a fix similar to 06024f21 to
> avoid crashing the benchmark.
>
> Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
> 1...512, for each possible bitmap with one bit set, for each possible
> offset: find the position of the first bit starting at offset. If you
> follow ;). Times include setup of the bitmap and checking of the
> results.
>
> Athlon Xeon Opteron 32/64bit
> x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
> generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s

ok, that's rather convincing.

the generic version in lib/find_next_bit.c is open-coded C which gcc can
optimize pretty nicely.

the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
the special x86 'bit search forward' (BSF) instruction - which i know
from the days when the scheduler relied on it has some non-trivial setup
costs. So especially when there's _small_ bitmasks involved, it's more
expensive.

> If the bitmap size is not a multiple of BITS_PER_LONG, and no set
> (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
> value outside of the range [0,size]. The generic version always
> returns exactly size. The generic version also uses unsigned long
> everywhere, while the x86 versions use a mishmash of int, unsigned
> (int), long and unsigned long.

i'm not surprised that the hand-coded assembly versions had a bug ...

[ this means we have to test it quite carefully though, as lots of code
only ever gets tested on x86 so code could have built dependency on
the buggy behavior. ]

> Using the generic version does give a slightly bigger kernel, though.
>
> defconfig: text data bss dec hex filename
> x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
> generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
> x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
> generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)

i'd not worry about that too much. Have you tried to build with:

CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_OPTIMIZE_INLINING=y

(the latter only available in x86.git)

> Patch is against -x86#testing. It compiles.

i've picked it up into x86.git, lets see how it goes in practice.

Ingo

2008-03-09 20:12:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


* Alexander van Heukelum <[email protected]> wrote:

> --- a/lib/find_next_bit.c
> +++ b/lib/find_next_bit.c
> @@ -15,6 +15,8 @@
> #include <asm/byteorder.h>
>
> #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
> +#undef find_next_bit
> +#undef find_next_zero_bit

this bit looks weird - did you need it for testing?

Ingo

2008-03-09 20:28:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

Alexander van Heukelum <[email protected]> writes:
>
> If the bitmap size is not a multiple of BITS_PER_LONG, and no set
> (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
> value outside of the range [0,size]. The generic version always returns
> exactly size.

With that change it is likely possible to remove the min(NR_CPUS in
lib/cpumask.c __first/__next_cpu. iirc it was just a workaround for
the x86 quirk. I suspect with such a change it would be possible to
inline those again, possibly speeding up some loops who do cpumask
walking (iirc the scheduler used to do that frequently)

-Andi

2008-03-09 20:31:54

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


On Sun, 9 Mar 2008 21:11:52 +0100, "Ingo Molnar" <[email protected]> said:
> * Alexander van Heukelum <[email protected]> wrote:
> > --- a/lib/find_next_bit.c
> > +++ b/lib/find_next_bit.c
> > @@ -15,6 +15,8 @@
> > #include <asm/byteorder.h>
> >
> > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
> > +#undef find_next_bit
> > +#undef find_next_zero_bit
>
> this bit looks weird - did you need it for testing?

Worse, it's needed to get x86_64 to compile.

They are defined in include/asm-x86/bitops_64.h (which gets
included). They are used to optimize the case where the
bitmap size is known at compile time and not larger than
BITS_PER_LONG. Undeffing them here is the easiest way to get
things to compile, here.

Greetings,
Alexander

--
http://www.fastmail.fm - The way an email service should be

2008-03-09 20:51:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


* Alexander van Heukelum <[email protected]> wrote:

> > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
> > > +#undef find_next_bit
> > > +#undef find_next_zero_bit
> >
> > this bit looks weird - did you need it for testing?
>
> Worse, it's needed to get x86_64 to compile.
>
> They are defined in include/asm-x86/bitops_64.h (which gets included).
> They are used to optimize the case where the bitmap size is known at
> compile time and not larger than BITS_PER_LONG. Undeffing them here is
> the easiest way to get things to compile, here.

ok - but this needs to be solved in a cleaner way. That build-time
optimization needs to be pushed into generic code so that 32-bit x86 and
other architectures can make use of it as well. The lib/find_next_bit.c
functions should be named __find_next_bit() or so.

Ingo

2008-03-09 21:03:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

Ingo Molnar <[email protected]> writes:
>
> the generic version in lib/find_next_bit.c is open-coded C which gcc can
> optimize pretty nicely.
>
> the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
> the special x86 'bit search forward' (BSF) instruction - which i know
> from the days when the scheduler relied on it has some non-trivial setup

~14 cycles on K8 for memory, but if you stay in a register it is 8 cycles

> costs. So especially when there's _small_ bitmasks involved, it's more
> expensive.

I had a patchkit some time ago to special case the max_bit <= 63 case
and always use directly inlined stream lined single instruction
assembler for that. There was still some issue and I dropped it then,
but doing something like that makes still sense. Even if the BSF
is slightly slower than the open coded version just getting rid
of the CALL will make it a win and it could be also kept in a register
so you get the 8 cycle variant (for which I doubt you can do
it faster open coded)

The result would be that a standard for_each_cpu () in a NR_CPUS <= 64
kernel wouldn't have any unnecessary calls.

In general the problem of walking cpu masks is quite different
from seaching ext2 bitmaps, so they likely should be special cased
and optimized for each.

-Andi

2008-03-09 21:13:26

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

On Sun, 9 Mar 2008 21:10:16 +0100, "Ingo Molnar" <[email protected]> said:
> > Athlon Xeon Opteron 32/64bit
> > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
> > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s
>
> ok, that's rather convincing.
>
> the generic version in lib/find_next_bit.c is open-coded C which gcc can
> optimize pretty nicely.
>
> the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
> the special x86 'bit search forward' (BSF) instruction - which i know
> from the days when the scheduler relied on it has some non-trivial setup
> costs. So especially when there's _small_ bitmasks involved, it's more
> expensive.

Hi,

BSF is fine, it doesn't need any special setup. The problem is probably
that the old versions use find_first_bit and find_first_zero_bit,
which are also hand optimized versions... and they use "repe scasl/q".
That's another little project ;).

> > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
> > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
> > value outside of the range [0,size]. The generic version always
> > returns exactly size. The generic version also uses unsigned long
> > everywhere, while the x86 versions use a mishmash of int, unsigned
> > (int), long and unsigned long.
>
> i'm not surprised that the hand-coded assembly versions had a bug ...

Not surprised about the bug, but it was in fact noticed, and fixed
in x86_64!

> [ this means we have to test it quite carefully though, as lots of code
> only ever gets tested on x86 so code could have built dependency on
> the buggy behavior. ]

Agreed.

> > Using the generic version does give a slightly bigger kernel, though.
> >
> > defconfig: text data bss dec hex filename
> > x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
> > generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
> > x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
> > generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)
>
> i'd not worry about that too much. Have you tried to build with:

I don't but I needed to compile something to test the build anyhow ;)

> CONFIG_CC_OPTIMIZE_FOR_SIZE=y
> CONFIG_OPTIMIZE_INLINING=y

This was defconfig in -x86#testing, they were both already enabled.
Here is what you get with those options turned off ;).

text data bss dec hex filename
x86-specific: 5543996 481232 626688 6651916 65800c vmlinux (32 bit)
generic: 5543880 481232 626688 6651800 657f98 vmlinux (32 bit)
x86-specific: 6111834 846568 724424 7682826 753b0a vmlinux (64 bit)
generic: 6111882 846568 724424 7682874 753b3a vmlinux (64 bit)

(and I double-checked the i386 results)

> (the latter only available in x86.git)
>
> > Patch is against -x86#testing. It compiles.
>
> i've picked it up into x86.git, lets see how it goes in practice.

Thanks,
Alexander

> Ingo
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - And now for something completely different?

2008-03-09 21:29:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

Ingo Molnar <[email protected]> writes:
>
> ok - but this needs to be solved in a cleaner way. That build-time
> optimization needs to be pushed into generic code so that 32-bit x86 and

It probably doesn't make sense in generic code because it will be
too large open coded.

-Andi

2008-03-09 21:32:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

Alexander van Heukelum <[email protected]> writes:
>
> Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
> 1...512, for each possible bitmap with one bit set, for each possible
> offset: find the position of the first bit starting at offset. If you
> follow ;). Times include setup of the bitmap and checking of the
> results.

BTW another comment: it would be far more sense if you did
some profiling on what bitmap sizes a real kernel uses
(should be easy enough with some systemtap) and benchmarked
only that.

I doubt 1 bit bitmap searches are common for example ...

-Andi

2008-03-09 21:32:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

Andi Kleen <[email protected]> writes:
>
> I had a patchkit some time ago to special case the max_bit <= 63 case

... never mind ... the patchkit is actually included. I had only
dropped it at some point, but fixed later.

-Andi

2008-03-10 06:30:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


* Alexander van Heukelum <[email protected]> wrote:

> > ok, that's rather convincing.
> >
> > the generic version in lib/find_next_bit.c is open-coded C which gcc
> > can optimize pretty nicely.
> >
> > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly
> > use the special x86 'bit search forward' (BSF) instruction - which i
> > know from the days when the scheduler relied on it has some
> > non-trivial setup costs. So especially when there's _small_ bitmasks
> > involved, it's more expensive.
>
> Hi,
>
> BSF is fine, it doesn't need any special setup. [...]

under "setup costs" i mean cycles spent by the CPU itself - the
instruction itself is simple (of course) and needs no setup. If you look
at BSF performance you'll see that it has nontrivial overhead.

Ingo

Subject: [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

NOTE: This is not well tested. I'm also quite unsure if this makes the
pre-processor situation any better.

On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
changes with this applied. I have observed only four places where this
optimization avoids a call into find_next_bit:

In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
In __next_cpu a call is avoided for a 32-bit bitmap. That's it.

On x86_64 I observed the following (I did not look closely yet, though).

Current #testing defconfig:
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392637 846592 724424 6963653 6a41c5 vmlinux

After removing the x86_64 specific optimization for find_next_*bit:
94 x bsf, 79 x find_next_*bit
text data bss dec hex filename
5392358 846592 724424 6963374 6a40ae vmlinux

After this patch (making the optimization generic):
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392396 846592 724424 6963412 6a40d4 vmlinux

Signed-off-by: Alexander van Heukelum <[email protected]>

---

> ok - but this needs to be solved in a cleaner way. That build-time
> optimization needs to be pushed into generic code so that 32-bit x86 and
> other architectures can make use of it as well. The lib/find_next_bit.c
> functions should be named __find_next_bit() or so.
>
> Ingo

I think this is what you had in mind? It seems to work.

Alternative implementation ideas? Comments? In particular: does
this break non-x86?

Greetings,
Alexander

include/asm-x86/bitops.h | 4 +-
include/asm-x86/bitops_64.h | 10 -------
include/linux/bitops.h | 60 +++++++++++++++++++++++++++++++++++++++++++
lib/find_next_bit.c | 10 +++----
4 files changed, 66 insertions(+), 18 deletions(-)

diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 7fbf93a..cbbe329 100644
--- a/include/asm-x86/bitops.h
+++ b/include/asm-x86/bitops.h
@@ -312,9 +312,9 @@ static int test_bit(int nr, const volatile unsigned long *addr);

#undef ADDR

-unsigned long find_next_bit(const unsigned long *addr,
+unsigned long __find_next_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
-unsigned long find_next_zero_bit(const unsigned long *addr,
+unsigned long __find_next_zero_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);


diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
index 40bf0f8..87e1a17 100644
--- a/include/asm-x86/bitops_64.h
+++ b/include/asm-x86/bitops_64.h
@@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max)
(__scanbit(*(unsigned long *)addr,(size))) : \
find_first_bit(addr,size)))

-#define find_next_bit(addr,size,off) \
-((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
- ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \
- find_next_bit(addr,size,off)))
-
#define find_first_zero_bit(addr,size) \
((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
(__scanbit(~*(unsigned long *)addr,(size))) : \
find_first_zero_bit(addr,size)))

-#define find_next_zero_bit(addr,size,off) \
-((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
- ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \
- find_next_zero_bit(addr,size,off)))
-
static inline void set_bit_string(unsigned long *bitmap, unsigned long i,
int len)
{
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 69c1edb..ba78ca1 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -72,4 +72,64 @@ static inline unsigned fls_long(unsigned long l)
return fls64(l);
}

+#ifdef __KERNEL__
+#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
+unsigned long __find_next_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+
+static __always_inline unsigned long
+find_next_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ unsigned long value;
+
+ /* Here we would like to use a version of ffs that */
+ /* returns BITS_PER_LONG if its argument is zero. */
+ if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
+ value = (*addr) & ((~0ul) << offset);
+ return (value == 0) ? BITS_PER_LONG : __ffs(value);
+ }
+
+ /* Less than BITS_PER_LONG: put in sentinel, then __ffs */
+ /* Here we would like to have a __set_bit/__ffs combo */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
+ value = (*addr) & ((~0ul) << offset);
+ value |= (1ul << size);
+ return __ffs(value);
+ }
+
+ /* size not constant or too big */
+ return __find_next_bit(addr, size, offset);
+}
+
+unsigned long __find_next_zero_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+
+static __always_inline unsigned long
+find_next_zero_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ unsigned long value;
+
+ /* Here we would like to use a version of ffs that */
+ /* returns BITS_PER_LONG if its argument is zero. */
+ if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
+ value = (~(*addr)) & ((~0ul) << offset);
+ return (value == 0) ? BITS_PER_LONG : __ffs(value);
+ }
+
+ /* Less than BITS_PER_LONG: put in sentinel, then __ffs */
+ /* Here we would like to have a __set_bit/__ffs combo. */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
+ value = (~(*addr)) & ((~0ul) << offset);
+ value |= (1ul << size);
+ return __ffs(value);
+ }
+
+ /* size not constant or too big */
+ return __find_next_zero_bit(addr, size, offset);
+}
+
+#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
+#endif /* __KERNEL__ */
#endif
diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
index 5820e07..5249f4a 100644
--- a/lib/find_next_bit.c
+++ b/lib/find_next_bit.c
@@ -15,8 +15,6 @@
#include <asm/byteorder.h>

#define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
-#undef find_next_bit
-#undef find_next_zero_bit

/**
* find_next_bit - find the next set bit in a memory region
@@ -24,8 +22,8 @@
* @offset: The bitnumber to start searching at
* @size: The maximum size to search
*/
-unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
+unsigned long __find_next_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);
@@ -69,8 +67,8 @@ EXPORT_SYMBOL(find_next_bit);
* This implementation of find_{first,next}_zero_bit was stolen from
* Linus' asm-alpha/bitops.h.
*/
-unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
+unsigned long __find_next_zero_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);

2008-03-11 09:57:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps


* Alexander van Heukelum <[email protected]> wrote:

> x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

thanks, looks nice, except for a small detail:

> +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
> +unsigned long __find_next_bit(const unsigned long *addr,
> + unsigned long size, unsigned long offset);
> +
> +static __always_inline unsigned long
> +find_next_bit(const unsigned long *addr, unsigned long size,
> + unsigned long offset)

> +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */

there should be an #else here i think, which maps find_next_bit to
__find_next_bit / etc.

small syntactic nit:

> +unsigned long __find_next_bit(const unsigned long *addr,
> + unsigned long size, unsigned long offset);

should be explicitly "extern", so that we see that it's a prototype
declaration.

Ingo

Subject: [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

x86: Optimize find_next_(zero_)bit for small constant-size bitmaps.

This moves an optimization for searching constant-sized small
bitmaps form x86_64-specific to generic code.

On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
changes with this applied. I have observed only four places where this
optimization avoids a call into find_next_bit:

In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
In __next_cpu a call is avoided for a 32-bit bitmap. That's it.

On x86_64, 52 locations are optimized with a minimal increase in
code size:

Current #testing defconfig:
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392637 846592 724424 6963653 6a41c5 vmlinux

After removing the x86_64 specific optimization for find_next_*bit:
94 x bsf, 79 x find_next_*bit
text data bss dec hex filename
5392358 846592 724424 6963374 6a40ae vmlinux

After this patch (making the optimization generic):
146 x bsf, 27 x find_next_*bit
text data bss dec hex filename
5392396 846592 724424 6963412 6a40d4 vmlinux

---

Hi Ingo,

I have now tested (in user-space) the one-long-versions of the
find_next_(zero_)bit in the patch. They give the expected
results.

> > +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
> > +unsigned long __find_next_bit(const unsigned long *addr,
> > + unsigned long size, unsigned long offset);
> > +
> > +static __always_inline unsigned long
> > +find_next_bit(const unsigned long *addr, unsigned long size,
> > + unsigned long offset)
>
> > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
>
> there should be an #else here i think, which maps find_next_bit to
> __find_next_bit / etc.

No, that is intentional: architectures are expected to either set
CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation
of find_next_bit.

An alternative would be to have all architectures providing
__find_next_bit/__find_next_zero_bit and do the optimization
unconditionally.

> > +unsigned long __find_next_bit(const unsigned long *addr,
> > + unsigned long size, unsigned long offset);
>
> should be explicitly "extern", so that we see that it's a prototype
> declaration.

ok.

This version runs fine on qemu.

Greetings,
Alexander

diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
index 7fbf93a..1a23ce1 100644
--- a/include/asm-x86/bitops.h
+++ b/include/asm-x86/bitops.h
@@ -312,12 +312,6 @@ static int test_bit(int nr, const volatile unsigned long *addr);

#undef ADDR

-unsigned long find_next_bit(const unsigned long *addr,
- unsigned long size, unsigned long offset);
-unsigned long find_next_zero_bit(const unsigned long *addr,
- unsigned long size, unsigned long offset);
-
-
#ifdef CONFIG_X86_32
# include "bitops_32.h"
#else
diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
index 40bf0f8..87e1a17 100644
--- a/include/asm-x86/bitops_64.h
+++ b/include/asm-x86/bitops_64.h
@@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max)
(__scanbit(*(unsigned long *)addr,(size))) : \
find_first_bit(addr,size)))

-#define find_next_bit(addr,size,off) \
-((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
- ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \
- find_next_bit(addr,size,off)))
-
#define find_first_zero_bit(addr,size) \
((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
(__scanbit(~*(unsigned long *)addr,(size))) : \
find_first_zero_bit(addr,size)))

-#define find_next_zero_bit(addr,size,off) \
-((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
- ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \
- find_next_zero_bit(addr,size,off)))
-
static inline void set_bit_string(unsigned long *bitmap, unsigned long i,
int len)
{
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 69c1edb..15360f9 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -72,4 +72,81 @@ static inline unsigned fls_long(unsigned long l)
return fls64(l);
}

+#ifdef __KERNEL__
+#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
+extern unsigned long __find_next_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+
+/**
+ * find_next_bit - find the next set bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ */
+static __always_inline unsigned long
+find_next_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ unsigned long value;
+
+ /* Avoid a function call if the bitmap size is a constant */
+ /* and not bigger than BITS_PER_LONG. */
+
+ /* insert a sentinel so that __ffs returns size if there */
+ /* are no set bits in the bitmap */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
+ value = (*addr) & ((~0ul) << offset);
+ value |= (1ul << size);
+ return __ffs(value);
+ }
+
+ /* the result of __ffs(0) is undefined, so it needs to be */
+ /* handled separately */
+ if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
+ value = (*addr) & ((~0ul) << offset);
+ return (value == 0) ? BITS_PER_LONG : __ffs(value);
+ }
+
+ /* size is not constant or too big */
+ return __find_next_bit(addr, size, offset);
+}
+
+extern unsigned long __find_next_zero_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
+
+/**
+ * find_next_zero_bit - find the next cleared bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ */
+static __always_inline unsigned long
+find_next_zero_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ unsigned long value;
+
+ /* Avoid a function call if the bitmap size is a constant */
+ /* and not bigger than BITS_PER_LONG. */
+
+ /* insert a sentinel so that __ffs returns size if there */
+ /* are no set bits in the bitmap */
+ if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
+ value = (~(*addr)) & ((~0ul) << offset);
+ value |= (1ul << size);
+ return __ffs(value);
+ }
+
+ /* the result of __ffs(0) is undefined, so it needs to be */
+ /* handled separately */
+ if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
+ value = (~(*addr)) & ((~0ul) << offset);
+ return (value == 0) ? BITS_PER_LONG : __ffs(value);
+ }
+
+ /* size is not constant or too big */
+ return __find_next_zero_bit(addr, size, offset);
+}
+#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
+#endif /* __KERNEL__ */
#endif
diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
index 5820e07..ce94c4c 100644
--- a/lib/find_next_bit.c
+++ b/lib/find_next_bit.c
@@ -15,17 +15,12 @@
#include <asm/byteorder.h>

#define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
-#undef find_next_bit
-#undef find_next_zero_bit
-
-/**
- * find_next_bit - find the next set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The maximum size to search
+
+/*
+ * Find the next set bit in a memory region.
*/
-unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
+unsigned long __find_next_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);
@@ -62,15 +57,14 @@ found_first:
found_middle:
return result + __ffs(tmp);
}
-
-EXPORT_SYMBOL(find_next_bit);
+EXPORT_SYMBOL(__find_next_bit);

/*
* This implementation of find_{first,next}_zero_bit was stolen from
* Linus' asm-alpha/bitops.h.
*/
-unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
+unsigned long __find_next_zero_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = addr + BITOP_WORD(offset);
unsigned long result = offset & ~(BITS_PER_LONG-1);
@@ -107,8 +101,7 @@ found_first:
found_middle:
return result + ffz(tmp);
}
-
-EXPORT_SYMBOL(find_next_zero_bit);
+EXPORT_SYMBOL(__find_next_zero_bit);

#ifdef __BIG_ENDIAN

2008-03-11 15:23:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps


* Alexander van Heukelum <[email protected]> wrote:

> > > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
> >
> > there should be an #else here i think, which maps find_next_bit to
> > __find_next_bit / etc.
>
> No, that is intentional: architectures are expected to either set
> CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation of
> find_next_bit.

indeed, you are right.

> This version runs fine on qemu.

great - i've queued it up into x86.git.

Ingo

Subject: [RFC] non-x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

non-x86: Optimize small bitmap searches for the other archs.

Concerns archs that have their own implementation of find_next_bit
and find_next_zero_bit.

For archs that define CONFIG_GENERIC_FIND_NEXT_BIT, a patch is
submitted to be included in Ingo's x86-testing tree. It moves
an optimization for searching small, constant-sized bitmaps
from x86_64-specific to generic code. It works by creating
a new __always_inline-marked function in linux/bitops.h, and
renaming find_next_(zero_)bit to __find_next_(zero_)bit in
the generic code. The new code is currently guarded by
CONFIG_GENERIC_FIND_NEXT_BIT, so other archs are expected to
work as before.

To enable the optimization globally, all other archs need to
implement __find_next_bit and __find_next_zero_bit either
by exporting these symbols, or by implementing them as inline
functions in their asm/bitops.h.

It would also be nice if every implementation would have the
same declaration:

unsigned long __find_next_(zero_)bit(unsigned long *,
unsigned long, unsigned long);

I added a totally untested patch for reference.

Did I mis an arch?
Does the assembly still match the (changed) prototype?
Can we get concensus on doing the optimization in generic code?

Greetings,
Alexander

arch/arm/lib/findbit.S | 6 ++++--
arch/avr32/kernel/avr32_ksyms.c | 4 ++--
arch/avr32/lib/findbit.S | 4 ++--
include/asm-arm/bitops.h | 20 ++++++++++++--------
include/asm-m68k/bitops.h | 8 ++++----
include/asm-powerpc/bitops.h | 4 ----
include/asm-s390/bitops.h | 12 ++++++------
include/linux/bitops.h | 2 --
8 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index a5ca024..5c92967 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -36,7 +36,8 @@ ENTRY(_find_first_zero_bit_le)

/*
* Purpose : Find next 'zero' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
+ * Prototype: unsigned long find_next_zero_bit(unsigned long *addr,
+ * unsigned long maxbit, unsigned long offset);
*/
ENTRY(_find_next_zero_bit_le)
teq r1, #0
@@ -70,7 +71,8 @@ ENTRY(_find_first_bit_le)

/*
* Purpose : Find next 'one' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
+ * Prototype: unsigned long find_next_zero_bit(unsigned long *addr,
+ * unsigned long maxbit, unsigned long offset);
*/
ENTRY(_find_next_bit_le)
teq r1, #0
diff --git a/arch/avr32/kernel/avr32_ksyms.c b/arch/avr32/kernel/avr32_ksyms.c
index 80f55f8..c228ed1 100644
--- a/arch/avr32/kernel/avr32_ksyms.c
+++ b/arch/avr32/kernel/avr32_ksyms.c
@@ -51,9 +51,9 @@ EXPORT_SYMBOL(__const_udelay);

/* Bit operations (lib/findbit.S) */
EXPORT_SYMBOL(find_first_zero_bit);
-EXPORT_SYMBOL(find_next_zero_bit);
+EXPORT_SYMBOL(__find_next_zero_bit);
EXPORT_SYMBOL(find_first_bit);
-EXPORT_SYMBOL(find_next_bit);
+EXPORT_SYMBOL(__find_next_bit);
EXPORT_SYMBOL(generic_find_next_zero_le_bit);

/* I/O primitives (lib/io-*.S) */
diff --git a/arch/avr32/lib/findbit.S b/arch/avr32/lib/findbit.S
index c6b91de..729deab 100644
--- a/arch/avr32/lib/findbit.S
+++ b/arch/avr32/lib/findbit.S
@@ -29,7 +29,7 @@ ENTRY(find_first_zero_bit)
* unsigned long size,
* unsigned long offset)
*/
-ENTRY(find_next_zero_bit)
+ENTRY(__find_next_zero_bit)
lsr r8, r10, 5
sub r9, r11, r10
retle r11
@@ -93,7 +93,7 @@ ENTRY(find_first_bit)
* unsigned long size,
* unsigned long offset)
*/
-ENTRY(find_next_bit)
+ENTRY(__find_next_bit)
lsr r8, r10, 5
sub r9, r11, r10
retle r11
diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h
index 5c60bfc..08b3163 100644
--- a/include/asm-arm/bitops.h
+++ b/include/asm-arm/bitops.h
@@ -158,9 +158,11 @@ extern int _test_and_set_bit_le(int nr, volatile unsigned long * p);
extern int _test_and_clear_bit_le(int nr, volatile unsigned long * p);
extern int _test_and_change_bit_le(int nr, volatile unsigned long * p);
extern int _find_first_zero_bit_le(const void * p, unsigned size);
-extern int _find_next_zero_bit_le(const void * p, int size, int offset);
+extern unsigned long _find_next_zero_bit_le(const unsigned long * p,
+ unsigned long size, unsigned long offset);
extern int _find_first_bit_le(const unsigned long *p, unsigned size);
-extern int _find_next_bit_le(const unsigned long *p, int size, int offset);
+extern unsigned long _find_next_bit_le(const unsigned long *p,
+ unsigned long size, unsigned long offset);

/*
* Big endian assembly bitops. nr = 0 -> byte 3 bit 0.
@@ -172,9 +174,11 @@ extern int _test_and_set_bit_be(int nr, volatile unsigned long * p);
extern int _test_and_clear_bit_be(int nr, volatile unsigned long * p);
extern int _test_and_change_bit_be(int nr, volatile unsigned long * p);
extern int _find_first_zero_bit_be(const void * p, unsigned size);
-extern int _find_next_zero_bit_be(const void * p, int size, int offset);
+extern unsigned long _find_next_zero_bit_be(const unsigned long * p,
+ unsigned long size, unsigned long offset);
extern int _find_first_bit_be(const unsigned long *p, unsigned size);
-extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
+extern unsigned long _find_next_bit_be(const unsigned long *p,
+ unsigned long size, unsigned long offset);

#ifndef CONFIG_SMP
/*
@@ -208,9 +212,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
#define test_and_clear_bit(nr,p) ATOMIC_BITOP_LE(test_and_clear_bit,nr,p)
#define test_and_change_bit(nr,p) ATOMIC_BITOP_LE(test_and_change_bit,nr,p)
#define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz)
-#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off)
+#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off)
#define find_first_bit(p,sz) _find_first_bit_le(p,sz)
-#define find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off)
+#define __find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off)

#define WORD_BITOFF_TO_LE(x) ((x))

@@ -226,9 +230,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
#define test_and_clear_bit(nr,p) ATOMIC_BITOP_BE(test_and_clear_bit,nr,p)
#define test_and_change_bit(nr,p) ATOMIC_BITOP_BE(test_and_change_bit,nr,p)
#define find_first_zero_bit(p,sz) _find_first_zero_bit_be(p,sz)
-#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off)
+#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off)
#define find_first_bit(p,sz) _find_first_bit_be(p,sz)
-#define find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off)
+#define __find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off)

#define WORD_BITOFF_TO_LE(x) ((x) ^ 0x18)

diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
index 83d1f28..c320c7a 100644
--- a/include/asm-m68k/bitops.h
+++ b/include/asm-m68k/bitops.h
@@ -199,8 +199,8 @@ out:
return ((long)p - (long)vaddr - 4) * 8 + res;
}

-static inline int find_next_zero_bit(const unsigned long *vaddr, int size,
- int offset)
+static inline unsigned long __find_next_zero_bit(const unsigned long *vaddr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = vaddr + (offset >> 5);
int bit = offset & 31UL, res;
@@ -246,8 +246,8 @@ out:
return ((long)p - (long)vaddr - 4) * 8 + res;
}

-static inline int find_next_bit(const unsigned long *vaddr, int size,
- int offset)
+static inline unsigned long __find_next_bit(const unsigned long *vaddr,
+ unsigned long size, unsigned long offset)
{
const unsigned long *p = vaddr + (offset >> 5);
int bit = offset & 31UL, res;
diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h
index 220d9a7..8ae6667 100644
--- a/include/asm-powerpc/bitops.h
+++ b/include/asm-powerpc/bitops.h
@@ -317,8 +317,6 @@ static __inline__ int fls(unsigned int x)
#include <asm-generic/bitops/hweight.h>

#define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
-unsigned long find_next_zero_bit(const unsigned long *addr,
- unsigned long size, unsigned long offset);
/**
* find_first_bit - find the first set bit in a memory region
* @addr: The address to start the search at
@@ -328,8 +326,6 @@ unsigned long find_next_zero_bit(const unsigned long *addr,
* containing a bit.
*/
#define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
-unsigned long find_next_bit(const unsigned long *addr,
- unsigned long size, unsigned long offset);

/* Little-endian versions */

diff --git a/include/asm-s390/bitops.h b/include/asm-s390/bitops.h
index 965394e..13a5c33 100644
--- a/include/asm-s390/bitops.h
+++ b/include/asm-s390/bitops.h
@@ -691,9 +691,9 @@ static inline unsigned long find_first_bit(const unsigned long * addr,
* @offset: The bitnumber to start searching at
* @size: The maximum size to search
*/
-static inline int find_next_zero_bit (const unsigned long * addr,
- unsigned long size,
- unsigned long offset)
+static inline unsigned long __find_next_zero_bit(const unsigned long * addr,
+ unsigned long size,
+ unsigned long offset)
{
const unsigned long *p;
unsigned long bit, set;
@@ -727,9 +727,9 @@ static inline int find_next_zero_bit (const unsigned long * addr,
* @offset: The bitnumber to start searching at
* @size: The maximum size to search
*/
-static inline int find_next_bit (const unsigned long * addr,
- unsigned long size,
- unsigned long offset)
+static inline unsigned long __find_next_bit(const unsigned long * addr,
+ unsigned long size,
+ unsigned long offset)
{
const unsigned long *p;
unsigned long bit, set;
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 15360f9..68be268 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -73,7 +73,6 @@ static inline unsigned fls_long(unsigned long l)
}

#ifdef __KERNEL__
-#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
extern unsigned long __find_next_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);

@@ -147,6 +146,5 @@ find_next_zero_bit(const unsigned long *addr, unsigned long size,
/* size is not constant or too big */
return __find_next_zero_bit(addr, size, offset);
}
-#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
#endif /* __KERNEL__ */
#endif

2008-03-13 12:44:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit

On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote:
> x86: Change x86 to use the generic find_next_bit implementation
>
> The versions with inline assembly are in fact slower on the machines I
> tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
> Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
> crashing the benchmark.
>
> Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
> 1...512, for each possible bitmap with one bit set, for each possible
> offset: find the position of the first bit starting at offset. If you
> follow ;). Times include setup of the bitmap and checking of the
> results.
>
> Athlon Xeon Opteron 32/64bit
> x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
> generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s
>
> If the bitmap size is not a multiple of BITS_PER_LONG, and no set
> (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
> value outside of the range [0,size]. The generic version always returns
> exactly size. The generic version also uses unsigned long everywhere,
> while the x86 versions use a mishmash of int, unsigned (int), long and
> unsigned long.
>

This problem is observed on x86_64 and powerpc also. We need a long
aligned address for test_bit, set_bit find_bit etc. In ext4 we have
to make sure we align the address passed to

ext4_test_bit
ext4_set_bit
ext4_find_next_zero_bit
ext4_find_next_bit

fs/ext4/mballoc.c have some examples.

-aneesh

2008-03-13 14:49:22

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] x86: Change x86 to use generic find_next_bit


On Thu, 13 Mar 2008 18:14:24 +0530, "Aneesh Kumar K.V"
<[email protected]> said:
> On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote:
> > x86: Change x86 to use the generic find_next_bit implementation
> >
> > The versions with inline assembly are in fact slower on the machines I
> > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
> > Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
> > crashing the benchmark.
> >
> > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
> > 1...512, for each possible bitmap with one bit set, for each possible
> > offset: find the position of the first bit starting at offset. If you
> > follow ;). Times include setup of the bitmap and checking of the
> > results.
> >
> > Athlon Xeon Opteron 32/64bit
> > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
> > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s
> >
> > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
> > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
> > value outside of the range [0,size]. The generic version always returns
> > exactly size. The generic version also uses unsigned long everywhere,
> > while the x86 versions use a mishmash of int, unsigned (int), long and
> > unsigned long.
>
> This problem is observed on x86_64 and powerpc also.

I'm not entirely sure if it is a problem. In some cases it
certainly is an inconvenience, though ;). I mentioned the
difference between the old and generic versions, because
of the possibility of dependence of this behaviour.

Indeed I see for example (in fs/ext4/mballoc.c).

bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit);
if (bit >= end)
break;
next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
if (next > end)
next = end;
free += next - bit;

So here it needed to adjust the value.

> We need a long
> aligned address for test_bit, set_bit find_bit etc. In ext4 we have
> to make sure we align the address passed to
>
> ext4_test_bit
> ext4_set_bit
> ext4_find_next_zero_bit
> ext4_find_next_bit
>
> fs/ext4/mballoc.c have some examples.

This is a different 'problem'. find_next_bit works on arrays of long,
while the bitmaps in ext4_find_next_bit are of type void * and seem
not to have any alignment restrictions. ext4 implements wrappers
around find_next_bit to solve that 'problem'.

The question that arises is: do we want find_first_bit, find_next_bit.
etc. to always return a value in the range [0, size], or do we want
to allow implementations that return [0, size-1] if there is a bit
found and something else (roundup(size,bitsperlong) or ulongmax, for
example if, none were found?

The current x86_64 versions of find_first_bit and find_next_bit return
roundup(size,bitsperlong) if no bits were found, but on the other hand
I guess most bitmaps are a multiple of bitsperlong bits in size, which
hides the difference.

Greetings,
Alexander

> -aneesh
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Or how I learned to stop worrying and
love email again