2010-01-30 09:54:03

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

This allows use of hweight_long() in BUILD_BUG_ON().
Suggested by Jamie.

CC: Jamie Lokier <[email protected]>
CC: Roland Dreier <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/bitops.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

--- linux-mm.orig/include/linux/bitops.h 2009-07-20 20:10:19.000000000 +0800
+++ linux-mm/include/linux/bitops.h 2010-01-30 17:41:15.000000000 +0800
@@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
return order;
}

-static inline unsigned long hweight_long(unsigned long w)
-{
- return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
-}
+#define hweight_long(x) \
+( \
+ __builtin_constant_p(x) ? \
+ __builtin_popcountl(x) : \
+ (sizeof(x) <= 4 ? \
+ hweight32(x) : \
+ hweight64(x)) \
+)

/**
* rol32 - rotate a 32-bit value left


2010-02-01 20:49:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sat, 30 Jan 2010 17:45:17 +0800
Wu Fengguang <[email protected]> wrote:

> This allows use of hweight_long() in BUILD_BUG_ON().
> Suggested by Jamie.
>
> CC: Jamie Lokier <[email protected]>
> CC: Roland Dreier <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> include/linux/bitops.h | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> --- linux-mm.orig/include/linux/bitops.h 2009-07-20 20:10:19.000000000 +0800
> +++ linux-mm/include/linux/bitops.h 2010-01-30 17:41:15.000000000 +0800
> @@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
> return order;
> }
>
> -static inline unsigned long hweight_long(unsigned long w)
> -{
> - return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> -}
> +#define hweight_long(x) \
> +( \
> + __builtin_constant_p(x) ? \
> + __builtin_popcountl(x) : \
> + (sizeof(x) <= 4 ? \
> + hweight32(x) : \
> + hweight64(x)) \
> +)
>
> /**
> * rol32 - rotate a 32-bit value left

Peter's been mucking with a compile-time HWEIGHT(). An outdated
version of that is presently in linux-next.

(I wonder if it could have used __builtin_popcount)

(I wonder which gcc versions support __builtin_popcount)

Anyway, I suspect you should be using that rather than tweaking
hweight_long().

2010-02-03 13:42:17

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Mon, Feb 01, 2010 at 01:48:25PM -0700, Andrew Morton wrote:
> > -static inline unsigned long hweight_long(unsigned long w)
> > -{
> > - return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> > -}
> > +#define hweight_long(x) \
> > +( \
> > + __builtin_constant_p(x) ? \
> > + __builtin_popcountl(x) : \
> > + (sizeof(x) <= 4 ? \
> > + hweight32(x) : \
> > + hweight64(x)) \
> > +)
> >
> > /**
> > * rol32 - rotate a 32-bit value left
>
> Peter's been mucking with a compile-time HWEIGHT(). An outdated
> version of that is presently in linux-next.

Ah I find it. Thanks.

> (I wonder if it could have used __builtin_popcount)
>
> (I wonder which gcc versions support __builtin_popcount)

This is how Jamie Lokier first recommended it to me:

Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
constant if the argument is a compile-time constant, so can be
used in BUILD_BUG_ON() and even for array sizes etc.

Is there an official lowest GCC version that Linux supports?

> Anyway, I suspect you should be using that rather than tweaking
> hweight_long().

OK.

But.. if ever the GCC builtins can be used:

__builtin_popcount(unsigned int)
__builtin_popcountl(unsigned long)
__builtin_popcountll(unsigned long long)

it would be more natural to use hweight_long() or even introduce
hweight_longlong() for (transparent) compile time computation.

Thanks,
Fengguang

2010-02-03 15:12:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <[email protected]> wrote:

> Is there an official lowest GCC version that Linux supports?

Documentation/Changes says gcc-3.2.

2010-02-03 15:16:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <[email protected]> wrote:
>
> > Is there an official lowest GCC version that Linux supports?
>
> Documentation/Changes says gcc-3.2.

Anyway, its not like that macro I used isn't straight forward. Not much
a builtin can do better.

The only benefit the builtin has is possibly use machine popcnt
instructions but we already deal with that in the kernel.


2010-02-03 15:44:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <[email protected]> wrote:
> >
> > > Is there an official lowest GCC version that Linux supports?
> >
> > Documentation/Changes says gcc-3.2.
>
> Anyway, its not like that macro I used isn't straight forward. Not much
> a builtin can do better.
>
> The only benefit the builtin has is possibly use machine popcnt
> instructions but we already deal with that in the kernel.

We didn't deal with it on every architecture, which is something which
the compiler extension takes care of.

In fact I can't find anywhere where we dealt with it on x86.

2010-02-03 15:47:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 2010-02-03 at 07:42 -0800, Andrew Morton wrote:
> On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <[email protected]> wrote:
> > >
> > > > Is there an official lowest GCC version that Linux supports?
> > >
> > > Documentation/Changes says gcc-3.2.
> >
> > Anyway, its not like that macro I used isn't straight forward. Not much
> > a builtin can do better.
> >
> > The only benefit the builtin has is possibly use machine popcnt
> > instructions but we already deal with that in the kernel.
>
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
>
> In fact I can't find anywhere where we dealt with it on x86.

>From what I know the popcnt ins on x86 is relatively new, hpa was going
to look at supporting it using alternatives.

2010-02-03 17:11:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/03/2010 05:39 AM, Wu Fengguang wrote:
>
>> (I wonder if it could have used __builtin_popcount)
>>
>> (I wonder which gcc versions support __builtin_popcount)
>

I think 4.0 or later, but it often generates calls to libgcc, which
would mean adding libgcc functions to the kernel (which some arches
already do.)

> This is how Jamie Lokier first recommended it to me:
>
> Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
> constant if the argument is a compile-time constant, so can be
> used in BUILD_BUG_ON() and even for array sizes etc.
>
> Is there an official lowest GCC version that Linux supports?

3.4, I believe.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-03 17:16:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/03/2010 07:47 AM, Peter Zijlstra wrote:
>
> From what I know the popcnt ins on x86 is relatively new, hpa was going
> to look at supporting it using alternatives.
>

And kernel-specific stuff like alternatives is pretty much the one thing
that gcc *can't* do for us.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-03 18:14:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, Feb 03, 2010 at 07:42:51AM -0800, Andrew Morton wrote:
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
>
> In fact I can't find anywhere where we dealt with it on x86.

Yeah, we talked briefly about using hardware popcnt, see thread
beginning at

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg00245.html

for example. I did an ftrace of the cpumask_weight() calls in sched.c to
see whether there would be a measurable performance gain but it didn't
seem so at the time. My numbers said something like ca. 170 hweight
calls per second and since the <lib/hweight.c> implementations roughly
translate to something like ~20 isns (hweight64 to about ~30), the whole
thing wasn't worth the trouble considering checking binutils versions
and slapping opcodes or using gcc intrinsics which involves gcc version
checking.

An alternatives solution which is based on CPUID flag could add the
popcnt opcode without checking any toolchain versions but how is the
replaced instruction going to look like? Something like

alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

by making sure the arg is in some register first?

Hmm..

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-03 18:48:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:

> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
and hweight64() needs a bit of magic for 32bit, but yes, something like
that ought to work nicely.

2010-02-03 19:55:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
>
>> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
>
> Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> and hweight64() needs a bit of magic for 32bit, but yes, something like
> that ought to work nicely.
>

Arguably the "best" option is to have the alternative being a jump to an
out-of-line stub which does the necessary parameter marshalling before
calling a stub. This technique is already used in a few other places.

-hpa

2010-02-04 15:10:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> >
> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> >
> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > that ought to work nicely.
> >
>
> Arguably the "best" option is to have the alternative being a jump to an
> out-of-line stub which does the necessary parameter marshalling before
> calling a stub. This technique is already used in a few other places.

Ok, here's a first alpha prototype and completely untested. The asm
output looks ok though. I've added separate 32-bit and 64-bit helpers in
order to dispense with the if-else tests. The hw-popcnt versions are the
opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
so %rAX has to be preloaded with the bitmask and the computed value
has to be retrieved from there afterwards. And yes, it looks not that
elegant so I'm open for suggestions.

The good thing is, this should work on any toolchain since we don't rely
on the compiler to know about popcnt and we're protected by CPUID flag
so that the hw-popcnt version is used only on processors which support
it.

Please take a good look and let me know what do you guys think.

Thanks.

--
arch/x86/include/asm/bitops.h | 4 ++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/popcnt.c | 62 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 67 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/lib/popcnt.c

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..deb5013 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -434,6 +434,10 @@ static inline int fls(int x)
#endif
return r + 1;
}
+
+
+extern int arch_hweight_long(unsigned long);
+
#endif /* __KERNEL__ */

#undef ADDR
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..c03fe2d 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o

ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
new file mode 100644
index 0000000..179a6e8
--- /dev/null
+++ b/arch/x86/lib/popcnt.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+
+int _hweight32(void)
+{
+ unsigned long w;
+
+ asm volatile("" : "=a" (w));
+
+ return hweight32(w);
+}
+
+int _hweight64(void)
+{
+ unsigned long w;
+
+ asm volatile("" : "=a" (w));
+
+ return hweight64(w);
+}
+
+int _popcnt32(void)
+{
+
+ unsigned long w;
+
+ asm volatile(".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0\n\t"
+ : "=a" (w));
+
+ return w;
+}
+
+int _popcnt64(void)
+{
+
+ unsigned long w;
+
+ asm volatile(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t."
+ "byte 0xb8\n\t.byte 0xc0\n\t"
+ : "=a" (w));
+
+ return w;
+}
+
+int arch_hweight_long(unsigned long w)
+{
+ if (sizeof(w) == 4) {
+ asm volatile("movl %[w], %%eax" :: [w] "r" (w));
+ alternative("call _hweight32",
+ "call _popcnt32",
+ X86_FEATURE_POPCNT);
+ asm volatile("" : "=a" (w));
+
+ } else {
+ asm volatile("movq %[w], %%rax" :: [w] "r" (w));
+ alternative("call _hweight64",
+ "call _popcnt64",
+ X86_FEATURE_POPCNT);
+ asm volatile("" : "=a" (w));
+ }
+ return w;
+}
--
1.6.6

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-04 15:14:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, 2010-02-04 at 16:10 +0100, Borislav Petkov wrote:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> > On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> > >
> > >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> > >
> > > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > > that ought to work nicely.
> > >
> >
> > Arguably the "best" option is to have the alternative being a jump to an
> > out-of-line stub which does the necessary parameter marshalling before
> > calling a stub. This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.

> +int arch_hweight_long(unsigned long w)
> +{
> + if (sizeof(w) == 4) {
> + asm volatile("movl %[w], %%eax" :: [w] "r" (w));
> + alternative("call _hweight32",
> + "call _popcnt32",
> + X86_FEATURE_POPCNT);
> + asm volatile("" : "=a" (w));
> +
> + } else {
> + asm volatile("movq %[w], %%rax" :: [w] "r" (w));
> + alternative("call _hweight64",
> + "call _popcnt64",
> + X86_FEATURE_POPCNT);
> + asm volatile("" : "=a" (w));
> + }
> + return w;
> +}

hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
are.

2010-02-04 15:21:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/04/2010 07:10 AM, Borislav Petkov wrote:
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub. This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.
>

I think you misunderstood me. The idea was to use the alternatives
mechanism to put the popcnt instruction inline instead of the call.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-04 15:39:34

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

2010/2/4 Borislav Petkov <[email protected]>:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
>> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
>> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
>> >
>> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
>> >
>> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
>> > and hweight64() needs a bit of magic for 32bit, but yes, something like
>> > that ought to work nicely.
>> >
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.
>
> Thanks.
>
> --
>  arch/x86/include/asm/bitops.h |    4 ++
>  arch/x86/lib/Makefile         |    2 +-
>  arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 67 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/lib/popcnt.c
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 02b47a6..deb5013 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -434,6 +434,10 @@ static inline int fls(int x)
>  #endif
>        return r + 1;
>  }
> +
> +
> +extern int arch_hweight_long(unsigned long);
> +
>  #endif /* __KERNEL__ */
>
>  #undef ADDR
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index cffd754..c03fe2d 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
>  lib-y += memcpy_$(BITS).o
>  lib-$(CONFIG_KPROBES) += insn.o inat.o
>
> -obj-y += msr.o msr-reg.o msr-reg-export.o
> +obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
>
>  ifeq ($(CONFIG_X86_32),y)
>         obj-y += atomic64_32.o
> diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
> new file mode 100644
> index 0000000..179a6e8
> --- /dev/null
> +++ b/arch/x86/lib/popcnt.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/bitops.h>
> +
> +int _hweight32(void)
> +{
> +       unsigned long w;
> +
> +       asm volatile("" : "=a" (w));
> +
> +       return hweight32(w);
> +

You should just use a normal parameter here (use %rdi for 64-bit).
The way you have it written isn't guaranteed to work. GCC could
clobber %eax/%rax before you read it, although it is highly unlikely.
For a good example of how this should be written, look at
__mutex_fastpath_lock().

--
Brian Gerst

2010-02-04 15:54:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> are.

Yeah, I'm still looking for the proper location. hweight_long() is the
generic version so do we want to do the

#ifndef __HAVE_ARCH_POPCNT
static inline unsigned long hweight_long(unsigned long w)
...

#endif

thing and define a x86-specific version?

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-04 16:04:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, 2010-02-04 at 16:54 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> > hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> > are.
>
> Yeah, I'm still looking for the proper location. hweight_long() is the
> generic version so do we want to do the
>
> #ifndef __HAVE_ARCH_POPCNT
> static inline unsigned long hweight_long(unsigned long w)
> ....
>
> #endif
>
> thing and define a x86-specific version?

No, just don't touch hweight_long(), simply provide
__arch_hweight{8,16,32,64} and all will be well.

hweight_long() is provided by <include/linux.h> and is constructed from
height32() and hweight64() depending on the actual word size.

2010-02-05 12:11:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> No, just don't touch hweight_long(), simply provide
> __arch_hweight{8,16,32,64} and all will be well.

Ok, another day, another version :)

It is, of course, completely untested but it builds and the asm looks
ok. I think I've addressed all concerns so far.

--
arch/x86/include/asm/hweight.h | 14 ++++++++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/hweight.c | 57 ++++++++++++++++++++++++++++++++++
include/asm-generic/bitops/hweight.h | 45 ++++++++++++++++++++++++--
lib/hweight.c | 16 +++++-----
5 files changed, 121 insertions(+), 13 deletions(-)
create mode 100644 arch/x86/include/asm/hweight.h
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/include/asm/hweight.h b/arch/x86/include/asm/hweight.h
new file mode 100644
index 0000000..762125f
--- /dev/null
+++ b/arch/x86/include/asm/hweight.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight64 __arch_hweight64
+
+extern unsigned int __arch_hweight8(unsigned int);
+extern unsigned int __arch_hweight16(unsigned int);
+extern unsigned int __arch_hweight32(unsigned int);
+extern unsigned long __arch_hweight64(__u64);
+
+#endif /*_ASM_X86_HWEIGHT_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..3cf51c8
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,57 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define __arch_hweight_alt(size) \
+ ALTERNATIVE("call _hweight"#size, POPCNT##size, X86_FEATURE_POPCNT)
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm volatile("xor %%dh, %%dh\n\t"
+ __arch_hweight_alt(32)
+ : "=di" (res)
+ : "di" (w)
+ : "ecx", "memory");
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight16(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm volatile(__arch_hweight_alt(32)
+ : "=di" (res)
+ : "di" (w)
+ : "ecx", "memory");
+
+ return res;
+
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned int res = 0;
+
+ asm volatile(__arch_hweight_alt(64)
+ : "=di" (res)
+ : "di" (w)
+ : "rsi", "rcx", "r8", "r9", "r10", "r11",
+ "memory");
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..340ad4e 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -2,10 +2,47 @@
#define _ASM_GENERIC_BITOPS_HWEIGHT_H_

#include <asm/types.h>
+#include <asm/hweight.h>

-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+extern unsigned int _hweight32(unsigned int w);
+extern unsigned int _hweight16(unsigned int w);
+extern unsigned int _hweight8(unsigned int w);
+extern unsigned long _hweight64(__u64 w);
+
+static inline unsigned int hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+ return __arch_hweight8(w);
+#else
+ return _hweight8(w);
+#endif
+}
+
+static inline unsigned int hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+ return __arch_hweight16(w);
+#else
+ return _hweight16(w);
+#endif
+}
+
+static inline unsigned int hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+ return __arch_hweight32(w);
+#else
+ return _hweight32(w);
+#endif
+}
+
+static inline unsigned long hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+ return __arch_hweight64(w);
+#else
+ return _hweight64(w);
+#endif
+}

#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/hweight.c b/lib/hweight.c
index 389424e..f7b81a1 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int hweight32(unsigned int w)
+unsigned int _hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,26 +17,26 @@ unsigned int hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(_hweight32);

-unsigned int hweight16(unsigned int w)
+unsigned int _hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(_hweight16);

-unsigned int hweight8(unsigned int w)
+unsigned int _hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(_hweight8);

-unsigned long hweight64(__u64 w)
+unsigned long _hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
@@ -56,4 +56,4 @@ unsigned long hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(_hweight64);
--
1.6.4.2


--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-05 12:16:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Fri, 2010-02-05 at 13:11 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> > No, just don't touch hweight_long(), simply provide
> > __arch_hweight{8,16,32,64} and all will be well.
>
> Ok, another day, another version :)
>
> It is, of course, completely untested but it builds and the asm looks
> ok. I think I've addressed all concerns so far.

please work against a tree that has:

http://lkml.org/lkml/2010/2/4/119

in.

2010-02-05 21:59:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> + unsigned int res = 0;
> +
> + asm volatile("xor %%dh, %%dh\n\t"
> + __arch_hweight_alt(32)
> + : "=di" (res)
> + : "di" (w)
> + : "ecx", "memory");
> +

This is wrong in more ways than I can shake a stick at.

a) "di" doesn't mean the DI register - it means the DX register (d) or
an immediate (i). Since you don't have any reference to either %0 or %1
in your code, you have no way of knowing which one it is. The
constraint for the di register is "D".

b) On 32 bits, the first argument register is in %eax (with %edx used
for the upper half of a 32-bit argument), but on 64 bits, the first
argument is in %rdi, with the return still in %rax.

c) You call a C function, but you don't clobber the set of registers
that a C function would clobber. You either need to put the function in
an assembly wrapper (which is better in the long run), or clobber the
full set of registers that is clobbered by a C function (which is better
in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

d) On the other hand, you do *not* need a "memory" clobber.

-hpa

2010-02-06 09:36:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Fri, Feb 05, 2010 at 01:54:42PM -0800, H. Peter Anvin wrote:
> On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> > +
> > +unsigned int __arch_hweight16(unsigned int w)
> > +{
> > + unsigned int res = 0;
> > +
> > + asm volatile("xor %%dh, %%dh\n\t"
> > + __arch_hweight_alt(32)
> > + : "=di" (res)
> > + : "di" (w)
> > + : "ecx", "memory");
> > +
>
> This is wrong in more ways than I can shake a stick at.

Thanks for reviewing it though - how else would I learn :).

> a) "di" doesn't mean the DI register - it means the DX register (d) or
> an immediate (i). Since you don't have any reference to either %0 or %1
> in your code, you have no way of knowing which one it is. The
> constraint for the di register is "D".

right.

> b) On 32 bits, the first argument register is in %eax (with %edx used
> for the upper half of a 32-bit argument), but on 64 bits, the first
> argument is in %rdi, with the return still in %rax.

Sure, it is right there in arch/x86/include/asm/calling.h. Shame on me.

> c) You call a C function, but you don't clobber the set of registers
> that a C function would clobber. You either need to put the function in
> an assembly wrapper (which is better in the long run), or clobber the
> full set of registers that is clobbered by a C function (which is better
> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

I think you mean rsi instead of esi here.

Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
the full set of clobbered registers. Please elaborate on the assembly
wrapper for the function, wouldn't I need to list all the clobbered
registers there too or am I missing something?

> d) On the other hand, you do *not* need a "memory" clobber.

Right, in this case we have all non-barrier like inlines so no memory
clobber, according to the comment above alternative() macro.

Thanks.

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-07 02:00:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/06/2010 01:36 AM, Borislav Petkov wrote:
>
>> c) You call a C function, but you don't clobber the set of registers
>> that a C function would clobber. You either need to put the function in
>> an assembly wrapper (which is better in the long run), or clobber the
>> full set of registers that is clobbered by a C function (which is better
>> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
>> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.
>
> I think you mean rsi instead of esi here.
>
> Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> the full set of clobbered registers. Please elaborate on the assembly
> wrapper for the function, wouldn't I need to list all the clobbered
> registers there too or am I missing something?
>

The notion there would be that you do push/pop in the assembly wrapper.

>> d) On the other hand, you do *not* need a "memory" clobber.
>
> Right, in this case we have all non-barrier like inlines so no memory
> clobber, according to the comment above alternative() macro.

OK, I'm missing something here.

A few more notions:

a. This is exactly the kind of code where you don't want to put
"volatile" on your asm statement, because it's a pure compute.

b. It is really rather pointless to go through the whole alternatives
work if you are then going to put it inside a function which isn't an
inline ...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-08 09:28:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sat, Feb 06, 2010 at 05:55:47PM -0800, H. Peter Anvin wrote:
> > Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> > the full set of clobbered registers. Please elaborate on the assembly
> > wrapper for the function, wouldn't I need to list all the clobbered
> > registers there too or am I missing something?
> >
>
> The notion there would be that you do push/pop in the assembly wrapper.

Oh yes, something similar to SAVE/RESTORE_ALL in
<arch/x86/kernel/entry_32.S> could work. Good idea!

> >> d) On the other hand, you do *not* need a "memory" clobber.
> >
> > Right, in this case we have all non-barrier like inlines so no memory
> > clobber, according to the comment above alternative() macro.
>
> OK, I'm missing something here.
>
> A few more notions:
>
> a. This is exactly the kind of code where you don't want to put
> "volatile" on your asm statement, because it's a pure compute.
>
> b. It is really rather pointless to go through the whole alternatives
> work if you are then going to put it inside a function which isn't an
> inline ...

Well, in the second version I did replace a 'call _hweightXX' with
the actual popcnt opcode so the alternatives is only needed to do the
replacement during boot. We might just as well do

if (X86_FEATURE_POPCNT)
__hw_popcnt()
else
__software_hweight()

The only advantage of the alternatives is that it would save us the
if-else test above each time we do cpumask_weight. However, the if-else
approach is much more readable and obviates the need for all that macro
magic and taking special care of calling c function from within asm. And
since we do not call cpumask_weight all that often I'll honestly opt for
alternative-less solution...

Hmm...

Thanks,
Boris.

2010-02-08 09:40:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/08/2010 01:28 AM, Borislav Petkov wrote:
>
> Well, in the second version I did replace a 'call _hweightXX' with
> the actual popcnt opcode so the alternatives is only needed to do the
> replacement during boot. We might just as well do
>
> if (X86_FEATURE_POPCNT)
> __hw_popcnt()
> else
> __software_hweight()
>
> The only advantage of the alternatives is that it would save us the
> if-else test above each time we do cpumask_weight. However, the if-else
> approach is much more readable and obviates the need for all that macro
> magic and taking special care of calling c function from within asm. And
> since we do not call cpumask_weight all that often I'll honestly opt for
> alternative-less solution...
>

The highest performance will be gotten by alternatives, but it only make
sense if they are inlined at the point of use... otherwise it's
basically pointless.

-hpa

2010-02-08 09:59:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Mon, Feb 08, 2010 at 01:35:41AM -0800, H. Peter Anvin wrote:
> On 02/08/2010 01:28 AM, Borislav Petkov wrote:
> >
> >Well, in the second version I did replace a 'call _hweightXX' with
> >the actual popcnt opcode so the alternatives is only needed to do the
> >replacement during boot. We might just as well do
> >
> >if (X86_FEATURE_POPCNT)
> > __hw_popcnt()
> >else
> > __software_hweight()
> >
> >The only advantage of the alternatives is that it would save us the
> >if-else test above each time we do cpumask_weight. However, the if-else
> >approach is much more readable and obviates the need for all that macro
> >magic and taking special care of calling c function from within asm. And
> >since we do not call cpumask_weight all that often I'll honestly opt for
> >alternative-less solution...
> >
>
> The highest performance will be gotten by alternatives, but it only
> make sense if they are inlined at the point of use... otherwise it's
> basically pointless.

The popcnt-replacement part of the alternative would be as fast as
possible since we're adding the opcode there but the slow version would
add the additional overhead of saving/restoring the registers before
calling the software hweight implementation. I'll do some tracing to see
what a change like that would cost on machines which don't have popcnt.

Let me prep another version when I get back on Wed. (currently
travelling) with all the stuff we discussed to see how it would turn.

Thanks,
Boris.

2010-02-11 17:24:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> Let me prep another version when I get back on Wed. (currently
> travelling) with all the stuff we discussed to see how it would turn.

Ok, here's another version ontop of PeterZ's patch at
http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
while on 64-bit I do "popcnt %rdi, %rdi".

I also did some rudimentary tracing with the function graph tracer of
all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
compile and the preliminary results show that hweight in software takes
about 9.768 usecs the longest while the hardware popcnt about 8.515
usecs. The machine is a Fam10 revB2 quadcore.

What remains to be done is see whether the saving/restoring of
callee-clobbered regs with this patch has any noticeable negative
effects on the software hweight case on machines which don't support
popcnt. Also, I'm open for better tracing ideas :).

Thanks.

---
arch/alpha/include/asm/bitops.h | 4 +
arch/ia64/include/asm/bitops.h | 1 +
arch/sparc/include/asm/bitops_64.h | 4 +
arch/x86/include/asm/bitops.h | 10 +++
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/hweight.c | 96 ++++++++++++++++++++++++++++
include/asm-generic/bitops/arch_hweight.h | 8 +-
include/asm-generic/bitops/const_hweight.h | 43 +++++++++++-
lib/hweight.c | 20 +++---
9 files changed, 169 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
{
return __kernel_ctpop(w);
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_weight32(unsigned int w)
{
return __arch_hweight64(w);
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
return __arch_hweight64(w & 0xffff);
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
return __arch_hweight64(w & 0xff);
}
+#define __arch_hweight8 __arch_hweight8
#else
#include <asm-generic/bitops/arch_hweight.h>
#endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
result = ia64_popcnt(x);
return result;
}
+#define __arch_hweight64 __arch_hweight64

#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
return res;
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_hweight32(unsigned int w)
{
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
return res;
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
return res;
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
return res;
}
+#define __arch_hweight8 __arch_hweight8

#else

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
#include <asm-generic/bitops/hweight.h>

#endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..4a1c5c9
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,96 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+/* popcnt %eax, %eax */
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+
+/* popcnt %rdi, %rdi */
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define PUSH_CLOBBERED_32 \
+ "pushl %%edx\n\t" \
+ "pushl %%ecx\n\t"
+
+#define POP_CLOBBERED_32 \
+ "\n\t" \
+ "popl %%ecx\n\t" \
+ "popl %%edx\n\t"
+
+#define PUSH_CLOBBERED_64 \
+ "pushq %%rax\n\t" \
+ "pushq %%rsi\n\t" \
+ "pushq %%rdx\n\t" \
+ "pushq %%rcx\n\t" \
+ "pushq %%r8\n\t" \
+ "pushq %%r9\n\t" \
+ "pushq %%r10\n\t" \
+ "pushq %%r11\n\t"
+
+#define POP_CLOBBERED_64 \
+ "\n\t" \
+ "popq %%r11\n\t" \
+ "popq %%r10\n\t" \
+ "popq %%r9\n\t" \
+ "popq %%r8\n\t" \
+ "popq %%rcx\n\t" \
+ "popq %%rdx\n\t" \
+ "popq %%rsi\n\t" \
+ "popq %%rax\n\t"
+
+#ifdef CONFIG_64BIT
+#define PUSH_CLOBBERED PUSH_CLOBBERED_64
+#define POP_CLOBBERED POP_CLOBBERED_64
+#define POPCNT POPCNT64
+#define ARG0 "D"
+#else
+#define PUSH_CLOBBERED PUSH_CLOBBERED_32
+#define POP_CLOBBERED POP_CLOBBERED_32
+#define POPCNT POPCNT32
+#define ARG0 "a"
+#endif
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm volatile(PUSH_CLOBBERED
+ ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ POP_CLOBBERED
+ : "="ARG0 (res)
+ : ARG0 (w));
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm volatile(PUSH_CLOBBERED
+ ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ POP_CLOBBERED
+ : "="ARG0 (res)
+ : ARG0 (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);

#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))

+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+ return __arch_hweight8(w);
+#else
+ return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+ return __arch_hweight16(w);
+#else
+ return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+ return __arch_hweight32(w);
+#else
+ return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+ return __arch_hweight64(w);
+#else
+ return __sw_hweight64(w);
+#endif
+}
/*
* Generic interface.
*/
-#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))

/*
* Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
--
1.6.6.1


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-11 17:43:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/11/2010 09:24 AM, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
>> Let me prep another version when I get back on Wed. (currently
>> travelling) with all the stuff we discussed to see how it would turn.
>
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

On 64 bits it should be "popcnt %rdi, %rax".

> I also did some rudimentary tracing with the function graph tracer of
> all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
> compile and the preliminary results show that hweight in software takes
> about 9.768 usecs the longest while the hardware popcnt about 8.515
> usecs. The machine is a Fam10 revB2 quadcore.
>
> What remains to be done is see whether the saving/restoring of
> callee-clobbered regs with this patch has any noticeable negative
> effects on the software hweight case on machines which don't support
> popcnt. Also, I'm open for better tracing ideas :).
>
> + asm volatile(PUSH_CLOBBERED
> + ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> + POP_CLOBBERED
> + : "="ARG0 (res)
> + : ARG0 (w));


Sorry, no.

You don't do the push/pop inline -- if you're going to take the hit of
pushing this into the caller, it's better to list them as explicit
clobbers and let the compiler figure out how to do it. The point of
doing an explicit push/pop is that it can be pushed into the out-of-line
subroutine.

Furthermore, you're still putting "volatile" on there... this is a pure
computation -- no side effects -- so it is exactly when you *shouldn't*
declare your asm statement volatile.

Note: given how simple and regular a popcnt actually is, it might be
preferrable to have the out-of-line implementation either in assembly,
or using gcc's -fcall-saved-* options to reduce the number of registers
that is clobbered by the routine.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-12 17:06:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> You don't do the push/pop inline -- if you're going to take the hit of
> pushing this into the caller, it's better to list them as explicit
> clobbers and let the compiler figure out how to do it. The point of
> doing an explicit push/pop is that it can be pushed into the out-of-line
> subroutine.

Doh! Of course, this was wrong. See below.

> Furthermore, you're still putting "volatile" on there... this is a pure
> computation -- no side effects -- so it is exactly when you *shouldn't*
> declare your asm statement volatile.

Also gone.

> Note: given how simple and regular a popcnt actually is, it might be
> preferrable to have the out-of-line implementation either in assembly,
> or using gcc's -fcall-saved-* options to reduce the number of registers
> that is clobbered by the routine.

Yep, this is actually a very good idea. And it seems to work, I've
tested it on an F10, K8 and even on an Atom and no blow ups so far...

---
arch/alpha/include/asm/bitops.h | 4 ++
arch/ia64/include/asm/bitops.h | 1 +
arch/sparc/include/asm/bitops_64.h | 4 ++
arch/x86/include/asm/bitops.h | 10 ++++
arch/x86/lib/Makefile | 8 +++-
arch/x86/lib/hweight.c | 72 ++++++++++++++++++++++++++++
include/asm-generic/bitops/arch_hweight.h | 8 ++--
include/asm-generic/bitops/const_hweight.h | 43 +++++++++++++++--
lib/hweight.c | 20 ++++----
9 files changed, 151 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
{
return __kernel_ctpop(w);
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_weight32(unsigned int w)
{
return __arch_hweight64(w);
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
return __arch_hweight64(w & 0xffff);
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
return __arch_hweight64(w & 0xff);
}
+#define __arch_hweight8 __arch_hweight8
#else
#include <asm-generic/bitops/arch_hweight.h>
#endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
result = ia64_popcnt(x);
return result;
}
+#define __arch_hweight64 __arch_hweight64

#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
return res;
}
+#define __arch_hweight64 __arch_hweight64

static inline unsigned int __arch_hweight32(unsigned int w)
{
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
return res;
}
+#define __arch_hweight32 __arch_hweight32

static inline unsigned int __arch_hweight16(unsigned int w)
{
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
return res;
}
+#define __arch_hweight16 __arch_hweight16

static inline unsigned int __arch_hweight8(unsigned int w)
{
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
return res;
}
+#define __arch_hweight8 __arch_hweight8

#else

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
#include <asm-generic/bitops/hweight.h>

#endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..ddc32eb 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,9 +22,11 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
+ CFLAGS_hweight.o = -fcall-saved-ecx -fcall-saved-edx
+
obj-y += atomic64_32.o
lib-y += checksum_32.o
lib-y += strstr_32.o
@@ -34,6 +36,10 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
+ CFLAGS_hweight.o = -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx \
+ -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 \
+ -fcall-saved-r11
+
obj-y += io_64.o iomap_copy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..9b4bfa5
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,72 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * Those are called out-of-line in the alternative below and are added here only
+ * so that gcc is able to figure out which registers have been clobbered by
+ * __sw_hweightXX so that it could restore their values before returning from
+ * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
+ */
+unsigned int _sw_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}
+
+unsigned long _sw_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call _sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);

#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))

+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+ return __arch_hweight8(w);
+#else
+ return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+ return __arch_hweight16(w);
+#else
+ return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+ return __arch_hweight32(w);
+#else
+ return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+ return __arch_hweight64(w);
+#else
+ return __sw_hweight64(w);
+#endif
+}
/*
* Generic interface.
*/
-#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))

/*
* Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
--
1.6.6.1


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-12 17:32:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
>> You don't do the push/pop inline -- if you're going to take the hit of
>> pushing this into the caller, it's better to list them as explicit
>> clobbers and let the compiler figure out how to do it. The point of
>> doing an explicit push/pop is that it can be pushed into the out-of-line
>> subroutine.
>
> Doh! Of course, this was wrong. See below.
>
> diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> new file mode 100644
> index 0000000..9b4bfa5
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,72 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * Those are called out-of-line in the alternative below and are added here only
> + * so that gcc is able to figure out which registers have been clobbered by
> + * __sw_hweightXX so that it could restore their values before returning from
> + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> + */
> +unsigned int _sw_hweight32(unsigned int w)
> +{
> + return __sw_hweight32(w);
> +}
> +
> +unsigned long _sw_hweight64(__u64 w)
> +{
> + return __sw_hweight64(w);
> +}
> +
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> + unsigned int res = 0;
> +
> + asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> + : "="REG_OUT (res)
> + : REG_IN (w));
> +

Now you have no clobbers and no assembly wrapper. That really doesn't work.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-12 17:47:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Fri, Feb 12, 2010 at 09:28:32AM -0800, H. Peter Anvin wrote:
> On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> > On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> >> You don't do the push/pop inline -- if you're going to take the hit of
> >> pushing this into the caller, it's better to list them as explicit
> >> clobbers and let the compiler figure out how to do it. The point of
> >> doing an explicit push/pop is that it can be pushed into the out-of-line
> >> subroutine.
> >
> > Doh! Of course, this was wrong. See below.
> >
> > diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> > new file mode 100644
> > index 0000000..9b4bfa5
> > --- /dev/null
> > +++ b/arch/x86/lib/hweight.c
> > @@ -0,0 +1,72 @@
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/bitops.h>
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> > +/* popcnt %eax, %eax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> > +#define REG_IN "a"
> > +#define REG_OUT "a"
> > +#endif
> > +
> > +/*
> > + * Those are called out-of-line in the alternative below and are added here only
> > + * so that gcc is able to figure out which registers have been clobbered by
> > + * __sw_hweightXX so that it could restore their values before returning from
> > + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> > + */
> > +unsigned int _sw_hweight32(unsigned int w)
> > +{
> > + return __sw_hweight32(w);
> > +}
> > +
> > +unsigned long _sw_hweight64(__u64 w)
> > +{
> > + return __sw_hweight64(w);
> > +}
> > +
> > +unsigned int __arch_hweight32(unsigned int w)
> > +{
> > + unsigned int res = 0;
> > +
> > + asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> > + : "="REG_OUT (res)
> > + : REG_IN (w));
> > +
>
> Now you have no clobbers and no assembly wrapper. That really doesn't work.

I see, this means we have to build lib/hweight.c with -fcall-saved*. I
just did a test and it uses %rcx and %rdx temporarily by saving their
values on the stack and restoring them before it returns.

However, this is generic code and for the above to work we have to
enforce x86-specific CFLAGS for it. What is the preferred way to do
that?

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-12 19:14:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>
> However, this is generic code and for the above to work we have to
> enforce x86-specific CFLAGS for it. What is the preferred way to do
> that?
>

That's a question for Michal and the kbuild list. Michal?

-hpa

2010-02-14 10:13:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > Let me prep another version when I get back on Wed. (currently
> > travelling) with all the stuff we discussed to see how it would turn.
>
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

Right, so I don't like how you need to touch !x86 for this, and I think
that is easily avoidable by not making x86 include
asm-generic/bitops/arch_hweight.h.

If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
and simply provide __arch_hweightN() from x86/include/asm/bitops.h

2010-02-14 11:24:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sun, Feb 14, 2010 at 11:12:23AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> > On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > > Let me prep another version when I get back on Wed. (currently
> > > travelling) with all the stuff we discussed to see how it would turn.
> >
> > Ok, here's another version ontop of PeterZ's patch at
> > http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> > differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> > while on 64-bit I do "popcnt %rdi, %rdi".
>
> Right, so I don't like how you need to touch !x86 for this, and I think
> that is easily avoidable by not making x86 include
> asm-generic/bitops/arch_hweight.h.
>
> If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
> arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
> and simply provide __arch_hweightN() from x86/include/asm/bitops.h

Hmm, all these different names start to get a little confusing. Can we first
agree on the naming please, here's my proposal:

__const_hweightN - for at compile time known constants as arguments
__arch_hweightN - arch possibly has an optimized hweight version
__sw_hweightN - fall back when nothing else is there, aka the functions in
lib/hweight.c

Now, in the x86 case, when the compiler can't know that the argument is
a constant, we call the __arch_hweightN versions. The alternative does
call the __sw_hweightN version in case the CPU doesn't support popcnt.
In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
to be able to take care of the regs clobbered ny __sw_hweightN.

So, if I understand you correctly, your suggestion might work, we
simply need to rename the lib/hweight.c versions to __sw_hweightN
and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
__sw_hweightN wrappers in the default case, all arches which have an
optimized version will provide it in their respective bitops header...

Hows that?

--
Regards/Gruss,
Boris.

2010-02-14 12:23:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
>
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
>
I'm not quite sure what the last 'it' refers to, does that refer to:
1) an __arch_hweightN() implementation, or
2) __arch_hweightN() -> __sw_hweightN() wrappers ?

If you meant 1, then yes.

2010-02-14 14:19:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sun, Feb 14, 2010 at 01:23:35PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
> >
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> >
> I'm not quite sure what the last 'it' refers to, does that refer to:
> 1) an __arch_hweightN() implementation, or
> 2) __arch_hweightN() -> __sw_hweightN() wrappers ?
>
> If you meant 1, then yes.

Yes, I mean all architectures which have an optimized hweight will use
that optimized version in their __arch_hweightN while as a default
fallback for the remaining architectures we'll have __arch_hweightN() ->
__sw_hweightN() wrappers in <asm-generic/bitops/arch_hweight.h>.

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-14 18:40:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/14/2010 03:24 AM, Borislav Petkov wrote:
>
> __const_hweightN - for at compile time known constants as arguments
> __arch_hweightN - arch possibly has an optimized hweight version
> __sw_hweightN - fall back when nothing else is there, aka the functions in
> lib/hweight.c
>
> Now, in the x86 case, when the compiler can't know that the argument is
> a constant, we call the __arch_hweightN versions. The alternative does
> call the __sw_hweightN version in case the CPU doesn't support popcnt.
> In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> to be able to take care of the regs clobbered ny __sw_hweightN.
>
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
>

I'm not entirely sure what you're asking; if what you're asking what to
name an x86-specific fallback function, it presumably should be
__arch_sw_hweightN (i.e. __arch prefix with a modifier.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-14 20:28:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Sun, Feb 14, 2010 at 10:36:48AM -0800, H. Peter Anvin wrote:
> On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> >
> > __const_hweightN - for at compile time known constants as arguments
> > __arch_hweightN - arch possibly has an optimized hweight version
> > __sw_hweightN - fall back when nothing else is there, aka the functions in
> > lib/hweight.c
> >
> > Now, in the x86 case, when the compiler can't know that the argument is
> > a constant, we call the __arch_hweightN versions. The alternative does
> > call the __sw_hweightN version in case the CPU doesn't support popcnt.
> > In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> > to be able to take care of the regs clobbered ny __sw_hweightN.
> >
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> >
>
> I'm not entirely sure what you're asking; if what you're asking what to
> name an x86-specific fallback function, it presumably should be
> __arch_sw_hweightN (i.e. __arch prefix with a modifier.)

Hmm, basically, what PeterZ suggested is that I drop one indirection
under __arch_hweightN, which would make x86-specific fallback functions
superfluous.

IOW, what we have so far is:

#define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))

and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
-> __sw_hweightN wrappers per default, where the __sw_hweightN are the
lib/hweight.c generic versions.

On architectures/CPUs which provide popcnt in
hardware, we create __arch_hweightN implementations in
<arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
<asm-generic/bitops/arch_hweight.h> versions by simply not including
that last header.

Is that agreeable?

--
Regards/Gruss,
Boris.

2010-02-14 22:17:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 02/14/2010 12:28 PM, Borislav Petkov wrote:
>
> Hmm, basically, what PeterZ suggested is that I drop one indirection
> under __arch_hweightN, which would make x86-specific fallback functions
> superfluous.
>
> IOW, what we have so far is:
>
> #define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))
>
> and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
> -> __sw_hweightN wrappers per default, where the __sw_hweightN are the
> lib/hweight.c generic versions.
>
> On architectures/CPUs which provide popcnt in
> hardware, we create __arch_hweightN implementations in
> <arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
> <asm-generic/bitops/arch_hweight.h> versions by simply not including
> that last header.
>
> Is that agreeable?
>

That makes sense... after all, that's a pretty typical use of asm-generic.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-17 13:57:53

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 12.2.2010 20:05, H. Peter Anvin wrote:
> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>
>> However, this is generic code and for the above to work we have to
>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>> that?
>>
>
> That's a question for Michal and the kbuild list. Michal?

(I was offline last week).

The _preferred_ way probably is not to do it :), but otherwise you can
set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
in arch/x86/lib/Makefile already.

Michal

2010-02-17 17:20:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> >
> > That's a question for Michal and the kbuild list. Michal?
>
> (I was offline last week).
>
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
which gets set in the arch Makefile instead?

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-17 17:31:11

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 17.2.2010 18:20, Borislav Petkov wrote:
> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>
>>>> However, this is generic code and for the above to work we have to
>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>> that?
>>>>
>>>
>>> That's a question for Michal and the kbuild list. Michal?
>>
>> (I was offline last week).
>>
>> The _preferred_ way probably is not to do it :), but otherwise you can
>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>> in arch/x86/lib/Makefile already.
>
> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> which gets set in the arch Makefile instead?

We could, but is it worth it if there is only one potential user so far?
IMO just put the condition to lib/Makefile now and if there turn out to
be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Michal

2010-02-17 17:34:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, Feb 17, 2010 at 06:31:04PM +0100, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
> > On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> >> On 12.2.2010 20:05, H. Peter Anvin wrote:
> >>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>>>
> >>>> However, this is generic code and for the above to work we have to
> >>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >>>> that?
> >>>>
> >>>
> >>> That's a question for Michal and the kbuild list. Michal?
> >>
> >> (I was offline last week).
> >>
> >> The _preferred_ way probably is not to do it :), but otherwise you can
> >> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> >> in arch/x86/lib/Makefile already.
> >
> > Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> > which gets set in the arch Makefile instead?
>
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Ok, I'm fine with that too, thanks.

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-17 17:39:17

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On 17.2.2010 18:31, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
>> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>>
>>>>> However, this is generic code and for the above to work we have to
>>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>>> that?
>>>>>
>>>>
>>>> That's a question for Michal and the kbuild list. Michal?
>>>
>>> (I was offline last week).
>>>
>>> The _preferred_ way probably is not to do it :), but otherwise you can
>>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>>> in arch/x86/lib/Makefile already.
>>
>> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
>> which gets set in the arch Makefile instead?
>
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

It wouldn't work actually, because such variable would then apply to all
hweight.o targets in the tree. But another way would be:

arch/x86/Kconfig
config ARCH_HWEIGHT_CFLAGS
string
default "..." if X86_32
default "..." if X86_64

lib/Makefile
CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Michal

2010-02-18 06:19:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, Feb 17, 2010 at 06:39:13PM +0100, Michal Marek wrote:
> It wouldn't work actually, because such variable would then apply to all
> hweight.o targets in the tree. But another way would be:
>
> arch/x86/Kconfig
> config ARCH_HWEIGHT_CFLAGS
> string
> default "..." if X86_32
> default "..." if X86_64
>
> lib/Makefile
> CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Yep, this works, albeit with a small adjustment since
CONFIG_ARCH_HWEIGHT_CFLAGS is quoted in the Kconfig and the quotes
appear in the $(CC) call like this:

gcc -Wp,-MD,lib/.hweight.o.d ... "-fcall-saved-ecx..."

which I fixed like this (idea reused from the make manual):

---
source "init/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="



I'm open for better suggestions though.

--
Regards/Gruss,
Boris.

2010-02-18 10:52:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Wed, 2010-02-17 at 14:57 +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> >
> > That's a question for Michal and the kbuild list. Michal?
>
> (I was offline last week).
>
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

I guess one way to achieve that is to create a arch/x86/lib/hweight.c
that includes lib/hweight.c and give the x86 one special compile flags
and not build the lib on.

2010-02-18 11:51:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)

On Thu, Feb 18, 2010 at 11:51:50AM +0100, Peter Zijlstra wrote:
> I guess one way to achieve that is to create a arch/x86/lib/hweight.c
> that includes lib/hweight.c and give the x86 one special compile flags
> and not build the lib on.

That's what I thought initially too but that won't fly because the
lib/hweight.c helpers have to be inlined into arch/x86/lib/hweight.c so
that gcc can take care of the clobbered registers. Otherwise, it just a
"call __sw_hweightXX" that gets issued into asm.


--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-19 14:22:07

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH] x86: Add optimized popcnt variants

From: Borislav Petkov <[email protected]>

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 5 ++
arch/x86/include/asm/bitops.h | 7 +++-
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/hweight.c | 62 +++++++++++++++++++++++++++++
include/asm-generic/bitops/arch_hweight.h | 22 ++++++++--
lib/Makefile | 3 +
lib/hweight.c | 20 +++++-----
scripts/Makefile.lib | 4 ++
8 files changed, 109 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config ARCH_HWEIGHT_CFLAGS
+ string
+ default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+ default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..5ec3bd8 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,12 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

-#include <asm-generic/bitops/hweight.h>
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#include <asm-generic/bitops/const_hweight.h>

#endif /* __KERNEL__ */

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
lib-$(CONFIG_KPROBES) += insn.o inat.o

-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

ifeq ($(CONFIG_X86_32),y)
obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..54d3cb0
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..1c82306 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+unsigned int __arch_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}

+unsigned int __arch_hweight16(unsigned int w)
+{
+ return __sw_hweight16(w);
+}
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+ return __sw_hweight8(w);
+}
+
+unsigned long __arch_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
--
1.6.5.4

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-19 16:10:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/19/2010 06:22 AM, Borislav Petkov wrote:
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * __sw_hweightXX are called from within the alternatives below
> + * and callee-clobbered registers need to be taken care of. See
> + * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
> + * compiler switches.
> + */
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> + unsigned int res = 0;
> +
> + asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> + : "="REG_OUT (res)
> + : REG_IN (w));
> +
> + return res;
> +}
> +EXPORT_SYMBOL(__arch_hweight32);
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> + return __arch_hweight32(w & 0xffff);
> +}
> +EXPORT_SYMBOL(__arch_hweight16);
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> + return __arch_hweight32(w & 0xff);
> +}
> +EXPORT_SYMBOL(__arch_hweight8);
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> + unsigned long res = 0;
> +
> +#ifdef CONFIG_X86_32
> + return __arch_hweight32((u32)w) +
> + __arch_hweight32((u32)(w >> 32));
> +#else
> + asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> + : "="REG_OUT (res)
> + : REG_IN (w));
> +#endif /* CONFIG_X86_32 */
> +
> + return res;
> +}

You're still not inlining these. They should be: there is absolutely no
reason for code size to not inline them anymore.

> diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
> index 3a7be84..1c82306 100644
> --- a/include/asm-generic/bitops/arch_hweight.h
> +++ b/include/asm-generic/bitops/arch_hweight.h
> @@ -3,9 +3,23 @@
>
> #include <asm/types.h>
>
> -extern unsigned int __arch_hweight32(unsigned int w);
> -extern unsigned int __arch_hweight16(unsigned int w);
> -extern unsigned int __arch_hweight8(unsigned int w);
> -extern unsigned long __arch_hweight64(__u64 w);
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> + return __sw_hweight32(w);
> +}
>
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> + return __sw_hweight16(w);
> +}
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> + return __sw_hweight8(w);
> +}
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> + return __sw_hweight64(w);
> +}
> #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */

and these are in a header file and *definitely* should be inlines.

-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-19 16:45:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Fri, Feb 19, 2010 at 08:06:07AM -0800

<snip>

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > + unsigned long res = 0;
> > +
> > +#ifdef CONFIG_X86_32
> > + return __arch_hweight32((u32)w) +
> > + __arch_hweight32((u32)(w >> 32));
> > +#else
> > + asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> > + : "="REG_OUT (res)
> > + : REG_IN (w));
> > +#endif /* CONFIG_X86_32 */
> > +
> > + return res;
> > +}
>
> You're still not inlining these. They should be: there is absolutely no
> reason for code size to not inline them anymore.

Isn't better to have only those 4 locations for apply_alternatives to
patch wrt to popcnt instead of sprinkling alternatives sections around
the kernel in every callsite of hweight and its users? Or is the aim to
optimize even that "call __arch_hweightXX" away?

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > + return __sw_hweight64(w);
> > +}
> > #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
>
> and these are in a header file and *definitely* should be inlines.

Yep, done.

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-19 16:59:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/19/2010 08:45 AM, Borislav Petkov wrote:
>>
>> You're still not inlining these. They should be: there is absolutely no
>> reason for code size to not inline them anymore.
>
> Isn't better to have only those 4 locations for apply_alternatives to
> patch wrt to popcnt instead of sprinkling alternatives sections around
> the kernel in every callsite of hweight and its users? Or is the aim to
> optimize even that "call __arch_hweightXX" away?
>

That's the idea, yes. We use inline alternatives in quite a few other
places.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-22 14:17:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Fri, Feb 19, 2010 at 08:53:32AM -0800

> That's the idea, yes. We use inline alternatives in quite a few other
> places.

Ok, inlining results in circa 100+ replacements here both on 32- and
64-bit. Here we go:

--
From: Borislav Petkov <[email protected]>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 5 ++
arch/x86/include/asm/alternative.h | 9 +++-
arch/x86/include/asm/arch_hweight.h | 59 +++++++++++++++++++++++++++++
arch/x86/include/asm/bitops.h | 4 +-
include/asm-generic/bitops/arch_hweight.h | 22 +++++++++--
lib/Makefile | 3 +
lib/hweight.c | 20 +++++-----
scripts/Makefile.lib | 4 ++
8 files changed, 108 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config ARCH_HWEIGHT_CFLAGS
+ string
+ default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+ default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
#define LOCK_PREFIX ""
#endif

-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
struct alt_instr {
u8 *instr; /* original instruction */
u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
".previous"

/*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
* Alternative instructions for different CPU types or capabilities.
*
* This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..f79b733
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>

#endif /* __KERNEL__ */

diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}

+inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
--
1.6.4.2


--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-22 17:26:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/22/2010 06:17 AM, Borislav Petkov wrote:
>
> +config ARCH_HWEIGHT_CFLAGS
> + string
> + default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> +

[...]

> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else

Just a note: this still means rdi is clobbered on x86-64, which is
probably fine, but needs to be recorded as such. Since gcc doesn't
support clobbers for registers used as operands (sigh), you have to
create a dummy output and assign it a "=D" constraint.

I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
reliably.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-22 18:49:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Mon, Feb 22, 2010 at 09:21:05AM -0800

> On 02/22/2010 06:17 AM, Borislav Petkov wrote:
> >
> > +config ARCH_HWEIGHT_CFLAGS
> > + string
> > + default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> > +
>
> [...]
>
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
>
> Just a note: this still means rdi is clobbered on x86-64, which is
> probably fine, but needs to be recorded as such. Since gcc doesn't
> support clobbers for registers used as operands (sigh), you have to
> create a dummy output and assign it a "=D" constraint.
>
> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> reliably.

Ok, from looking at kernel/sched.s output it looks like it saves rdi
content over the alternative where needed. I'll do some more testing
just to make sure.

--
From: Borislav Petkov <[email protected]>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 5 ++
arch/x86/include/asm/alternative.h | 9 +++-
arch/x86/include/asm/arch_hweight.h | 59 +++++++++++++++++++++++++++++
arch/x86/include/asm/bitops.h | 4 +-
include/asm-generic/bitops/arch_hweight.h | 22 +++++++++--
lib/Makefile | 3 +
lib/hweight.c | 20 +++++-----
scripts/Makefile.lib | 4 ++
8 files changed, 108 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config ARCH_HWEIGHT_CFLAGS
+ string
+ default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+ default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
#define LOCK_PREFIX ""
#endif

-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
struct alt_instr {
u8 *instr; /* original instruction */
u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
".previous"

/*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
* Alternative instructions for different CPU types or capabilities.
*
* This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..cc3a188
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0, dummy;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res), "="REG_IN (dummy)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>

#endif /* __KERNEL__ */

diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}

+inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
--
1.6.4.2


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-22 20:41:02

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/22/2010 10:49 AM, Borislav Petkov wrote:
>>
>> Just a note: this still means rdi is clobbered on x86-64, which is
>> probably fine, but needs to be recorded as such. Since gcc doesn't
>> support clobbers for registers used as operands (sigh), you have to
>> create a dummy output and assign it a "=D" constraint.
>>
>> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
>> reliably.
>
> Ok, from looking at kernel/sched.s output it looks like it saves rdi
> content over the alternative where needed. I'll do some more testing
> just to make sure.
>

No, you can't rely on behavioral observation. A different version of
gcc could behave differently. We need to make sure we tell gcc what the
requirements actually are, as opposed to thinking we can just fix them.

+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
0xb8\n\t.byte 0xc7"

BTW, this can be written:

#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

-hpa

2010-02-23 06:37:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such. Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.
> >
> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> >
>
> No, you can't rely on behavioral observation. A different version of
> gcc could behave differently. We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Ok, I've added the dummy "=D" constraint since it sounded like the more
stable thing to do WRT gcc versions. BTW, I left the machine overnight
and it is still alive cheerfully building randconfigs.

I'll try the -fcall-saved-rdi thing also later today.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
>
> BTW, this can be written:
>
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

done, updated version coming up.

--
Regards/Gruss,
Boris.

2010-02-23 15:58:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such. Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.

Hmm, we cannot do that with the current design since __arch_hweight64
is being inlined into every callsite and AFAICT we would have to build
every callsite with "-fcall-saved-rdi" which is clearly too much. The
explicit "=D" dummy constraint is straightforward, instead.

> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> >
>
> No, you can't rely on behavioral observation. A different version of
> gcc could behave differently. We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Just to make clear: those observations were referring to the version
_with_ the dummy "=D" constraint - I was simply verifying the asm
output and wasn't relying on behavioral observation.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
>
> BTW, this can be written:
>
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

Done, here's the latest version, it boots fine and testing is ongoing:

--
From: Borislav Petkov <[email protected]>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 5 ++
arch/x86/include/asm/alternative.h | 9 +++-
arch/x86/include/asm/arch_hweight.h | 61 +++++++++++++++++++++++++++++
arch/x86/include/asm/bitops.h | 4 +-
include/asm-generic/bitops/arch_hweight.h | 22 ++++++++--
lib/Makefile | 3 +
lib/hweight.c | 20 +++++-----
scripts/Makefile.lib | 4 ++
8 files changed, 110 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config ARCH_HWEIGHT_CFLAGS
+ string
+ default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+ default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
#define LOCK_PREFIX ""
#endif

-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
struct alt_instr {
u8 *instr; /* original instruction */
u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
".previous"

/*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
* Alternative instructions for different CPU types or capabilities.
*
* This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..eaefadc
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,61 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+ /* tell gcc that %rdi is clobbered as an input operand */
+ unsigned long dummy;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res), "="REG_IN (dummy)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>

#endif /* __KERNEL__ */

diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}

+inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
--
1.6.4.2


--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-23 17:39:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/23/2010 07:58 AM, Borislav Petkov wrote:
>
> Hmm, we cannot do that with the current design since __arch_hweight64
> is being inlined into every callsite and AFAICT we would have to build
> every callsite with "-fcall-saved-rdi" which is clearly too much. The
> explicit "=D" dummy constraint is straightforward, instead.
>

Uh... the -fcall-saved-rdi would go with all the other ones. Assuming
it can actually work and that gcc doesn't choke on an inbound argument
being saved.

-hpa

2010-02-23 17:54:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Tue, Feb 23, 2010 at 09:34:04AM -0800

> On 02/23/2010 07:58 AM, Borislav Petkov wrote:
> >
> >Hmm, we cannot do that with the current design since __arch_hweight64
> >is being inlined into every callsite and AFAICT we would have to build
> >every callsite with "-fcall-saved-rdi" which is clearly too much. The
> >explicit "=D" dummy constraint is straightforward, instead.
> >
>
> Uh... the -fcall-saved-rdi would go with all the other ones.
> Assuming it can actually work and that gcc doesn't choke on an
> inbound argument being saved.

Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
might choke on it, I could leave the "=D" dummy constraint in?

BTW, the current version screams

/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h: In function ‘__arch_hweight64’:
/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h:47: warning: unused variable ‘dummy’

on x86-32. I'll send a fixed version in a second.

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-23 18:22:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/23/2010 09:54 AM, Borislav Petkov wrote:
>
> Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> might choke on it, I could leave the "=D" dummy constraint in?
>

I can try it with gcc 3.4 here. -fcall-saved-rdi is cleaner, if it works.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-23 19:06:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Tue, Feb 23, 2010 at 10:17:39AM -0800

> On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> >
> > Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> > 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> > might choke on it, I could leave the "=D" dummy constraint in?
> >
>
> I can try it with gcc 3.4 here. -fcall-saved-rdi is cleaner, if it works.

Ok, here you go.

--
From: Borislav Petkov <[email protected]>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 5 ++
arch/x86/include/asm/alternative.h | 9 +++-
arch/x86/include/asm/arch_hweight.h | 59 +++++++++++++++++++++++++++++
arch/x86/include/asm/bitops.h | 4 +-
include/asm-generic/bitops/arch_hweight.h | 22 +++++++++--
lib/Makefile | 3 +
lib/hweight.c | 20 +++++-----
scripts/Makefile.lib | 4 ++
8 files changed, 108 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..4673dc5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR

+config ARCH_HWEIGHT_CFLAGS
+ string
+ default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+ default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
config KTIME_SCALAR
def_bool X86_32
source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
#define LOCK_PREFIX ""
#endif

-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
struct alt_instr {
u8 *instr; /* original instruction */
u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
".previous"

/*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
* Alternative instructions for different CPU types or capabilities.
*
* This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+ unsigned int res = 0;
+
+ asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+
+ return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+ unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+ return __arch_hweight32((u32)w) +
+ __arch_hweight32((u32)(w >> 32));
+#else
+ asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+ : "="REG_OUT (res)
+ : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+ return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)

#define ARCH_HAS_FAST_MULTIPLIER 1

-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>

#endif /* __KERNEL__ */

diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@

#include <asm/types.h>

-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+ return __sw_hweight32(w);
+}

+inline unsigned int __arch_hweight16(unsigned int w)
+{
+ return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+ return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+ return __sw_hweight64(w);
+}
#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
* The Hamming Weight of a number is the total number of bits set in it.
*/

-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55555555);
res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
res = res + (res >> 8);
return (res + (res >> 16)) & 0x000000FF;
}
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);

-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x5555);
res = (res & 0x3333) + ((res >> 2) & 0x3333);
res = (res + (res >> 4)) & 0x0F0F;
return (res + (res >> 8)) & 0x00FF;
}
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);

-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
{
unsigned int res = w - ((w >> 1) & 0x55);
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res + (res >> 4)) & 0x0F;
}
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);

-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
{
#if BITS_PER_LONG == 32
- return __arch_hweight32((unsigned int)(w >> 32)) +
- __arch_hweight32((unsigned int)w);
+ return __sw_hweight32((unsigned int)(w >> 32)) +
+ __sw_hweight32((unsigned int)w);
#elif BITS_PER_LONG == 64
#ifdef ARCH_HAS_FAST_MULTIPLIER
w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
#endif
#endif
}
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO $@
cmd_lzo = (cat $(filter-out FORCE,$^) | \
lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
--
1.6.4.2


--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-26 05:34:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/23/2010 11:06 AM, Borislav Petkov wrote:
>
> Ok, here you go.
>
> --
> From: Borislav Petkov <[email protected]>
> Date: Thu, 11 Feb 2010 00:48:31 +0100
> Subject: [PATCH] x86: Add optimized popcnt variants
>
> Add support for the hardware version of the Hamming weight function,
> popcnt, present in CPUs which advertize it under CPUID, Function
> 0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
> default lib/hweight.c sw versions.
>
> A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
> a 3x speedup on a F10h machine.
>
> Signed-off-by: Borislav Petkov <[email protected]>

OK, this patch looks pretty good now, but I'm completely lost as to what
the baseline of this patch is supposed to be.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-26 07:47:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Thu, Feb 25, 2010 at 09:27:25PM -0800

> OK, this patch looks pretty good now, but I'm completely lost as to
> what the baseline of this patch is supposed to be.

Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119

But I'm not sure which tree has it...

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-26 17:55:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/25/2010 11:47 PM, Borislav Petkov wrote:
> From: "H. Peter Anvin" <[email protected]>
> Date: Thu, Feb 25, 2010 at 09:27:25PM -0800
>
>> OK, this patch looks pretty good now, but I'm completely lost as to
>> what the baseline of this patch is supposed to be.
>
> Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119
>
> But I'm not sure which tree has it...
>

Looks like -mm, which really means that either Andrew has to take your
patch, too, or we have to wait until that is upstream until we can merge
your patch.

I'm a little nervous about just acking the patch and telling Andrew to
test it, because I don't know what the fallout would look like. I'm
particularly concerned about gcc version dependencies.

I guess, on the other hand, if it ends up not getting merged until .35
it's not a huge deal either.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-02-27 08:28:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

From: "H. Peter Anvin" <[email protected]>
Date: Fri, Feb 26, 2010 at 09:48:46AM -0800

> Looks like -mm, which really means that either Andrew has to take your
> patch, too, or we have to wait until that is upstream until we can merge
> your patch.
>
> I'm a little nervous about just acking the patch and telling Andrew to
> test it, because I don't know what the fallout would look like. I'm
> particularly concerned about gcc version dependencies.

I have the same concern. Actually, I'll be much more at ease if it saw a
bit of wider testing without hitting mainline just yet. I'll try to give
it some more testing with the machines and toolchains I can get my hands
on next week.

> I guess, on the other hand, if it ends up not getting merged until .35
> it's not a huge deal either.

Yeah, let's give it another round of testing and queue it for .35 -
AFAIR Ingo runs also a wide testing effort so it spending another cycle
in -tip and being hammered on by us could give us a bit more certainty.

Thanks.

--
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-27 20:04:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Add optimized popcnt variants

On 02/27/2010 12:28 AM, Borislav Petkov wrote:
>
>> I guess, on the other hand, if it ends up not getting merged until .35
>> it's not a huge deal either.
>
> Yeah, let's give it another round of testing and queue it for .35 -
> AFAIR Ingo runs also a wide testing effort so it spending another cycle
> in -tip and being hammered on by us could give us a bit more certainty.
>

Yes, if we can get into -tip then we'll get more test coverage, so I'll
queue it up for .35 as soon as the merge window closes. Please remind
me if I forget.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.