Hi,
gcc 3.4 gained support for several typical bitops as builtin directives.
Using these over inline asm has a few advantages:
* gcc can optimize constants into these better
* gcc can reorder and schedule the code better
* gcc can allocate registers etc better for the code
The question is if we consider it desirable to go down this road or not. In
order to help that discussion I've attached a patch below that switches the
i386 ffz() function to the gcc builtin version, conditional on gcc having
support for this. Before I go down the road of converting more functions
and/or architectures.... is this worth doing?
Greetings,
Arjan van de Ven
--- linux-2.6.7/include/asm-i386/bitops.h~ 2004-06-23 23:45:06.048614387 +0200
+++ linux-2.6.7/include/asm-i386/bitops.h 2004-06-23 23:45:06.048614387 +0200
@@ -344,6 +344,8 @@
*
* Undefined if no zero exists, so code should check against ~0UL first.
*/
+
+#ifndef HAVE_BUILTIN_CTZL
static inline unsigned long ffz(unsigned long word)
{
__asm__("bsfl %1,%0"
@@ -351,6 +353,12 @@
:"r" (~word));
return word;
}
+#else
+static inline unsigned long ffz (unsigned long word)
+{
+ return __builtin_ctzl (~word);
+}
+#endif
/**
* __ffs - find first bit in word.
--- linux-2.6.7/include/linux/compiler-gcc3.h~ 2004-06-24 09:26:04.123455290 +0200
+++ linux-2.6.7/include/linux/compiler-gcc3.h 2004-06-24 09:26:04.123455290 +0200
@@ -19,6 +19,11 @@
# define __attribute_used__ __attribute__((__unused__))
#endif
+#if __GNUC_MINOR__ >= 4
+#define HAVE_BUILTIN_CTZL
+#endif
+
+
#define __attribute_pure__ __attribute__((pure))
#define __attribute_const__ __attribute__((__const__))
Arjan van de Ven <[email protected]> wrote:
>
> gcc 3.4 gained support for several typical bitops as builtin directives.
> Using these over inline asm has a few advantages:
> * gcc can optimize constants into these better
> * gcc can reorder and schedule the code better
> * gcc can allocate registers etc better for the code
>
> The question is if we consider it desirable to go down this road or not. In
> order to help that discussion I've attached a patch below that switches the
> i386 ffz() function to the gcc builtin version, conditional on gcc having
> support for this. Before I go down the road of converting more functions
> and/or architectures.... is this worth doing?
I guess it depends on the resulting code size and quality. Some extra
conversions would be needed for that.
For the implementation it would be nice to have the old-style
implementations in one header and the new-style ones in a separate header.
That would create a bit of an all-or-nothing situation, but that should be
OK?
> +static inline unsigned long ffz (unsigned long word)
> +{
> + return __builtin_ctzl (~word);
> +}
eww, whitepsace innovations.
static inline unsigned long ffz(unsigned long word)
{
return __builtin_ctzl(~word);
}
On Thu, Jun 24, 2004 at 02:00:22AM -0700, Andrew Morton wrote:
> > and/or architectures.... is this worth doing?
>
> I guess it depends on the resulting code size and quality. Some extra
> conversions would be needed for that.
ok I'll see if I can get some detailed info on that
>
> For the implementation it would be nice to have the old-style
> implementations in one header and the new-style ones in a separate header.
> That would create a bit of an all-or-nothing situation, but that should be
> OK?
Perhaps. It's not impossible that say gcc 3.5 will add a few more builtins
even that then allow more functions to be converted, otoh that shouldn't be
impossible to cope with. I'll have a look to see how it pans out.
Greetings,
Arjan van de Ven
On Thu, Jun 24, 2004 at 02:00:22AM -0700, Andrew Morton wrote:
> Arjan van de Ven <[email protected]> wrote:
> >
> > gcc 3.4 gained support for several typical bitops as builtin directives.
> > Using these over inline asm has a few advantages:
> > * gcc can optimize constants into these better
> > * gcc can reorder and schedule the code better
> > * gcc can allocate registers etc better for the code
> >
> > The question is if we consider it desirable to go down this road or not. In
> > order to help that discussion I've attached a patch below that switches the
> > i386 ffz() function to the gcc builtin version, conditional on gcc having
> > support for this. Before I go down the road of converting more functions
> > and/or architectures.... is this worth doing?
>
> I guess it depends on the resulting code size and quality. Some extra
> conversions would be needed for that.
for ffz() the exact same assembly instructions are generated in the cases I looked at
(kernel/signal.c); eg no extra code at all.
I see a list of these gcc bitop builtins at the bottom of the page:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Looks like you can find the position of the first 1 bit, the length of
the leading or trailing seq of 0 bits, the hamming weight (popcount) and
the parity, each for int, long and long long.
I just add this for the benefit of others.
As to your primary question - is this worth doing - I don't have
an answer.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Thu, Jun 24, 2004 at 02:31:09AM -0700, Paul Jackson wrote:
> I see a list of these gcc bitop builtins at the bottom of the page:
>
> http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
>
> Looks like you can find the position of the first 1 bit, the length of
> the leading or trailing seq of 0 bits, the hamming weight (popcount) and
> the parity, each for int, long and long long.
>
> I just add this for the benefit of others.
>
> As to your primary question - is this worth doing - I don't have
> an answer.
It is, for 2 reasons:
1) unlike __asm, GCC knows how to schedule the instructions in the builtins
2) GCC will handle stuff like ffz (16) at compile time rather than runtime
But, all the builtins are not natively supported on every architecture,
if there is no arch support, it falls back to a libgcc library function,
which the kernel probably wants to avoid.
E.g. popcount on i386, etc.
Jakub
On Thu, Jun 24, 2004 at 02:00:22AM -0700, Andrew Morton wrote:
> For the implementation it would be nice to have the old-style
> implementations in one header and the new-style ones in a separate header.
> That would create a bit of an all-or-nothing situation, but that should be
> OK?
In addition I stuck those in asm-generic since they no longer are
architecture specific....
diff -purN linux-2.6.7/include/asm-i386/bitops.h linux/include/asm-i386/bitops.h
--- linux-2.6.7/include/asm-i386/bitops.h 2004-06-24 17:26:10.030404507 +0200
+++ linux/include/asm-i386/bitops.h 2004-06-24 18:47:26.582837487 +0200
@@ -338,34 +338,6 @@ static inline int find_first_bit(const u
*/
int find_next_bit(const unsigned long *addr, int size, int offset);
-/**
- * ffz - find first zero in word.
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static inline unsigned long ffz(unsigned long word)
-{
- __asm__("bsfl %1,%0"
- :"=r" (word)
- :"r" (~word));
- return word;
-}
-
-/**
- * __ffs - find first bit in word.
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static inline unsigned long __ffs(unsigned long word)
-{
- __asm__("bsfl %1,%0"
- :"=r" (word)
- :"rm" (word));
- return word;
-}
-
/*
* fls: find last bit set.
*/
@@ -374,6 +346,12 @@ static inline unsigned long __ffs(unsign
#ifdef __KERNEL__
+#ifdef USE_BUILTIN_BITOPS
+#include <asm-generic/bitops_gcc.h>
+#else
+#include <asm/bitops_asm.h>
+#endif
+
/*
* Every architecture must define this function. It's the fastest
* way of searching a 140-bit bitmap where the first 100 bits are
@@ -394,25 +372,6 @@ static inline int sched_find_first_bit(c
}
/**
- * ffs - find first bit set
- * @x: the word to search
- *
- * This is defined the same way as
- * the libc and compiler builtin ffs routines, therefore
- * differs in spirit from the above ffz (man ffs).
- */
-static inline int ffs(int x)
-{
- int r;
-
- __asm__("bsfl %1,%0\n\t"
- "jnz 1f\n\t"
- "movl $-1,%0\n"
- "1:" : "=r" (r) : "rm" (x));
- return r+1;
-}
-
-/**
* hweightN - returns the hamming weight of a N-bit word
* @x: the word to weigh
*
diff -purN linux-2.6.7/include/asm-generic/bitops_gcc.h linux/include/asm-generic/bitops_gcc.h
--- linux-2.6.7/include/asm-generic/bitops_gcc.h 1970-01-01 01:00:00.000000000 +0100
+++ linux/include/asm-generic/bitops_gcc.h 2004-06-24 18:45:03.483991176 +0200
@@ -0,0 +1,61 @@
+#ifndef _I386_BITOPS_GCC_H
+#define _I386_BITOPS_GCC_H
+
+/*
+ * Copyright 1992, Linus Torvalds.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * June 2004 - Modified by Arjan van de Ven <[email protected]> to use gcc builtin's
+ *
+ */
+
+
+/**
+ * ffz - find first zero in word.
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static inline unsigned long ffz(unsigned long word)
+{
+ return __builtin_ctzl(~word);
+}
+
+/**
+ * __ffs - find first bit in word.
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __ffs(unsigned long word)
+{
+ return __builtin_ctzl(word);
+}
+
+/**
+ * ffs - find first bit set
+ * @x: the word to search
+ *
+ * This is defined the same way as
+ * the libc and compiler builtin ffs routines, therefore
+ * differs in spirit from the above ffz (man ffs).
+ */
+static inline int ffs(int x)
+{
+ return __builtin_ffs(x);
+}
+
+#endif /* _I386_BITOPS_GCC_H */
diff -purN linux-2.6.7/include/asm-i386/bitops_asm.h linux/include/asm-i386/bitops_asm.h
--- linux-2.6.7/include/asm-i386/bitops_asm.h 1970-01-01 01:00:00.000000000 +0100
+++ linux/include/asm-i386/bitops_asm.h 2004-06-24 18:45:15.669530459 +0200
@@ -0,0 +1,56 @@
+#ifndef _I386_BITOPS_ASM_H
+#define _I386_BITOPS_ASM_H
+
+/*
+ * Copyright 1992, Linus Torvalds.
+ */
+
+
+/**
+ * ffz - find first zero in word.
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static inline unsigned long ffz(unsigned long word)
+{
+ __asm__("bsfl %1,%0"
+ :"=r" (word)
+ :"r" (~word));
+ return word;
+}
+
+/**
+ * __ffs - find first bit in word.
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static inline unsigned long __ffs(unsigned long word)
+{
+ __asm__("bsfl %1,%0"
+ :"=r" (word)
+ :"rm" (word));
+ return word;
+}
+
+/**
+ * ffs - find first bit set
+ * @x: the word to search
+ *
+ * This is defined the same way as
+ * the libc and compiler builtin ffs routines, therefore
+ * differs in spirit from the above ffz (man ffs).
+ */
+static inline int ffs(int x)
+{
+ int r;
+
+ __asm__("bsfl %1,%0\n\t"
+ "jnz 1f\n\t"
+ "movl $-1,%0\n"
+ "1:" : "=r" (r) : "rm" (x));
+ return r+1;
+}
+
+#endif /* _I386_BITOPS_ASM_H */
diff -purN linux-2.6.7/include/linux/compiler-gcc+.h linux/include/linux/compiler-gcc+.h
--- linux-2.6.7/include/linux/compiler-gcc+.h 2004-06-24 17:26:10.513346616 +0200
+++ linux/include/linux/compiler-gcc+.h 2004-06-24 18:48:29.266323429 +0200
@@ -6,6 +6,7 @@
*/
#include <linux/compiler-gcc.h>
+#define USE_BUILTIN_BITOPS
#define inline __inline__ __attribute__((always_inline))
#define __inline__ __inline__ __attribute__((always_inline))
#define __inline __inline__ __attribute__((always_inline))
diff -purN linux-2.6.7/include/linux/compiler-gcc3.h linux/include/linux/compiler-gcc3.h
--- linux-2.6.7/include/linux/compiler-gcc3.h 2004-06-24 17:26:10.511346855 +0200
+++ linux/include/linux/compiler-gcc3.h 2004-06-24 18:48:16.189890940 +0200
@@ -19,6 +19,10 @@
# define __attribute_used__ __attribute__((__unused__))
#endif
+#if __GNUC_MINOR__ >= 4
+#define USE_BUILTIN_BITOPS
+#endif
+
#define __attribute_pure__ __attribute__((pure))
#define __attribute_const__ __attribute__((__const__))
On Thu, Jun 24, 2004 at 09:09:36AM +0200, Arjan van de Ven wrote:
> Hi,
>
> gcc 3.4 gained support for several typical bitops as builtin directives.
> Using these over inline asm has a few advantages:
> * gcc can optimize constants into these better
> * gcc can reorder and schedule the code better
> * gcc can allocate registers etc better for the code
>
> The question is if we consider it desirable to go down this road or not. In
> order to help that discussion I've attached a patch below that switches the
> i386 ffz() function to the gcc builtin version, conditional on gcc having
> support for this. Before I go down the road of converting more functions
> and/or architectures.... is this worth doing?
>
> Greetings,
> Arjan van de Ven
>
>
> --- linux-2.6.7/include/asm-i386/bitops.h~ 2004-06-23 23:45:06.048614387 +0200
> +++ linux-2.6.7/include/asm-i386/bitops.h 2004-06-23 23:45:06.048614387 +0200
> @@ -344,6 +344,8 @@
> *
> * Undefined if no zero exists, so code should check against ~0UL first.
> */
> +
> +#ifndef HAVE_BUILTIN_CTZL
> static inline unsigned long ffz(unsigned long word)
> {
> __asm__("bsfl %1,%0"
> @@ -351,6 +353,12 @@
> :"r" (~word));
> return word;
> }
> +#else
> +static inline unsigned long ffz (unsigned long word)
> +{
> + return __builtin_ctzl (~word);
> +}
> +#endif
>
> /**
> * __ffs - find first bit in word.
> --- linux-2.6.7/include/linux/compiler-gcc3.h~ 2004-06-24 09:26:04.123455290 +0200
> +++ linux-2.6.7/include/linux/compiler-gcc3.h 2004-06-24 09:26:04.123455290 +0200
> @@ -19,6 +19,11 @@
> # define __attribute_used__ __attribute__((__unused__))
> #endif
>
> +#if __GNUC_MINOR__ >= 4
Please do no test only on minor. People have been arguing
whether the next major release of GCC should be called 3.5
or 4.0 since tree-ssa has been merged.
This said, I'd rather use GCC intrinsics when they can
perform the task, especially the ones that expand
to more than one machine instruction.
> +#define HAVE_BUILTIN_CTZL
> +#endif
Gabriel
On Thu, Jun 24, 2004 at 01:31:51PM +0200, Arjan van de Ven wrote:
> On Thu, Jun 24, 2004 at 02:00:22AM -0700, Andrew Morton wrote:
> > For the implementation it would be nice to have the old-style
> > implementations in one header and the new-style ones in a separate header.
> > That would create a bit of an all-or-nothing situation, but that should be
> > OK?
>
> In addition I stuck those in asm-generic since they no longer are
> architecture specific....
This is not going to work.
Say on x86_64, __builtin_ctzl (~word) ends up __ctzdi2 (~word) call in GCC
3.4.x, which is not defined in the kernel (in 3.5 it will be bsfq).
On a bunch of arches which don't have an instruction for ffz operation
it will always result in a library call.
Jakub
On Thu, Jun 24, 2004 at 01:51:46PM +0200, Gabriel Paubert wrote:
> > --- linux-2.6.7/include/linux/compiler-gcc3.h~ 2004-06-24 09:26:04.123455290 +0200
> > +++ linux-2.6.7/include/linux/compiler-gcc3.h 2004-06-24 09:26:04.123455290 +0200
> > @@ -19,6 +19,11 @@
> > # define __attribute_used__ __attribute__((__unused__))
> > #endif
> >
> > +#if __GNUC_MINOR__ >= 4
>
> Please do no test only on minor. People have been arguing
> whether the next major release of GCC should be called 3.5
> or 4.0 since tree-ssa has been merged.
this header is only uses for the gcc major == 3. The header for ==4 and
later has the define unconditional, and the file for major ==2 doesn't have
it at all.
On Thu, Jun 24, 2004 at 08:05:34AM -0400, Jakub Jelinek wrote:
> On Thu, Jun 24, 2004 at 01:31:51PM +0200, Arjan van de Ven wrote:
> > On Thu, Jun 24, 2004 at 02:00:22AM -0700, Andrew Morton wrote:
> > > For the implementation it would be nice to have the old-style
> > > implementations in one header and the new-style ones in a separate header.
> > > That would create a bit of an all-or-nothing situation, but that should be
> > > OK?
> >
> > In addition I stuck those in asm-generic since they no longer are
> > architecture specific....
>
> This is not going to work.
> Say on x86_64, __builtin_ctzl (~word) ends up __ctzdi2 (~word) call in GCC
> 3.4.x, which is not defined in the kernel (in 3.5 it will be bsfq).
> On a bunch of arches which don't have an instruction for ffz operation
> it will always result in a library call.
It's actually fine; the architecture first needs to include this file and
there it can use the proper ifdefs; the functions themselves don't matter,
only when they can be used, and the arch still controls that.
On Thu, 24 Jun 2004 09:20:07 +0200, you wrote in linux.kernel:
> +#if __GNUC_MINOR__ >= 4
> +#define HAVE_BUILTIN_CTZL
> +#endif
Eh, what value does __GNUC_MINOR__ have for, say, gcc-2.95.x?
--
Ciao,
Pascal
On Thu, Jun 24, 2004 at 03:46:50PM +0200, Pascal Schmidt wrote:
> On Thu, 24 Jun 2004 09:20:07 +0200, you wrote in linux.kernel:
>
> > +#if __GNUC_MINOR__ >= 4
> > +#define HAVE_BUILTIN_CTZL
> > +#endif
>
> Eh, what value does __GNUC_MINOR__ have for, say, gcc-2.95.x?
gcc 2.x.y do not use compiler-gcc3.h but compiler-gcc2.h instead so that is
irrelevant ;)
On Thu, 24 Jun 2004, Arjan van de Ven wrote:
>> Eh, what value does __GNUC_MINOR__ have for, say, gcc-2.95.x?
> gcc 2.x.y do not use compiler-gcc3.h but compiler-gcc2.h instead so that is
> irrelevant ;)
Ah. Sorry for the noise.
--
Ciao,
Pascal
On Thu, 24 Jun 2004, Jakub Jelinek wrote:
> >
> > As to your primary question - is this worth doing - I don't have
> > an answer.
>
> It is, for 2 reasons:
> 1) unlike __asm, GCC knows how to schedule the instructions in the builtins
> 2) GCC will handle stuff like ffz (16) at compile time rather than runtime
I'd argue that unless the gcc code can be shown to be clearly better, it
is NOT worth doing.
Adding support for built-ins generates a noticeable maintenance burden,
since we'll still have to support older gcc's and architectures where gcc
does WORSE than we do. And quite frankly, I doubt you'll find any cases
where gcc does any better in any measurable way.
In other words, the rule about gcc builtins should be:
- use them only if they are old enough that the huge majority (possibly
all) of users get them. This is partly because gcc has frankly been
buggy, and often makes assumptions that simply may not be true for the
kernel (ie "it's ok to use library routines").
- only use them if you can show a measurable improvement. Theoretical
arguments simply don't count. Theoretical arguments is why gcc-3.x is
a lot slower than 2.95 and is apparently still not generating
appreciably better code for the kernel (and no, don't bother pointing
me to spec runs, I just don't care. The kernel is what I care about).
So far, the only case where they have been worth it is likely the
"memcpy()" stuff, and even there our previous macros were doing an almost
equivalent job (but were arguably so ugly that it was worth making the
change).
For something like ffs/popcount/etc, I do not see _any_ point to compiler
support. There just aren't any kernel uses that make it worth it. Sounds
like a total micro-optimization for some spec benchmark.
Linus
On Thu, 24 Jun 2004, Arjan van de Ven wrote:
>
> It's actually fine; the architecture first needs to include this file and
> there it can use the proper ifdefs; the functions themselves don't matter,
> only when they can be used, and the arch still controls that.
And my argument is: what the hell does this _buy_ us, except for extra
complexity, and even more code dependence on different versions of gcc.
I don't want the extra code-paths and magic #ifdef's unless there's a
clear improvement somewhere. And quite frankly, I don't see that being the
case for something like ffs.
Linus
> Perhaps. It's not impossible that say gcc 3.5 will add a few more builtins
> even that then allow more functions to be converted, otoh that shouldn't be
> impossible to cope with. I'll have a look to see how it pans out.
You could have an asm-generic/bitops-builtin.h and arch's could #include
that after defining all the HAVE_BUILTIN_xxx macros they want. I suspect
not all architectures will get the most correct built-ins (e.g. the arch
may be able to optimize better than gcc's builtin is doing).
--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/