2003-05-01 04:57:56

by Hu Gang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Wed, 30 Apr 2003 13:55:12 -0700
Andrew Morton <[email protected]> wrote:

> Daniel Phillips <[email protected]> wrote:
> >
> > +static inline unsigned fls8(unsigned n)
> > +{
> > + return n & 0xf0?
> > + n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> > + n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> > +}
>
> return fls_table[n];
>
>
> That'll be faster in benchmarks, possibly slower in real world. As usual.
>
Here is table version of the fls. Yes it fast than other.
typedef struct fls_table_t {
unsigned min;
unsigned max;
unsigned value;
} fls_table_t;

static fls_table_t fls_table[] = {
{0, 1, 1},
{1, 2, 2},
{2, 4, 3},
{4, 8, 4},
{8, 16, 5},
{16, 32, 6},
{32, 64, 7},
{64, 128, 8},
{128, 256, 9},
{256, 512, 10},
{512, 1024, 11},
{1024, 2048, 12},
{2048, 4096, 13},
{4096, 8192, 14},
{8192, 16384, 15},
{16384, 32768, 16},
{32768, 65536, 17},
{65536, 131072, 18},
{131072, 262144, 19},
{262144, 524288, 20},
{524288, 1048576, 21},
{1048576, 2097152, 22},
{2097152, 4194304, 23},
{4194304, 8388608, 24},
{8388608, 16777216, 25},
{16777216, 33554432, 26},
{33554432, 67108864, 27},
{67108864, 134217728, 28},
{134217728, 268435456, 29},
{268435456, 536870912, 30},
{536870912, 1073741824, 31},
{1073741824, 2147483648, 32},

{0, -1, 32},
};

static inline int fls_table_fls(unsigned n)
{
unsigned i = 0;

do {
i++;
} while (n <= fls_table[i].max && n > fls_table[i].min);

return fls_table[i].value;
}

--test log is here----
model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real 0m22.311s
user 0m21.850s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265

real 0m34.673s
user 0m33.960s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m57.702s
user 0m56.530s
sys 0m0.040s
4294967295 iterations of new2... checksum = 4294967265

real 0m41.709s
user 0m40.810s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967295

real 0m22.274s
user 0m21.640s
sys 0m0.010s
------------------
==== -march=i686 -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real 0m29.501s
user 0m28.880s
sys 0m0.040s
4294967295 iterations of old... checksum = 4294967265

real 0m49.054s
user 0m47.840s
sys 0m0.040s
4294967295 iterations of new... checksum = 4294967265

real 0m52.331s
user 0m51.270s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real 1m7.043s
user 1m5.660s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967295

real 0m45.783s
user 0m44.860s
sys 0m0.010s
------------------
==== -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real 0m34.395s
user 0m33.670s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m52.277s
user 0m51.230s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m54.750s
user 0m53.640s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real 0m57.728s
user 0m56.590s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967295

real 0m44.540s
user 0m43.600s
sys 0m0.020s
------------------
==== -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real 0m16.196s
user 0m15.880s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m34.342s
user 0m33.630s
sys 0m0.010s
4294967295 iterations of new... checksum = 4294967265

real 0m58.554s
user 0m57.380s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real 0m37.140s
user 0m36.320s
sys 0m0.040s
4294967295 iterations of table... checksum = 4294967295

real 0m21.723s
user 0m21.270s
sys 0m0.000s
------------------

--
Hu Gang / Steve
Email : [email protected], [email protected]
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016


2003-05-01 04:58:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

hugang <[email protected]> wrote:
>
> Here is table version of the fls. Yes it fast than other.

nooo.. That has a big cache footprint. At the very least you should use
a binary search. gcc will do it for you:

switch (n) {
case 0 ... 1:
return 1;
case 2 ... 3:
return 2;
case 4 ... 7:
return 3;
case 8 ... 15:
return 4;

etc.


2003-05-01 05:21:04

by Hu Gang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Wed, 30 Apr 2003 22:11:29 -0700
Andrew Morton <[email protected]> wrote:

> nooo.. That has a big cache footprint. At the very least you should use
> a binary search. gcc will do it for you:
>
> switch (n) {
> case 0 ... 1:
> return 1;
> case 2 ... 3:
> return 2;
> case 4 ... 7:
> return 3;
> case 8 ... 15:
> return 4;
>
> etc.
It is here.
--------------------
static inline int fls_table_fls(unsigned n)
{
switch (n) {
case 0 ... 0: return 1;
case 1 ... 1: return 2;
case 2 ... 3: return 3;
case 4 ... 7: return 4;
case 8 ... 15: return 5;
case 16 ... 31: return 6;
case 32 ... 63: return 7;
case 64 ... 127: return 8;
case 128 ... 255: return 9;
case 256 ... 511: return 10;
case 512 ... 1023: return 11;
case 1024 ... 2047: return 12;
case 2048 ... 4095: return 13;
case 4096 ... 8191: return 14;
case 8192 ... 16383: return 15;
case 16384 ... 32767: return 16;
case 32768 ... 65535: return 17;
case 65536 ... 131071: return 18;
case 131072 ... 262143: return 19;
case 262144 ... 524287: return 20;
case 524288 ... 1048575: return 21;
case 1048576 ... 2097151: return 22;
case 2097152 ... 4194303: return 23;
case 4194304 ... 8388607: return 24;
case 8388608 ... 16777215: return 25;
case 16777216 ... 33554431: return 26;
case 33554432 ... 67108863: return 27;
case 67108864 ... 134217727: return 28;
case 134217728 ... 268435455: return 29;
case 268435456 ... 536870911: return 30;
case 536870912 ... 1073741823: return 31;
default:
}
return 32;
}

Now it in test, I will put log and file into
http://soulinfo.com/~hugang/kernel/

--
Hu Gang / Steve
Email : [email protected], [email protected]
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016

2003-05-01 07:01:01

by Hu Gang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

Hello

Here is the table version of fls. Before version of it is wrong.
-------
static inline int fls_table_fls(unsigned n)
{
switch (n) {
case 0: return 0;
case 1 ... 1: return 1;
case 2 ... 3: return 2;
case 4 ... 7: return 3;
case 8 ... 15: return 4;
case 16 ... 31: return 5;
case 32 ... 63: return 6;
case 64 ... 127: return 7;
case 128 ... 255: return 8;
case 256 ... 511: return 9;
case 512 ... 1023: return 10;
case 1024 ... 2047: return 11;
case 2048 ... 4095: return 12;
case 4096 ... 8191: return 13;
case 8192 ... 16383: return 14;
case 16384 ... 32767: return 15;
case 32768 ... 65535: return 16;
case 65536 ... 131071: return 17;
case 131072 ... 262143: return 18;
case 262144 ... 524287: return 19;
case 524288 ... 1048575: return 20;
case 1048576 ... 2097151: return 21;
case 2097152 ... 4194303: return 22;
case 4194304 ... 8388607: return 23;
case 8388608 ... 16777215: return 24;
case 16777216 ... 33554431: return 25;
case 33554432 ... 67108863: return 26;
case 67108864 ... 134217727: return 27;
case 134217728 ... 268435455: return 28;
case 268435456 ... 536870911: return 29;
case 536870912 ... 1073741823: return 30;
case 1073741824 ... 2147483647: return 31;
default:return 32;
}
}
--here is the test log-----
model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
4294967295 iterations of nil... checksum = 4294967295

real 0m15.548s
user 0m15.500s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265

real 0m34.778s
user 0m34.690s
sys 0m0.020s
4294967295 iterations of new... checksum = 4294967265

real 0m56.330s
user 0m56.190s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real 0m40.755s
user 0m40.680s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265

real 0m49.928s
user 0m49.770s
sys 0m0.050s
------------------
==== -march=i686 -O2 ====
4294967295 iterations of nil... checksum = 4294967295

real 0m28.350s
user 0m28.280s
sys 0m0.020s
4294967295 iterations of old... checksum = 4294967265

real 0m48.199s
user 0m48.100s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m50.986s
user 0m50.890s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real 1m5.356s
user 1m5.210s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967265

real 0m45.238s
user 0m45.150s
sys 0m0.000s
------------------
==== -O2 ====
4294967295 iterations of nil... checksum = 4294967295

real 0m33.503s
user 0m33.410s
sys 0m0.020s
4294967295 iterations of old... checksum = 4294967265

real 0m50.571s
user 0m50.450s
sys 0m0.020s
4294967295 iterations of new... checksum = 4294967265

real 0m53.324s
user 0m53.200s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real 0m56.360s
user 0m56.250s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265

real 0m46.524s
user 0m46.440s
sys 0m0.000s
------------------
==== -O3 ====
4294967295 iterations of nil... checksum = 4294967295

real 0m5.421s
user 0m5.410s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m29.744s
user 0m29.670s
sys 0m0.010s
4294967295 iterations of new... checksum = 4294967265

real 0m53.160s
user 0m53.050s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real 0m31.464s
user 0m31.400s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967265

real 0m46.546s
user 0m46.440s
sys 0m0.000s
------------------

--
Hu Gang / Steve
Email : [email protected], [email protected]
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016

2003-05-01 13:42:31

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Hello
>
> Here is the table version of fls. Before version of it is wrong.

Ok, I recoded the tree myself with if/else, and it's now faster than all others,
whatever the compiler. I have cut the tree to only 16 bits evaluations because
it was still faster than a full 32 bits tree on x86, probably because of the
code length. However, on Alpha EV6 (gcc-2.95.3), the 32 bits tree is faster.
The 32 bits code is also the same speed as the 16 bits on gcc-3.2.3 if I
specify -mbranch-cost=2 (because gcc assumes that today's CPUs need one cycle
per jump).

Here is the function, followed by the logs executed on an Athlon.

Disclaimer: Daniel's PentiumIII seems to give very different results than my
Athlon, so every test should be considered for one arch only !!!

To quickly resume the results, on gcc-3.2.3 it's 45% faster than the table algo,
I don't understand why, and just a little bit faster than Daniel's function.

On gcc-2.95.3, it's only 20% faster than the table, but 46% than Daniel's.

However, I have good news for you, the table code is 15% faster than my tree
on an Alpha EV6 ;-)
It may be because gcc can use different tricks on this arch.

Cheers
Willy

static inline int fls_tree_fls(unsigned n) {
#ifndef REALLY_WANT_A_32BITS_TREE
/* it seems as if working on half a tree is faster than the
complete tree.
*/
register t = 0;
if (n >= (1<<16)) {
n >>= 16;
t = 16;
}
if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return t + n - ((n + 1) >> 2);
else
return t + 3 + (n >= 8);
else
if (n < (1<<6))
return t + 5 + (n >= (1<<5));
else
return t + 7 + (n >= (1<<7));
else
if (n < (1<<12))
if (n < (1<<10))
return t + 9 + (n >= (1<<9));
else
return t + 11 + (n >= (1<<11));
else
if (n < (1<<14))
return t + 13 + (n >= (1<<13));
else
return t + 15 + (n >= (1<<15));
#else
/* perhaps this is faster on systems with BIG caches, but on
athlon, at least, it's slower than the previous one
*/
if (n < (1<<16))
if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return n - ((n + 1) >> 2);
else
return 3 + (n >= 8);
else
if (n < (1<<6))
return 5 + (n >= (1<<5));
else
return 7 + (n >= (1<<7));
else
if (n < (1<<12))
if (n < (1<<10))
return 9 + (n >= (1<<9));
else
return 11 + (n >= (1<<11));
else
if (n < (1<<14))
return 13 + (n >= (1<<13));
else
return 15 + (n >= (1<<15));
else
if (n < (1<<24))
if (n < (1<<20))
if (n < (1<<18))
return 17 + (n >= (1<<17));
else
return 19 + (n >= (1<<19));
else
if (n < (1<<22))
return 21 + (n >= (1<<21));
else
return 23 + (n >= (1<<23));
else
if (n < (1<<28))
if (n < (1<<26))
return 25 + (n >= (1<<25));
else
return 27 + (n >= (1<<27));
else
if (n < (1<<30))
return 29 + (n >= (1<<29));
else
return 31 + (n >= (1<<31));
#endif
}

== results on Athlon 1800+ (1.53 GHz), gcc-2.95.3 -O3 -march=i686 :

4294967295 iterations of nil... checksum = 4294967295

real 0m5.600s
user 0m5.590s
sys 0m0.007s
4294967295 iterations of old... checksum = 4294967265

real 0m39.640s
user 0m39.499s
sys 0m0.141s
4294967295 iterations of new... checksum = 4294967265

real 0m45.088s
user 0m44.936s
sys 0m0.158s
4294967295 iterations of fls_table... checksum = 4294967264

real 0m35.988s
user 0m35.833s
sys 0m0.159s
4294967295 iterations of fls_tree... checksum = 4294967265

real 0m28.699s
user 0m28.605s
sys 0m0.096s

== results on Athlon 1800+ (1.53 GHz), gcc-3.2.3 -O3 -march=athlon :

4294967295 iterations of nil... checksum = 4294967295

real 0m16.624s
user 0m16.624s
sys 0m0.001s
4294967295 iterations of old... checksum = 4294967265

real 0m57.685s
user 0m57.568s
sys 0m0.125s
4294967295 iterations of new... checksum = 4294967265

real 0m37.513s
user 0m37.415s
sys 0m0.103s
4294967295 iterations of fls_table... checksum = 4294967264

real 0m56.068s
user 0m55.991s
sys 0m0.085s
4294967295 iterations of fls_tree...
checksum = 4294967265

real 0m36.636s
user 0m36.516s
sys 0m0.125s

=== alpha EV6 / 466 Mhz, gcc-2.95.3 -O3 ====
4294967295 iterations of new... checksum = 4294967265

real 2m19.951s
user 2m19.543s
sys 0m0.018s
4294967295 iterations of fls_table... checksum = 4294967265

real 1m12.825s
user 1m12.667s
sys 0m0.005s
4294967295 iterations of fls_tree... checksum = 4294967265

real 1m33.469s
user 1m33.242s
sys 0m0.013s

2003-05-01 14:02:22

by Falk Hueffner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

Willy TARREAU <[email protected]> writes:

> On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Ok, I recoded the tree myself with if/else, and it's now faster than
> all others, whatever the compiler.

Have you tried with not simply increasing, but random numbers? I guess
this could make quite a difference here because of branch prediction.

--
Falk

2003-05-01 14:16:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, May 01, 2003 at 04:14:17PM +0200, Falk Hueffner wrote:
> Willy TARREAU <[email protected]> writes:
>
> > On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> > Ok, I recoded the tree myself with if/else, and it's now faster than
> > all others, whatever the compiler.
>
> Have you tried with not simply increasing, but random numbers? I guess
> this could make quite a difference here because of branch prediction.

I thought about this, and indeed, that's what I used in the program I used
to bench the first function I sent yesterday. The problem of the random, is
that it's so slow that you must build a giant table and apply your tests to
this table. So the problem mainly displaces to data cache misses which cost
more than certain operations. If you try it, you'll note that it's difficult
to get comparable results twice.

Other solutions include non-linear suites such as mixing some sequential
values with BSWAP. Eg: x ^ bswap(x) ^ bswap(x << 4).

Willy

2003-05-01 14:48:12

by Hu Gang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, 1 May 2003 15:52:04 +0200
Willy TARREAU <[email protected]> wrote:

> However, I have good news for you, the table code is 15% faster than my tree
> on an Alpha EV6 ;-)
> It may be because gcc can use different tricks on this arch.
>
> Cheers
> Willy

However, The most code changed has increase a little peformance, Do we really need it?

Here is test in my machine, The faster is still table.

==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:78: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295

real 0m21.852s
user 0m21.500s
sys 0m0.300s
4294967295 iterations of old... checksum = 4294967265

real 0m45.206s
user 0m44.800s
sys 0m0.310s
4294967295 iterations of new... checksum = 4294967265

real 0m45.235s
user 0m44.720s
sys 0m0.420s
4294967295 iterations of new2... checksum = 4294967265

real 0m59.615s
user 0m58.960s
sys 0m0.540s
4294967295 iterations of table... checksum = 4294967265

real 0m41.784s
user 0m41.420s
sys 0m0.280s
4294967295 iterations of tree... checksum = 4294967265

real 0m44.874s
user 0m44.410s
sys 0m0.380s
------------------


--
Hu Gang / Steve
Email : [email protected], [email protected]
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016

2003-05-01 15:43:25

by Thomas Schlichter

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

Hi,

On March 1, hugang wrote:
> On Thu, 1 May 2003 15:52:04 +0200
>
> Willy TARREAU <[email protected]> wrote:
> > However, I have good news for you, the table code is 15% faster than my
> > tree on an Alpha EV6 ;-)
> > It may be because gcc can use different tricks on this arch.
> >
> > Cheers
> > Willy
>
> However, The most code changed has increase a little peformance, Do we
> really need it?
>
> Here is test in my machine, The faster is still table.

I think the table version is so fast as a normal distribution of numbers is
tested. With this distribution half of the occuring values have the MSB set,
one quarter the next significant bit and so on...

The tree version is approximately equally fast for every number, but the table
version is fast if a very significant bit is set and slow if a less
significant bit is set...

If small values occour more often than big ones the tree version should be
preferred, else following version may be very fast, too.

static inline int fls_shift(int x)
{
int bit = 32;

while(x > 0) {
--bit;
x <<= 1;
}

return x ? bit : 0;
}

I get following times on my K6-III (compiled with -O3 -march=k6) for
4294967296 iterations:

fls_bsrl: (uses the 'bsrl' command via inline assembler)
real 3m39.059s
user 3m27.416s
sys 0m0.345s

fls_shift:
real 1m47.337s
user 1m41.801s
sys 0m0.185s

fls_tree:
real 1m56.746s
user 1m50.681s
sys 0m0.165s

The 32Bit tree is a bit slower here, too:
real 1m59.447s
user 1m53.375s
sys 0m0.200s

I did not test the table version as I think it cannot be faster than the shift
version...

Best regards
Thomas Schlichter

P.S.: I'd be happy to see how this performs on an Athlon or P-IV...


Attachments:
(No filename) (1.68 kB)
signed data
(No filename) (189.00 B)
signature
Download all attachments

2003-05-01 17:32:10

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, May 01, 2003 at 10:53:21PM +0800, hugang wrote:

> Here is test in my machine, The faster is still table.

because, as Falk said, the tests are incremental and the branch prediction
works very well. I proposed a simple scrambling function based on bswap. Please
consider this function :

f(i) = i ^ htonl(i) ^ htonl(i<<7)

It returns such values :

0x00000001 => 0x81000001
0x00000002 => 0x02010002
0x00000003 => 0x83010003
0x00000004 => 0x04020004
0x00000005 => 0x85020005
0x00000006 => 0x06030006
0x00000007 => 0x87030007
0x00000008 => 0x08040008
0x00000009 => 0x89040009
0x0000000a => 0x0a05000a
0x0000000b => 0x8b05000b

etc...

As you see, high bits move fast enough to shot a predictor.

The tree function as well as Daniel's "new" resist better to non linear suites.
BTW, the latest changes I did show that the convergence should be between
Daniel's function and mine, because there are common concepts. I noticed that
the expression used in Daniel's function is too complex for gcc to optimize it
well enough. In my case, gcc 3.2.3 coded too many jumps instead of conditional
moves. I saw that playing with -mbranch-cost changes the code. A mix between
the two is used here (and listed after). It's still not optimial, reading the
code, because there's always one useless jump and move. But in the worst case,
it gives exactly the same result as Daniel's. But in other cases, it is even
slightly faster. Honnestly, now it's just a matter of taste. Daniel's easier to
read, mine produces smaller code. This time, it's faster than others Athlon,
Alpha and Sparc. I Don't know about the PentiumIII nor the P4.

Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.

Willy

====

4294967295 iterations of bswap/nil... checksum = 4294967295

real 0m53.133s
user 0m53.114s
sys 0m0.005s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real 1m44.686s
user 1m44.163s
sys 0m0.487s

4294967295 iterations of bswap/new... checksum = 4294967262

real 1m16.463s
user 1m16.144s
sys 0m0.314s

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real 1m16.511s
user 1m16.169s
sys 0m0.296s

== alpha 466 MHz , gcc-3.2.3
gcc -O3 -o testfls -fomit-frame-pointer -mcpu=ev67 ===

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real 4m14.432s
user 4m13.540s
sys 0m0.038s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real 4m36.204s
user 4m35.368s
sys 0m0.030s

== ultra-sparc 333 MHz, gcc 3.1.1 (linear only)
gcc -O3 -mcpu=v9 -fomit-frame-pointer ===

4294967295 iterations of fls_tree... checksum = 4294967265

real 1m52.680s
user 1m52.640s
sys 0m0.010s

4294967295 iterations of fls_table... checksum = 4294967265

real 3m24.514s
user 3m24.430s
sys 0m0.010s

======= here is the function :

unsigned fls_tree_fls(unsigned n) {
register t = 0;

if (n >= (1<<16)) {
n >>= 16;
t = 16;
}

if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return t + n - ((n + 1) >> 2);
else
return t + 3 + (n>>3);
else
if (n < (1<<6))
return t + 5 + (n>>5);
else
return t + 7 + (n>>7);
else
if (n < (1<<12))
if (n < (1<<10))
return t + 9 + (n>>9);
else
return t + 11 + (n>>11);
else
if (n < (1<<14))
return t + 13 + (n>>13);
else
return t + 15 + (n>>15);
}

2003-05-01 23:15:43

by Thomas Schlichter

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On May 1, Willy TARREAU wrote:
> On Thu, May 01, 2003 at 10:53:21PM +0800, hugang wrote:
> > Here is test in my machine, The faster is still table.
>
> because, as Falk said, the tests are incremental and the branch prediction
> works very well. I proposed a simple scrambling function based on bswap.
> Please consider this function :
>
> f(i) = i ^ htonl(i) ^ htonl(i<<7)
>
> It returns such values :
>
> 0x00000001 => 0x81000001
> 0x00000002 => 0x02010002
> 0x00000003 => 0x83010003
> 0x00000004 => 0x04020004
> 0x00000005 => 0x85020005
> 0x00000006 => 0x06030006
> 0x00000007 => 0x87030007
> 0x00000008 => 0x08040008
> 0x00000009 => 0x89040009
> 0x0000000a => 0x0a05000a
> 0x0000000b => 0x8b05000b
>
> etc...
>
> As you see, high bits move fast enough to shot a predictor.
>
> The tree function as well as Daniel's "new" resist better to non linear
> suites. BTW, the latest changes I did show that the convergence should be
> between Daniel's function and mine, because there are common concepts. I
> noticed that the expression used in Daniel's function is too complex for
> gcc to optimize it well enough. In my case, gcc 3.2.3 coded too many jumps
> instead of conditional moves. I saw that playing with -mbranch-cost changes
> the code. A mix between the two is used here (and listed after). It's still
> not optimial, reading the code, because there's always one useless jump and
> move. But in the worst case, it gives exactly the same result as Daniel's.
> But in other cases, it is even slightly faster. Honnestly, now it's just a
> matter of taste. Daniel's easier to read, mine produces smaller code. This
> time, it's faster than others Athlon, Alpha and Sparc. I Don't know about
> the PentiumIII nor the P4.
>
> Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.

~~ snip~~

Here are some results with the scrambling function on AMD K6-III 450MHz,
gcc-2.95.3:

fls_old (generic_fls from linux 2.5.68):
real 3m52.960s
user 3m42.080s
sys 0m0.348s

fls_bsrl:
real 4m39.040s
user 4m25.532s
sys 0m0.401s

fls_table:
real 4m59.622s
user 4m45.511s
sys 0m0.469s

fls_tree:
real 3m3.986s
user 2m55.236s
sys 0m0.272s

fls_shift:
real 2m58.765s
user 2m50.092s
sys 0m0.278s

So for me the table version seems to be the slowest one. The BSRL instruction
on the K6-III seems to be very slow, too. The tree and my shift version are
faster than the original version here...

That someone else can test my fls_shift version I'll provide it here again:
static inline int fls_shift(int x)
{
int bit = 32;

while(x > 0) {
--bit;
x <<= 1;
}

return x ? bit : 0;
}

> Willy

Thomas


Attachments:
(No filename) (2.60 kB)
signed data
(No filename) (189.00 B)
signature
Download all attachments

2003-05-01 23:55:36

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Friday 02 May 2003 01:27, Thomas Schlichter wrote:
> ...So for me the table version seems to be the slowest one. The BSRL
> instruction on the K6-III seems to be very slow, too. The tree and my shift
> version are faster than the original version here...
>
> That someone else can test my fls_shift version I'll provide it here again:
> static inline int fls_shift(int x)
> {
> int bit = 32;
>
> while(x > 0) {
> --bit;
> x <<= 1;
> }
>
> return x ? bit : 0;
> }

Your shift version is the fastest on the PIII as well, finishing in 45.3
seconds vs 53.4 for my original, and using only 12 bytes of text. This was a
big surprise. The time was the same, whether I inlined it or not. I take
this to mean that -O3 inlines such a short function in either case.

Regards,

Daniel

2003-05-02 00:00:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Fri, May 02, 2003 at 01:27:16AM +0200, Thomas Schlichter wrote:

> So for me the table version seems to be the slowest one. The BSRL instruction
> on the K6-III seems to be very slow, too. The tree and my shift version are
> faster than the original version here...
>
> That someone else can test my fls_shift version I'll provide it here again:

That's interesting Thomasr. I get 18.4 s on the Athlon here vs 32.3 for Daniel's
(I have broken my function at the moment so I cannot re-bench it right now, but
it should be near Daniel's). At first, I thought you had coded an FFS instead of
an FLS. But I realized it's valid, and is fast because one half of the numbers
tested will not even take one iteration.

Regards,
Willy

2003-05-02 00:04:20

by Lincoln Dale

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

At 03:52 PM 1/05/2003 +0200, Willy TARREAU wrote:
>Ok, I recoded the tree myself with if/else, and it's now faster than all
>others,
[..]
>4294967295 iterations of nil... checksum = 4294967295
>4294967295 iterations of old... checksum = 4294967265
>4294967295 iterations of new... checksum = 4294967265
>4294967295 iterations of fls_table... checksum = 4294967264
>4294967295 iterations of fls_tree... checksum = 4294967265

yep - faster - but with different results.
shouldn't the checksums be the same????


cheers,

lincoln.

2003-05-02 00:21:08

by Hu Gang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, 1 May 2003 17:54:28 +0200
Thomas Schlichter <[email protected]> wrote:

> Hi,
>
> On March 1, hugang wrote:
> > On Thu, 1 May 2003 15:52:04 +0200
> >
> > Willy TARREAU <[email protected]> wrote:
> > > However, I have good news for you, the table code is 15% faster than my
> > > tree on an Alpha EV6 ;-)
> > > It may be because gcc can use different tricks on this arch.
> > >
> > > Cheers
> > > Willy
> >
> > However, The most code changed has increase a little peformance, Do we
> > really need it?
> >
> > Here is test in my machine, The faster is still table.
>
> I think the table version is so fast as a normal distribution of numbers is
> tested. With this distribution half of the occuring values have the MSB set,
> one quarter the next significant bit and so on...
>
> The tree version is approximately equally fast for every number, but the table
> version is fast if a very significant bit is set and slow if a less
> significant bit is set...
>
> If small values occour more often than big ones the tree version should be
> preferred, else following version may be very fast, too.
>
> static inline int fls_shift(int x)
> {
> int bit = 32;
>
> while(x > 0) {
> --bit;
> x <<= 1;
> }
>
> return x ? bit : 0;
> }
>

Yes, The shift is very faster than other. Here is a test in P4 1.6G.

model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:85: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295

real 0m21.698s
user 0m21.650s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265

real 0m41.581s
user 0m41.500s
sys 0m0.000s
4294967295 iterations of tree... checksum = 4294967265

real 1m1.173s
user 1m1.040s
sys 0m0.000s
4294967295 iterations of shift... checksum = 4294967265

real 0m22.050s
user 0m22.000s
sys 0m0.010s



--
Hu Gang / Steve
Email : [email protected], [email protected]
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016

2003-05-02 00:25:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> At first, I thought you had coded an FFS instead of an FLS. But I realized
> it's valid, and is fast because one half of the numbers tested will not even
> take one iteration.

Aha, and duh. At 1 million iterations, my binary search clobbers the shift
version by a factor of four. At 2**31 iterations, my version also wins, by
20%.

I should note that the iterations parameter to my benchmark program is deeply
flawed, since it changes the nature of the input set, not just the number of
iterations. Fortunately, all tests I've done have been with 2**32
iterations, giving a perfectly uniform distribution of input values.

For uniformly distributed inputs the simple shift loop does well, but for
calculating floor(log2(x)) it's much worse than the fancier alternatives. I
suspect typical usage leans more to the latter than the former.

Regards,

Daniel

2003-05-02 00:45:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

Daniel Phillips <[email protected]> wrote:
>
> On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> > At first, I thought you had coded an FFS instead of an FLS. But I realized
> > it's valid, and is fast because one half of the numbers tested will not even
> > take one iteration.
>
> Aha, and duh. At 1 million iterations, my binary search clobbers the shift
> version by a factor of four. At 2**31 iterations, my version also wins, by
> 20%.
>

Would it be churlish to point out that the only significant user of fls()
is sctp_v6_addr_match_len()?

2003-05-02 01:35:40

by Thomas Schlichter

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On May 2, Daniel Phillips wrote:
> On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> > At first, I thought you had coded an FFS instead of an FLS. But I
> > realized it's valid, and is fast because one half of the numbers tested
> > will not even take one iteration.
>
> Aha, and duh. At 1 million iterations, my binary search clobbers the shift
> version by a factor of four. At 2**31 iterations, my version also wins, by
> 20%.
>
> I should note that the iterations parameter to my benchmark program is
> deeply flawed, since it changes the nature of the input set, not just the
> number of iterations. Fortunately, all tests I've done have been with
> 2**32 iterations, giving a perfectly uniform distribution of input values.

That is what I posted in my first message in this thread... The shift
algorithm only works fine for uniform distributed input values... But here is
a version that behaves better for small values, too. I don't think it will
reach the tree version but it should be much better that the old version!

int fls_shift(int x)
{
int bit;

if(x & 0xFFFF0000) {
bit = 32;

while(x > 0) {
--bit;
x <<= 1;
}
} else {
bit = 0;

while(x) {
++bit;
x >>= 1;
}
}

return bit;
}

For me this version even speeds up the uniform distributed benchmark...

> For uniformly distributed inputs the simple shift loop does well, but for
> calculating floor(log2(x)) it's much worse than the fancier alternatives.
> I suspect typical usage leans more to the latter than the former.

If this is the case the tree version will surely be the best!

But I think this topic is not worth any further work as this is not used very
often... So this version will be my last one!

But this was a good example how suited algorithms can speed up benchmarks ;-)

> Regards,
>
> Daniel

Best regards
Thomas


Attachments:
(No filename) (1.80 kB)
signed data
(No filename) (189.00 B)
signature
Download all attachments

2003-05-02 12:02:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Friday 02 May 2003 02:54, Andrew Morton wrote:
> Would it be churlish to point out that the only significant user of fls()
> is sctp_v6_addr_match_len()?

Maybe. It's also useful for breaking up an arbitrary IO region optimally into
binary-sized blocks, which is part of the current 2.4 device-mapper patch
set, not yet submitted.

Regards,

Daniel

2003-05-02 13:12:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC][PATCH] Faster generic_fls

On Fri, May 02, 2003 at 03:47:37AM +0200, Thomas Schlichter wrote:

> That is what I posted in my first message in this thread... The shift
> algorithm only works fine for uniform distributed input values... But here is
> a version that behaves better for small values, too. I don't think it will
> reach the tree version but it should be much better that the old version!

I also tried to change your version into this, and at least it's not slower,
so it's good anyway, considering the small code size.

> If this is the case the tree version will surely be the best!
>
> But I think this topic is not worth any further work as this is not used very
> often... So this version will be my last one!

I agree. Just for the record, I'll post two other original implementations
which are not intersting for their real-life performance, but may be
interesting to reuse in other projects or in cheap hardware implementations,
or as a base for other algos. One of the downsides is that they need many
registers. They both are about twice as slow as the tree on athlon, alpha and
sparc.

The first one uses no jump at all if the CPU can do CMOV. It's about twice as
slow as tree, but may be a win on machines with a big jump cost. And since
there's no jump, its execution time is constant :

unsigned fls32_1(unsigned n)
{
/* this code is totally jumpless on architectures that support CMOV,
and can execute up to 4 instructions per cycle. However, it uses
lots of instructions and registers and is not as fast as it should be.
It's about 20 cycles on a dual-pipeline CPU.
*/
register unsigned x = n, bits = 0, bits2, a, b;

a = x & 0xffff0000;
bits2 = bits + 16;

b = x & 0xff00ff00;
if (x & a) { bits = bits2;}
if (x & a) { x &= a;}

bits2 = bits + 8;
a = x & 0xf0f0f0f0;

if (x & b) { bits = bits2;}
if (x & b) { x &= b;}

bits2 = bits + 4;
b = x & 0xcccccccc;

if (x & a) { bits = bits2;}
if (x & a) { x &= a;}

bits2 = bits + 2;
a = x & 0xaaaaaaaa;

if (x & b) { bits = bits2;}

if (x & b) { x &= b;}
bits2 = bits + 1;

if (x & a) { bits = bits2;}
if (x & a) { x &= a;}

if (x) { bits += 1; }

return bits;
}

The second one has a radically different approach. It converges int 5 shifts.
However, each iteration has a non neglileable cost, and its time is nearly
constant too. The code is rather small (about 70 bytes), but it needs a CPU
which can shift and jump fast to be efficient. It consumes about the same
time as above.

unsigned fls32_2(unsigned n)
{
register unsigned t = 16, r = 0;
register unsigned m = 0xffff0000;

// it only needs 5 iterations to complete, and some
// instructions can be executed in parallel. It's
// more efficient than the pure shift on small values.
// But it needs many registers :-(
if (n) {
while (t) {
if (n & m) {
n >>= t;
r += t;
t >>= 1;
m >>= t;
} else {
n <<= t;
t >>= 1;
m <<= t ? t : 1;
}
}
return r + 1;
}
else
return n;
}


These were my last versions, and not particularly performant ones ;-)

Regards,
Willy