LinuxLists.cc - Re: [RFC][PATCH] Faster generic

2003-05-01 09:07:39

Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, May 01, 2003 at 03:02:14AM +0200, Daniel Phillips wrote:
> On Wednesday 30 April 2003 22:59, Willy Tarreau wrote:
> > I must acknowledge that your simple code was not easy to beat ! You can try
> > this one on your PIII, I could only test it on an athlon mobile and a P4.
> > With gcc 2.95.3, it gives me a boost of about 25%, because it seems as gcc
> > cannot optimize shifts efficiently. On 3.2.3, however, it's between 0 and
> > 5% depending on optimization/CPU.
>
> Was something ifdef'd incorrectly? Otherwise, there is something the PIII
> hates about that code. I got 107 seconds on the PIII, vs 53 seconds for my
> posted code at O3, and virtually no difference at O2. (gcc 3.2.3)

No, it was correct. It simply means that some CPUs prefer bit shifting while
others prefer comparisons and sometimes jumps, that mostly depends on the
ALU and the branch prediction unit, it seems. That shows us that if we include
such code in the kernel, it has to be arch-specific, so we may end up coding
it in assembler, keeping your code as the default one when no asm has been coded
for a given arch. BTW, has someone benchmarked BSF/BSR on x86 ? It should be
faster, it it's also possible that a poor microcode implements it with a one
bit/cycle algo, which will result in one instruction not being as fast as your
code.

Cheers,
Willy

2003-05-01 15:49:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH] Faster generic_fls

On Thu, 1 May 2003, Willy Tarreau wrote:
>
> BTW, has someone benchmarked BSF/BSR on x86 ? It should be
> faster, it it's also possible that a poor microcode implements it with a one
> bit/cycle algo, which will result in one instruction not being as fast as your
> code.

I think the original i386 did it with a one-bit-per-cycle algorithm,
anything since should be fine. In particular, on a P4 where I just tested,
the bsf seems to be 4 cycles over the whole input set (actually, my whole
loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
the rest could have been done in parallell).

Linus