Linus Torvalds wrote:
>> BTW, has someone benchmarked BSF/BSR on x86 ? It should be
>> faster, it it's also possible that a poor microcode implements it with a one
>> bit/cycle algo, which will result in one instruction not being as fast as your
>> code.
>
> I think the original i386 did it with a one-bit-per-cycle algorithm,
> anything since should be fine. In particular, on a P4 where I just tested,
> the bsf seems to be 4 cycles over the whole input set (actually, my whole
> loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
> the rest could have been done in parallell).
Just for comparison, the Pentium (Classic) manual says 6-43 clocks for
bsfl and 7-72 (!) for bsrl.
------
Chuck