i want to speed up my product's checksum verification code, and was
pondering the use of mmx (ip_fast_csum as implemented by cwik and
gulbrandsen from asm-i386/checksum.h is fast enough for my needs, but i
don't want to violate the gpl 8) ).
i'm refreshing myself on mmx currently, but noticed the following
comment from arch/i386/lib/mmx.c's _mmx_memcpy:
"Checksums are not a win with MMX on any CPU tested so far for any MMX
solution figured."
firstly, to what domain of checksums does this comment apply? secondly,
why is it true? it seems the PADDW family of instructions could work
well here; is the slowdown a result of the kernel's need to muck with
fpu state (from what i can tell, mmx uses the fp registers)?
thanks so much for any help!
--
nick black <[email protected]>
On Tue, 11 Feb 2003 13:37:07 EST, [email protected] (nick black) said:
> firstly, to what domain of checksums does this comment apply? secondly,
> why is it true? it seems the PADDW family of instructions could work
> well here; is the slowdown a result of the kernel's need to muck with
> fpu state (from what i can tell, mmx uses the fp registers)?
(Note - second-hand info from somebody else who looked at MMX/SSE to optimize
an inner loop. Double-check with CPU documentation).
There's a big "urp" sound as the processor switches from FP to MMX mode and
back, which apparently takes a large number of cycles. You can to some extent
amortize this if you're switching once for a LONG loop (the analysis I saw was
with a million or so pixels on a screen image) - if you're switching in and out
for a 1500 byte packet (or even worse, a 100-byte packet) the impact may be
more noticable. You may wish to examine the SSE/SSE2 opcodes, which apparently
don't take this performance hit.
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech