Here is a kernel patch which replaces compile-time selection
of csum and csum_copy routines with boot-time one.
This allows kernel to use fastest routines _for given CPU_.
Checks CPU capabilities before using 'em.
I tried to generalize code so it will be possible to adapt it
for memcpy, copy to/from user etc.
Theory of operation: I replace functions with pointers in .h files:
-asmlinkage unsigned int csum_partial(const unsigned char * buff, int len, unsigned int sum);
+extern
+asmlinkage unsigned int (*csum_partial)(const unsigned char * buff, int len, unsigned int sum);
Pointers are initialized to default routines:
+csum_func *csum_partial = csum_simple;
+csumcpy_func *csum_partial_copy_generic = csumcpy_simple;
After benchmarking they are replaced by fastest ones:
+ best = find_best(bench_csum, buffer, csum_runner,
+ VECTOR_SZ(csum_runner));
+ printk("csum: using csum function: %s\n", best->name);
+ csum_partial = (csum_func*)(best->f);
bench_func.[ch] contains generic benchmarking support.
To be used for memcpy and copy to/from user... later...
Some thoughts:
+/* Set this to value bigger than cache(s) */
+#define bufshift 20 /* 10=1kb,20=1Mb etc */
+#define chunksz (4*1024)
How to make this picked according to cache amount
(I mean reliably even for future CPUs)?
+ char *buffer = (char *) __get_free_pages(GFP_KERNEL,
+ (bufshift-PAGE_SHIFT));
hmmm... using __xx func... what is 'official' way to do it?
+#define ROUND(x) \
+SRC( movq x(%esi), %mm0 ); \
+ adcl x(%esi), %eax ; \
+ adcl x+4(%esi), %eax ; \
+DST( movntq %mm0, x(%edi) );
+// we don't need SRC() around 2nd and 3rd commands
+// (exception, if any, would be catched by 1st one)
+// (FIXME: can races against interrupts bite us?)
So, can page disappear under us here or not?
Currently code is very verbose at boot time and runs too much
benchmark repetitions for debugging purposes.
I don't know of any bugs in the code, but they can be there.
No warranties. I gave it a limited testing so far, it didn't eat
my disk and didn't drink my coffee :)
My dmesg snippet shown below. Patch and test suite attached
(bzipped).
--
vda
dmesg
=====
Linux version 2.4.20-pre11csumtest (root@localhost) (gcc version 3.2) #14 SMP
...
...
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 1024 buckets, 16Kbytes
TCP: Hash tables configured (established 8192 bind 10922)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
Measuring network checksumming speed
allocated 256 pages
count = 34
count = 34
count = 34
count = 34
count = 34
simple : 348.160 MB/sec
count = 50
count = 50
count = 50
count = 50
count = 50
PII : 512.000 MB/sec
unsupported caps: 25
func SSE/MMX+ skipped: not supported by CPU
count = 79
count = 79
count = 79
count = 79
count = 79
SSE/MMX+ : 808.960 MB/sec
csum: using csum function: SSE/MMX+
count = 20
count = 20
count = 20
count = 20
count = 20
simple : 204.800 MB/sec
count = 19
count = 19
count = 19
count = 19
count = 19
PII : 194.560 MB/sec
unsupported caps: 25
func SSE/MMX+ skipped: not supported by CPU
count = 35
count = 35
count = 35
count = 35
count = 35
SSE/MMX+ : 358.400 MB/sec
csum: using csum_copy function: SSE/MMX+
freed 256 pages
Measuring network checksumming speed
allocated 256 pages
count = 34
count = 34
count = 34
count = 34
count = 34
simple : 348.160 MB/sec
count = 50
count = 50
count = 50
count = 50
count = 50
PII : 512.000 MB/sec
unsupported caps: 25
func SSE/MMX+ skipped: not supported by CPU
count = 79
count = 79
count = 79
count = 79
count = 79
SSE/MMX+ : 808.960 MB/sec
csum: using csum function: SSE/MMX+
count = 20
count = 20
count = 20
count = 20
count = 20
simple : 204.800 MB/sec
count = 19
count = 19
count = 19
count = 19
count = 19
PII : 194.560 MB/sec
unsupported caps: 25
func SSE/MMX+ skipped: not supported by CPU
count = 35
count = 35
count = 35
count = 35
count = 35
SSE/MMX+ : 358.400 MB/sec
csum: using csum_copy function: SSE/MMX+
freed 256 pages
VFS: Mounted root (ext2 filesystem) readonly.
Mounted devfs on /dev
Freeing unused kernel memory: 140k freed