2002-11-02 11:57:43

by Denis Vlasenko

[permalink] [raw]
Subject: [PATCH] csum and csum_copy routines with boot-time selection .D

Here is a kernel patch which replaces compile-time selection
of csum and csum_copy routines with boot-time selection.
This allows kernel to use fastest routines _for given CPU_,
regardless of copmile-time CPU seletion! You can compile for 486
and yet use faster P4 optimized code

Theory of operation: inside csum_partial() I check buffer size
and if it is large enough, I call optimized routine.

* Rigorous testsuite catches maybe 95% of all possible bugs.
It checks misaligned, zero, very small, odd sized buffers etc.
It compares results with known-good ones.
If bug is caught, testsuite prints verbose debug info.
This helps to understand and fix bug quickly.

* Testsuite is combined with benchmarking tool.

* Code is split into 'driver' and 'optimized' routines.
Driver routine checks and handles alignment issues and corner cases.
This reduces possibility of subtle bugs and makes optimized routines
simpler and easier to code/understand/bugcheck.

* Driver's and optimized routines' source files are the same
for kernel and testsuite, this reduces chances of cut'n'paste
and similar silly bugs.

* Optimized routines are called only if block is large enough (>128 bytes).
This avoids MMX/SSE register save/restore overhead for small blocks.
In other words, you never get slower copy than with old kernels
(speed is the same for small blocks) but large blocks will be handled
much faster.

Some thoughts:

+/* Set this to value bigger than cache(s) */
+#define bufshift 20 /* 10=1kb,20=1Mb etc */
+#define chunksz (4*1024)

How to make this picked according to cache amount
(I mean reliably even for future CPUs)?


+ char *buffer = (char *) __get_free_pages(GFP_KERNEL,
+ (bufshift-PAGE_SHIFT));

hmmm... using __xx func... what is 'official' way to do it?


+#define ROUND(x) \
+SRC( movq x(%esi), %mm0 ); \
+ adcl x(%esi), %eax ; \
+ adcl x+4(%esi), %eax ; \
+DST( movntq %mm0, x(%edi) );
+// we don't need SRC() around 2nd and 3rd commands
+// (exception, if any, would be catched by 1st one)
+// (FIXME: can races against interrupts bite us?)

So, can page disappear under us here or not?

Currently code is very verbose at boot time and runs too much
benchmark repetitions for debugging purposes.

I don't know of any bugs in the code, but they can be there.
No warranties. I gave it a some testing, it didn't eat
my disk and didn't drink my coffee :)
I write this while running patched kernel on a fully
NFS mounted filesystem, so it must be usable ;)

My dmesg snippet shown below. Patch and test suite attached
(bzipped).
--
vda

dmesg
=====
Linux version 2.4.20-pre11csum_t (root@localhost) (gcc version 3.2) #5 SMP Sat Nov 2 11:53:37 GMT+2 2002
....
....
IP-Config: Got DHCP answer from 255.255.255.255, my address is 172.16.42.177
IP-Config: Complete:
device=eth0, addr=172.16.42.177, mask=255.255.255.0, gw=172.16.42.98,
host=(none), domain=, nis-domain=(none),
bootserver=255.255.255.255, rootserver=172.16.42.75, rootpath=
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
Measuring network checksumming speed
allocated 256 pages
count = 2
count = 2
count = 2
count = 3
count = 3
count = 3
count = 6
count = 6
count = 6
count = 11
count = 11
count = 11
count = 21
count = 21
count = 21
test run will last 16 ticks
count = 54
count = 54
count = 54
basic : 345.600 MB/sec
count = 29
count = 29
count = 29
simple : 185.600 MB/sec
unsupported caps: 63
func 3Dnow! skipped: not supported by CPU
unsupported caps: 54
func SSE/MMX+ skipped: not supported by CPU
count = 108
count = 105
count = 107
SSE/MMX+ : 691.200 MB/sec
csum: using csum function: SSE/MMX+
count = 21
count = 21
count = 21
basic : 134.400 MB/sec
count = 17
count = 17
count = 17
simple : 108.800 MB/sec
unsupported caps: 54
func SSE/MMX+ skipped: not supported by CPU
count = 39
count = 39
count = 39
SSE/MMX+ : 249.600 MB/sec
count = 39
count = 39
count = 39
SSE : 249.600 MB/sec
csum: using csum_copy function: SSE/MMX+
freed 256 pages
Looking up port of RPC 100003/2 on 172.16.42.75
Looking up port of RPC 100005/1 on 172.16.42.75
VFS: Mounted root (nfs filesystem).
Mounted devfs on /dev
Freeing unused kernel memory: 140k freed


Attachments:
linux-2.4.20-pre11csumD.patch.bz2 (10.13 kB)
timing_csum_copy.D.tar.bz2 (15.79 kB)
Download all attachments