2002-10-28 08:19:09

by Denis Vlasenko

[permalink] [raw]
Subject: New csum and csum_copy routines - and a test/benchmark program

I took some time to develop a little test/benchmark program
for csum and csum_copy routines (used in networking).
It has grown to include following features:

* Total buffer size #define-selectable, hence you can measure
cache-hot and cache-cold performance.

* It does not simply checksum entire buffer, you can do it in 'chunks'.
Chunk size is a #define too. Chunk order is randomized for eash run
(this is done to stop fooling us with prefetch from prev chunk to next).
But you are guaranteed to walk entire buffer.

* Buffer contents are randomized at each run. Csum correctness is checked.

* Buffer copy correctness verified for csum_copy.

* You can set random (up to a #defined value) start and end offset for each
chunk. Gaps are poisoned before each csum_copy and verified afterwards.
This has already caught two bugs.

* It benchmarks each routine by running it #defined number of times
and reporting min/max cycles per kb taken.

* It is easy to add/remove C and asm test routines.

* Easily adaptable for SSE and MMX instruction sets.

* It can make coffee for you. ;)

I'm thinking on how to collect 2-5 best routines and
make 'em compete at kernel init time for the right
to be used for blazing network performance, but did not
even start to code this. Similar approach can be taken
for page clear/copy and copy to/from user routines.

Election of 'best' routine by lkml posts is:
1. Slow
2. Doesn't fit given combination of CPU/mem/mobo
so do _not_ send your results to lkml unless you think
you found something interesting.

FYI, my last results below. kpf_XXX routines are newest'n'greatest.
I found out to my surprize that shortening unrolled loop
on Duron has positive effect.

Coders with 'prefetchless' CPUs are encouraged to write up
their own versions of prefetch-like routines (you may use
mov [mem],reg as a prefetch in the hopes CPU will reorder
instructions and will happily csum older data while such mov
is waiting for data to be fetched. But this needs testing.
That's what this program is for! :-)
--
vda

Duron 650
=========
Csum benchmark program
buffer size: 4 Mb
Each test tried 32 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 2612 max, 1887 min cycles per kb. sum=0xfad28968
kernel_csum - took 2654 max, 1887 min cycles per kb. sum=0xfad28968
kernel_csum - took 2105 max, 1887 min cycles per kb. sum=0xfad28968
kernel2_csum - took 2636 max, 1925 min cycles per kb. sum=0xfad28968
kernelpii_csum - took 11879 max, 1735 min cycles per kb. sum=0xaeffd53b
kernelpiipf_csum - took 2565 max, 1642 min cycles per kb. sum=0xaeffd53b
kpf_csum - took 1280 max, 1037 min cycles per kb. sum=0xaeffd53b
kpf_csum - took 1298 max, 1037 min cycles per kb. sum=0xaeffd53b
kpf_csum - took 1285 max, 1035 min cycles per kb. sum=0xaeffd53b
kpf_csum - took 1893 max, 1037 min cycles per kb. sum=0xaeffd53b
copy tests:
kernel_copy - took 5812 max, 4854 min cycles per kb. sum=0xfad28968
kernel_copy - took 5741 max, 4854 min cycles per kb. sum=0xfad28968
kernel_copy - took 17680 max, 4859 min cycles per kb. sum=0xfad28968
kernelpii_copy - took 7204 max, 6381 min cycles per kb. sum=0xe3bca07e
kernelpiipf_copy - took 8429 max, 7477 min cycles per kb. sum=0xe3bca07e
kpf_copy - took 12806 max, 2471 min cycles per kb. sum=0xfad28968
kpf_copy - took 3181 max, 2470 min cycles per kb. sum=0xfad28968
kpf_copy - took 3327 max, 2471 min cycles per kb. sum=0xfad28968
kpf_copy - took 11967 max, 2471 min cycles per kb. sum=0xfad28968
Done

Celeron 1200
============
Csum benchmark program
buffer size: 4 Mb
Each test tried 32 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 7368 max, 6833 min cycles per kb. sum=0x291132e0
kernel_csum - took 9038 max, 6845 min cycles per kb. sum=0x291132e0
kernel_csum - took 7112 max, 6836 min cycles per kb. sum=0x291132e0
kernel2_csum - took 7254 max, 6871 min cycles per kb. sum=0x291132e0
kernelpii_csum - took 4696 max, 4109 min cycles per kb. sum=0x484713aa
kernelpiipf_csum - took 4715 max, 4271 min cycles per kb. sum=0x484713aa
kpf_csum - took 3295 max, 2780 min cycles per kb. sum=0x484713aa
kpf_csum - took 3091 max, 2793 min cycles per kb. sum=0x484713aa
kpf_csum - took 14580 max, 2833 min cycles per kb. sum=0x484713aa
kpf_csum - took 3292 max, 2833 min cycles per kb. sum=0x484713aa
copy tests:
kernel_copy - took 13927 max,13450 min cycles per kb. sum=0x291132e0
kernel_copy - took 14009 max,13406 min cycles per kb. sum=0x291132e0
kernel_copy - took 13957 max,13447 min cycles per kb. sum=0x291132e0
kernelpii_copy - took 15039 max,11335 min cycles per kb. sum=0x5474077d
kernelpiipf_copy - took 14137 max,13059 min cycles per kb. sum=0x5474077d
kpf_copy - took 8226 max, 7857 min cycles per kb. sum=0x291132e0
kpf_copy - took 20698 max, 7886 min cycles per kb. sum=0x291132e0
kpf_copy - took 8504 max, 7897 min cycles per kb. sum=0x291132e0
kpf_copy - took 8245 max, 7893 min cycles per kb. sum=0x291132e0
Done


Attachments:
timing_csum_copy.3.tar.bz2 (8.51 kB)

2002-10-28 08:42:13

by Roberto Nibali

[permalink] [raw]
Subject: Re: New csum and csum_copy routines - and a test/benchmark program

diff -ur timing_csum_copy.3/copy_kpf.S timing_csum_copy.3-ratz/copy_kpf.S
--- timing_csum_copy.3/copy_kpf.S Mon Oct 28 13:35:25 2002
+++ timing_csum_copy.3-ratz/copy_kpf.S Mon Oct 28 09:41:11 2002
@@ -76,7 +76,7 @@
PREFETCH(128(%esi))
PREFETCH(192(%esi))
subl $108, %esp
- fsave (%esp) # save FPU - we'll use MMX...
+ fsave (%esp) # save FPU - we will use MMX...
fwait
testl %esi, %esi # clears CF
10:
diff -ur timing_csum_copy.3/copy_ntq.c timing_csum_copy.3-ratz/copy_ntq.c
--- timing_csum_copy.3/copy_ntq.c Sun Oct 27 02:14:12 2002
+++ timing_csum_copy.3-ratz/copy_ntq.c Mon Oct 28 09:32:49 2002
@@ -1,6 +1,7 @@
unsigned int ntq_copy(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
{
+ int count;
char fpu_save[108];
__asm__ __volatile__ (
" fsave %0\n"
@@ -8,7 +9,7 @@
: /* output */ "=m"(fpu_save[0])
);

- int count = len/8;
+ count = len/8;
__asm__ __volatile__ (
" testl %%ecx, %%ecx\n" //carry unset - we need it
// these two back-to-back references actually _is_ the fastest way
diff -ur timing_csum_copy.3/copy_ntqpf.c timing_csum_copy.3-ratz/copy_ntqpf.c
--- timing_csum_copy.3/copy_ntqpf.c Fri Oct 25 20:01:44 2002
+++ timing_csum_copy.3-ratz/copy_ntqpf.c Mon Oct 28 09:33:08 2002
@@ -1,6 +1,7 @@
unsigned int ntqpf_copy(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
{
+ int count;
char fpu_save[108];
__asm__ __volatile__ (
" "PREFETCH" (%0)\n"
@@ -15,7 +16,7 @@
: /* output */ "=m"(fpu_save[0])
);

- int count = len/(8*8);
+ count = len/(8*8);
while(count--) {
__asm__ __volatile__ (
"1: "PREFETCH" 256(%1)\n"
diff -ur timing_csum_copy.3/copy_ntqpf2.c timing_csum_copy.3-ratz/copy_ntqpf2.c
--- timing_csum_copy.3/copy_ntqpf2.c Sun Oct 27 02:19:09 2002
+++ timing_csum_copy.3-ratz/copy_ntqpf2.c Mon Oct 28 09:33:27 2002
@@ -1,6 +1,7 @@
unsigned int ntqpf2_copy(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
{
+ int count;
char fpu_save[108];
__asm__ __volatile__ (
" "PREFETCH" (%0)\n"
@@ -15,7 +16,7 @@
: /* output */ "=m"(fpu_save[0])
);

- int count = len/(8*8);
+ count = len/(8*8);
while(count--) {
__asm__ __volatile__ (
"1: "PREFETCH" 256(%1)\n"
diff -ur timing_csum_copy.3/copy_ntqpfm.c timing_csum_copy.3-ratz/copy_ntqpfm.c
--- timing_csum_copy.3/copy_ntqpfm.c Sun Oct 27 02:38:21 2002
+++ timing_csum_copy.3-ratz/copy_ntqpfm.c Mon Oct 28 09:33:41 2002
@@ -1,6 +1,7 @@
unsigned int ntqpfm_copy(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
{
+ int count;
char xmm0[16];
__asm__ __volatile__ (
" "PREFETCH" (%0)\n"
@@ -14,7 +15,7 @@
: /* output */ "=m"(xmm0[0])
);

- int count = len/(8*8);
+ count = len/(8*8);
while(count--) {
__asm__ __volatile__ (
"1: "PREFETCH" 256(%1)\n"
diff -ur timing_csum_copy.3/csum_kpf.S timing_csum_copy.3-ratz/csum_kpf.S
--- timing_csum_copy.3/csum_kpf.S Mon Oct 28 13:35:30 2002
+++ timing_csum_copy.3-ratz/csum_kpf.S Mon Oct 28 09:43:21 2002
@@ -73,7 +73,7 @@
40:
PREFETCH(256(%esi))
41:
- addl/* -128(%esi), %eax
+ addl -128(%esi), %eax
adcl -124(%esi), %eax
adcl -120(%esi), %eax
adcl -116(%esi), %eax
@@ -97,7 +97,7 @@
adcl -44(%esi), %eax
adcl -40(%esi), %eax
adcl -36(%esi), %eax
- adcl*/ -32(%esi), %eax
+ adcl -32(%esi), %eax
adcl -28(%esi), %eax
adcl -24(%esi), %eax
adcl -20(%esi), %eax
@@ -115,7 +115,7 @@
js 46f
cmp $8,%ecx
jae 40b # need prefetch
- jmp 41b # don't need it
+ jmp 41b # do not need it
46:
//adcl $0, %eax
movl %edx, %ecx
diff -ur timing_csum_copy.3/csum_pfm.c timing_csum_copy.3-ratz/csum_pfm.c
--- timing_csum_copy.3/csum_pfm.c Sun Oct 27 02:41:17 2002
+++ timing_csum_copy.3-ratz/csum_pfm.c Mon Oct 28 09:31:58 2002
@@ -1,5 +1,6 @@
unsigned int pfm_csum(const unsigned char *src, int len, unsigned int sum)
{
+ int count;
__asm__ __volatile__ (
" "PREFETCH" (%0)\n"
" "PREFETCH" 64(%0)\n"
@@ -8,8 +9,8 @@
: : "r" (src)
);

- int count = len/(8*8);
- while(count--)
{
+ count = len/(8*8);
+ while(count--) {
__asm__ __volatile__ (
"1: "PREFETCH" 256(%1)\n"
" xorl %%ecx, %%ecx\n" //carry unset - we need it
diff -ur timing_csum_copy.3/csum_pfm2.c timing_csum_copy.3-ratz/csum_pfm2.c
--- timing_csum_copy.3/csum_pfm2.c Sun Oct 27 02:42:04 2002
+++ timing_csum_copy.3-ratz/csum_pfm2.c Mon Oct 28 09:32:27 2002
@@ -1,5 +1,6 @@
unsigned int pfm2_csum(const unsigned char *src, int len, unsigned int sum)
{
+ int count;
__asm__ __volatile__ (
" "PREFETCH" (%0)\n"
" "PREFETCH" 64(%0)\n"
@@ -8,8 +9,8 @@
: : "r" (src)
);

- int count = len/(8*8);
- while(count--)
{
+ count = len/(8*8);
+ while(count--) {
__asm__ __volatile__ (
"1: "PREFETCH" 256(%1)\n"
" xorl %%ecx, %%ecx\n" //carry unset - we need it


Attachments:
timing_csum_copy_fix.diff (4.78 kB)

2002-10-28 10:33:48

by Denis Vlasenko

[permalink] [raw]
Subject: Re: New csum and csum_copy routines - and a test/benchmark program

On 28 October 2002 06:48, Roberto Nibali wrote:
> Denis Vlasenko wrote:
> > I took some time to develop a little test/benchmark program
> > for csum and csum_copy routines (used in networking).
> > It has grown to include following features:
>
> I needed the attached patch with changes to make it work on my
> machine. Could you comment on it, please? Also a Makefile would be
> nicer ;).

Applied except for a bug ;) see below

--- timing_csum_copy.3/csum_kpf.S Mon Oct 28 13:35:30 2002
+++ timing_csum_copy.3-ratz/csum_kpf.S Mon Oct 28 09:43:21 2002
@@ -73,7 +73,7 @@
40:
PREFETCH(256(%esi))
41:
- addl/* -128(%esi), %eax
+ addl -128(%esi), %eax
adcl -124(%esi), %eax
adcl -120(%esi), %eax
adcl -116(%esi), %eax
@@ -97,7 +97,7 @@
adcl -44(%esi), %eax
adcl -40(%esi), %eax
adcl -36(%esi), %eax
- adcl*/ -32(%esi), %eax
+ adcl -32(%esi), %eax
adcl -28(%esi), %eax
adcl -24(%esi), %eax
adcl -20(%esi), %eax

No no no. First instruction nas to be an addl, that's why
there is a weird comment. Just comment out lines
from addl -128... to adcl -36... and change adcl -23.. to addl

# as --version
GNU assembler 2.13.90.0.6 20021002
Copyright 2002 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License. This program has absolutely no warranty.
This assembler was configured for a target of `i386-pc-linux-gnu'.

What's yours?
--
vda

2002-10-28 11:00:23

by Roberto Nibali

[permalink] [raw]
Subject: Re: New csum and csum_copy routines - and a test/benchmark program

> Applied except for a bug ;) see below

Yes, this was wrong. I didn't read the code too closely. This introduced
a wrong csum return for me. With your .4 version it works now perfectly.

> # as --version
> GNU assembler 2.13.90.0.6 20021002
> Copyright 2002 Free Software Foundation, Inc.
> This program is free software; you may redistribute it under the terms of
> the GNU General Public License. This program has absolutely no warranty.
> This assembler was configured for a target of `i386-pc-linux-gnu'.
>
> What's yours?

# as --version
GNU assembler 2.11.92.0.10
Copyright 2001 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License. This program has absolutely no warranty.
This assembler was configured for a target of `i686-pc-linux-gnu'.

I reckon this explains it all. I'll go and upgrade the damn thing now.

Cheers,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

2002-10-28 15:06:08

by J.A. Magallon

[permalink] [raw]
Subject: Re: New csum and csum_copy routines - and a test/benchmark program


On 2002.10.28 Denis Vlasenko wrote:
>
>Duron 650
>=========
>csum tests:
> kernel_csum - took 2612 max, 1887 min cycles per kb. sum=0xfad28968
> kernel_csum - took 2654 max, 1887 min cycles per kb. sum=0xfad28968
> kernel_csum - took 2105 max, 1887 min cycles per kb. sum=0xfad28968
> kernel2_csum - took 2636 max, 1925 min cycles per kb. sum=0xfad28968
> kernelpii_csum - took 11879 max, 1735 min cycles per kb. sum=0xaeffd53b
> kernelpiipf_csum - took 2565 max, 1642 min cycles per kb. sum=0xaeffd53b
>copy tests:
> kernel_copy - took 5812 max, 4854 min cycles per kb. sum=0xfad28968
> kernel_copy - took 5741 max, 4854 min cycles per kb. sum=0xfad28968
> kernel_copy - took 17680 max, 4859 min cycles per kb. sum=0xfad28968
> kernelpii_copy - took 7204 max, 6381 min cycles per kb. sum=0xe3bca07e
> kernelpiipf_copy - took 8429 max, 7477 min cycles per kb. sum=0xe3bca07e
>
>Celeron 1200
>============
>csum tests:
> kernel_csum - took 7368 max, 6833 min cycles per kb. sum=0x291132e0
> kernel_csum - took 9038 max, 6845 min cycles per kb. sum=0x291132e0
> kernel_csum - took 7112 max, 6836 min cycles per kb. sum=0x291132e0
> kernel2_csum - took 7254 max, 6871 min cycles per kb. sum=0x291132e0
> kernelpii_csum - took 4696 max, 4109 min cycles per kb. sum=0x484713aa
> kernelpiipf_csum - took 4715 max, 4271 min cycles per kb. sum=0x484713aa
>copy tests:
> kernel_copy - took 13927 max,13450 min cycles per kb. sum=0x291132e0
> kernel_copy - took 14009 max,13406 min cycles per kb. sum=0x291132e0
> kernel_copy - took 13957 max,13447 min cycles per kb. sum=0x291132e0
> kernelpii_copy - took 15039 max,11335 min cycles per kb. sum=0x5474077d
> kernelpiipf_copy - took 14137 max,13059 min cycles per kb. sum=0x5474077d

Ejem, I have read your comment about not posting results, but can really a PII
do this:

Pentium2 400, 512Kb cache
=========================
Csum benchmark program
buffer size: 4 Mb
Each test tried 32 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 2586 max, 2541 min cycles per kb. sum=0x7a86335d
kernel_csum - took 2566 max, 2541 min cycles per kb. sum=0x7a86335d
kernel_csum - took 2703 max, 2541 min cycles per kb. sum=0x7a86335d
kernel2_csum - took 2890 max, 2556 min cycles per kb. sum=0x7a86335d
kernelpii_csum - took 1682 max, 1494 min cycles per kb. sum=0x993114b2
kernelpiipf_csum - took 1688 max, 1493 min cycles per kb. sum=0x993114b2
kpf_csum - took 1742 max, 1703 min cycles per kb. sum=0xf3500fff
kpf_csum - took 1805 max, 1747 min cycles per kb. sum=0xf3500fff
kpf_csum - took 1746 max, 1703 min cycles per kb. sum=0xf3500fff
kpf_csum - took 1777 max, 1747 min cycles per kb. sum=0xf3500fff
copy tests:
kernel_copy - took 4507 max, 4306 min cycles per kb. sum=0x7a86335d
kernel_copy - took 4704 max, 4306 min cycles per kb. sum=0x7a86335d
kernel_copy - took 4833 max, 4306 min cycles per kb. sum=0x7a86335d
kernelpii_copy - took 3688 max, 3504 min cycles per kb. sum=0xa3cf0a14
kernelpiipf_copy - took 4961 max, 4516 min cycles per kb. sum=0xa3cf0a14
Done

I had to set empty prefetch in kpf_csum.
I can't run kpf_copy benchmarks, they crash even with empty prefetch.
Do they contain PIII specific code ?

TIA

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-pre11-jam2 (gcc 3.2 (Mandrake Linux 9.0 3.2-2mdk))

2002-10-28 18:21:22

by Denis Vlasenko

[permalink] [raw]
Subject: Re: New csum and csum_copy routines - and a test/benchmark program

> Ejem, I have read your comment about not posting results, but can
> really a PII do this:
> [snip]

I'm trying to stop myself from tweaking those routines and getting
to write a second part - a kernel patch. No success so far ;)
--
vda