I ran oprofile with bw_tcp and retired instructions on athlon showed:
samples %-age symbol name
903640 75.4825 csum_partial_copy_generic
In Carl Staelin and Larry McVoy's 98 Usenix paper they wrote:
"It is interesting to compare pipes with TCP because the TCP benchmark is
identical to the pipe benchmark except for the transport mechanism. Ideally,
the TCP bandwidth would be as good as the pipe bandwidth. It is not widely
known that the majority of the TCP cost is in the bcopy, the checksum, and
the network interface driver. The checksum and the driver may be safely
eliminated in the loopback case and if the costs have been eliminated, then
TCP should be just as fast as pipes. From the pipe and TCP results [...]
it is easy to see that Solaris and HP-UX have done this optimization."
Here are some recent Linux kernels:
Processor Pipe TCP
Athlon/1330 840.66 73.75 (or 150 MB/sec - see below)
k6-2/475 65.15 52.45
PIII * 1/700 Xeon 539.73 446.16
I tried compiling the athlon kernel without X86_USE_PPRO_CHECKSUM
but that didn't really change tcp bandwidth.
kernel Pipe TCP
2.4.19rc2aa1 860.97 74.27
2.4.19rc2aa1-nocsum 853.18 74.16
[topic shift]
There was a change in bw_tcp.c that has a 2x impact on
the computed bandwidth. I have two versions:
ls -gl LM*/src/bw_tcp.c
-r--r--r-- 1 rwhron 3553 Jul 23 2001 LMbench.old/src/bw_tcp.c
-r--r--r-- 1 rwhron 3799 Sep 27 2001 LMbench2/src/bw_tcp.c
Both LMbench trees have the same version:
#define MAJOR 2
#define MINOR -13 /* negative is alpha, it "increases" */
ident doesn't specify a version in tcp_bw.c, but diff shows
a difference.
This is the newer bw_tcp on an Athlon 1330.
/bw_tcp localhost
server: nbytes=10485760
initial bandwidth measurement: move=10485760, usecs=117291: 89.40 MB/sec
move=693633024, XFERSIZE=65536
server: nbytes=693633024
Socket bandwidth using localhost: 75.85 MB/sec
And the older bw_tcp compiled with same gcc same kernel on athlon:
/bw_tcp localhost
Socket bandwidth using localhost: 150.21 MB/sec
--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html
On Sun, 2002-07-21 at 14:21, [email protected] wrote:
> In Carl Staelin and Larry McVoy's 98 Usenix paper they wrote:
>
> "It is interesting to compare pipes with TCP because the TCP benchmark is
> identical to the pipe benchmark except for the transport mechanism. Ideally,
> the TCP bandwidth would be as good as the pipe bandwidth. It is not widely
> known that the majority of the TCP cost is in the bcopy, the checksum, and
> the network interface driver. The checksum and the driver may be safely
The paper however ignored something else we do which is why you see
csum_partial_copy_generic. On a modern processor the cost of fetching
and storing memory is so high compared to the throughput of the
processor that it is actually much more effective to fold the copy and
checksum together. Generally the copy/checksum has the same speed as a
pure copy anyway
> I ran oprofile with bw_tcp and retired instructions on
> athlon showed:
> samples %-age symbol name
> 903640 75.4825 csum_partial_copy_generic
> In Carl Staelin and Larry McVoy's 98 Usenix paper they
> wrote:
>
> It is interesting to compare pipes with TCP because the TCP
> benchmark is identical to the pipe benchmark except for the
> transport mechanism. Ideally, the TCP bandwidth would be as
> good as the pipe bandwidth. It is not widely known that the
Well, TCP will have a little more overhead than a pipe -
the network stack having to take care of a few more things.
> majority of the TCP cost is in the bcopy, the checksum, and
> the network interface driver. The checksum and the driver
> may be safely eliminated in the loopback case and if the
> costs have been eliminated, then TCP should be just as fast
> as pipes. From the pipe and TCP results [...]
> it is easy to see that Solaris and HP-UX have done this
> optimization.
Much has happened since 1998: hw checksum offload,
sendfile, amongst others. Note that this checksum
is performed while copying the data from/to
user space (though not that often in rx code path).
However, while we dont look at the checksum on the
rx side for TCP, since the loopback driver will
have set ip_summed to CHECKSUM_UNNECESSARY, on the
send side we dont bother, because we have to do the
copy in any case. Marginal difference, although
perhaps worth doing. I had a patch for this which
showed no difference in a VolanoMark (yes, I know),
benchmark.
> Processor Pipe TCP
> Athlon/1330 840.66 73.75 (or 150 MB/sec - see below)
> k6-2/475 65.15 52.45
> PIII * 1/700 Xeon 539.73 446.16
Hmm, so if K6 and Xeon can scrounge up 80% of pipe
performance, why is the Athlon an order of magnitude off
at 8%? How did your Athlon perform in other tests relative
to these other procs?
> I tried compiling the athlon kernel without
> X86_USE_PPRO_CHECKSUM but that didn't really change
> tcp bandwidth.
> kernel Pipe TCP
> 2.4.19rc2aa1 860.97 74.27
> 2.4.19rc2aa1-nocsum 853.18 74.16
Well, that would simply change how the checksum is
calculated, and in this case, I believe the substantial
latency is from the copy.
> There was a change in bw_tcp.c that has a 2x impact on
> the computed bandwidth. I have two versions:
> ls -gl LM*/src/bw_tcp.c
> -r--r--r-- 1 rwhron 3553 Jul 23 2001 LMbench.old/src/bw_tcp.c
> -r--r--r-- 1 rwhron 3799 Sep 27 2001 LMbench2/src/bw_tcp.c
I only see the bitkeeper version thats almost a year old, online,
where is the later version from?
> Both LMbench trees have the same version:
...
> ident doesn't specify a version in tcp_bw.c, but diff shows
> a difference.
A change in your test causes a 2x difference in performance,
and you dont give us the diff? :) :)
> Socket bandwidth using localhost: 75.85 MB/sec
...
> Socket bandwidth using localhost: 150.21 MB/sec
Where are the complete profiles from these runs?
Also, any chance you have network stats before/after?
I looked on your site but couldnt find the tcp_bw
runs..
thanks,
Nivedita
Nivedita Singhvi <[email protected]> writes:
> Hmm, so if K6 and Xeon can scrounge up 80% of pipe
> performance, why is the Athlon an order of magnitude off
> at 8%? How did your Athlon perform in other tests relative
> to these other procs?
The pipe test basically tests copy_from_user()/copy_to_user().
The standard implementation of these macros (essentially rep ; movsl)
doesn't exploit the Athlon very well - it is not good at this
instruction. AFAIK Intel CPUs have an faster microcode
implementation for this.
You could likely do better on Athlon with a copy*user that uses
an unrolled loop with explicit movls or even SSE.
[similar to the implementation the x86-64 port uses, but without
the NT instructions]
-Andi
Nivedita Singhvi wrote:
> How did your Athlon perform in other tests relative
> to these other procs?
TCP bandwidth was the only strange really strange result.
Side by side lmbench for three processors is at:
http://home.earthlink.net/~rwhron/kernel/lmbench_comparison.html
> I only see the bitkeeper version thats almost a year old, online,
> where is the later version from?
The earlier version of bw_tcp is from lmbench-2.0-patch1.tgz
and the later version is from lmbench-2.0-patch2.tgz.
> Also, any chance you have network stats before/after?
I only ran oprofile on the athlon using "localhost".
oprofile was to get an idea of the hot functions.
oprofile wasn't executing for the lmbench runs in the
link above.
--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html