2003-08-31 07:02:32

by Dan Kegel

[permalink] [raw]
Subject: LMbench as gcc performance regression test?

http://cs.nmu.edu/~benchmark/ has an interesting little graph
of LMBench results vs. Linux kernel version, all done with the
same compiler.

Has anyone seen a similar graph showing LMBench results vs. gcc version,
all done with the same Linux kernel?
And does everyone agree that's a meaningful way to compare the
performance of code generated by different compilers?

I happen to have a number of versions of gcc handy, and was
considering making such a graph, but was hoping somebody
else had already done it.

(There seems to be large variations in successive runs of LMBench
when I try it, so it may take me a bit of work to get repeatable
results.)

Thanks,
Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045


2003-08-31 14:09:34

by Dan Kegel

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

Larry McVoy wrote:
> On Sun, Aug 31, 2003 at 12:21:37AM -0700, Dan Kegel wrote:
>
>>(There seems to be large variations in successive runs of LMBench
>>when I try it, so it may take me a bit of work to get repeatable
>>results.)
>
>
> Other than the context switch part or anything based on it, that shouldn't
> be true, it should be very stable.
>
> I'm pretty convinced that the variations are due to different pages being
> allocated and the result cache contention makes things bounce.

Or an idiot running the benchmark. We really do have to rule that out first.
- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

2003-08-31 14:00:44

by Larry McVoy

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

On Sun, Aug 31, 2003 at 12:21:37AM -0700, Dan Kegel wrote:
> (There seems to be large variations in successive runs of LMBench
> when I try it, so it may take me a bit of work to get repeatable
> results.)

Other than the context switch part or anything based on it, that shouldn't
be true, it should be very stable.

I'm pretty convinced that the variations are due to different pages being
allocated and the result cache contention makes things bounce.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-31 15:24:54

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

On Sun, Aug 31, 2003 at 12:21:37AM -0700, Dan Kegel wrote:
> http://cs.nmu.edu/~benchmark/ has an interesting little graph
> of LMBench results vs. Linux kernel version, all done with the
> same compiler.
>
> Has anyone seen a similar graph showing LMBench results vs. gcc version,
> all done with the same Linux kernel?
> And does everyone agree that's a meaningful way to compare the
> performance of code generated by different compilers?
>
> I happen to have a number of versions of gcc handy, and was
> considering making such a graph, but was hoping somebody
> else had already done it.

It's been a while since I looked at lmbench but: why do you think this
would be useful? It's a system and kernel benchmark; I doubt
optimization makes much difference at all.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-08-31 15:39:11

by Dan Kegel

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

Daniel Jacobowitz wrote:
> On Sun, Aug 31, 2003 at 12:21:37AM -0700, Dan Kegel wrote:
>
>>http://cs.nmu.edu/~benchmark/ has an interesting little graph
>>of LMBench results vs. Linux kernel version, all done with the
>>same compiler.
>>
>>Has anyone seen a similar graph showing LMBench results vs. gcc version,
>>all done with the same Linux kernel?
>>And does everyone agree that's a meaningful way to compare the
>>performance of code generated by different compilers?
>
> It's been a while since I looked at lmbench but: why do you think this
> would be useful? It's a system and kernel benchmark; I doubt
> optimization makes much difference at all.

I need to make sure that moving to a newer compiler for our kernel
will cause no performance regressions. Before bothering to bring up a
real-world networking application and measuring its performance
under the new compiler, it seems sensible to use a couple microbenchmarks
to verify that identifiable parts of the system have
not degraded in performance.

I myself am quite convinced I need to move to a newer compiler,
since I keep running into problems building various things with
old compilers, but my users are very conservative and skeptical;
I have to build a solid case for updating. Hence the insane amount
of time I spent figuring out and documenting how to build and test
the various versions of gcc and glibc (http://kegel.com/crosstool),
and then understanding the regression test failures.
- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

2003-08-31 16:18:28

by Larry McVoy

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

On Sun, Aug 31, 2003 at 08:59:10AM -0700, Dan Kegel wrote:
> I need to make sure that moving to a newer compiler for our kernel
> will cause no performance regressions.

Perhaps people think that you mean to compile LMbench w/ different GCC's and
maybe what you really mean is compile the same kernel with different GCC's
and measure that.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-31 16:54:49

by Randy Hron

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

> For starters, just compiling the kernel with different compilers;
> I'll keep running LMBench compiled with the old compiler.

Below are the lmbench results I got. One is from K6/2, and
one is from quad P3 Xeon. The kernels called -falign=2 had
-falign-functions=2 -falign-jumps=2 -falign-labels=2 -falign-loops=2

First is the K6/2. Second bunch of results is quad P3.
The numbers are the average of 25 runs. These do not have high/low
results removed from the averages.

The kernel versions without a suffix are the default kernel compile options.

I also ran a bunch of other benchmarks on these kernels.
It seems to me the application/target platform may be the
final arbiter of "what is best". I was hoping for a very
general result like gcc-3.3.1 -Os -falign=2 is always clearly better,
but I don't see that. (That generalization is partly true on
K6/2, but not on P3 Xeon - very different cache characteristics for
those chips).

BTW, gcc-3.3.1 -Os saved about 800k ram on 1GB athlon compared to gcc-2.95.3 -O2
for 2.6 kernel. Most of the savings were in nop instructions. There was also
a different mix of push/pop instructions. -Os created more push/pops on athlon.

One thing that is clear. gcc-3.3.1 takes longer to compile (roughly 2x).

Note: LMbench wasn't recompiled. The compiler options only changed on the kernel.

L M B E N C H 2 . 0 S U M M A R Y
------------------------------------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3 0.73 1.20351 5.89 1.60 7.07 1.99 5.04 1426 3272 13618
2.6.0-test3-gcc-2.96-Os 0.73 1.25547 4.17 1.71 7.22 1.99 4.43 1456 3416 13945
2.6.0-test3-gcc-3.3.1 0.74 1.08099 4.15 1.56 6.73 2.02 4.28 1397 3146 13238
2.6.0-test3-gcc-3.3.1-Os 0.73 1.33862 4.64 1.92 7.11 2.00 4.88 1398 3453 13874
2.6.0-test3-gcc-3.3.1-Os-falign=2 0.71 1.24968 3.99 1.77 6.50 2.02 4.00 1413 3441 13814

File select - times in microseconds - smaller is better
-------------------------------------------------------
select select select select select select select select
kernel 10 fd 100 fd 250 fd 500 fd 10 tcp 100 tcp 250 tcp 500 tcp
----------------------------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3 4.11 15.51 35.15 66.14 5.98 33.6829 83.6085 156.960
2.6.0-test3-gcc-2.96-Os 4.04 15.71 35.02 67.80 6.03 33.5566 79.2899 160.586
2.6.0-test3-gcc-3.3.1 4.62 16.14 36.56 74.28 6.38 33.7929 79.8531 156.387
2.6.0-test3-gcc-3.3.1-Os 4.29 21.57 42.62 81.01 6.71 46.7026 98.9269 193.885
2.6.0-test3-gcc-3.3.1-Os-falign=2 3.64 15.20 33.43 65.21 6.23 39.4099 92.0308 178.482

Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/0k 4proc/0k 8proc/0k 16proc/0k 32proc/0k 64proc/0k 96proc/0k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 1.87 7.71 8.97 10.71 12.91 14.45 15.24
2.6.0-test3-gcc-2.96-Os 2.27 6.07 7.39 8.91 10.88 12.25 12.73
2.6.0-test3-gcc-3.3.1 2.34 5.31 7.10 8.76 10.77 12.18 12.99
2.6.0-test3-gcc-3.3.1-Os 1.60 5.00 6.57 8.45 10.51 11.67 12.18
2.6.0-test3-gcc-3.3.1-Os-falign=2 1.25 3.60 5.45 7.15 9.29 10.65 11.19

Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/4k 4proc/4k 8proc/4k 16proc/4k 32proc/4k 64proc/4k 96proc/4k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 5.72 23.35 24.98 26.33 28.22 29.56 29.97
2.6.0-test3-gcc-2.96-Os 7.00 21.94 22.98 24.53 26.12 26.94 27.41
2.6.0-test3-gcc-3.3.1 6.24 21.41 22.92 24.12 25.83 27.27 27.70
2.6.0-test3-gcc-3.3.1-Os 8.21 22.88 23.45 24.57 26.01 26.95 27.54
2.6.0-test3-gcc-3.3.1-Os-falign=2 8.41 24.30 25.56 26.98 28.93 29.94 30.51

Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/8k 4proc/8k 8proc/8k 16proc/8k 32proc/8k 64proc/8k 96proc/8k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 18.85 40.92 41.65 42.04 44.08 45.42 46.26
2.6.0-test3-gcc-2.96-Os 18.32 39.79 39.45 40.34 41.50 42.87 43.61
2.6.0-test3-gcc-3.3.1 18.15 38.66 39.15 39.70 41.11 42.61 43.37
2.6.0-test3-gcc-3.3.1-Os 18.38 38.33 39.38 39.67 41.27 42.48 43.40
2.6.0-test3-gcc-3.3.1-Os-falign=2 18.67 41.22 41.94 42.71 44.51 45.99 46.75

Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/16k 4proc/16k 8proc/16k 16prc/16k 32prc/16k 64prc/16k 96prc/16k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 32.70 74.53 74.04 73.43 74.59 76.24 77.31
2.6.0-test3-gcc-2.96-Os 31.40 72.91 71.58 71.54 72.60 74.36 75.62
2.6.0-test3-gcc-3.3.1 31.45 71.75 70.75 71.21 72.53 74.22 75.36
2.6.0-test3-gcc-3.3.1-Os 29.74 70.87 70.07 69.92 71.42 73.41 74.78
2.6.0-test3-gcc-3.3.1-Os-falign=2 32.24 75.09 74.88 74.67 75.23 76.83 78.08

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/32k 4proc/32k 8proc/32k 16prc/32k 32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 129.83 134.99 136.17 135.81 138.72 142.94 146.38
2.6.0-test3-gcc-2.96-Os 128.66 132.55 132.72 134.04 136.61 140.45 143.50
2.6.0-test3-gcc-3.3.1 129.67 133.26 133.57 134.00 135.80 139.78 143.06
2.6.0-test3-gcc-3.3.1-Os 130.01 133.43 132.44 133.15 135.29 139.57 143.19
2.6.0-test3-gcc-3.3.1-Os-falign=2 129.63 135.77 136.05 135.82 138.48 142.42 145.80

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/64k 4proc/64k 8proc/64k 16prc/64k 32prc/64k 64prc/64k 96prc/64k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3 234.17 241.21 244.17 247.96 255.85 265.69 270.87
2.6.0-test3-gcc-2.96-Os 239.48 242.65 242.96 246.98 254.17 263.57 268.69
2.6.0-test3-gcc-3.3.1 237.30 241.26 243.74 245.77 253.20 262.63 268.03
2.6.0-test3-gcc-3.3.1-Os 236.21 240.27 241.66 246.98 253.88 264.21 269.66
2.6.0-test3-gcc-3.3.1-Os-falign=2 241.20 244.36 246.88 250.26 257.55 267.07 272.12

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
------------------------------ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
2.6.0-test3 157.4 44.3 282.2 77.9 293.5 78.0 480.7 98.8 4281 0.11 8.7
2.6.0-test3-gcc-2.96-Os 160.1 39.1 283.4 74.0 291.8 74.0 474.3 92.0 3973 0.84 8.4
2.6.0-test3-gcc-3.3.1 152.9 35.2 263.4 64.9 271.0 65.3 450.8 84.1 3884 0.90 9.0
2.6.0-test3-gcc-3.3.1-Os 142.8 32.1 257.1 65.8 265.2 65.7 447.9 85.0 4600 0.37 9.8
2.6.0-test3-gcc-3.3.1-Os-falign=2 145.0 37.0 261.6 75.7 269.9 75.9 452.9 94.8 3981 0.67 8.9

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel Pipe AF/Unix UDP RPC/UDP TCP RPC/TCP TCPconn
----------------------------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3 23.91 28.95 79.4908 180.193 89.9243 213.264 379.17
2.6.0-test3-gcc-2.96-Os 16.72 24.36 61.8104 172.207 99.5137 213.399 384.39
2.6.0-test3-gcc-3.3.1 18.16 24.52 60.0927 156.112 85.7277 216.343 335.97
2.6.0-test3-gcc-3.3.1-Os 16.65 30.98 55.1646 159.562 97.5878 209.945 357.23
2.6.0-test3-gcc-3.3.1-Os-falign=2 15.69 25.77 64.9899 170.012 82.2858 210.149 344.43

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
File Mmap Bcopy Bcopy Memory Memory
kernel Pipe AF/Unix TCP reread reread (libc) (hand) read write
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3 52.7 37.0 28.9 52.2 232.6 60.0 60.0 232.4 86.0
2.6.0-test3-gcc-2.96-Os 52.9 36.0 33.1 50.7 231.7 60.0 60.0 232.2 86.1
2.6.0-test3-gcc-3.3.1 55.8 36.8 31.9 52.9 232.7 60.1 60.1 232.5 86.2
2.6.0-test3-gcc-3.3.1-Os 56.5 37.2 25.2 51.8 232.6 60.0 60.0 232.4 86.2
2.6.0-test3-gcc-3.3.1-Os-falign=2 54.3 35.9 29.3 50.7 232.6 60.0 60.0 232.3 86.1

*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
File Mmap Aligned Partial Partial Partial Partial
OS open open Bcopy Bcopy Mmap Mmap Mmap Bzero
close close (libc) (hand) read write rd/wrt copy HTTP
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3 53.0 175.8 59.5 66.2 241.9 86.1 86.1 86.0 2.99
2.6.0-test3-gcc-2.96-Os 51.6 176.8 59.5 66.2 241.9 86.1 86.1 86.0 2.91
2.6.0-test3-gcc-3.3.1 52.3 178.0 59.5 66.2 242.1 86.2 86.2 86.2 3.21
2.6.0-test3-gcc-3.3.1-Os 52.8 165.6 59.5 66.2 242.1 86.3 86.3 86.2 3.08
2.6.0-test3-gcc-3.3.1-Os-falign=2 50.9 168.2 59.5 66.2 242.1 86.1 86.1 86.1 3.14

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel Mhz L1 $ L2 $ Main mem
----------------------------- ----- ------- ------- ---------
2.6.0-test3 476 4.25 232.06 268.3
2.6.0-test3-gcc-2.96-Os 476 4.26 231.12 268.7
2.6.0-test3-gcc-3.3.1 476 4.25 231.13 268.2
2.6.0-test3-gcc-3.3.1-Os 476 4.25 227.97 267.7
2.6.0-test3-gcc-3.3.1-Os-falign=2 476 4.25 231.75 267.7


Quad Xeon

L M B E N C H 2 . 0 S U M M A R Y
------------------------------------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3-mm2 0.50 0.75332 4.35 1.36 5.94 1.56 5.13 257 1007 4506
2.6.0-test3-mm2-gcc-3.3.1 0.51 0.76516 4.39 1.37 5.96 1.54 5.11 254 996 4444
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 0.52 0.75901 4.28 1.37 5.83 1.56 5.44 264 1019 4492

File select - times in microseconds - smaller is better
-------------------------------------------------------
select select select select select select select select
kernel 10 fd 100 fd 250 fd 500 fd 10 tcp 100 tcp 250 tcp 500 tcp
----------------------------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3-mm2 3.98 22.31 53.07 106.21 5.02 33.4355 79.1707 159.488
2.6.0-test3-mm2-gcc-3.3.1 3.94 23.22 54.85 108.65 4.94 32.8961 79.7636 159.520
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 3.90 22.66 53.64 105.36 5.49 38.1446 91.8798 181.395

Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/0k 4proc/0k 8proc/0k 16proc/0k 32proc/0k 64proc/0k 96proc/0k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 1.65 2.17 2.60 2.75 2.90 3.53 4.76
2.6.0-test3-mm2-gcc-3.3.1 1.61 2.12 2.48 2.66 2.79 3.33 4.52
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 2.00 2.62 3.39 3.44 3.53 3.92 5.00

Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/4k 4proc/4k 8proc/4k 16proc/4k 32proc/4k 64proc/4k 96proc/4k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 2.66 3.76 4.31 4.39 4.58 6.78 10.27
2.6.0-test3-mm2-gcc-3.3.1 2.25 4.14 4.47 4.47 4.56 6.54 9.67
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 2.54 4.18 4.95 4.96 5.22 6.95 10.17

Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/8k 4proc/8k 8proc/8k 16proc/8k 32proc/8k 64proc/8k 96proc/8k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 4.65 5.74 5.76 5.83 6.51 12.94 21.74
2.6.0-test3-mm2-gcc-3.3.1 4.53 6.03 5.96 5.90 6.36 12.49 21.53
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 4.50 5.95 6.16 6.27 6.86 12.92 21.89

Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/16k 4proc/16k 8proc/16k 16prc/16k 32prc/16k 64prc/16k 96prc/16k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 9.14 9.07 8.90 8.99 14.06 39.74 49.45
2.6.0-test3-mm2-gcc-3.3.1 8.21 8.33 8.40 8.96 14.51 39.50 50.52
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 8.76 9.07 9.09 9.68 15.57 40.04 50.86

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/32k 4proc/32k 8proc/32k 16prc/32k 32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 15.740 15.347 14.779 19.616 57.711 87.463 88.771
2.6.0-test3-mm2-gcc-3.3.1 14.584 14.636 14.619 20.376 57.226 87.183 88.533
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 14.961 14.820 14.847 20.192 58.146 87.172 88.723

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/64k 4proc/64k 8proc/64k 16prc/64k 32prc/64k 64prc/64k 96prc/64k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.6.0-test3-mm2 26.08 25.19 32.19 105.04 158.28 163.83 162.59
2.6.0-test3-mm2-gcc-3.3.1 24.83 24.42 34.92 106.19 160.16 166.21 164.72
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 26.00 25.57 32.14 102.00 157.90 162.73 163.06

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
------------------------------ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
2.6.0-test3-mm2 56.2 10.5 89.1 22.0 89.0 22.0 130.8 30.5 4603 0.65 4.0
2.6.0-test3-mm2-gcc-3.3.1 53.1 9.9 85.9 21.0 85.9 21.0 127.8 29.7 4603 0.70 4.0
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 57.7 10.3 90.4 21.3 90.4 21.3 132.3 30.1 4725 0.61 4.1

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel Pipe AF/Unix UDP RPC/UDP TCP RPC/TCP TCPconn
----------------------------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3-mm2 10.31 15.11 30.8128 65.4053 37.5467 76.3995 97.65
2.6.0-test3-mm2-gcc-3.3.1 10.23 15.34 29.5681 62.6713 35.9673 75.1441 97.27
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 11.77 17.47 33.1430 67.9045 38.9957 78.8412 97.49

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
File Mmap Bcopy Bcopy Memory Memory
kernel Pipe AF/Unix TCP reread reread (libc) (hand) read write
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3-mm2 482.1 532.7 189.2 298.3 367.5 170.7 173.6 366.9 213.5
2.6.0-test3-mm2-gcc-3.3.1 480.6 546.2 167.3 296.8 362.3 170.5 173.1 362.6 212.4
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 472.9 538.1 163.9 297.5 365.2 170.9 173.6 364.9 213.0

*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
File Mmap Aligned Partial Partial Partial Partial
OS open open Bcopy Bcopy Mmap Mmap Mmap Bzero
close close (libc) (hand) read write rd/wrt copy HTTP
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.0-test3-mm2 298.3 283.4 168.9 183.9 781.2 212.6 213.7 350.2 10.24
2.6.0-test3-mm2-gcc-3.3.1 296.5 280.1 168.6 182.9 782.7 211.9 212.2 350.6 10.24
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 297.1 280.0 169.2 183.7 783.4 212.3 212.6 350.4 10.04

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel Mhz L1 $ L2 $ Main mem
----------------------------- ----- ------- ------- ---------
2.6.0-test3-mm2 698 4.33 12.98 164.4
2.6.0-test3-mm2-gcc-3.3.1 698 4.33 13.00 166.4
2.6.0-test3-mm2-gcc-3.3.1-Os-falign=2 698 4.31 12.93 164.8




--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html

2003-08-31 17:03:47

by Martin J. Bligh

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

> http://cs.nmu.edu/~benchmark/ has an interesting little graph
> of LMBench results vs. Linux kernel version, all done with the
> same compiler.
>
> Has anyone seen a similar graph showing LMBench results vs. gcc version,
> all done with the same Linux kernel?
> And does everyone agree that's a meaningful way to compare the
> performance of code generated by different compilers?

I've done similar things with kernbench before (always using 2.95 to
run the test, but comparing kernels compiled with gcc 2.95 vs 3.2 vs 3.3,
and -Os vs -O2, etc). Summary was that 3.x takes *much* longer to compile
the kernel, and produces worse code (though 3.3 is almost back up to
the performance of 2.95, and is better than 3.2). -O2 is better than -Os,
at least on a machine with 2MB L2 cache. Search the archives for results
I posted if you want, but I never bothered graphing them.

> I happen to have a number of versions of gcc handy, and was
> considering making such a graph, but was hoping somebody
> else had already done it.
>
> (There seems to be large variations in successive runs of LMBench
> when I try it, so it may take me a bit of work to get repeatable
> results.)

I'd just throw away any of the subtests that give you > 1% variations,
deriving anything meaning from crap data is hard (and dubious). I have
similar problems with some of the results in lmbench - Larry suggested
setting "ENOUGH=" or something, which helped a few tests, but most of
them still aren't stable enough to be useful to me.

I'd also use something "bigger" than just a microbenchmark - you need to
exercise a realistic set of the kernel functions in order to see space
vs time tradeoffs, etc. If you want to see whether it's faster for you,
you need a benchmark that simulates roughly what you do with the machine
(ie a system-level benchmark).

M.

2003-08-31 22:34:06

by Dan Kegel

[permalink] [raw]
Subject: Re: LMbench as gcc performance regression test?

Larry McVoy wrote:
> Here is some background, pick a benchmark and play with it and see if
> you can convince yourself of anything. The basic idea is to run the
> benchmark TRIES times for $ENOUGH milliseconds. TRIES is set to an odd
> number in bench.h because we sort the results and take the midpoint and
> print that as the result.

It seems lat_pipe never does any median smoothing; it always sets TRIES to 1.
However, at least on the fairly quiet embedded system I'm testing on,
smoothing samples taken within a single run wouldn't make
a huge difference. Any smoothing you get with that would be swamped by
the fact that lat_pipe's result has a bimodal distribution only one of whose
peaks shows up in any one run.
This sure sounds like the kind of thing page coloring is
supposed to solve; has anyone observed page coloring improving
the repeatability of the lat_pipe benchmark?

(There's no median smoothing in lat_pipe.c, I think, because it passes
a value >= 1000000 as the 2nd arg of BENCH:
BENCH(doit(p2[0], p1[1]), SHORT);
BENCH computes the number of samples to take the median of as
__N = (get_enough(1000000) <= 100000) ? TRIES : 1;
get_enough() will always return at least what it is passed,
thus __N will always be 1. It sure was whenever I printed it out, too.
This seems to be the case for the following tests:
bw_pipe bw_tcp bw_unix
lat_fcntl lat_fifo lat_pipe lat_rpc lat_tcp lat_udp lat_unix)
- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045