2002-11-11 23:24:55

by Con Kolivas

[permalink] [raw]
Subject: [BENCHMARK] 2.5.47{-mm1} with contest

Here are the latest contest (http://contest.kolivas.net) benchmarks up to and
including 2.5.47.

Note:
2.5.46-mm1 and later kernels tested now include preempt (previous ones didn't)
These tests were run on a system that uses reiserFS so the new changes are relevant.

noload:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [5] 71.7 93 0 0 1.00
2.4.19 [5] 69.0 97 0 0 0.97
2.5.46 [2] 74.1 92 0 0 1.04
2.5.46-mm1 [5] 74.0 93 0 0 1.04
2.5.47 [3] 73.5 93 0 0 1.03
2.5.47-mm1 [5] 73.6 93 0 0 1.03

cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [2] 66.6 99 0 0 0.93
2.4.19 [2] 68.0 99 0 0 0.95
2.5.46 [2] 67.9 99 0 0 0.95
2.5.46-mm1 [5] 68.9 99 0 0 0.96
2.5.47 [3] 68.3 99 0 0 0.96
2.5.47-mm1 [5] 68.4 99 0 0 0.96

process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 109.5 57 119 44 1.53
2.4.19 [3] 106.5 59 112 43 1.49
2.5.46 [1] 92.9 74 36 29 1.30
2.5.46-mm1 [5] 82.7 82 21 21 1.16
2.5.47 [3] 83.4 82 22 21 1.17
2.5.47-mm1 [5] 83.0 83 21 20 1.16

ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 117.4 63 1 7 1.64
2.4.19 [2] 106.5 70 1 8 1.49
2.5.46 [1] 98.3 80 1 7 1.38
2.5.46-mm1 [5] 95.3 80 1 5 1.33
2.5.47 [3] 93.9 80 1 5 1.32
2.5.47-mm1 [5] 94.0 81 1 5 1.32

xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 150.8 49 2 8 2.11
2.4.19 [1] 132.4 55 2 9 1.85
2.5.46 [1] 113.5 67 1 8 1.59
2.5.46-mm1 [5] 227.1 34 3 7 3.18
2.5.47 [3] 167.1 45 2 7 2.34
2.5.47-mm1 [5] 118.5 64 1 7 1.66

Of note here is that 2.5.47 takes longer cf 2.5.46 despite adding preempt. Also
2.5.47-mm1 is substantially shorter than 2.5.46-mm1 (both include preempt).


io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 474.1 15 36 10 6.64
2.4.19 [3] 492.6 14 38 10 6.90
2.5.46 [1] 600.5 13 48 12 8.41
2.5.46-mm1 [5] 134.3 58 6 8 1.88
2.5.47 [3] 165.9 46 9 9 2.32
2.5.47-mm1 [5] 126.3 61 5 8 1.77

Very nice. Further improvement in 2.5.47-mm1 (note the big change in 2.5.46-47
is consistent with the preempt addition as mentioned in a previous thread)


read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 102.3 70 6 3 1.43
2.4.19 [2] 134.1 54 14 5 1.88
2.5.46 [1] 103.5 75 7 4 1.45
2.5.46-mm1 [5] 103.2 74 6 4 1.45
2.5.47 [3] 103.4 74 6 4 1.45
2.5.47-mm1 [5] 100.6 76 7 4 1.41

The improvement in 2.5.47-mm1 although small is actually statistically significant.


list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 90.2 76 1 17 1.26
2.4.19 [1] 89.8 77 1 20 1.26
2.5.46 [1] 96.8 74 2 22 1.36
2.5.46-mm1 [5] 101.4 70 1 22 1.42
2.5.47 [3] 100.2 71 1 20 1.40
2.5.47-mm1 [5] 102.4 69 1 19 1.43

mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 103.3 70 32 3 1.45
2.4.19 [3] 100.0 72 33 3 1.40
2.5.46 [3] 148.0 51 34 2 2.07
2.5.46-mm1 [5] 180.5 41 35 1 2.53
2.5.47 [3] 151.1 49 35 2 2.12
2.5.47-mm1 [5] 127.0 58 29 2 1.78

Again very nice.


I refuse to speculate on what part of the kernel is responsible for these
changes, but to me the -mm1 results are encouraging to say the least.

Well done.

Con


2002-11-12 00:02:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Con Kolivas wrote:
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 474.1 15 36 10 6.64
> 2.4.19 [3] 492.6 14 38 10 6.90
> 2.5.46 [1] 600.5 13 48 12 8.41
> 2.5.46-mm1 [5] 134.3 58 6 8 1.88
> 2.5.47 [3] 165.9 46 9 9 2.32
> 2.5.47-mm1 [5] 126.3 61 5 8 1.77
>
> Very nice. Further improvement in 2.5.47-mm1 (note the big change in 2.5.46-47
> is consistent with the preempt addition as mentioned in a previous thread)
>

Actually, 2.5.47 changed fifo_batch from 32 to 16. That's what caused
this big shift.

We've increased the kernel build speed by 3.6x while decreasing the
speed at which writes are retired by 5.3x.

It could be argued that this is a net decrease in throughput. Although
there's clearly a big increase in total CPU utilisation.

It's a tradeoff. I think this is a better tradeoff than the old one
though.

2002-11-12 01:44:52

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Quoting Andrew Morton <[email protected]>:

> Con Kolivas wrote:
> >
> > io_load:
> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> > 2.4.18 [3] 474.1 15 36 10 6.64
> > 2.4.19 [3] 492.6 14 38 10 6.90
> > 2.5.46 [1] 600.5 13 48 12 8.41
> > 2.5.46-mm1 [5] 134.3 58 6 8 1.88
> > 2.5.47 [3] 165.9 46 9 9 2.32
> > 2.5.47-mm1 [5] 126.3 61 5 8 1.77
> >
> > Very nice. Further improvement in 2.5.47-mm1 (note the big change in
> 2.5.46-47
> > is consistent with the preempt addition as mentioned in a previous thread)
> >
>
> Actually, 2.5.47 changed fifo_batch from 32 to 16. That's what caused
> this big shift.

There I go again, inappropriately commenting on the kernel ;-P Anyway preempt
does help here too (I never said that).

> We've increased the kernel build speed by 3.6x while decreasing the
> speed at which writes are retired by 5.3x.
>
> It could be argued that this is a net decrease in throughput. Although
> there's clearly a big increase in total CPU utilisation.
>
> It's a tradeoff. I think this is a better tradeoff than the old one
> though.

I agree. Fortunately I don't think it's as bad a tradeoff as these numbers make
out. The load accounting in contest (johntest?) is still relatively bogus. Apart
from saying it's more or less loads I dont think the scale of the numbers are
accurate.

Con

2002-11-12 02:00:22

by mark walters

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

--- Con Kolivas <[email protected]> wrote:
>
> I agree. Fortunately I don't think it's as bad a
> tradeoff as these numbers make
> out. The load accounting in contest (johntest?) is
> still relatively bogus. Apart
> from saying it's more or less loads I dont think the
> scale of the numbers are
> accurate.


Is the number of loads the total number of loads done
during the kernel compile or the number of loads per
unit time during the kernel compile? I was guessing
the former. (Andrew appeared to be guessing the
latter?)

Mark



__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

2002-11-12 02:11:43

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Quoting mark walters <[email protected]>:

> --- Con Kolivas <[email protected]> wrote:
> >
> > I agree. Fortunately I don't think it's as bad a
> > tradeoff as these numbers make
> > out. The load accounting in contest (johntest?) is
> > still relatively bogus. Apart
> > from saying it's more or less loads I dont think the
> > scale of the numbers are
> > accurate.
>
>
> Is the number of loads the total number of loads done
> during the kernel compile or the number of loads per
> unit time during the kernel compile? I was guessing
> the former. (Andrew appeared to be guessing the
> latter?)

Number of loads = (total loads) * (kernel compile time) / (load run time)

And the load run time is impossible to fix because of the variable time it takes
to kill the load.

The load will be doing more work while the kernel is not compiling. Thus it will
always overestimate. At some stage I need to completely rewrite everything with
the ability of the load itself to know when to start and stop counting load
iterations; and I'm not even sure I can do that.

Con

2002-11-12 02:58:17

by Aaron Lehmann

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Tue, Nov 12, 2002 at 10:31:38AM +1100, Con Kolivas wrote:
> Here are the latest contest (http://contest.kolivas.net) benchmarks up to and
> including 2.5.47.

This is just great to see. Most previous contest runs made me cringe
when I saw how -mm and recent 2.5 kernels were faring, but it looks
like Andrew has done something right in 2.5.47-mm1. I hope the
appropriate get merged so that 2.6.0 has stunning performance across
the board.

2002-11-12 08:45:34

by Giuliano Pochini

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest


On 12-Nov-2002 Andrew Morton wrote:
> Con Kolivas wrote:
>>
>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 474.1 15 36 10 6.64
>> 2.4.19 [3] 492.6 14 38 10 6.90
>> 2.5.46 [1] 600.5 13 48 12 8.41
>> 2.5.46-mm1 [5] 134.3 58 6 8 1.88
>> 2.5.47 [3] 165.9 46 9 9 2.32
>> 2.5.47-mm1 [5] 126.3 61 5 8 1.77
>>
>
> We've increased the kernel build speed by 3.6x while decreasing the
> speed at which writes are retired by 5.3x.

Did the elevator change between .46 and .47 ?


Bye.


2002-11-12 09:14:09

by Jens Axboe

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Tue, Nov 12 2002, Giuliano Pochini wrote:
>
> On 12-Nov-2002 Andrew Morton wrote:
> > Con Kolivas wrote:
> >>
> >> io_load:
> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> 2.4.18 [3] 474.1 15 36 10 6.64
> >> 2.4.19 [3] 492.6 14 38 10 6.90
> >> 2.5.46 [1] 600.5 13 48 12 8.41
> >> 2.5.46-mm1 [5] 134.3 58 6 8 1.88
> >> 2.5.47 [3] 165.9 46 9 9 2.32
> >> 2.5.47-mm1 [5] 126.3 61 5 8 1.77
> >>
> >
> > We've increased the kernel build speed by 3.6x while decreasing the
> > speed at which writes are retired by 5.3x.
>
> Did the elevator change between .46 and .47 ?

No, but the fifo_batch count (which controls how many requests are moved
sort list to dispatch queue) was halved. This gives lower latency, at
the possible cost of dimishing throughput.

--
Jens Axboe

2002-11-12 09:33:55

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>On Tue, Nov 12 2002, Giuliano Pochini wrote:
>> On 12-Nov-2002 Andrew Morton wrote:
>> > Con Kolivas wrote:
>> >> io_load:
>> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> >> 2.4.18 [3] 474.1 15 36 10 6.64
>> >> 2.4.19 [3] 492.6 14 38 10 6.90
>> >> 2.5.46 [1] 600.5 13 48 12 8.41
>> >> 2.5.46-mm1 [5] 134.3 58 6 8 1.88
>> >> 2.5.47 [3] 165.9 46 9 9 2.32
>> >> 2.5.47-mm1 [5] 126.3 61 5 8 1.77
>> >
>> > We've increased the kernel build speed by 3.6x while decreasing the
>> > speed at which writes are retired by 5.3x.
>>
>> Did the elevator change between .46 and .47 ?
>
>No, but the fifo_batch count (which controls how many requests are moved
>sort list to dispatch queue) was halved. This gives lower latency, at
>the possible cost of dimishing throughput.

Preempt also lowered this value, and ReiserFS changes may have affected it
further (test machine running Reiser).

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE90Mx4F6dfvkL3i1gRAtNhAKCfIDch3dTao94+xTpEUDDYHkZFTQCdEeX9
280PvEgXnJDt4RxByJX/pnM=
=ct+I
-----END PGP SIGNATURE-----

2002-11-12 10:57:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Aaron Lehmann wrote:
>
> On Tue, Nov 12, 2002 at 10:31:38AM +1100, Con Kolivas wrote:
> > Here are the latest contest (http://contest.kolivas.net) benchmarks up to and
> > including 2.5.47.
>
> This is just great to see. Most previous contest runs made me cringe
> when I saw how -mm and recent 2.5 kernels were faring, but it looks
> like Andrew has done something right in 2.5.47-mm1. I hope the
> appropriate get merged so that 2.6.0 has stunning performance across
> the board.

Tuning of 2.5 has really hardly started. In some ways, it should be
tested against 2.3.99 (well, not really, but...)

It will never be stunningly better than 2.4 for normal workloads on
normal machines, because 2.4 just ain't that bad.

What is being addressed in 2.5 is the areas where 2.4 fell down:
large machines, large numbers of threads, large disks, large amounts
of memory, etc. There have been really big gains in that area.

For the uniprocessors and small servers, there will be significant
gains in some corner cases. And some losses. Quite a lot of work
has gone into "fairness" issues: allowing tasks to make equal progress
when the machine is under load. Not stalling tasks for unreasonable
amounts of time, etc. Simple operations such as copying a forest
of files from one part of the disk to another have taken a bit of a
hit from this. (But copying them to another disk got better).

Generally, 2.6 should be "nicer to use" on the desktop. But not appreciably
faster. Significantly slower when there are several processes causing a
lot of swapout. That is one area where fairness really hurts throughput.
The old `make -j30 bzImage' with mem=128M takes 1.5x as long with 2.5.
Because everyone makes equal progress.

Most of the VM gains involve situations where there are large amounts
of dirty data in the machine. This has always been a big problem
for Linux, and I think we've largely got it under control now. There
are still a few issues in the page reclaim code wrt this, but they're
fairly obscure (I'm the only person who has noticed them ;))

There are some things which people simply have not yet noticed.


Andrea's kernel is the fastest which 2.4 has to offer; let's tickle
its weak spots:



Run mke2fs against six disks at the same time, mem=1G:

2.4.20-rc1aa1:
0.04s user 13.16s system 51% cpu 25.782 total
0.05s user 31.53s system 63% cpu 49.542 total
0.05s user 29.04s system 58% cpu 49.544 total
0.05s user 31.07s system 62% cpu 50.017 total
0.06s user 29.80s system 58% cpu 50.983 total
0.06s user 23.30s system 43% cpu 53.214 total

2.5.47-mm2:
0.04s user 2.94s system 48% cpu 6.168 total
0.04s user 2.89s system 39% cpu 7.473 total
0.05s user 3.00s system 37% cpu 8.152 total
0.06s user 4.33s system 43% cpu 9.992 total
0.06s user 4.35s system 42% cpu 10.484 total
0.04s user 4.32s system 32% cpu 13.415 total


Write six 4G files to six disks in parallel, mem=1G:

2.4.20-rc1aa1:
0.01s user 63.17s system 7% cpu 13:53.26 total
0.05s user 63.43s system 7% cpu 14:07.17 total
0.03s user 65.94s system 7% cpu 14:36.25 total
0.01s user 66.29s system 7% cpu 14:38.01 total
0.08s user 63.79s system 7% cpu 14:45.09 total
0.09s user 65.22s system 7% cpu 14:46.95 total

2.5.47-mm2:
0.03s user 53.95s system 39% cpu 2:18.27 total
0.03s user 58.11s system 30% cpu 3:08.23 total
0.02s user 57.43s system 30% cpu 3:08.47 total
0.03s user 54.73s system 23% cpu 3:52.43 total
0.03s user 54.72s system 23% cpu 3:53.22 total
0.03s user 46.14s system 14% cpu 5:29.71 total


Compile a kernel while running `while true;do;./dbench 32;done' against
the same disk. mem=128m:

2.4.20-rc1aa1:
Throughput 17.7491 MB/sec (NB=22.1863 MB/sec 177.491 MBit/sec)
Throughput 16.6311 MB/sec (NB=20.7888 MB/sec 166.311 MBit/sec)
Throughput 17.0409 MB/sec (NB=21.3012 MB/sec 170.409 MBit/sec)
Throughput 17.4876 MB/sec (NB=21.8595 MB/sec 174.876 MBit/sec)
Throughput 15.3017 MB/sec (NB=19.1271 MB/sec 153.017 MBit/sec)
Throughput 18.0726 MB/sec (NB=22.5907 MB/sec 180.726 MBit/sec)
Throughput 18.2769 MB/sec (NB=22.8461 MB/sec 182.769 MBit/sec)
Throughput 19.152 MB/sec (NB=23.94 MB/sec 191.52 MBit/sec)
Throughput 14.2632 MB/sec (NB=17.8291 MB/sec 142.632 MBit/sec)
Throughput 20.5007 MB/sec (NB=25.6258 MB/sec 205.007 MBit/sec)
Throughput 24.9471 MB/sec (NB=31.1838 MB/sec 249.471 MBit/sec)
Throughput 20.36 MB/sec (NB=25.45 MB/sec 203.6 MBit/sec)
make -j4 bzImage 412.28s user 36.90s system 15% cpu 47:11.14 total

2.5.46:
Throughput 19.3907 MB/sec (NB=24.2383 MB/sec 193.907 MBit/sec)
Throughput 16.6765 MB/sec (NB=20.8456 MB/sec 166.765 MBit/sec)
make -j4 bzImage 412.16s user 36.92s system 83% cpu 8:55.74 total

2.5.47-mm2:
Throughput 15.0539 MB/sec (NB=18.8174 MB/sec 150.539 MBit/sec)
Throughput 21.6388 MB/sec (NB=27.0485 MB/sec 216.388 MBit/sec)
make -j4 bzImage 413.88s user 35.90s system 94% cpu 7:56.68 total <- fifo_batch strikes again


It's the "doing multiple things at the same time" which gets better; the
straightline throughput of "one thing at a time" won't change much at all.

Corner cases....

2002-11-12 14:17:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Tue, Nov 12 2002, Aaron Lehmann wrote:
> On Tue, Nov 12, 2002 at 03:04:23AM -0800, Andrew Morton wrote:
> > It will never be stunningly better than 2.4 for normal workloads on
> > normal machines, because 2.4 just ain't that bad.
>
> Actually, I am having serious problems with 2.4 (.20-pre5). Copying a
> file from hda to hdc without really doing anything else goes very
> slowly and lags the whole system ruthlessly. The load average rises to
> about three. Any app which tries to touch the disk will hang for
> several seconds. Yes, DMA is on on both drives (udma5), as well as
> 32-bit I/O and unmaskirq. Bad IDE controller or driver? I don't know.
> It's a ServerWorks CSB5. I've been meaning to try 2.5-mm to see if it
> improves this.

Testing 2.5 for this would be interesting too indeed, but you should
also try 2.4.20-rc1. Between -pre3 and -pre8 (iirc) you could have
awfully slow io. And you should probably do

# elvtune -r512 /dev/hd{a,c}

too, in 2.4.20-rc1

--
Jens Axboe

2002-11-12 14:13:35

by Aaron Lehmann

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Tue, Nov 12, 2002 at 03:04:23AM -0800, Andrew Morton wrote:
> It will never be stunningly better than 2.4 for normal workloads on
> normal machines, because 2.4 just ain't that bad.

Actually, I am having serious problems with 2.4 (.20-pre5). Copying a
file from hda to hdc without really doing anything else goes very
slowly and lags the whole system ruthlessly. The load average rises to
about three. Any app which tries to touch the disk will hang for
several seconds. Yes, DMA is on on both drives (udma5), as well as
32-bit I/O and unmaskirq. Bad IDE controller or driver? I don't know.
It's a ServerWorks CSB5. I've been meaning to try 2.5-mm to see if it
improves this.

Another sort of offtopic 2.4 thing: I found an entry like this for every
running process in dmesg:

getty S 00000013 5096 27557 1 24160 (NOTLB)
Call Trace: [<c0114d3b>] [<c01be0f9>] [<c01b15a1>] [<c01b106d>] [<c01ad1f5>]
[<c0136513>] [<c01239f8>] [<c0109107>]

This puzzles me because sysrq is turned OFF. How could this have
happened?

2002-11-12 20:31:57

by Bill Davidsen

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Tue, 12 Nov 2002, Andrew Morton wrote:

> Tuning of 2.5 has really hardly started. In some ways, it should be
> tested against 2.3.99 (well, not really, but...)
>
> It will never be stunningly better than 2.4 for normal workloads on
> normal machines, because 2.4 just ain't that bad.
>
> What is being addressed in 2.5 is the areas where 2.4 fell down:
> large machines, large numbers of threads, large disks, large amounts
> of memory, etc. There have been really big gains in that area.
>
> For the uniprocessors and small servers, there will be significant
> gains in some corner cases. And some losses. Quite a lot of work
> has gone into "fairness" issues: allowing tasks to make equal progress
> when the machine is under load. Not stalling tasks for unreasonable
> amounts of time, etc. Simple operations such as copying a forest
> of files from one part of the disk to another have taken a bit of a
> hit from this. (But copying them to another disk got better).
>
> Generally, 2.6 should be "nicer to use" on the desktop. But not appreciably
> faster. Significantly slower when there are several processes causing a
> lot of swapout. That is one area where fairness really hurts throughput.
> The old `make -j30 bzImage' with mem=128M takes 1.5x as long with 2.5.
> Because everyone makes equal progress.
>
> Most of the VM gains involve situations where there are large amounts
> of dirty data in the machine. This has always been a big problem
> for Linux, and I think we've largely got it under control now. There
> are still a few issues in the page reclaim code wrt this, but they're
> fairly obscure (I'm the only person who has noticed them ;))
>
> There are some things which people simply have not yet noticed.
>
>
> Andrea's kernel is the fastest which 2.4 has to offer; let's tickle
> its weak spots:
>
At this point let me say that these are not things I do every day,
thankfully.
>
> Run mke2fs against six disks at the same time, mem=1G:
>
> Write six 4G files to six disks in parallel, mem=1G:
>
> Compile a kernel while running `while true;do;./dbench 32;done' against
> the same disk. mem=128m:


> It's the "doing multiple things at the same time" which gets better; the
> straightline throughput of "one thing at a time" won't change much at all.
>
> Corner cases....

In the area of things I do do every day, the occasionally posted AIM and
BYTE benchmarks look as though pipe latency and thruput are down, UNIX
socket latency and thruput are down, and these are things which will make
the system feel slower. More to the point, they are things which seem to
go down from 2.2 to 2.4 and 2.4 to 2.6, and are not obviously impacted by
fairness. I have a context switching benchmark which I should run on a
single machine to get some results. Unfortunately I don't have a single
machine which runs all the kernels, although I might next month.

This is neither a complain nor a condemnation of the development process.
It is an observation I believe is easily checked by numbers from recent
posts to this list. See
<[email protected]> for an
example. A few items:


10 signal_test Signal Traps/second
linux-2.4.19 60.00 13358 222.62222 222622.20
linux-2.4.20-rc1 60.01 13350 222.49630 222496.30

linux-2.5.42 60.01 9099 151.62473 151624.73
linux-2.5.43 60.00 9474 157.90000 157900.00
linux-2.5.44 60.00 9186 153.10000 153100.00
linux-2.5.45 60.01 7481 124.66256 124662.56
linux-2.5.46 60.00 7621 127.01667 127016.67
12 fork_test Task Creations/second
linux-2.4.19 60.01 1903 31.70560 3170.60
linux-2.4.20-rc1 60.01 1736 28.92510 2892.20

linux-2.5.42 60.03 772 12.86024 1286.02
linux-2.5.43 60.06 705 11.33826 1173.83
linux-2.5.44 60.01 806 13.43109 1343.11
linux-2.5.45 60.02 867 14.44518 1444.52
linux-2.5.46 60.06 755 12.57076 1257.08
22 disk_src Directory Searches/second
linux-2.4.19 60.00 21280 354.66670 26600.10
linux-2.4.20-rc1 60.00 20690 344.82760 25862.10

linux-2.5.42 60.00 9147 152.45000 11433.75
linux-2.5.43 60.00 9208 153.46667 11510.00
linux-2.5.44 60.01 9193 153.19113 11489.34
linux-2.5.45 60.01 9053 150.85819 11314.36
linux-2.5.46 60.00 8891 148.18333 11113.75
54 tcp_test TCP/IP Messages/second
linux-2.4.19 60.01 36185 603.08330 54277.50
linux-2.4.20-rc1 60.00 35735 595.58330 53602.50

linux-2.5.42 60.00 9464 157.73333 14196.00
linux-2.5.43 60.00 9377 156.28333 14065.50
linux-2.5.44 60.00 9368 156.13333 14052.00
linux-2.5.45 60.01 13410 223.46276 20111.65
linux-2.5.46 60.01 10293 171.52141 15436.93

Looking at the whole post, you will also see that some categories are far
better, this is not a one way street. But latencies have been creeping
up, version by version, for some years. And I suspect if the old results
I'm looking at were run on a modern machine the values would show even
more changes.

As you say, the new kernel is being tuned, maybe these things will get
faster, but at the moment there are some performance drops in things which
happen on the desktop rather than the server.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-11-12 20:52:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Bill Davidsen wrote:
>
> In the area of things I do do every day, the occasionally posted AIM and
> BYTE benchmarks look as though pipe latency and thruput are down, UNIX
> socket latency and thruput are down, and these are things which will make
> the system feel slower.

Yes, the AIM numbers which have been posted here are quite outrageous.
It's (hopefully) either a few Great Big Bugs or something in 2.5 has
invalidated the measurements. Or, conceivably, the reduction of the
size of the slabs_free list in the slab allocator.

I shall be taking a look at what's going on in there fairly soon.

2002-11-20 22:57:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Bill Davidsen wrote:
>
> ...
> In the area of things I do do every day, the occasionally posted AIM and
> BYTE benchmarks look as though pipe latency and thruput are down, UNIX
> socket latency and thruput are down, and these are things which will make
> the system feel slower.


OK, I finally got around to running AIM9. 2.4.20-rc1aa1 versus 2.5.48-mm1+

2.5 has taken a 1% to 1.5% hit everywhere due to the HZ change. 2.5 is faster
at lots of things and significantly slower at a couple of things. Which is all
exactly as one would hope and expect. I really don't know what's up with Pavan's
testing. I tested a quad PIII, so maybe there's a uniprocessor problem or
artifact...




First row: 2.4.30-rc1aa1
Second row: 2.5.48-mm1+


add_double 30040 18.5087 333155.79 Thousand Double Precision Additions/second
add_double 10020 18.1637 326946.11 Thousand Double Precision Additions/second

In all the compute-intensive operations, 2.5 is 1%-1.5% slower
due to the increase in HZ from 100 to 1000.

add_float 30000 27.7667 333200.00 Thousand Single Precision Additions/second
add_float 10010 27.2727 327272.73 Thousand Single Precision Additions/second

add_long 30000 17.1333 1028000.00 Thousand Long Integer Additions/second
add_long 10040 16.8327 1009960.16 Thousand Long Integer Additions/second

add_int 30050 17.1381 1028286.19 Thousand Integer Additions/second
add_int 10050 16.8159 1008955.22 Thousand Integer Additions/second

add_short 30000 29.3667 704800.00 Thousand Short Integer Additions/second
add_short 10020 28.8423 692215.57 Thousand Short Integer Additions/second

creat-clo 30000 110.1 110100.00 File Creations and Closes/second
creat-clo 10010 119.181 119180.82 File Creations and Closes/second

2.5 sped up opens and closes. I don't know why.

page_test 30020 39.6069 67331.78 System Allocations & Pages/second
page_test 10000 41.1 69870.00 System Allocations & Pages/second

2.5's page allocator is faster - the per-cpu-pages code
presumably.

brk_test 30030 31.6017 537229.44 System Memory Allocations/second
brk_test 10010 32.4675 551948.05 System Memory Allocations/second

Ditto

jmp_test 30000 3311.6 3311600.00 Non-local gotos/second
jmp_test 10000 3249.1 3249100.00 Non-local gotos/second

signal_test 30000 122 122000.00 Signal Traps/second
signal_test 10000 91.7 91700.00 Signal Traps/second

Signal delivery is a lot slower in 2.5. I do not know why,

exec_test 30000 38.3 191.50 Program Loads/second
exec_test 10020 37.0259 185.13 Program Loads/second

Possibly rmap overhead.

fork_test 30010 15.3282 1532.82 Task Creations/second
fork_test 10060 13.5189 1351.89 Task Creations/second

Possibly rmap overhead

link_test 30000 472.5 29767.50 Link/Unlink Pairs/second
link_test 10010 568.332 35804.90 Link/Unlink Pairs/second

Again, VFS operations got a lot quicker. Reason unknown.
(This kernel has dcache-rcu).

disk_rr 30050 10.3494 52989.02 Random Disk Reads (K)/second
disk_rr 10080 10.7143 54857.14 Random Disk Reads (K)/second

2.5's IO paths are more efficient.

disk_rw 30060 9.61411 49224.22 Random Disk Writes (K)/second
disk_rw 10080 9.92063 50793.65 Random Disk Writes (K)/second

Ditto.

disk_rd 30020 31.9121 163389.74 Sequential Disk Reads (K)/second
disk_rd 10020 31.3373 160447.11 Sequential Disk Reads (K)/second

disk_wrt 30040 17.8096 91185.09 Sequential Disk Writes (K)/second
disk_wrt 10030 18.345 93926.22 Sequential Disk Writes (K)/second

disk_cp 30080 11.2699 57702.13 Disk Copies (K)/second
disk_cp 10010 11.1888 57286.71 Disk Copies (K)/second

sync_disk_rw 33490 0.089579 229.32 Sync Random Disk Writes (K)/second
sync_disk_rw 11120 0.0899281 230.22 Sync Random Disk Writes (K)/second

sync_disk_wrt 53610 0.0373065 95.50 Sync Sequential Disk Writes (K)/second
sync_disk_wrt 24700 0.0404858 103.64 Sync Sequential Disk Writes (K)/second

sync_disk_cp 53400 0.0374532 95.88 Sync Disk Copies (K)/second
sync_disk_cp 24940 0.0400962 102.65 Sync Disk Copies (K)/second

disk_src 30000 317.567 23817.50 Directory Searches/second
disk_src 10000 326.2 24465.00 Directory Searches/second

VFS is faster here too.

div_double 30010 18.8604 56581.14 Thousand Double Precision Divides/second
div_double 10050 18.5075 55522.39 Thousand Double Precision Divides/second

div_float 30010 18.8604 56581.14 Thousand Single Precision Divides/second
div_float 10040 18.5259 55577.69 Thousand Single Precision Divides/second

div_long 30020 15.4231 13880.75 Thousand Long Integer Divides/second
div_long 10030 15.1545 13639.08 Thousand Long Integer Divides/second

div_int 30010 15.4282 13885.37 Thousand Integer Divides/second
div_int 10030 15.1545 13639.08 Thousand Integer Divides/second

div_short 30020 15.4231 13880.75 Thousand Short Integer Divides/second
div_short 10030 15.1545 13639.08 Thousand Short Integer Divides/second

fun_cal 30010 46.851 23987737.42 Function Calls (no arguments)/second
fun_cal 10010 47.0529 24091108.89 Function Calls (no arguments)/second

fun_cal1 30010 55.7148 28525958.01 Function Calls (1 argument)/second
fun_cal1 10000 55.7 28518400.00 Function Calls (1 argument)/second

fun_cal2 30000 75.8667 38843733.33 Function Calls (2 arguments)/second
fun_cal2 10000 74.5 38144000.00 Function Calls (2 arguments)/second

fun_cal15 30030 27.0729 13861338.66 Function Calls (15 arguments)/second
fun_cal15 10010 26.5734 13605594.41 Function Calls (15 arguments)/second

sieve 30190 1.02683 5.13 Integer Sieves/second
sieve 10840 1.01476 5.07 Integer Sieves/second

mul_double 30020 16.6556 199866.76 Thousand Double Precision Multiplies/second
mul_double 10030 16.3509 196211.37 Thousand Double Precision Multiplies/second

mul_float 30010 16.6611 199933.36 Thousand Single Precision Multiplies/second
mul_float 10020 16.3673 196407.19 Thousand Single Precision Multiplies/second

mul_long 30000 737.667 177040.00 Thousand Long Integer Multiplies/second
mul_long 10000 722.2 173328.00 Thousand Long Integer Multiplies/second

mul_int 30000 735.833 176600.00 Thousand Integer Multiplies/second
mul_int 10000 723.8 173712.00 Thousand Integer Multiplies/second

mul_short 30000 590.4 177120.00 Thousand Short Integer Multiplies/second
mul_short 10010 577.023 173106.89 Thousand Short Integer Multiplies/second

num_rtns_1 30000 320.433 32043.33 Numeric Functions/second
num_rtns_1 10000 314.9 31490.00 Numeric Functions/second

trig_rtns 30020 22.0187 220186.54 Trigonometric Functions/second
trig_rtns 10040 21.9124 219123.51 Trigonometric Functions/second

matrix_rtns 30000 5946.77 594676.67 Point Transformations/second
matrix_rtns 10000 5916.4 591640.00 Point Transformations/second

array_rtns 30090 7.24493 144.90 Linear Systems Solved/second
array_rtns 10000 7.2 144.00 Linear Systems Solved/second

string_rtns 30030 8.25841 825.84 String Manipulations/second
string_rtns 10040 8.06773 806.77 String Manipulations/second

mem_rtns_1 30050 12.0466 361397.67 Dynamic Memory Operations/second
mem_rtns_1 10080 12.1032 363095.24 Dynamic Memory Operations/second

mem_rtns_2 30000 1332.43 133243.33 Block Memory Operations/second
mem_rtns_2 10000 1287.1 128710.00 Block Memory Operations/second

sort_rtns_1 30000 21.6333 216.33 Sort Operations/second
sort_rtns_1 10040 21.2151 212.15 Sort Operations/second

misc_rtns_1 30000 230.467 2304.67 Auxiliary Loops/second
misc_rtns_1 10000 239 2390.00 Auxiliary Loops/second

dir_rtns_1 30000 101.967 1019666.67 Directory Operations/second
dir_rtns_1 10000 81.5 815000.00 Directory Operations/second

2.5 VFS is slower here. I don't know why.

shell_rtns_1 30000 47.9333 47.93 Shell Scripts/second
shell_rtns_1 10010 46.7532 46.75 Shell Scripts/second

shell_rtns_2 30010 47.9174 47.92 Shell Scripts/second
shell_rtns_2 10010 46.7532 46.75 Shell Scripts/second

shell_rtns_3 30000 47.8667 47.87 Shell Scripts/second
shell_rtns_3 10010 46.7532 46.75 Shell Scripts/second

rmap?

series_1 30000 27653.8 2765376.67 Series Evaluations/second
series_1 10000 26897 2689700.00 Series Evaluations/second

shared_memory 30000 1499.8 149980.00 Shared Memory Operations/second
shared_memory 10000 1355 135500.00 Shared Memory Operations/second

Reason unknown.

tcp_test 30000 341.667 30750.00 TCP/IP Messages/second
tcp_test 10010 296.503 26685.31 TCP/IP Messages/second

networking to localhost is really unrepeatable. I tend to
ignore such results. Although 2.5 does seem to be consistently
slower.

udp_test 30000 635.367 63536.67 UDP/IP DataGrams/second
udp_test 10000 619.7 61970.00 UDP/IP DataGrams/second

fifo_test 30000 2008.63 200863.33 FIFO Messages/second
fifo_test 10000 1970.8 197080.00 FIFO Messages/second

stream_pipe 30000 1362.77 136276.67 Stream Pipe Messages/second
stream_pipe 10000 1381.5 138150.00 Stream Pipe Messages/second

dgram_pipe 30000 1315.1 131510.00 DataGram Pipe Messages/second
dgram_pipe 10000 1353.1 135310.00 DataGram Pipe Messages/second

pipe_cpy 30000 2164.77 216476.67 Pipe Messages/second
pipe_cpy 10000 2291.6 229160.00 Pipe Messages/second

The pipe code has had some work. Although these tests also
tend to show very high variation between runs, and between reboots.

ram_copy 30000 17630.1 441104268.00 Memory to Memory Copy/second
ram_copy 10000 17245 431469900.00 Memory to Memory Copy/second

2002-11-21 00:07:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> First row: 2.4.30-rc1aa1
> Second row: 2.5.48-mm1+

Given the contents of -aa this is probably not far from its proper
version number, if not being tagged as a 2.5 variant. =)


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> signal_test 30000 122 122000.00 Signal Traps/second
> signal_test 10000 91.7 91700.00 Signal Traps/second
> Signal delivery is a lot slower in 2.5. I do not know why,

Similar things have been reported with 2.4.x vs. 2.2.x and IIRC there
was some speculation they were due to low-level arch code interactions.
I think this merits some investigation. I, for one, am a big user of
SIGIO in userspace C programs...


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> exec_test 30000 38.3 191.50 Program Loads/second
> exec_test 10020 37.0259 185.13 Program Loads/second
> Possibly rmap overhead.
> fork_test 30010 15.3282 1532.82 Task Creations/second
> fork_test 10060 13.5189 1351.89 Task Creations/second
> Possibly rmap overhead

Both known rmap overheads. This got debated/flamed/beaten to death
earlier in 2.5.x.


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> shell_rtns_1 30000 47.9333 47.93 Shell Scripts/second
> shell_rtns_1 10010 46.7532 46.75 Shell Scripts/second
> shell_rtns_2 30010 47.9174 47.92 Shell Scripts/second
> shell_rtns_2 10010 46.7532 46.75 Shell Scripts/second
> shell_rtns_3 30000 47.8667 47.87 Shell Scripts/second
> shell_rtns_3 10010 46.7532 46.75 Shell Scripts/second
> rmap?

Yep. Shell stuff forking short-lived things that exec rapid-fire is the
same stuff antonb saw during SDET. (Otherwise it could be something else.)


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> shared_memory 30000 1499.8 149980.00 Shared Memory Operations/second
> shared_memory 10000 1355 135500.00 Shared Memory Operations/second
> Reason unknown.

This is mmap()/munmap() and open()/close() of an anonymous (i.e. link
count 0) shmfs inode. This needs to be broken down into more specific
sysv shm operations and profiled to get a proper notion of what's wrong.
The codepaths responsible are clear: ipc/shm.c, mm/shmem.c, and mm/mmap.c


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> tcp_test 30000 341.667 30750.00 TCP/IP Messages/second
> tcp_test 10010 296.503 26685.31 TCP/IP Messages/second
> udp_test 30000 635.367 63536.67 UDP/IP DataGrams/second
> udp_test 10000 619.7 61970.00 UDP/IP DataGrams/second
> networking to localhost is really unrepeatable. I tend to
> ignore such results. Although 2.5 does seem to be consistently
> slower.

The behavior I observe is a bit different (i.e. totally consistent) but
that's pretty much because I'm bitten incredibly hard by odd performance
scalability issues in networking code (localhost all goes to the same
queue and things spin hard on the lock). On machines small enough not
to really care that there's locking at all I see the variability too.
I'd like to get this variability resolved but it's quite far afield
from anything I actually have expertise in or any sanction to work on.


On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> fifo_test 30000 2008.63 200863.33 FIFO Messages/second
> fifo_test 10000 1970.8 197080.00 FIFO Messages/second
> stream_pipe 30000 1362.77 136276.67 Stream Pipe Messages/second
> stream_pipe 10000 1381.5 138150.00 Stream Pipe Messages/second
> dgram_pipe 30000 1315.1 131510.00 DataGram Pipe Messages/second
> dgram_pipe 10000 1353.1 135310.00 DataGram Pipe Messages/second
> pipe_cpy 30000 2164.77 216476.67 Pipe Messages/second
> pipe_cpy 10000 2291.6 229160.00 Pipe Messages/second
> The pipe code has had some work. Although these tests also
> tend to show very high variation between runs, and between reboots.

The interactions of wake_up_sync() with the rest of the workload would
be nice to resolve (the processes on the pipe end up dominating the
machine), though that's not going to show up in a microbenchmark. Aside
from that I'm satisfied with pipe performance and see highly variable
bandwidth measurements also. I've heard rumors pipe microbenchmark
performance numbers have mostly to do with page color clashes with the
codepaths exercised during the benchmark and the task_structs
describing the processes performing the benchmark, with the natural
link order and code arrangement/size dependencies implied (not to
mention phase of the moon, the weather, and the government).

Would getting an idea of what's statistically significant across runs
for the highly variable benchmarks be useful?


Bill

2002-11-21 00:37:00

by Alan

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Thu, 2002-11-21 at 00:08, William Lee Irwin III wrote:
> Similar things have been reported with 2.4.x vs. 2.2.x and IIRC there
> was some speculation they were due to low-level arch code interactions.
> I think this merits some investigation. I, for one, am a big user of
> SIGIO in userspace C programs...

FSAVE v FXSAVE is one of the reasons for the 2.2/2.4 shift


2002-11-21 06:47:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

William Lee Irwin III wrote:
>
> ...
> On Wed, Nov 20, 2002 at 03:02:24PM -0800, Andrew Morton wrote:
> > signal_test 30000 122 122000.00 Signal Traps/second
> > signal_test 10000 91.7 91700.00 Signal Traps/second
> > Signal delivery is a lot slower in 2.5. I do not know why,
>
> Similar things have been reported with 2.4.x vs. 2.2.x and IIRC there
> was some speculation they were due to low-level arch code interactions.
> I think this merits some investigation. I, for one, am a big user of
> SIGIO in userspace C programs...
>

OK, got it back to 119000. Each signal was calling copy_*_user 24 times.
This gets it down to six.


--- 25/arch/i386/kernel/i387.c~signal-speedup Wed Nov 20 20:44:56 2002
+++ 25-akpm/arch/i386/kernel/i387.c Wed Nov 20 21:06:04 2002
@@ -232,7 +232,7 @@ void set_fpu_mxcsr( struct task_struct *
* FXSR floating point environment conversions.
*/

-static inline int convert_fxsr_to_user( struct _fpstate *buf,
+static int convert_fxsr_to_user( struct _fpstate *buf,
struct i387_fxsave_struct *fxsave )
{
unsigned long env[7];
@@ -254,13 +254,18 @@ static inline int convert_fxsr_to_user(
to = &buf->_st[0];
from = (struct _fpxreg *) &fxsave->st_space[0];
for ( i = 0 ; i < 8 ; i++, to++, from++ ) {
- if ( __copy_to_user( to, from, sizeof(*to) ) )
+ unsigned long *t = (unsigned long *)to;
+ unsigned long *f = (unsigned long *)from;
+
+ if (__put_user(*f, t) ||
+ __put_user(*(f + 1), t + 1) ||
+ __put_user(from->exponent, &to->exponent))
return 1;
}
return 0;
}

-static inline int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave,
+static int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave,
struct _fpstate *buf )
{
unsigned long env[7];
@@ -283,7 +288,12 @@ static inline int convert_fxsr_from_user
to = (struct _fpxreg *) &fxsave->st_space[0];
from = &buf->_st[0];
for ( i = 0 ; i < 8 ; i++, to++, from++ ) {
- if ( __copy_from_user( to, from, sizeof(*from) ) )
+ unsigned long *t = (unsigned long *)to;
+ unsigned long *f = (unsigned long *)from;
+
+ if (__get_user(*f, t) ||
+ __get_user(*(f + 1), t + 1) ||
+ __get_user(from->exponent, &to->exponent))
return 1;
}
return 0;
@@ -305,7 +315,7 @@ static inline int save_i387_fsave( struc
return 1;
}

-static inline int save_i387_fxsave( struct _fpstate *buf )
+static int save_i387_fxsave( struct _fpstate *buf )
{
struct task_struct *tsk = current;
int err = 0;
@@ -355,7 +365,7 @@ static inline int restore_i387_fsave( st
sizeof(struct i387_fsave_struct) );
}

-static inline int restore_i387_fxsave( struct _fpstate *buf )
+static int restore_i387_fxsave( struct _fpstate *buf )
{
int err;
struct task_struct *tsk = current;
@@ -373,7 +383,7 @@ int restore_i387( struct _fpstate *buf )

if ( HAVE_HWFP ) {
if ( cpu_has_fxsr ) {
- err = restore_i387_fxsave( buf );
+ err = restore_i387_fxsave( buf );
} else {
err = restore_i387_fsave( buf );
}

_

2002-11-21 13:15:44

by Dave Jones

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Wed, Nov 20, 2002 at 10:54:40PM -0800, Andrew Morton wrote:
> > I think this merits some investigation. I, for one, am a big user of
> > SIGIO in userspace C programs...
> OK, got it back to 119000. Each signal was calling copy_*_user 24 times.
> This gets it down to six.

Good eyes. But.. this also applies to 2.4 (which should also then
get faster). So the gap between 2.4 & 2.5 must be somewhere else ?

Also maybe we can do something about that multiple memcpy in copy_fpu_fxsave()
In fact, that looks a bit fishy. We copy 10 bytes each memcpy, but
advance the to ptr 5 bytes each iteration. What gives here ?

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-11-21 13:55:34

by Dave Jones

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

> > OK, got it back to 119000. Each signal was calling copy_*_user 24 times.
> > This gets it down to six.
> Also maybe we can do something about that multiple memcpy in copy_fpu_fxsave()
> In fact, that looks a bit fishy. We copy 10 bytes each memcpy, but
> advance the to ptr 5 bytes each iteration. What gives here ?

<morning caffiene kicks in>
Doh, of course.. it's copying shorts. Still looks icky though IMO.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-11-21 15:07:08

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On 12 November 2002 01:04, Aaron Lehmann wrote:
> On Tue, Nov 12, 2002 at 10:31:38AM +1100, Con Kolivas wrote:
> > Here are the latest contest (http://contest.kolivas.net) benchmarks
> > up to and including 2.5.47.
>
> This is just great to see. Most previous contest runs made me cringe
> when I saw how -mm and recent 2.5 kernels were faring, but it looks
> like Andrew has done something right in 2.5.47-mm1. I hope the
> appropriate get merged so that 2.6.0 has stunning performance across
> the board.

Con, your test is extremely useful. Thank you.

(I think I have to say this aloud intead of just reading lkml)
--
vda

2002-11-21 17:14:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

Dave Jones wrote:
>
> On Wed, Nov 20, 2002 at 10:54:40PM -0800, Andrew Morton wrote:
> > > I think this merits some investigation. I, for one, am a big user of
> > > SIGIO in userspace C programs...
> > OK, got it back to 119000. Each signal was calling copy_*_user 24 times.
> > This gets it down to six.
>
> Good eyes. But.. this also applies to 2.4 (which should also then
> get faster). So the gap between 2.4 & 2.5 must be somewhere else ?

But 2.4 already inlines the usercopy functions. With this benchmark,
the cost of the function call is visible. Same with the dir_rtn_1
test - it is performing zillions of 3, 7, 10-byte copies into userspace.

The usercopy functions got themselves optimised for large copies and
cache footprint. Maybe we should inline them again. Maybe it doesn't
matter much.

> Also maybe we can do something about that multiple memcpy in copy_fpu_fxsave()
> In fact, that looks a bit fishy. We copy 10 bytes each memcpy, but
> advance the to ptr 5 bytes each iteration. What gives here ?
>

We'd buy a bit by arranging for the in-kernel copy of the fp state
to have the same layout as the hardware. That way it can be done in
a single big, fast, well-aligned slurp. But for some reason that code has
to convert into and out of a different representation.

But the real low-hanging fruit here is the observation that the
test application doesn't use floating point!!!

Maybe we need to take an fp trap now and then to "poll" the application
to see if it is still using float.

2002-11-21 17:27:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Thu, Nov 21, 2002 at 09:21:47AM -0800, Andrew Morton wrote:
> We'd buy a bit by arranging for the in-kernel copy of the fp state
> to have the same layout as the hardware. That way it can be done in
> a single big, fast, well-aligned slurp. But for some reason that code has
> to convert into and out of a different representation.
> But the real low-hanging fruit here is the observation that the
> test application doesn't use floating point!!!
> Maybe we need to take an fp trap now and then to "poll" the application
> to see if it is still using float.

Um... both of these are in the "wtf?? it doesn't do that now??" category.


Bill

2002-11-21 18:15:04

by Dave Jones

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Thu, Nov 21, 2002 at 09:21:47AM -0800, Andrew Morton wrote:

> > Good eyes. But.. this also applies to 2.4 (which should also then
> > get faster). So the gap between 2.4 & 2.5 must be somewhere else ?
>
> But 2.4 already inlines the usercopy functions. With this benchmark,
> the cost of the function call is visible. Same with the dir_rtn_1
> test - it is performing zillions of 3, 7, 10-byte copies into userspace.

But the reduction of number of copy_*_user's applies to 2.4 too right ?

> We'd buy a bit by arranging for the in-kernel copy of the fp state
> to have the same layout as the hardware. That way it can be done in
> a single big, fast, well-aligned slurp. But for some reason that code has
> to convert into and out of a different representation.

Possibly hardware restrictions, I'm unfamiliar with how this voodoo works.

> But the real low-hanging fruit here is the observation that the
> test application doesn't use floating point!!!

Interesting point. What was the test app ? contest ?
(I missed the beginning of this thread).

> Maybe we need to take an fp trap now and then to "poll" the application
> to see if it is still using float.

Neat idea. Bonus points for making it work 8-)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-11-21 18:20:26

by Bill Davidsen

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest

On Thu, 21 Nov 2002, Andrew Morton wrote:

> Dave Jones wrote:

> We'd buy a bit by arranging for the in-kernel copy of the fp state
> to have the same layout as the hardware. That way it can be done in
> a single big, fast, well-aligned slurp. But for some reason that code has
> to convert into and out of a different representation.
>
> But the real low-hanging fruit here is the observation that the
> test application doesn't use floating point!!!
>
> Maybe we need to take an fp trap now and then to "poll" the application
> to see if it is still using float.

I thought we used to do that and someone (Linus??) thought the complexity
didn't justify the saving. Always seemed like a good idea to me, but with
the various types of processor we have today, it's harder to say what wins
until benchmarks are done.

Hopefully someone has a better memory than I about this.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.