2002-11-22 22:20:50

by Con Kolivas

[permalink] [raw]
Subject: [BENCHMARK] 2.4.20-rc2-aa1 with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here is a partial run of contest (http://contest.kolivas.net) benchmarks for
rc2aa1 with the disk latency hack

noload:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [5] 71.7 93 0 0 0.98
2.4.19 [5] 69.0 97 0 0 0.94
2.4.20-rc1 [3] 72.2 93 0 0 0.99
2.4.20-rc1aa1 [1] 71.9 94 0 0 0.98
2420rc2aa1 [1] 71.1 94 0 0 0.97

cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [2] 66.6 99 0 0 0.91
2.4.19 [2] 68.0 99 0 0 0.93
2.4.20-rc1 [3] 67.2 99 0 0 0.92
2.4.20-rc1aa1 [1] 67.4 99 0 0 0.92
2420rc2aa1 [1] 66.6 99 0 0 0.91

process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 109.5 57 119 44 1.50
2.4.19 [3] 106.5 59 112 43 1.45
2.4.20-rc1 [3] 110.7 58 119 43 1.51
2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51*
2420rc2aa1 [1] 212.5 31 412 69 2.90*

This load just copies data between 4 processes repeatedly. Seems to take
longer.


ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 117.4 63 1 7 1.60
2.4.19 [2] 106.5 70 1 8 1.45
2.4.20-rc1 [3] 102.1 72 1 7 1.39
2.4.20-rc1aa1 [3] 107.1 69 1 7 1.46
2420rc2aa1 [1] 103.3 73 1 8 1.41

xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 150.8 49 2 8 2.06
2.4.19 [1] 132.4 55 2 9 1.81
2.4.20-rc1 [3] 180.7 40 3 8 2.47
2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28*
2420rc2aa1 [1] 217.7 34 4 9 2.97*

Takes longer. Is only one run though so may not be an accurate average.


io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 474.1 15 36 10 6.48
2.4.19 [3] 492.6 14 38 10 6.73
2.4.20-rc1 [2] 1142.2 6 90 10 15.60
2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47
2420rc2aa1 [1] 164.3 44 10 9 2.24

This was where the effect of the disk latency hack was expected to have an
effect. It sure did.


read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 102.3 70 6 3 1.40
2.4.19 [2] 134.1 54 14 5 1.83
2.4.20-rc1 [3] 173.2 43 20 5 2.37
2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06
2420rc2aa1 [1] 140.5 51 13 4 1.92

list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 90.2 76 1 17 1.23
2.4.19 [1] 89.8 77 1 20 1.23
2.4.20-rc1 [3] 88.8 77 0 12 1.21
2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20
2420rc2aa1 [1] 99.7 69 1 19 1.36

mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 103.3 70 32 3 1.41
2.4.19 [3] 100.0 72 33 3 1.37
2.4.20-rc1 [3] 105.9 69 32 2 1.45

Mem load hung the machine. I could not get rc2aa1 through this part of the
benchmark no matter how many times I tried to run it. No idea what was going
on. Easy to reproduce. Simply run the mem_load out of contest (which runs
until it is killed) and the machine will hang.

Con

P.S. I'm having mailserver trouble so respond to lkml where I may see
responses
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE93q/IF6dfvkL3i1gRAqWCAKCp6eZ2MFe4Ag7LqoGwy4+0MbUqxQCgkkxl
AOUDUScNazCAJ2oZrdgDMuE=
=vHmI
-----END PGP SIGNATURE-----


2002-11-24 16:21:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Here is a partial run of contest (http://contest.kolivas.net) benchmarks for
> rc2aa1 with the disk latency hack
>
> noload:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [5] 71.7 93 0 0 0.98
> 2.4.19 [5] 69.0 97 0 0 0.94
> 2.4.20-rc1 [3] 72.2 93 0 0 0.99
> 2.4.20-rc1aa1 [1] 71.9 94 0 0 0.98
> 2420rc2aa1 [1] 71.1 94 0 0 0.97
>
> cacherun:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [2] 66.6 99 0 0 0.91
> 2.4.19 [2] 68.0 99 0 0 0.93
> 2.4.20-rc1 [3] 67.2 99 0 0 0.92
> 2.4.20-rc1aa1 [1] 67.4 99 0 0 0.92
> 2420rc2aa1 [1] 66.6 99 0 0 0.91
>
> process_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 109.5 57 119 44 1.50
> 2.4.19 [3] 106.5 59 112 43 1.45
> 2.4.20-rc1 [3] 110.7 58 119 43 1.51
> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51*
> 2420rc2aa1 [1] 212.5 31 412 69 2.90*
>
> This load just copies data between 4 processes repeatedly. Seems to take
> longer.

you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
<< (20 - 9)) and see if it makes any differences here? if it doesn't
make differences it could be the a bit increased readhaead but I doubt
it's the latter.

> ctar_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 117.4 63 1 7 1.60
> 2.4.19 [2] 106.5 70 1 8 1.45
> 2.4.20-rc1 [3] 102.1 72 1 7 1.39
> 2.4.20-rc1aa1 [3] 107.1 69 1 7 1.46
> 2420rc2aa1 [1] 103.3 73 1 8 1.41
>
> xtar_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 150.8 49 2 8 2.06
> 2.4.19 [1] 132.4 55 2 9 1.81
> 2.4.20-rc1 [3] 180.7 40 3 8 2.47
> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28*
> 2420rc2aa1 [1] 217.7 34 4 9 2.97*
>
> Takes longer. Is only one run though so may not be an accurate average.

This most probably is a too small waitqueue. Of course increasing the
waitqueue will increase a bit the latency too for the other workloads,
it's a tradeoff and there's no way around it. Even read-latency has the
tradeoff when it chooses the "nth" place to be the seventh slot, where
to put the read request if it fails inserction.

>
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 474.1 15 36 10 6.48
> 2.4.19 [3] 492.6 14 38 10 6.73
> 2.4.20-rc1 [2] 1142.2 6 90 10 15.60
> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47
> 2420rc2aa1 [1] 164.3 44 10 9 2.24
>
> This was where the effect of the disk latency hack was expected to have an
> effect. It sure did.

yes, I certainly can feel the machine much more responsive during the
write load too. Too bad some benchmark like dbench decreased
significantly but I don't see too many ways around it. At least now with
those changes the contigous write case is unaffected, my storage test
box still reads and writes at over 100mbyte/sec for example, this
clearly means what matters is that we have 512k dma commands, not an
huge size of the queue. Really with a loaded machine and potential
scheduling delays it could matter more to have a larger queue, that
maybe why the performance is decreased for some workload here too, not
only because of a less effective elevator. So probably 2Mbyte of queue
is a much better idea, so at least we can have a ring with 4 elements to refill
after a completion wakeup, I wanted to be strict to see the "lowlatency" effect
at most in the first place. We could also consider to use a /4 instead of my
current /2 for the batch_sectors initialization.

BTW, at first glance it looks 2.5 has the same problem in the queue
sizing too.

> read_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 102.3 70 6 3 1.40
> 2.4.19 [2] 134.1 54 14 5 1.83
> 2.4.20-rc1 [3] 173.2 43 20 5 2.37
> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06
> 2420rc2aa1 [1] 140.5 51 13 4 1.92
>
> list_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 90.2 76 1 17 1.23
> 2.4.19 [1] 89.8 77 1 20 1.23
> 2.4.20-rc1 [3] 88.8 77 0 12 1.21
> 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20
> 2420rc2aa1 [1] 99.7 69 1 19 1.36
>
> mem_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 103.3 70 32 3 1.41
> 2.4.19 [3] 100.0 72 33 3 1.37
> 2.4.20-rc1 [3] 105.9 69 32 2 1.45
>
> Mem load hung the machine. I could not get rc2aa1 through this part of the
> benchmark no matter how many times I tried to run it. No idea what was going
> on. Easy to reproduce. Simply run the mem_load out of contest (which runs
> until it is killed) and the machine will hang.

sorry but what is mem_load supposed to do other than to loop forever? It
is running for two days on my test box (512m of ram, 2G of swap, 4-way
smp) and nothing happened yet. It's an infinite loop. Sounds like you're
trapping a signal. Wouldn't it be simpler to just finish after a number
of passes? The machine is perfectly usable and responsive during the
mem_load, xmms doesn't skip a beat for istance, this is probably thanks
to the elevator-lowlatency too, I recall xmms wasn't used to be
completely smooth during heavy swapping in previous kernels (because the read()
of the sound file didn't return in rasonable time since I'm swapping in the
same hd where I store the data).

jupiter:~ # uptime
4:20pm up 1 day, 14:43, 3 users, load average: 1.38, 1.28, 1.21
jupiter:~ # vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 0 197408 4504 112 1436 21 34 23 34 36 19 0 2 97
0 1 0 199984 4768 116 1116 11712 5796 11720 5804 514 851 1 2 97
0 1 0 234684 4280 108 1116 14344 12356 14344 12360 617 1034 0 3 96
0 1 0 267880 4312 108 1116 10464 11916 10464 11916 539 790 0 3 97
1 0 0 268704 5192 108 1116 6220 9336 6220 9336 363 474 0 1 99
0 1 0 270764 5312 108 1116 13036 18952 13036 18952 584 958 0 1 99
0 1 0 271368 5088 108 1116 8288 5160 8288 5160 386 576 0 1 99
0 1 1 269184 4296 108 1116 4352 6420 4352 6416 254 314 0 0 100
0 1 0 266528 4604 108 1116 9644 4652 9644 4656 428 658 0 1 99

there is no way I can reproduce any stability problem with mem_load here
(tested both on scsi quad xeon and ide dualathlon). Can you provide more
details of your problem and/or a SYSRQ+T during the hang? thanks.

Andrea

2002-11-25 06:35:35

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> process_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 109.5 57 119 44 1.50
>> 2.4.19 [3] 106.5 59 112 43 1.45
>> 2.4.20-rc1 [3] 110.7 58 119 43 1.51
>> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51*
>> 2420rc2aa1 [1] 212.5 31 412 69 2.90*
>>
>> This load just copies data between 4 processes repeatedly. Seems to take
>> longer.
>
>you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
><< (20 - 9)) and see if it makes any differences here? if it doesn't
>make differences it could be the a bit increased readhaead but I doubt
>it's the latter.

No significant difference:
2420rc2aa1 212.53 31% 412 69%
2420rc2aa1mqs2 227.72 29% 455 71%

>> xtar_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 150.8 49 2 8 2.06
>> 2.4.19 [1] 132.4 55 2 9 1.81
>> 2.4.20-rc1 [3] 180.7 40 3 8 2.47
>> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28*
>> 2420rc2aa1 [1] 217.7 34 4 9 2.97*
>>
>> Takes longer. Is only one run though so may not be an accurate average.
>
>This most probably is a too small waitqueue. Of course increasing the
>waitqueue will increase a bit the latency too for the other workloads,
>it's a tradeoff and there's no way around it. Even read-latency has the
>tradeoff when it chooses the "nth" place to be the seventh slot, where
>to put the read request if it fails inserction.
>
>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 474.1 15 36 10 6.48
>> 2.4.19 [3] 492.6 14 38 10 6.73
>> 2.4.20-rc1 [2] 1142.2 6 90 10 15.60
>> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47
>> 2420rc2aa1 [1] 164.3 44 10 9 2.24
>>
>> This was where the effect of the disk latency hack was expected to have an
>> effect. It sure did.
>
>yes, I certainly can feel the machine much more responsive during the
>write load too. Too bad some benchmark like dbench decreased
>significantly but I don't see too many ways around it. At least now with
>those changes the contigous write case is unaffected, my storage test
>box still reads and writes at over 100mbyte/sec for example, this
>clearly means what matters is that we have 512k dma commands, not an
>huge size of the queue. Really with a loaded machine and potential
>scheduling delays it could matter more to have a larger queue, that
>maybe why the performance is decreased for some workload here too, not
>only because of a less effective elevator. So probably 2Mbyte of queue
>is a much better idea, so at least we can have a ring with 4 elements to
> refill after a completion wakeup, I wanted to be strict to see the
> "lowlatency" effect at most in the first place. We could also consider to
> use a /4 instead of my current /2 for the batch_sectors initialization.
>
>BTW, at first glance it looks 2.5 has the same problem in the queue
>sizing too.
>
>> read_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 102.3 70 6 3 1.40
>> 2.4.19 [2] 134.1 54 14 5 1.83
>> 2.4.20-rc1 [3] 173.2 43 20 5 2.37
>> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06
>> 2420rc2aa1 [1] 140.5 51 13 4 1.92
>>
>> list_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 90.2 76 1 17 1.23
>> 2.4.19 [1] 89.8 77 1 20 1.23
>> 2.4.20-rc1 [3] 88.8 77 0 12 1.21
>> 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20
>> 2420rc2aa1 [1] 99.7 69 1 19 1.36
>>
>> mem_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 103.3 70 32 3 1.41
>> 2.4.19 [3] 100.0 72 33 3 1.37
>> 2.4.20-rc1 [3] 105.9 69 32 2 1.45
>>
>> Mem load hung the machine. I could not get rc2aa1 through this part of the
>> benchmark no matter how many times I tried to run it. No idea what was
>> going on. Easy to reproduce. Simply run the mem_load out of contest (which
>> runs until it is killed) and the machine will hang.
>
>sorry but what is mem_load supposed to do other than to loop forever? It
>is running for two days on my test box (512m of ram, 2G of swap, 4-way
>smp) and nothing happened yet. It's an infinite loop. Sounds like you're
>trapping a signal. Wouldn't it be simpler to just finish after a number
>of passes? The machine is perfectly usable and responsive during the
>mem_load, xmms doesn't skip a beat for istance, this is probably thanks
>to the elevator-lowlatency too, I recall xmms wasn't used to be
>completely smooth during heavy swapping in previous kernels (because the
> read() of the sound file didn't return in rasonable time since I'm swapping
> in the same hd where I store the data).
>
>jupiter:~ # uptime
> 4:20pm up 1 day, 14:43, 3 users, load average: 1.38, 1.28, 1.21
>jupiter:~ # vmstat 1
> procs memory swap io system
> cpu r b w swpd free buff cache si so bi bo in cs us
> sy id 0 1 0 197408 4504 112 1436 21 34 23 34 36 19
> 0 2 97 0 1 0 199984 4768 116 1116 11712 5796 11720 5804 514
> 851 1 2 97 0 1 0 234684 4280 108 1116 14344 12356 14344 12360
> 617 1034 0 3 96 0 1 0 267880 4312 108 1116 10464 11916
> 10464 11916 539 790 0 3 97 1 0 0 268704 5192 108 1116 6220
> 9336 6220 9336 363 474 0 1 99 0 1 0 270764 5312 108 1116
> 13036 18952 13036 18952 584 958 0 1 99 0 1 0 271368 5088 108
> 1116 8288 5160 8288 5160 386 576 0 1 99 0 1 1 269184 4296
> 108 1116 4352 6420 4352 6416 254 314 0 0 100 0 1 0 266528
> 4604 108 1116 9644 4652 9644 4656 428 658 0 1 99
>
>there is no way I can reproduce any stability problem with mem_load here
>(tested both on scsi quad xeon and ide dualathlon). Can you provide more
>details of your problem and/or a SYSRQ+T during the hang? thanks.

The machine stops responding but sysrq works. It wont write anything to the
logs. To get the error I have to run the mem_load portion of contest, not
just mem_load by itself. The purpose of mem_load is to be just that - a
memory load during the contest benchmark and contest will kill it when it
finishes testing in that load. To reproduce it yourself, run mem_load then do
a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how
else you can see it. sys-rq-T shows too much stuff on screen for me to make
any sense of it and scrolls away without me being able to scroll up.

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE94cbRF6dfvkL3i1gRAvkgAKCOJwQ4hP2E5n1tu1r31MeCz9tULQCdE/lm
hEbMrTEK/u2Sb8INZbVJWpg=
=8YxG
-----END PGP SIGNATURE-----

2002-11-25 06:59:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

Con Kolivas wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >> process_load:
> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> 2.4.18 [3] 109.5 57 119 44 1.50
> >> 2.4.19 [3] 106.5 59 112 43 1.45
> >> 2.4.20-rc1 [3] 110.7 58 119 43 1.51
> >> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51*
> >> 2420rc2aa1 [1] 212.5 31 412 69 2.90*
> >>
> >> This load just copies data between 4 processes repeatedly. Seems to take
> >> longer.
> >
> >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
> ><< (20 - 9)) and see if it makes any differences here? if it doesn't
> >make differences it could be the a bit increased readhaead but I doubt
> >it's the latter.
>
> No significant difference:
> 2420rc2aa1 212.53 31% 412 69%
> 2420rc2aa1mqs2 227.72 29% 455 71%

process_load is a CPU scheduler thing, not a disk scheduler thing. Something
must have changed in kernel/sched.c.

It's debatable whether 210 seconds is worse than 110 seconds in
this test, really. You have four processes madly piping stuff around and
four to eight processes compiling stuff. I don't see why it's "worse"
that the compile happens to get 31% of the CPU time in this kernel. One
would need to decide how much CPU it _should_ get before making that decision.

> ...
>
> The machine stops responding but sysrq works. It wont write anything to the
> logs. To get the error I have to run the mem_load portion of contest, not
> just mem_load by itself. The purpose of mem_load is to be just that - a
> memory load during the contest benchmark and contest will kill it when it
> finishes testing in that load. To reproduce it yourself, run mem_load then do
> a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how
> else you can see it. sys-rq-T shows too much stuff on screen for me to make
> any sense of it and scrolls away without me being able to scroll up.

Try sysrq-p.

2002-11-25 18:16:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote:
> will kill it when it finishes testing in that load. To reproduce it
> yourself, run mem_load then do a kernel compile make -j(4xnum_cpus).

I will try.

> If that doesnt do it I'm not sure how else you can see it. sys-rq-T
> shows too much stuff on screen for me to make any sense of it and
> scrolls away without me being able to scroll up.

you can use as usual a serial or netconsole to log the sysrq+t output.

Andrea

2002-11-25 18:50:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

On Sun, Nov 24, 2002 at 11:06:13PM -0800, Andrew Morton wrote:
> Con Kolivas wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> > >> -----BEGIN PGP SIGNED MESSAGE-----
> > >> Hash: SHA1
> > >> process_load:
> > >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> > >> 2.4.18 [3] 109.5 57 119 44 1.50
> > >> 2.4.19 [3] 106.5 59 112 43 1.45
> > >> 2.4.20-rc1 [3] 110.7 58 119 43 1.51
> > >> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51*
> > >> 2420rc2aa1 [1] 212.5 31 412 69 2.90*
> > >>
> > >> This load just copies data between 4 processes repeatedly. Seems to take
> > >> longer.
> > >
> > >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
> > ><< (20 - 9)) and see if it makes any differences here? if it doesn't
> > >make differences it could be the a bit increased readhaead but I doubt
> > >it's the latter.
> >
> > No significant difference:
> > 2420rc2aa1 212.53 31% 412 69%
> > 2420rc2aa1mqs2 227.72 29% 455 71%
>
> process_load is a CPU scheduler thing, not a disk scheduler thing. Something
> must have changed in kernel/sched.c.
>
> It's debatable whether 210 seconds is worse than 110 seconds in
> this test, really. You have four processes madly piping stuff around and
> four to eight processes compiling stuff. I don't see why it's "worse"
> that the compile happens to get 31% of the CPU time in this kernel. One
> would need to decide how much CPU it _should_ get before making that decision.

I see, so it's probably one of the core o1 scheduler design fixes I did
in my tree to avoid losing around 60% of the available cpu power in smp
in critical workloads due design bugs in the o1 scheduler (partly
reduced by a factor of 10 in 2.5 because of the HZ=1000 but that's also
additional overhead that showup in all the userspace cpu intensive
benchmarks posted to l-k, compared to the right fix that is needed
anyways in 2.5 too since HZ=1000 only hides the problem partially, and
s390 idle patch won't let the local smp interrupts running on idle
cpus anyways). So this result should be a good thing, or anyways it's
not interesting for what we're trying to benchmark here.

>
> > ...
> >
> > The machine stops responding but sysrq works. It wont write anything to the
> > logs. To get the error I have to run the mem_load portion of contest, not
> > just mem_load by itself. The purpose of mem_load is to be just that - a
> > memory load during the contest benchmark and contest will kill it when it
> > finishes testing in that load. To reproduce it yourself, run mem_load then do
> > a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how
> > else you can see it. sys-rq-T shows too much stuff on screen for me to make
> > any sense of it and scrolls away without me being able to scroll up.
>
> Try sysrq-p.

indeed it might be sysrq+p the interesting one, I would had find out
from the sysrq+t. the problem with sysrq+p is that with the improved
irq-balance patch in my tree will likely dump only 1 cpu, I should send
an IPI to get a reliable sysrq+p from all cpus at the same time like I
did in the alpha port some time ago. Of course this is not a problem at
all if his testbox is UP.

The main problem of the elevator-lowlatency patch is that it increases fariness
of an order of magnitude so it can hardly be the fastest kernel on dbench
anymore.

Again many thanks to Randy for these so useful accurate benchmarks.

2.4.20-rc1aa1 73.92 75.22 71.79
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.4.20-rc2-ac1-rmap15-O1 53.09 54.85 51.09
2.4.20-rc2aa1 64.60 65.33 63.98
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.5.31-mm1-dl-ew 59.55 61.51 57.00
2.5.32-mm1-dl-ew 55.43 57.15 53.13
2.5.32-mm2-dl-ew 54.01 57.38 47.48
2.5.33-mm1-dl-ew 52.02 54.86 46.74
2.5.33-mm5 49.61 53.42 41.31
2.5.40-mm1 70.39 73.85 65.24
2.5.42 67.72 70.50 66.05
2.5.43-mm2 67.32 69.92 65.11
2.5.44-mm5 69.47 71.86 66.14
2.5.44-mm6 69.03 71.66 64.11

you see rc2aa1 is slower than rc1aa1. Not that much as I would had expected,
I was expecting something horrible of the order of the 30mbyte/sec, so it's
quite a great result IMHO considering the queue was only 1Mbyte, but still it's
noticeable (note that the queue now is 1M even for seeks, not only for
contigous I/O, previously it was 32M for contigous I/O where it's
useless to apply the elevator because I/O is contigous in the first
place and it was something like 256k for seeks). It would be interesting
to see how dbench 192 on reiserfs reacts to this patch applied on top of
2.4.20rc2aa1. 4M is a saner value for the queue size, 1M was too small
but I wanted to show the lowest latency ever in contest. With this one
contest should show still a very low read latency (and write latency too
unlike read-latency, if you would ever test fsync or O_SYNC/O_DIRECT and
not only read latency), but dbench should run faster, I doubt it's as
fast as rc1aa1 but it could be a good tradeoff.

--- 2.4.20rc2aa1/drivers/block/ll_rw_blk.c.~1~ 2002-11-21 06:06:02.000000000 +0100
+++ 2.4.20rc2aa1/drivers/block/ll_rw_blk.c 2002-11-25 19:45:03.000000000 +0100
@@ -421,7 +421,7 @@ int blk_grow_request_list(request_queue_
}
q->batch_requests = q->nr_requests;
q->max_queue_sectors = max_queue_sectors;
- q->batch_sectors = max_queue_sectors / 2;
+ q->batch_sectors = max_queue_sectors / 4;
BUG_ON(!q->batch_sectors);
atomic_set(&q->nr_sectors, 0);
spin_unlock_irqrestore(q->queue_lock, flags);
--- 2.4.20rc2aa1/include/linux/blkdev.h.~1~ 2002-11-21 06:24:18.000000000 +0100
+++ 2.4.20rc2aa1/include/linux/blkdev.h 2002-11-25 19:44:09.000000000 +0100
@@ -244,7 +244,7 @@ extern char * blkdev_varyio[MAX_BLKDEV];

#define MAX_SEGMENTS 128
#define MAX_SECTORS 255
-#define MAX_QUEUE_SECTORS (1 << (20 - 9)) /* 1 mbytes when full sized */
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
#define MAX_NR_REQUESTS (MAX_QUEUE_SECTORS >> (10 - 9)) /* 1mbyte queue when all requests are 1k */

#define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

Andrea

2002-11-30 16:10:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest

On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote:
> finishes testing in that load. To reproduce it yourself, run mem_load then do
> a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how

JFYI: can't reproduce it here with kernel compile and mem_load in
parallel. Did you compile in AGP? there's apparently some known issue
with AGP/DRI.

Andrea