LinuxLists.cc - [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Con Kolivas wrote:
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 474.1 15 36 10 6.64
> 2.4.19 [3] 492.6 14 38 10 6.90
> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
>

2.4.20-pre3 included some elevator changes. I assume they are the
cause of this. Those changes have propagated into Alan's and Andrea's
kernels. Hence they have significantly impacted the responsiveness
of all mainstream 2.4 kernels under heavy writes.

(The -ck patch includes rmap14b which includes the read-latency2 thing)

2002-11-09 03:21:03

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>Con Kolivas wrote:
>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 474.1 15 36 10 6.64
>> 2.4.19 [3] 492.6 14 38 10 6.90
>> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
>> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
>
>2.4.20-pre3 included some elevator changes. I assume they are the
>cause of this. Those changes have propagated into Alan's and Andrea's
>kernels. Hence they have significantly impacted the responsiveness
>of all mainstream 2.4 kernels under heavy writes.
>
>(The -ck patch includes rmap14b which includes the read-latency2 thing)

Thanks for the explanation. I should have said this was ck with compressed
caching; not rmap.

Con.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zIB8F6dfvkL3i1gRAs6lAJ0f7E9HTlNl5cOaDnmSfw9gi0QLQgCfV3jh
kaG/a1TzlUviOGz5Ci895uA=
=TyH7
-----END PGP SIGNATURE-----

2002-11-09 03:38:04

by Dieter Nützel

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andreww Morton wrote:
> Con Kolivas wrote:
> >
> > io_load:
> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> > 2.4.18 [3] 474.1 15 36 10 6.64
> > 2.4.19 [3] 492.6 14 38 10 6.90
> > 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> > 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> > 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
> >
>
> 2.4.20-pre3 included some elevator changes. I assume they are the
> cause of this. Those changes have propagated into Alan's and Andrea's
> kernels. Hence they have significantly impacted the responsiveness
> of all mainstream 2.4 kernels under heavy writes.
>
> (The -ck patch includes rmap14b which includes the read-latency2 thing)

No, the 2.4.19-ck9 that I have (the default?) include -AA and preemption (!!!)

Preemption is for several months the clear throughput winner for me.
Latest 2.4.19-ck9 and now 2.5.46-mm1.

I know you all "hate" dbench but 2.5.45/2.5.46-mm1 halved (!!!) my "dbench 32"
numbers. deadline IO is so GREAT.

2.4.19-ck5: ~55-60 seconds
2.5.46-mm1: ~31-45 seconds (even under VM pressure)

total used free shared buffers cached
Mem: 1034988 864172 170816 0 231840 345120
-/+ buffers/cache: 287212 747776
Swap: 1028120 8452 1019668
Total: 2063108 872624 1190484

Throughput 110.61 MB/sec (NB=138.263 MB/sec 1106.1 MBit/sec)
7.941u 38.251s 0:39.20 117.8% 0+0k 0+0io 841pf+0w

Sorry, "free -t" forgotten.

Throughput 114.462 MB/sec (NB=143.077 MB/sec 1144.62 MBit/sec)
7.986u 35.900s 0:37.90 115.7% 0+0k 0+0io 841pf+0w

total used free shared buffers cached
Mem: 1034988 481812 553176 0 178788 54048
-/+ buffers/cache: 248976 786012
Swap: 1028120 9836 1018284
Total: 2063108 491648 1571460

Throughput 112.283 MB/sec (NB=140.354 MB/sec 1122.83 MBit/sec)
7.728u 37.358s 0:38.62 116.7% 0+0k 0+0io 841pf+0w

total used free shared buffers cached
Mem: 1034988 461736 573252 0 163260 51488
-/+ buffers/cache: 246988 788000
Swap: 1028120 9976 1018144
Total: 2063108 471712 1591396

Only one MP3 playback hiccup during "dbench 32" and nearly no slowdown of
dbench.

2.5.45+ need some more memory during my normal workload and do little more
swap than 2.4.19+AA.

MemTotal: 1034988 kB
MemFree: 559784 kB
MemShared: 0 kB
Buffers: 164260 kB
Cached: 63308 kB
SwapCached: 2884 kB
Active: 399388 kB
Inactive: 10096 kB
HighTotal: 131008 kB
HighFree: 46508 kB
LowTotal: 903980 kB
LowFree: 513276 kB
SwapTotal: 1028120 kB
SwapFree: 1018156 kB
Dirty: 44 kB
Writeback: 0 kB
Mapped: 220700 kB
Slab: 36904 kB
Committed_AS: 530908 kB
PageTables: 3436 kB
ReverseMaps: 125959
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB

slabinfo - version: 1.2
fib6_nodes 7 112 32 1 1 1 : 248 124
ip6_dst_cache 9 20 192 1 1 1 : 248 124
ndisc_cache 1 30 128 1 1 1 : 248 124
raw6_sock 0 0 576 0 0 1 : 120 60
udp6_sock 1 7 576 1 1 1 : 120 60
tcp6_sock 5 8 1024 2 2 1 : 120 60
ip_conntrack 8 60 320 5 5 1 : 120 60
unix_sock 192 261 448 29 29 1 : 120 60
tcp_tw_bucket 0 0 128 0 0 1 : 248 124
tcp_bind_bucket 13 112 32 1 1 1 : 248 124
tcp_open_request 0 0 128 0 0 1 : 248 124
inet_peer_cache 2 59 64 1 1 1 : 248 124
secpath_cache 0 0 32 0 0 1 : 248 124
flow_cache 0 0 64 0 0 1 : 248 124
xfrm4_dst_cache 0 0 192 0 0 1 : 248 124
ip_fib_hash 15 112 32 1 1 1 : 248 124
ip_dst_cache 25 100 192 5 5 1 : 248 124
arp_cache 3 60 128 2 2 1 : 248 124
raw4_sock 0 0 448 0 0 1 : 120 60
udp_sock 7 18 448 2 2 1 : 120 60
tcp_sock 24 40 896 10 10 1 : 120 60
sgpool-MAX_PHYS_SEGMENTS 32 33 2560 11 11 2 : 54 27
sgpool-64 32 33 1280 11 11 1 : 54 27
sgpool-32 32 36 640 6 6 1 : 120 60
sgpool-16 32 36 320 3 3 1 : 120 60
sgpool-8 36 40 192 2 2 1 : 248 124
reiser_inode_cache 3900 19320 384 1932 1932 1 : 120 60
eventpoll 0 0 96 0 0 1 : 248 124
kioctx 0 0 192 0 0 1 : 248 124
kiocb 0 0 192 0 0 1 : 248 124
dnotify_cache 0 0 20 0 0 1 : 248 124
file_lock_cache 104 160 96 4 4 1 : 248 124
fasync_cache 2 202 16 1 1 1 : 248 124
shmem_inode_cache 12 27 448 3 3 1 : 120 60
uid_cache 5 112 32 1 1 1 : 248 124
deadline_drq 1792 1792 32 16 16 1 : 248 124
blkdev_requests 1280 1320 192 66 66 1 : 248 124
biovec-BIO_MAX_PAGES 256 260 3072 52 52 4 : 54 27
biovec-128 256 260 1536 52 52 2 : 54 27
biovec-64 256 260 768 52 52 1 : 120 60
biovec-16 256 260 192 13 13 1 : 248 124
biovec-4 256 295 64 5 5 1 : 248 124
biovec-1 325 404 16 2 2 1 : 248 124
bio 272 295 64 5 5 1 : 248 124
sock_inode_cache 237 330 384 33 33 1 : 120 60
skbuff_head_cache 897 980 192 49 49 1 : 248 124
sock 7 10 384 1 1 1 : 120 60
proc_inode_cache 117 696 320 58 58 1 : 120 60
sigqueue 87 87 132 3 3 1 : 248 124
radix_tree_node 4560 11340 320 945 945 1 : 120 60
cdev_cache 24 177 64 3 3 1 : 248 124
bdev_cache 15 30 128 1 1 1 : 248 124
mnt_cache 24 59 64 1 1 1 : 248 124
inode_cache 548 588 320 49 49 1 : 120 60
dentry_cache 7302 36560 192 1828 1828 1 : 248 124
filp 2512 2550 128 85 85 1 : 248 124
names_cache 6 6 4096 6 6 1 : 54 27
buffer_head 56609 158616 52 2203 2203 1 : 248 124
mm_struct 90 110 384 11 11 1 : 120 60
vm_area_struct 5357 6300 128 210 210 1 : 248 124
fs_cache 90 295 64 5 5 1 : 248 124
files_cache 90 99 448 11 11 1 : 120 60
signal_act 99 99 1344 33 33 1 : 54 27
task_struct 133 145 1600 29 29 2 : 54 27
pte_chain 19930 28851 64 489 489 1 : 248 124
mm_chain 0 0 8 0 0 1 : 248 124
size-131072(DMA) 0 0 131072 0 0 32 : 8 4
size-131072 0 0 131072 0 0 32 : 8 4
size-65536(DMA) 0 0 65536 0 0 16 : 8 4
size-65536 0 0 65536 0 0 16 : 8 4
size-32768(DMA) 0 0 32768 0 0 8 : 8 4
size-32768 1 1 32768 1 1 8 : 8 4
size-16384(DMA) 0 0 16384 0 0 4 : 8 4
size-16384 11 15 16384 11 15 4 : 8 4
size-8192(DMA) 0 0 8192 0 0 2 : 8 4
size-8192 5 9 8192 5 9 2 : 8 4
size-4096(DMA) 0 0 4096 0 0 1 : 54 27
size-4096 198 212 4096 198 212 1 : 54 27
size-2048(DMA) 0 0 2048 0 0 1 : 54 27
size-2048 190 206 2048 99 103 1 : 54 27
size-1024(DMA) 0 0 1024 0 0 1 : 120 60
size-1024 268 268 1024 67 67 1 : 120 60
size-512(DMA) 0 0 512 0 0 1 : 120 60
size-512 512 512 512 64 64 1 : 120 60
size-256(DMA) 0 0 256 0 0 1 : 248 124
size-256 360 360 256 24 24 1 : 248 124
size-192(DMA) 0 0 192 0 0 1 : 248 124
size-192 54 60 192 3 3 1 : 248 124
size-128(DMA) 0 0 128 0 0 1 : 248 124
size-128 923 1050 128 35 35 1 : 248 124
size-64(DMA) 0 0 64 0 0 1 : 248 124
size-64 1851 2124 64 36 36 1 : 248 124
size-32(DMA) 0 0 64 0 0 1 : 248 124
size-32 1891 2065 64 35 35 1 : 248 124
kmem_cache 112 128 120 4 4 1 : 248 124

GREAT work!

Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel at hamburg.de (replace at with @)

2002-11-09 03:49:07

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>Andreww Morton wrote:
>> Con Kolivas wrote:
>> > io_load:
>> > Kernel [runs] Time CPU% Loads LCPU% Ratio
>> > 2.4.18 [3] 474.1 15 36 10 6.64
>> > 2.4.19 [3] 492.6 14 38 10 6.90
>> > 2.4.19-ck9 [2] 140.6 49 5 5 1.97
>> > 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>> > 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
>>
>> 2.4.20-pre3 included some elevator changes. I assume they are the
>> cause of this. Those changes have propagated into Alan's and Andrea's
>> kernels. Hence they have significantly impacted the responsiveness
>> of all mainstream 2.4 kernels under heavy writes.
>>
>> (The -ck patch includes rmap14b which includes the read-latency2 thing)
>
>No, the 2.4.19-ck9 that I have (the default?) include -AA and preemption
> (!!!)

Err I made the ck patchset so I think I should know. ck9 came only as one
patch which included O(1),Low Latency, Preempt, Compressed Caching,
Supermount, ALSA and XFS. CK10-13 on the otherhand had optional Compressed
Caching OR AA OR Rmap. By default since they are 2.4 kernels they all include
the vanilla aa vm, but the ck trunk with AA has the extra AA vm addons only
available in the -AA kernel set. If you disabled compressed caching in ck9
you got only the vanilla 2.4.19 vm.

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zIcRF6dfvkL3i1gRAoEmAJ9DxKp9y+Jx11G+k+rcaMYKrVsM5gCgn5NH
nMwKh/nfafNt5kMvLpm+Bsg=
=YwE8
-----END PGP SIGNATURE-----

2002-11-09 03:56:05

by Dieter Nützel

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Am Samstag, 9. November 2002 04:54 schrieb Con Kolivas:
> >Andreww Morton wrote:
> >> Con Kolivas wrote:
> >> > io_load:
> >> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> > 2.4.18 [3] 474.1 15 36 10 6.64
> >> > 2.4.19 [3] 492.6 14 38 10 6.90
> >> > 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> >> > 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> >> > 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
> >>
> >> 2.4.20-pre3 included some elevator changes. I assume they are the
> >> cause of this. Those changes have propagated into Alan's and Andrea's
> >> kernels. Hence they have significantly impacted the responsiveness
> >> of all mainstream 2.4 kernels under heavy writes.
> >>
> >> (The -ck patch includes rmap14b which includes the read-latency2 thing)
> >
> >No, the 2.4.19-ck9 that I have (the default?) include -AA and preemption
> > (!!!)
>
> Err I made the ck patchset so I think I should know. ck9 came only as one
> patch which included O(1),Low Latency, Preempt, Compressed Caching,
> Supermount, ALSA and XFS. CK10-13 on the otherhand had optional Compressed
> Caching OR AA OR Rmap. By default since they are 2.4 kernels they all
> include the vanilla aa vm, but the ck trunk with AA has the extra AA vm
> addons only available in the -AA kernel set. If you disabled compressed
> caching in ck9 you got only the vanilla 2.4.19 vm.

Then I mixed it up with 2.4.19-llck5 -AA.
To much versions... Sorry!

-Dieter

2002-11-09 04:08:36

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Con Kolivas wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> >Con Kolivas wrote:
> >> io_load:
> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> 2.4.18 [3] 474.1 15 36 10 6.64
> >> 2.4.19 [3] 492.6 14 38 10 6.90
> >> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> >> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> >> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
> >
> >2.4.20-pre3 included some elevator changes. I assume they are the
> >cause of this. Those changes have propagated into Alan's and Andrea's
> >kernels. Hence they have significantly impacted the responsiveness
> >of all mainstream 2.4 kernels under heavy writes.
> >
> >(The -ck patch includes rmap14b which includes the read-latency2 thing)
>
> Thanks for the explanation. I should have said this was ck with compressed
> caching; not rmap.
>

hrm. In that case I'll shut up with the speculating.

You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
Maybe it doesn't translate to worsened interactivity. Needs more
testing and anaysis.

2002-11-09 05:06:16

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>hrm. In that case I'll shut up with the speculating.

Please dont stop speculating. I and many others rely on someone like yourself
who is more likely to understand what is going on to comment. I can't expect
you to know exactly what goes into every patchset out there. Your input has
been invaluable and most of the drive for my benchmarking.

>You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
>Maybe it doesn't translate to worsened interactivity. Needs more
>testing and anaysis.

Sounds fair enough. My resources are exhausted though. Someone else have any
thoughts?

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zJknF6dfvkL3i1gRAnHLAKCRuTqBfxqX582puVwQ/hBb0T0R1QCePyws
0N9uKoKVY/M22gses+MkEnE=
=UvJP
-----END PGP SIGNATURE-----

2002-11-09 11:13:55

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Fri, Nov 08 2002, Andrew Morton wrote:
> Con Kolivas wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > >Con Kolivas wrote:
> > >> io_load:
> > >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> > >> 2.4.18 [3] 474.1 15 36 10 6.64
> > >> 2.4.19 [3] 492.6 14 38 10 6.90
> > >> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> > >> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> > >> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
> > >
> > >2.4.20-pre3 included some elevator changes. I assume they are the
> > >cause of this. Those changes have propagated into Alan's and Andrea's
> > >kernels. Hence they have significantly impacted the responsiveness
> > >of all mainstream 2.4 kernels under heavy writes.
> > >
> > >(The -ck patch includes rmap14b which includes the read-latency2 thing)
> >
> > Thanks for the explanation. I should have said this was ck with compressed
> > caching; not rmap.
> >
>
> hrm. In that case I'll shut up with the speculating.
>
> You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
> Maybe it doesn't translate to worsened interactivity. Needs more
> testing and anaysis.

The merging and seek accounting in 2.4.19 is completely off, it doesn't
make any sense. 2.4.20-rc1 should be sanely tweakable.

--
Jens Axboe

2002-11-09 11:15:05

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sat, Nov 09 2002, Con Kolivas wrote:
> >You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
> >Maybe it doesn't translate to worsened interactivity. Needs more
> >testing and anaysis.
>
> Sounds fair enough. My resources are exhausted though. Someone else have any
> thoughts?

Try setting lower elevator passover values. Something ala

# elvtune -r 64 /dev/hda

(or whatever your drive is)

--
Jens Axboe

2002-11-09 13:04:00

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>On Sat, Nov 09 2002, Con Kolivas wrote:
>> >You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
>> >Maybe it doesn't translate to worsened interactivity. Needs more
>> >testing and anaysis.
>>
>> Sounds fair enough. My resources are exhausted though. Someone else have
>> any thoughts?
>
>Try setting lower elevator passover values. Something ala
>
># elvtune -r 64 /dev/hda
>
>(or whatever your drive is)

Heres some more data:

io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.20-rc1 [2] 1142.2 6 90 10 16.00
2420rc1r64 [3] 575.0 12 43 10 8.05

That's it then. Should I run a family of different values and if so over what
range?

Cheers,
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zQkXF6dfvkL3i1gRAggJAKCOAWzrTxFlnPbOftzMAXPnvI7KVQCfWqUC
iDVmD1UcPDNPWCfQmlBF9yk=
=Q299
-----END PGP SIGNATURE-----

2002-11-09 13:33:39

by Steve Lord

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sat, 2002-11-09 at 07:09, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> >On Sat, Nov 09 2002, Con Kolivas wrote:
> >> >You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
> >> >Maybe it doesn't translate to worsened interactivity. Needs more
> >> >testing and anaysis.
> >>
> >> Sounds fair enough. My resources are exhausted though. Someone else have
> >> any thoughts?
> >
> >Try setting lower elevator passover values. Something ala
> >
> ># elvtune -r 64 /dev/hda
> >
> >(or whatever your drive is)
>
> Heres some more data:
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> 2420rc1r64 [3] 575.0 12 43 10 8.05
>
> That's it then. Should I run a family of different values and if so over what
> range?
>

There is more going on than this, XFS suffered a major slowdown in some
metadata write only benchmarks - the file create/delete phase of
bonnie++. Now thats a single app only doing writes. Slowdown on the
order of 500% to 600%. Since we did not follow the pre kernels in
2.4.20 we do not really know when it was introduced and there is
a possibility XFS itself has not followed some api change.

Steve

2002-11-09 13:48:20

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10 2002, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> >On Sat, Nov 09 2002, Con Kolivas wrote:
> >> >You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
> >> >Maybe it doesn't translate to worsened interactivity. Needs more
> >> >testing and anaysis.
> >>
> >> Sounds fair enough. My resources are exhausted though. Someone else have
> >> any thoughts?
> >
> >Try setting lower elevator passover values. Something ala
> >
> ># elvtune -r 64 /dev/hda
> >
> >(or whatever your drive is)
>
> Heres some more data:
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> 2420rc1r64 [3] 575.0 12 43 10 8.05
>
> That's it then. Should I run a family of different values and if so
> over what range?

The default is 2048. How long does the io_load test take, or rather how
many tests are appropriate to do? To get a good picture of how it looks
you should probably try: 0, 8, 16, 64, 128, 512. Once you get some of
these results, it will be easier to determine which area(s) would be
most interesting to further explore.

There's also the write passover, I don't think it will have much impact
on this test though.

--
Jens Axboe

2002-11-09 21:07:48

by Diego Calleja

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sat, 9 Nov 2002 14:54:46 +0100
Jens Axboe <[email protected]> wrote:

> The default is 2048. How long does the io_load test take, or rather how

then, shouldn't the default be changed?. There's a big performance drop (/2)
(in that case of course)

Diego Calleja

2002-11-09 21:48:07

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>On Sun, Nov 10 2002, Con Kolivas wrote:
>> >On Sat, Nov 09 2002, Con Kolivas wrote:
>> >> >You're showing a big shift in behaviour between 2.4.19 and 2.4.20-rc1.
>> >> >Maybe it doesn't translate to worsened interactivity. Needs more
>> >> >testing and anaysis.
>> >>
>> >> Sounds fair enough. My resources are exhausted though. Someone else
>> >> have any thoughts?
>> >
>> >Try setting lower elevator passover values. Something ala
>> >
>> ># elvtune -r 64 /dev/hda
>> >
>> >(or whatever your drive is)
>>

>> That's it then. Should I run a family of different values and if so
>> over what range?
>
>The default is 2048. How long does the io_load test take, or rather how
>many tests are appropriate to do? To get a good picture of how it looks
>you should probably try: 0, 8, 16, 64, 128, 512. Once you get some of
>these results, it will be easier to determine which area(s) would be
>most interesting to further explore.

The io_load test takes as long as the time in seconds shown on the table. At
least 3 tests are appropriate to get a reasonable average [runs is in square
parentheses]. Therefore it takes about half an hour per run. Luckily I had
the benefit of a night to set up a whole lot of runs:

io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2420rc1r0 [3] 489.3 15 36 10 6.85
2420rc1r8 [3] 485.5 15 35 10 6.80
2420rc1r16 [3] 570.4 12 43 10 7.99
2420rc1r32 [3] 570.1 12 42 10 7.98
2420rc1r64 [3] 575.0 12 43 10 8.05
2420rc1r128 [3] 611.4 11 46 10 8.56
2420rc1r256 [3] 646.2 11 49 10 9.05
2420rc1r512 [3] 603.7 12 45 10 8.46
2420rc1r1024 [3] 693.9 10 53 10 9.72
2.4.20-rc1 [2] 1142.2 6 90 10 16.00

Test hardware is 1133Mhz P3 laptop with 5400rpm ATA100 drive. I don't doubt
the response curve would be different for other hardware.

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zYPmF6dfvkL3i1gRAlgQAJ9wbCJUc6OesGsuR+S2YHi2+zzRuACePEPJ
MIVeNptM2zdnvEFPZXCWMO8=
=7M4k
-----END PGP SIGNATURE-----

2002-11-10 02:19:57

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sat, Nov 09, 2002 at 10:12:06PM +0100, Arador wrote:
> On Sat, 9 Nov 2002 14:54:46 +0100
> Jens Axboe <[email protected]> wrote:
>
> > The default is 2048. How long does the io_load test take, or rather how
>
> then, shouldn't the default be changed?. There's a big performance drop (/2)
> (in that case of course)

depends what side you are benchmarking, not always more throughput means
less interactivity, but at some point (when the more throughput can't
payoff for the reordering anymore) it does.

You should definitely benchmark 2.4.19-ck9 and 2.4.20rc1aa2 with dbench
too. Those numbers as is doesn't show the whole picture.

Andrea

2002-11-10 02:38:15

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
> xtar_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 150.8 49 2 8 2.11
> 2.4.19 [1] 132.4 55 2 9 1.85
> 2.4.19-ck9 [2] 138.6 58 2 11 1.94
> 2.4.20-rc1 [3] 180.7 40 3 8 2.53
> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.33

these numbers doesn't make sense. Can you describe what xtar_load is
doing?

> First noticeable difference. With repeated extracting of tars while compiling
> kernels 2.4.20-rc1 seems to be slower and aa1 curbs it just a little.
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 474.1 15 36 10 6.64
> 2.4.19 [3] 492.6 14 38 10 6.90
> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86

What are you benchmarking, tar or the kernel compile? I think the
latter. That's the elevator and the size of the I/O queue here. Nothing
else. hacks like read-latency aren't very nice in particular with
async-io aware apps. If this improvement in ck9 was achieved decreasing
the queue size it'll be interesting to see how much the sequential I/O
is slowed down, it's very possible we've too big queues for some device.

> Well this is interesting. 2.4.20-rc1 seems to have improved it's ability to do
> IO work. Unfortunately it is now busy starving the scheduler in the mean
> time, much like the 2.5 kernels did before the deadline scheduler was put in.
>
> read_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 102.3 70 6 3 1.43
> 2.4.19 [2] 134.1 54 14 5 1.88
> 2.4.19-ck9 [2] 77.4 85 11 9 1.08
> 2.4.20-rc1 [3] 173.2 43 20 5 2.43
> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.11

What is busy starving the scheduler? This sounds like it's again just an
evelator benchmark. I don't buy your scheduler claims, give more
explanations or it'll take it as vapourware wording, I very much doubt
you can find any single problem in the scheduler rc1aa2 or that the
scheduler in rc1aa1 has a chance to run slower than the one of 2.4.19 in
a I/O benchmark, ok it still misses the numa algorithm, but that's not a
bug, just a missing feature and it'll soon be fixed too and it doesn't
matter for normal smp non-numa machines out there.

> Also a noticeable difference, repeatedly reading a large file while trying to
> compile a kernel has slowed down in 2.4.20-rc1 and aa1 blunts this effect
> somewhat.
>
> list_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 90.2 76 1 17 1.26
> 2.4.19 [1] 89.8 77 1 20 1.26
> 2.4.19-ck9 [2] 85.2 79 1 22 1.19
> 2.4.20-rc1 [3] 88.8 77 0 12 1.24
> 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.23
>
> mem_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2.4.18 [3] 103.3 70 32 3 1.45
> 2.4.19 [3] 100.0 72 33 3 1.40
> 2.4.19-ck9 [2] 78.3 88 31 8 1.10
> 2.4.20-rc1 [3] 105.9 69 32 2 1.48
> 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49

again ck9 is faster because of elevator hacks ala read-latency.

in short your whole benchmark seems all about interacitivy of reads
during write flood. That's the read-latency thing or whatever else you
could do to ll_rw_block.c.

In short if somebody runs fast in something like this:

cp /dev/zero . & time cp bigfile /dev/null

he will win your whole contest too.

please show the difff between
2.4.19-ck9/drivers/block/{ll_rw_blk,elevator}.c and
2.4.19/drivers/block/...

All the difference is there and it will hurt you badly if you do
async-io benchmarks, and possibly dbench too. So you should always
accompain your benchmark with async-io simultanous read/write bandwitdth
and dbench, or I could always win your contest by shipping a very bad
kernel. Either that or change the name of your project, if somebody wins
this context that's probably a bad I/O scheduler in many other aspects,
some of the reason I didn't merge read-latency from Andrew.

Andrea

2002-11-10 03:49:37

by Matt Reppert

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Purely for information's sake ...

On Sun, 10 Nov 2002 03:44:51 +0100
Andrea Arcangeli <[email protected]> wrote:

> On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
> > xtar_load:
> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> > 2.4.18 [3] 150.8 49 2 8 2.11
> > 2.4.19 [1] 132.4 55 2 9 1.85
> > 2.4.19-ck9 [2] 138.6 58 2 11 1.94
> > 2.4.20-rc1 [3] 180.7 40 3 8 2.53
> > 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.33
>
> these numbers doesn't make sense. Can you describe what xtar_load is
> doing?

Repeatedly extracting tars while compiling kernels.

Andrea, I think you mixed up what the descriptions go to. They come *under*
the numbers, not above, commenting on only the test directly above them.
(eg, "First noticeable difference" is about "xtar_load")

Yes, these are kind of meaningless without descriptions. You can find those
at the webpage, http://contest.kolivas.net/ ... This will make more sense with
that, of course how meaningful it is is always up to debate :)

All of these benchmark the kernel compile while doing something else in the
background.

> In short if somebody runs fast in something like this:
>
> cp /dev/zero . & time cp bigfile /dev/null
>
> he will win your whole contest too.

That's practically one of the loads, actually.

"IO Load - copies /dev/zero continually to a file the size of
the physical memory."

Which dds blocks the size of MemTotal in /proc/meminfo to a file
in /tmp in a shell script as long as the kernel compile is running.

> please show the difff between
> 2.4.19-ck9/drivers/block/{ll_rw_blk,elevator}.c and
> 2.4.19/drivers/block/...

elevator.c is untouched, ll_rw_blk.c follows. The full patch is here:
http://members.optusnet.com.au/con.man/ck9_2.4.19.patch.bz2

diff -bBdaurN linux-2.4.19/drivers/block/ll_rw_blk.c linux-2.4.19-ck9/drivers/bl
ock/ll_rw_blk.c
--- linux-2.4.19/drivers/block/ll_rw_blk.c 2002-08-03 13:14:45.000000000 +1
000
+++ linux-2.4.19-ck9/drivers/block/ll_rw_blk.c 2002-10-14 17:21:18.000000000 +1
000
@@ -1112,6 +1112,9 @@
if (!test_bit(BH_Lock, &bh->b_state))
BUG();

+ if (buffer_delay(bh) || !buffer_mapped(bh))
+ BUG();
+
set_bit(BH_Req, &bh->b_state);
set_bit(BH_Launder, &bh->b_state);

@@ -1132,6 +1135,7 @@
kstat.pgpgin += count;
break;
}
+ conditional_schedule();
}

/**
@@ -1270,7 +1274,8 @@

req->errors = 0;
if (!uptodate)
- printk("end_request: I/O error, dev %s (%s), sector %lu\n",
+ printk(KERN_INFO "end_request: I/O error, dev %s (%s),"
+ " sector %lu\n",
kdevname(req->rq_dev), name, req->sector);

if ((bh = req->bh) != NULL) {

.
> Either that or change the name of your project,

It's called "contest" because it's a reasonably arbitrary test of what
the kernel does under some circumstances that's put out by Con Kolivas.
Con's test. Contest. It's not supposed to actually mean anything.

Matt

2002-11-10 09:52:47

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

First some explanation.

Contest (http://contest.kolivas.net) is obviously not a throughput style
benchmark. The benchmark simply uses userland loads known to slow down the
machine (like writing large files) and sees how much longer kernel
compilation takes (make -j4 bzImage on uniprocessor). Thus it never claims to
be any sort of comprehensive system benchmark; it only serves to give an idea
of the systems ability to respond in the presence of different loads, in
terms end users can understand.

>On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
>> xtar_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 150.8 49 2 8 2.11
>> 2.4.19 [1] 132.4 55 2 9 1.85
>> 2.4.19-ck9 [2] 138.6 58 2 11 1.94
>> 2.4.20-rc1 [3] 180.7 40 3 8 2.53
>> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.33
>
>these numbers doesn't make sense. Can you describe what xtar_load is
>doing?

Ok xtar_load starts extracting a large tar (a kernel tree) in the background
then tries to compile a kernel. The time is how long kernel compilation takes
and cpu% is how much cpu% make -j4 bzImage uses. Loads is how many times it
successfully extracts the tar and LCPU% is the cpu% returned by the "tar x
linux.tar" command. Ratio is the ratio of this kernel compilation time to the
reference (2.4.18 with no load).

>> First noticeable difference. With repeated extracting of tars while
>> compiling kernels 2.4.20-rc1 seems to be slower and aa1 curbs it just a
>> little.

This explanation said simply that kernel compilation with the same tar
extracting load takes longer in 2.4.20-rc1 compared with 2.4.19, but that the
aa addons sped it up a bit.

>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 474.1 15 36 10 6.64
>> 2.4.19 [3] 492.6 14 38 10 6.90
>> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
>> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
>
>What are you benchmarking, tar or the kernel compile? I think the
>latter. That's the elevator and the size of the I/O queue here. Nothing
>else. hacks like read-latency aren't very nice in particular with
>async-io aware apps. If this improvement in ck9 was achieved decreasing
>the queue size it'll be interesting to see how much the sequential I/O
>is slowed down, it's very possible we've too big queues for some device.
>
>> Well this is interesting. 2.4.20-rc1 seems to have improved it's ability
>> to do IO work. Unfortunately it is now busy starving the scheduler in the
>> mean time, much like the 2.5 kernels did before the deadline scheduler was
>> put in.
>>
>> read_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 102.3 70 6 3 1.43
>> 2.4.19 [2] 134.1 54 14 5 1.88
>> 2.4.19-ck9 [2] 77.4 85 11 9 1.08
>> 2.4.20-rc1 [3] 173.2 43 20 5 2.43
>> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.11
>
>What is busy starving the scheduler? This sounds like it's again just an
>evelator benchmark. I don't buy your scheduler claims, give more
>explanations or it'll take it as vapourware wording, I very much doubt
>you can find any single problem in the scheduler rc1aa2 or that the
>scheduler in rc1aa1 has a chance to run slower than the one of 2.4.19 in
>a I/O benchmark, ok it still misses the numa algorithm, but that's not a
>bug, just a missing feature and it'll soon be fixed too and it doesn't
>matter for normal smp non-numa machines out there.

Ok I fully retract the statement. I should not pass judgement on what part of
the kernel has changed the benchmark results, I'll just describe what the
results say. Note however this comment was centred on the results of io_load
above. Put simply : if I am writing a large file and then try to compile the
kernel (make -j4 bzImage) it is 16 times slower.

>> mem_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 103.3 70 32 3 1.45
>> 2.4.19 [3] 100.0 72 33 3 1.40
>> 2.4.19-ck9 [2] 78.3 88 31 8 1.10
>> 2.4.20-rc1 [3] 105.9 69 32 2 1.48
>> 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
>
>again ck9 is faster because of elevator hacks ala read-latency.
>
>in short your whole benchmark seems all about interacitivy of reads
>during write flood. That's the read-latency thing or whatever else you
>could do to ll_rw_block.c.
>
>In short if somebody runs fast in something like this:
>
> cp /dev/zero . & time cp bigfile /dev/null
>
>he will win your whole contest too.
>
>please show the difff between
>2.4.19-ck9/drivers/block/{ll_rw_blk,elevator}.c and
>2.4.19/drivers/block/...

I think Matt addressed this issue.

>All the difference is there and it will hurt you badly if you do
>async-io benchmarks, and possibly dbench too. So you should always
>accompain your benchmark with async-io simultanous read/write bandwitdth
>and dbench, or I could always win your contest by shipping a very bad
>kernel. Either that or change the name of your project, if somebody wins
>this context that's probably a bad I/O scheduler in many other aspects,
>some of the reason I didn't merge read-latency from Andrew.

The name is meaningless and based on my name. Had my name been John it would
be johntest.

I regret ever including the -ck (http://kernel.kolivas.net) results. The
purpose of publishing these results was to compare 2.4.20-rc1/aa1 with
previous kernels. As some people are interested in the results of the ck
patchset I threw them in as well. -ck is a patchset with desktop users in
mind and is simply a merged patch of O(1),preempt,low latency and compressed
caching. If it sacrifices throughput in certain areas to maintain system
responsiveness then so be it. I'll look into adding other loads to contest as
you suggested, but I'm not going to add basic throughput benchmarks. There
are plenty of tools for this already.

I've done some ordinary dbench-quick benchmarks of ck9 and 2.4.20-rc1aa1 at
the OSDL http://www.osdl.org/stp

ck10_cc is the sum of patches that make up ck9 so is the same thing.

ck10_cc: http://khack.osdl.org/stp/7005/
2.4.20-rc1-aa1: http://khack.osdl.org/stp/7006/

Summary:
2420rc1aa1:
1 117.5
4 114.002
7 114.643
10 114.818
13 109.478
16 109.817
19 103.692
22 103.678
25 105.478
28 93.1296
31 87.0544
34 84.2668
37 81.0731
40 75.4605
43 77.2198
46 69.0448
49 66.7997
52 61.5987
55 60.2009
58 60.1531
61 58.3121
64 55.7127
67 56.2714
70 53.6214
73 52.2704
76 52.3631
79 49.7146
82 48.2406
85 48.1078
88 42.8405
91 42.4929
94 42.3958
97 43.5729
100 45.8318

ck10_cc:
1 116.239
4 115.075
7 114.414
10 114.166
13 109.129
16 109.403
19 106.601
22 97.7714
25 93.7279
28 95.0076
31 92.5594
34 88.5938
37 89.7026
40 86.9904
43 85.1783
46 82.7975
49 79.7348
52 80.2497
55 79.2346
58 76.6632
61 75.9002
64 75.8677
67 75.7318
70 73.2223
73 73.7652
76 72.9277
79 72.5244
82 71.6753
85 71.3161
88 70.9735
91 69.5539
94 69.602
97 67.2016
100 67.158

Regards,
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zi3UF6dfvkL3i1gRAkWmAJ4zX7gyUjzKH7eCNneyNRWLPGtCeACff9A7
Bn8LHqZw46CrGauuWTldDnQ=
=0WMB
-----END PGP SIGNATURE-----

2002-11-10 10:03:31

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10 2002, Con Kolivas wrote:
> >The default is 2048. How long does the io_load test take, or rather how
> >many tests are appropriate to do? To get a good picture of how it looks
> >you should probably try: 0, 8, 16, 64, 128, 512. Once you get some of
> >these results, it will be easier to determine which area(s) would be
> >most interesting to further explore.
>
> The io_load test takes as long as the time in seconds shown on the table. At
> least 3 tests are appropriate to get a reasonable average [runs is in square
> parentheses]. Therefore it takes about half an hour per run. Luckily I had
> the benefit of a night to set up a whole lot of runs:
>
> io_load:
> Kernel [runs] Time CPU% Loads LCPU% Ratio
> 2420rc1r0 [3] 489.3 15 36 10 6.85
> 2420rc1r8 [3] 485.5 15 35 10 6.80
> 2420rc1r16 [3] 570.4 12 43 10 7.99
> 2420rc1r32 [3] 570.1 12 42 10 7.98
> 2420rc1r64 [3] 575.0 12 43 10 8.05
> 2420rc1r128 [3] 611.4 11 46 10 8.56
> 2420rc1r256 [3] 646.2 11 49 10 9.05
> 2420rc1r512 [3] 603.7 12 45 10 8.46
> 2420rc1r1024 [3] 693.9 10 53 10 9.72
> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>
> Test hardware is 1133Mhz P3 laptop with 5400rpm ATA100 drive. I don't doubt
> the response curve would be different for other hardware.

That looks pretty good, the behaviour in 2.4.20-rc1 is no sanely tunable
unlike before. Could you retest the whole contest suite with 512 as the
default value? It looks like a good default for 2.4.20.

Marcelo, we probably need to make few tweaks here to get the read
passover value right. The algorithmic changes in 2.4.20-pre made it
impossible to guess a good default value, as we invalidated the previous
tests. Right now we are using 2048 which is a number I basically pulled
out of my ass, it looks as if it might be a bit high. So I'll be sending
you a one-liner correction once a decent default value is found.

--
Jens Axboe

2002-11-10 10:00:45

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10 2002, Con Kolivas wrote:
> >> Well this is interesting. 2.4.20-rc1 seems to have improved it's ability
> >> to do IO work. Unfortunately it is now busy starving the scheduler in the
> >> mean time, much like the 2.5 kernels did before the deadline scheduler was
> >> put in.
> >>
> >> read_load:
> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> 2.4.18 [3] 102.3 70 6 3 1.43
> >> 2.4.19 [2] 134.1 54 14 5 1.88
> >> 2.4.19-ck9 [2] 77.4 85 11 9 1.08
> >> 2.4.20-rc1 [3] 173.2 43 20 5 2.43
> >> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.11
> >
> >What is busy starving the scheduler? This sounds like it's again just an
> >evelator benchmark. I don't buy your scheduler claims, give more
> >explanations or it'll take it as vapourware wording, I very much doubt
> >you can find any single problem in the scheduler rc1aa2 or that the
> >scheduler in rc1aa1 has a chance to run slower than the one of 2.4.19 in
> >a I/O benchmark, ok it still misses the numa algorithm, but that's not a
> >bug, just a missing feature and it'll soon be fixed too and it doesn't
> >matter for normal smp non-numa machines out there.
>
> Ok I fully retract the statement. I should not pass judgement on what part of
> the kernel has changed the benchmark results, I'll just describe what the
> results say. Note however this comment was centred on the results of io_load
> above. Put simply : if I am writing a large file and then try to compile the
> kernel (make -j4 bzImage) it is 16 times slower.

In Con's defence, I think he meant io scheduler starvation and not
process scheduler starvation. Otherwise the following wouldn't make a
lot of sense:

"Unfortunately it is now busy starving the scheduler in the mean time,
much like the 2.5 kernels did before the deadline scheduler was put in."

In indeed, 2.5 kernels had the exact same io scheduler algorithm in 2.5
as 2.4.20-rc has, so this makes perfect sense from the io scheduler
starvation POV.

There are inherent problems in the 2.4 io scheduler for these types of
work loads, the ugly and nausea-inducing read-latency hack that akpm did
attempts to work-around that.

Andrea is obviously talking about process scheduler, note the numa
reference among other things.

--
Jens Axboe

2002-11-10 10:08:16

by Kjartan Maraas

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

l?r, 2002-11-09 kl. 14:54 skrev Jens Axboe:

[SNIP]

> The default is 2048. How long does the io_load test take, or rather how

The default on my RH system with the latest errata kernel is as follows:

[root@sevilla kmaraas]# elvtune /dev/hda

/dev/hda elevator ID 0
read_latency: 8192
write_latency: 16384
max_bomb_segments: 6

[root@sevilla kmaraas]# uname -a
Linux sevilla.gnome.no 2.4.18-17.7.x #1 Tue Oct 8 13:33:14 EDT 2002 i686
unknown
[root@sevilla kmaraas]#

Is this worth changing to lower values then? They seem to be an awful
lot higher than the values mentioned below here.

> many tests are appropriate to do? To get a good picture of how it looks
> you should probably try: 0, 8, 16, 64, 128, 512. Once you get some of
> these results, it will be easier to determine which area(s) would be
> most interesting to further explore.
>
> There's also the write passover, I don't think it will have much impact
> on this test though.

Cheers
Kjartan

2002-11-10 10:11:23

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10 2002, Kjartan Maraas wrote:
> l?r, 2002-11-09 kl. 14:54 skrev Jens Axboe:
>
> [SNIP]
>
> > The default is 2048. How long does the io_load test take, or rather how
>
> The default on my RH system with the latest errata kernel is as follows:
>
> [root@sevilla kmaraas]# elvtune /dev/hda
>
> /dev/hda elevator ID 0
> read_latency: 8192
> write_latency: 16384
> max_bomb_segments: 6
>
> [root@sevilla kmaraas]# uname -a
> Linux sevilla.gnome.no 2.4.18-17.7.x #1 Tue Oct 8 13:33:14 EDT 2002 i686
> unknown
> [root@sevilla kmaraas]#
>
> Is this worth changing to lower values then? They seem to be an awful
> lot higher than the values mentioned below here.

As I mentioned in the email sent out a few minutes ago, you cannot
compare the values from 2.4.19 and earlier to 2.4.20-pre/rc at all. The
algorithm for determining when a request is starved has been changed to
be more correct, and that has invalidated these values.

--
Jens Axboe

2002-11-10 16:14:50

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 11:06:56AM +0100, Jens Axboe wrote:
> Andrea is obviously talking about process scheduler, note the numa

exactly, sorry.

Andrea

2002-11-10 16:14:24

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 08:58:43PM +1100, Con Kolivas wrote:
> >> to do IO work. Unfortunately it is now busy starving the scheduler in the
> >> mean time, much like the 2.5 kernels did before the deadline scheduler was
> >> put in.
> Ok I fully retract the statement. I should not pass judgement on what part of
> the kernel has changed the benchmark results, I'll just describe what the

actually Wil pointed out to me privately you meant I/O scheduler, you
just never mentioned the name "I/O" so I mistaken if for the process
scheduler, sorry (I should have understood from the deadline adjective).
It makes sense what you said once parsed as I/O scheduler of course.

Next week I will check the changes in your tree and I'll try to
reproduce the dbench numbers on my 4-way with very high I/O and disk
bandwith and I'll let you know the numbers I get here. It maybe simply
the different elevator default values and fixes in 2.4.20rc, but I
recall that you still win compared to -r0 somewhere (according to your
numbers). It's pointless from my part to discuss this further now until
I've the whole picture of the changes you did, the whole picture on the
contest source code, and after I can reproduce every single result you
posted here. Hope to be able to comment further ASAP.

Andrea

2002-11-10 16:16:48

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 11:09:42AM +0100, Jens Axboe wrote:
> On Sun, Nov 10 2002, Con Kolivas wrote:
> > >The default is 2048. How long does the io_load test take, or rather how
> > >many tests are appropriate to do? To get a good picture of how it looks
> > >you should probably try: 0, 8, 16, 64, 128, 512. Once you get some of
> > >these results, it will be easier to determine which area(s) would be
> > >most interesting to further explore.
> >
> > The io_load test takes as long as the time in seconds shown on the table. At
> > least 3 tests are appropriate to get a reasonable average [runs is in square
> > parentheses]. Therefore it takes about half an hour per run. Luckily I had
> > the benefit of a night to set up a whole lot of runs:
> >
> > io_load:
> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> > 2420rc1r0 [3] 489.3 15 36 10 6.85
> > 2420rc1r8 [3] 485.5 15 35 10 6.80
> > 2420rc1r16 [3] 570.4 12 43 10 7.99
> > 2420rc1r32 [3] 570.1 12 42 10 7.98
> > 2420rc1r64 [3] 575.0 12 43 10 8.05
> > 2420rc1r128 [3] 611.4 11 46 10 8.56
> > 2420rc1r256 [3] 646.2 11 49 10 9.05
> > 2420rc1r512 [3] 603.7 12 45 10 8.46
> > 2420rc1r1024 [3] 693.9 10 53 10 9.72
> > 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
> >
> > Test hardware is 1133Mhz P3 laptop with 5400rpm ATA100 drive. I don't doubt
> > the response curve would be different for other hardware.
>
> That looks pretty good, the behaviour in 2.4.20-rc1 is no sanely tunable
> unlike before. Could you retest the whole contest suite with 512 as the
> default value? It looks like a good default for 2.4.20.

agreed, btw, a 2048 before the fixes would mean much less than now.

Andrea

2002-11-10 16:20:37

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 11:12:47AM +0100, Kjartan Maraas wrote:
> l?r, 2002-11-09 kl. 14:54 skrev Jens Axboe:
>
> [SNIP]
>
> > The default is 2048. How long does the io_load test take, or rather how
>
> The default on my RH system with the latest errata kernel is as follows:
>
> [root@sevilla kmaraas]# elvtune /dev/hda
>
> /dev/hda elevator ID 0
> read_latency: 8192
> write_latency: 16384
> max_bomb_segments: 6

that has still the bugs in 2.4.19 and all previous 2.4 that I found and
that I fixed first with an limited patch, not complete, and then Jens
fixed it competely after I showed him the bugs while explaining him why
I did the first limited patch (then Jens's patch gone in 2.4.20pre).

so a 8192 there, isn't comparable to a 8192 in 2.4.20rc, Jens was of
course aware and just lowered it to 2048 but that is probably still more
than a 8192 in previous 2.4, it would be possible to do the math to
calculate the exact value in some common case but I guess we want a sane
default, not necessairly the exact same behaviour, so I guess
benchmarking is more useful than doing the math to calculate the exact
new value to get the exact same behaviour.

Andrea

2002-11-10 19:26:29

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, 10 Nov 2002, Andrea Arcangeli wrote:
> On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:

> > 2.4.19-ck9 [2] 78.3 88 31 8 1.10
> > 2.4.20-rc1 [3] 105.9 69 32 2 1.48
> > 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
>
> again ck9 is faster because of elevator hacks ala read-latency.
>
> in short your whole benchmark seems all about interacitivy of reads
> during write flood.

Which is a very important thing. You have to keep in mind that
reads and writes are fundamentally different operations since
the majority of the writes happen asynchronously while the program
continues running, while the majority of reads are synchronous and
your program will block while the read is going on.

Because of this it is also much easier to do writes in large chunks
than it is to do reads in large chunks, because with writes you
know exactly what data you're going to write while you can't know
which data you'll need to read next.

> All the difference is there and it will hurt you badly if you do
> async-io benchmarks,

Why would read-latency hurt the async-io benchmark ?

Whether the IO is synchronous or asynchronous shouldn't matter much,
if you do a read you still need to wait for the data to be read in
before you can process it while the data you write is still in memory
and can be used over and over again.

What is the big difference with asynchronous IO that removes the big
asymetry between reads and writes ?

> kernel. Either that or change the name of your project, if somebody wins
> this context that's probably a bad I/O scheduler in many other aspects,
> some of the reason I didn't merge read-latency from Andrew.

Any reasons in particular or just a gut feeling ?

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-10 20:04:11

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 05:32:44PM -0200, Rik van Riel wrote:
> On Sun, 10 Nov 2002, Andrea Arcangeli wrote:
> > On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
>
> > > 2.4.19-ck9 [2] 78.3 88 31 8 1.10
> > > 2.4.20-rc1 [3] 105.9 69 32 2 1.48
> > > 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
> >
> > again ck9 is faster because of elevator hacks ala read-latency.
> >
> > in short your whole benchmark seems all about interacitivy of reads
> > during write flood.
>
> Which is a very important thing. You have to keep in mind that

sure, this is why I fixed the potential ~infinite starvation in the 2.3
elevator.

> reads and writes are fundamentally different operations since
> the majority of the writes happen asynchronously while the program
> continues running, while the majority of reads are synchronous and
> your program will block while the read is going on.
>
> Because of this it is also much easier to do writes in large chunks
> than it is to do reads in large chunks, because with writes you
> know exactly what data you're going to write while you can't know
> which data you'll need to read next.
>
> > All the difference is there and it will hurt you badly if you do
> > async-io benchmarks,
>
> Why would read-latency hurt the async-io benchmark ?

because only with async-io it is possible to keep the I/O pipeline
filled by reads. readahead only allows to do read-I/O in large chunks,
it has no way to fill the pipeline.

Infact the size of the request queue is the foundamental factor that
controls read latency during heavy writes without special heuristics ala
read-latency.

In short without async-io there is no way at all that a read application
can read at a decent speed during a write flood, unless you have special
hacks in the elevator ala read-latency that allows reads to enter in the
front of the queue, which reduces the chance to reorder reads and
potentially decreases performance on a async-io benchmark even in
presence of seeks.

> Whether the IO is synchronous or asynchronous shouldn't matter much,

the fact the I/O is sync or async makes the whole difference. with sync
reads the vmstat line in the read column will be always very small
compared to the write column under a write flood. This can be fixed either:

1) with hacks in the elevator ala read-latency that are not generic and
could decrease performance of other workloads
2) reducing the size of the I/O queue, that may decrease performance
also with seeks since it decreases the probaility of reordering in
the elevator
3) by having the app using async-io for reads allowing it to keep the
I/O pipeline full with reads

readahead, at least in its current form, only make sure that a 512k
command will be submitted instead of a 4k command, that's not remotely
comparable to writeback that floods the I/O queue constnatly with
several dozen or hundred mbytes of data. Increasing readhaead is also
risky, 512k is kind of obviously safe in all circumstances since it's a
single dma command anyways (and 128k for ide).

I'm starting benchmarking 2.4.20rc1aa against 2.4.19-ck9 under dbench
right now (then I'll run the contest), I can't imagine how can it be
that much faster under dbench, -aa is almost as fast as 2.5 in dbench
and much faster than 2.4 mainline, so if 19-ck9 is really that much
faster than -aa then it is likely much faster than 2.5 too. I definitely
need to examine in full detail what's going on with 2.4.19-ck9. Once I
will understand it I will let you know. For istance I know Randy's
numbers are fully reliable and I trust them:

http://home.earthlink.net/~rwhron/kernel/bigbox.html

I find Randy's number extremely useful. Of course it's great to see also
the responsiveness side of a kernel, but dbench isn't normally a
benchmark that needs responsiveness, quite the opposite, the most unfair
is the behaviour of vm and elevator, the faster usually dbench runs,
because with unfariness dbench tends to run kind of single threaded that
maximizes at most the writeback effect etc... So if 2.4.19-ck9 is so
much faster under dbench and so much more responsive with the contest
that seems to benchmark basically only the read latency under writeback
flushing flood, then it is definitely worthwhile to produce a patch
against mainline that generates this boost. If it has the preemption
patch that could hardly explain it too, the improvement from 45 MB/sec
to 65 MB/sec there's quite an huge difference and we have all the
schedule points in the submit_bh too, so it's quite unlikely that
preempt could explain that difference, it might against a mainline, but
not against my tree.

Anyways this is all guessing, once I'll check the code after I
reproduced the numbers things should be much more clear.

Andrea

2002-11-10 20:45:29

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
>
> > Whether the IO is synchronous or asynchronous shouldn't matter much,
>
> the fact the I/O is sync or async makes the whole difference. with sync
> reads the vmstat line in the read column will be always very small
> compared to the write column under a write flood. This can be fixed either:
>
> 1) with hacks in the elevator ala read-latency that are not generic and
> could decrease performance of other workloads

read-latency will only do the front-insertion if it was unable to find a
merge or insert on the tail-to-head search.

And the problem it desparately addresses is severe.

2002-11-10 20:49:55

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
>
> So if 2.4.19-ck9 is so
> much faster under dbench and so much more responsive with the contest
> that seems to benchmark basically only the read latency under writeback
> flushing flood, then it is definitely worthwhile to produce a patch
> against mainline that generates this boost. If it has the preemption
> patch that could hardly explain it too, the improvement from 45 MB/sec
> to 65 MB/sec there's quite an huge difference and we have all the
> schedule points in the submit_bh too, so it's quite unlikely that
> preempt could explain that difference, it might against a mainline, but
> not against my tree.
>
> Anyways this is all guessing, once I'll check the code after I
> reproduced the numbers things should be much more clear.

Well if I understand it correctly, compressed caching, umm, compresses
the cache ;)

And dbench writes 01 01 01 01 01 everywhere. Enormously compressible.

So it's basically fitting vastly more pagecache into the machine.

That would be my guessing, anyway. Changing dbench to write random
stuff might change the picture.

2002-11-10 20:58:28

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, 10 Nov 2002, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > > Whether the IO is synchronous or asynchronous shouldn't matter much,
> >
> > the fact the I/O is sync or async makes the whole difference. with sync
> > reads the vmstat line in the read column will be always very small
> > compared to the write column under a write flood. This can be fixed either:
> >
> > 1) with hacks in the elevator ala read-latency that are not generic and
> > could decrease performance of other workloads

It'd be nice if you specified which kind of workloads. Generic
handwaving is easy, but if you think about this problem a bit
more you'll see that most workloads which look like they might
suffer at first view should be just fine in reality...

> read-latency will only do the front-insertion if it was unable to find a
> merge or insert on the tail-to-head search.
>
> And the problem it desparately addresses is severe.

Note that async-IO shouldn't make a big difference here, except
maybe in synthetic benchmarks.

This is because the stream of data in a server will be approximately
the same regardless of whether the application is coded to use async
IO, threads or processes and because clients still need to wait for
the data on read while most writes are asynchronous.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-11 01:01:38

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 12:56:33PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > So if 2.4.19-ck9 is so
> > much faster under dbench and so much more responsive with the contest
> > that seems to benchmark basically only the read latency under writeback
> > flushing flood, then it is definitely worthwhile to produce a patch
> > against mainline that generates this boost. If it has the preemption
> > patch that could hardly explain it too, the improvement from 45 MB/sec
> > to 65 MB/sec there's quite an huge difference and we have all the
> > schedule points in the submit_bh too, so it's quite unlikely that
> > preempt could explain that difference, it might against a mainline, but
> > not against my tree.
> >
> > Anyways this is all guessing, once I'll check the code after I
> > reproduced the numbers things should be much more clear.
>
> Well if I understand it correctly, compressed caching, umm, compresses
> the cache ;)
>
> And dbench writes 01 01 01 01 01 everywhere. Enormously compressible.
>
> So it's basically fitting vastly more pagecache into the machine.
>
> That would be my guessing, anyway. Changing dbench to write random
> stuff might change the picture.

yes, it may be the pagecache compression that makes the difference here.
My hardware has lots of disk and ram bandwidth so it should benefit less
from compression. the results on my tree are finished, I'm starting a
new run on ck10.

Andrea

2002-11-11 01:48:13

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 07:05:01PM -0200, Rik van Riel wrote:
> On Sun, 10 Nov 2002, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > > Whether the IO is synchronous or asynchronous shouldn't matter much,
> > >
> > > the fact the I/O is sync or async makes the whole difference. with sync
> > > reads the vmstat line in the read column will be always very small
> > > compared to the write column under a write flood. This can be fixed either:
> > >
> > > 1) with hacks in the elevator ala read-latency that are not generic and
> > > could decrease performance of other workloads
>
> It'd be nice if you specified which kind of workloads. Generic

the slowdown happens in this case:

queue 5 6 7 8 9

insert read 3

queue 3 5 6 7 8 9

request 3 is handled by the device

queue 5 6 7 8 9

insert read 1

queue 1 5 6 7 8 9

request 1 is handled by the device

queue 5 6 7 8 9

insert read 2

queue 2 5 6 7 8 9

request 2 is handled by the device

so what happened is:

read 3
read 1
read 2

while w/o read-latency what would happen most probably would been the
below, because the read 5 6 7 8 9 would give the other reads the time to
be inserted and reordered and in turn optimized:

read 1
read 2
read 3

let's ignore async-io to keep it simple, there definitely the
possibility of slowing down with lots of task reading at the same time
even with only sync reads (that could be during swapping lots of major
faults at the same time during some write load or whatever else that
generates lots of tasks reading at near time during some background
writing).

Anybody claiming there isn't the potential of a global I/O throughput
slowdown would be clueless.

I know in some case it the additional seeking may allow the cpu to do
more work and that may actually increase the throughput, but this isn't
always the case, it can definitely slowdown.

all you can argue is that the decrease of latency for lots of common
interactive workloads could worth the potential of a global throghput
slowdown. On that I may agree. I wasn't very excited in merging that
because I was scared of slowdowns of workloads with async-io and
lots of tasks reading at the same time small things during writes that
as I demonstrated above can definitely happen in practice and it's
realistic. I run myself a number of workloads like that. The current
algorithm is optimal for throughput.

However I think even read-latency is more a workarond to a problem in
the I/O queue dimensions. I think the I/O queue should be dunamically
limited to amount of data queued (in bytes not in number of requests).

We need plenty of requests only because all the requests may have 4k
only when no merging can happen, and in such case we definitely need the
elevator to do an huge work to be efficient, seeking heavily on 4k
requests (or smaller) hurts a lot, seeking on 512k requests is much less
severe.

But when each request is large 512k it is pointless to allow the same
number of requests that we allow when the requests are 4k. I think
starting with such simple fix would provide smimilar benefit of
read-latency and no corner case at all. So I would much prefer to start
with a fix like that to account for the available request size to
drivers in bytes of data in the queue, instead of in number of requests
in the queue. read-latency kind of workarounds the way too huge I/O
queue when each request is 512k in size. And it workaround it only for
reads, O_SYNC/-osync would get stuck big time against writeback load
from other files just like like reads now. The fix I propose is generic,
basically it has no downside, it is more dynamic and so I prefer it even
if may not be as direct and hard like read-latency, but that is infact what
makes it better and potentially faster in throughput than read-latency.

Going one step further we could limit the amount of bytes that each
single task can submit, so for example kupdate/bdflush couldn't fill the
queue completely anymore, and still the elevator could do an huge work
when thousand of different tasks are submitting at the same time, which
is the interesting case for the elevator, or the amount of data to
submit in the queue for each task could depend on the number of tasks
actively doing I/O in the last few seconds.

These are the fixes (I consider the limiting of bytes in the I/O queue a
fix) that I would prefer.

Infact I today think the max_bomb_segment I researched some year back
was so beneficial in terms of read-latency just because it effectively
had the effect of reducing the max amount of pending "writeback" bytes
in the queue, not really because it splitted the request in multiple dma
(in turn decreasing a lot performance because the dma chunks were way
too small to have an hope to reach the peak performance of the hardware,
and the fact performance was so hurted forced us to back it out
completely, rightly). So I'm optimistic that reducing the size of the
queue and making it tunable from elvtune would be the first thing to do
rather than playing with the read-latency hack that just workarounds the
way too huge queue size when the merging is at its maximum and that can
hurt performance in some case.

Andrea

2002-11-11 04:00:28

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > the slowdown happens in this case:
> >
> > queue 5 6 7 8 9
> >
> > insert read 3
> >
> > queue 3 5 6 7 8 9
>
> read-latency will not do that.

So what will it do? Must do something very much like what I described or
it is a noop period. Please elaborate.

>
> > However I think even read-latency is more a workarond to a problem in
> > the I/O queue dimensions.
>
> The problem is the 2.4 algorithm. If a read is not mergeable or
> insertable it is placed at the tail of the queue. Which is the
> worst possible place it can be put because applications wait on
> reads, not on writes.

O_SYNC/-osync waits on writes too, so are you saying writes must go to
the head because of that? reads should be not too bad at the end too if
only the queue wasn't that oversized when the merging is at its maximum.
Fix the oversizing of the queue, then read-latency will matter much
less.

Andrea

2002-11-11 03:56:24

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
>
> the slowdown happens in this case:
>
> queue 5 6 7 8 9
>
> insert read 3
>
> queue 3 5 6 7 8 9

read-latency will not do that.

> However I think even read-latency is more a workarond to a problem in
> the I/O queue dimensions.

The problem is the 2.4 algorithm. If a read is not mergeable or
insertable it is placed at the tail of the queue. Which is the
worst possible place it can be put because applications wait on
reads, not on writes.

2002-11-11 04:15:59

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
>
> On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > the slowdown happens in this case:
> > >
> > > queue 5 6 7 8 9
> > >
> > > insert read 3
> > >
> > > queue 3 5 6 7 8 9
> >
> > read-latency will not do that.
>
> So what will it do? Must do something very much like what I described or
> it is a noop period. Please elaborate.

If a read was not merged with another read on the tail->head walk
the read will be inserted near the head. The head->tail walk bypasses
all reads, six (default) writes and then inserts the new read.

It has the shortcoming that earlier reads may be walked past in the
tail->head phase. It's a three-liner to prevent that but I was never
able to demonstrate any difference.

> >
> > > However I think even read-latency is more a workarond to a problem in
> > > the I/O queue dimensions.
> >
> > The problem is the 2.4 algorithm. If a read is not mergeable or
> > insertable it is placed at the tail of the queue. Which is the
> > worst possible place it can be put because applications wait on
> > reads, not on writes.
>
> O_SYNC/-osync waits on writes too, so are you saying writes must go to
> the head because of that?

It has been discussed: boost a request to head-of-queue when a thread
starts to wait on a buffer/page which is inside that request.

But we don't care about synchronous writes. As long as we don't
starve them out completely, optimise the (vastly more) common case.

> reads should be not too bad at the end too if
> only the queue wasn't that oversized when the merging is at its maximum.
> Fix the oversizing of the queue, then read-latency will matter much
> less.

Think about two threads. One is generating a stream of writes and
the other is trying to read a file. The reader needs to read the
directory, the inode, the first data blocks, the first indirect and
then some more data blocks. That's at least three synchronous reads.
Even if those reads are placed just three requests from head-of-queue,
the reader will make one tenth of the progress of the writer.

And the current code places those reads 64 requests from head-of-queue.

When the various things which were congesting write queueing were fixed
in the 2.5 VM a streaming write was slowing such read operations down by
a factor of 4000.

2002-11-11 04:20:51

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>On Sun, Nov 10 2002, Con Kolivas wrote:
>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2420rc1r0 [3] 489.3 15 36 10 6.85
>> 2420rc1r8 [3] 485.5 15 35 10 6.80
>> 2420rc1r16 [3] 570.4 12 43 10 7.99
>> 2420rc1r32 [3] 570.1 12 42 10 7.98
>> 2420rc1r64 [3] 575.0 12 43 10 8.05
>> 2420rc1r128 [3] 611.4 11 46 10 8.56
>> 2420rc1r256 [3] 646.2 11 49 10 9.05
>> 2420rc1r512 [3] 603.7 12 45 10 8.46
>> 2420rc1r1024 [3] 693.9 10 53 10 9.72
>> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>>
>> Test hardware is 1133Mhz P3 laptop with 5400rpm ATA100 drive. I don't
>> doubt the response curve would be different for other hardware.
>
>That looks pretty good, the behaviour in 2.4.20-rc1 is no sanely tunable
>unlike before. Could you retest the whole contest suite with 512 as the
>default value? It looks like a good default for 2.4.20.

Ok here they are:

noload:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [5] 71.7 93 0 0 1.00
2.4.19 [5] 69.0 97 0 0 0.97
2.4.20-rc1 [3] 72.2 93 0 0 1.01
2420rc1r512 [3] 71.6 93 0 0 1.00

cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [2] 66.6 99 0 0 0.93
2.4.19 [2] 68.0 99 0 0 0.95
2.4.20-rc1 [3] 67.2 99 0 0 0.94
2420rc1r512 [3] 67.1 99 0 0 0.94

process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 109.5 57 119 44 1.53
2.4.19 [3] 106.5 59 112 43 1.49
2.4.20-rc1 [3] 110.7 58 119 43 1.55
2420rc1r512 [3] 112.1 57 122 43 1.57

ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 117.4 63 1 7 1.64
2.4.19 [2] 106.5 70 1 8 1.49
2.4.20-rc1 [3] 102.1 72 1 7 1.43
2420rc1r512 [3] 101.7 73 1 8 1.42

xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 150.8 49 2 8 2.11
2.4.19 [1] 132.4 55 2 9 1.85
2.4.20-rc1 [3] 180.7 40 3 8 2.53
2420rc1r512 [3] 170.0 44 3 7 2.38

io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 474.1 15 36 10 6.64
2.4.19 [3] 492.6 14 38 10 6.90
2.4.20-rc1 [2] 1142.2 6 90 10 16.00
2420rc1r512 [6] 602.7 12 45 10 8.44

read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 102.3 70 6 3 1.43
2.4.19 [2] 134.1 54 14 5 1.88
2.4.20-rc1 [3] 173.2 43 20 5 2.43
2420rc1r512 [3] 112.5 67 11 5 1.58

list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 90.2 76 1 17 1.26
2.4.19 [1] 89.8 77 1 20 1.26
2.4.20-rc1 [3] 88.8 77 0 12 1.24
2420rc1r512 [3] 88.0 78 0 12 1.23

mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.4.18 [3] 103.3 70 32 3 1.45
2.4.19 [3] 100.0 72 33 3 1.40
2.4.20-rc1 [3] 105.9 69 32 2 1.48
2420rc1r512 [3] 105.0 70 33 3 1.47

Looks good. Note that read_load is a lot "better" too.

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE9zzFtF6dfvkL3i1gRAvQ/AJ0UK7za0Uvy6SnyPxFoYEjcX2iGDACcCWfx
WRq8eTboTj6bRCzERw/gMfo=
=kSMm
-----END PGP SIGNATURE-----

2002-11-11 04:33:17

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 08:22:38PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli wrote:
> > > >
> > > > the slowdown happens in this case:
> > > >
> > > > queue 5 6 7 8 9
> > > >
> > > > insert read 3
> > > >
> > > > queue 3 5 6 7 8 9
> > >
> > > read-latency will not do that.
> >
> > So what will it do? Must do something very much like what I described or
> > it is a noop period. Please elaborate.
>
> If a read was not merged with another read on the tail->head walk
> the read will be inserted near the head. The head->tail walk bypasses
> all reads, six (default) writes and then inserts the new read.
>
> It has the shortcoming that earlier reads may be walked past in the
> tail->head phase. It's a three-liner to prevent that but I was never
> able to demonstrate any difference.

from your description it seems what will happen is:

queue 3 5 6 7 8 9

I don't see why you say it won't do that. the whole point of the patch
to put reads at or near the head, and you say 3 won't be put at the
head if only 5 writes are pending. Or maybe your bypasses "6 writes"
means the other way around, that you put the read as the seventh entry
in the queue if there are 6 writes pending, is it the case?

> > > > However I think even read-latency is more a workarond to a
> > > > problem in
> > > > the I/O queue dimensions.
> > >
> > > The problem is the 2.4 algorithm. If a read is not mergeable or
> > > insertable it is placed at the tail of the queue. Which is the
> > > worst possible place it can be put because applications wait on
> > > reads, not on writes.
> >
> > O_SYNC/-osync waits on writes too, so are you saying writes must go to
> > the head because of that?
>
> It has been discussed: boost a request to head-of-queue when a thread
> starts to wait on a buffer/page which is inside that request.
>
> But we don't care about synchronous writes. As long as we don't
> starve them out completely, optimise the (vastly more) common case.

yes, it should be worthwhile to potentially decrease a little the global
throughput to increase significantly the read latency, I'm not against
that, but before I would care about that I prefer to get a limit on the
size of the queue in bytes, not in requests, that is a generic issue for
writes and read-async-io too, it's a task against task fairness/latency
matter, not specific to reads, but it should help read latency
visibly too. In any case the two things are orthogonal, if the queue is
smaller read-latency will do even better.

> > reads should be not too bad at the end too if
> > only the queue wasn't that oversized when the merging is at its maximum.
> > Fix the oversizing of the queue, then read-latency will matter much
> > less.
>
> Think about two threads. One is generating a stream of writes and
> the other is trying to read a file. The reader needs to read the
> directory, the inode, the first data blocks, the first indirect and
> then some more data blocks. That's at least three synchronous reads.

sure I know the problem with sync reads.

> Even if those reads are placed just three requests from head-of-queue,
> the reader will make one tenth of the progress of the writer.

actually it's probably much worse tha a 10 times ratio since the writer
is going to use big requests, while the reader is probably seeking with
<=4k requests.

> And the current code places those reads 64 requests from head-of-queue.
>
> When the various things which were congesting write queueing were fixed
> in the 2.5 VM a streaming write was slowing such read operations down by
> a factor of 4000.

Andrea

2002-11-11 05:04:04

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
>
> from your description it seems what will happen is:
>
> queue 3 5 6 7 8 9
>
> I don't see why you say it won't do that. the whole point of the patch
> to put reads at or near the head, and you say 3 won't be put at the
> head if only 5 writes are pending. Or maybe your bypasses "6 writes"
> means the other way around, that you put the read as the seventh entry
> in the queue if there are 6 writes pending, is it the case?

Actually I thought your "queue" was "head of queue" and that 5,6,7,8 and 9
were reads....

If the queue contains, say:

(head) R1 R2 R3 W1 W2 W3 W4 W5 W6 W7

Then a new R4 will be inserted between W6 and W7. So if R5 is mergeable
with R4 there is still plenty of time for that.

> > > > > However I think even read-latency is more a workarond to a
> > > > > problem in
> > > > > the I/O queue dimensions.
> > > >
> > > > The problem is the 2.4 algorithm. If a read is not mergeable or
> > > > insertable it is placed at the tail of the queue. Which is the
> > > > worst possible place it can be put because applications wait on
> > > > reads, not on writes.
> > >
> > > O_SYNC/-osync waits on writes too, so are you saying writes must go to
> > > the head because of that?
> >
> > It has been discussed: boost a request to head-of-queue when a thread
> > starts to wait on a buffer/page which is inside that request.
> >
> > But we don't care about synchronous writes. As long as we don't
> > starve them out completely, optimise the (vastly more) common case.
>
> yes, it should be worthwhile to potentially decrease a little the global
> throughput to increase significantly the read latency, I'm not against
> that, but before I would care about that I prefer to get a limit on the
> size of the queue in bytes, not in requests,

Really, it should be in terms of "time". If you assume 6 msec seek and
30 mbyte/sec bandwidth, the crossover is a 120 kbyte I/O. Not that I'm
sure this means anything interesting ;) But the lesson is that the
size of a request isn't very important.

> actually it's probably much worse tha a 10 times ratio since the writer
> is going to use big requests, while the reader is probably seeking with
> <=4k requests.
>

Yup. This is one case where improving latency improves throughput,
if there's computational work to be done.

2.5 (and read-latency) sort-of solve these problems by creating a
massive seekstorm when there are competing reads and writes. It's
a pretty sad solution really.

Better would be to perform those reads and writes in nice big batches.
That's easy for the writes, but for reads we need to wait for the
application to submit another one. That means actually deliberately
leaving the disk head idle for a few milliseconds in the anticipation
that the application will submit another nearby read. This is called
"anticipatory scheduling" and has been shown to provide 20%-70%
performance boost in web serving workloads. It just makes heaps of
sense to me and I'd love to see it in Linux...

See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf

2002-11-11 05:17:07

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, Nov 10, 2002 at 09:10:41PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > from your description it seems what will happen is:
> >
> > queue 3 5 6 7 8 9
> >
> > I don't see why you say it won't do that. the whole point of the patch
> > to put reads at or near the head, and you say 3 won't be put at the
> > head if only 5 writes are pending. Or maybe your bypasses "6 writes"
> > means the other way around, that you put the read as the seventh entry
> > in the queue if there are 6 writes pending, is it the case?
>
> Actually I thought your "queue" was "head of queue" and that 5,6,7,8 and 9
> were reads....
>
> If the queue contains, say:
>
> (head) R1 R2 R3 W1 W2 W3 W4 W5 W6 W7
>
> Then a new R4 will be inserted between W6 and W7. So if R5 is mergeable
> with R4 there is still plenty of time for that.

yes, the fact it's "near" and not exactly in the head as I originally
thought, makes it less likely that it slows things down, even if it
theoretically still could for some workload, overall it seems a
worthwhile heuristic.

Andrea

2002-11-11 07:54:31

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

Andrea Arcangeli wrote:
> 2.5 (and read-latency) sort-of solve these problems by creating a
> massive seekstorm when there are competing reads and writes. It's
> a pretty sad solution really.

On Sun, Nov 10, 2002 at 09:10:41PM -0800, Andrew Morton wrote:
> Better would be to perform those reads and writes in nice big batches.
> That's easy for the writes, but for reads we need to wait for the
> application to submit another one. That means actually deliberately
> leaving the disk head idle for a few milliseconds in the anticipation
> that the application will submit another nearby read. This is called
> "anticipatory scheduling" and has been shown to provide 20%-70%
> performance boost in web serving workloads. It just makes heaps of
> sense to me and I'd love to see it in Linux...
> See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf

This smacks of "deceptive idleness". OTOH I prefer to keep out of those
issues and focus on pure fault handling, TLB, and space consumption
issues. I/O scheduling is far afield for me, and I prefer to keep it so.

Bill

2002-11-11 13:38:46

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Mon, 11 Nov 2002, Andrea Arcangeli wrote:

> [snip bad example by somebody who hasn't read Andrew's patch]

> Anybody claiming there isn't the potential of a global I/O throughput
> slowdown would be clueless.

IO throughput isn't the point. Due to the fundamental asymmetry
between reads and writes IO throughput does NOT correspond to
program throughput under many kinds of IO patterns.

Sure, the best IO throughput is good for writeout, but it'll slow
down any program doing reads, including async IO programs because
those too need to get their data before they can process it.

> all you can argue is that the decrease of latency for lots of common
> interactive workloads could worth the potential of a global throghput
> slowdown. On that I may agree.

On the contrary, the decrease of latency will probably bring a
global throughput increase. Just program throughput, not raw
IO throughput.

> However I think even read-latency is more a workarond to a problem in
> the I/O queue dimensions. I think the I/O queue should be dunamically
> limited to amount of data queued (in bytes not in number of requests).

The number of bytes makes surprisingly little sense when you keep
into account that one disk seek on a modern costs as much time as
it takes to read about half a megabyte worth of data.

> But when each request is large 512k it is pointless to allow the same
> number of requests that we allow when the requests are 4k.

A request of 512 kB will take about twice the time to service as a 4 kB
request would take, assuming the disk does around 50 MB/s throughput.
If you take one of those really modern disks Andre Hedrick has in his
lab the difference gets even smaller.

> Infact I today think the max_bomb_segment I researched some year back
> was so beneficial in terms of read-latency just because it effectively

That must be why it was backed out ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-11 13:50:04

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Sun, 10 Nov 2002, Andrew Morton wrote:

> Really, it should be in terms of "time". If you assume 6 msec seek and
> 30 mbyte/sec bandwidth, the crossover is a 120 kbyte I/O.

Now figure in the rotational latency and the crossover point has
moved to 200 kB. ;)

> Not that I'm sure this means anything interesting ;) But the lesson is
> that the size of a request isn't very important.

Besides, larger requests are much more efficient so penalising
those is the very last thing we want to do.

> Better would be to perform those reads and writes in nice big batches.
> That's easy for the writes, but for reads we need to wait for the
> application to submit another one. That means actually deliberately
> leaving the disk head idle for a few milliseconds in the anticipation
> that the application will submit another nearby read. This is called
> "anticipatory scheduling" and has been shown to provide 20%-70%
> performance boost in web serving workloads. It just makes heaps of
> sense to me and I'd love to see it in Linux...

It only makes sense under heavy multiprocessing workloads where
we have multiple processes submitting IO, but if it's just one
process all this deliberate delay will achieve is a slowdown of
the process.

> See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf

Looking at it now.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-11 14:02:47

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Mon, Nov 11 2002, Rik van Riel wrote:
> > Infact I today think the max_bomb_segment I researched some year back
> > was so beneficial in terms of read-latency just because it effectively
>
> That must be why it was backed out ;)

Warning, incredibly bad quote snip above.

Rik, you basically deleted the interesting part there. The
max_bomb_segment logic was pretty uninteresting if you looked at it
from the POV that says that we must limit the size of a request to
prevent starvation. This is what the name implies, and this is flawed.
However, Andrea goes on to say that it sort-of worked anyways, just
not for the reaon he originally thought it would. It worked because it
limited the total size of pending writes in the queue. And this is
indeed the key factor to read latency in the 2.4 elevator, because reads
tend to get pushed in the back all the time because the queue looks like

R1-W1-W2-W3-....W127

service R1, queue is now

W1-W2-W3....-W127

application got R1 serviced, issue a new read. Queue is now:

W1-W2-W3....-W127-R2

So even with 0 read passover value, an application typically has to wait
for the total sum of writes in the queue. And this is what causes the
starvation. max_bomb_segments wasn't too good anyways, because in order
to get good latency you have to limit the sum of W1-W127 way too much
and then it starts to hurt write throughput really badly.

This is why the 2.4 io scheduler is fundamentally flawed from the read
latency view point. This is also why the 2.5 deadline io scheduler is
far superior in this area.

>> But when each request is large 512k it is pointless to allow the same
>> number of requests that we allow when the requests are 4k.

> A request of 512 kB will take about twice the time to service as a 4 kB
> request would take, assuming the disk does around 50 MB/s throughput.
> If you take one of those really modern disks Andre Hedrick has in his
> lab the difference gets even smaller.

I'll mention that for 2.5 the number of bytes that equals a full seek in
service time if called a stream_unit and is tweakable. Typically you are
looking at plain 40MiB/s and 8ms seek, so ~256-300KiB is more in the
normal range that 512KiB.

--
Jens Axboe

2002-11-11 15:37:12

[permalink] [raw]

Subject: Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest

On Mon, Nov 11, 2002 at 11:45:06AM -0200, Rik van Riel wrote:
> On Mon, 11 Nov 2002, Andrea Arcangeli wrote:
>
> > [snip bad example by somebody who hasn't read Andrew's patch]
>
> > Anybody claiming there isn't the potential of a global I/O throughput
> > slowdown would be clueless.
>
> IO throughput isn't the point. Due to the fundamental asymmetry

IO throughput is the whole point of the elevator and if you change it
that way, you can decrease it, even if you put at the seventh request
instead of the first, you're making assumption that the reads cannot
keep the I/O pipeline full, this is a realistic assumption for some
workloads, but not all. My example still very much apply, just not at
the head but as the seventh request. I definitely known what is the
design idea behind read-latency unlike what you think, I just didn't
remeber the lowlevel implementation details which are not important in
terms of a pontential slowdown in math theorical terms.

> On the contrary, the decrease of latency will probably bring a
> global throughput increase. Just program throughput, not raw

I perfectly know this, but you're making assumptions about certain
workloads, I can agree they are realistic workloads on a desktop
machine though, but not all the workloads are like that.

> That must be why it was backed out ;)

it was backed out because the request size must be big and it couldn't
be big with such ""feature"" enabled, as I just said in my previous
email. I just given you the reason it was backed out, not sure what are
you wondering about.

The fact is that read-latency is an hack for getting a special case
faster and that definitely in theory can hurt some workload, there is a
reason read-latency isn't the default, read-latency definitely *can*
increase the seeks, not admitting this and claiming it can only improve
performance is clueless from your part. the implementation detail that
it is adding as the seventh request instead of as the first request
decreases the probability of a slowdown, but it still has the potential
of slowing down something, this is all about math local to the elevator.

And IMHO read-latency kinds of hide the real problem that is we should
limit the queue in bytes or we could delay after I/O completion as
mentioned by Andrew since certain workloads will be still very much
slower than writes even with read-latency. I'll fix soon the real
problem in my tree, I just need to make a number of benchmarks on SCSI
and IDE to kind of measure a good size in bytes for peak contigous I/O
performance before I can implement that.

Andrea

2002-11-11 15:41:38