On Fri, Apr 05, 2002 at 11:04:18PM +0200, Moritz Franosch wrote:
>
>
> Hello Andrea,
>
>
> When releasing 2.4.19-pre5, Marcelo wrote
>
> This release has -aa writeout scheduling changes, which should
> improve IO performance (and interactivity under heavy write loads).
>
> _Please_ test that extensively looking for any kind of problems
> (performance, interactivity, etc).
>
> I did test it because I noticed serious IO performance problems with
> earlier kernels.
> http://groups.google.com/groups?selm=linux.kernel.rxxn103tdbw.fsf%40synapse.t30.physik.tu-muenchen.de&output=gplain
> http://groups.google.com/groups?selm=linux.kernel.rxxsn83rd4c.fsf%40synapse.t30.physik.tu-muenchen.de&output=gplain
>
> The problem is that writing to a DVD-RAM, ZIP or MO device almost
> totally blocks reading from a _different_ device. Here is some data.
>
> nr bench read write 2.4.18 2.4.19-rc5 expected factor
> 1 dd 30GB HDD DVD-RAM 278 490 60 8.2
> 2 dd 120GB HDD DVD-RAM 197 438 32 14
> 3 dd 30GB HDD ZIP 158 239 60 4.0
> 4 dd 120GB HDD ZIP 142 249 32 7.8
> 5 dd 30GB HDD 120GB HDD 87 89 60 1.5
> 6 dd 120GB HDD 30GB HDD 66 69 32 2.2
> 7 cp 30GB HDD 120GB HDD 97 77 60 1.3
> 8 cp 120GB HDD 30GB HDD 78 65 50 1.3
>
> The columns 2.4.18 and 2.4.19-rc5 list execution times in seconds of
> the respective benchmark. The column "expected" lists the time I would
> have expected for the respective benchmark to complete with a
> "perfect" kernel. The "factor" is the factor 2.4.19-rc5 is slower than
> a perfect kernel would be.
>
> The "dd" benchmark reads by 'dd' a file of size 1GB from the device
> listed under "read" and writes to /dev/null while _another_ 'dd'
> process writes to the device listed under "write" and reads from
> /dev/zero. Please see the source below. The "cp" benchmark simply
> copies a file of size 1GB from "read" to "write".
>
> I have four IDE devices installed in that system:
> hda: 30 GB IDE HDD 5400 rpm
> hdb: 9.4 GB ATAPI DVD-RAM
> hdc: 120 GB IDE HDD 7200 rpm
> hdd: 100 MB ATAPI ZIP
>
> As you can see, the "cp" benchmark (7,8) has considerably improved
> between 2.4.18 and 2.4.19-rc5 and it is only a factor of 1.3 away from
> perfect. Good work!
>
> The performance problems can be seen in benchmarks 1-4. Writing to
> DVD-RAM while reading from the (fast) 130GB HDD (benchmark 2) almost
> totally blocks the read process. Under 2.4.19-rc5, it takes 14 times
> longer to 'dd' a 1GB file from HDD to /dev/null while writing to
> DVD-RAM than without any other IO. Without any other IO it only takes
> 32 seconds to read the 1GB file from the 130GB HDD. I would expect
> that writing simultaneously to _another_ device while reading a file
> would have no impact on the read speed. That's why I expected 32
> seconds for benchmark 2 to complete. Similarly, in benchmark 6,
> reading 1GB from the 120GB HDD should only take 32 seconds. But it
> takes more than twice that time when writing simultaneously to the
> 30GB HDD.
>
> With 'vmstat 1' I have made the observation that 2.4.19-pre5 is a bit
> "fairer" that 2.4.18. Under 2.4.18, a writing process to DVD-RAM
> almost totally blocks reading from HDD whereas under 2.4.19-pre5 about
> 1-2 MB/s can be read simultaneously. So I was astonished that in my
> benchmarks 1-4, kernel 2.4.19-pre5 performed much worse than
> 2.4.18. The reason may be that the main throughput stems from the
> short moments where, for what reason whatsoever, read speed increases
> to 20-30 MB/s, as is normal. In benchmarks 3 and 4 the 'dd' process
> writing to the ZIP drive terminated with "no space left on device"
> before the reading 'dd' process completed. The reading 'dd' process
> then probably got a higher throughput (I checked that in X with
> xosview). That's probably the reason why benchmarks 3 and 4 (ZIP) took
> shorter than 1 and 2 (DVD-RAM).
>
> I ran all benchmarks just after booting into runlevel 1 (single user
> mode) on a Suse 7.1 system. The system is a Athlon 700 MHz, KT133
> chipset, 256 MB RAM, 256 MB swap.
>
>
> The "dd" benchmark is:
>
> #!/bin/bash
>
> dd if=/dev/zero of=$1/tmp bs=1000000 &
> # a sleep is sometimes necessary for bad performance (to fill cache?)
> sleep 30
> time dd if=$2 of=/dev/null bs=1000000
>
>
> Filesystems are:
> jfranosc@nomad:~ > mount
> /dev/hda3 on / type reiserfs (rw,noatime)
> proc on /proc type proc (rw)
> devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
> /dev/hda1 on /boot type ext2 (rw,noatime)
> /dev/hda6 on /home type ext2 (rw,noatime)
> /dev/hda7 on /lscratch type reiserfs (rw,noatime)
> /dev/hdc2 on /lscratch2 type reiserfs (rw,noatime)
> shmfs on /dev/shm type shm (rw)
> automount(pid341) on /net type autofs (rw,fd=5,pgrp=341,minproto=2,maxproto=4)
> automount(pid334) on /misc type autofs (rw,fd=5,pgrp=334,minproto=2,maxproto=4)
> /dev/hdd4 on /mzip type vfat (rw,noexec,nosuid,nodev,user=jfranosc)
> /dev/sr0 on /dvd type ext2 (rw,noexec,nosuid,nodev,user=jfranosc)
>
>
> Boot parameters in /etc/lilo.conf:
> append = "hdb=ide-scsi hdd=ide-floppy"
>
>
> I report this performance problem because, first, there is room for
> improvements of IO performance up to factor 2 when using multiple
> disks, and, second, writing to DVD-RAM or ZIP almost makes the system
> unusable (because read performance drops to virtually zero).
>
> If you have patches that you think should be tested, I'd like to try
> them.
The reason hd is faster is because new algorithm is much better than the
previous mainline code. Now the reason the DVDRAM hangs the machine
more, that's probably because more ram can be marked dirty with those
new changes (beneficial for some workload, but it stalls much more the
fast hd, if there's one very slow blkdev in the system). You can try
decrasing the percent of vm dirty in the system with:
echo 2 500 0 0 500 3000 3 1 0 >/proc/sys/vm/bdflush
hope this helps,
Right fix is different but not suitable for 2.4.
Andrea
On Tue, 16 Apr 2002, Andrea Arcangeli wrote:
> On Fri, Apr 05, 2002 at 11:04:18PM +0200, Moritz Franosch wrote:
> > The problem is that writing to a DVD-RAM, ZIP or MO device almost
> > totally blocks reading from a _different_ device. Here is some data.
> >
> > nr bench read write 2.4.18 2.4.19-rc5 expected factor
> > 1 dd 30GB HDD DVD-RAM 278 490 60 8.2
> > 2 dd 120GB HDD DVD-RAM 197 438 32 14
> > 3 dd 30GB HDD ZIP 158 239 60 4.0
> > 4 dd 120GB HDD ZIP 142 249 32 7.8
> > 5 dd 30GB HDD 120GB HDD 87 89 60 1.5
> > 6 dd 120GB HDD 30GB HDD 66 69 32 2.2
> > 7 cp 30GB HDD 120GB HDD 97 77 60 1.3
> > 8 cp 120GB HDD 30GB HDD 78 65 50 1.3
> >
> > The columns 2.4.18 and 2.4.19-rc5 list execution times in seconds of
> > the respective benchmark. The column "expected" lists the time I would
> > have expected for the respective benchmark to complete with a
> > "perfect" kernel. The "factor" is the factor 2.4.19-rc5 is slower than
> > a perfect kernel would be.
> The reason hd is faster is because new algorithm is much better than the
> previous mainline code. Now the reason the DVDRAM hangs the machine
> more, that's probably because more ram can be marked dirty with those
> new changes (beneficial for some workload, but it stalls much more the
> fast hd, if there's one very slow blkdev in the system). You can try
> decrasing the percent of vm dirty in the system with:
>
> echo 2 500 0 0 500 3000 3 1 0 >/proc/sys/vm/bdflush
Judging from the performance regression above it would seem the
new defaults suck rocks.
Can we please stop optimising Linux for a single workload benchmark
and start tuning it for the common case of running multiple kinds
of applications and making sure one application can't mess up the
others ?
Personally I couldn't care less if my tar went 30% faster if it
meant having my desktop unresponsive for the whole time.
regards,
Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"
http://www.surriel.com/ http://distro.conectiva.com/
> > The problem is that writing to a DVD-RAM, ZIP or MO device almost
> > totally blocks reading from a _different_ device. Here is some data.
Yes I saw this with M/O disks, thats one reason the -ac tree doesn't adopt
all the ll_rw_blk/elevator changes from the vanilla tree.
> > DVD-RAM while reading from the (fast) 130GB HDD (benchmark 2) almost
> > totally blocks the read process. Under 2.4.19-rc5, it takes 14 times
You'll see this on other things too. Large file creates seem to basically
stall anything wanting swap
> > benchmarks 1-4, kernel 2.4.19-pre5 performed much worse than
> > 2.4.18. The reason may be that the main throughput stems from the
> > short moments where, for what reason whatsoever, read speed increases
Fairness, throughput, latency - pick any two..
> Right fix is different but not suitable for 2.4.
Curious - what do you think the right fix is ?
On Tue, Apr 16, 2002 at 12:39:22PM -0300, Rik van Riel wrote:
> On Tue, 16 Apr 2002, Andrea Arcangeli wrote:
> > On Fri, Apr 05, 2002 at 11:04:18PM +0200, Moritz Franosch wrote:
>
> > > The problem is that writing to a DVD-RAM, ZIP or MO device almost
> > > totally blocks reading from a _different_ device. Here is some data.
> > >
> > > nr bench read write 2.4.18 2.4.19-rc5 expected factor
> > > 1 dd 30GB HDD DVD-RAM 278 490 60 8.2
> > > 2 dd 120GB HDD DVD-RAM 197 438 32 14
> > > 3 dd 30GB HDD ZIP 158 239 60 4.0
> > > 4 dd 120GB HDD ZIP 142 249 32 7.8
> > > 5 dd 30GB HDD 120GB HDD 87 89 60 1.5
> > > 6 dd 120GB HDD 30GB HDD 66 69 32 2.2
> > > 7 cp 30GB HDD 120GB HDD 97 77 60 1.3
> > > 8 cp 120GB HDD 30GB HDD 78 65 50 1.3
> > >
> > > The columns 2.4.18 and 2.4.19-rc5 list execution times in seconds of
> > > the respective benchmark. The column "expected" lists the time I would
> > > have expected for the respective benchmark to complete with a
> > > "perfect" kernel. The "factor" is the factor 2.4.19-rc5 is slower than
> > > a perfect kernel would be.
>
> > The reason hd is faster is because new algorithm is much better than the
> > previous mainline code. Now the reason the DVDRAM hangs the machine
> > more, that's probably because more ram can be marked dirty with those
> > new changes (beneficial for some workload, but it stalls much more the
> > fast hd, if there's one very slow blkdev in the system). You can try
> > decrasing the percent of vm dirty in the system with:
> >
> > echo 2 500 0 0 500 3000 3 1 0 >/proc/sys/vm/bdflush
>
> Judging from the performance regression above it would seem the
> new defaults suck rocks.
>
> Can we please stop optimising Linux for a single workload benchmark
> and start tuning it for the common case of running multiple kinds
> of applications and making sure one application can't mess up the
> others ?
>
> Personally I couldn't care less if my tar went 30% faster if it
> meant having my desktop unresponsive for the whole time.
Your desktop is not unresponsive for the whole time. The problem happens
only under a flood to DVD and ZIP and as you can see above 2.4.18 sucks
rocks for such a workload too in the first place and that's nothing new,
not a problem introduced by my changes, more detailed explanation
follows.
DVDRAM and ZIP writes are dogslow and for such a slow blkdev allowing
60% of freeable ram to be locked in dirty buffers, is exactly like if
you have to swapout on top of a DVDRAM instead of on top of a 40M/sec hd
while allocating memory. You will have to wait minutes before such 60%
of freeable ram is flushed to disk. Zip and DVDRAM write with a bandwith
lower than 1M/sec, swapping out on them clearly can lead a malloc to
take minutes (having to flush dirty data as said is equivalent to
swapout on them). OTOH if you don't reach the 60% the DVDRAM and ZIP
will behave better with the new changes, it's the cp /dev/zero /dvdram
that cause you to wait the 100 seconds at the next malloc.
So if you are used to cp /dev/zero /dvdram you should definitely reduce
the max dirty amount of freeable ram to 3%, and that's what the above bdflush
tuning does.
The only way to avoid you to set the nfract levels, is to have them per
blkdev but I don't see it as a 2.4 requirement but it would be nice to
have it for 2.5 at least.
This is an issue of swapping on a very slow HD, you will be slow be
sure of that, with mainline too, with the new changes even more becaue
you will have to swapout more. Reducing the bdflush is equivalent to swap
less. If the HD would be as fast as memory we should swap more. The
slower the HD the less we must swap to be fast. It's not that we are
optimizing for a single workload, but it's that it's not possible to
generate with math the most efficient number that will lead to the max
possible performance in terms of performance while still providing good
latency, the current heuristic is tuned for a fairly normal hd, so
slower hd requires and always required in previous mainline too, the
tuning of the VM to avoid swapping out too much if the HD runs at less
than 1M/sec.
Not to tell the writes to a slow HD will hang all the writes of the fast
HD, and that's another basic design issue that is completly unchanged
between the two kernels, but that could be completly rewritten to allow
a fast HD to write at max speed while the slow HD writes at max speed.
This is definitely not possible right now, the fast HD will write seldom
in small chunks in such a workload.
You should acknowledge the new changes and defaults are better, they
even are better at showing basic design problems of the kernel with HD
with performance much slower than memory than the normal hd. Just trying
to hide those design problems by setting a default of 3% would be wrong,
doing that would lead to your production desktop and servers to be much
slower. If it's slow only during the backup to zip or dvdram it's not a
showstopper, no failures, just higher write and read latencies and
higher allocation latencies than when you run cp /dev/zero on a normal
HD.
Andrea
On Tue, Apr 16, 2002 at 05:09:17PM +0100, Alan Cox wrote:
> > > The problem is that writing to a DVD-RAM, ZIP or MO device almost
> > > totally blocks reading from a _different_ device. Here is some data.
>
> Yes I saw this with M/O disks, thats one reason the -ac tree doesn't adopt
> all the ll_rw_blk/elevator changes from the vanilla tree.
that should have nothing to do with the elevator. The elevator matters
within the same disk. here it's the other (fast) disks that are slower
while you write to the slow ZIP M/O DVDRAM. That is always been the
case, I remeber that since I run 2.0.25 the first time with ppa.
> > > DVD-RAM while reading from the (fast) 130GB HDD (benchmark 2) almost
> > > totally blocks the read process. Under 2.4.19-rc5, it takes 14 times
>
> You'll see this on other things too. Large file creates seem to basically
> stall anything wanting swap
the "wanting swap" bit also depends how much anon/shm ram there is in the
system compared to clean freeable cache. With the rest of the patches
applied you should not want swap during a large file create unless
you've quite a lot of physical ram mapped in shm/anon.
> > > benchmarks 1-4, kernel 2.4.19-pre5 performed much worse than
> > > 2.4.18. The reason may be that the main throughput stems from the
> > > short moments where, for what reason whatsoever, read speed increases
>
> Fairness, throughput, latency - pick any two..
>
> > Right fix is different but not suitable for 2.4.
>
> Curious - what do you think the right fix is ?
One part of the fix is not to allow dirty buffers belonging to the
zip/M/O/DVDRAM drives to grow over 3/4% of the total freeable ram in the
system. So the rest of the 96/97% of freeable ram can be allocated
nearly atomically without blocking. And really that percentage can
depend on the user needs too. If an user needs to rewrite and rewrite
and he never goes to use more than 20% of the physical ram as cache, he
will probably want a limit of 30%, not 3/4%, even if then a malloc
requiring such 30% to be flushed to disk could take several minutes to
return. It's not a trivial problem, but at least having per-blkdev
tunings would make it much better. 60% of ram in something that writes
512bytes/sec would be totally insane for example. If something writes at
512bytes/sec we should allow at most a few pages of cache to be dirty
simultaneously. the best would be if the kernel could learn a limit with
runtime for each blkdev, the fixed 3/4% still is not very appealing.
The other side of the fix, is to rewrite the write against writes in the
BUF_DIRTY list, now even if the allocations don't block, the other async
flushes will wait those three pages to be written at 512bytes/sec,
despite the other async flushes could go to the hd in parallel at
30M/sec.
The linux vm (this is always been true since 2.0) is tuned and behaves
well with normal HD running at similar speeds, if the speed of the HD
very a lot or if the HD is dogslow, linux async flushing it's not
optimal. The new more aggressive tunings just put it at the light more,
despite other more server oriented workloads are improved because of the
faster hardware and the fact they actually take advantage of the larger
dirty cache, unlike the dd where if it would be synchronous it wouldn't
matter.
Andrea
On Tue, 16 Apr 2002, Alan Cox wrote:
> > > benchmarks 1-4, kernel 2.4.19-pre5 performed much worse than
> > > 2.4.18. The reason may be that the main throughput stems from the
> > > short moments where, for what reason whatsoever, read speed increases
>
> Fairness, throughput, latency - pick any two..
Personally I try to go for fairness and latency in -rmap,
since most real workloads I've encountered don't seem to
have throughput problems.
The standard "it's getting slow" complaint has been about
response time and fairness 90% of the time, usually when
the system stalls one process during some other activity.
> > Right fix is different but not suitable for 2.4.
>
> Curious - what do you think the right fix is ?
Tuning the current system for latency and fairness should
keep most people happy. Desktop users really won't notice
if unpacking an RPM takes 20% longer, but having their
mp3 skip during RPM unpacking is generally considered
unacceptable.
regards,
Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"
http://www.surriel.com/ http://distro.conectiva.com/
> Judging from the performance regression above it would seem the
> new defaults suck rocks.
I first thought that 2.4.19-pre5 would be better than 2.4.18 because
vmstat showed that 2.4.19-pre5 could still read 1-2 MB per second from
HDD while writing to DVD-RAM, whereas 2.4.18 blocked totally for more
than 10 seconds or so. But there are short moments under both kernels
(with default bdflush parameters) where you get data from HDD at a
very high rate before it drops again. It seems the main throughput
over a long time stems from these short moments.
> Can we please stop optimising Linux for a single workload benchmark
> and start tuning it for the common case of running multiple kinds
> of applications and making sure one application can't mess up the
> others ?
>
> Personally I couldn't care less if my tar went 30% faster if it
> meant having my desktop unresponsive for the whole time.
That's why I did the benchmarks in the first place, because my desktop
was unresponsive while writing to DVD-RAM.
Moritz
--
Dipl.-Phys. Moritz Franosch
http://Franosch.org
Alan Cox <[email protected]> writes:
> > > benchmarks 1-4, kernel 2.4.19-pre5 performed much worse than
> > > 2.4.18. The reason may be that the main throughput stems from the
> > > short moments where, for what reason whatsoever, read speed increases
>
> Fairness, throughput, latency - pick any two..
That's exactly the point. Writing large files to DVD-RAM leads to low
throughput when reading from HDD, long latencies and doesn't even
let the HDD read as mush data as is written to DVD-RAM (with 2.4.18),
which is very unfair.
My benchmarks show bad throughput. What I first observed was bad
latency when writing to DVD-RAM (no mouse movement for 3 seconds or
so, long times switching between applications, text output delayed
when typing). Fairness shouldn't be an issue because 2.4.18/19-pre5
are also bad when both tested disks are on different IDE controllers,
therefore no resources must be shared between the reading and the
writing process (except RAM for cache, but there is plenty).
Moritz
--
Dipl.-Phys. Moritz Franosch
http://Franosch.org