i was trying to study various cpu & i/o bottlenecks for a backup
tool (rdiff-backup) and i stumbled into this oddity:
# time xargs -0 -n100 cat -- > /dev/null < /tmp/filelist
0.520u 5.310s 0:36.92 15.7% 0+0k 0+0io 11275pf+0w
# time xargs -0 -n100 cat -- > /dev/null < /tmp/filelist
0.510u 5.090s 0:35.05 15.9% 0+0k 0+0io 11275pf+0w
# time xargs -0 -P2 -n100 cat -- > /dev/null < /tmp/filelist
0.500u 5.380s 1:30.51 6.4% 0+0k 0+0io 11275pf+0w
# time xargs -0 -P2 -n100 cat -- > /dev/null < /tmp/filelist
0.420u 4.810s 1:36.73 5.4% 0+0k 0+0io 11275pf+0w
3x slower with the two cats in parallel.
these particular measurements were taken on this system:
linux 2.4.19-pre10-ac2
SMP dual P3 500MHz
512MB PC100 memory
promise ultra100/tx2
maxtor 5400rpm 80GB disk
the filelist has 25400 files totalling 1700MB (two copies of
my ~/mail).
i thought it might be that the /dev/null file is shared amongst
the two parallel cats... so i wrote up a program that just
iterates over the file in 4k blocks and does nothing with the
data:
# time xargs -0 -n100 ~/tmp/reader < /tmp/filelist
0.300u 5.040s 0:37.07 14.4% 0+0k 0+0io 8735pf+0w
# time xargs -0 -n100 -P2 ~/tmp/reader < /tmp/filelist
0.320u 4.740s 1:35.30 5.3% 0+0k 0+0io 8735pf+0w
any ideas what is going on here?
i've seen the same thing on an smp system with 4x maxtors and software
raid5.
-dean
p.s. i was trying to get more i/o in flight in hopes of seeing better
perf from the disk subsystem... i know it's hopeless with a single IDE
disk without TCQ, but i was targetting the 4-disk system. are there
any newfangled fun interfaces which can improve the tree recursion and
zillion file open() bottlenecks in filesystem-level backups?
On Mon, Jun 17, 2002 at 01:03:15PM -0700, dean gaudet wrote:
> 3x slower with the two cats in parallel.
cat uses an incredibly small buffer for file io (4KB on x86), so
running multiple cats in parallel will simply thrash your disk.
What you really want is to run the open()s in parallel and the
read()s sequentially (or in parallel with a large buffer to cut
down on the seek cost).
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
On Mon, 17 Jun 2002, Benjamin LaHaise wrote:
> On Mon, Jun 17, 2002 at 01:03:15PM -0700, dean gaudet wrote:
> > 3x slower with the two cats in parallel.
>
> cat uses an incredibly small buffer for file io (4KB on x86), so
> running multiple cats in parallel will simply thrash your disk.
> What you really want is to run the open()s in parallel and the
> read()s sequentially (or in parallel with a large buffer to cut
> down on the seek cost).
using a 64KB buffer makes the xargs -P2 only twice as long as the -P1 ...
so that's an improvement, but still something seems odd. (btw, many of
the files are tiny anyhow -- a bunch of maildirs mixed in amongst the
files.)
i'll try playing around with threading the open()s.
-dean
dean gaudet wrote:
>
> i was trying to study various cpu & i/o bottlenecks for a backup
> tool (rdiff-backup) and i stumbled into this oddity:
>
> # time xargs -0 -n100 cat -- > /dev/null < /tmp/filelist
> 0.520u 5.310s 0:36.92 15.7% 0+0k 0+0io 11275pf+0w
> # time xargs -0 -n100 cat -- > /dev/null < /tmp/filelist
> 0.510u 5.090s 0:35.05 15.9% 0+0k 0+0io 11275pf+0w
>
> # time xargs -0 -P2 -n100 cat -- > /dev/null < /tmp/filelist
> 0.500u 5.380s 1:30.51 6.4% 0+0k 0+0io 11275pf+0w
> # time xargs -0 -P2 -n100 cat -- > /dev/null < /tmp/filelist
> 0.420u 4.810s 1:36.73 5.4% 0+0k 0+0io 11275pf+0w
>
> 3x slower with the two cats in parallel.
Note that the CPU time remained constant. The wall time went up.
You did more seeking with the dual-thread approach.
I rather depends on what is in /tmp/filelist. I assume it's
something like the output of `find'. And I assume you're
using ext2 or ext3?
- ext2/3 will chop the filesystem up into 128-megabyte block groups.
- It attemts to place all the files in a directory into the same
block group.
- It will explicitly place new directories into a different blockgroup
from their parent.
And I suspect it's the latter point which has caught you out. You have
two threads, and probably each thread's list of 100 files is from a
different directory. And hence it lives in a different block group.
And hence your two threads are competing for the disk head.
Even increasing the elevator read latency won't help you here - we don't
perform inter-file readahead, so as soon as thread 1 blocks on a read,
it has *no* reads queued up and the other thread's requests are then
serviced.
You'll get best throughput with a single read thread. There are some
smarter readahead things we could do in there, but it tends to be
that device-level readahead fixes everything up anyway.
-
On Mon, 17 Jun 2002, Andrew Morton wrote:
> I rather depends on what is in /tmp/filelist. I assume it's
> something like the output of `find'. And I assume you're
> using ext2 or ext3?
yup: find mail1 mail2 -type f -print0 >/tmp/filelist
ext3.
thanks for the description of the block groups... that all makes more
sense now.
> You'll get best throughput with a single read thread.
what if you have a disk array with lots of spindles? it seems at some
point that you need to give the array or some lower level driver a lot of
i/os to choose from so that it can get better parallelism out of the
hardware.
i see similar slowdowns on a 4-IDE-disk softRAID5+LVM+ext3 setup (SMP
2.4.19-pre7-ac4). not nearly as bad as on the single disk.
-dean
dean gaudet wrote:
>
> ...
> > You'll get best throughput with a single read thread.
>
> what if you have a disk array with lots of spindles? it seems at some
> point that you need to give the array or some lower level driver a lot of
> i/os to choose from so that it can get better parallelism out of the
> hardware.
mm. For that particular test, you'd get nice speedups from striping
the blockgroups across disks, so each `cat' is probably talking to
a different disk. I don't think I've seen anything like that proposed
though.
But regardless of the disk topology, the sanest way to get good IO
scheduling is to throw a lot of requests at the block layer. That's
simple for writes. But for reads, it's harder.
You could fork one `cat' per file ;) (Not so silly, really. But if
you took this approach, you'd need "many" more threads than blockgroups).
Or teach `cat' to perform asynchronous (aio) reads. You'd need async
opens, too. But generally we get a good cache hit rate against the
data which is needed to open a small file.
hmm. What else? Physical readahead - read metadata into the block
device's pagecache and flip pages from there into directories and
files on-demand. Fat chance of that happening.
Or change ext2/3 to not place directories in different block groups
at all. That's super-effective, but does cause somewhat worse long-term
fragmentation.
You can probably lessen the seek-rate by accessing the files in the correct
order. Read all the files from a directory before descending into any of
its subdirectories. Can find(1) do that? You should be able to pretty
much achieve disk bandwidth this way - it depends on how bad the inter-
and intra-file fragmentation has become.
-
On Mon, 17 Jun 2002, Andrew Morton wrote:
> dean gaudet wrote:
> > what if you have a disk array with lots of spindles? it seems at some
> > point that you need to give the array or some lower level driver a lot of
> > i/os to choose from so that it can get better parallelism out of the
> > hardware.
>
> mm. For that particular test, you'd get nice speedups from striping
> the blockgroups across disks, so each `cat' is probably talking to
> a different disk. I don't think I've seen anything like that proposed
> though.
heh, a 128MB stripe? that'd be huge :)
> You could fork one `cat' per file ;) (Not so silly, really. But if
> you took this approach, you'd need "many" more threads than blockgroups).
i actually tried this first :) the problem then becomes a fork()
bottleneck before you run into the disk bottlenecks. iirc the numbers
were ~45s for the 1-file-per-cat (for any -Pn, n<=10), ~30s for
100-files-per-cat (-P1) and ~1m15s for 100-files-per-cat (-P2).
> hmm. What else? Physical readahead - read metadata into the block
> device's pagecache and flip pages from there into directories and
> files on-demand. Fat chance of that happening.
one idea i had -- given that the server has a volume manager and you're
working from a snapshot volume anyhow (only sane way to do backups), it
might make a lot more sense to use userland ext2/3 libraries to read the
snapshot block device anyhow. but this kind of makes me cringe :)
-dean
On Jun 17, 2002 17:36 -0700, Andrew Morton wrote:
> You can probably lessen the seek-rate by accessing the files in the correct
> order. Read all the files from a directory before descending into any of
> its subdirectories. Can find(1) do that? You should be able to pretty
> much achieve disk bandwidth this way - it depends on how bad the inter-
> and intra-file fragmentation has become.
Just FYI - "find -depth" will do that, from find(1):
-depth Process each directory's contents before the directory itself.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
On Mon, 17 Jun 2002, Andreas Dilger wrote:
> On Jun 17, 2002 17:36 -0700, Andrew Morton wrote:
> > You can probably lessen the seek-rate by accessing the files in the correct
> > order. Read all the files from a directory before descending into any of
> > its subdirectories. Can find(1) do that? You should be able to pretty
> > much achieve disk bandwidth this way - it depends on how bad the inter-
> > and intra-file fragmentation has become.
>
> Just FYI - "find -depth" will do that, from find(1):
> -depth Process each directory's contents before the directory itself.
not quite... even with this, find processes a directory's contents in the
order that the filenames appear on disk ... and this will recurse into
subdirs before finishing with all the files.
what i think andrew was suggesting is to process all non-directory entries
before handling the directory entries.
i'm working with something like this:
find -type f -print \
| perl -ne '@c = split(/\//); print $#c . " $_";' \
| sort -k 1,1nr -k 2,2 \
| perl -pe 's#^\d+ ##;' \
> filelist
(deliberately not using perl's sort because i want the thing to scale of
course ;)
that reverse sorts by number of path components first, pathname second.
it seems to be worth another 10% or so improvement on the single reader
test.
i figured handling the leaves first was best... hard to tell if there's
any difference if i remove the "r". (this is on a live system with other
loads perturbing the results unfortunately.)
-dean
Jamie Lokier's treescan may be of interest:
http://www.tantalophile.demon.co.uk/treescan/
I noticed you're using IDE. Would Jens' TCQ
stuff acheive better parallelism for you?
Does mount -o remount,noatime,nodiratime make
any difference?
Padraig.