2002-10-08 02:54:00

by Andrew Morton

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Theodore Ts'o wrote:
>
> ...
> It depends on what you are doing. BSD, and even XFS, uses the concept
> of using cylinder groups or block groups as one of many tools to avoid
> file fragmentation and to concetrate locality for files within a
> directory. The reason why FAT filesystems have file fragmentation
> problems in far more worse way is because they attempt don't have the
> concept of a block group, and simply always allocate from the
> beginning of the filesystem. This is effectively what would happen if
> you had a single block/cylinder group in the filesystem.
>

In the testing which I did, based on Keith Smith's traces, the
current code really isn't very effective.

What I did was to run his aging workload an increasing number of
times. Then measured the fragmentation of the files which it
left behind. I measured the fragmentation simply by timing
how long it took to read all the files, and compared that to
how long it took to read the same files when they had been laid
down on a fresh fs.

After ten aging rounds, with the current block allocator, we're
running 4x to 5x times slower. With the Orlov allocator, we're
running 5x to 6x slower. Either way, that's a big performance
slowdown.

Orlov turns a 400% slowdown into a 500% slowdown. So it is a
25% regression for slow growth. But it is a 300% to 500%
improvement for fast-growth. (Well, it used to be. But I
just fixed a memory-corrupting bug in it which I think has
slowed it down. It's now only double the speed on scsi, triple
on IDE).

What we need, *regardless* of which allocator is used is effective
defrag tools.

I just retested.

2.5.41, scsi:
time find linux-2.4.19 -type f | xargs cat > /dev/null
find linux-2.4.19 -type f 0.06s user 0.24s system 1% cpu 19.274 total
xargs cat > /dev/null 0.23s user 1.42s system 8% cpu 19.954 total

2.5.41, IDE:
time find linux-2.4.19 -type f | xargs cat > /dev/null
find linux-2.4.19 -type f 0.06s user 0.23s system 0% cpu 29.274 total
xargs cat > /dev/null 0.23s user 1.58s system 5% cpu 30.199 total

2.5.41+Orlov, SCSI:
time find linux-2.4.19 -type f | xargs cat > /dev/null
find linux-2.4.19 -type f 0.06s user 0.24s system 2% cpu 11.579 total
xargs cat > /dev/null 0.23s user 1.46s system 14% cpu 11.951 total

2.5.41+Orlov, IDE:
time find linux-2.4.19 -type f | xargs cat > /dev/null
find linux-2.4.19 -type f 0.06s user 0.24s system 2% cpu 12.225 total
xargs cat > /dev/null 0.22s user 1.59s system 14% cpu 12.500 total

We need some of that goodness.

>
> [ administrator hints ]
>

Alas, nobody uses them :(

Maybe a mount option? But I think the current algorithm should
default to "off".


2002-10-08 16:10:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Mon, Oct 07, 2002 at 07:59:26PM -0700, Andrew Morton wrote:
> In the testing which I did, based on Keith Smith's traces, the
> current code really isn't very effective.
>
> What I did was to run his aging workload an increasing number of
> times. Then measured the fragmentation of the files which it
> left behind. I measured the fragmentation simply by timing
> how long it took to read all the files, and compared that to
> how long it took to read the same files when they had been laid
> down on a fresh fs.

What access pattern did you use when you read the files? Did you
sweep through filesystem directory by directory, or did you use some
other pattern (perhaps random)?

It would also be interesting to get a measure of fragmentation of the
filesystems as measured by e2fsck. This only measures file
fragmentation, and not file locality on a per-directory (or more
ideally per-directory tree, but establishing where the directory trees
are is difficult).

> >
> > [ administrator hints ]
> >
>
> Alas, nobody uses them :(

No one will use them if they are need to do so manually. But if we
can convert a few programs to use them, then it might work. And
people didn't much use madvise() when it was first introduced either,
but it doesn't mean that the existence of the interface was a bad
idea....

If the current algorithm is so bad, then maybe the trick is to use the
fast-growth optimized allocator as the default, *unless* given a hint
to do so via some magic mkdir flag. Then if certain programs, such as
adduser (when creating a home directory), "cp -r", "bk clone", tar,
etc. where modified to give hints that the a particular directory was
at the top of a directory tree, then slow-growth optimized allocator
could be used to spread apart directory trees. No, it's not perfect,
but it should be better not using any hints at all. (And yes, it will
take a while before the userpsace tools that provide said hints are
widely deployed.)

And if we don't have any user-space hints, then we default to the
fast-growth algorithm, which should make Linus happy. :-)

> Maybe a mount option? But I think the current algorithm should
> default to "off".

How about a mount option with the possible values: "fast", "slow",
"hinted", and "auto", with the default being "auto" or "hinted"?
(Where hinted utilizes user-space hints, and "auto" utilizes
user-space hints if present, plus some of the so-called ugly
hueristics which you had discussed?)

- Ted