2002-10-07 19:31:59

by Andrew Morton

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Chris Friesen wrote:
>
> Andrew Morton wrote:
>
> > Go into ext2_new_inode, replace the call to find_group_dir with
> > find_group_other. Then untar a kernel tree, unmount the fs,
> > remount it and see how long it takes to do a
> >
> > `find . -type f xargs cat > /dev/null'
> >
> > on that tree. If your disk is like my disk, you will achieve
> > full disk bandwidth.
>
> Pardon my ignorance, but what's the difference between find_group_dir
> and find_group_other, and why aren't we using find_group_other already
> if its so much faster?
>

ext2 and ext3 filesystems are carved up into "block groups", aka
"cylinder groups". Each one is 4096*8 blocks - typically 128 MB.
So you can easily have hundreds of blockgroups on a single partition.

The inode allocator is designed to arrange that files which are within the
same directory fall in the same blockgroup, for locality of reference.

But new directories are placed "far away", in block groups which have
plenty of free space. (find_group_dir -> find a blockgroup for a
directory).

The thinking here is that files in a separate directory are related,
and files in different directories are unrelated. So we can take
advantage of that heuristic - go and use a new blockgroup each time
a new directory is created. This is a levelling algorithm which
tries to keep all blockgroups at a similar occupancy level.
That's a good thing, because high occupancy levels lead to fragmentation.

find_group_other() is basically first-fit-from-start-of-disk, and
if we use that for directories as well as files, your untar-onto-a-clean-disk
simply lays everything out in a contiguous chunk.

Part of the problem here is that it has got worse over time. The
size of a blockgroup is hardwired to blocksize*bits-in-a-byte*blocksize.
But disks keep on getting bigger. Five years ago (when, presumably, this
algorithm was designed), a typical partition had, what? Maybe four
blockgroups? Now it has hundreds, and so the "levelling" is levelling
across hundreds of blockgroups and not just a handful.

I did a lot of work on this back in November 2001, mainly testing
with a trace-based workload from Keith Smith. See
http://www.eecs.harvard.edu/~keith/usenix.1995.html

Al Viro wrote a modified allocator (which nobody understood ;))
based on Orlov's algorithm.

I ended up concluding that the current (sucky) code is indeed
best for minimising long-term fragmentation under slow-growth
scenarios. And worst for fast-growth.

Orlov was in between on both.

Simply nuking find_group_dir() was best for fast-growth, worst
for slow-growth.

Block allocators are fertile grounds for academic papers. It's
complex. There is a risk that you can do something which is
cool in testing, but ends up exploding horridly after a year's
use. By which time we have ten million deployed systems running like
dogs, damn all we can do about it.

The best solution is to use first-fit and online defrag to fix the
long-term fragmentation. It really is. There has been no appreciable
progress on this.

A *practical* solution is to keep a spare partition empty and do
a `cp -a' from one partition onto another once per week and
swizzle the mountpoints. Because the big copy will unfragment
everything.

ho-hum. I shall forward-port Orlov, and again attempt to understand
it ;)


2002-10-08 02:31:41

by Simon Kirby

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Mon, Oct 07, 2002 at 12:36:48PM -0700, Andrew Morton wrote:

> Block allocators are fertile grounds for academic papers. It's
> complex. There is a risk that you can do something which is
> cool in testing, but ends up exploding horridly after a year's
> use. By which time we have ten million deployed systems running like
> dogs, damn all we can do about it.
>
> The best solution is to use first-fit and online defrag to fix the
> long-term fragmentation. It really is. There has been no appreciable
> progress on this.
>
> A *practical* solution is to keep a spare partition empty and do
> a `cp -a' from one partition onto another once per week and
> swizzle the mountpoints. Because the big copy will unfragment
> everything.

Having seen fragmentation issues build up on (mbox) mail spools over
several years first hand, I can say that mail spools definitely show the
need for a defragmentation tool. I remember actually doing the "cp -a"
trick just to restore the mail server to decent performance (which
worked amazingly well, for another few months). (This was before we
switched to hashed directories and a POP3 server which caches mbox
messages offsets/UIDLs/states.)

Being able to defragment online would be very useful. I've seen some
people talk about this every so often. How far away is it?

Simon-

[ Simon Kirby ][ Network Operations ]
[ [email protected] ][ NetNation Communications ]
[ Opinions expressed are not necessarily those of my employer. ]

2002-10-08 02:59:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Tuesday 08 October 2002 04:36, Simon Kirby wrote:
> Being able to defragment online would be very useful. I've seen some
> people talk about this every so often. How far away is it?

The vfs consistency semantics are a little complex and fragile at the
moment, which is the only thing that makes it hard. Think about how
many months of truncate bugs we had, then consider how the situation
looks when all the bits of filesystem are moving around while its being
accessed. That's not to say it won't happen, but it's unlikely to ever
be solid until the vfs semantics mature a little more.

--
Daniel