2002-10-08 13:50:02

by Helge Hafting

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Andrew Morton wrote:
>
[...]
> ext2 and ext3 filesystems are carved up into "block groups", aka
> "cylinder groups". Each one is 4096*8 blocks - typically 128 MB.
> So you can easily have hundreds of blockgroups on a single partition.
>
> The inode allocator is designed to arrange that files which are within the
> same directory fall in the same blockgroup, for locality of reference.
>
> But new directories are placed "far away", in block groups which have
> plenty of free space. (find_group_dir -> find a blockgroup for a
> directory).
>
> The thinking here is that files in a separate directory are related,
> and files in different directories are unrelated. So we can take
> advantage of that heuristic - go and use a new blockgroup each time
> a new directory is created. This is a levelling algorithm which
> tries to keep all blockgroups at a similar occupancy level.
> That's a good thing, because high occupancy levels lead to fragmentation.
>
> find_group_other() is basically first-fit-from-start-of-disk, and
> if we use that for directories as well as files, your untar-onto-a-clean-disk
> simply lays everything out in a contiguous chunk.
>
> Part of the problem here is that it has got worse over time. The
> size of a blockgroup is hardwired to blocksize*bits-in-a-byte*blocksize.
> But disks keep on getting bigger. Five years ago (when, presumably, this
> algorithm was designed), a typical partition had, what? Maybe four
> blockgroups? Now it has hundreds, and so the "levelling" is levelling
> across hundreds of blockgroups and not just a handful.

If having only "a few" block groups really work better
(even for todays bigger disks) then bigger
block groups seems like a solution.

changing the on-disk format might not be popular, but there
is no need for that. Simply regard several on-disk block
groups as a bigger "allocation group" when using the above
algorithm. This should be perfectly backwards compatible.

Helge Hafting


2002-10-08 15:28:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Oct 08, 2002 15:54 +0200, Helge Hafting wrote:
> Andrew Morton wrote:
> > Part of the problem here is that it has got worse over time. The
> > size of a blockgroup is hardwired to blocksize*bits-in-a-byte*blocksize.
> > But disks keep on getting bigger. Five years ago (when, presumably, this
> > algorithm was designed), a typical partition had, what? Maybe four
> > blockgroups? Now it has hundreds, and so the "levelling" is levelling
> > across hundreds of blockgroups and not just a handful.
>
> If having only "a few" block groups really work better
> (even for todays bigger disks) then bigger
> block groups seems like a solution.
>
> changing the on-disk format might not be popular, but there
> is no need for that. Simply regard several on-disk block
> groups as a bigger "allocation group" when using the above
> algorithm. This should be perfectly backwards compatible.

We already have plans for something like this - a "meta blockgroup".
This will help us with several things, actually, so it is likely to
be implemented.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/