2002-10-07 18:54:24

by Chris Friesen

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Andrew Morton wrote:

> Go into ext2_new_inode, replace the call to find_group_dir with
> find_group_other. Then untar a kernel tree, unmount the fs,
> remount it and see how long it takes to do a
>
> `find . -type f xargs cat > /dev/null'
>
> on that tree. If your disk is like my disk, you will achieve
> full disk bandwidth.

Pardon my ignorance, but what's the difference between find_group_dir
and find_group_other, and why aren't we using find_group_other already
if its so much faster?

Chris


2002-10-07 19:20:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Monday 07 October 2002 20:58, Chris Friesen wrote:
> Andrew Morton wrote:
>
> > Go into ext2_new_inode, replace the call to find_group_dir with
> > find_group_other. Then untar a kernel tree, unmount the fs,
> > remount it and see how long it takes to do a
> >
> > `find . -type f xargs cat > /dev/null'
> >
> > on that tree. If your disk is like my disk, you will achieve
> > full disk bandwidth.
>
> Pardon my ignorance, but what's the difference between find_group_dir
> and find_group_other, and why aren't we using find_group_other already
> if its so much faster?

These are the heuristics that determine where in the volume directory
inodes are allocated:

http://lxr.linux.no/source/fs/ext2/ialloc.c#L221

Ext2 likes to spread directory inodes around the volume so that there is
room to keep the associated file blocks nearby. This interacts rather
poorly with readahead.

--
Daniel

2002-10-07 19:32:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On Mon, 7 Oct 2002, Daniel Phillips wrote:
>
> Ext2 likes to spread directory inodes around the volume so that there is
> room to keep the associated file blocks nearby. This interacts rather
> poorly with readahead.

Not a read-ahead problem. It interacts rather poory _full_stop_.

It means that the inode tables are spread all out, the bitmaps are
fragmented etc, so the disk head has to move all over the disk even when
only working with one directory tree like the kernel sources.

Kernel-level read-ahead doens't much help, because the FS tries to keep
the data blocks for individual files together - which is the case the
kernel _can_ try to optimize a bit. Physical read-ahead doesn't work
either, since the parts that can be physically read ahead are the ones
that the regular in-file read-ahead already mostly takes care of it.

So the problem with spreading stuff out doesn't have anything to do with
read-ahead, and has everything to do with the basic issue of BAD LOCALITY.
Locality is _good_, independently of read-ahead and independently of
medium.

Locality helps regardless of any read-ahead, although it is clearly true
that bad locality makes readahead more futile.

Linus

2002-10-08 00:34:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Mon, Oct 07, 2002 at 12:35:26PM -0700, Linus Torvalds wrote:
>
> On Mon, 7 Oct 2002, Daniel Phillips wrote:
> >
> > Ext2 likes to spread directory inodes around the volume so that there is
> > room to keep the associated file blocks nearby. This interacts rather
> > poorly with readahead.
>
> Not a read-ahead problem. It interacts rather poory _full_stop_.
>
> It means that the inode tables are spread all out, the bitmaps are
> fragmented etc, so the disk head has to move all over the disk even when
> only working with one directory tree like the kernel sources.

It depends on what you are doing. BSD, and even XFS, uses the concept
of using cylinder groups or block groups as one of many tools to avoid
file fragmentation and to concetrate locality for files within a
directory. The reason why FAT filesystems have file fragmentation
problems in far more worse way is because they attempt don't have the
concept of a block group, and simply always allocate from the
beginning of the filesystem. This is effectively what would happen if
you had a single block/cylinder group in the filesystem.

> So the problem with spreading stuff out doesn't have anything to do with
> read-ahead, and has everything to do with the basic issue of BAD LOCALITY.
> Locality is _good_, independently of read-ahead and independently of
> medium.

Ironically, as I mentioned, one of the reasons behind the block group
scheme is to *increase* locality for files within a particular
directory. As you point out quite correctly, though, it tends to
destroy locality across an entire directory tree.

Maybe the answer is that we need some way of declaring that some
directory is the root of "a directory tree". That way, the filesystem
can keep directories underneath the directory tree close together, and
the filesystem can try to keep directory trees far apart from each
other.

In order to do something like this, we would just need a filesystem
API extension to allow programs like tar and bitkeeper to give a hint
that a new directory tree is being established --- and ideally, it
needs to be done at mkdir time, so that the filesystem can perform
appropriate do a better job of deciding where to place the initial
root of the "directory tree". Things would also work if you declared
some directory tree to be the root of a "directory tree" after the
directory was initially created, but the allocation hueristics
wouldn't be nearly as effective.

Linus, what do you think about defining a new flag which could be
passed as part of the mode bits to mkdir()? If we allow the
filesystem to get some additional hints from userspace about what the
difference between /usr/src/linux (where directory spreading is a bad
idea) and /usr/home (where directory spreading is a very good idea),
it would make life for the filesystem much easier.

- Ted