2002-09-20 16:36:21

by Theodore Ts'o

[permalink] [raw]
Subject: New version of the ext3 indexed-directory patch


I've done a bunch of hacking on the ext3 indexed directory patch, and I
believe it's just about ready for integration with the 2.5 tree.
Testing and comments are appreciated.

The code can be found either via bitkeeper, at:

bk://extfs.bkbits.net/ext3-htree-2.4
and
bk://extfs.bkbits.net/ext3-htree-2.5

Or for those people who want straight diffs, patches against 2.4
and 2.5 can be found at:

http://thunk.org/tytso/linux/ext3-dxdir

- Ted

Here are my Release Notes for my changes:
(i.e., the changes from Christopher Li's port of Daniel
Phillip's hashtree code)

*) The Ext3 hash-tree code uses the new TEA and half-MD4 hash that ships
with e2fsprogs 1.28

*) The code has been massively reorganized and rewritten to make it more
maintainable. In particular, the horrible mess which had been
ext3_find_entry and ext3_add_entry has been broken up into
multiple functions. This eliminated a lot of goto's, and more
importantly, found some memory leaks which have been eliminated.
Some were only in the error paths, but some were in normal code
paths that would be executed after a split, for example.

*) As a side effect of the reorganization, the ext3_find_entry() and
ext3_readdir() paths can now support arbitrary numbers of
indirect levels in the tree. (The resulting code was simpler,
too!) The one code path that still could use some work is the
split handling on the insert path. Once that is generalized, we
will be able to remove the depth restriction on the indexed tree.

*) The error handling has been cleaned up, so that they are appropriately
reported back up to userspace, where approach. Ext3_std_error()
is now called when the journal routines report an error, in line
with all of the other journalling calls in the rest of the ext3
codebase.

*) Ext3_readdir now traverses the directory in hash sort order. This
(mostly) solves the requirement that readdir not return doubled
file entries, or skip a block of files, which could happen if
another process managed to trigger a tree split while the
readdir operation was going on. Unfortunately, on 32 bit
machines, we have to use a 31-bit sort under all conditions,
even if telldir64() is used in userspace. This is because the
VFS uses a single path for sys_lseek() and sys_llseek(), and so
the ext3 code can't tell whether to send back a 32-bit value or
a 64-bit value. And, if we send a 64-bit positional value when
the 32-bit lseek() was called, lseek() will bomb out.

This is unfortunate, but the chances of our hitting this failure
more are relatively small. Since readdir (well, sys_getdents64)
returns 4k of data at at time, that means a directory with
400,000 entries have approximately 3,000 getdents() system
calls, with each getdents() returning approximately 133 entries.
In order to trigger the problem, the 31-bit hash collision
must be positioned such that it straddles one of the 4k getdents
blocks --- while a split operation is taking place.

(If we were willing to modify the VFS layer, so that
ext3_readdir() knew how much space was left in the filldir
buffer, we could be even smarter about things by stopping if
there is only space for less than all of the entries that have
the same hash value. I'm not entirely convinced this is
necessary, though. The risk is low and in the worst case, one
or two filenames will simply be returned twice by readdir, which
shouldn't cause problems for most applications.)

*) Ext3_readdir(), ext3_add_entry(), and ext3_find_entry() now
automatically fall back to the old linear directory code if they
find a indexed directory format they don't understand. This
allows us to add new hash versions, and/or allow the depth of
the tree to be increased, without losing backwards compatiblity
with deployed kernels.


2002-09-20 17:01:06

by Andrew Morton

[permalink] [raw]
Subject: Re: New version of the ext3 indexed-directory patch

[email protected] wrote:
>
> I've done a bunch of hacking on the ext3 indexed directory patch, and I
> believe it's just about ready for integration with the 2.5 tree.

Thanks; it's good that this is moving ahead. I'll update to this
version in the -mm patchkit, so people can test it from there
too.

What is the status of e2fsprogs support for htree? Is everything covered?

2002-09-22 19:55:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: New version of the ext3 indexed-directory patch

On Fri, Sep 20, 2002 at 10:06:04AM -0700, Andrew Morton wrote:
>
> Thanks; it's good that this is moving ahead. I'll update to this
> version in the -mm patchkit, so people can test it from there
> too.
>
> What is the status of e2fsprogs support for htree? Is everything covered?

Almost. E2fsck support is fully there. With e2fsprogs 1.28, you
still need to manually set up the dir_index feature flag to convert a
filesystme to use the directory indexing feature:

debugfs -w -R "features dir_index" /dev/hdXX
debugfs -w -R "ssv def_hash_version tea" /dev/hdXX
debugfs -w -R "ssv hash_seed random" /dev/hdXX

In the 1.29 release, mke2fs will create filesystems with the directory
indexing flag set by default, and tune2fs -O dir_index will do set up
the directory indexing flag and the default hash version
automatically.

There is also a bug in e2fsprogs 1.28 so that the -D option to e2fsck
(which optimizes all directories) has a 1 in 512 chance of corrupting
a directory, due to a fenceport error that escaped my testing. This
will be fixed in 1.29, which should be released very shortly.

Folks who want to play with the latest e2fsprogs get grab it here:

bk://thunk.org:5000

- Ted