2001-04-10 16:54:57

by Daniel Phillips

[permalink] [raw]
Subject: Ext2 Directory Index - File Structure

> Are you going to go to a COMPAT flag before final release? This is
> pretty much needed for e2fsck to be able to detect/correct indexes.

I will if I know what the exact semantics are. I have only an
approximate idea of how this works and I'd appreciate a more precise
definition.

> One thing I was thinking about the root entry - for compatibility with 2.0
> and earlier kernels (which don't clear INDEX_FL) there is a simple
> solution:
>
> - deleted dirents do not pose any problems, because a hash index is no
> guarantee that a specific dirent exists, only a range of possible entries
> - adding a dirent will _always_ go into the first empty block (which is the
> index root block), and overwrite the dx_root_info data after "." and ".."
>
> In the latter case (and also in the case of data corruption), it is useful
> to have some sort of integrity check within dx_root_info. If a dirent
> overwrote the root index block, the first 4 bytes of the dirent after
> "." and ".." would be an inode number, this could be anything and is not
> very useful for detecting if it is a valid number or not. However, if you
> put the "reserved_zero" field first and swap the order of "hash_version"
> and "info_length", then "info_length" and "hash_version" will overlap with
> the "rec_len" field of a new dirent. If (hash_version & 3) is always true,
> a valid "rec_len" is always a multiple of 4, so it is possible to detect if
> the dx_root_info has been overwritten by an old version of ext2 code. So:
>
> struct dx_root_info {
> struct fake_dirent fake1;
> char dot1[4];
> struct fake_dirent fake2;
> char dot2[4];
> /* __u32 inode; */ le_u32 reserved_zero;
> /* __u16 rec_len lo */ u8 hash_version; /* 1, or 2, or 3, but not 0 or 4
> /* __u16 rec_len hi */ u8 info_length; /* 8 */
> u8 indirect_levels
> u8 unused_flags;
> } info;
>
> It will also be easy to detect if reserved_zero is changed, because zero is
> not a valid inode number. We can't rely on fake2's rec_len in case a dir
> entry was added and then deleted (corrupting dx_root_info, but leaving
> fake1 and fake2 as they should be).

OK, this really does give a lot better protection against those wild and
crazy admins who switch back and forth between 2.0 and 2.5 systems on a
regular basis. ;-) Since it costs next to nothing I think this is a
good idea. It looks a little funny to have the info_length in the
middle of the info instead of at the beginning, but it would hardly be
the strangest layout I've ever seen.

So this is the new plan: reorder the fields as you suggest and let
hash_version be 1 when the hash function is finally frozen. Then we are
ok until somebody comes up with a hash function version 4, and we have
to decide to skip that one or throw caution to the winds and use it.
(By this time, Linux 2.0 may be as distant a memory as 1.0) As a fringe
benefit we will automatically flag any test directories inadvertently
left around after being created by pre-release code. A comment goes
into the code advising against hash functions which are even multiples
of 4, and why. On detecting an invalid hash function we kprint a
message advising a fsck and fail the lookup.

> > The high four bits of the block pointer field are reserved for use by
> > a coalesce-on-delete feature which may be implemented in the future.
> > The remaining 28 bits are capable of indexing up to 1TB per directory
> > file. (Perhaps I should coopt 8 bits instead of 4.)
>
> Might be worthwhile, even if you don't use all of the bits for coalesce
> flags. Nothing better than having free space for future expansion. Even
> for a 1kB block filesystem, this would give us directories up to 16GB, at
> worst case about 32M 256-character entries, average case 595M 9-to-12-character
> names. Having 24 bits for 4kB block (64GB) directories gives us worst-case
> 117M 256-character entries, and average case 2.2B 9-to-12 character names.
> Assumes 50% full leaf blocks for worst case, 70% full for average case.
>
> For that matter, I'm not even sure if ext2 supports directory files larger
> than 2GB? Anyone? I'm not 100% sure, but it may be that e2fsck considers
> directories >2GB as invalid.
>
> > Thus, directory leafs will average no more than 75% full and typically
> > a few points less than that since the hash function never produces a
> > perfectly uniform distribution. (The better the hash function, the
> > closer we get to 75%; we are currently at 71%.)
>
> Have you been testing this with different kinds of input for filenames,
> or only synthetic input data?

Only very boring, sequentially numbered names so far. Which reminds me:
I should probably improve the method of finding the middle of the block
for a split, which currently just picks the n/2th entry as the midpoint.
The correct thing to do is walk the sorted list of names twice, once to
find the total length of entries and again to find the entry closest to
1/2 that. This code is easy to write but it's a little distasteful
because of its bulk in comparison to the small average performance
improvement we'd see. For now I guess I'll just stick with the crude
approximation and beef up the comment.

Because I'm not relying on any particular properties of coherence of
names, i.e., my hash function is as random as I can make it, I think the
sequential names are a pretty good test, except that they are all of
similar length.

Here is my "makefiles" test code:

int main (int argc, char *argv[])
{
int n = (argc > 2)? strtol(argv[2], 0, 10): 0;
int i, size = 50, show = argc > 3 && !strncmp(argv[3], "y", 1);
char name[size];
for (i = 0; i < n; i++)
{
snprintf(name, size, "%s%i", argv[1], i);
if (show) printf("create %s\n", name);
close(open(name, 0100));
}
return 0;
}

Sample usage:

makefiles /mountpoint/testdir/basename 1000000 y

Which creates 1,000,000 files named basename0..basename999999 in
testdir (which better have been created with an index or this test will
take several days to finish). The "y" enables listing of names and
should be off for a real benchmark run.

It would be a good idea to elaborate this code to generate names with
randomly varying lengths.

> One interesting dataset would be to get
> the filename generation code from sendmail (for /var/spool/mqueue) and/or
> inn and/or squid to see how those names are hashed. These are legitimate
> applications which tend to store lots of files in a single directory.

Yes, I won't do that myself but I'd be happy if somebody did.

> > It is quite likely that the hash function will be improved in the
> > future, therefore a hash function version field is included in the root
> > header. If the current code sees a nonzero hash function version it
> > will fail the directory operation gracefully. Thus, the worst possible
> > effect of adding a new hash function in the future is that old versions
> > of ext2 will not be able to access a directory created with the new hash
> > function. (Alternatively, we could arrange a fallback to a linear
> > search.)
>
> The fallback to a linear search is pretty much a requirement anyways, isn't
> it?

Yes, in the sense that the new code also has to be able to handle
nonindexed directories, which it does now. As far as falling back to a
linear search after somebody has done something boneheaded - I think
it's better to fail and kprint a suggestion to run fsck, which can
easily fix the problem instead of allowing it to go unnoticed and
perhaps adversely affect system performance. The problem here is that
most users do not check their log files unless something doesn't work.
In this case, failing instead of pressing on constitutes helping, not
hurting.

> What happens with an existing directory larger than a single block
> when you are mounting "-o index"? You can't index it after-the-fact, so
> you need to linear search.

This works now.

> This also needs to be the case if you detect
> bogus data in the index blocks. If there is any erronous data in the index
> blocks, you basically need to turn off the INDEX flag, mark the filesystem
> in error, and revert to linear searching for that directory (leaving it to
> e2fsck to fix (based on a COMPAT flag).

Hmm. How about just marking the filesystem in error, issuing an advice
to run fsck, and failing. We are most likely talking about a situation
that came about as an ill-conceived series of kernel upgrades and
fallbacks between 2.0 and 2.4/2.5 kernels. An admin who really wants to
do this would be better advised to upgrade to a recent 2.2 kernel as an
intermediate step.

Another possibility is that we're seeing real disk corruption because of
faultly hardware or a kernel bug. In that case, we really don't want to
continue, we want to deal with it.

> > To Do
> > -----
> >
> > - Finalize hash function
> > - Test/debug big endian
> > - Convert to page cache
> >
> > Future Work
> > -----------
> >
> > - Fail gracefully on random data
> > - Defer index creation until 1st block full
>
> I would consider both of these in the "To Do" category before it goes into
> the kernel. Otherwise, in 99% of the normal directories, you are doubling
> the space needed for a directory, which means you will need to weigh on a
> filesystem-by-filesystem basis whether to use indexing. I would rather be
> able to turn it on for all filesystems.

OK, that's settled. For the release version I'll just try to reduce the
likelihood of oopsing on random data rather than trying to provide an
airtight guarantee.

> > [1] The index tree could have been made finer-grained than one block,
> > but not without sacrificing the desireable property of forward
> > compatibility with older Linux versions (because the directory block
> > format would have had to have been changed) and it is not clear that a
> > finer-grained index would perform better.
>
> Interestingly, I was reading on the reiserfs list that they are looking
> at (for performance reasons) using linear hash searching for under about
> 100 because of reduced overhead. Whether this goes into the code or not
> will obviously be determined by their testing, it does make some sense.
> If we had sub-block indexing, then we would need to keep the index entries
> sorted to a finer resolution, adding more overhead to each entry.

I have designed an incompatible directory block format that includes an
internal index and uses the same number of bytes per entry, including
the index overhead. I intend to code this up and try it sometime later
this year just to see if the performance improvement is significant. If
it is then, well... we'll deal with it then.

As far as bulking up the index goes, my guess is that just costs
performance because of the bigger cache footprint.

> > [4] The test for directory size could be eliminated if we are willing to
> > accept that indexing be used on all new directories. In that case the
> > index flag would be set on any directory larger than a block. A
> > decision was made to err on the side of more control by the
> > administrator, who can now be offered the option of specifying which
> > directories should be indexed and which not.
>
> I'm not sure I follow this. Isn't this the case already (or at least the
> same as "only add the index after we grow past 1 block")? Namely, don't
> all directories get an index block (especially those growing larger than
> 1 block), or do you need to "chattr" each directory which will get an
> index?

New directories get the index flag set but with the planned
optimization, don't get an index until they grow over a block.

> If e2fsck gets into the picture with a COMPAT flag, I would think it
> will build an index for any directory larger than 1 block (to free the
> kernel from having to do this on existing filesystems).

That makes sense to me. There probably should be some way to suppress
the automatic index addition, for example, when a user has a partition
that's 100% full.


2001-04-10 19:19:46

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Daniel writes:
> > Are you going to go to a COMPAT flag before final release? This is
> > pretty much needed for e2fsck to be able to detect/correct indexes.
>
> I will if I know what the exact semantics are. I have only an
> approximate idea of how this works and I'd appreciate a more precise
> definition.

OK, on ext2 the {,RO_,IN_}COMPAT flags work as follows for the kernel:

COMPAT: Means that the on-disk format is 100% compatible with older on-disk
formats, so a kernel which didn't know anything about this feature
could read/write the filesystem without corrupting the filesystem.
This is essentially just a flag which says "this filesystem has a
(hidden) feature" for e2fsck (more on e2fsck later). The ext3
HAS_JOURNAL is a COMPAT feature, because the journal is simply a
regular file with data blocks in it, nothing special. DIR_PREALLOC
is a COMPAT feature, since all it does is pre-allocate empty
directory blocks for directories growing over 1 block.
RO_COMPAT: Means the on-disk format is 100% compatible with older on-disk
formats for reading (i.e. the feature does not change the visible
on-disk format). However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is
prevented(*). SPARSE_SUPER is such a feature, because sparse
groups allow data blocks where superblock/GDT backups used to
live, and the old ext2_free_blocks() would refuse to free these
blocks, leading to inconsistent bitmaps. Old ext2_free_blocks()
would also get an error if a series of blocks crossed a group
boundary.
INCOMPAT: Means the on-disk format has changed in some way that makes it
unreadable by older kernels. FILETYPE is this because older
kernels would think the directory name_len was > 256, so that
would be bad. COMPRESSION is obvious - you would just get
garbage back from read(). The ext3 RECOVER flag is needed to
prevent ext2 from mounting the filesystem without replaying the
journal. In most cases, an unrecovered ext3 filesystem is in
no worse a state than an unchecked ext2 filesystem (so at one
time I wanted an RO_COMPAT_RECOVER flag to at least allow you
to mount an ext3 filesystem as ext2 for e3fsck), but a lot
more inconsistencies can accumulate in the journal than was
previously possible so that was shot down.

Now for e2fsck, slightly different semantics are needed. Basically, if
a filesystem has ANY unsupported feature (COMPAT, RO_COMPAT, INCOMPAT),
it will refuse to run on the filesystem, because it doesn't know how to
deal with the feature. For example, ext3 COMPAT_HAS_JOURNAL is not supported
by older e2fsck, because it is possible for ext3 to use a reserved inode
(number 8) to hold the journal, so this would fail. It also didn't know
what was a valid or invalid ext3 journal, so allowing an old e2fsck to pass
on an ext3 filesystem would be a false sense of security.

Because (with the change to dx_root_info), an indexed filesystem is 100%
safe to write into even for kernels that know nothing about the directory
indexes, it could be a COMPAT flag in the superblock. This would force
users to have an updated e2fsck to verify that the directory indexes
are valid. Otherwise, the indexes could (somehow) become corrupt and an
old e2fsck would not detect this because it would just think the index
blocks were empty directory blocks.

(*) kernels before 2.4.0 didn't check RO_COMPAT on remount, so root could
be mounted RO with an RO_COMPAT feature, and then be remounted RW. I
submitted a patch for this which made it into 2.4.0, but it is still
not in 2.2.19. We also didn't check *COMPAT flags for rev 0 ext2
filesystems (i.e. those created by default with mke2fs 1.15 or earlier,
and still may be around), so COMPAT flags were totally ignored in this
case. My patch also fixed that for 2.4.0. Alan has said he will put
it into 2.2.20-pre1.

> > For that matter, I'm not even sure if ext2 supports directory files larger
> > than 2GB? Anyone? I'm not 100% sure, but it may be that e2fsck considers
> > directories >2GB as invalid.

Actually, as I think about this more, it is currently invalid for
directories to become > 2GB. The i_size_high field in the inode is
actually the i_dir_acl pointer, so this is only "available" for regular
files and not directories. Given that the ext2 EA/ACL code does not use
this field, we may allow this in the future, but for now we are limited
to 2GB directories. Should be enough for now.

> > Have you been testing this with different kinds of input for filenames,
> > or only synthetic input data?
>
> Because I'm not relying on any particular properties of coherence of
> names, i.e., my hash function is as random as I can make it, I think the
> sequential names are a pretty good test, except that they are all of
> similar length.

You could try "bonnie++ -s 0 -n 1000 -d /mnt/index_filesystem"

to create 1M (zero length) files in a single directory. The files have
semi-random names and different lengths. Basically a fixed-length number
with random length characters added to the end. Since bonnie++ is also
a standard filesystem performance test, it will at least give some
useful numbers to compare with.

> > The fallback to a linear search is pretty much a requirement anyways, isn't
> > it?
>
> Yes, in the sense that the new code also has to be able to handle
> nonindexed directories, which it does now. As far as falling back to a
> linear search after somebody has done something boneheaded - I think
> it's better to fail and kprint a suggestion to run fsck, which can
> easily fix the problem instead of allowing it to go unnoticed and
> perhaps adversely affect system performance. The problem here is that
> most users do not check their log files unless something doesn't work.
> In this case, failing instead of pressing on constitutes helping, not
> hurting.

I would have to disagree. In the case of unsupported/corrupt/bad index
data, there _is_ a meaningful fallback, which is linear directory search.
Calling ext2_error() will mark the filesystem in error (for the next
e2fsck to clean up), and the sysadmin has the option of mounting with
"errors=remount-ro" or "errors=panic" if this is desirable. It should be
up to the sysadmin to decide they want their box to crash or not, if there
is a reasonable solution.

We should definitely also clear the index flag on that directory, so we
don't keep getting the same error. The rest of the ext2 code will deal
with the case if the actual directory data is corrupt.

Also, I'm sure that if the system has a large directory (thousands of
files), they will notice that it has become very slow. If they don't
notice (and e2fsck cleans it up after reboot), then that is just as well.

> > What happens with an existing directory larger than a single block
> > when you are mounting "-o index"? You can't index it after-the-fact, so
> > you need to linear search.
>
> This works now.

So then bailing out of index mode (on error) to go into linear search
mode is as easy as clearing the directory index flag and reading the
directory from the start.

> Another possibility is that we're seeing real disk corruption because of
> faultly hardware or a kernel bug. In that case, we really don't want to
> continue, we want to deal with it.

We should leave it up to the mount options and sysadmin whether they would
like to run in degraded mode rather than just fail. If the directory data
itself is also bad, the code will already deal with this.

> > If e2fsck gets into the picture with a COMPAT flag, I would think it
> > will build an index for any directory larger than 1 block (to free the
> > kernel from having to do this on existing filesystems).
>
> That makes sense to me. There probably should be some way to suppress
> the automatic index addition, for example, when a user has a partition
> that's 100% full.

Yes, I was basically thinking that index creation would be done after the
rest of the filesystem had already been checked and was in a valid state.
This is really the only safe time to do it. I suppose the index checking
itself could be done at that time as well.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-13 20:59:05

by Daniel Phillips

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Andreas Dilger wrote:
> Daniel writes:
> > > Are you going to go to a COMPAT flag before final release? This is
> > > pretty much needed for e2fsck to be able to detect/correct indexes.
> >
> > I will if I know what the exact semantics are. I have only an
> > approximate idea of how this works and I'd appreciate a more precise
> > definition.
>
> OK, on ext2 the {,RO_,IN_}COMPAT flags work as follows for the kernel:
>
> COMPAT: Means that the on-disk format is 100% compatible with older
> on-disk formats, so a kernel which didn't know anything about this feature
> could read/write the filesystem without corrupting the filesystem. This is
> essentially just a flag which says "this filesystem has a (hidden) feature"
> for e2fsck (more on e2fsck later). The ext3 HAS_JOURNAL is a COMPAT
> feature, because the journal is simply a regular file with data blocks in
> it, nothing special. DIR_PREALLOC is a COMPAT feature, since all it does
> is pre-allocate empty directory blocks for directories growing over 1 block.
>
> RO_COMPAT: Means the on-disk format is 100% compatible with older on-disk
> formats for reading (i.e. the feature does not change the
> visible on-disk format). However, an old kernel writing to such a
> filesystem would/could corrupt the filesystem, so this is
> prevented(*). SPARSE_SUPER is such a feature, because sparse
> groups allow data blocks where superblock/GDT backups used to
> live, and the old ext2_free_blocks() would refuse to free these
> blocks, leading to inconsistent bitmaps. Old ext2_free_blocks()
> would also get an error if a series of blocks crossed a group
> boundary.
> INCOMPAT: Means the on-disk format has changed in some way that makes it
> unreadable by older kernels. FILETYPE is this because older
> kernels would think the directory name_len was > 256, so that
> would be bad. COMPRESSION is obvious - you would just get
> garbage back from read(). The ext3 RECOVER flag is needed to
> prevent ext2 from mounting the filesystem without replaying the
> journal. In most cases, an unrecovered ext3 filesystem is in
> no worse a state than an unchecked ext2 filesystem (so at one
> time I wanted an RO_COMPAT_RECOVER flag to at least allow you
> to mount an ext3 filesystem as ext2 for e3fsck), but a lot
> more inconsistencies can accumulate in the journal than was
> previously possible so that was shot down.
>
> Now for e2fsck, slightly different semantics are needed. Basically, if
> a filesystem has ANY unsupported feature (COMPAT, RO_COMPAT, INCOMPAT),
> it will refuse to run on the filesystem, because it doesn't know how to
> deal with the feature. For example, ext3 COMPAT_HAS_JOURNAL is not
> supported by older e2fsck, because it is possible for ext3 to use a
> reserved inode (number 8) to hold the journal, so this would fail. It also
> didn't know what was a valid or invalid ext3 journal, so allowing an old
> e2fsck to pass on an ext3 filesystem would be a false sense of security.
>
> Because (with the change to dx_root_info), an indexed filesystem is 100%
> safe to write into even for kernels that know nothing about the directory
> indexes, it could be a COMPAT flag in the superblock. This would force
> users to have an updated e2fsck to verify that the directory indexes
> are valid. Otherwise, the indexes could (somehow) become corrupt and an
> old e2fsck would not detect this because it would just think the index
> blocks were empty directory blocks.
>
> (*) kernels before 2.4.0 didn't check RO_COMPAT on remount, so root could
> be mounted RO with an RO_COMPAT feature, and then be remounted RW. I
> submitted a patch for this which made it into 2.4.0, but it is still
> not in 2.2.19. We also didn't check *COMPAT flags for rev 0 ext2
> filesystems (i.e. those created by default with mke2fs 1.15 or earlier,
> and still may be around), so COMPAT flags were totally ignored in this
> case. My patch also fixed that for 2.4.0. Alan has said he will put
> it into 2.2.20-pre1.

OK, I get it. Nice article - it would sure be nice to see this
incorporated Documentation/filesystems/ext2.txt. I just checked my copy
of Understanding the Linux Kernel and while existence of the compat
fields in the super block is noted, there is nothing at all said about
them.

So then, the obvious candidate would be:

#define EXT2_FEATURE_RO_COMPAT_DIR_INDEX 0x0004

which was formerly EXT2_FEATURE_RO_COMPAT_BTREE_DIR.

Other than declaring it, I gather I have to set this flag in the
superblock every time I set the EXT2_INDEX_FL in an inode. Is that it?

> > > For that matter, I'm not even sure if ext2 supports directory files
> > > larger than 2GB? Anyone? I'm not 100% sure, but it may be that e2fsck
> > > considers directories >2GB as invalid.
>
> Actually, as I think about this more, it is currently invalid for
> directories to become > 2GB. The i_size_high field in the inode is
> actually the i_dir_acl pointer, so this is only "available" for regular
> files and not directories. Given that the ext2 EA/ACL code does not use
> this field, we may allow this in the future, but for now we are limited
> to 2GB directories. Should be enough for now.

Yes, I will mask of 8 bits of the block index and this will be more than
enough to support the planned coalesce function. I can't think of any
other use for these bits than coalescing - can you?

As you point out, we can't have directories bigger than 2 (4?) GB at the
moment, and if we ever can then we could revisit the maximum size
question. I'm pretty sure that one day, not even very far in the
future, 16 GB will be too small a directory for some applications. On
the other hand, it's possible that none of these applications will be
using Ext2.

> > > Have you been testing this with different kinds of input for filenames,
> > > or only synthetic input data?
> >
> > Because I'm not relying on any particular properties of coherence of
> > names, i.e., my hash function is as random as I can make it, I think the
> > sequential names are a pretty good test, except that they are all of
> > similar length.
>
> You could try "bonnie++ -s 0 -n 1000 -d /mnt/index_filesystem"
>
> to create 1M (zero length) files in a single directory. The files have
> semi-random names and different lengths. Basically a fixed-length number
> with random length characters added to the end. Since bonnie++ is also
> a standard filesystem performance test, it will at least give some
> useful numbers to compare with.

OK, that sounds ideal.

> > > The fallback to a linear search is pretty much a requirement anyways,
> > > isn't it?
> >
> > Yes, in the sense that the new code also has to be able to handle
> > nonindexed directories, which it does now. As far as falling back to a
> > linear search after somebody has done something boneheaded - I think
> > it's better to fail and kprint a suggestion to run fsck, which can
> > easily fix the problem instead of allowing it to go unnoticed and
> > perhaps adversely affect system performance. The problem here is that
> > most users do not check their log files unless something doesn't work.
> > In this case, failing instead of pressing on constitutes helping, not
> > hurting.
>
> I would have to disagree. In the case of unsupported/corrupt/bad index
> data, there _is_ a meaningful fallback, which is linear directory search.
> Calling ext2_error() will mark the filesystem in error (for the next
> e2fsck to clean up), and the sysadmin has the option of mounting with
> "errors=remount-ro" or "errors=panic" if this is desirable. It should be
> up to the sysadmin to decide they want their box to crash or not, if there
> is a reasonable solution.
>
> We should definitely also clear the index flag on that directory, so we
> don't keep getting the same error. The rest of the ext2 code will deal
> with the case if the actual directory data is corrupt.
>
> Also, I'm sure that if the system has a large directory (thousands of
> files), they will notice that it has become very slow. If they don't
> notice (and e2fsck cleans it up after reboot), then that is just as well.

You win and I will put it on the to-do list. As you say, it's not a lot
of work, but of course that's not the point - the point is to be
consistent with current behaviour. Marking the filesystem in error
takes care of my objection, and the user gets to press on bravely in
this circumstance.

> > > What happens with an existing directory larger than a single block
> > > when you are mounting "-o index"? You can't index it after-the-fact,
> > > so you need to linear search.
> >
> > This works now.
>
> So then bailing out of index mode (on error) to go into linear search
> mode is as easy as clearing the directory index flag and reading the
> directory from the start.

Are you sure we want to clear the index flag? The user has probably
just booted the wrong kernel. And yes, we are talking about a
strategically placed goto here, after a little cleanup.
--
Daniel

2001-04-14 00:52:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Daniel, you write:
> OK, I get it. Nice article - it would sure be nice to see this
> incorporated Documentation/filesystems/ext2.txt. I just checked my copy
> of Understanding the Linux Kernel and while existence of the compat
> fields in the super block is noted, there is nothing at all said about
> them.
>
> So then, the obvious candidate would be:
>
> #define EXT2_FEATURE_RO_COMPAT_DIR_INDEX 0x0004
>
> which was formerly EXT2_FEATURE_RO_COMPAT_BTREE_DIR.

Actually not. We should go with "EXT2_FEATURE_COMPAT_DIR_INDEX 0x0008"
because the on-disk layout is 100% compatible with older kernels, so
no reason to force read-only for those systems. I'm guessing Ted had
put RO_COMPAT_BTREE_DIR in there in anticipation, but it was never used.

> Other than declaring it, I gather I have to set this flag in the
> superblock every time I set the EXT2_INDEX_FL in an inode. Is that it?

Yes. If you do like the LARGE_FILE code, it may be worthwhile to check
whether it is already set to avoid writing out the superblock. However,
in most cases the superblock is already dirty after block allocation, so
it is no extra overhead in those cases.

> Yes, I will mask of 8 bits of the block index and this will be more than
> enough to support the planned coalesce function. I can't think of any
> other use for these bits than coalescing - can you?

Not at the moment, but then again, I probably couldn't have forseen any
of the other requirements either. If we don't need them for indexed
directories, we may as well save them for a future (unspecified) use.
Start using them from the high bits down, so that we allow the directory
size to grow (if needed) in the future.

> > So then bailing out of index mode (on error) to go into linear search
> > mode is as easy as clearing the directory index flag and reading the
> > directory from the start.
>
> Are you sure we want to clear the index flag? The user has probably
> just booted the wrong kernel. And yes, we are talking about a
> strategically placed goto here, after a little cleanup.

OK, no need to clear the index flag under normal circumstances. We
simply fall back to linear search on unknown hash code, etc. But
with the hash version byte and dx_root size we should _know_ when we
have bogus data. Whether we clear the index flag or not in this case
is up for debate. I was thinking that it would force us into the
"safer" code path of linear search. However, clearing the index flag
may lose some state (if we decide to allow the admin to set un-indexed
but larger than 1 block directories). In that case, how do we make
e2fsck create indexes for old filesystems when the DIR_INDEX compat flag
is turned on in the superblock?

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-15 19:53:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Andreas, you wrote:
> Daniel, you write:
> > So then, the obvious candidate would be:
> >
> > #define EXT2_FEATURE_RO_COMPAT_DIR_INDEX 0x0004
> >
> > which was formerly EXT2_FEATURE_RO_COMPAT_BTREE_DIR.
>
> Actually not. We should go with "EXT2_FEATURE_COMPAT_DIR_INDEX 0x0008"
> because the on-disk layout is 100% compatible with older kernels, so
> no reason to force read-only for those systems. I'm guessing Ted had
> put RO_COMPAT_BTREE_DIR in there in anticipation, but it was never used.

Whoops, my mind came to one conclusion and my mouse to another: I cut
and pasted Ted's original define without noticing the RO in it. Yes,
it would have been RO as Ted conceived it due to the addition
of a new class of filesystem metadata (btree nodes). He suggested
keeping the normal file index blocks as well, which would have given it
RO compatibility, which is what got me thinking about this design. As
it turns out, the page cache erases the advantage of absolute indexing
in terms of access time, and even with the buffer cache the overhead is
not large compared to the savings.

That reminds me, putting this code back into the page cache is gradually
working its way back up my to-do list. My first attempt to do that was
not very satisfying. The problems I ran into were:

- Locking issues. The page cache locks in units of pages; the
directory index works in units of blocks that do not group nicely
into pages. We do not lock buffers at all when we read/write them,
but in the current page cache regime we are supposed to with pages.
Since I operate on up to 4 non-contiguous blocks simultaneously this
turns into a major hassle, and at least one problem is quite
difficult: suppose you are splitting a leaf. You have a lock on the
page the original lives on and you allocate a new block at the end
of the file for the new one. Is that new block on the same page as
the original or not? If it is, you'd better not try to lock again
or you will deadlock. So you skip the lock, and now you have to
remember that you skipped the lock so you don't unlock the page
twice. This is just a simple example, when you throw index blocks
in with the leaf blocks it can get quite a lot messier.

- The "prepare/commit bracket". In the current regime you prepare a
page for writing by calling a VFS function that maps the page to
file blocks and reads some of them if necessary. With buffers, you
use the same function (bread) to prepare the buffer for writing as
you do for reading. Often, you will have read the buffer anyway, so
there's nothing more to do there. This doesn't sound like a huge
difference, but actually it is. In the case of mark_buffer_dirty
you are acting on the buffer at one point in its lifetime; with
prepare/commit you have to worry about a linear span of a page's
life. This can get especially messy if the span crosses a few
conditionals, which it often does.

- Extra parameters. Because you are typically operating on a piece of
a page you have to keep not only a pointer to the struct page, but
something to tell you which part of the page you were working on,
so instead of having one data item to pass around you have 2-3, and
the chance of them getting out of sync increases from "0" to
"probable". If you examine directory code you'll see I'm keeping a
pointer not only to each buffer, but to its b_data as well, which
would normally be a stupid thing to do - I kept this extra cruft in
there when I converted the code back from the page cache to the
buffer cache, in order to ease the pain of the next try at page
cache conversion. Combined with the prepare/commit bracket, you now
have to track 4-6 data items over a span of code fs 1 data item at
one point. The combinatorics are starting to kick in here. Sure,
it's all doable, but at what cost to readability?

- Granularity. Besides the locking issue, it is just extra work to
keep track of which logical block lives on which page.

- Interface definition issues. Nobody has written down a definition
of the page cache interface in terms of operations, states, locking
rules, etc. There are fragments of information here and there but
you really have to dig for them and it's a practical certainty that
you will miss some. You pretty much just have to read the code on
this one, and even then there are a lot of subtleties that can be
easily missed.

All these problems could have been solved, but the fact is, I put quite
a lot of work into the code and the result was distressingly unstable,
the reasons for this being not at all obvious. On top of that, I found
that the code was getting quite tied to all the little bookkeeping
issues that came up, to the point where I wasn't making a lot of
progress developing the features I wanted. And to add insult to injury,
I just wasn't very proud of the way the code looked on the page, and
didn't see a way clear to cleaning it up.

But the fact is, there remain very good reasons for dealing with these
issues. For example, it is wasteful to have the same file block
referenced by both the page cache and buffer cache, which can easily
happen if a directory is deleted and its blocks get reallocated to a
regular file. It is also annoying to have to keep track of the states
of both blocks and pages throughout the VFS, and we would at least make
some progress towards consolidating code paths. So there is no question
about whether to do this or not. (Let alone the fact that Linus has so
decreed;-)

Having had time to think about it and discuss the problem with a few
people, I have a pretty good idea how to proceed for my page cache
version, take two. In San Jose last month Stephen Tweedie suggested
that instead of directly on pages, I operate on the buffer attached to
the page. Al Viro hated this idea but was willing to admit that it
would probably work. Notwithstanding the need to convince Al of the
merits, I now plan to try this and see how it works. If it does work
reliably I will be able to keep essentially the same control flow and
data design I have now, without obfuscation, and I will also be able to
get the efficiency advantage of the page cache. (Operating on the
buffers underlying a page does not mean that I am actually using the
buffer cache - the buffer cache is accessed through a separate hash
table.)

A drawback of operating on the underlying buffers of a page is that we
fail to hide details of how the page cache works. However, it should be
clear from above that we fail to hide those details anyway, and in a way
that really hurts the readability/maintainability of the code.

By the way, did you mean:

#define EXT2_FEATURE_COMPAT_DIR_INDEX 0x0002

since there is only one other COMPAT feature currently defined?

> > Other than declaring it, I gather I have to set this flag in the
> > superblock every time I set the EXT2_INDEX_FL in an inode. Is that it?
>
> Yes. If you do like the LARGE_FILE code, it may be worthwhile to check
> whether it is already set to avoid writing out the superblock. However,
> in most cases the superblock is already dirty after block allocation, so
> it is no extra overhead in those cases.

OK, this revision of the patch incorporates the COMPAT handling, I'd
appreciate your comments.

> > Yes, I will mask of 8 bits of the block index and this will be more than
> > enough to support the planned coalesce function. I can't think of any
> > other use for these bits than coalescing - can you?
>
> Not at the moment, but then again, I probably couldn't have foreseen any
> of the other requirements either. If we don't need them for indexed
> directories, we may as well save them for a future (unspecified) use.
> Start using them from the high bits down, so that we allow the directory
> size to grow (if needed) in the future.

Yes, that was my plan.

> > > So then bailing out of index mode (on error) to go into linear search
> > > mode is as easy as clearing the directory index flag and reading the
> > > directory from the start.
> >
> > Are you sure we want to clear the index flag? The user has probably
> > just booted the wrong kernel. And yes, we are talking about a
> > strategically placed goto here, after a little cleanup.
>
> OK, no need to clear the index flag under normal circumstances. We
> simply fall back to linear search on unknown hash code, etc. But
> with the hash version byte and dx_root size we should _know_ when we
> have bogus data. Whether we clear the index flag or not in this case
> is up for debate. I was thinking that it would force us into the
> "safer" code path of linear search. However, clearing the index flag
> may lose some state (if we decide to allow the admin to set un-indexed
> but larger than 1 block directories). In that case, how do we make
> e2fsck create indexes for old filesystems when the DIR_INDEX compat flag
> is turned on in the superblock?

For now I will leave the index flag untouched.

Modified Header Info
--------------------

The root_info has been rearranged along the lines you suggested, to gain
the ability to detect the case where a pre-2.2 kernel has modified an
indexed directory without clearing the inode's index flag:

struct dx_root
{
struct fake_dirent fake1;
char dot1[4];
struct fake_dirent fake2;
char dot2[4];
struct dx_root_info
{
le_u32 reserved_zero;
u8 hash_version; /* 0 now, 1 at release */
u8 info_length; /* 8 */
u8 indirect_levels;
u8 unused_flags;
}
info;
struct dx_entry entries[0];
};

Deferred Index Creation
-----------------------

The current patch incorporates the deferred index creation feature.
For now, this can be enabled/disabled by the defining/undefing the
DX_DEFER symbol (defaults to on). Evenhis can be enabled/disabled by the defining/undefing the
DX_DEFER symbol (defaults to on). Eventually the index creation will
always be deferred and the original code path removed, after some
testing and perhaps some benchmarking of the deferred vs always-indexed
cases. It would be nice to know what the overhead of the index is on
single-block directories.

It would also be nice to know whether my theory that the index begins to
win when the directory grows over one block is in fact correct. If I'm
wrong then I may have to defer the index creation by more than one
block, work I'd prefer to avoid.

Developing the deferred index creation took longer than I expected, but
then, I expected that. :-/ This feature sits cross-wise in the code
path, jumping from across from the non-indexed to the indexed entry
creation path. A batch of index-specific local data had to be moved
into common scope, and as expected, making the directory look as if the
index had always been there was a little messy.

The compiler blindsided me with a surprise union of local data
allocation in different blocks, causing rarely-occurring stack overflows.
My solution to this was to develop a longer piece of custom code that
uses less stack, something I had to do in the leaf-splitting part of the
code earlier. An alternative and perhaps cleaner solution is the break
the memory-using parts out into separate functions so that the compiler
doesn't have the option of goofing up in this way, so I left the
original, overflow-causing code in as an ifdef for now.

This feature now seems to be stable, though I haven't tested it under a
wide variety of conditions.

Superblock Feature Flag
-----------------------

This is now incorporated. I use the following code:

if (!EXT2_HAS_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX))
{
lock_kernel();
ext2_update_dynamic_rev(sb);
EXT2_SET_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX);
unlock_kernel();
ext2_write_super(sb);
}
new->u.ext2_i.i_flags |= EXT2_INDEX_FL;

It looks like there is a common factor in there that should be used
throughout the ext2 code.

This is only done on the deferred-creation path since the other path
will be gone by beta time.

Delete Performance
------------------

As we both noticed, delete performance for rm -rf is running
considerably behind create performance by a factor of 4 or so - worse,
it's running behind non-indexed Ext2 :-O

There is no valid reason for this in the index code. The indexed delete
dirties the same things as a non-indexed delete, so it's not clear to me
why we see a lot of extra disk activity with the indexed code.

I'm a little hampered in tracking this down since I do not have a test
machine at the moment, just my laptop and my firewall, neither of which
is available for booting hacked kernels. So I'd appreciate it if
somebody else wants to jump in here. What we need to do is put in
tracing code to find out why excessive numbers of block writes are
taking place.

Since I have to guess at this point, I guess that we are triggering some
fringe behavior of the buffer LRU, possibly caused by interaction with
page-oriented VFS code. If so, it's likely the behavior was introduced
in the pre-2.4 series and I could test this by preparing a patch for the
2.2 series.

For now, I'll just assume the lotus position and contemplate the code.

Redundant Existence Test
------------------------

The inner loop of add_ext2_entry has traditionally included a test for
an existing entry with the same name as the new entry, in which case an
error exit is taken:

err = -EEXIST;
if (ext2_match (namelen, name, de))
goto err_out;

This test is entirely redundant as can be seen from this code snippet
from fs/namei.c, open_namei:

980 dentry = lookup_hash(&nd->last, nd->dentry);
[...]
989 /* Negative dentry, just create the file */
990 if (!dentry->d_inode) {
991 error = vfs_create(dir->d_inode, dentry, mode);
[...]
1000 goto ok;
1001 }
1002
1003 /*
1004 * It already exists.
1005 */

There is always an explicit search for the entry before creating it
(except in the case where we find a deleted entry in the dcache - then
the search can be safely skipped) Thus, ext2's create never gets called
when the name already exists. Worse, the ext2 existence test is not
reliable. If there is a hole in the directory big enough for the new
entry then add_ext2_entry will stop there and not check the rest of
the directory. So the test in add_ext2_entry adds no extra protection,
except perhaps helping verify that the code is correct. In this case,
an assertion would capture the intent better. On the other hand, it
does cost a lot of CPU cycles.

For now I have removed the existence test from both the indexed and
non-indexed paths.

With the directory index it becomes attractive to combine the existence
test with the entry creation and this would come almost for free: after
a suitable place has been found for the new entry the rest of the block
has to be searched for an entry of the same name, and if the name's hash
value continues in the next block(s) those blocks have to be checked
too, the latter test being needed very infrequently. So in exchange for
20-30 new lines of code we get a significant performance boost. A
similar argument applies to unlink. The only difficulty is, there is
currently no internal interface to support this, so I'll just note that
it's possible.

Status
------

Funnily enough, all the same items remain on my to-do list as last week,
in fact it got longer. However, one item from the future work list is
now done:

- Defer index creation until 1st block full

And several items were taken care of that weren't on the list:

- Add COMPAT flag handling
- Rearrange header info for forward-compatibility with 2.0 series
- Cleaned up the non-indexed create path, merged the common parts
with the indexed path.

Quite a few small cleanups have been done in preparation for a new
attack on the page cache problem.

The patch is still pre-alpha, until the work on the hash function is
complete, and until you run out of objections to the on-disk format :-)

To Do
-----

- Finalize hash function
- Test/debug big endian
- Incorporate more of AV's cleaned up dir/namei code
- Convert to page cache (take two)
- Fail gracefully on random data

Future Work
-----------

- Generalize to n level index
- Coalesce on delete

The Patch is Available At
-------------------------

http://kernelnewbies.org/~phillips/htree/dx.testme.2.4.3-2

To apply:

cd /your/kernel/source
patch -p0 </the/patch

To create an indexed directory:

mount /dev/hdxxx /test -o index
mkdir /test/foo

Everybody: the invitation for testing remains open.

--
Daniel

2001-04-16 01:43:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Daniel, you write:
> Andreas, you wrote:
> > We should go with "EXT2_FEATURE_COMPAT_DIR_INDEX 0x0008"
> > because the on-disk layout is 100% compatible with older kernels, so
> > no reason to force read-only for those systems. I'm guessing Ted had
> > put RO_COMPAT_BTREE_DIR in there in anticipation, but it was never used.
>
> By the way, did you mean:
>
> #define EXT2_FEATURE_COMPAT_DIR_INDEX 0x0002
>
> since there is only one other COMPAT feature currently defined?

Quick note - you need to use 0x0008 or higher (Ted is the authority on
this). The kernel ext2_fs.h is out of date compared to the one in e2fsprogs.
The EXT3_FEATURE_COMPAT_HAS_JOURNAL and EXT2_FEATURE_COMPAT_IMAGIC_INODES
is missing from the kernel header.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-16 06:59:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Daniel, you write (re indexed directories):
> Superblock Feature Flag
> -----------------------
>
> This is now incorporated. I use the following code:
>
> if (!EXT2_HAS_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX))
> {
> lock_kernel();
> ext2_update_dynamic_rev(sb);
> EXT2_SET_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX);
> unlock_kernel();
> ext2_write_super(sb);
> }
> new->u.ext2_i.i_flags |= EXT2_INDEX_FL;

Why lock_kernel() calls and not lock_super()? I would _think_ that
file creation within the VFS was already protected with BKL, but you
need to do lock_super(), because lock_kernel() can be dropped on
context switch. A typo in the email maybe?

> It looks like there is a common factor in there that should be used
> throughout the ext2 code.

Yes, may be worthwhile to have a little helper function do this.

> This is only done on the deferred-creation path since the other path
> will be gone by beta time.

Yes, it only makes sense to do this at initial dx root creation.

> Delete Performance
> ------------------
>
> As we both noticed, delete performance for rm -rf is running
> considerably behind create performance by a factor of 4 or so - worse,
> it's running behind non-indexed Ext2 :-O
>
> There is no valid reason for this in the index code. The indexed delete
> dirties the same things as a non-indexed delete, so it's not clear to me
> why we see a lot of extra disk activity with the indexed code.

Possibilities:
- extra cache footprint for code (unlikely cause)
- extra block reads because of index blocks
-

> Redundant Existence Test
> ------------------------
>
> The inner loop of add_ext2_entry has traditionally included a test for
> an existing entry with the same name as the new entry, in which case an
> error exit is taken:
>
> err = -EEXIST;
> if (ext2_match (namelen, name, de))
> goto err_out;
>
> This test is entirely redundant as can be seen from this code snippet
> from fs/namei.c, open_namei:
>
> 980 dentry = lookup_hash(&nd->last, nd->dentry);
> [...]
> 989 /* Negative dentry, just create the file */
> 990 if (!dentry->d_inode) {
> 991 error = vfs_create(dir->d_inode, dentry, mode);
> [...]
> 1000 goto ok;
> 1001 }
> 1002
> 1003 /*
> 1004 * It already exists.
> 1005 */
>
> There is always an explicit search for the entry before creating it
> (except in the case where we find a deleted entry in the dcache - then
> the search can be safely skipped) Thus, ext2's create never gets called
> when the name already exists. Worse, the ext2 existence test is not
> reliable. If there is a hole in the directory big enough for the new
> entry then add_ext2_entry will stop there and not check the rest of
> the directory. So the test in add_ext2_entry adds no extra protection,
> except perhaps helping verify that the code is correct. In this case,
> an assertion would capture the intent better. On the other hand, it
> does cost a lot of CPU cycles.

Possibly it is a holdout from (or extra check for) some sort of locking
race condition? ISTR that the dentry cache _should_ protect us from a
dirent being created twice (that would also corrupt the dentry cache).

However, if it _was_ some sort of race avoidance, the existence check
_would_ be enough. Reasoning is that if we had two processes trying
to create the same dirent then one of them would find "the spot" big
enough to hold the new entry first, and the second process would _have_
to pass this spot in order to find another place to put the dentry
(assuming no other dentry was deleted in the meantime). The check
would be equally valid (if it is needed at all, mind you) in the index
code because we would always search the same hash bucket to find the
new home of the dentry. In fact, in the indexed code we would be much
more likely to find duplicate dentries because the hashing would always
place the duplicate dentries into the same hash bucket.

> With the directory index it becomes attractive to combine the existence
> test with the entry creation and this would come almost for free: after
> a suitable place has been found for the new entry the rest of the block
> has to be searched for an entry of the same name, and if the name's hash
> value continues in the next block(s) those blocks have to be checked
> too, the latter test being needed very infrequently. So in exchange for
> 20-30 new lines of code we get a significant performance boost. A
> similar argument applies to unlink. The only difficulty is, there is
> currently no internal interface to support this, so I'll just note that
> it's possible.

Yes, I have thought about this as well. If it is possible (I'm not sure
how, maybe a hashed per-super cache?) you could keep a pointer to the first
free entry in a directory, along with the dentry size and the mtime of the
directory. You could use this cache at dentry insertion time (validating
it by size and directory mtime). If the cache entry is invalid, fall back
to linear search again.

> http://kernelnewbies.org/~phillips/htree/dx.testme.2.4.3-2

Good. You have version numbers for the patches now...

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-16 10:42:01

by Daniel Phillips

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

Andreas, you wrote:
> Daniel, you write:
> > Andreas, you wrote:
> > > We should go with "EXT2_FEATURE_COMPAT_DIR_INDEX 0x0008"
> > > because the on-disk layout is 100% compatible with older kernels, so
> > > no reason to force read-only for those systems. I'm guessing Ted had
> > > put RO_COMPAT_BTREE_DIR in there in anticipation, but it was never used.
> >
> > By the way, did you mean:
> >
> > #define EXT2_FEATURE_COMPAT_DIR_INDEX 0x0002
> >
> > since there is only one other COMPAT feature currently defined?
>
> Quick note - you need to use 0x0008 or higher (Ted is the authority on
> this). The kernel ext2_fs.h is out of date compared to the one in e2fsprogs.
> The EXT3_FEATURE_COMPAT_HAS_JOURNAL and EXT2_FEATURE_COMPAT_IMAGIC_INODES
> is missing from the kernel header.

Since this could mess up somebody's superblock, albiet in a minor way, I
uploaded the correction right away without changing the patch name.

--
Daniel

2001-04-16 16:37:20

by Daniel Phillips

[permalink] [raw]
Subject: Re: Ext2 Directory Index - File Structure

On Monday 16 April 2001 14:40, you wrote:
> Daniel, you write (re indexed directories):
> > Superblock Feature Flag
> > -----------------------
> >
> > This is now incorporated. I use the following code:
> >
> > if (!EXT2_HAS_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX))
> > {
> > lock_kernel();
> > ext2_update_dynamic_rev(sb);
> > EXT2_SET_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_DIR_INDEX);
> > unlock_kernel();
> > ext2_write_super(sb);
> > }
> > new->u.ext2_i.i_flags |= EXT2_INDEX_FL;
>
> Why lock_kernel() calls and not lock_super()? I would _think_ that
> file creation within the VFS was already protected with BKL, but you
> need to do lock_super(), because lock_kernel() can be dropped on
> context switch. A typo in the email maybe?

I lifted the code directly from ext2_update_inode in ext2/inode.c.
I suppose I should looked at it more critically, but at least I wrote a
comment that it has to be looked at further. The BKL does the job but
on SMP it's overkill, but is probably also inconsequential because it's
a once-per-life-of-filesystem operation. Al Viro has rewritten the
superblock locking completely and is preparing patches for 2.5, so I
suppose we need to see what the new locking regime is going to be. In
2.5 we can clean up all the update_dynamic_rev's at the same time.

> > This is only done on the deferred-creation path since the other path
> > will be gone by beta time.
>
> Yes, it only makes sense to do this at initial dx root creation.

Actually, I do it at the time I set the EXT2_INDEX_FL and the dx root is
created later, on the theory that even the presence of the index flag is
something that fsck might be interested in.

> > Delete Performance
> > ------------------
> >
> > As we both noticed, delete performance for rm -rf is running
> > considerably behind create performance by a factor of 4 or so - worse,
> > it's running behind non-indexed Ext2 :-O
> >
> > There is no valid reason for this in the index code. The indexed delete
> > dirties the same things as a non-indexed delete, so it's not clear to me
> > why we see a lot of extra disk activity with the indexed code.
>
> Possibilities:
> - extra cache footprint for code (unlikely cause)
> - extra block reads because of index blocks
> -

The second possibility is unlikely too, since we see the slowness even
when the directory has only a single block. I'm hunting for your third
possiblity now.

I can think of quite a few ways to chase this but they are all
time-consuming, so I'm taking even more time to read the code and settle
on my line of attack. I strikes me that the Linux Trace Toolkit is
something I want to be using right now, but most likely not on my
laptop. I don't know, I think I will get a copy and see if there's any
chance it will run under UML.

I need some way of finding out where all the extra disk events are
coming from.

> > Redundant Existence Test
> > ------------------------
> >
> > The inner loop of add_ext2_entry has traditionally included a test for
> > an existing entry with the same name as the new entry, in which case an
> > error exit is taken:
> >
> > err = -EEXIST;
> > if (ext2_match (namelen, name, de))
> > goto err_out;
> >
> > This test is entirely redundant as can be seen from this code snippet
> > from fs/namei.c, open_namei:
> >
> > 980 dentry = lookup_hash(&nd->last, nd->dentry);
> > [...]
> > 989 /* Negative dentry, just create the file */
> > 990 if (!dentry->d_inode) {
> > 991 error = vfs_create(dir->d_inode, dentry, mode);
> > [...]
> > 1000 goto ok;
> > 1001 }
> > 1002
> > 1003 /*
> > 1004 * It already exists.
> > 1005 */
> >
> > There is always an explicit search for the entry before creating it
> > (except in the case where we find a deleted entry in the dcache - then
> > the search can be safely skipped) Thus, ext2's create never gets called
> > when the name already exists. Worse, the ext2 existence test is not
> > reliable. If there is a hole in the directory big enough for the new
> > entry then add_ext2_entry will stop there and not check the rest of
> > the directory. So the test in add_ext2_entry adds no extra protection,
> > except perhaps helping verify that the code is correct. In this case,
> > an assertion would capture the intent better. On the other hand, it
> > does cost a lot of CPU cycles.
>
> Possibly it is a holdout from (or extra check for) some sort of locking
> race condition? ISTR that the dentry cache _should_ protect us from a
> dirent being created twice (that would also corrupt the dentry cache).
>
> However, if it _was_ some sort of race avoidance, the existence check
> _would_ be enough. Reasoning is that if we had two processes trying
> to create the same dirent then one of them would find "the spot" big
> enough to hold the new entry first, and the second process would _have_
> to pass this spot in order to find another place to put the dentry
> (assuming no other dentry was deleted in the meantime). The check
> would be equally valid (if it is needed at all, mind you) in the index
> code because we would always search the same hash bucket to find the
> new home of the dentry. In fact, in the indexed code we would be much
> more likely to find duplicate dentries because the hashing would always
> place the duplicate dentries into the same hash bucket.

Sorry, I should not have snipped out the locking above. The
test/create is serialized by ->i_sem:

979 down(&dir->d_inode->i_sem);
980 dentry = lookup_hash(&nd->last, nd->dentry);
[...]
989 /* Negative dentry, just create the file */
990 if (!dentry->d_inode) {
991 error = vfs_create(dir->d_inode, dentry, mode);
992 up(&dir->d_inode->i_sem);

Out of interest I'll check to see how far back this goes. (lksr.org has
a full history of kernels in cvs; bitkeeper.com has something similar -
Larry McVoversion

and I suppose I should use a dash between the patchname and
kernelversion to be consistent. I'll do that on the next version and
hopefully that will be the last time I change my patch naming scheme.
(Drifting offtopic here) I guess there are probably more patch naming
schemes than there are hackers, since the natural tendency is to come up
with a new improvement to the naming scheme with each patch revision.
So the lifetime of a patch naming scheme tends towards 1, such a system
being approximately as useful as a read-once cache.
--
Daniel