2008-07-21 21:11:51

by Daniel Phillips

[permalink] [raw]
Subject: [RFC] A better solution to the HTree telldir problem

Hi Ted,

After all these years, I may have finally noticed a better approach to
the NFS directory cookie problem in HTree.

Let me summarize the issue for those not privy to its horrors:

* The Single Unix Specification defines semantics for directory
traversal whereby it is permitted to create and delete directory
entries simultaneously with a process that reads the directory
entries and does something with them.

* SUS specifies that a sequence of all the files in a directory will
be returned by the readdir call, and though it leaves unspecified
whether a simultaneously deleted or created directory entry will
appear in the sequence, it does seem to require that no entry appear
more than once and that no entry that existed before the during the
entire directory traversal be omitted. (This is not directly stated,
but can be inferred from the definition of directory stream.)
http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html

The implied requirement that no pre-existing directory entry be returned
multiple times or be omitted poses problems for modern btree directory
indexing schemes, which like to move batches of entries around on disk
as new entries are created or old ones deleted. Whatever the indexing
scheme, it is necessary to come up with some stable enumeration of
directory entries that satisfies the (implied) SUS requirement. Enter
NFS with a monumental design blunder that makes this exceptionally
difficult to achieve for any modern directory indexing scheme. You
summarize the issues authoritavely here:

http://lkml.org/lkml/2007/4/7/159

One nit: NFS v2 provides 31 bits worth of telldir cookie, not 32. (But
you knew that)

I see that you have been busy explaining the gory details to Chris, as
you once explained them to me:

http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000460.html

The code to deal with the telldir cookie problem in HTree was written by
one Theodore Y Ts'o in 2002, and consists of some hundreds of lines of
clever RB tree hacking that miraculously works so reliably that none of
the tens of millions of Ext3 users has had any complaint for years. Now
six years after the fact, I think there is a nicer way.

This is something of a duh: if HTree refrains from splitting leaves
while a directory traversal is in progress then a simple linear
traversal can be used in classic Ext2/UFS fashion, with the logical
file address of each directory entry serving as the telldir cookie.

The idea is to disable leaf splitting during directory traversal and use
an alternate leaf overflow mechanism in its place. To handle a create
into a full leaf when splitting is disabled, a new leaf block is
created having the same hash index as the current one and inserted
before the full leaf in the hash tree to hold the new entry and any
subsequent creates into the same hash bucket. The lookup probe has to
be modified slightly to traverse all leaves with the same hash index.
Or maybe this already works.

HTree could opportunistically re-distribute the entries later when no
directory listing is in progress, or it could just leave things as they
are. The fallback is quite unlikely to occur in practice, and the
alternative storage format is only very slightly less efficient to
process on lookup, and does not consume any extra space.

A possible heuristic for disabling node splitting: disable on each
opendir until a matching number of closedirs are received, unless there
was a telldir in which case it stays disabled until a following readdir
returns NULL, followed by matching closedirs.

A new feature bit would be needed should the new overflow code path ever
be hit for a given directory (unlikely...) in which case an older
HTree-enabled ext3 would fall back to linear scan if somebody manages
to write a volume under an ungraded kernel then reboot to an older
kernel. There would also be a flag in the HTree index root block that
does not affect the volume as a whole. I think that the way we did it,
any unknown HTRee flag just causes a fallback to linear scan. (If so,
this would be a nice example of forward compatibility gone right.)

Now the sticky bit where the brain damaged NFS semantics really get to
take their best kick at us. This mechanism has to work across NFS server
reboots. We can notice from the flag in the Htree index root block that
a particular directory went into an "NFS mode" that was never properly
completed. But how can we know when to take the directory out of NFS
mode and get on with a slightly more efficient version of life? Well.

There are two kinds of NFS servers available on Linux: 1) kernel and
2) userspace. The kernel version is... part of the kernel, and so we
can always know when it has mounted our volume. We could give it a
minute or two to reconnect to the client that was doing the readdir,
then assume the readdir will never resume. Otherwise, if the readdir
does resume in that time, the client will get exactly the SUS semantics
and things will work properly even if the server reboots again.

The user space server is another story. I have no idea how to detect
its presence, and I doubt we need to care, because it already suffers
from fatal flaws such as:

"some awkwardnesses in the semantics (for instance, moving a file
to a different directory will render its file handle invalid)"
http://packages.debian.org/sid/nfs-user-server

Nobody should use this server, and if they do, they simply risk
dropouts and repeated filenames in a directory listing if they go out
of their way to abuse it. What you would have to do to break the above
heuristic for the user space NFS server:

* NFS client starts a directory listing
* Server reboots after retrieving some dirents
* NFS client creates some new files in the directory, splitting a
btree block
* Client side ls omits or repeats a few names. Sob.
* Moral of the story is: use knfsd if you want your NFS server to
respect reboot semantics properly, or to operate properly at all.

If the new semantics are acceptable, the payoff for adopting this simpler
method is:

* Can use a more efficient HTree hash because we do not need to be so
paranoid about bucket levelling to avoid rare cases that might break
the RB tree approach.

* We get to use the nice shiny alternative hash mechanism that we put
considerable effort into building but so far have never used.

* Considerably more efficient directory traversal.

* Delete will now happen in physical create order and therefore not
thrash the inode table, which is a problem that HTree can currently
hit with large directories ant tight cache.

* Several hundred lines of special purpose code removed. (Replaced by
maybe 50 different lines of special purpose code.)

Drawbacks:

* Adds a new feature to knfsd to force it to advertise its presence
to filesystems. On the other hand, this might be seen as a feature
useful to other filesystems (btrfs and jfs come to mind).

* Breaks the user space NFS server worse than it already is. But it
is already fatally broken, so do we care?

Regards,

Daniel


2008-07-22 03:00:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] A better solution to the HTree telldir problem

On Mon, Jul 21, 2008 at 02:11:51PM -0700, Daniel Phillips wrote:
> This is something of a duh: if HTree refrains from splitting leaves
> while a directory traversal is in progress then a simple linear
> traversal can be used in classic Ext2/UFS fashion, with the logical
> file address of each directory entry serving as the telldir cookie.

The problem here is the phrase "while a directory traversal is in
progress". What does that mean? While a file descriptor is open on a
directory? An application could call opendir(), and then go to sleep
for a week. Or, a user could type control-Z while an "ls -lR" is in
progress, and then forget to resume it. So any scheme has to be
prepared to not split leaves for a *long* time.

You later say that we only need to do this if the user calls
telldir(); although of course the kernel doesn't know if the user has
called telldir(). We only know if the lseek() system call has been
called. So it becomes impossible for us to distinguish between an
application calling telldir() or rewinddir(). So we would end up
needing to do this more often than might be strictly necessary.

The problem becomes even worse with NFS, because we have absolutely no
idea whether or not lseek() has been called. Even worse, because NFS
is a stateless protocol, we don't even know when a file handle is
closed! So we can't even detect a closedir()! Or of course, an
opendir()! With NFS all you get is an NFS getattr RPC, which could be
used to service either a stat(2) call or an open(2) call. The net is
that while NFS mount is in progress, we would never be able to split a
leaf.

So your optimization of only doing this if there is a telldir()
operation may not make the operation be as rare as we might like,
especially if there are some user programs that keep file descriptors
open over long periods for inotify purposes, and then use rewinddir().
We would need to check out the behavior of programs such as trackerd
and the GNome VFS library routines to see whether this could
potentially be a problem.

It might be possible to make some assumptions that if userspace calls
lseek(fd, 0, SEEK_SET), that this really is a rewinddir(), and not a
telldir(), but this could be a potentially dangerous assumption since
there is no way to know for sure whether userspace is paying attention
to the return value of lseek(fd, 0, SEEK_SET) or not.

> The idea is to disable leaf splitting during directory traversal and use
> an alternate leaf overflow mechanism in its place. To handle a create
> into a full leaf when splitting is disabled, a new leaf block is
> created having the same hash index as the current one and inserted
> before the full leaf in the hash tree to hold the new entry and any
> subsequent creates into the same hash bucket. The lookup probe has to
> be modified slightly to traverse all leaves with the same hash index.
> Or maybe this already works.

The lookup probe would indeed need to be modified, but that's not a
major deal. The bigger deal is that lookups are less efficient, since
we it effectively looks like a hash bucket overflow.

> HTree could opportunistically re-distribute the entries later when no
> directory listing is in progress, or it could just leave things as they
> are. The fallback is quite unlikely to occur in practice, and the
> alternative storage format is only very slightly less efficient to
> process on lookup, and does not consume any extra space.

How likely it takes place depends on how often an opendir()/readdir()
cycle is held for a long period of time. It will depend on your
application workload. And if NFS is enabled, all bets are off.

> If the new semantics are acceptable, the payoff for adopting this simpler
> method is:
>
> * Can use a more efficient HTree hash because we do not need to be so
> paranoid about bucket levelling to avoid rare cases that might break
> the RB tree approach.

Yes we do, because the biggest problem we have right now is your
original implementation hard codes us to a tree which is at most two
levels deep. Clusterfs/Sun has more general code which allows an
arbitrary deep tree, but we would need to integrate that into the
kernel, and also get support for that into e2fsck.

> * We get to use the nice shiny alternative hash mechanism that we put
> considerable effort into building but so far have never used.

We are using alternative hash algorithms already....

> * Delete will now happen in physical create order and therefore not
> thrash the inode table, which is a problem that HTree can currently
> hit with large directories ant tight cache.

Not just deletes, but any kind of readdir followed by stat operation.
This includes "ls -sF", find, sendmail/exim queue operations, etc., in
addition to "rm -rf".

> * Several hundred lines of special purpose code removed. (Replaced by
> maybe 50 different lines of special purpose code.)

If you mean the rbtree fill code, I suspect the special purpose code
you are referring to will be roughly end up being roughly the same
size, plus or minus 20%. That's because in order to do what you are
suggesting, after we finish looking at one leaf node, we will need to
go back to the interior node, and find the next leaf node to see if
the hash is the same. And if we are at the end of the interior node,
we may need to recursively go up one level, and then back down to find
the subsequent leaf node.

Even worse, since if NFS is enabled in any way, shape or form, this
scheme is totally and completely screwed. So we have to the old way
of doing things if NFS is turned on.

- Ted

2008-07-22 05:57:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] A better solution to the HTree telldir problem

On Monday 21 July 2008 19:56, Theodore Tso wrote:
> Even worse, since if NFS is enabled in any way, shape or form, this
> scheme is totally and completely screwed. So we have to the old way
> of doing things if NFS is turned on.

I didn't realize it would be so difficult to detect telldir coming from
NFSD. NFS is the entire reason for all this posturing, so if the
strategy fails for NFS then it isn't worth spending any time on it.

I wrote generic btree code for ddsnap with arbitrary depth and merge on
delete, but was never able to justify the time to port it to HTree
during my day job. Props to clusterfs (Andreas!) for stepping up to
do this. No doubt a commercial imperative helped achieve focus there.

But I will not use HTree for my own filesystem work, I will use a new
thing called PHTree, which is a (P)hysically stable variant of HTree,
obtained by inserting a new layer of index blocks between the index
nodes and the dirent blocks, the "terminal" index blocks. Each
terminal index block is populated with { hash : block } pairs, each of
which indicates that there is a dirent in <block> with hash <hash>.

This requires one extra lookup per operation which is regretable, but
it solves the physical stability problem. Directory entries will just
stay in the leaf blocks in which they are first created, and only
index nodes will ever be split. Because leaf blocks will typically be
full instead of 75% full in HTree, the space usage ends up about the
same. A more efficient hash algorithm can be used, and deletion and
other operations will tend to be faster due to less thrashing of the
inode table.

Free space management becomes an issue: when dirents are deleted we
need to be able to re-use them for new entries. To support efficient
detection of available dirents of appropriate size, I will use a lazy
method of recording the maximum sized dirent available in any leaf
block. The actual largest free direct may be smaller than that, which
will be detected when a search fails, causing the lazy limit to be
updated. We can safely skip searching for free space in any block
for which the lazy max is less than the size we require. One byte is
sufficient for the lazy max, so one 4K block is sufficient to keep
track of 2^12 * 2^12 bytes worth of directory blocks, a 16 meg
directory with about half a million entries. For larger directories
this structure becomes a radix tree, with lazy max recorded also at
each index pointer for quick free space searching.

Mind you, I haven't coded this yet and it will be a good few months
before I do, so I don't have performance numbers. I just thought I
should mention it now, because I expect it to meet or beat HTree
overall in performance while supporting brain damaged NFS semantics in
a much nicer way. Hopefully we will find out for sure before Ext4
indexing becomes cast in stone.

Regards,

Daniel