2009-09-15 09:57:30

by Zhang Huan

[permalink] [raw]
Subject: Question on readdir implementation

Hi all,

I'm reading EXT4 codes and has some questions about readdir
implementation.

Why traverse the directory in hash order? This brings lots of code to
build and traverse a red-black tree. Why not just plainly traverse the
directory's blocks?

Since the red-black tree is built every time a NFS readdir request comes
in, in case of hash collision, the nfs client may receive duplicate dir
entries if the buffer is not large enough to return all entries with the
same hash value in once.

Thanks.

--
Zhang Huan


2009-09-15 14:41:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: Question on readdir implementation

On Sep 15, 2009 17:57 +0800, Zhang Huan wrote:
> I'm reading EXT4 codes and has some questions about readdir
> implementation.
>
> Why traverse the directory in hash order? This brings lots of code to
> build and traverse a red-black tree. Why not just plainly traverse the
> directory's blocks?

Because htree does not maintain constant ordering of directory entries.
Consider a readdir running at the same time as files are being added:

readdir proc create proc
read block 0
read block 1
create new file
hash of filename puts entry into block 0
block 0 is full, split it
add new block 3
copy 1/2 of block 0 entries to block 3
add new filename to block 0
read block 2
read block 3

When the readdir returns block 3, 1/2 of the entries in that block
are the same as were returned with the original block 0 read. This
is a violation of POSIX readdir() semantics that require each existing
entry only be returned a single time (it doesn't matter if the new
filename is returned or not).

> Since the red-black tree is built every time a NFS readdir request comes
> in, in case of hash collision, the nfs client may receive duplicate dir
> entries if the buffer is not large enough to return all entries with the
> same hash value in once.

This is because NFSv2 only has a 32-bit cookie value, and if there is a
whole buffer full of values with the same 32-bit hash there isn't anything
that can be done about it except drop those extra duplicates (a very rare
case). It would have a much more serious problem if there was a directory
larger than 2^32 bytes in size, which would result in all entries beyond
2GB being lost.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-15 14:53:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on readdir implementation

On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote:
> Hi all,
>
> I'm reading EXT4 codes and has some questions about readdir
> implementation.
>
> Why traverse the directory in hash order? This brings lots of code to
> build and traverse a red-black tree. Why not just plainly traverse the
> directory's blocks?
>
> Since the red-black tree is built every time a NFS readdir request comes
> in, in case of hash collision, the nfs client may receive duplicate dir
> entries if the buffer is not large enough to return all entries with the
> same hash value in once.

The reason why we have to do this is because of the telldir() and
seekdir() interfaces. NFS is implicated in this as well, because it
traverses the directory using a 32-bit offset on the protocol wire.

The fundamental problem is they presume that directory is stored as a
linear array. For file systems which use some kind of B-tree or other
non-linear data structure to speed up directory lookups, the problem
is what to do when we need to split a B-tree leaf. When this happens,
half of the directory entries in the that B-tree leaf must be moved to
a new directory block. However, readdir() must still return every
single entry in the directory once and exactly once; and this must be
true even if even if telldir()/seekdir() is used, and even if NFS
decides to wait for several minutes or hours between reading the first
set of directory entries, and then reading the next set of directory
entries, sending to the NFS server nothing but the 32-bit offset
token.

It is for this reason that we must traverse the directory in hash
order; so that directory entries are returned once and only once even
in the face of node splits. We try very hard to avoid a hash
collisions, including using a keyed hash which is kept secret to
prevent users from deliberately trying to trigger the failure mode.

Looking more closely at what we have done, we should be able to do
better on architectures where off_t is 64 bits. Internally, we use a
64-bit hash, but we only put the high 32 bits of the hash in f_offset
because we can't tell whether the telldir() or telldir64() call might
be used by an application, and we can't tell whether the NFS server is
speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3
(where the readdir cookie is cookie is 64 bits). On platforms where
off_t is 64-bytes, we could store the full 64-bit hash value in
f_offset, but we would swap the low and high 32-bit values, so that 32
LSB are in the high 32 bits of f_offset and vice versa.

The NFS v2 server would still get a 64-bit f_offset, but it currently
doesn't do any range checking before dropping it into the 32-bit
cookie, so the high 32 bits would get truncated. This would result in
the same behaviour we have today for those poor benighted souls which
are still using NFSv2, but allow for better behavior for NFSv3 at
least on 64-bit platforms.

So this is something we could do in the future. In practice, no one
has complained about this as far as NFS is concerned, so it's not high
priority for me to pursue. Were you worried about this as a practical
matter, or as a theoretical one?

Regards,

- Ted

2009-09-15 18:32:09

by Florian Weimer

[permalink] [raw]
Subject: Re: Question on readdir implementation

* Theodore Tso:

> So this is something we could do in the future. In practice, no one
> has complained about this as far as NFS is concerned, so it's not high
> priority for me to pursue. Were you worried about this as a practical
> matter, or as a theoretical one?

readdir returning entries in essentially randomized order is a
practical performance problem for many things, from grep -r to tar. 8-(
(My recent FIBMAP/FIEMAP question was related to that, too.)

2009-09-15 18:38:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on readdir implementation

On Tue, Sep 15, 2009 at 05:56:32PM +0000, Florian Weimer wrote:
> * Theodore Tso:
>
> > So this is something we could do in the future. In practice, no one
> > has complained about this as far as NFS is concerned, so it's not high
> > priority for me to pursue. Were you worried about this as a practical
> > matter, or as a theoretical one?
>
> readdir returning entries in essentially randomized order is a
> practical performance problem for many things, from grep -r to tar. 8-(
> (My recent FIBMAP/FIEMAP question was related to that, too.)

Well, it's not _that_ hard for applications to sort the directory
entries by inode number. I've written a LD_PRELOAD, called
spd_readdir() for people who want to use it. The mutt application
does this natively, and it makes the problem go away.

We could do this in the kernel, but for very large directories, you
will end up pinning large amounts of memory --- and if an application
holds a directory fd open for a long time, the memory needs to be kept
pinned until the dfd is closed. Still, for moderate sized
directories, it's a possibility.

- Ted

2009-09-16 05:47:55

by Zhang Huan

[permalink] [raw]
Subject: Re: Question on readdir implementation

On Tue, Sep 15, 2009 at 10:53:37AM -0400, Theodore Tso wrote:
> On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote:
> > Hi all,
> >
> > I'm reading EXT4 codes and has some questions about readdir
> > implementation.
> >
> > Why traverse the directory in hash order? This brings lots of code to
> > build and traverse a red-black tree. Why not just plainly traverse the
> > directory's blocks?
> >
> > Since the red-black tree is built every time a NFS readdir request comes
> > in, in case of hash collision, the nfs client may receive duplicate dir
> > entries if the buffer is not large enough to return all entries with the
> > same hash value in once.
>
> The reason why we have to do this is because of the telldir() and
> seekdir() interfaces. NFS is implicated in this as well, because it
> traverses the directory using a 32-bit offset on the protocol wire.
>
> The fundamental problem is they presume that directory is stored as a
> linear array. For file systems which use some kind of B-tree or other
> non-linear data structure to speed up directory lookups, the problem
> is what to do when we need to split a B-tree leaf. When this happens,
> half of the directory entries in the that B-tree leaf must be moved to
> a new directory block. However, readdir() must still return every
> single entry in the directory once and exactly once; and this must be
> true even if even if telldir()/seekdir() is used, and even if NFS
> decides to wait for several minutes or hours between reading the first
> set of directory entries, and then reading the next set of directory
> entries, sending to the NFS server nothing but the 32-bit offset
> token.
>
> It is for this reason that we must traverse the directory in hash
> order; so that directory entries are returned once and only once even
> in the face of node splits. We try very hard to avoid a hash
> collisions, including using a keyed hash which is kept secret to
> prevent users from deliberately trying to trigger the failure mode.
>
> Looking more closely at what we have done, we should be able to do
> better on architectures where off_t is 64 bits. Internally, we use a
> 64-bit hash, but we only put the high 32 bits of the hash in f_offset
> because we can't tell whether the telldir() or telldir64() call might
> be used by an application, and we can't tell whether the NFS server is
> speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3
> (where the readdir cookie is cookie is 64 bits). On platforms where
> off_t is 64-bytes, we could store the full 64-bit hash value in
> f_offset, but we would swap the low and high 32-bit values, so that 32
> LSB are in the high 32 bits of f_offset and vice versa.
>
> The NFS v2 server would still get a 64-bit f_offset, but it currently
> doesn't do any range checking before dropping it into the 32-bit
> cookie, so the high 32 bits would get truncated. This would result in
> the same behaviour we have today for those poor benighted souls which
> are still using NFSv2, but allow for better behavior for NFSv3 at
> least on 64-bit platforms.
>
Thanks for replying to explain in such details.

So there isn't much we can do before we improve nfsd's offset range
checking.

> So this is something we could do in the future. In practice, no one
> has complained about this as far as NFS is concerned, so it's not high
> priority for me to pursue. Were you worried about this as a practical
> matter, or as a theoretical one?
>
Oh, I was asking because some guy I work with told me he has seen this
once before. He was testing a proprietary Windows NFS v3 client then,
using EXT3 as backend filesystem.
I do not know it is hash collision that cause the problem until I spent
some time reading EXT3 codes.

> Regards,
>
> - Ted

--
Zhang Huan