From: Zhang Huan <zhhuan@gmail.com>
Subject: Re: Question on readdir implementation
Date: Wed, 16 Sep 2009 13:47:46 +0800
Message-ID: <20090916054746.GA2393@zhanghuan.nrchpc.ac.cn>
References: <20090915095724.GA8440@zhanghuan.nrchpc.ac.cn> <20090915145337.GB23118@mit.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-ext4@vger.kernel.org
To: Theodore Tso <tytso@mit.edu>
Content-Disposition: inline
In-Reply-To: <20090915145337.GB23118@mit.edu>
Sender: linux-ext4-owner@vger.kernel.org

On Tue, Sep 15, 2009 at 10:53:37AM -0400, Theodore Tso wrote:
> On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote:
> > Hi all,
> > 
> > I'm reading EXT4 codes and has some questions about readdir
> > implementation.
> > 
> > Why traverse the directory in hash order? This brings lots of code to
> > build and traverse a red-black tree. Why not just plainly traverse the
> > directory's blocks?
> > 
> > Since the red-black tree is built every time a NFS readdir request comes
> > in, in case of hash collision, the nfs client may receive duplicate dir
> > entries if the buffer is not large enough to return all entries with the
> > same hash value in once.
> 
> The reason why we have to do this is because of the telldir() and
> seekdir() interfaces.  NFS is implicated in this as well, because it
> traverses the directory using a 32-bit offset on the protocol wire.
> 
> The fundamental problem is they presume that directory is stored as a
> linear array.  For file systems which use some kind of B-tree or other
> non-linear data structure to speed up directory lookups, the problem
> is what to do when we need to split a B-tree leaf.  When this happens,
> half of the directory entries in the that B-tree leaf must be moved to
> a new directory block.  However, readdir() must still return every
> single entry in the directory once and exactly once; and this must be
> true even if even if telldir()/seekdir() is used, and even if NFS
> decides to wait for several minutes or hours between reading the first
> set of directory entries, and then reading the next set of directory
> entries, sending to the NFS server nothing but the 32-bit offset
> token.
> 
> It is for this reason that we must traverse the directory in hash
> order; so that directory entries are returned once and only once even
> in the face of node splits.  We try very hard to avoid a hash
> collisions, including using a keyed hash which is kept secret to
> prevent users from deliberately trying to trigger the failure mode.
> 
> Looking more closely at what we have done, we should be able to do
> better on architectures where off_t is 64 bits.  Internally, we use a
> 64-bit hash, but we only put the high 32 bits of the hash in f_offset
> because we can't tell whether the telldir() or telldir64() call might
> be used by an application, and we can't tell whether the NFS server is
> speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3
> (where the readdir cookie is cookie is 64 bits).  On platforms where
> off_t is 64-bytes, we could store the full 64-bit hash value in
> f_offset, but we would swap the low and high 32-bit values, so that 32
> LSB are in the high 32 bits of f_offset and vice versa.  
> 
> The NFS v2 server would still get a 64-bit f_offset, but it currently
> doesn't do any range checking before dropping it into the 32-bit
> cookie, so the high 32 bits would get truncated.  This would result in
> the same behaviour we have today for those poor benighted souls which
> are still using NFSv2, but allow for better behavior for NFSv3 at
> least on 64-bit platforms.
> 
Thanks for replying to explain in such details.

So there isn't much we can do before we improve nfsd's offset range
checking.

> So this is something we could do in the future.  In practice, no one
> has complained about this as far as NFS is concerned, so it's not high
> priority for me to pursue.  Were you worried about this as a practical
> matter, or as a theoretical one?
> 
Oh, I was asking because some guy I work with told me he has seen this
once before. He was testing a proprietary Windows NFS v3 client then,
using EXT3 as backend filesystem.
I do not know it is hash collision that cause the problem until I spent
some time reading EXT3 codes.

> Regards,
> 
> 						- Ted

-- 
Zhang Huan