In recent discussions on #linuxfs with Coly Li (I hope I have the email
correct), we have ressurected the idea of improving the stat and unlink
performance of ext3 htree directories. Coly seems interested in working
on fixing this problem, so I thought I'd start an email thread in the
list to get wider input, and also keep a record of the discussions.
This problem was exposed widely in an old thread started by Andrew Morton:
and can be summarized as "htree directories have random readdir
order compared to inode number" (and vice versa).
And my email in that thread explains the solution relatively cleanly:
and can be summarized as "allocate new inodes in a manner that takes
the name hash into account".
Apparently, this has become a hot topic for the kernel.org systems
because Git is fond of creating huge directories.
It should be possible to make a patch that improves the performance
of readdir+stat or readdir+unlink (for directories created under this
scheme) but does not affect the on-disk format.
Ted has also proposed the LD_PRELOAD library for doing sorting of the
readdir output in userspace before returning it to the application.
This can be helpful, but until this approach is widely available (e.g.
built into glibc or something) it will have limited benefit to users.
This is complementary with any approach taken here.
The basic goal, as discussed in my above email is to map the filename
to a goal inode number in (growing) range of inodes/blocks in the itable.
The resulting inode order is piece-wise in approximately hash order. The
goal selection part is relatively straight forward, especially for new
directories, where we could keep a count of the number of inodes in
a directory and scale the itable window size as a function of the number
of inodes. We would likely reserve the itable blocks for a given directory
in the same manner (and possibly even the same code) that ext3 reservation
For existing directories, the number of inodes in the directory (and hence
the itable window size) can also be approximated by the current size of
the directory divided by an average filename length. Whether the directory
is full or not is less important, if we assume that past usage of the
directory is an indication of future usage. It is in fact better to know
the directory size in advance because we get fewer "pieces" in the hash to
What remains open for discussion (hopefully someone will have a good answer)
with existing directories is how to find the previous itable allocation
window(s) across a remount. For directories with more than tens of
thousands of inodes in them the inodes by necessity must span multiple
block groups, and there is no guarantee that the itable ranges are
contiguous (no point in reserving a range of itable blocks that are full).
One possibility is to scan the directory leaf block that will ultimately
hold the new inode and try to allocate in the same itable block(s) that
the other entries in this leaf are using (preferring the block that is
used closest hashes in that leaf). Drawbacks of this are that renamed
entries will have non-local inode numbers (which is a drawback of the
whole scheme, actually) and we don't store hashes with each entry.
We could also persistently store a "hint" for the itable allocation window
in the directory somewhere (e.g. the htree header by increasing the
Another possibility, first suggested by Ted in the same thread above,
and something I've been thinking about since the meeting in Portland
last year, is to change the directory format so that the directory leaf
blocks are actually the inode table itself. It is complex enough that
it should go into a separate email, and is only worth considering if
we are going to have another incompatible change to the on-disk format
(though we could bundle a number of inode- and directory- specific
Principal Software Engineer
Cluster File Systems, Inc.