2001-02-20 19:02:58

by Daniel Phillips

[permalink] [raw]
Subject: [rfc] Near-constant time directory index for Ext2

Earlier this month a runaway installation script decided to mail all its
problems to root. After a couple of hours the script aborted, having
created 65535 entries in Postfix's maildrop directory. Removing those
files took an awfully long time. The problem is that Ext2 does each
directory access using a simple, linear search though the entire
directory file, resulting in n**2 behaviour to create/delete n files.
It's about time we fixed that.

Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
for an Ext2 directory index, including the following points:

- Fixed-size hash keys instead of names in the index
- Leaf blocks are normal ext2 directory blocks
- Leaf blocks are sequental, so readdir doesn't have to be changed

Having thought about it on and off since then, I came up with the
following additional design elements:

- Logical addressing
The cost of logical addressing of disk blocks is scarcely higher
than physical addressing, and logical access through the page cache
is actually faster than physical addressing because you don't have
to traverse a tree: you can go straight to the logical block you are
interested in and only traverse the tree if it's not there. The
payoff in terms of not breaking Ext2's existing allocation strategy
is huge, not to mention friendliness to tools such as e2fsck and
e2resize. Finally, logical addressing means Tux2 will support
this feature without modification. :-)

- 8 bytes is sufficient for an index entry
This gives a branching factor of 512 for 4K filesystem blocks
resulting in log512(n) access time, performance that is almost
indistinguishable from constant-time. The 8 bytes allows for a 32
bit hash key (31 bits actually used, see below) and a 4 byte
logical block number, both sufficient for handling billions of
directory entries.

- Uniform-depth tree
Usually, some form of balanced tree is used for a directory index.
I found that a simple, uniform-depth tree provides equivalent
performance with far simpler code. Just two tree levels can handle
millions of directory entries, and for all practical purposes,
such a tree is never out of balance.

So to give this a name, it's a "uniform-depth hash tree", or htree for
short. (It's not a btree.) Such a structure inherits properties of
both trees and hash tables. From the hash table side, the htree
inherits the advantage of compact, fixed-size keys which gives a high
branching factor and enables the use of binary search in interior index
nodes.

It also inherits a big disadvantage of hash tables: key collisions.
Though rare, collisions give rise to a number of corner cases that
are not particularly easy to deal with. (see below)

Index Structure
---------------

The root of the index tree is in the 0th block of the file. Space is
reserved for a second level of the index tree in blocks 1 though 511
(for 4K filesystem blocks). Directory leaf blocks are appended
starting at block 512, thus the tail of the directory file looks like a
normal Ext2 directory and can be processed directly by ext2_readdir.
For directories with less than about 90K files there is a hole running
from block 1 to block 511, so an empty directory has just two blocks in
it, though its size appears to be about 2 Meg in a directory listing.

So a directory file looks like:

0: Root index block
1: Index block/0
2: Index block/0
...
511: Index block/0
512: Dirent block
513: Dirent block
...

Each index block consists of 512 index entries of the form:

hash, block

where hash is a 32 bit hash with a collision flag in its least
significant bit, and block is the logical block number of an index of
leaf block, depending on the tree level.

The hash value of the 0th index entry isn't needed because it can
always be obtained from the level about, so it is used to record the
count of index entries in an index block. This gives a nice round
branching factor of 512, the evenness being a nicety that mainly
satisfies my need to seek regularity, rather than winning any real
performance. (On the other hand, the largeness of the branching factor
matters a great deal.)

The root index block has the same format as the other index blocks,
with its first 8 bytes reserved for a small header:

1 byte header length (default: 8)
1 byte index type (default: 0)
1 byte hash version (default:0)
1 byte tree depth (default: 1)

The treatment of the header differs slightly in the attached patch. In
particular, only a single level of the index tree (the root) is
implemented here. This turns out to be sufficient to handle more than
90,000 entries, so it is enough for today. When a second level is
added to the tree, capacity will incease to somewhere around 50
million entries, and there is nothing preventing the use of n levels,
should there ever be a reason. It's doubtfull that a third level
will ever be required, but if it is, the design provides for it.

Lookup Algorithm
----------------

Lookup is straightforword:

- Compute a hash of the name
- Read the index root
- Use binary search (linear in the current code) to find the
first index or leaf block that could contain the target hash
(in tree order)
- Repeat the above until the lowest tree level is reached
- Read the leaf directory entry block and do a normal Ext2
directory block search in it.
- If the name is found, return its directory entry and buffer
- Otherwise, if the collision bit of the next directory entry is
set, continue searching in the successor block

Normally, two logical blocks of the file will need to be accessed, and
one or two metadata index blocks. The effect of the metadata index
blocks can largely be ignored in terms of disk access time since these
blocks are unlikely to be evicted from cache. There is some small CPU
cost that can be addressed by moving the whole directory into the page
cache.

Insert Algorithm
----------------

Insertion of new entries into the directory is considerably more
complex than lookup, due to the need to split leaf blocks when they
become full, and to satisfy the conditions that allow hash key
collisions to be handled reliably and efficiently. I'll just summarize
here:

- Probe the index as for lookup
- If the target leaf block is full, split it and note the block
that will receive the new entry
- Insert the new entry in the leaf block using the normal Ext2
directory entry insertion code.

The details of splitting and hash collision handling are somewhat
messy, but I will be happy to dwell on them at length if anyone is
interested.

Splitting
---------

In brief, when a leaf node fills up and we want to put a new entry into
it the leaf has to be split, and its share of the hash space has to
be partitioned. The most straightforward way to do this is to sort the
entrys by hash value and split somewhere in the middle of the sorted
list. This operation is log(number_of_entries_in_leaf) and is not a
great cost so long as an efficient sorter is used. I used Combsort
for this, although Quicksort would have been just as good in this
case since average case performance is more important than worst case.

An alternative approach would be just to guess a median value for the
hash key, and the partition could be done in linear time, but the
resulting poorer partitioning of hash key space outweighs the small
advantage of the linear partition algorithm. In any event, the number
of entries needing sorting is bounded by the number that fit in a leaf.

Key Collisions
--------------

Some complexity is introduced by the need to handle sequences of hash
key collisions. It is desireable to avoid splitting such sequences
between blocks, so the split point of a block is adjusted with this in
mind. But the possibility still remains that if the block fills up
with identically-hashed entries, the sequence may still have to be
split. This situation is flagged by placing a 1 in the low bit of the
index entry that points at the sucessor block, which is naturally
interpreted by the index probe as an intermediate value without any
special coding. Thus, handling the collision problem imposes no real
processing overhead, just come extra code and a slight reduction in the
hash key space. The hash key space remains sufficient for any
conceivable number of directory entries, up into the billions.

Hash Function
-------------

The exact properties of the hash function critically affect the
performance of this indexing strategy, as I learned by trying a number
of poor hash functions, at times intentionally. A poor hash function
will result in many collisions or poor partitioning of the hash space.
To illustrate why the latter is a problem, consider what happens when a
block is split such that it covers just a few distinct hash values.
The probability of later index entries hashing into the same, small
hash space is very small. In practice, once a block is split, if its
hash space is too small it tends to stay half full forever, an effect I
observed in practice.

After some experimentation I came up with a hash function that gives
reasonably good dispersal of hash keys across the entire 31 bit key
space. This improved the average fullness of leaf blocks considerably,
getting much closer to the theoretical average of 3/4 full.

But the current hash function is just a place holder, waiting for
an better version based on some solid theory. I currently favor the
idea of using crc32 as the default hash function, but I welcome
suggestions.

Inevitably, no matter how good a hash function I come up with, somebody
will come up with a better one later. For this reason the design
allows for additional hash functiones to be added, with backward
compatibility. This is accomplished simply, by including a hash
function number in the index root. If a new, improved hash function is
added, all the previous versions remain available, and previously
created indexes remain readable.

Of course, the best strategy is to have a good hash function right from
the beginning. The initial, quick hack has produced results that
certainly have not been disappointing.

Performance
-----------

OK, if you have read this far then this is no doubt the part you've
been waiting for. In short, the performance improvement over normal
Ext2 has been stunning. With very small directories performance is
similar to standard Ext2, but as directory size increases standard
Ext2 quickly blows up quadratically, while htree-enhanced Ext2
continues to scale linearly.

Uli Luckas ran benchmarks for file creation in various sizes of
directories ranging from 10,000 to 90,000 files. The results are
pleasing: total file creation time stays very close to linear, versus
quadratic increase with normal Ext2.

Time to create:

Indexed Normal
======= ======
10000 Files: 0m1.350s 0m23.670s
20000 Files: 0m2.720s 1m20.470s
30000 Files: 0m4.330s 3m9.320s
40000 Files: 0m5.890s 5m48.750s
50000 Files: 0m7.040s 9m31.270s
60000 Files: 0m8.610s 13m52.250s
70000 Files: 0m9.980s 19m24.070s
80000 Files: 0m12.060s 25m36.730s
90000 Files: 0m13.400s 33m18.550s

A graph is posted at:

http://www.innominate.org/~phillips/htree/performance.png

All of these tests are CPU-bound, which may come as a surprise. The
directories fit easily in cache, and the limiting factor in the case of
standard Ext2 is the looking up of directory blocks in buffer cache,
and the low level scan of directory entries. In the case of htree
indexing there are a number of costs to be considered, all of them
pretty well bounded. Notwithstanding, there are a few obvious
optimizations to be done:

- Use binary search instead of linear search in the interior index
nodes.

- If there is only one leaf block in a directory, bypass the index
probe, go straight to the block.

- Map the directory into the page cache instead of the buffer cache.

Each of these optimizations will produce a noticeable improvement in
performance, but naturally it will never be anything like the big jump
going from N**2 to Log512(N), ~= N. In time the optimizations will be
applied and we can expect to see another doubling or so in performance.

There will be a very slight performance hit when the directory gets big
enough to need a second level. Because of caching this will be very
small. Traversing the directories metadata index blocks will be a
bigger cost, and once again, this cost can be reduced by moving the
directory blocks into the page cache.

Typically, we will traverse 3 blocks to read or write a directory
entry, and that number increases to 4-5 with really huge directories.
But this is really nothing compared to normal Ext2, which traverses
several hundred blocks in the same situation.

Current Implementation
----------------------

The current implementation has only a single level of the htree (the
root) and is sufficient to handle a little more than 90,000 files.
This good enough for benchmarking. There has not been a lot of
stability testing yet and indeed there are a number of unhandled error
conditions in the code, and possibly some buffer leaks as well.

This patch is for kernel 2.4.1, but it should be entirely applicable
to the 2.2 series as well. There it should find a friend: Stephen
Tweedie's Ext3 journalling extension.

To-do List
----------

There is still a fair amount of work remaining before this patch is
ready for regular use. Here is the to-do list as of today:

- finalize the file format
- endianness
- improve the hash function
- INCOMPAT flag handling
- second tree level
- bullet proofing
- testing under load

Additionally, some (small) changes will be required in ext2utils. The
ETA for completion of the items on the to-do list is... pretty soon.

Credits
-------

Thanks to Ted Ts'o for providing the inspiration and essential design
elements. Many thanks to Uli Luckas for spending large number of
hours drinking beer^H^H^H^H^H^H^H^H^H^H walking through the code with
me, suggesting a number of design improvements and understanding and
fixing at least one part of the code which remains, quite frankly,
beyond me. :-)

Applying and Running the patch
------------------------------

The patch adds a symbol to ext2_fs.h, CONFIG_EXT2_INDEX, which
controls whether the htree index feature is enabled or not - it
defaults to on.

- Use a test machine, not your workstation :-)
- cd to the 2.4.1 source root
- patch -p0 <this.email
- build and install - should have no effect on normal operation
- mount /dev/hdxxx /test -t ext2 -o index

All new directories in the mounted partition will be created indexed.

Here is the patch:

--- ../2.4.1.uml.clean/fs/ext2/dir.c Sat Dec 9 02:35:54 2000
+++ ./fs/ext2/dir.c Tue Feb 20 04:21:25 2001
@@ -67,22 +67,24 @@
{
int error = 0;
unsigned long offset, blk;
- int i, num, stored;
- struct buffer_head * bh, * tmp, * bha[16];
- struct ext2_dir_entry_2 * de;
- struct super_block * sb;
- int err;
+ int i, num, stored = 0, err;
+ struct buffer_head *bh = NULL, *tmp, *bha[16];
+ struct ext2_dir_entry_2 *de;
struct inode *inode = filp->f_dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
+ unsigned blockshift = EXT2_BLOCK_SIZE_BITS(sb);
+#ifdef CONFIG_EXT2_INDEX
+ int dir_base = is_dx(inode)? dx_dir_base(sb): 0;
+#else
+ int dir_base = 0;
+#endif

- sb = inode->i_sb;
-
- stored = 0;
- bh = NULL;
offset = filp->f_pos & (sb->s_blocksize - 1);

- while (!error && !stored && filp->f_pos < inode->i_size) {
- blk = (filp->f_pos) >> EXT2_BLOCK_SIZE_BITS(sb);
- bh = ext2_bread (inode, blk, 0, &err);
+ while (!error && !stored && filp->f_pos < inode->i_size - (dir_base << blockshift))
+ {
+ blk = (filp->f_pos) >> blockshift;
+ bh = ext2_bread (inode, dir_base + blk, 0, &err);
if (!bh) {
ext2_error (sb, "ext2_readdir",
"directory #%lu contains a hole at offset %lu",
@@ -95,9 +97,9 @@
* Do the readahead
*/
if (!offset) {
- for (i = 16 >> (EXT2_BLOCK_SIZE_BITS(sb) - 9), num = 0;
- i > 0; i--) {
- tmp = ext2_getblk (inode, ++blk, 0, &err);
+ for (i = 16 >> (blockshift - 9), num = 0; i > 0; i--)
+ {
+ tmp = ext2_getblk (inode, dir_base + ++blk, 0, &err);
if (tmp && !buffer_uptodate(tmp) && !buffer_locked(tmp))
bha[num++] = tmp;
else
@@ -140,8 +142,7 @@
de = (struct ext2_dir_entry_2 *) (bh->b_data + offset);
if (!ext2_check_dir_entry ("ext2_readdir", inode, de,
bh, offset)) {
- /* On error, skip the f_pos to the
- next block. */
+ /* On error, skip the f_pos to the next block. */
filp->f_pos = (filp->f_pos | (sb->s_blocksize - 1))
+ 1;
brelse (bh);
--- ../2.4.1.uml.clean/fs/ext2/namei.c Sat Dec 9 02:35:54 2000
+++ ./fs/ext2/namei.c Tue Feb 20 16:00:53 2001
@@ -18,13 +18,13 @@
* for B-tree directories by Theodore Ts'o ([email protected]), 1998
*/

+#define CONFIG_EXT2_INDEX
+
#include <linux/fs.h>
#include <linux/ext2_fs.h>
#include <linux/locks.h>
#include <linux/quotaops.h>

-
-
/*
* define how far ahead to read directories while searching them.
*/
@@ -33,6 +33,250 @@
#define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
#define NAMEI_RA_INDEX(c,b) (((c) * NAMEI_RA_BLOCKS) + (b))

+#ifdef CONFIG_EXT2_INDEX
+#define dxtrace(command)
+#define dxtrace_on(command) command
+#define dxtrace_off(command)
+
+/*
+ * Order n log(n) sort utility with n log(n) worst case
+ */
+
+#ifndef COMBSORT
+#define COMBSORT(size, i, j, COMPARE, EXCHANGE) { \
+ unsigned gap = size, more, i; \
+ do { \
+ if (gap > 1) gap = gap*10/13; \
+ if (gap - 9 < 2) gap = 11; \
+ for (i = size - 1, more = gap > 1; i >= gap; i--) { \
+ int j = i - gap; \
+ if (COMPARE) { EXCHANGE; more = 1; } } \
+ } while (more); }
+#endif
+
+#ifndef exchange
+#define exchange(x, y) do { typeof(x) z = x; x = y; y = z; } while (0)
+#endif
+
+/*
+ * Structure of the directory root block
+ */
+
+struct dx_root
+{
+ struct dx_root_info
+ {
+ u32 total_entries;
+ u32 reserved_zero;
+ }
+ info;
+ struct dx_entry
+ {
+ u32 hash;
+ u32 block;
+ }
+ entries[0];
+};
+
+/*
+ * Bookkeeping for index traversal
+ */
+
+struct dx_frame
+{
+ struct buffer_head *bh;
+ struct dx_entry *entries;
+ struct dx_entry *at;
+ struct dx_root_info *info;
+ unsigned count;
+ unsigned limit;
+};
+
+/*
+ * Sort map for splitting leaf
+ */
+
+struct dx_map_entry
+{
+ u32 hash;
+ u32 offs;
+};
+
+#define MAX_DX_MAP (PAGE_SIZE/EXT2_DIR_REC_LEN(1) + 1)
+/* Assumes file blocksize <= PAGE_SIZE */
+
+#if 1
+unsigned dx_hash (const char *name, int namelen)
+{
+ u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
+ if (!namelen) return 0;
+ while (namelen--)
+ {
+ u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
+ if (hash < 0) hash -= 0x7fffffff;
+ hash1 = hash0;
+ hash0 = hash;
+ }
+ return ((hash0 & -1) << 1);
+}
+#else
+/*
+ * A simple hash // need hash function upgrade support
+ */
+
+int dx_hash (const char *name, int namelen)
+{
+ u32 hash = 0;
+ if (!namelen) BUG();
+ while (namelen--) hash = *(name++) + (hash << 6);
+ return hash << 1;
+}
+#endif
+
+/*
+ * Probe to find a directory leaf block to search
+ */
+
+int dx_probe (struct inode *dir, u32 hash, struct dx_frame *dxframe)
+{
+ int count, search, err;
+ struct buffer_head *bh;
+ struct dx_entry *at, *at0;
+
+ dxtrace(printk("Look up %u.", hash));
+ if (!(bh = ext2_bread (dir, 0, 0, &err)))
+ {
+ dxframe->bh = NULL;
+ return -1;
+ }
+
+ /* First hash field holds count of entries */
+ at = at0 = ((struct dx_root *) (bh->b_data))->entries;
+ if (!(count = *(u32 *) at)) BUG();
+ search = count - 1; // should use binary search
+
+ while (search--)
+ {
+ dxtrace(printk("."));
+ if ((++at)->hash > hash)
+ {
+ at--;
+ break;
+ }
+ }
+ dxtrace(printk(" in %u:%u\n", at - at0, at->block));
+ dxframe->info = (struct dx_root_info *) bh->b_data;
+ dxframe->bh = bh;
+ dxframe->entries = at0;
+ dxframe->at = at;
+ dxframe->count = count;
+ dxframe->limit = (bh->b_size - sizeof(struct dx_root_info)) / sizeof(struct dx_entry);
+ return 0;
+}
+
+/*
+ * Prior to split, finds record offset, computes hash of each dir block record
+ */
+
+static int dx_make_map (struct ext2_dir_entry_2 *de, int size, struct dx_map_entry map[])
+{
+ int count = 0;
+ char *base = (char *) de;
+ while ((char *) de < base + size)
+ {
+ map[count].hash = dx_hash (de->name, de->name_len);
+ map[count].offs = (u32) ((char *) de - base);
+ de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ count++;
+ }
+ return count;
+}
+
+/*
+ * For dir block splitting and compacting
+ */
+
+struct ext2_dir_entry_2 *dx_copy (
+ char *from, char *to, unsigned size, // should pass from, to as de's (uli)
+ struct dx_map_entry map[], int start, int count)
+{
+ struct ext2_dir_entry_2 *de = NULL;
+ char *top = to + size;
+ unsigned rec_len = 0;
+ if (!count) BUG();
+ while (count--)
+ {
+ de = (struct ext2_dir_entry_2 *) (from + map[start++].offs);
+ rec_len = EXT2_DIR_REC_LEN(de->name_len);
+ if (to + rec_len > top) BUG();
+ memcpy (to, de, rec_len);
+ ((struct ext2_dir_entry_2 *) to)->rec_len = rec_len;
+ to += rec_len;
+ }
+ return (struct ext2_dir_entry_2 *) (to - rec_len);
+}
+
+void dx_adjust (struct ext2_dir_entry_2 *de, char *limit)
+{
+ de->rec_len = limit - (char *) de; // need to clear top?
+}
+
+/*
+ * Debug
+ */
+
+void dx_show_index (struct dx_frame *dxframe)
+{
+ struct dx_entry *entries = dxframe->entries;
+ int i = 0;
+ printk("Index: ");
+ for (;i < *(u32 *) entries; i++)
+ {
+ printk("%u@%u ", entries[i].hash, entries[i].block);
+ }
+ printk("\n");
+}
+
+void dx_show_leaf (struct ext2_dir_entry_2 *de, int size)
+{
+ int count = 0;
+ char *base = (char *) de;
+ printk("dirblock: ");
+ while ((char *) de < base + size)
+ {
+ { int n = de->name_len; char *s = de->name; while (n--) printk("%c", *s++); }
+ printk(":%u.%u ", dx_hash (de->name, de->name_len), (u32) ((char *) de - base));
+ de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ count++;
+ }
+ printk("(%i)\n", count);
+}
+
+void dx_show_buckets (struct inode *dir)
+{
+ struct super_block *sb = dir->i_sb;
+ int blockshift = EXT2_BLOCK_SIZE_BITS (sb), blocksize = 1 << blockshift;
+ int count, i, err;
+ struct dx_entry *at;
+ struct buffer_head *bh, *dbh;
+ if (!(dbh = ext2_bread (dir, 0, 0, &err))) return;
+ at = ((struct dx_root *) (dbh->b_data))->entries;
+ count = *(u32 *) at;
+ printk("%i indexed blocks...\n", count);
+ for (i = 0; i < count; i++, at++)
+ {
+ u32 hash = i? at->hash: 0;
+ u32 range = i == count - 1? ~at->hash: ((at + 1)->hash - hash);
+ printk("%i:%u hash %u/%u", i, at->block, hash, range);
+ if (!(bh = ext2_bread (dir, at->block, 0, &err))) continue;
+ dx_show_leaf ((struct ext2_dir_entry_2 *) bh->b_data, blocksize);
+ brelse (bh);
+ }
+ brelse(dbh);
+ printk("\n");
+}
+#endif
+
/*
* NOTE! unlike strncmp, ext2_match returns 1 for success, 0 for failure.
*
@@ -49,36 +293,94 @@
return !memcmp(name, de->name, len);
}

+struct ext2_dir_entry_2 *ext2_find_de (struct buffer_head *bh,
+ const char *const name, int namelen,
+ int *err, struct inode *dir, u32 offset)
+ /* dir, offset used only in error report */
+{
+ struct ext2_dir_entry_2 *de = (struct ext2_dir_entry_2 *) bh->b_data;
+ char *top = (char *) de + bh->b_size;
+ while ((char *) de < top) {
+ /* this code may be executed quadratically often */
+ /* do minimal checking `by hand' */
+ int de_len;
+ if ((char *) de + namelen <= top && ext2_match (namelen, name, de)) // is the compare to top really needed??
+ {
+ /* found a match - just to be sure, do a full check */
+ if (!ext2_check_dir_entry("ext2_find_entry", dir, de, bh, offset))
+ goto error;
+ *err = 0;
+ return de;
+ }
+ de_len = le16_to_cpu(de->rec_len);
+ /* prevent looping on a bad block */
+ if (de_len <= 0)
+ goto error;
+ de = (struct ext2_dir_entry_2 *) ((char *) de + de_len);
+ offset += de_len;
+ }
+ *err = 0;
+ return NULL;
+error:
+ *err = 1;
+ return NULL;
+}
+
/*
- * ext2_find_entry()
- *
- * finds an entry in the specified directory with the wanted name. It
- * returns the cache buffer in which the entry was found, and the entry
- * itself (as a parameter - res_dir). It does NOT read the inode of the
- * entry - you'll have to do that yourself if you want to.
- */
-static struct buffer_head * ext2_find_entry (struct inode * dir,
- const char * const name, int namelen,
- struct ext2_dir_entry_2 ** res_dir)
-{
- struct super_block * sb;
- struct buffer_head * bh_use[NAMEI_RA_SIZE];
- struct buffer_head * bh_read[NAMEI_RA_SIZE];
+ * Find an entry in the specified directory with the wanted name. Return
+ * the buffer the entry was found in, and set the entry through a pointer.
+ */
+static struct buffer_head *ext2_find_entry (
+ struct inode *dir,
+ const char *name, int namelen,
+ struct ext2_dir_entry_2 **res_dir)
+{
+ struct super_block *sb = dir->i_sb;
+ struct buffer_head *bh_use[NAMEI_RA_SIZE];
+ struct buffer_head *bh_read[NAMEI_RA_SIZE];
unsigned long offset;
int block, toread, i, err;
+ int blockshift = EXT2_BLOCK_SIZE_BITS (sb);

*res_dir = NULL;
- sb = dir->i_sb;
+ if (namelen > EXT2_NAME_LEN) return NULL;
+#ifdef CONFIG_EXT2_INDEX
+ if (is_dx(dir))
+ {
+ u32 hash = dx_hash (name, namelen);
+ struct ext2_dir_entry_2 *de;
+ struct dx_frame dxframe;
+ struct buffer_head *bh;
+ int err = dx_probe (dir, hash, &dxframe); // don't ignore the error!!
+
+ while (1)
+ {
+ bh = ext2_bread (dir, dxframe.at->block, 0, &err); // don't ignore the error!!
+ de = ext2_find_de (bh, name, namelen, &err, dir, 666); // don't ignore the error!!
+ if (de)
+ {
+ dxtrace(printk("Found %s in %i:%i\n", name,
+ dxframe.at - dxframe.entries, dxframe.at->block));
+ brelse(dxframe.bh);
+ *res_dir = de;
+ return bh;
+ }

- if (namelen > EXT2_NAME_LEN)
+ brelse(bh);
+ /* Same hash continues in next block? Search further. */
+ if (++(dxframe.at) - dxframe.entries == dxframe.count) break;
+ if ((dxframe.at->hash & -2) != hash) break;
+ dxtrace(printk("Try next, block %i\n", dxframe.at->block));
+ }
+ brelse(dxframe.bh);
return NULL;
-
+ }
+#endif
memset (bh_use, 0, sizeof (bh_use));
toread = 0;
for (block = 0; block < NAMEI_RA_SIZE; ++block) {
struct buffer_head * bh;
-
- if ((block << EXT2_BLOCK_SIZE_BITS (sb)) >= dir->i_size)
+ if ((block << blockshift) >= dir->i_size)
break;
bh = ext2_getblk (dir, block, 0, &err);
bh_use[block] = bh;
@@ -86,75 +388,54 @@
bh_read[toread++] = bh;
}

- for (block = 0, offset = 0; offset < dir->i_size; block++) {
+ for (block = 0, offset = 0; offset < dir->i_size; offset += sb->s_blocksize, block++)
+ {
struct buffer_head * bh;
- struct ext2_dir_entry_2 * de;
- char * dlimit;
-
- if ((block % NAMEI_RA_BLOCKS) == 0 && toread) {
+ struct ext2_dir_entry_2 *de;
+ if ((block % NAMEI_RA_BLOCKS) == 0 && toread)
+ {
ll_rw_block (READ, toread, bh_read);
toread = 0;
}
bh = bh_use[block % NAMEI_RA_SIZE];
- if (!bh) {
+ if (!bh)
+ {
#if 0
ext2_error (sb, "ext2_find_entry",
"directory #%lu contains a hole at offset %lu",
dir->i_ino, offset);
#endif
- offset += sb->s_blocksize;
continue;
}
+
wait_on_buffer (bh);
- if (!buffer_uptodate(bh)) {
- /*
- * read error: all bets are off
- */
+
+ /* handle read error */
+ if (!buffer_uptodate(bh))
break;
- }

- de = (struct ext2_dir_entry_2 *) bh->b_data;
- dlimit = bh->b_data + sb->s_blocksize;
- while ((char *) de < dlimit) {
- /* this code is executed quadratically often */
- /* do minimal checking `by hand' */
- int de_len;
-
- if ((char *) de + namelen <= dlimit &&
- ext2_match (namelen, name, de)) {
- /* found a match -
- just to be sure, do a full check */
- if (!ext2_check_dir_entry("ext2_find_entry",
- dir, de, bh, offset))
- goto failure;
- for (i = 0; i < NAMEI_RA_SIZE; ++i) {
- if (bh_use[i] != bh)
- brelse (bh_use[i]);
- }
- *res_dir = de;
- return bh;
- }
- /* prevent looping on a bad block */
- de_len = le16_to_cpu(de->rec_len);
- if (de_len <= 0)
- goto failure;
- offset += de_len;
- de = (struct ext2_dir_entry_2 *)
- ((char *) de + de_len);
+ de = ext2_find_de (bh, name, namelen, &err, dir, offset);
+ if (de)
+ {
+ for (i = 0; i < NAMEI_RA_SIZE; ++i)
+ if (bh_use[i] != bh)
+ brelse (bh_use[i]);
+ *res_dir = de;
+ return bh;
}
-
+ if (err)
+ goto fail;
brelse (bh);
- if (((block + NAMEI_RA_SIZE) << EXT2_BLOCK_SIZE_BITS (sb)) >=
- dir->i_size)
- bh = NULL;
- else
+ if (((block + NAMEI_RA_SIZE) << blockshift) < dir->i_size)
bh = ext2_getblk (dir, block + NAMEI_RA_SIZE, 0, &err);
+ else
+ bh = NULL;
+
bh_use[block % NAMEI_RA_SIZE] = bh;
if (bh && !buffer_uptodate(bh))
bh_read[toread++] = bh;
}
-
-failure:
+fail:
for (i = 0; i < NAMEI_RA_SIZE; ++i)
brelse (bh_use[i]);
return NULL;
@@ -171,7 +452,8 @@

bh = ext2_find_entry (dir, dentry->d_name.name, dentry->d_name.len, &de);
inode = NULL;
- if (bh) {
+ if (bh)
+ {
unsigned long ino = le32_to_cpu(de->inode);
brelse (bh);
inode = iget(dir->i_sb, ino);
@@ -202,37 +484,151 @@
}

/*
- * ext2_add_entry()
- *
* adds a file entry to the specified directory.
*/
+
int ext2_add_entry (struct inode * dir, const char * name, int namelen,
struct inode *inode)
{
unsigned long offset;
- unsigned short rec_len;
+ unsigned short rec_len = EXT2_DIR_REC_LEN(namelen);
struct buffer_head * bh;
- struct ext2_dir_entry_2 * de, * de1;
- struct super_block * sb;
- int retval;
-
- sb = dir->i_sb;
+ struct ext2_dir_entry_2 * de, * de2;
+ struct super_block * sb = dir->i_sb;
+ unsigned blockshift = EXT2_BLOCK_SIZE_BITS(sb);
+ unsigned blocksize = 1 << blockshift;
+ int err;
+#ifdef CONFIG_EXT2_INDEX
+ struct dx_frame dxframe;
+ u32 hash;
+#endif

- if (!namelen)
- return -EINVAL;
- bh = ext2_bread (dir, 0, 0, &retval);
- if (!bh)
- return retval;
- rec_len = EXT2_DIR_REC_LEN(namelen);
+ if (!namelen) return -EINVAL;
+#ifdef CONFIG_EXT2_INDEX
+ if (is_dx(dir))
+ {
+ hash = dx_hash (name, namelen);
+ dx_probe (dir, hash, &dxframe); // don't ignore the error!!
+ if (!dxframe.bh) return EINVAL;
+ if (!(bh = ext2_bread (dir, dxframe.at->block, 0, &err))) return err;
+ }
+ else
+#endif
+ {
+ if (!(bh = ext2_bread (dir, 0, 0, &err))) return err;
+ }
offset = 0;
de = (struct ext2_dir_entry_2 *) bh->b_data;
- while (1) {
- if ((char *)de >= sb->s_blocksize + bh->b_data) {
+ while (1)
+ {
+ if ((char *) de >= bh->b_data + blocksize)
+ {
+#ifdef CONFIG_EXT2_INDEX
+ if (is_dx(dir))
+ {
+ u32 block2 = dir->i_size >> blockshift;
+ struct dx_entry *entries = dxframe.entries, *at = dxframe.at;
+ struct buffer_head *bh2;
+ int count, split;
+ int continued; /* true if identical hashes split between two blocks */
+ u32 hash2;
+ dxtrace_off(printk("entry count %i, limit %i\n", dxframe.count, dxframe.limit));
+
+ if (dxframe.count == dxframe.limit)
+ {
+ brelse(bh);
+ brelse (dxframe.bh);
+ return -ENOENT;
+ }
+
+ if (!(bh2 = ext2_getblk (dir, block2, 1, &err)))
+ {
+ brelse(bh);
+ brelse (dxframe.bh);
+ return err;
+ }
+
+ {
+ char *b1 = bh->b_data, *b2, *b3;
+ struct dx_map_entry map[MAX_DX_MAP];
+ count = dx_make_map ((struct ext2_dir_entry_2 *) b1, blocksize, map);
+ split = count/2; // need to adjust to actual middle
+ COMBSORT(count, i, j, map[i].hash < map[j].hash, exchange(map[i], map[j]));
+
+ /* Don't split between duplicate hashes */
+ if (hash <= map[split].hash)
+ while (split && map[split].hash == map[split-1].hash)
+ split--;
+ else
+ while (split < count && map[split].hash == map[split-1].hash)
+ split++;
+ hash2 = map[split].hash;
+ continued = hash == hash2; // this happens to be valid for now
+ dxtrace(printk("Split block %i at %u, %i/%i\n", dxframe.at->block, hash2, split, count-split));
+
+ b2 = bh2->b_data;
+ dir->i_size = dir->i_size += blocksize;
+
+ if (!split || split == count)
+ {
+ // just create an empty dirblock for now
+ de2 = (struct ext2_dir_entry_2 *) b2;
+ de2->inode = 0;
+ de2->rec_len = le16_to_cpu(blocksize);
+ } else {
+ /* Fancy dance to stay within two buffers */
+ de2 = dx_copy (b1, b2, blocksize, map, split, count - split);
+ b3 = (char *) de2 + de2->rec_len;
+ de = dx_copy (b1, b3, blocksize - (b3 - b2), map, 0, split);
+ memcpy(b1, b3, (char *) de + de->rec_len - b3);
+ de = (struct ext2_dir_entry_2 *) ((char *) de - b3 + b1);
+ dx_adjust (de, b1 + blocksize);
+ dx_adjust (de2, b2 + blocksize);
+ }
+
+ dxtrace(dx_show_leaf ((struct ext2_dir_entry_2 *) b1, blocksize));
+ dxtrace(dx_show_leaf ((struct ext2_dir_entry_2 *) b2, blocksize));
+
+ /* Which block gets the new entry? */
+ dxtrace(printk("Insert %s/%u ", name, hash));
+ if (hash >= hash2 || !split || split == count)
+ {
+ dxtrace(printk("above"));
+ exchange(bh, bh2);
+ de = de2;
+ }
+ dxtrace(printk("\n"));
+ }
+
+ memmove (at + 1, at, (char *) (entries + dxframe.count) - (char *) (at));
+ if (continued && (!split || split == count))
+ {
+ /* assuming we put new identical hash into lower entry's block */
+ (at+1)->hash = hash + 1;
+ if (at != dxframe.entries) at->hash = hash;
+ at->block = block2;
+ } else {
+ at++;
+ at->block = block2;
+ at->hash = hash2;
+ }
+ dxframe.count = entries[0].hash++; /* first hash field is entry count */
+
+ /* Clean up and continue with scan for available space */
+ /* New dirent will be added at de in bh */
+ if (!continued) mark_buffer_dirty (bh2);
+ mark_buffer_dirty (dxframe.bh);
+ brelse (dxframe.bh);
+ brelse (bh2);
+ dxframe.bh = NULL; // oops if come here again
+ dxtrace(dx_show_index (&dxframe));
+ } else {
+#endif
brelse (bh);
bh = NULL;
- bh = ext2_bread (dir, offset >> EXT2_BLOCK_SIZE_BITS(sb), 1, &retval);
+ bh = ext2_bread (dir, offset >> EXT2_BLOCK_SIZE_BITS(sb), 1, &err);
if (!bh)
- return retval;
+ return err;
if (dir->i_size <= offset) {
if (dir->i_size == 0) {
return -ENOENT;
@@ -244,7 +640,6 @@
de->inode = 0;
de->rec_len = le16_to_cpu(sb->s_blocksize);
dir->i_size = offset + sb->s_blocksize;
- dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);
} else {

@@ -252,6 +647,9 @@

de = (struct ext2_dir_entry_2 *) bh->b_data;
}
+#ifdef CONFIG_EXT2_INDEX
+ }
+#endif
}
if (!ext2_check_dir_entry ("ext2_add_entry", dir, de, bh,
offset)) {
@@ -266,12 +664,12 @@
(le16_to_cpu(de->rec_len) >= EXT2_DIR_REC_LEN(de->name_len) + rec_len)) {
offset += le16_to_cpu(de->rec_len);
if (le32_to_cpu(de->inode)) {
- de1 = (struct ext2_dir_entry_2 *) ((char *) de +
+ de2 = (struct ext2_dir_entry_2 *) ((char *) de +
EXT2_DIR_REC_LEN(de->name_len));
- de1->rec_len = cpu_to_le16(le16_to_cpu(de->rec_len) -
+ de2->rec_len = cpu_to_le16(le16_to_cpu(de->rec_len) -
EXT2_DIR_REC_LEN(de->name_len));
de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(de->name_len));
- de = de1;
+ de = de2;
}
de->file_type = EXT2_FT_UNKNOWN;
if (inode) {
@@ -293,7 +691,6 @@
* and/or different from the directory change time.
*/
dir->i_mtime = dir->i_ctime = CURRENT_TIME;
- dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);
dir->i_version = ++event;
mark_buffer_dirty_inode(bh, dir);
@@ -380,6 +777,7 @@
return err;
}
d_instantiate(dentry, inode);
+// dx_show_buckets (dir);
return 0;
}

@@ -408,12 +806,19 @@
return err;
}

-static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode)
+static int ext2_mkdir (struct inode *dir, struct dentry *dentry, int mode)
{
- struct inode * inode;
- struct buffer_head * dir_block;
- struct ext2_dir_entry_2 * de;
+ struct super_block *sb = dir->i_sb;
+ struct inode *inode;
+ struct buffer_head *bh;
+ struct ext2_dir_entry_2 *de;
int err;
+#ifdef CONFIG_EXT2_INDEX
+ int make_dx = test_opt (sb, DXTREE);
+ int dir_blk = make_dx? dx_dir_base(sb): 0;
+#else
+ int dir_blk = 0;
+#endif

if (dir->i_nlink >= EXT2_LINK_MAX)
return -EMLINK;
@@ -425,40 +830,61 @@

inode->i_op = &ext2_dir_inode_operations;
inode->i_fop = &ext2_dir_operations;
- inode->i_size = inode->i_sb->s_blocksize;
+ inode->i_size = sb->s_blocksize;
inode->i_blocks = 0;
- dir_block = ext2_bread (inode, 0, 1, &err);
- if (!dir_block) {
+ bh = ext2_bread (inode, dir_blk, 1, &err);
+ if (!bh)
+ {
inode->i_nlink--; /* is this nlink == 0? */
mark_inode_dirty(inode);
iput (inode);
return err;
}
- de = (struct ext2_dir_entry_2 *) dir_block->b_data;
+ de = (struct ext2_dir_entry_2 *) bh->b_data;
+#ifdef CONFIG_EXT2_INDEX
de->inode = cpu_to_le32(inode->i_ino);
de->name_len = 1;
de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(de->name_len));
strcpy (de->name, ".");
- ext2_set_de_type(dir->i_sb, de, S_IFDIR);
+ ext2_set_de_type(sb, de, S_IFDIR);
de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize - EXT2_DIR_REC_LEN(1));
+ de->rec_len = cpu_to_le16(sb->s_blocksize - EXT2_DIR_REC_LEN(1));
de->name_len = 2;
strcpy (de->name, "..");
- ext2_set_de_type(dir->i_sb, de, S_IFDIR);
+ ext2_set_de_type (sb, de, S_IFDIR);
+#else
+ de->rec_len = cpu_to_le16(sb->s_blocksize);
+#endif
inode->i_nlink = 2;
- mark_buffer_dirty_inode(dir_block, dir);
- brelse (dir_block);
+ mark_buffer_dirty_inode(bh, dir);
+ brelse (bh);
inode->i_mode = S_IFDIR | mode;
if (dir->i_mode & S_ISGID)
inode->i_mode |= S_ISGID;
mark_inode_dirty(inode);
- err = ext2_add_entry (dir, dentry->d_name.name, dentry->d_name.len,
- inode);
+ err = ext2_add_entry (dir, dentry->d_name.name, dentry->d_name.len, inode);
if (err)
goto out_no_entry;
dir->i_nlink++;
- dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
+#ifdef CONFIG_EXT2_INDEX
+ if (make_dx)
+ {
+ struct buffer_head *bh = ext2_bread (inode, 0, 1, &err);
+ if (bh)
+ {
+ struct dx_entry *entries = ((struct dx_root *) bh->b_data)->entries;
+ dxtrace_on(printk("Making dx indexed directory\n"));
+ inode->i_size = (dx_dir_base(sb) + 1) << sb->s_blocksize_bits;
+ entries[0].block = dx_dir_base(sb);
+ entries[0].hash = 1; /* first hash field is entry count */
+ mark_buffer_dirty(bh);
+ brelse(bh);
+ inode->u.ext2_i.i_flags |= EXT2_INDEX_FL;
+
+ }
+ }
+#endif
mark_inode_dirty(dir);
d_instantiate(dentry, inode);
return 0;
@@ -473,23 +899,27 @@
/*
* routine to check that the specified directory is empty (for rmdir)
*/
-static int empty_dir (struct inode * inode)
+static int ext2_is_empty_dir (struct inode *inode)
{
unsigned long offset;
struct buffer_head * bh;
struct ext2_dir_entry_2 * de, * de1;
- struct super_block * sb;
+ struct super_block * sb = inode->i_sb;
int err;
-
- sb = inode->i_sb;
+#ifdef CONFIG_EXT2_INDEX
+ int start = is_dx(inode)? dx_dir_base(sb): 0;
+#else
+ int start = 0;
+#endif
if (inode->i_size < EXT2_DIR_REC_LEN(1) + EXT2_DIR_REC_LEN(2) ||
- !(bh = ext2_bread (inode, 0, 0, &err))) {
+ !(bh = ext2_bread (inode, start, 0, &err))) {
ext2_warning (inode->i_sb, "empty_dir",
"bad directory (dir #%lu) - no data block",
inode->i_ino);
return 1;
}
de = (struct ext2_dir_entry_2 *) bh->b_data;
+#ifdef CONFIG_EXT2_INDEX
de1 = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
if (le32_to_cpu(de->inode) != inode->i_ino || !le32_to_cpu(de1->inode) ||
strcmp (".", de->name) || strcmp ("..", de1->name)) {
@@ -501,6 +931,7 @@
}
offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
de = (struct ext2_dir_entry_2 *) ((char *) de1 + le16_to_cpu(de1->rec_len));
+#endif
while (offset < inode->i_size ) {
if (!bh || (void *) de >= (void *) (bh->b_data + sb->s_blocksize)) {
brelse (bh);
@@ -552,7 +983,7 @@
goto end_rmdir;

retval = -ENOTEMPTY;
- if (!empty_dir (inode))
+ if (!ext2_is_empty_dir (inode))
goto end_rmdir;

retval = ext2_delete_entry(dir, de, bh);
@@ -568,7 +999,6 @@
mark_inode_dirty(inode);
dir->i_nlink--;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);

end_rmdir:
@@ -605,7 +1035,6 @@
if (retval)
goto end_unlink;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);
inode->i_nlink--;
mark_inode_dirty(inode);
@@ -729,7 +1158,7 @@
if (S_ISDIR(old_inode->i_mode)) {
if (new_inode) {
retval = -ENOTEMPTY;
- if (!empty_dir (new_inode))
+ if (!ext2_is_empty_dir (new_inode))
goto end_rename;
}
retval = -EIO;
@@ -782,7 +1211,6 @@
mark_inode_dirty(new_inode);
}
old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME;
- old_dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(old_dir);
if (dir_bh) {
PARENT_INO(dir_bh->b_data) = le32_to_cpu(new_dir->i_ino);
@@ -794,7 +1222,6 @@
mark_inode_dirty(new_inode);
} else {
new_dir->i_nlink++;
- new_dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(new_dir);
}
}
--- ../2.4.1.uml.clean/fs/ext2/super.c Fri Dec 29 23:36:44 2000
+++ ./fs/ext2/super.c Tue Feb 20 04:56:43 2001
@@ -188,6 +188,12 @@
printk("EXT2 Check option not supported\n");
#endif
}
+ else if (!strcmp (this_char, "index"))
+#ifdef CONFIG_EXT2_INDEX
+ set_opt (*mount_options, DXTREE);
+#else
+ printk("EXT2 Index option not supported\n");
+#endif
else if (!strcmp (this_char, "debug"))
set_opt (*mount_options, DEBUG);
else if (!strcmp (this_char, "errors")) {
--- ../2.4.1.uml.clean/include/linux/ext2_fs.h Tue Jan 30 08:24:55 2001
+++ ./include/linux/ext2_fs.h Tue Feb 20 15:52:54 2001
@@ -40,6 +40,12 @@
#define EXT2FS_VERSION "0.5b"

/*
+ * Hash Tree Directory indexing
+ * (c) Daniel Phillips, 2001
+ */
+#undef CONFIG_EXT2_INDEX
+
+/*
* Debug code
*/
#ifdef EXT2FS_DEBUG
@@ -53,7 +59,7 @@
#endif

/*
- * Special inodes numbers
+ * Special inode numbers
*/
#define EXT2_BAD_INO 1 /* Bad blocks inode */
#define EXT2_ROOT_INO 2 /* Root inode */
@@ -197,7 +203,7 @@
#define EXT2_NOCOMP_FL 0x00000400 /* Don't compress */
#define EXT2_ECOMPR_FL 0x00000800 /* Compression error */
/* End compression flags --- maybe not all used */
-#define EXT2_BTREE_FL 0x00001000 /* btree format dir */
+#define EXT2_INDEX_FL 0x00001000 /* btree format dir */
#define EXT2_RESERVED_FL 0x80000000 /* reserved for ext2 lib */

#define EXT2_FL_USER_VISIBLE 0x00001FFF /* User visible flags */
@@ -314,6 +320,7 @@
#define EXT2_MOUNT_ERRORS_PANIC 0x0040 /* Panic on errors */
#define EXT2_MOUNT_MINIX_DF 0x0080 /* Mimics the Minix statfs */
#define EXT2_MOUNT_NO_UID32 0x0200 /* Disable 32-bit UIDs */
+#define EXT2_MOUNT_DXTREE 0x0400 /* Enable dx trees */

#define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt
#define set_opt(o, opt) o |= EXT2_MOUNT_##opt
@@ -518,6 +525,16 @@
#define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
#define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) & \
~EXT2_DIR_ROUND)
+
+/*
+ * Hash Tree Directory indexing
+ * (c) Daniel Phillips, 2001
+ */
+#ifdef CONFIG_EXT2_INDEX
+#define is_dx(dir) (dir->u.ext2_i.i_flags & EXT2_INDEX_FL)
+#define dx_entries_per_block(sb) (EXT2_BLOCK_SIZE(sb) >> 3)
+#define dx_dir_base(sb) (dx_entries_per_block(sb) - 1 + 1)
+#endif

#ifdef __KERNEL__
/*


--
Daniel


2001-02-20 20:04:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <01022020011905.18944@gimli>,
Daniel Phillips <[email protected]> wrote:
>Earlier this month a runaway installation script decided to mail all its
>problems to root. After a couple of hours the script aborted, having
>created 65535 entries in Postfix's maildrop directory. Removing those
>files took an awfully long time. The problem is that Ext2 does each
>directory access using a simple, linear search though the entire
>directory file, resulting in n**2 behaviour to create/delete n files.
>It's about time we fixed that.

Interesting.

However, if you're playing with the directory structure, please consider
getting rid of the "struct buffer_head"-centricity, and using the page
cache instead. The page cache has much nicer caching semantics, and
looking up data in the page cache is much faster because it never needs
to do the "virtual->physical" translation.

Talk to Al Viro about this - he's already posted patches to move the
regular ext2 directory tree into the page cache, and they weren't
applied to 2.4.x only because there was no great feeling of "we _must_
do this for correctness".

I see that you already considered this issue, but I wanted to bring it
up again simply because something like this certainly looks like a
potential candidate for 2.5.x, but I will _refuse_ to add code that
increases our reliance of "struct buffer_head" as a caching entity. So
I'd rather see the page cache conversion happen sooner rather than
later...

Also, just out of interest: if you've already been worrying about
hashes, what's the verdict on just using the native dentry hash value
directly? It has other constraints (_really_ low latency and absolutely
performance critical to calculate for the common case, which is not
needing a real lookup at all), but maybe it is good enough? And if not,
and you have done some statistics on it, I'd love to hear about it ;)

Linus

2001-02-20 21:14:21

by Jeremy Jackson

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

> In article <01022020011905.18944@gimli>,
> Daniel Phillips <[email protected]> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root. After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory. Removing those
> >files took an awfully long time. The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files.
> >It's about time we fixed that.

In the case of your script I'm not sure this will help, but:
I've seen /home directories organised like /home/a/adamsonj,
/home/a/arthurtone, /home/b/barrettj, etc.
this way (crude) indexing only costs areas where it's needed,
without kernel modification. (app does it) What other placed would we
need indexing *in* the filesystem?

2001-02-20 21:20:41

by Mike Dresser

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

the way i'm reading this, the problem is there's 65535 files in the directory
/where/postfix/lives. rm * or what have you, is going to take forever and
ever, and bog the machine down while its doing it. My understanding is you
could do the rm *, and instead of it reading the tree over and over for every
file that has to be deleted, it just jumps one or two blocks to the file that's
being deleted, instead of thousands of files to be scanned for each file
deleted.

Jeremy Jackson wrote:

> > In article <01022020011905.18944@gimli>,
> > Daniel Phillips <[email protected]> wrote:
> > >Earlier this month a runaway installation script decided to mail all its
> > >problems to root. After a couple of hours the script aborted, having
> > >created 65535 entries in Postfix's maildrop directory. Removing those
> > >files took an awfully long time. The problem is that Ext2 does each
> > >directory access using a simple, linear search though the entire
> > >directory file, resulting in n**2 behaviour to create/delete n files.
> > >It's about time we fixed that.
>
> In the case of your script I'm not sure this will help, but:
> I've seen /home directories organised like /home/a/adamsonj,
> /home/a/arthurtone, /home/b/barrettj, etc.
> this way (crude) indexing only costs areas where it's needed,
> without kernel modification. (app does it) What other placed would we
> need indexing *in* the filesystem?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-02-20 21:56:20

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

On Tue, 20 Feb 2001, Linus Torvalds wrote:
> In article <01022020011905.18944@gimli>,
> Daniel Phillips <[email protected]> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root. After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory. Removing those
> >files took an awfully long time. The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files.
> >It's about time we fixed that.
>
> Interesting.
>
> However, if you're playing with the directory structure, please consider
> getting rid of the "struct buffer_head"-centricity, and using the page
> cache instead. The page cache has much nicer caching semantics, and
> looking up data in the page cache is much faster because it never needs
> to do the "virtual->physical" translation.

Oh yes, I was planning on it. I started with the buffers version
for two main reasons version: 1) it's simple and solid and 2) it
provides the basis for a backport to 2.2 - after the 2.4/2.5 version is
complete of course.

> Talk to Al Viro about this - he's already posted patches to move the
> regular ext2 directory tree into the page cache, and they weren't
> applied to 2.4.x only because there was no great feeling of "we _must_
> do this for correctness".
>
> I see that you already considered this issue, but I wanted to bring it
> up again simply because something like this certainly looks like a
> potential candidate for 2.5.x, but I will _refuse_ to add code that
> increases our reliance of "struct buffer_head" as a caching entity. So
> I'd rather see the page cache conversion happen sooner rather than
> later...

You are preaching to the converted.

> Also, just out of interest: if you've already been worrying about
> hashes, what's the verdict on just using the native dentry hash value
> directly? It has other constraints (_really_ low latency and absolutely
> performance critical to calculate for the common case, which is not
> needing a real lookup at all), but maybe it is good enough? And if not,
> and you have done some statistics on it, I'd love to hear about it ;)

You mean full_name_hash? I will un-static it and try it. I should have
some statistics tomorrow. I have a couple of simple metrics for
measuring the effectiveness of the hash function: the uniformity of
the hash space splitting (which in turn affects the average fullness
of directory leaves) and speed.

Let the hash races begin.

--
Daniel

2001-02-20 22:43:30

by Jeremy Jackson

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Mike Dresser wrote:

> the way i'm reading this, the problem is there's 65535 files in the directory
> /where/postfix/lives. rm * or what have you, is going to take forever and
> ever, and bog the machine down while its doing it. My understanding is you
> could do the rm *, and instead of it reading the tree over and over for every
> file that has to be deleted, it just jumps one or two blocks to the file that's
> being deleted, instead of thousands of files to be scanned for each file
> deleted.
>

I thought about it again, and the proformance problem with "rm *" is that
the shell reads and sorts the directory, passes each file as a separate
argument to rm, which then causes the kernel to lookup each file
from a random directory block (random because of previous sort),
modify that directory block, then read another... after a few seconds
the modified blocks start to be written back to disk while new ones
are looked up... disk seek contention. and this becomes hard on the
dir. block cache (wherever this is) since from source each dir entry
is just over 256 bytes (?) 65535 files would require 16MB to
cache dir entries. Plus it has to read in all the inodes, modify,
then write, taking up xxMB more. You're probably swapping
out, with swap partition on same disk, the disk may explode.

If it were truly doing a linear scan, it might be faster. Two
successive mods to same dir block would be merged
onto same write.

Perhaps rm -rf . would be faster? Let rm do glob expansion,
without the sort. Care to recreate those 65535 files and try it?

or use ls with the nosort flag pipe through xargs then to rm...
again loose sorting but don't delete directory or subdirs.

>
> Jeremy Jackson wrote:
>
> > > In article <01022020011905.18944@gimli>,
> > > Daniel Phillips <[email protected]> wrote:
> > > >Earlier this month a runaway installation script decided to mail all its
> > > >problems to root. After a couple of hours the script aborted, having
> > > >created 65535 entries in Postfix's maildrop directory. Removing those
> > > >files took an awfully long time. The problem is that Ext2 does each
> > > >directory access using a simple, linear search though the entire
> > > >directory file, resulting in n**2 behaviour to create/delete n files.
> > > >It's about time we fixed that.
> >

2001-02-20 23:18:27

by Jonathan Morton

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

>Perhaps rm -rf . would be faster? Let rm do glob expansion,
>without the sort. Care to recreate those 65535 files and try it?

Perhaps, but I think that form is still fairly slow. It takes an
"uncomfortable" amount of time to remove a complex directory structure
using, eg. "rm -rf /usr/src/linux-obsolete" or "rm -rf
downloads/XFree86-old-and-buggy". I'm not sure, but I would guess it's not
as much quicker than removing each file individually as you might think.

If I had more time on my hands, I'd run some quick benchmarks on some of my
systems.

--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----


2001-02-20 23:37:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

On Tue, 20 Feb 2001, Jeremy Jackson wrote:
> Mike Dresser wrote:
>
> > the way i'm reading this, the problem is there's 65535 files in the directory
> > /where/postfix/lives. rm * or what have you, is going to take forever and
> > ever, and bog the machine down while its doing it. My understanding is you
> > could do the rm *, and instead of it reading the tree over and over for every
> > file that has to be deleted, it just jumps one or two blocks to the file that's
> > being deleted, instead of thousands of files to be scanned for each file
> > deleted.
>
> I thought about it again, and the proformance problem with "rm *" is that
> the shell reads and sorts the directory, passes each file as a separate
> argument to rm, which then causes the kernel to lookup each file
> from a random directory block (random because of previous sort),
> modify that directory block, then read another... after a few seconds
> the modified blocks start to be written back to disk while new ones
> are looked up... disk seek contention. and this becomes hard on the
> dir. block cache (wherever this is) since from source each dir entry
> is just over 256 bytes (?) 65535 files would require 16MB to
> cache dir entries. Plus it has to read in all the inodes, modify,
> then write, taking up xxMB more. You're probably swapping
> out, with swap partition on same disk, the disk may explode.
>
> If it were truly doing a linear scan, it might be faster. Two
> successive mods to same dir block would be merged
> onto same write.
>
> Perhaps rm -rf . would be faster? Let rm do glob expansion,
> without the sort. Care to recreate those 65535 files and try it?
>
> or use ls with the nosort flag pipe through xargs then to rm...
> again loose sorting but don't delete directory or subdirs.

Indeed, rm -rf is faster. It does a readdir to get all the directory
entries in internal order, then calls unlink to remove them, one at a
time. This removes each entry from the front of the file, shortening
the time that has to be spent scanning forward in the file to find the
target entry. Manfred Spraul observed that this could be speeded up
with by caching the file position, and sent me a patch to do that. It
did speed things up - about 20%.

But actually, rm is not problem, it's open and create. To do a
create you have to make sure the file doesn't already exist, and
without an index you have to scan on average half the directory file.
Open requires a similar scan. Here we are talking about using an index
to speed that up quadraticly when operating on N files. That is the
real gravy.

--
Daniel

2001-02-21 00:23:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2



On Tue, 20 Feb 2001, Daniel Phillips wrote:
>
> You mean full_name_hash? I will un-static it and try it. I should have
> some statistics tomorrow. I have a couple of simple metrics for
> measuring the effectiveness of the hash function: the uniformity of
> the hash space splitting (which in turn affects the average fullness
> of directory leaves) and speed.

I was more thinking about just using "dentry->d_name->hash" directly, and
not worrying about how that hash was computed. Yes, for ext2 it will have
the same value as "full_name_hash" - the difference really being that
d_hash has already been precomputed for you anyway.

> Let the hash races begin.

Note that dentry->d_name->hash is really quick (no extra computation), but
I'm not claiming that it has anything like a CRC quality. And it's
probably a bad idea to use it, because in theory at least the VFS layer
might decide to switch the hash function around. I'm more interested in
hearing whether it's a good hash, and maybe we could improve the VFS hash
enough that there's no reason to use anything else..

Linus

2001-02-21 00:27:50

by Alan

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

Reiserfs seems to have done a lot of work on this and be using tea, which is
also nice as tea is non trivial to abuse as a user to create pessimal file
searches intentionally

2001-02-21 01:03:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus writes:
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> > You mean full_name_hash? I will un-static it and try it. I should have
> > some statistics tomorrow.
>
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.

I _thought_ that's what you meant, but then I was also thinking that the
dentry hash was on the full path name and not just the filename? This
wouldn't be any good for use in the directory index, in case the directory
is renamed. If this is _not_ the case, then it is a definite candidate.

> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around.

I was thinking about this as well. Since the setup Daniel has allows us
to store a hash version, we could run the hash function on a fixed string
at SB init time to give us a hash "version" number. If the hash function
changes we will get a new hash "version". We could inline each new dentry
hash function into the ext2 code (so we can unpack the directories), or
as a cop-out if any directory has a hash version not equal to the current
one we re-hash all the entries in the directory.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-21 01:04:57

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <01022100361408.18944@gimli> you wrote:
> But actually, rm is not problem, it's open and create. To do a
> create you have to make sure the file doesn't already exist, and
> without an index you have to scan on average half the directory file.

Unless you use a File System which is better for that, like Reiser-FS. Of
course a even better solution is to distribute those files in hashed subdirs.

Greetings
Bernd

2001-02-21 06:39:36

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Alan Cox wrote:

>> probably a bad idea to use it, because in theory at least the VFS layer
>> might decide to switch the hash function around. I'm more interested in
>> hearing whether it's a good hash, and maybe we could improve the VFS hash
>> enough that there's no reason to use anything else..
>
> Reiserfs seems to have done a lot of work on this and be using tea, which is
> also nice as tea is non trivial to abuse as a user to create pessimal file
> searches intentionally

The default in reiserfs is now the R5 hash, but you are right that lots of efforts went
into finding this hash. This includes testing various hashes on real directory
structures to see which one worked best. R5 won.

Ed Tomlinson

2001-02-21 16:40:55

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

On Wed, 21 Feb 2001, Bernd Eckenfels wrote:
> In article <01022100361408.18944@gimli> you wrote:
> > But actually, rm is not problem, it's open and create. To do a
> > create you have to make sure the file doesn't already exist, and
> > without an index you have to scan on average half the directory file.
>
> Unless you use a File System which is better for that, like Reiser-FS. Of
> course a even better solution is to distribute those files in hashed subdirs.

Ahem. Please read the first post in the thread. ;-)

--
Daniel

2001-02-21 17:19:41

by Davide Libenzi

[permalink] [raw]
Subject: RE: [rfc] Near-constant time directory index for Ext2


On 20-Feb-2001 Daniel Phillips wrote:
> Earlier this month a runaway installation script decided to mail all its
> problems to root. After a couple of hours the script aborted, having
> created 65535 entries in Postfix's maildrop directory. Removing those
> files took an awfully long time. The problem is that Ext2 does each
> directory access using a simple, linear search though the entire
> directory file, resulting in n**2 behaviour to create/delete n files.
> It's about time we fixed that.
>
> Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
> for an Ext2 directory index, including the following points:
>
> - Fixed-size hash keys instead of names in the index
> - Leaf blocks are normal ext2 directory blocks
> - Leaf blocks are sequental, so readdir doesn't have to be changed

Have You tried to use skiplists ?
In 93 I've coded a skiplist based directory access for Minix and it gave very
interesting performances.
Skiplists have a link-list like performance when linear scanned, and overall
good performance in insertion/seek/delete.




- Davide

2001-02-21 21:09:35

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> Have You tried to use skiplists ?
> In 93 I've coded a skiplist based directory access for Minix and it gave very
> interesting performances.
> Skiplists have a link-list like performance when linear scanned, and overall
> good performance in insertion/seek/delete.

Skip list search/insert/delete is O(log N) in average as skip lists are just a
dynamic version of interval bisection. Good hashing is O(1).

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Entropy isn't what it used to be.

2001-02-21 21:28:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> Have You tried to use skiplists ?
>> In 93 I've coded a skiplist based directory access for Minix and it gave
>> very
>> interesting performances.
>> Skiplists have a link-list like performance when linear scanned, and overall
>> good performance in insertion/seek/delete.
>
> Skip list search/insert/delete is O(log N) in average as skip lists are just
> a
> dynamic version of interval bisection. Good hashing is O(1).

To have O(1) you've to have the number of hash entries > number of files and a
really good hasing function.



>
> Have a nice fortnight

To be sincere, here is pretty daylight :)



- Davide

2001-02-21 21:33:02

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> To have O(1) you've to have the number of hash entries > number of files and a
> really good hasing function.

No, if you enlarge the hash table twice (and re-hash everything) every time the
table fills up, the load factor of the table keeps small and everything is O(1)
amortized, of course if you have a good hashing function. If you are really
smart and re-hash incrementally, you can get O(1) worst case complexity, but
the multiplicative constant is large.

> To be sincere, here is pretty daylight :)

:)
Martin

2001-02-21 21:58:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> To have O(1) you've to have the number of hash entries > number of files and
>> a
>> really good hasing function.
>
> No, if you enlarge the hash table twice (and re-hash everything) every time
> the
> table fills up, the load factor of the table keeps small and everything is
> O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.

My personal preference goes to skiplist coz it doesn't have fixed ( or growing
) tables to handle. You've simply a stub of data togheter with FS data in each
direntry.
And performance ( O(log2(n)) ) are the same for whatever number of entries.




- Davide

2001-02-21 22:15:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Followup to: <[email protected]>
By author: Martin Mares <[email protected]>
In newsgroup: linux.dev.kernel
>
> Hello!
>
> > To have O(1) you've to have the number of hash entries > number of files and a
> > really good hasing function.
>
> No, if you enlarge the hash table twice (and re-hash everything) every time the
> table fills up, the load factor of the table keeps small and everything is O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.
>

Not true. The rehashing is O(n) and it has to be performed O(log n)
times during insertion. Therefore, insertion is O(log n).

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:27:18

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> My personal preference goes to skiplist coz it doesn't have fixed ( or growing
> ) tables to handle. You've simply a stub of data togheter with FS data in each
> direntry.

Another problem with skip lists is that they require variable sized nodes,
so you either need to keep free chunk lists and lose some space in deleted
nodes kept in these lists, or you choose to shift remaining nodes which is
slow and complicated as you need to keep the inter-node links right. With
hashing, you can separate the control part of the structure and the actual
data and shift data while leaving most of the control part intact.

> And performance ( O(log2(n)) ) are the same for whatever number of entries.

I don't understand this complexity estimate -- it cannot be the same for
whatever number of entries as the complexity function depends on the number
of entries.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
P.C.M.C.I.A. stands for `People Can't Memorize Computer Industry Acronyms'

2001-02-21 22:32:38

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> Not true. The rehashing is O(n) and it has to be performed O(log n)
> times during insertion. Therefore, insertion is O(log n).

Rehashing is O(n), but the "n" is the _current_ number of items, not the
maximum one after all the insertions.

Let's assume you start with a single-entry hash table. You rehash for the
first time after inserting the first item (giving hash table of size 2),
then after the second item (=> size 4), then after the fourth item (=> size 8)
and so on. I.e., when you insert n items, the total cost of rehashing summed
over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
MIPS: Meaningless Indicator of Processor Speed.

2001-02-21 22:38:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > Not true. The rehashing is O(n) and it has to be performed O(log n)
> > times during insertion. Therefore, insertion is O(log n).
>
> Rehashing is O(n), but the "n" is the _current_ number of items, not the
> maximum one after all the insertions.
>
> Let's assume you start with a single-entry hash table. You rehash for the
> first time after inserting the first item (giving hash table of size 2),
> then after the second item (=> size 4), then after the fourth item (=> size 8)
> and so on. I.e., when you insert n items, the total cost of rehashing summed
> over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
> k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.
>

You're right. However, for each hash table operation to be O(1) the size
of the hash table must be >> n.

I suggested at one point to use B-trees with a hash value as the key.
B-trees are extremely efficient when used on a small constant-size key.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:41:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> My personal preference goes to skiplist coz it doesn't have fixed ( or
>> growing
>> ) tables to handle. You've simply a stub of data togheter with FS data in
>> each
>> direntry.
>
> Another problem with skip lists is that they require variable sized nodes,
> so you either need to keep free chunk lists and lose some space in deleted
> nodes kept in these lists, or you choose to shift remaining nodes which is
> slow and complicated as you need to keep the inter-node links right. With
> hashing, you can separate the control part of the structure and the actual
> data and shift data while leaving most of the control part intact.

An entry in skip list table is a u32 direntry offset and You've not to keep
free entries, simply the height of the node will change depending on the number
of entries.


>> And performance ( O(log2(n)) ) are the same for whatever number of entries.
>
> I don't understand this complexity estimate -- it cannot be the same for
> whatever number of entries as the complexity function depends on the number
> of entries.

n == number of entries

For constant I mean the formula not the result.



- Davide

2001-02-21 22:47:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Mark Hahn wrote:
>
> > You're right. However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
>
> there's at least one kind of HT where the table starts small
> and gets bigger, but at trivial cost (memcpy). while those
> memcpy's are O(n) each time, it's a little misleading to treat
> them as costing the same as O(n) rehashing.
>

memcpy() isn't exactly trivial, especially not when we're talking about
disk storage. Note, too, that we're talking about storage in a
filesystem, and random access a large, growable linear space (i.e. a
file) in a filesystem is O(log n) because of necessary inode indirection.

That's yet another reason I like the idea of using B-trees over hash
values: B-trees are O(log n), but do not need the file inode indirection
to do the job, so what you end up with is very nice and fast.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:50:38

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> You're right. However, for each hash table operation to be O(1) the size
> of the hash table must be >> n.

If we are talking about average case complexity (which is the only possibility
with fixed hash function and arbitrary input keys), it suffices to have
hash table size >= c*n for some constant c which gives O(1/c) cost of
all operations.

> I suggested at one point to use B-trees with a hash value as the key.
> B-trees are extremely efficient when used on a small constant-size key.

Although from asymptotic complexity standpoint hashing is much better
than B-trees, I'm not sure at all what will give the best performance for
reasonable directory sizes. Maybe the B-trees are really the better
alternative as they are updated dynamically and the costs of successive
operations are similar as opposed to hashing which is occassionally very
slow due to rehashing unless you try to rehash on-line, but I don't
know any algorithm for on-line rehashing with both inserts and deletes
which wouldn't be awfully complex and slow (speaking of multiplicative
constants, of course -- it's still O(1) per operation, but "the big Oh
is really big there").

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
"#define QUESTION ((bb) || !(bb))" -- Shakespeare

2001-02-21 22:54:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > You're right. However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
>
> If we are talking about average case complexity (which is the only possibility
> with fixed hash function and arbitrary input keys), it suffices to have
> hash table size >= c*n for some constant c which gives O(1/c) cost of
> all operations.
>

True. Note too, though, that on a filesystem (which we are, after all,
talking about), if you assume a large linear space you have to create a
file, which means you need to multiply the cost of all random-access
operations with O(log n).

> > I suggested at one point to use B-trees with a hash value as the key.
> > B-trees are extremely efficient when used on a small constant-size key.
>
> Although from asymptotic complexity standpoint hashing is much better
> than B-trees, I'm not sure at all what will give the best performance for
> reasonable directory sizes. Maybe the B-trees are really the better
> alternative as they are updated dynamically and the costs of successive
> operations are similar as opposed to hashing which is occassionally very
> slow due to rehashing unless you try to rehash on-line, but I don't
> know any algorithm for on-line rehashing with both inserts and deletes
> which wouldn't be awfully complex and slow (speaking of multiplicative
> constants, of course -- it's still O(1) per operation, but "the big Oh
> is really big there").

Well, once you multiply with O(log n) for the file indirection (which
B-trees don't need, since they inherently handle blocking and thus can
use block pointers directly) then the asymptotic complexity is the same
as well, and I think the B-trees are the overall winner.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:08:24

by Martin Mares

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> True. Note too, though, that on a filesystem (which we are, after all,
> talking about), if you assume a large linear space you have to create a
> file, which means you need to multiply the cost of all random-access
> operations with O(log n).

One could avoid this, but it would mean designing the whole filesystem in a
completely different way -- merge all directories to a single gigantic
hash table and use (directory ID,file name) as a key, but we were originally
talking about extending ext2, so such massive changes are out of question
and your log n access argument is right.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
COBOL -- Completely Outdated, Badly Overused Language

2001-02-21 23:15:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > True. Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
>
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.
>

It would still be tricky since you have to have actual files in the
filesystem as well.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:15:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Ed Tomlinson <[email protected]> wrote:
>
>The default in reiserfs is now the R5 hash, but you are right that lots of efforts went
>into finding this hash. This includes testing various hashes on real directory
>structures to see which one worked best. R5 won.

That's interesting. The R5 hash is easily also the only one of the
reiser hashes that might be useable for the generic VFS hashing. It's
not so different in spirit from the current one, and if you've done the
work to test it, it's bound to be a lot better.

(The current VFS name hash is probably _really_ stupid - I think it's
still my original one, and nobody probably ever even tried to run it
through any testing. For example, I bet that using a shift factor of 4
is really bad, because it evenly divides a byte, which together with the
xor means that you can really easily generate trivial bad cases).

What did you use for a test-case? Real-life directory contents? Did you
do any worst-case analysis too?

Linus

2001-02-21 23:27:16

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
> Hello!
>
> > True. Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
>
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

A gigantic hash table has serious problems with non-locality of
reference. Basically any regular access pattern you started with is
destroyed. This is a problem with pageable RAM, let alone disks with
millisecond seek times.

-- Jamie

2001-02-21 23:33:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


On 21-Feb-2001 Linus Torvalds wrote:
> In article <[email protected]>,
> Ed Tomlinson <[email protected]> wrote:
>>
>>The default in reiserfs is now the R5 hash, but you are right that lots of
>>efforts went
>>into finding this hash. This includes testing various hashes on real
>>directory
>>structures to see which one worked best. R5 won.
>
> That's interesting. The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing. It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.
>
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing. For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?

Yep, 4 is not good as a shifting factor. Prime number are the better choice for
this stuff.
The issue to have a good distribution is not only to have a good hashing
function, but also to give this function not correlated data.
Good hashing function for a Domain A may not be so good for a Domain B.




- Davide

2001-02-21 23:44:00

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Martin Mares wrote:
> >
> > > True. Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key, but we were originally
> > talking about extending ext2, so such massive changes are out of question
> > and your log n access argument is right.
>
> It would still be tricky since you have to have actual files in the
> filesystem as well.

Have you looked at the structure and algorithms I'm using? I would not
call this a hash table, nor is it a btree. It's a 'hash-keyed
uniform-depth tree'. It never needs to be rehashed (though it might be
worthwhile compacting it at some point). It also never needs to be
rebalanced - it's only two levels deep for up to 50 million files.

This thing deserves a name of its own. I call it an 'htree'. The
performance should speak for itself - 150 usec/create across 90,000
files and still a few optmizations to go.

Random access runs at similar speeds too, it's not just taking advantage
of a long sequence of insertions into the same directory.

BTW, the discussion in this thread has been very interesting, it just
isn't entirely relevant to my patch :-)

--
Daniel

2001-02-21 23:49:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> Have you looked at the structure and algorithms I'm using? I would not
> call this a hash table, nor is it a btree. It's a 'hash-keyed
> uniform-depth tree'. It never needs to be rehashed (though it might be
> worthwhile compacting it at some point). It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
>

I'm curious how you do that. It seems each level would have to be 64K
large in order to do that, with a minimum disk space consumption of 128K
for a directory. That seems extremely painful *except* in the case of
hysterically large directories, which tend to be the exception even on
filesystems where they occur.

I think I'd rather take the extra complexity and rebalancing cost of a
B-tree.

> This thing deserves a name of its own. I call it an 'htree'. The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
>
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
>
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:52:01

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


On 21-Feb-2001 Daniel Phillips wrote:
> "H. Peter Anvin" wrote:
>>
>> Martin Mares wrote:
>> >
>> > > True. Note too, though, that on a filesystem (which we are, after all,
>> > > talking about), if you assume a large linear space you have to create a
>> > > file, which means you need to multiply the cost of all random-access
>> > > operations with O(log n).
>> >
>> > One could avoid this, but it would mean designing the whole filesystem in
>> > a
>> > completely different way -- merge all directories to a single gigantic
>> > hash table and use (directory ID,file name) as a key, but we were
>> > originally
>> > talking about extending ext2, so such massive changes are out of question
>> > and your log n access argument is right.
>>
>> It would still be tricky since you have to have actual files in the
>> filesystem as well.
>
> Have you looked at the structure and algorithms I'm using? I would not
> call this a hash table, nor is it a btree. It's a 'hash-keyed
> uniform-depth tree'. It never needs to be rehashed (though it might be
> worthwhile compacting it at some point). It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
>
> This thing deserves a name of its own. I call it an 'htree'. The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
>
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
>
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

Daniel,

I'm all but saying that Your algo is not good.
I use something very like to it in my mail server ( XMail ) to index mail queue
files that has a two level depth fs splitting.
The mine was only an hint to try different types of directory indexing.



- Davide

2001-02-21 23:58:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Followup to: <[email protected]>
By author: [email protected] (Linus Torvalds)
In newsgroup: linux.dev.kernel
>
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing. For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>

Actually, the VFS name hash I think is derived from the "Dragon Book"
hash (via autofs), so it's not like it's completely untested.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 00:00:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Davide Libenzi <[email protected]> wrote:
>
>Yep, 4 is not good as a shifting factor. Prime number are the better choice for
>this stuff.

Oh, absolutely.

It looks like the hash function was done rather early on in the dcache
lifetime (one of the first things), back when nobody cared about whether
it was really good or not because there were many much more complicated
questions like "how the h*ll will this all ever work" ;)

And at no point did anybody ever go back and verify whether the hash
function made much sense or not.

We had another boo-boo with the actual _folding_ of the "full" hash
value into the actual hash chain pointer that is done when the name
cache is actually looked up, which was even more embarrassing: even if
the hash ended up being ok, we would remove most of the valid bits from
it because it would under certain circumstances (512MB of RAM on x86)
basically xor itself with itself.

That took quite a while to find too - the code still worked fine, it
just had a horrible distribution on machines with half a gig of memory.

>The issue to have a good distribution is not only to have a good hashing
>function, but also to give this function not correlated data.
>Good hashing function for a Domain A may not be so good for a Domain B.

This is not something we can do all that much about. The data we get is
generated by the user, and can basically be a random string of
characters. HOWEVER, there are certainly tons of _usual_ data, and
while there's no way to select the data we can at least try to make sure
that the distribution is good for the normal case (ie regular ASCII
filenames, not forgetting the fact that many people use more interesting
encodings)

Linus

2001-02-22 00:36:27

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds <[email protected]> wrote:
>
>?Ed Tomlinson ?<[email protected]> wrote:
>?>The default in reiserfs is now the R5 hash, but you are right that lots of
>?> efforts went into finding this hash. ?This includes testing various
>?> hashes on real directory structures to see which one worked best. ?R5
>?> won.
>
>?That's interesting. ?The R5 hash is easily also the only one of the
>?reiser hashes that might be useable for the generic VFS hashing. ?It's
>?not so different in spirit from the current one, and if you've done the
>?work to test it, it's bound to be a lot better.

It was not me personally. ? I just remembered the thread (from june 2000) on
the reiserfs list... ?I have summerized the results for you below.

For the program see: http://www.jedi.claranet.fr/hash_torture.tar.gz

Ed

PS. ?I am still seeing hangs with (2.4.2pre2 then I switched to ac7 or so and
have had hangs with all pre and ac(s) tried and that is most of them) ?ac20
plus the latest reiserfs fixes has stayed up 8 hours so far - it can take two
or three days ?to trigger the hang though. ?When it hangs it really dead, ?a
UPS connected via a serial port cannot shut it down. ? pings to the box fail.
A+SysRQ is dead, and the software watchdog does not trigger a reboot. ?
ideas?

>?(The current VFS name hash is probably _really_ stupid - I think it's
>?still my original one, and nobody probably ever even tried to run it
>?through any testing. ?For example, I bet that using a shift factor of 4
>?is really bad, because it evenly divides a byte, which together with the
>?xor means that you can really easily generate trivial bad cases).
>
>?What did you use for a test-case? Real-life directory contents? Did you
>?do any worst-case analysis too?
>
>??????? ??????? Linus


some test results from june 2000 with Hans's summary first.
---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: Hans Reiser <[email protected]>
To: "Yury Yu. Rupasov" <[email protected]>
Cc: Jedi/Sector One <[email protected]>, Petru Paler <[email protected]>,
"[email protected]" <[email protected]>, Yury Shevchuk
<[email protected]>


Ok, based on this benchmark let's put rupasov5 in, and warn users who choose
the
currently used rupasov1 hash that rupasov5 has obsoleted it. ?Do this in both
3.6 and 3.5, and fix the the delimiting key check in 3.5 REISERFS_CHECK bug at
the same time. ?Cut the patch, start testing, and see if you can release by
Monday. ?Make rupasov5 the default. ?sizif, review the documentation he
creates
for users.

Jedi, if you disagree with the benchmarks let me know. ?You might try
concatenating two filenames together instead of adding a digit to them, or
running find on a really large FS, to improve these tests. ?Thanks for helping
us with analyzing the different hash methods available Jedi.

Hans

---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: "Yury Yu. Rupasov" <[email protected]>
To: Hans Reiser <[email protected]>
Cc: Jedi/Sector One <[email protected]>, Petru Paler <[email protected]>,
"[email protected]" <[email protected]>, Yury Shevchuk
<[email protected]>


Hans Reiser wrote:
>?
>?What is the speed of the real filenames, not just the number of collisions.
>?



Ok, here is the results for real names :
# find / -type d -exec ls {} \; | sort | uniq > allfiles.txt

# wc -l allfiles.txt
161101 allfiles.txt

Collisions for 161 101 names:

tea_hash ?: 784 total, ?2 dangerous
jedi_hash2: 957 total, ?2 dangerous
r5_hash ? :1191 total, ?2 dangerous
r7_hash ? :8439 total, 18 dangerous


The speed for 161 101 real names :

create 161101 files of 10 bytes with names from allfiles.txt

# time create d1 allfiles.txt
# time cp d1 d2 -r
# time rm d1 -r

? ? ? ? ? ? ? create ? ? ?copy ? ? ? ?remove
? ? ? ? ? ? ?--------------------------------
tea_hash ? : 1m27.223s ? 5m43.069s ?2m33.449s
jedi_hash2 : 1m26.062s ? 5m40.872s ?2m32.795s
r5_hash ? ?: 1m16.729s ? 4m14.967s ?1m53.037s
r7_hash ? ?: 1m10.665s ? 3m34.950s ?1m39.756s


As you can see the results are differ, but not too much. :)
The situation changes dramatically if we will test 1 million files.

The same test, but at the end of each name from allfiles.txt
added numbers from 0 to 6 (1 127 707 files):
?
? ? ? ? ? ? ? create ? ? ?copy ? ? ? ?remove
? ? ? ? ? ? ?--------------------------------
tea_hash ? : 81m44.449s ?
jedi_hash2 : 79m46.419s
r5_hash ? ?: 15m56.037s
r7_hash ? ?: 15m30.680s

Dual Celeron 500, 128 MB RAM, 8 GB scsi HDD
Reiserfs-3.5.21, Linux-2.2.15

Thanks,
Yura.
---------------------------------------------------------------
body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff;
text-decoration: none; }(reiserfs) Torture results
From: Jedi/Sector One <[email protected]>
To: [email protected]


? Here are the results of the hash torture on a Celeron 300.
? Once again, you can substract 1 from the dangerous collisions numbers.
? Xuan, can you provide a test for the case Rupasov hash was designed
for ?
? Anyway, I don't really see why large directories should have similar
file names, rather that keywords.

? Best regards,
--
???????? Frank DENIS aka Jedi/Sector One aka DJ Chrysalis <[email protected]>
???????? ??????? -> Software : http://www.jedi.claranet.fr <-
? ? ? If Bill Gates had a dime for every time a Windows box crashed...
???????? ??????? ?...oh, wait a minute -- he already does.


********************** /usr/dict/words test **********************

Trying with ? 45402 words


-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous : ? ? ? 1????? ffff980
Timing :

real???? 0m0.145s
user???? 0m0.120s
sys????? 0m0.010s

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous : ? ? ? 1????? ffffe00
Timing :

real???? 0m0.297s
user???? 0m0.260s
sys????? 0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous : ? ? ? 1????? ffae000
Timing :

real???? 0m0.124s
user???? 0m0.080s
sys????? 0m0.030s

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous : ? ? ? 1????? fffd400
Timing :

real???? 0m0.121s
user???? 0m0.100s
sys????? 0m0.000s

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous : ? ? ? 1????? fff9780
Timing :

real???? 0m0.122s
user???? 0m0.100s
sys????? 0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous : ? ? ? 1????? fff9780
Timing :

real???? 0m0.122s
user???? 0m0.090s
sys????? 0m0.020s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous : ? ? ? 1????? ffff480
Timing :

real???? 0m0.123s
user???? 0m0.100s
sys????? 0m0.000s

********************** Squid names test **********************

Trying with ?458752 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 6237
Dangerous : ? ? ? 1????? fffff80
Timing :

real???? 0m1.138s
user???? 0m1.090s
sys????? 0m0.030s

-------------[Benchmarking rupasov hash]-------------

Collisions : 377520
Dangerous : ? ? ? 1????? e32700
Timing :

real???? 0m2.588s
user???? 0m2.550s
sys????? 0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 309991
Dangerous : ? ? ? 1????? 55406b80
Timing :

real???? 0m0.940s
user???? 0m0.880s
sys????? 0m0.040s

-------------[Benchmarking r7 hash]-------------

Collisions : 449006
Dangerous : ? ? ? 2????? 22b16580
Timing :

real???? 0m0.928s
user???? 0m0.840s
sys????? 0m0.070s

-------------[Benchmarking jedi hash]-------------

Collisions : 2771
Dangerous : ? ? ? 1????? fffef80
Timing :

real???? 0m0.928s
user???? 0m0.860s
sys????? 0m0.050s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous : ? ? ? 1????? ffff80
Timing :

real???? 0m0.879s
user???? 0m0.810s
sys????? 0m0.050s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 6203
Dangerous : ? ? ? 1????? fffdc00
Timing :

real???? 0m0.930s
user???? 0m0.840s
sys????? 0m0.080s

********************** Real names test **********************

Trying with ? 89830 files

-------------[Benchmarking tea hash]-------------

Collisions : 237
Dangerous : ? ? ? 1????? fff5580
Timing :

real???? 0m0.276s
user???? 0m0.250s
sys????? 0m0.000s

-------------[Benchmarking rupasov hash]-------------

Collisions : 6288
Dangerous : ? ? ? 1????? ffee080
Timing :

real???? 0m0.582s
user???? 0m0.560s
sys????? 0m0.010s

-------------[Benchmarking r5 hash]-------------

Collisions : 3920
Dangerous : ? ? ? 1????? fff4600
Timing :

real???? 0m0.230s
user???? 0m0.190s
sys????? 0m0.020s

-------------[Benchmarking r7 hash]-------------

Collisions : 11801
Dangerous : ? ? ? 1????? fff580
Timing :

real???? 0m0.225s
user???? 0m0.180s
sys????? 0m0.030s

-------------[Benchmarking jedi hash]-------------

Collisions : 269
Dangerous : ? ? ? 1????? fff9f80
Timing :

real???? 0m0.226s
user???? 0m0.200s
sys????? 0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 415
Dangerous : ? ? ? 1????? fff9f80
Timing :

real???? 0m0.225s
user???? 0m0.200s
sys????? 0m0.010s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 223
Dangerous : ? ? ? 1????? ffff480
Timing :

real???? 0m0.230s
user???? 0m0.210s
sys????? 0m0.000s

----------------------------------------------------------------------------------------

body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff;
text-decoration: none; }(reiserfs) hash torture results
From: Petru Paler <[email protected]>
To: [email protected]


Machine: AMD Athlon/650MHz, 128Mb RAM, Quantum Fireball lct15 IDE hdd
(UDMA/66 but that doesn't matter). Kernel 2.4.0-test1-ac10.

The results are interesting, but more interesting would be to see how fast
reiserfs actually is with each of these hashes.

Script output:

********************** /usr/dict/words test **********************

Trying with ? 45402 words


-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous : ? ? ? 1????? ffff980
Timing :
0.00user 0.01system 0:00.08elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous : ? ? ? 1????? ffffe00
Timing :
0.00user 0.00system 0:00.18elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous : ? ? ? 1????? ffae000
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous : ? ? ? 1????? fffd400
Timing :
0.00user 0.01system 0:00.07elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous : ? ? ? 1????? fff9780
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous : ? ? ? 1????? fff9780
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous : ? ? ? 1????? ffff480
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Squid names test **********************

Trying with ?262144 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 2019
Dangerous : ? ? ? 1????? ffff880
Timing :
0.00user 0.01system 0:00.47elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 210912
Dangerous : ? ? ? 1????? a88f00
Timing :
0.00user 0.02system 0:01.03elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 171912
Dangerous : ? ? ? 1????? 54ca7680
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 256171
Dangerous : ? ? ? 6????? 22aa0600
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 589
Dangerous : ? ? ? 1????? fffda00
Timing :
0.00user 0.02system 0:00.42elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous : ? ? ? 1????? ffff80
Timing :
0.00user 0.00system 0:00.40elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 2041
Dangerous : ? ? ? 1????? fffdc00
Timing :
0.00user 0.01system 0:00.40elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Real names test **********************

find: /proc/31112/fd/4: No such file or directory
Trying with ? 94836 files

-------------[Benchmarking tea hash]-------------

Collisions : 235
Dangerous : ? ? ? 1????? fff5e80
Timing :
0.00user 0.00system 0:00.20elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 2016
Dangerous : ? ? ? 1????? fffab80
Timing :
0.01user 0.00system 0:00.46elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 495
Dangerous : ? ? ? 1????? fff8780
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 8162
Dangerous : ? ? ? 1????? fff580
Timing :
0.00user 0.02system 0:00.17elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 331
Dangerous : ? ? ? 1????? ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 341
Dangerous : ? ? ? 1????? ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 298
Dangerous : ? ? ? 1????? fffb700
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-Petru

2001-02-22 01:24:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Daniel Phillips wrote:
> >
> > Have you looked at the structure and algorithms I'm using? I would not
> > call this a hash table, nor is it a btree. It's a 'hash-keyed
> > uniform-depth tree'. It never needs to be rehashed (though it might be
> > worthwhile compacting it at some point). It also never needs to be
> > rebalanced - it's only two levels deep for up to 50 million files.
>
> I'm curious how you do that. It seems each level would have to be 64K
> large in order to do that, with a minimum disk space consumption of 128K
> for a directory. That seems extremely painful *except* in the case of
> hysterically large directories, which tend to be the exception even on
> filesystems where they occur.

Easy, with average dirent reclen of 16 bytes each directory leaf block
can holds up to 256 entries. Each index block indexes 512 directory
blocks and the root indexes 511 index blocks. Assuming the leaves are
on average 75% full this gives:

(4096 / 16) * 512 * 511 * .75 = 50,233,344

I practice I'm getting a little more than 90,000 entries indexed by a
*single* index block (the root) so I'm not just making this up.

> I think I'd rather take the extra complexity and rebalancing cost of a
> B-tree.

Do you still think so?

--
Daniel

2001-02-22 01:43:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> "H. Peter Anvin" wrote:
> >
> > Daniel Phillips wrote:
> > >
> > > Have you looked at the structure and algorithms I'm using? I would not
> > > call this a hash table, nor is it a btree. It's a 'hash-keyed
> > > uniform-depth tree'. It never needs to be rehashed (though it might be
> > > worthwhile compacting it at some point). It also never needs to be
> > > rebalanced - it's only two levels deep for up to 50 million files.
> >
> > I'm curious how you do that. It seems each level would have to be 64K
> > large in order to do that, with a minimum disk space consumption of 128K
> > for a directory. That seems extremely painful *except* in the case of
> > hysterically large directories, which tend to be the exception even on
> > filesystems where they occur.
>
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries. Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks. Assuming the leaves are
> on average 75% full this gives:
>
> (4096 / 16) * 512 * 511 * .75 = 50,233,344
>

That's a three-level tree, not a two-level tree.

> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.
>
> > I think I'd rather take the extra complexity and rebalancing cost of a
> > B-tree.
>
> Do you still think so?

I think so.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 02:04:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips writes:
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries. Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks. Assuming the leaves are
> on average 75% full this gives:
>
> (4096 / 16) * 512 * 511 * .75 = 50,233,344
>
> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.

I was just doing the math for 1k ext2 filesystems, and the numbers aren't
nearly as nice. We get:

(1024 / 16) * 127 * .75 = 6096 # 1 level
(1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels

Basically (IMHO) we will not really get any noticable benefit with 1 level
index blocks for a 1k filesystem - my estimates at least are that the break
even point is about 5k files. We _should_ be OK with 780k files in a single
directory for a while. Looks like we will need 2-level indexes sooner than
you would think though. Note that tests on my workstation showed an average
filename length of 10 characters (excluding MP3s at 78 characters), so this
would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 02:30:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds wrote:
>
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> >
> > You mean full_name_hash? I will un-static it and try it. I should have
> > some statistics tomorrow. I have a couple of simple metrics for
> > measuring the effectiveness of the hash function: the uniformity of
> > the hash space splitting (which in turn affects the average fullness
> > of directory leaves) and speed.
>
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.
>
> > Let the hash races begin.
>
> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

In the first heat of hash races - creating 20,000 files in one directory
- dentry::hash lost out to my original hack::dx_hash, causing a high
percentage of leaf blocks to remain exactly half full and slowing down
the whole thing by about 5%. (This was under uml - I haven't tried it
native yet but I expect the results to be similar.)

Contender Result
========= ======
dentry::hash Average fullness = 2352 (57%)
hack::dx_hash Average fullness = 2758 (67%)

This suggests that dentry::hash is producing distinctly non-dispersed
results and needs to be subjected to further scrutiny. I'll run the
next heat of hash races tomorrow, probably with R5, and CRC32 too if I
have time.

--
Daniel

2001-02-22 02:41:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
>
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files. We _should_ be OK with 780k files in a single
> directory for a while.
>

I've had a news server with 2000000 files in one directory. Such a
filesystem is likely to use small blocks, too, because each file is
generally small.

This is an important connection: filesystems which have lots and lots of
small files will have large directories and small block sizes.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 03:10:17

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
>
> Daniel Phillips writes:
> > Easy, with average dirent reclen of 16 bytes each directory leaf block
> > can holds up to 256 entries. Each index block indexes 512 directory
> > blocks and the root indexes 511 index blocks. Assuming the leaves are
> > on average 75% full this gives:
> >
> > (4096 / 16) * 512 * 511 * .75 = 50,233,344
> >
> > I practice I'm getting a little more than 90,000 entries indexed by a
> > *single* index block (the root) so I'm not just making this up.
>
> I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> nearly as nice. We get:
>
> (1024 / 16) * 127 * .75 = 6096 # 1 level
> (1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels
>
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files. We _should_ be OK with 780k files in a single
> directory for a while. Looks like we will need 2-level indexes sooner than
> you would think though. Note that tests on my workstation showed an average
> filename length of 10 characters (excluding MP3s at 78 characters), so this
> would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
> to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

But you are getting over 3/4 million files in one directory on a 1K
blocksize system, and you really shouldn't be using 1K blocks on a
filesystem under that big a load. Is it just to reduce tail block
fragmentation? That's what tail merging is for - it does a much better
job than shrinking the block size.

But if you are *determined* to use 1K blocks and have more than 1/2
million files in one directory then I suppose a 3rd level is what you
need. The uniform-depth tree still works just fine and still doesn't
need to be rebalanced - it's never out of balance.

--
Daniel

2001-02-22 03:31:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2



On Thu, 22 Feb 2001, Daniel Phillips wrote:
>
> In the first heat of hash races - creating 20,000 files in one directory
> - dentry::hash lost out to my original hack::dx_hash, causing a high
> percentage of leaf blocks to remain exactly half full and slowing down
> the whole thing by about 5%. (This was under uml - I haven't tried it
> native yet but I expect the results to be similar.)
>
> Contender Result
> ========= ======
> dentry::hash Average fullness = 2352 (57%)
> hack::dx_hash Average fullness = 2758 (67%)
>
> This suggests that dentry::hash is producing distinctly non-dispersed
> results and needs to be subjected to further scrutiny. I'll run the
> next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> have time.

I'd love to hear the results from R5, as that seems to be the reiserfs
favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
in..

Linus

2001-02-22 03:44:42

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Andreas Dilger wrote:
> >
> > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > index blocks for a 1k filesystem - my estimates at least are that the break
> > even point is about 5k files. We _should_ be OK with 780k files in a single
> > directory for a while.
> >
>
> I've had a news server with 2000000 files in one directory. Such a
> filesystem is likely to use small blocks, too, because each file is
> generally small.
>
> This is an important connection: filesystems which have lots and lots of
> small files will have large directories and small block sizes.

I mentioned this earlier but it's worth repeating: the desire to use a
small block size is purely an artifact of the fact that ext2 has no
handling for tail block fragmentation. That's a temporary situation -
once we've dealt with it your 2,000,000 file directory will be happier
with 4K filesystem blocks. There will be a lot fewer metadata index
blocks in your directory file, for one thing. Another practical matter
is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
result friendlier to the page cache and memory manager.

--
Daniel

2001-02-22 04:03:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> "H. Peter Anvin" wrote:
> >
> > Andreas Dilger wrote:
> > >
> > > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > > index blocks for a 1k filesystem - my estimates at least are that the break
> > > even point is about 5k files. We _should_ be OK with 780k files in a single
> > > directory for a while.
> > >
> >
> > I've had a news server with 2000000 files in one directory. Such a
> > filesystem is likely to use small blocks, too, because each file is
> > generally small.
> >
> > This is an important connection: filesystems which have lots and lots of
> > small files will have large directories and small block sizes.
>
> I mentioned this earlier but it's worth repeating: the desire to use a
> small block size is purely an artifact of the fact that ext2 has no
> handling for tail block fragmentation. That's a temporary situation -
> once we've dealt with it your 2,000,000 file directory will be happier
> with 4K filesystem blocks. There will be a lot fewer metadata index
> blocks in your directory file, for one thing. Another practical matter
> is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> result friendlier to the page cache and memory manager.
>

Well, that's something I really don't expect to see anymore -- this
"purely temporary situation" is now already 7 years old at least.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 04:03:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Daniel Phillips <[email protected]> wrote:
>
>I mentioned this earlier but it's worth repeating: the desire to use a
>small block size is purely an artifact of the fact that ext2 has no
>handling for tail block fragmentation. That's a temporary situation -
>once we've dealt with it your 2,000,000 file directory will be happier
>with 4K filesystem blocks.

I'd rather see a whole new filesystem than have ext2 do tail-block
fragmentation.

Once you do tail fragments, you might as well do the whole filesystem
over and have it do fancier stuff than just handling sub-blocking.

Another way of saying this: if you go to the complexity of no longer
being a purely block-based filesystem, please go the whole way. Make the
thing be extent-based, and get away from the notion that you have to
allocate blocks one at a time. Make the blocksize something nice and
big, not just 4kB or 8kB or something.

And don't call it ext2.

Linus

2001-02-22 04:03:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> There will be a lot fewer metadata index
> blocks in your directory file, for one thing.
>

Oh yes, another thing: a B-tree directory structure does not need
metadata index blocks.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 05:20:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Linus Torvalds <[email protected]> wrote:
>
>Another way of saying this: if you go to the complexity of no longer
>being a purely block-based filesystem, please go the whole way. Make the
>thing be extent-based, and get away from the notion that you have to
>allocate blocks one at a time. Make the blocksize something nice and
>big, not just 4kB or 8kB or something.

Btw, this is also going to be a VM and performance issue some time in
the future. Tgere are already CPU's that would _love_ to have 64kB
pages etc, and as such a filesystem that doesn't play with the old silly
"everthing is a block" rules would be much appreciated with the kind of
people who have multi-gigabyte files and want to read in big chunks at a
time.

So either you have a simple block-based filesystem (current ext2, no
extents, no crapola), or you decide to do it over. Don't do some
half-way thing, please.

Linus

2001-02-22 06:23:29

by Theodore Tso

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Daniel,

Nice work!

A couple of comments. If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

32 bits: ino == 0
16 bits: rec_len == blocksize
16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing. When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry. And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore. (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it? Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

- Ted

2001-02-22 07:05:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

HPA writes:
> Daniel Phillips wrote:
> > I mentioned this earlier but it's worth repeating: the desire to use a
> > small block size is purely an artifact of the fact that ext2 has no
> > handling for tail block fragmentation. That's a temporary situation -
> > once we've dealt with it your 2,000,000 file directory will be happier
> > with 4K filesystem blocks. There will be a lot fewer metadata index
> > blocks in your directory file, for one thing. Another practical matter
> > is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> > result friendlier to the page cache and memory manager.
> >
>
> Well, that's something I really don't expect to see anymore -- this
> "purely temporary situation" is now already 7 years old at least.

Peter, you're barking up the wrong tree - Daniel has had an ext2 tail
merging patch around for 6 months or more... However, from the sounds
of it, Linus may not want such a thing in ext2 (at least not until he
is convinced otherwise). It will be interesting to compare ext2 +
ongoing patches vs. new filesystems like reiserfs, XFS, JFS -- not only
speed, but reliability as well. XFS and JFS have previous implementations
to work with (although the JFS code is not the AIX JFS code), but reiserfs
has a long way to go, just from the standpoint of being run on millions
of machines, and being looked at by thousands of programmers.

I think people will be surprised at how ext2 + patches will continue to
improve. One of the reasons (despite Linus' misgivings, IMHO) is that
ext2 is continually being improved by small measures, has lots of eyes
on the code, and it offers a stable base for each improvement - which
means each improvement is stable and reliable much quicker than if you
were to code a new filesystem from scratch for each new feature.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 07:21:26

by Bill Wendling

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Also sprach H. Peter Anvin:
} Martin Mares wrote:
} >
} > Hello!
} >
} > > True. Note too, though, that on a filesystem (which we are, after all,
} > > talking about), if you assume a large linear space you have to create a
} > > file, which means you need to multiply the cost of all random-access
} > > operations with O(log n).
} >
} > One could avoid this, but it would mean designing the whole filesystem in a
} > completely different way -- merge all directories to a single gigantic
} > hash table and use (directory ID,file name) as a key, but we were originally
} > talking about extending ext2, so such massive changes are out of question
} > and your log n access argument is right.
} >
}
} It would still be tricky since you have to have actual files in the
} filesystem as well.
}
But that's just a user space issue, isn't it.

(Just kidding :-)

--
|| Bill Wendling [email protected]

2001-02-22 07:29:47

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

On Thu, 22 Feb 2001, [email protected] wrote:
> A couple of comments. If you make the beginning of each index block
> look like a an empty directory block (i.e, the first 8 blocks look like
> this):
>
> 32 bits: ino == 0
> 16 bits: rec_len == blocksize
> 16 bits: name_len = 0
>
> ... then you will have full backwards compatibility, both for reading
> *and* writing. When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry. And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore. (Running a smart e2fsck which knows about directory trees
> will be able to restore the tree structure).

:-) That's really nice, now I see what you were thinking about with
all those bit clears.

> Is it worth it? Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution. I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

--
Daniel

2001-02-22 08:09:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [rfc] [LONG] Near-constant time directory index for Ext2

Daniel Phillips writes:
> Andreas Dilger wrote:
> > I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> > nearly as nice. We get:
> >
> > (1024 / 16) * 127 * .75 = 6096 # 1 level
> > (1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels
>
> But if you are *determined* to use 1K blocks and have more than 1/2
> million files in one directory then I suppose a 3rd level is what you
> need. The uniform-depth tree still works just fine and still doesn't
> need to be rebalanced - it's never out of balance.

I would rather simply go to some chained block scheme at that point.
ext2 is already fairly fast at linear searching, so if we index a HUGE
directory we are still linearly searching only 1/2^16 of the directory
(at worst for 1k blocks, 1/2^20 for 4k blocks).

I just had a clever idea - on a single-level index you put the header
and index data in block 0, and put the directory data in the first
indirect block (11 sparse blocks, instead of 511). If you need to go
to a second-level index, you can simply shift the indirect data block to
be a double-indirect block, and start the level-2 index in the first
indirect block. If we ever need a third-level index, you basically do
the same thing - move the double-indirect blocks to triple-indirect,
and put the level-3 index in the double-indirect block. It will always
fit, because the index branching level is 1/2 of the indirect block
branching level because the index has the extra 4-byte hash values.

Andreas:
>> One thing I was thinking was that you could put "." and ".." in the first
>> block (like usual), and then put the index data after that. This way
>> "." and ".." still exist and e2fsck and the kernel code doesn't complain,
>> except about the sparse directory blocks.

Daniel:
>The kernel code - ext2 fs that is - doesn't complain at the moment
>because I removed the complaint, and everything seems to be fine. All
>references to "." and ".." are now intercepted and never reach the
>filesystem level. If they did then I'd just fix ext2_is_empty_dir to
>tolerate those entries being somewhere other than the first block.
>But, reading ahead, I see you are talking about forward compatibility...

One of the (many) benefits of ext2 is that it has tried to maintain
compatibility as much as possible, if possible. In this case, I
don't see that there is an overwhelming reason to NOT keep compatibility,
and I think Ted agrees:

Ted Ts'o writes:
> E2fsck uses '..' to be able to climb up the directory tree when it needs
> to print a pathname given only a directory inode. So yes, removing it
> will cause e2fsck to not work as well. '.' is not as useful, but it's
> useful as a sanity check.

> Of course, if we completely give up on compatibility, we don't actually
> need to have special directory entries for '.' and '..' complete with
> their names; we can just store the inode numbers for both in a 32bit
> field along with the indexes. (And so magic number for sanity checking;
> magic numbers are good things....)

Having real dirents for "." and ".." only costs 16 more bytes (2 index
leaves), compared to only keeping the inode numbers.

Andreas:
> > So, we would have (for the root entry, let's say):
(in directory block 0)

> > ext2_dir_entry_2{ EXT2_ROOT_INO, 12, 1, EXT2_FT_DIR, ".\0\0\0"}
> > ext2_dir_entry_2{ EXT2_ROOT_INO, <blocksize> - 12, 2, EXT2_FT_DIR, "..\0\0"}
> > <index magic (maybe)>
> > <index header>
> > <index data>
> >
> > For the index ext2 kernel code, it would notice the EXT2_INDEX_FL and
> > access the data after the end of the ".." dir entry, and this would also
> > give you read-only "compatibility" of sorts with older kernels (modulo
> > calling ext2_error() for all of the sparse blocks before the start of the
> > actual directory data blocks). You lose 24 bytes of data in the first
> > block, but gain some compatibility. For second-level index blocks, if you
> > want to keep compatibility you lose 8 bytes each block if you start with:
> >
> > ext2_dir_entry_2 { 0, <blocksize>, 0, EXT2_FT_DIR, "" }
> > <index magic (maybe)>
> > <second level index data>

Daniel:
> I really think INCOMPAT is the way to go and if you must mount it with
> an old kernel, do a fsck. Old fsck manages to put things back into a
> form it can understand without too much difficulty, though you do have
> to answer quite a few questions. The exact answers you give don't seem
> to be that important.

You don't always have the luxury to go back to an old kernel (for whatever
reason), and if an incompat flag is set the kernel will refuse to mount
your old filesystem. If this is your root, you can't even run fsck. Yes,
I know - have a rescue disk/partition - but _sometimes_ you are just stuck
and it would be good to not get into that situation in the first place.

Andreas:
> > Will there be a lower limit at which you create indexed directories?

Daniel:
> Yes, I hashed that out today with Al Viro on #kernelnewbies. The
> breakeven happens at 3 directory blocks.

Andreas:
> > I guess the tradeoff is if you have to index all of the existing entries
> > in a non-indexed directory. However, you need to handle this anyways if
> > you are updating an existing non-indexed directory from an old filesystem.

Daniel:
> If I do the optimization just for the first directory block then it's
> very nearly free - just one extra read of the first directory block,
> and it's almost certainly in cache anyway because it was just read to
> see if the entry already exists.

But you still need to handle the case for an arbitrary-sized non-indexed
directory, if you want to be able to upgrade an existing ext2 filesystem.
Since you need this, you may as well only turn indexing when you are
actually getting a speed benefit, because doing anything else still
wastes space. It may even be that indexing a large existing directory
and _then_ doing the lookup is still faster than doing the lookup on the
original un-indexed directory...

Ted writes:
> A couple of comments. If you make the beginning of each index block
> look like a an empty directory block:
>
> 32 bits: ino == 0
> 16 bits: rec_len == blocksize
> 16 bits: name_len = 0

This is what I also suggested for second-level index blocks above.
However, for a single-level index, blocks 1-511 (1-127 on a 1k filesystem)
will be sparse, because they will be unused - we don't want to have 511
(or 127) real empty dir blocks just for compatibility on a single-level
index. The ext2 dir code handles the case of a sparse directory block
with an ext2_error() and continues. By default ext2_error() is just
a printk, and on the only system I have seen where it is otherwise
(Debian), it is remount-ro for root only.

> ... then you will have full backwards compatibility, both for reading
> *and* writing. When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry. And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore.

Yes, I had something like this on the tip of my brain as well. When you
boot with a non-index ext2 kernel, it will naturally find free space in
the first block, immediately after "." and ".." (with the setup above).
Not only will it clear BTREE_FL, it will also overwrite the index magic
(if we have one) so we definitely know that the index is not valid.
Since the index head is only using 4 of the 8 bytes needed for alignment,
we could stick in a 4 byte magic before or after the index header, and
still be assured that it will be overwritten by a new dirent.

Full COMPAT support would be a win, IMHO. You could leave it to e2fsck
to do reindexing, or the next time a file is added (or even removed)
from a candidate directory it could do the reindexing, which it needs
to be able to do for compatibility with old filesystems.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 08:35:00

by Rogier Wolff

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

H. Peter Anvin wrote:
> Martin Mares wrote:
> >
> > Hello!
> >
> > > True. Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key,

Novell, NTFS, HFS all do this.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

2001-02-22 10:33:09

by Alan

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

> Daniel Phillips wrote:
> >
> > There will be a lot fewer metadata index
> > blocks in your directory file, for one thing.
> >
>
> Oh yes, another thing: a B-tree directory structure does not need
> metadata index blocks.

Before people get excited about complex tree directory indexes, remember to
solve the other 95% before implementation - recovering from lost blocks,
corruption and the like

2001-02-22 11:32:05

by Ingo Oeser

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hi Linus,
Hi LKML people,

On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> In article <[email protected]>,
> Linus Torvalds <[email protected]> wrote:
> >allocate blocks one at a time. Make the blocksize something nice and
> >big, not just 4kB or 8kB or something.
>
> Btw, this is also going to be a VM and performance issue some time in
> the future. Tgere are already CPU's that would _love_ to have 64kB
> pages etc, and as such a filesystem that doesn't play with the old silly
> "everthing is a block" rules would be much appreciated with the kind of
> people who have multi-gigabyte files and want to read in big chunks at a
> time.

For this we need a block remapper layer that can map any
blocksize n to any blocksize m with only the following constraints:

- n and m are powers of 2
- n is a multiple of m

Both should use the page cache ( of size p) of course, so it
becomes 2 layers, if n > p.

- translating a buffer of n into some pages
- translating a page into buffers of m (current buffercache)

We could limit the translation to 5 powers of 2 obove and 5 powers of 2
below PAGE_CACHE_SIZE so that we can maintain a validity bitmap
(2^5 = 32 bits) for each layer if access is too expensive[1].

Some subsystems could certainly benefit from it.

- loop device (with all the crypto stuff)
- LVM
- FSes that support block sizes != PAGE_CACHE_SIZE
- Devices with blocksize != 512 (they don't have to care
being special anymore). There are even some rumors
about very pervert blocksizes of 1M and the like.

Since these remapped buffers will look like merged requests, I
see even no problems with the elevator any more.

The question is, where we implement this infrastructure, esp. if
we consider the last user (devices with blocksize != 512).

This has to be answered by the main architects of Linux before
anyone could start.

> So either you have a simple block-based filesystem (current ext2, no
> extents, no crapola), or you decide to do it over. Don't do some
> half-way thing, please.

Daniel (and others) uses ext2 as as a playground, because it is
implemented, tested and not that hard to understand and verify.

Hope they will switch to some own design later, once they
sufficiently played around with^W^W^Wtested their ideas.

Regards

Ingo Oeser

[1] In buffer cache we use read-modify-write for partial pages,
which hurts performance for them and is annoying for media
with limited write cycles like flash and CD-RW[2].

[2] Yes I know about packet writing mode ;-)
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2001-02-22 13:20:36

by Theodore Tso

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Daniel Phillips <[email protected]>
Date: Thu, 22 Feb 2001 08:24:08 +0100
Content-Type: text/plain

> Is it worth it? Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution. I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted

2001-02-22 16:34:29

by Chris Mason

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2



On Wednesday, February 21, 2001 07:30:47 PM -0800 Linus Torvalds
<[email protected]> wrote:
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
>>
>
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

Quick details, since I don't think I've seen them on l-k yet. r5 was
chosen because it is more tuned to the reiserfs disk format. The location
of a directory item on disk is determined by the hash of the name, and r5
is designed to put similar names close to each other on disk.

The benchmark that shows this best is creating X number of files in a
single dir (named 0001, 0002, 0003 etc). r5 greating increases the chances
the directory item for 00006 will be right next to the item for 00007. If
the application accesses these files in the same order they were created,
this has benefits at other times than just creation. The benchmarks Ed
posted give a general idea for other naming patterns, but this one is best
case:

Time to create 100,000 files (10 bytes each) with r5 hash: 48s
Time to create 100,000 files (10 bytes each) with tea: 3m58s

The percentage increase just gets bigger as you create more and more files.
That doesn't mean this is a real world case, but it is what the hash was
designed for.

-chris

2001-02-22 18:17:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Daniel writes:
> All references to "." and ".." are now intercepted and never reach the
> filesystem level.

Ted writes:
> From: Daniel Phillips <[email protected]>
>
> I'll leave that up to somebody else - we now have two alternatives, the
> 100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> still largely intact forward compatible solution. I'll maintain both
> solutions for now code so it's just as easy to choose either in the end.
>
> Well, the $64,000 question is exactly how much performance does it cost?
> My guess is that it will be barely measurable, but only benchmarks will
> answer that question.

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2? If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 18:21:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2



On Thu, 22 Feb 2001, Ingo Oeser wrote:
>
> On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> > In article <[email protected]>,
> > Linus Torvalds <[email protected]> wrote:
> > >allocate blocks one at a time. Make the blocksize something nice and
> > >big, not just 4kB or 8kB or something.
> >
> > Btw, this is also going to be a VM and performance issue some time in
> > the future. Tgere are already CPU's that would _love_ to have 64kB
> > pages etc, and as such a filesystem that doesn't play with the old silly
> > "everthing is a block" rules would be much appreciated with the kind of
> > people who have multi-gigabyte files and want to read in big chunks at a
> > time.
>
> For this we need a block remapper layer that can map any
> blocksize n to any blocksize m with only the following constraints:

No, nothing like that at all..

What you can _trivially_ do is to basically act to the VFS and VM layer as
if you're a 1kB block filesystem (or something), and then when you get
called to do a "bmap()" (which only happens for kernel installing and
LILO, not under normal load), you just return the "offset" into a larger
block.

The VFS and MM layers do not care what the _real_ underlying blocksize of
the filesystem is. They will just do "readpage()" and "write()" calls, and
you can implement those any way you want to - never showing that you are
chunking out page-sized pieces from a bigger allocation block.

It's not all that hard. You just have to think a bit differently: don't
think of it as a block-based filesystem that has to have fixed blocks. The
VFS and MM layer don't care. They just want to access it.

> Daniel (and others) uses ext2 as as a playground, because it is
> implemented, tested and not that hard to understand and verify.

I realize that. But I get _very_ nervous when people talk about adding
stuff to ext2, because there are a lot of people who do not want to ever
even by mistake run code that is "new" on their filesystem.

Note that I had the same issue with ext3 - for the longest time, Stephen
Tweedie wanted to just extend ext2, and make it an invisible upgrade where
the filesystem would just magically become journalled when the user asked
for it. It _sounds_ appealing, but it doesn't take into account (a)
inevitable bugs and (b) the fact that Reiserfs actually got a head start
at least partly because it didn't need to worry about backwards
compatibility at all (there were other reasons too).

Basically, if there is one thing I've learnt over ten years of Linux, it's
that it is absolutely _evil_ to add more "linkages" or dependencies than
you absolutely have to. It is _much_ easier to create a new filesystem,
and slowly phase out old code that is no longer used. It's been done
several times (both with filesystems and with drivers), and every time
we've had a "new X, phase out old X" kind of situation it has been very
smooth.

In comparison, if you have "new features in X, which also handles the old
cases of X" situation, you not only bind yourself to backwards
compatibility, but you also cause yourself to be unable to ever phase out
the old code. Which means that eventually the whole system is a piece of
crap, full of old garbage that nobody needs to use, but that is part of
the new stuff that everybody _does_ use.

Linus

2001-02-22 20:30:57

by kaih

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] (Martin Mares) wrote on 22.02.01 in <[email protected]>:

> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

s/hash table/btree/ and you have just described the Macintosh HFS file
system. (Incidentally, it stores file extent indices in a similar manner,
with key = (file id, fork, offset).)


MfG Kai

2001-02-22 20:30:46

by kaih

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] (Daniel Phillips) wrote on 20.02.01 in <01022020011905.18944@gimli>:

> But the current hash function is just a place holder, waiting for
> an better version based on some solid theory. I currently favor the
> idea of using crc32 as the default hash function, but I welcome
> suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,
ISBM 0-201-41607-7, Addison-Wesley):

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that
given any input, the average performance of all the functions is good.
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of
variable length) are best treated by considering them as a number base b.
Let the string s be composed of k characters s1s2...sk. Then

h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w
operation is done by the hardware. For this function the value B = 131 is
recommended, as B^i has a maximum cycle mod 2^k for 8<=k<=64.

Hashing function for strings

int hashfunction(s)
char *s;

{ int i;
for(i=0; *s; s++) i = 131*i + *s;
return(i % m);
}

--- snip ---

I've actually used that function for a hash containing something like a
million phone numbers as keys, and there were *very* few collisions.
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai

2001-02-22 22:32:05

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds wrote:
>
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
> >
> > In the first heat of hash races - creating 20,000 files in one directory
> > - dentry::hash lost out to my original hack::dx_hash, causing a high
> > percentage of leaf blocks to remain exactly half full and slowing down
> > the whole thing by about 5%. (This was under uml - I haven't tried it
> > native yet but I expect the results to be similar.)
> >
> > Contender Result
> > ========= ======
> > dentry::hash Average fullness = 2352 (57%)
> > hack::dx_hash Average fullness = 2758 (67%)
> >
> > This suggests that dentry::hash is producing distinctly non-dispersed
> > results and needs to be subjected to further scrutiny. I'll run the
> > next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> > have time.
>
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

In this round there were two new contenders:

- ReiserFS's R5
- Bob Jenkins' hash

Eirik Fuller pointed me to the latter, the subject of a very interesting
article in Dr. Dobbs, available online here:

http://burtleburtle.net/bob/hash/doobs.html

As before, the runs are for 20,000 creates and I report only the
fullness, because I'm still running these under UML. Suffice to say
that the total running time is roughly related to the average fullness
with a variance of about 15% from the best to the worst. Eventually I
will rerun the entire series of tests natively and provide more detailed
statistics. Here are the results from the second heat of hash races:

Contender Result
========= ======
dentry::hash Average fullness = 2352 (57%)
daniel::hack_hash Average fullness = 2758 (67%)
bob::hash Average fullness = 2539
(61%)
reiserfs::r5 Average fullness = 2064 (50%)

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block.
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks. The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal But according
to Chris Mason (see his post) it does work very well for ReiserFS. This
provides a little more evidence that my htree scheme is a quite
different from other approaches.

u32 r5_hash (const char *msg, int len)
{
u32 a=0;
while(*msg) {
a += *msg << 4;
a += *msg >> 4;
a *= 11;
msg++;
}
return a;
}

I expected more from bob::hash since it's very carefully well-thought
out in terms of dispersal and avoidance of 'funnelling' (the property
that determines the probabililty collision), but it still fell short of
hack_hash's performance. Oh well. Tomorrow I'll try CRC32.

The bottom line: dx_hack_hash is still the reigning champion. OK, come
out and take a bow:

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash < 0) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

--
Daniel

2001-02-22 23:05:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
> Daniel writes:
> > All references to "." and ".." are now intercepted and never reach the
> > filesystem level.
>
> Ted writes:
> > From: Daniel Phillips <[email protected]>
> >
> > I'll leave that up to somebody else - we now have two alternatives, the
> > 100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> > still largely intact forward compatible solution. I'll maintain both
> > solutions for now code so it's just as easy to choose either in the end.
> >
> > Well, the $64,000 question is exactly how much performance does it cost?
> > My guess is that it will be barely measurable, but only benchmarks will
> > answer that question.
>
> One important question as to the disk format is whether the "." and ".."
> interception by VFS is a new phenomenon in 2.4 or if this also happened
> in 2.2? If so, then having these entries on disk will be important
> for 2.2 compatibility, and you don't want to have different on-disk formats
> between 2.2 and 2.4.

The answer is 'yes', it's been in since at least the beginning of 2.2:


http://innominate.org/cgi-bin/lksr/linux/fs/namei.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=v2.2

Search for '.'.

By the way, out whole linux cvsweb tree is here:

http://lksr.org/

will all versions of linux back to linux-0.97.pl5, with a makefile that
starts out with:

#
# Makefile for linux.
# If you don't have '-mstring-insns' in your gcc (and nobody but me has
:-)
# remove them from the CFLAGS defines.
#

Getting back on topic, this makes the idea of getting rid of the actual
on-disk "." and ".." entries a little less scary, though I am keeping in
mind the fact that having those entries on disk could in some extreme
circumstance help fsck recover a a corrupted directory tree little
better and more automatically.

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted. Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

--
Daniel

2001-02-22 23:40:53

by Theodore Tso

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Andreas Dilger <[email protected]>
Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2? If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility. That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..' entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated. After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility. One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted

2001-02-23 00:58:52

by Felix von Leitner

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Thus spake Alan Cox ([email protected]):
> > > There will be a lot fewer metadata index
> > > blocks in your directory file, for one thing.
> > Oh yes, another thing: a B-tree directory structure does not need
> > metadata index blocks.
> Before people get excited about complex tree directory indexes, remember to
> solve the other 95% before implementation - recovering from lost blocks,
> corruption and the like

And don't forget the trouble with NFS handles after the tree was rebalanced.

Trees are nice only theoretically. In practice, the benefits are
outweighed by the nastiness in form of fsck and NFS and bigger code
(normally: more complex -> less reliable).

Felix

2001-02-23 01:53:21

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

>> idea of using crc32 as the default hash function

> I once liked those things, too - but I've learned better since.

> A universal class of hashing functions is a class with the property that
> given any input, the average performance of all the functions is good.
> For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
> a universal class of hash functions.

Here a != 0 should be (a,m) = 1.

> for(i=0; *s; s++) i = 131*i + *s;

Yes, that is a good function.

(Also because 131 has only 3 bits, so that there is a cheap shift and add
implementation.)

I did some random tests, on the one hand on a collection of 557398 file
names (last part of path names) in a file system here.
On the other hand on an artificially generated sequence of file names
with increasing number tails: foo0001, foo0002, ...

On the first collection the choice of multiplier didnt really matter
provided that it was odd and not too close to a power of two.
The smallest number with good behaviour was 11, the winner was 51.

(51 has 4 bits, but is not more expensive because they are evenly spaced:
/* 51x = 17*3*x */
x += (x << 1);
x += (x << 4);
)

On the second collection there were large differences between multipliers.
The clear winner was 11.

Some numbers:

Hash 557398 actual names, using
hash(unsigned char *s) {
unsigned int h = 0;

while(*s)
h = m*h + *s++;
return h % sz;
}
for various values of m and powers of two sz (so that the % is an AND).
Report min, max, average length of hash chain, and standard deviation.
Of course min and max should be close to average and stddev should be small.

m= 11 sz=2048, min 221, max 324, av 272.17, stddev 254.12
m= 13 sz=2048, min 219, max 322, av 272.17, stddev 259.96
m= 51 sz=2048, min 218, max 325, av 272.17, stddev 265.52
m=131 sz=2048, min 222, max 344, av 272.17, stddev 264.20

m= 11 sz=4096, min 96, max 176, av 136.08, stddev 132.58
m= 13 sz=4096, min 101, max 177, av 136.08, stddev 128.71
m= 51 sz=4096, min 92, max 174, av 136.08, stddev 130.89
m=131 sz=4096, min 85, max 180, av 136.08, stddev 131.99

m= 11 sz=8192, min 38, max 102, av 68.04, stddev 68.26
m= 13 sz=8192, min 42, max 100, av 68.04, stddev 66.30
m= 51 sz=8192, min 41, max 97, av 68.04, stddev 64.98
m=131 sz=8192, min 39, max 102, av 68.04, stddev 66.19

m= 11 sz=16384, min 14, max 57, av 34.02, stddev 33.96
m= 13 sz=16384, min 14, max 58, av 34.02, stddev 33.51
m= 51 sz=16384, min 15, max 60, av 34.02, stddev 32.29
m=131 sz=16384, min 16, max 59, av 34.02, stddev 33.94

m= 11 sz=32768, min 3, max 37, av 17.01, stddev 17.50
m= 13 sz=32768, min 3, max 34, av 17.01, stddev 16.84
m= 51 sz=32768, min 4, max 41, av 17.01, stddev 16.46
m=131 sz=32768, min 3, max 40, av 17.01, stddev 16.90

m= 11 sz=65536, min 0, max 24, av 8.51, stddev 8.70
m= 13 sz=65536, min 0, max 23, av 8.51, stddev 8.56
m= 51 sz=65536, min 0, max 24, av 8.51, stddev 8.31
m=131 sz=65536, min 0, max 23, av 8.51, stddev 8.51

m= 11 sz=131072, min 0, max 17, av 4.25, stddev 4.39
m= 13 sz=131072, min 0, max 16, av 4.25, stddev 4.32
m= 51 sz=131072, min 0, max 16, av 4.25, stddev 4.22
m=131 sz=131072, min 0, max 16, av 4.25, stddev 4.24

m= 11 sz=262144, min 0, max 12, av 2.13, stddev 2.20
m= 13 sz=262144, min 0, max 12, av 2.13, stddev 2.18
m= 51 sz=262144, min 0, max 12, av 2.13, stddev 2.12
m=131 sz=262144, min 0, max 12, av 2.13, stddev 2.12

On the second, nonrandom, collection there are more variations:

m= 11 sz=8192, min 61, max 76, av 68.04, stddev 4.41
m= 13 sz=8192, min 55, max 83, av 68.04, stddev 18.64
m= 51 sz=8192, min 58, max 79, av 68.04, stddev 12.47
m=131 sz=8192, min 52, max 83, av 68.04, stddev 29.05

m= 11 sz=16384, min 26, max 41, av 34.02, stddev 3.61
m= 13 sz=16384, min 24, max 45, av 34.02, stddev 8.76
m= 51 sz=16384, min 25, max 44, av 34.02, stddev 6.32
m=131 sz=16384, min 23, max 47, av 34.02, stddev 14.00

m= 11 sz=32768, min 10, max 23, av 17.01, stddev 4.36
m= 13 sz=32768, min 7, max 28, av 17.01, stddev 8.66
m= 51 sz=32768, min 10, max 25, av 17.01, stddev 4.04
m=131 sz=32768, min 6, max 27, av 17.01, stddev 8.66

Andries

2001-02-23 02:49:27

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2


Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block.
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks. The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal.

The bottom line: dx_hack_hash is still the reigning champion.

Now that you provide source for r5 and dx_hack_hash, let me feed my
collections to them.
r5: catastrophic
dx_hack_hash: not bad, but the linear hash is better.

E.g.:
Actual file names:

Linear hash, m=11, sz=2048, min 262, max 283, av 272.17, stddev 12.25
dx_hack_hash: sz=2048, min 220, max 330, av 272.17, stddev 280.43
r5: sz=2048, min 205, max 382, av 272.17, stddev 805.18

Linear hash, m=11, sz=65536, min 0, max 24, av 8.51, stddev 8.70
dx_hack_hash: sz=65536, min 0, max 23, av 8.51, stddev 8.51
r5: sz=65536, min 0, max 26, av 8.51, stddev 8.89

Generated consecutive names:

Linear hash, m=11, sz=2048, min 262, max 283, av 272.17, stddev 12.25
dx_hack_hash: sz=2048, min 191, max 346, av 272.17, stddev 636.11
r5: sz=2048, min 0, max 3587, av 272.17, stddev 755222.91

Linear hash, m=11, sz=65536, min 2, max 14, av 8.51, stddev 2.79
dx_hack_hash: sz=65536, min 0, max 26, av 8.51, stddev 12.24
r5: sz=65536, min 0, max 120, av 8.51, stddev 738.08

Andries

2001-02-23 03:43:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] wrote:
>
> Just looking at R5 I knew it wasn't going to do well in this application
> because it's similar to a number of hash functions I tried with the same
> idea in mind: to place similar names together in the same leaf block.
> That turned out to be not very important compared to achieving a
> relatively high fullness of leaf blocks. The problem with R5 when used
> with my htree is, it doesn't give very uniform dispersal.
>
> The bottom line: dx_hack_hash is still the reigning champion.
>
> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x80000000) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure.
I will try your hash with the htree index code tomorrow.

--
Daniel

2001-02-23 12:22:54

by Jonathan Morton

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

>Now that you provide source for r5 and dx_hack_hash, let me feed my
>collections to them.
>r5: catastrophic
>dx_hack_hash: not bad, but the linear hash is better.

<snip verbose results>

So, not only does the linear hash normally provide a shorter worst-case
chain, its' results are actually more consistent than the other two. Looks
like simple is good here, but is it still possible to produce
"pathological" sets for the linear hash to see how badly it falls down?
I'm no mathematician, so I'll leave that to the gurus...

Meanwhile, let's go back to Linus' comment on compatibility and so on. He
has a *very* good point, which I'll expand on slightly here:

Suppose some stone-age Linux user, running 2.0.35 or something equally old
(which runs ext2), decides to finally bite the bullet and upgrade to the
all-new 2.6.1 (OK, this is some time in the future). 2.6.1 implements some
"enhanced" version of ext2 which makes some incompatible modifications to
the directory structure. However, since the process of upgrading through
such a massive range of kernels also involves upgrading most other software
to boot, this user forgot one or two pieces, and reboots to 2.0.35 to
regain a sufficiently working system that he can build the updated software
- or not, because 2.0.35's old ext2 code suddenly can't read the
filesystem, which was modified by 2.6.1 before the boot process stalled.
e2fsck is no help here either, because he now has an unbootable system with
old software that doesn't understand the new stuff.

I hope people understand this as well as I do - if a filesystem upgrade is
desirable, let the user perform some *specific* action to upgrade it, when
he has an otherwise-working setup *and* sufficient backups. I for one do
not want to be caught out like the hypothetical user I mentioned above.

OTOH, I have my own opinions on the direction of ext2:

- Currently, it's a stable and universally-utilised filesystem which offers
very good performance for most applications. I'm specifically drawing
attention to certain benchmarks which place ext2's disk-based performance
ahead of many commercial UNIX' ram-based filesystem performance.

- There are specific problems with performance when reading and/or
modifying large directories. I haven't tested for this personally, but I
have noticed slowness when using 'rm -rf' on a large *tree* of directories.
The problem appeared to be one of disk access, but may be a side-effect of
poor storage distribution (I haven't examined the ext2 code). Related to
this, rebuilding the slocate database on all my systems appears to be
disk-bound rather than CPU-bound, and takes too long for my liking.

One of the current suggestions, if I've interpreted it correctly, is to
introduce an extension to ext2 which essentially makes a "fast index" of a
large directory, attaches it to the directory in some backwards-compatible
manner, and uses it *in conjunction with* the regular directory structure.
This is probably a good idea, but it needs some thought:

- How much overhead does the "fast index" introduce for modification of the
directory? Large directories are the most likely to have stuff added and
deleted, and it doesn't help if during an "rm *" operation the saving on
the search is negated by the overhead on the unlink.

- If the index gets out of sync with the directory, how is this detected
and recovered from? Assuming the index merely points to the correct
position in the regular directory, some simple sanity checks will suffice
for most cases (is this entry in the directory the same as I asked for?),
and if the entry is not in the index then a standard search of the real
directory can be done. In either case, the index can be marked as invalid
(and removed?) and rebuilt whenever necessary.

- At what threshold of directory size does the index come into play?
(fairly obviously, the index is useless for tiny directories)

- What happens when an old kernel encounters a directory which has an index
attached to it? Does it look like a virtual file, which has no special
properties but whose name is reserved for system use? (cf. lost+found) Or
is it some inidentifiable bits in the directory structure and a few lost
clusters on the disk? If the latter, it will have to be done in a way that
older versions of e2fsck can clean it up easily and older versions of ext2
won't throw up on it, which could be kinda hard. If the former, an
'unused' name will have to be found to avoid conflicts but the big
advantage is *no* inconsistency with old systems.

Answers on a postcard...

--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----


2001-02-23 12:39:12

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

From [email protected] Fri Feb 23 04:43:23 2001

> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x80000000) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}


The correction gained me another 1% in the leaf block fullness measure.
I will try your hash with the htree index code tomorrow.

Basically I find the same results as before.

Actual names (N=557398)
dx_hack+if dx_hack-if best
Size min max av stddev min max av stddev
2048 217 326 272.17 273.45 220 330 272.17 280.43 +
4096 97 191 136.08 138.35 100 182 136.08 138.29 -
8192 40 105 68.04 68.57 36 102 68.04 68.06 -
16384 14 59 34.02 34.36 14 59 34.02 34.08 -
32768 3 37 17.01 17.24 4 36 17.01 17.09 -
65536 0 24 8.51 8.55 0 23 8.51 8.51 -
131072 0 18 4.25 4.24 0 16 4.25 4.26 +
262144 0 13 2.13 2.12 0 11 2.13 2.13 -

Generated names
2048 195 347 272.17 509.38 191 346 272.17 636.11 +
4096 71 218 136.08 645.73 56 224 136.08 995.79 +
8192 23 125 68.04 202.16 23 135 68.04 288.99 +
16384 10 69 34.02 67.47 8 72 34.02 89.29 +
32768 1 42 17.01 26.32 1 43 17.01 31.39 +
65536 0 28 8.51 10.92 0 26 8.51 12.24 +
131072 0 17 4.25 4.93 0 18 4.25 5.28 +
262144 0 12 2.13 2.32 0 13 2.13 2.40 +

In other words, the "broken" version wins on actual names, the "correct" version
on generated names with number tail. The differences are small.
(And of course the broken version is faster.)
As a comparison:

Actual names (N=557398)
linear hash (m=11) linear hash (m=51) best of 4
Size min max av stddev min max av stddev
2048 221 324 272.17 254.02 218 325 272.17 265.46 lin-11
4096 96 176 136.08 132.53 92 174 136.08 130.94 lin-51
8192 38 102 68.04 68.26 41 97 68.04 64.98 lin-51
16384 14 57 34.02 33.97 15 60 34.02 32.29 lin-51
32768 3 37 17.01 17.50 4 41 17.01 16.46 lin-51
65536 0 24 8.51 8.70 0 24 8.51 8.31 lin-51
131072 0 17 4.25 4.39 0 16 4.25 4.22 lin-51
262144 0 12 2.13 2.20 0 12 2.13 2.12 lin-51

Generated names
2048 262 283 272.17 12.25 244 298 272.17 136.72 lin-11
4096 128 146 136.08 9.39 119 151 136.08 39.73 lin-11
8192 61 76 68.04 4.41 58 79 68.04 12.47 lin-11
16384 26 41 34.02 3.61 25 44 34.02 6.32 lin-11
32768 10 23 17.01 4.36 10 25 17.01 4.04 lin-51
65536 2 14 8.51 2.79 3 16 8.51 4.54 lin-11
131072 0 8 4.25 1.85 0 10 4.25 1.88 lin-11
262144 0 5 2.13 1.10 0 6 2.13 1.22 lin-11

So both linear hash versions are far superior (if the criterion is uniform
distribution) in the case of generated names, and lin-51 also beats dx_hack
in the case of actual names. Of course it also wins in speed.

Andries


2001-02-23 18:58:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Jonathan Morton writes:
> Meanwhile, let's go back to Linus' comment on compatibility and so on. He
> has a *very* good point, which I'll expand on slightly here:
>
> Suppose some stone-age Linux user, running 2.0.35 or something equally old
> (which runs ext2), decides to finally bite the bullet and upgrade...
> ... 2.0.35's old ext2 code suddenly can't read the filesystem, which was
> modified by 2.6.1 before the boot process stalled.

One of the proposals on the table for the indexed directories is read AND
WRITE compatible with older kernels. For 1.0 through 2.4 kernels the read
compatibility is very safe (it will generate some errors on reads because
of "sparse" directory blocks, but the code will continue on correctly,
<= 1.2 won't even generate an error). For < 2.2 kernels deleting a file from
an indexed directory will work, but would leave the index out of date.
For < 2.2 kernels adding a new file to an indexed directory would always
overwrite the index, so it is also safe.

> I hope people understand this as well as I do - if a filesystem upgrade is
> desirable, let the user perform some *specific* action to upgrade it, when
> he has an otherwise-working setup *and* sufficient backups. I for one do
> not want to be caught out like the hypothetical user I mentioned above.

I am on the side of maintaining compatibility. There _may_ be a _small_
performance (more like capacity) impact on indexed directories for this
compatibility, but it is well worth it, IMHO.

> - Currently, it's a stable and universally-utilised filesystem which offers
> very good performance for most applications. I'm specifically drawing
> attention to certain benchmarks which place ext2's disk-based performance
> ahead of many commercial UNIX' ram-based filesystem performance.

Totally agree.

> - There are specific problems with performance when reading and/or
> modifying large directories. I haven't tested for this personally, but I
> have noticed slowness when using 'rm -rf' on a large *tree* of directories.

That is what the index change will address. Actually, "rm -r" may not
be speeded up very much, but "rm *" definitely would be ("rm -r" deletes
files in directory order, but "rm *" deletes each file individually in
alphabetical order).

> One of the current suggestions, if I've interpreted it correctly, is to
> introduce an extension to ext2 which essentially makes a "fast index" of a
> large directory, attaches it to the directory in some backwards-compatible
> manner, and uses it *in conjunction with* the regular directory structure.

Yes, this is essentially true. The on-disk directory entries are exactly
the same. The index itself (in the compatible layout) appears to simply
be empty directory blocks (at the cost of 8 bytes = 1 index entry per block).

> - How much overhead does the "fast index" introduce for modification of the
> directory? Large directories are the most likely to have stuff added and
> deleted, and it doesn't help if during an "rm *" operation the saving on
> the search is negated by the overhead on the unlink.

The index will improve the performance for file add, stat, and delete. For
all of these operations you need to find a directory entry (add needs to
check if a file of the same name already exists before a new file is added).

> - If the index gets out of sync with the directory, how is this detected
> and recovered from? Assuming the index merely points to the correct
> position in the regular directory, some simple sanity checks will suffice
> for most cases (is this entry in the directory the same as I asked for?),
> and if the entry is not in the index then a standard search of the real
> directory can be done. In either case, the index can be marked as invalid
> (and removed?) and rebuilt whenever necessary.

On an index-aware kernel, the index can obviously not get out of sync.
All 2.2+ kernels that don't understand indexing will clear the "has
index" flag if they modify that directory, and the index will disappear.
Since the index is "hidden" (in my proposal at least) inside a totally
normal directory block, it will simply be overwritten by new entries.
As I mentioned above, 1.x and 2.0 kernels will overwrite the index on
an add, but not on a delete, and will not clear the "has index" flag.
This means we need some extra magic at the start of the index to ensure
we have a valid index header.

> - At what threshold of directory size does the index come into play?
> (fairly obviously, the index is useless for tiny directories)

This remains to be seen. Definitely not for directories 1 block in size
(which is 85%? of all directories). It looks like directories with about
250-300 files or more are needed for indexing to be useful. The good
news is that since indexing is optional, it can be tuned to only improve
performance of directories.

> - What happens when an old kernel encounters a directory which has an index
> attached to it? Does it look like a virtual file, which has no special
> properties but whose name is reserved for system use? (cf. lost+found) Or
> is it some inidentifiable bits in the directory structure and a few lost
> clusters on the disk? If the latter, it will have to be done in a way that
> older versions of e2fsck can clean it up easily and older versions of ext2
> won't throw up on it, which could be kinda hard. If the former, an
> 'unused' name will have to be found to avoid conflicts but the big
> advantage is *no* inconsistency with old systems.

This has also been discussed already for the indexing code. The index
is actually stored inside the directory, so no hidden files or anything,
and no "lost blocks" either. The original indexing code wasn't 100%
compatible with the older layout, but a proposal has been made to make
it totally compatible with older kernels (excluding some error messages
that all ext2 directory code handles properly).

If (for whatever reason) you started using an indexed ext2 kernel,
created a huge directory of files, and then never used it again, the
only thing wasted would be your time (when you try to delete the files
in the huge directory without the indexed directory ;-). The ext2 code
has NEVER releases directory blocks, even if all of the files are gone.
All blocks used by the index also appear (in the proposed layout) as
regular directory blocks, so a non-index kernel can use them at any time
(of course the index is destroyed in this case).

As an aside, lost+found is NOT a "special" name in any way to the
filesystem. The only thing that it is used for is e2fsck - to the kernel
it is just a regular directory.

Basically, I think the compatibility is excellent for the functionality
delivered. Another reason why I think ext2 will continue to have a long
and prosperous life, without making sacrifices for performance.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-23 20:41:45

by Theodore Tso

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Daniel Phillips <[email protected]>
Date: Fri, 23 Feb 2001 00:04:02 +0100

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted. Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck. Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted

2001-02-23 21:45:55

by Ralph Loader

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Hi,

I while ago I did some experimentation with simple bit-op based string
hash
functions. I.e., no multiplications / divides in the hash loop.

The best I found was:

int hash_fn (char * p)
{
int hash = 0;
while (*p) {
hash = hash + *p;
// Rotate a 31 bit field 7 bits:
hash = ((hash << 7) | (hash >> 24)) & 0x7fffffff;
}
return hash;
}

[I haven't kept my test program / data set - if anyone compares the
above
to the others functions mentioned on the list, let me know.]

The 31 and 7 were determined experimentally. But the 31 has a
theoretical
explanation (which might even be valid):

The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
where P is the polynomial x**31 - 1 (over Z_2).
Presumably the "best" P would be irreducible - but that would have more
bits set in the polynomial, making reduction harder. A compromise is to
choose P in the form x**N - 1 but with relatively few factors.
X**31 - 1 is such a P.

Also, a 32 bit rotate (modulo X**32 - 1, which is equal
to (X - 1) ** 32 over Z_2), came out pretty badly.

One thing that shouldn't be forgotten about hashing for hash tables
is that you have to reduce the hash value to the required range - doing
that well greatly reduces the difference between various hash functions.

Ralph.


2001-02-23 22:37:43

by Guest section DW

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

On Sat, Feb 24, 2001 at 10:43:16AM +1300, Ralph Loader wrote:

> A while ago I did some experimentation with simple bit-op based string
> hash functions. I.e., no multiplications / divides in the hash loop.
>
> The best I found was:
>
> int hash_fn (char * p)
> {
> int hash = 0;
> while (*p) {
> hash = hash + *p;
> // Rotate a 31 bit field 7 bits:
> hash = ((hash << 7) | (hash >> 24)) & 0x7fffffff;
> }
> return hash;
> }

Hmm. This one goes in the "catastrophic" category.

For actual names:

N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99

For generated names:

N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83

A very non-uniform distribution.

> The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
> where P is the polynomial x**31 - 1 (over Z_2).
> Presumably the "best" P would be irreducible - but that would have more
> bits set in the polynomial, making reduction harder. A compromise is to
> choose P in the form x**N - 1 but with relatively few factors.
> X**31 - 1 is such a P.

It has seven irreducible factors. Hardly "almost irreducible".

Shifting the 7-bit ASCII characters over 7 bits makes sure that there
is very little interaction to start with. And the final AND that truncates
to the final size of the hash chain kills the high order bits.
No good.

Andries


2001-02-24 00:33:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Ted writes:
> Note that in the long run, the fully comatible version should probably
> have a COMPAT feature flag set so that you're forced to use a new enough
> version of e2fsck. Otherwise an old e2fsck may end up not noticing
> corruptions in an index block which might cause a new kernel to have
> serious heartburn.

Actually, having a COMPAT flag also helps in other ways:

1) Turning indexing on and off is not a mount option as it currently is
(or automatically done) so it will quell Linus' fears about priniciple
of least surprise (i.e. not converting a filesystem without user action).
A superblock COMPAT flag is more in keeping with other ext2 features.

2) Running a new e2fsck on a COMPAT_INDEX filesystem could create the
index for existing "large" directories that don't have the BTREE/INDEX
flag set, so the kernel only ever has to deal with incremental indexing
after the first block. The kernel would just do linear access on
existing multi-block directories until e2fsck is run.

3) Clearing the COMPAT flag would make e2fsck remove the indexes, if the
user so desires. I think this would be the behaviour of existing
e2fsck anyways.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-24 02:47:46

by Ralph Loader

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Andries,

> is very little interaction to start with. And the final AND that
truncates

> to the final size of the hash chain kills the high order bits.
> No good.

I didn't realise you were bit-masking down to the required size.

Yes, it would be pretty useless in that case.

Ralph.


>
> Andries
>
>
>

2001-02-24 05:35:35

by Ralph Loader

[permalink] [raw]
Subject: Re: [rfc] Near-constant time directory index for Ext2

Andries,


> > int hash_fn (char * p)
> > {
> > int hash = 0;
> > while (*p) {
> > hash = hash + *p;
> > // Rotate a 31 bit field 7 bits:
> > hash = ((hash << 7) | (hash >> 24)) & 0x7fffffff;
> > }
> > return hash;
> > }

>

> Hmm. This one goes in the "catastrophic" category.

>
> For actual names:
>
> N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99
>
> For generated names:
>
> N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83
>

Instead of masking the hash value down to 11 bits you could try:

index = (hash ^ (hash >> 11) ^ (hash >> 22)) & 0x7ff;

I ran a quick test which gave fairly good results with that: 12871
identifiers
from a source tree) gave a mean square bucket size of 45.65, expected
value for a random function is 45.78.

That change might improve some of your other hashes as well - there
doesn't
seem to be much point in computing a 32 bit value only to throw 20 bits
away -
stirring in the extra bits makes much more sense to me.

> > The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],

> > where P is the polynomial x**31 - 1 (over Z_2).
> > Presumably the "best" P would be irreducible - but that would have more
> > bits set in the polynomial, making reduction harder. A compromise is to
> > choose P in the form x**N - 1 but with relatively few factors.
> > X**31 - 1 is such a P.
>
> It has seven irreducible factors. Hardly "almost irreducible".

I didn't say it was. "almost irreducible" polynomials with Hamming
weight two are pretty rare... Relative to say, x**32 - 1 or x**24 - 1,
having 7 factors is good.

Ralph.