2003-02-27 17:20:48

by Martin J. Bligh

[permalink] [raw]
Subject: [Bug 417] New: htree much slower than regular ext3

http://bugme.osdl.org/show_bug.cgi?id=417

Summary: htree much slower than regular ext3
Kernel Version: 2.5.63-bk3
Status: NEW
Severity: normal
Owner: [email protected]
Submitter: [email protected]


Distribution: Debian Testing

Hardware Environment: x86, 256mb RAM, PIIX4, Maxtor 91728D8 ATA DISK drive

Software Environment:
Linux razor 2.5.63bk3 #25 Thu Feb 27 10:13:35 EST 2003 i686 Pentium II
(Klamath) GenuineIntel GNU/Linux

Gnu C 2.95.4
Gnu make 3.79.1
util-linux 2.11n
mount 2.11n
module-init-tools implemented
e2fsprogs 1.30-WIP
reiserfsprogs 3.6.3
Linux C Library 2.2.5
Dynamic linker (ldd) 2.2.5
Procps 2.0.7
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 4.5.2

Problem Description:
I created a directory ("test") with 32000 (ok, 31998) directories in it,
and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
'time du -hs' on the test directory, and here are the results.

ext3+htree:
bwindle@razor:/giant/inodes$ time du -hs
126M .

real 7m21.756s
user 0m2.021s
sys 0m22.190s

I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
and did another du -hs on the test directory. It took 1 minute, 48 seconds.

bwindle@razor:/giant/test$ time du -hs
126M .

real 1m48.760s
user 0m1.986s
sys 0m21.563s


I thought htree was supposed to speed up access with large numbers of
directories?



2003-02-27 19:54:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> Problem Description:
> I created a directory ("test") with 32000 (ok, 31998) directories in it,
> and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
> did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
> 'time du -hs' on the test directory, and here are the results.
>
> ext3+htree:
> bwindle@razor:/giant/inodes$ time du -hs
> 126M .
>
> real 7m21.756s
> user 0m2.021s
> sys 0m22.190s
>
> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> and did another du -hs on the test directory. It took 1 minute, 48
> seconds.
>
> bwindle@razor:/giant/test$ time du -hs
> 126M .
>
> real 1m48.760s
> user 0m1.986s
> sys 0m21.563s
>
>
> I thought htree was supposed to speed up access with large numbers of
> directories?

The du just does getdents and lstats in physical storage order, so there's no
possible benefit from indexing in this case, and unindexed ext3 avoids long
search times by caching the position at which it last found an entry. That
answers the question "why doesn't it speed up", however, "why did it slow way
down" is harder.

The single-file leaves of your directory tree don't carry any index (it's not
worth it with only one file) and therfore use the same code path as unindexed
ext3, so there's no culprit there. I'm looking suspiciously at
ext3_dx_readdir, which is apparently absorbing about 11 ms per returned
entry. To put this in perspective, I'm used to seeing individual directory
operation times well below 100 us, so even if each dirent cost as much as a
full lookup, you'd see less than 3 seconds overhead for your 30,000
directories.

11 ms sounds like two seeks for each returned dirent, which sounds like a bug.

Regards,

Daniel

2003-02-27 20:50:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Feb 28, 2003 03:55 +0100, Daniel Phillips wrote:
> On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> > Problem Description:
> > I created a directory ("test") with 32000 (ok, 31998) directories in it,
> > and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> > touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
> > did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
> > 'time du -hs' on the test directory, and here are the results.
> >
> > ext3+htree:
> > bwindle@razor:/giant/inodes$ time du -hs
> > 126M .
> >
> > real 7m21.756s
> > user 0m2.021s
> > sys 0m22.190s
> >
> > I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> > and did another du -hs on the test directory. It took 1 minute, 48
> > seconds.
> >
> > bwindle@razor:/giant/test$ time du -hs
> > 126M .
> >
> > real 1m48.760s
> > user 0m1.986s
> > sys 0m21.563s
> >
> >
> > I thought htree was supposed to speed up access with large numbers of
> > directories?
>
> The du just does getdents and lstats in physical storage order, so there's no
> possible benefit from indexing in this case, and unindexed ext3 avoids long
> search times by caching the position at which it last found an entry. That
> answers the question "why doesn't it speed up", however, "why did it slow way
> down" is harder.
>
> The single-file leaves of your directory tree don't carry any index (it's not
> worth it with only one file) and therfore use the same code path as unindexed
> ext3, so there's no culprit there. I'm looking suspiciously at
> ext3_dx_readdir, which is apparently absorbing about 11 ms per returned
> entry. To put this in perspective, I'm used to seeing individual directory
> operation times well below 100 us, so even if each dirent cost as much as a
> full lookup, you'd see less than 3 seconds overhead for your 30,000
> directories.
>
> 11 ms sounds like two seeks for each returned dirent, which sounds like a bug.

I think you are pretty dead on there. The difference is that with unindexed
entries, the directory entry and the inode are in the same order, while with
indexed directories they are essentially in random order to each other. That
means that each directory lookup causes a seek to get the directory inode data
instead of doing allocation order (which is sequential reads on disk).

In the past both would have been slow equally, but the orlov allocator in
2.5 causes a number of directories to be created in the same group before
moving on to the next group, so you have nice batches of sequential reads.

I've got a patch which should help here, although it was originally written
to speed up the "create" case instead of the "lookup" case. In the lookup
case, it will do a pre-read of a number of inode table blocks, since the cost
of doing a 64kB read and doing a 4kB read is basically the same - the cost
of the seek.

The patch was written for a 2.4.18 RH kernel, but I think the areas touched
(ext3_get_inode_loc and ext3_new_inode) are relatively unchanged, so it
should be fairly straightforward to fix the rejects by hand for 2.5.

For the create case, the patch speeds up lots-of-file creation rates a lot
(50%), regardless of whether we are using htree or not. That is because we
do not read in empty inode table blocks from disk (which will be slow
behind a bunch of writes going on), and instead we have a pure write
workload (with the exception of reading in the inode bitmaps occasionally).

Cheers, Andreas
======================= ext3-noread.diff ====================================
diff -ru lustre-head/fs/ext3/ialloc.c lustre/fs/ext3/ialloc.c
--- lustre-head/fs/ext3/ialloc.c Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/ialloc.c Mon Dec 23 09:46:20 2002
@@ -289,6 +289,37 @@
}

/*
+ * @block_group: block group of inode
+ * @offset: relative offset of inode within @block_group
+ *
+ * Check whether any of the inodes in this disk block are in use.
+ *
+ * Caller must be holding superblock lock (group/bitmap read lock in future).
+ */
+int ext3_itable_block_used(struct super_block *sb, unsigned int block_group,
+ int offset)
+{
+ int bitmap_nr = load_inode_bitmap(sb, block_group);
+ int inodes_per_block;
+ unsigned long inum, iend;
+ struct buffer_head *ibitmap;
+
+ if (bitmap_nr < 0)
+ return 1;
+
+ inodes_per_block = sb->s_blocksize / EXT3_SB(sb)->s_inode_size;
+ inum = offset & ~(inodes_per_block - 1);
+ iend = inum + inodes_per_block;
+ ibitmap = EXT3_SB(sb)->s_inode_bitmap[bitmap_nr];
+ for (; inum < iend; inum++) {
+ if (inum != offset && ext3_test_bit(inum, ibitmap->b_data))
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
* There are two policies for allocating an inode. If the new inode is
* a directory, then a forward search is made for a block group with both
* free space and a low directory-to-inode ratio; if that fails, then of
@@ -312,6 +343,7 @@
struct ext3_group_desc * gdp;
struct ext3_group_desc * tmp;
struct ext3_super_block * es;
+ struct ext3_iloc iloc;
int err = 0;

/* Cannot create files in a deleted directory */
@@ -514,9 +547,18 @@
inode->i_generation = sbi->s_next_generation++;

ei->i_state = EXT3_STATE_NEW;
- err = ext3_mark_inode_dirty(handle, inode);
+ err = ext3_get_inode_loc_new(inode, &iloc, 1);
if (err) goto fail;
-
+ BUFFER_TRACE(iloc->bh, "get_write_access");
+ err = ext3_journal_get_write_access(handle, iloc.bh);
+ if (err) {
+ brelse(iloc.bh);
+ iloc.bh = NULL;
+ goto fail;
+ }
+ err = ext3_mark_iloc_dirty(handle, inode, &iloc);
+ if (err) goto fail;
+
unlock_super (sb);
if(DQUOT_ALLOC_INODE(inode)) {
DQUOT_DROP(inode);
diff -ru lustre-head/fs/ext3/inode.c lustre/fs/ext3/inode.c
--- lustre-head/fs/ext3/inode.c Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/inode.c Mon Dec 23 09:50:25 2002
@@ -2011,16 +1994,25 @@
ext3_journal_stop(handle, inode);
}

-/*
- * ext3_get_inode_loc returns with an extra refcount against the
- * inode's underlying buffer_head on success.
- */
+extern int ext3_itable_block_used(struct super_block *sb,
+ unsigned int block_group,
+ int offset);
+
+#define NUM_INODE_PREREAD 16

-int ext3_get_inode_loc (struct inode *inode, struct ext3_iloc *iloc)
+/*
+ * ext3_get_inode_loc returns with an extra refcount against the inode's
+ * underlying buffer_head on success. If this is for a new inode allocation
+ * (new is non-zero) then we may be able to optimize away the read if there
+ * are no other in-use inodes in this inode table block. If we need to do
+ * a read, then read in a whole chunk of blocks to avoid blocking again soon
+ * if we are doing lots of creates/updates.
+ */
+int ext3_get_inode_loc_new(struct inode *inode, struct ext3_iloc *iloc, int new)
{
struct super_block *sb = inode->i_sb;
struct ext3_sb_info *sbi = EXT3_SB(sb);
- struct buffer_head *bh = 0;
+ struct buffer_head *bh[NUM_INODE_PREREAD];
unsigned long block;
unsigned long block_group;
unsigned long group_desc;
@@ -2042,38 +2034,86 @@
}
group_desc = block_group >> sbi->s_desc_per_block_bits;
desc = block_group & (sbi->s_desc_per_block - 1);
- bh = sbi->s_group_desc[group_desc];
- if (!bh) {
+ if (!sbi->s_group_desc[group_desc]) {
ext3_error(sb, __FUNCTION__, "Descriptor not loaded");
goto bad_inode;
}

- gdp = (struct ext3_group_desc *) bh->b_data;
+ gdp = (struct ext3_group_desc *)(sbi->s_group_desc[group_desc]->b_data);
+
/*
* Figure out the offset within the block group inode table
*/
- offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group) *
- sbi->s_inode_size;
+ offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group);
+
block = le32_to_cpu(gdp[desc].bg_inode_table) +
- (offset >> EXT3_BLOCK_SIZE_BITS(sb));
- if (!(bh = sb_bread(sb, block))) {
- ext3_error (sb, __FUNCTION__,
- "unable to read inode block - "
- "inode=%lu, block=%lu", inode->i_ino, block);
- goto bad_inode;
+ (offset * sbi->s_inode_size >> EXT3_BLOCK_SIZE_BITS(sb));
+
+ bh[0] = sb_getblk(sb, block);
+ if (buffer_uptodate(bh[0]))
+ goto done;
+
+ /* If we don't really need to read this block, and it isn't already
+ * in memory, then we just zero it out. Otherwise, we keep the
+ * current block contents (deleted inode data) for posterity.
+ */
+ if (new && !ext3_itable_block_used(sb, block_group, offset)) {
+ lock_buffer(bh[0]);
+ memset(bh[0]->b_data, 0, bh[0]->b_size);
+ mark_buffer_uptodate(bh[0], 1);
+ unlock_buffer(bh[0]);
+ } else {
+ unsigned long block_end, itable_end;
+ int count = 1;
+
+ itable_end = le32_to_cpu(gdp[desc].bg_inode_table) +
+ sbi->s_itb_per_group;
+ block_end = block + NUM_INODE_PREREAD;
+ if (block_end > itable_end)
+ block_end = itable_end;
+
+ for (; block < block_end; block++) {
+ bh[count] = sb_getblk(sb, block);
+ if (count && (buffer_uptodate(bh[count]) ||
+ buffer_locked(bh[count]))) {
+ __brelse(bh[count]);
+ } else
+ count++;
+ }
+
+ ll_rw_block(READ, count, bh);
+
+ /* Release all but the block we actually need (bh[0]) */
+ while (--count > 0)
+ __brelse(bh[count]);
+
+ wait_on_buffer(bh[0]);
+ if (!buffer_uptodate(bh[0])) {
+ ext3_error(sb, __FUNCTION__,
+ "unable to read inode block - "
+ "inode=%lu, block=%lu", inode->i_ino,
+ bh[0]->b_blocknr);
+ goto bad_inode;
+ }
}
- offset &= (EXT3_BLOCK_SIZE(sb) - 1);
+ done:
+ offset = (offset * sbi->s_inode_size) & (EXT3_BLOCK_SIZE(sb) - 1);

- iloc->bh = bh;
- iloc->raw_inode = (struct ext3_inode *) (bh->b_data + offset);
+ iloc->bh = bh[0];
+ iloc->raw_inode = (struct ext3_inode *)(bh[0]->b_data + offset);
iloc->block_group = block_group;
-
+
return 0;
-
+
bad_inode:
return -EIO;
}

+int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc)
+{
+ return ext3_get_inode_loc_new(inode, iloc, 0);
+}
+
void ext3_read_inode(struct inode * inode)
{
struct ext3_iloc iloc;
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-02-27 21:13:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Thursday 27 February 2003 22:00, Andreas Dilger wrote:
> > 11 ms sounds like two seeks for each returned dirent, which sounds like a
> > bug.
>
> I think you are pretty dead on there. The difference is that with
> unindexed entries, the directory entry and the inode are in the same order,
> while with indexed directories they are essentially in random order to each
> other. That means that each directory lookup causes a seek to get the
> directory inode data instead of doing allocation order (which is sequential
> reads on disk).
>
> In the past both would have been slow equally, but the orlov allocator in
> 2.5 causes a number of directories to be created in the same group before
> moving on to the next group, so you have nice batches of sequential reads.

I think you're close to the truth there, but out-of-order inode table access
would only introduce one seek per inode table block, and the cache should
take care of the rest. Martin's numbers suggest the cache isn't caching at
all.

Martin, does iostat show enormously more reads for the Htree case?

Regards,

Daniel

2003-02-27 21:33:27

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

>> > 11 ms sounds like two seeks for each returned dirent, which sounds
>> > like a bug.
>>
>> I think you are pretty dead on there. The difference is that with
>> unindexed entries, the directory entry and the inode are in the same
>> order, while with indexed directories they are essentially in random
>> order to each other. That means that each directory lookup causes a
>> seek to get the directory inode data instead of doing allocation order
>> (which is sequential reads on disk).
>>
>> In the past both would have been slow equally, but the orlov allocator in
>> 2.5 causes a number of directories to be created in the same group before
>> moving on to the next group, so you have nice batches of sequential
>> reads.
>
> I think you're close to the truth there, but out-of-order inode table
> access would only introduce one seek per inode table block, and the
> cache should take care of the rest. Martin's numbers suggest the cache
> isn't caching at all.
>
> Martin, does iostat show enormously more reads for the Htree case?

Wasn't me ... I just forward the bug data from bugzilla ;-)
Filer in this case was ... [email protected]
cc'ed.

M.

2003-03-07 15:43:39

by Alex Tomas

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3


Hi!

The problem is that getdents(2) returns inodes out of order and
du causes many head seeks. I tried to solve the problem by patch
I included in. The idea of the patch is pretty simple: just try
to sort dentries by inode number in readdir(). It works because
inodes sits at fixed places in ext2/ext3. Please, look at results
I just got:

real user sys

ext3+htree: 7m22.236s 0m0.292s 0m2.204s
ext3-htree: 3m6.539s 0m0.481s 0m2.278s
ext3+sd-05: 3m45.453s 0m0.493s 0m2.265s
ext3+sd-10: 3m35.770s 0m0.514s 0m2.076s
ext3+sd-15: 3m24.235s 0m0.558s 0m2.075s
ext3+sd-20: 3m6.802s 0m0.486s 0m2.214s
ext3+sd-25: 2m55.062s 0m0.490s 0m2.105s
ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s


'sd-N' stands for sortdir, N blocks will be prefetched before
sorting.

Of course, sortdir isn't production-ready and I even don't try
to test it hard. Just to probe the idea.

After applying the patch, use 'sortdir' or 'sortdir=N' as mount
option.

with best regards, Alex

>>>>> Martin J Bligh (MJB) writes:

MJB> Problem Description: I created a directory ("test") with 32000
MJB> (ok, 31998) directories in it, and put a file called 'foo' in
MJB> each of them (for i in `ls`; do cd $i && touch bar && cd .. ;
MJB> done). Then I took that ext3 partion, umounted it, did a
MJB> 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then
MJB> did a 'time du -hs' on the test directory, and here are the
MJB> results.

MJB> ext3+htree: bwindle@razor:/giant/inodes$ time du -hs 126M .

MJB> real 7m21.756s user 0m2.021s sys 0m22.190s

MJB> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1,
MJB> remounted, and did another du -hs on the test directory. It took
MJB> 1 minute, 48 seconds.

MJB> bwindle@razor:/giant/test$ time du -hs 126M .

MJB> real 1m48.760s user 0m1.986s sys 0m21.563s


MJB> I thought htree was supposed to speed up access with large
MJB> numbers of directories?



diff -uNr linux/fs/ext3/dir.c edited/fs/ext3/dir.c
--- linux/fs/ext3/dir.c Thu Mar 6 14:57:50 2003
+++ edited/fs/ext3/dir.c Fri Mar 7 18:35:32 2003
@@ -33,8 +33,12 @@
static int ext3_readdir(struct file *, void *, filldir_t);
static int ext3_dx_readdir(struct file * filp,
void * dirent, filldir_t filldir);
+static int ext3_readdir_sort (struct file * filp,
+ void * dirent, filldir_t filldir);
static int ext3_release_dir (struct inode * inode,
struct file * filp);
+int ext3_htree_store_dirent_sorted(struct file *dir_file,
+ struct ext3_dir_entry_2 *dirent);

struct file_operations ext3_dir_operations = {
.read = generic_read_dir,
@@ -104,9 +108,17 @@
sb = inode->i_sb;

if (is_dx(inode)) {
+ if (test_opt(inode->i_sb, SORTDIR)) {
+ err = ext3_readdir_sort(filp, dirent, filldir);
+ unlock_kernel();
+ return err;
+ }
+
err = ext3_dx_readdir(filp, dirent, filldir);
- if (err != ERR_BAD_DX_DIR)
+ if (err != ERR_BAD_DX_DIR) {
+ unlock_kernel();
return err;
+ }
/*
* We don't set the inode dirty flag since it's not
* critical that it get flushed back to the disk.
@@ -249,6 +261,8 @@
__u32 inode;
__u8 name_len;
__u8 file_type;
+ loff_t offs;
+ struct list_head list;
char name[0];
};

@@ -311,6 +325,9 @@
p->curr_hash = pos2maj_hash(pos);
p->curr_minor_hash = pos2min_hash(pos);
p->next_hash = 0;
+ INIT_LIST_HEAD(&p->list);
+ p->list_nr = 0;
+
return p;
}

@@ -493,10 +510,193 @@

static int ext3_release_dir (struct inode * inode, struct file * filp)
{
- if (is_dx(inode) && filp->private_data)
+ if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);

return 0;
}

+int ext3_htree_fill_pool(struct file *dir_file, struct ext3_dir_entry_2 *dirent)
+{
+ struct rb_node **p, *parent = NULL;
+ struct fname * fname, *new_fn, *prev_fname = NULL;
+ struct dir_private_info *info;
+ int len;
+
+ info = (struct dir_private_info *) dir_file->private_data;
+ p = &info->root.rb_node;
+
+ /* Create and allocate the fname structure */
+ len = sizeof(struct fname) + dirent->name_len + 1;
+ new_fn = kmalloc(len, GFP_KERNEL);
+ if (!new_fn)
+ return -ENOMEM;
+ memset(new_fn, 0, len);
+ new_fn->hash = new_fn->inode = le32_to_cpu(dirent->inode);
+ new_fn->name_len = dirent->name_len;
+ new_fn->file_type = dirent->file_type;
+ memcpy(new_fn->name, dirent->name, dirent->name_len);
+ new_fn->name[dirent->name_len] = 0;
+ new_fn->offs = dir_file->f_pos;
+
+ while (*p) {
+ parent = *p;
+ fname = rb_entry(parent, struct fname, rb_hash);
+
+ if (new_fn->hash < fname->hash)
+ p = &(*p)->rb_left;
+ else {
+ prev_fname = fname;
+ p = &(*p)->rb_right;
+ }
+ }
+
+ if (prev_fname)
+ list_add(&new_fn->list, &prev_fname->list);
+ else
+ list_add(&new_fn->list, &info->list);
+
+ rb_link_node(&new_fn->rb_hash, parent, p);
+ rb_insert_color(&new_fn->rb_hash, &info->root);
+ info->list_nr++;
+
+ return 0;
+}
+
+static int ext3_fill_from_sort_pool (struct file * filp, void * dirent, filldir_t filldir)
+{
+ struct super_block * sb = filp->f_dentry->d_inode->i_sb;
+ struct dir_private_info *info = filp->private_data;
+ struct list_head * cur;
+ struct fname *fname;
+ int error;
+
+ list_for_each(cur, &info->list) {
+ fname = list_entry(cur, struct fname, list);
+
+ error = filldir(dirent, fname->name,
+ fname->name_len, fname->offs,
+ fname->inode,
+ get_dtype(sb, fname->file_type));
+ if (error)
+ return 0;
+
+ list_del(&fname->list);
+ info->list_nr--;
+ }
+
+ if (info->list_nr == 0)
+ free_rb_tree_fname(&info->root);
+
+ return 0;
+}
+
+static int ext3_readdir_sort (struct file * filp,
+ void * dirent, filldir_t filldir)
+{
+ unsigned long offset, blk;
+ int i, stored, error = 0, blocks = 0, err;
+ struct buffer_head * bh;
+ struct ext3_dir_entry_2 * de;
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct super_block * sb = inode->i_sb;
+ struct dir_private_info *info = filp->private_data;
+
+ if (!info) {
+ info = create_dir_info(filp->f_pos);
+ if (!info)
+ return -ENOMEM;
+ filp->private_data = info;
+ }
+
+ if (info->list_nr > 0) {
+ ext3_fill_from_sort_pool(filp, dirent, filldir);
+ return 0;
+ }
+
+ stored = 0;
+ bh = NULL;
+ offset = filp->f_pos & (sb->s_blocksize - 1);
+
+ while (!error && (blocks < EXT3_SB(sb)->sortdir_prefetch) &&
+ (filp->f_pos < inode->i_size)) {
+ blk = (filp->f_pos) >> EXT3_BLOCK_SIZE_BITS(sb);
+ bh = ext3_bread (0, inode, blk, 0, &err);
+ if (!bh) {
+ ext3_error (sb, "ext3_readdir",
+ "directory #%lu contains a hole at offset %lu",
+ inode->i_ino, (unsigned long)filp->f_pos);
+ filp->f_pos += sb->s_blocksize - offset;
+ continue;
+ }
+
+revalidate:
+ /* If the dir block has changed since the last call to
+ * readdir(2), then we might be pointing to an invalid
+ * dirent right now. Scan from the start of the block
+ * to make sure. */
+ if (filp->f_version != inode->i_version) {
+ for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ de = (struct ext3_dir_entry_2 *)
+ (bh->b_data + i);
+ /* It's too expensive to do a full
+ * dirent test each time round this
+ * loop, but we do have to test at
+ * least that it is non-zero. A
+ * failure will be detected in the
+ * dirent test below. */
+ if (le16_to_cpu(de->rec_len) <
+ EXT3_DIR_REC_LEN(1))
+ break;
+ i += le16_to_cpu(de->rec_len);
+ }
+ offset = i;
+ filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
+ | offset;
+ filp->f_version = inode->i_version;
+ }
+
+ while (!error && filp->f_pos < inode->i_size
+ && offset < sb->s_blocksize) {
+ de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
+ if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
+ bh, offset)) {
+ /* On error, skip the f_pos to the
+ * next block. */
+ filp->f_pos = (filp->f_pos |
+ (sb->s_blocksize - 1)) + 1;
+ brelse (bh);
+ return stored;
+ }
+ offset += le16_to_cpu(de->rec_len);
+ if (le32_to_cpu(de->inode)) {
+ /* We might block in the next section
+ * if the data destination is
+ * currently swapped out. So, use a
+ * version stamp to detect whether or
+ * not the directory has been modified
+ * during the copy operation.
+ */
+ unsigned long version = filp->f_version;
+
+ error = ext3_htree_fill_pool(filp, de);
+ if (error)
+ break;
+ if (version != filp->f_version)
+ goto revalidate;
+ stored ++;
+ }
+ filp->f_pos += le16_to_cpu(de->rec_len);
+ }
+ offset = 0;
+ brelse (bh);
+ blocks++;
+ }
+
+ ext3_fill_from_sort_pool(filp, dirent, filldir);
+
+ UPDATE_ATIME(inode);
+ return 0;
+}
+
#endif
diff -uNr linux/fs/ext3/super.c edited/fs/ext3/super.c
--- linux/fs/ext3/super.c Mon Feb 24 17:47:45 2003
+++ edited/fs/ext3/super.c Fri Mar 7 17:47:27 2003
@@ -584,6 +584,19 @@
continue;
if ((value = strchr (this_char, '=')) != NULL)
*value++ = 0;
+ if (!strcmp (this_char, "sortdir")) {
+ if (value && *value) {
+ int blocks = simple_strtoul(value, &value, 0);
+ printk("EXT3: has value: %d\n", blocks);
+ if (blocks > 0)
+ sbi->sortdir_prefetch = blocks;
+ }
+ if (sbi->sortdir_prefetch == 0)
+ sbi->sortdir_prefetch = 10; /* default value */
+ set_opt(sbi->s_mount_opt, SORTDIR);
+ printk("EXT3-fs: sortdir (%d blocks prefetch)\n",
+ sbi->sortdir_prefetch);
+ } else
#ifdef CONFIG_EXT3_FS_XATTR
if (!strcmp (this_char, "user_xattr"))
set_opt (sbi->s_mount_opt, XATTR_USER);
diff -uNr linux/include/linux/ext3_fs.h edited/include/linux/ext3_fs.h
--- linux/include/linux/ext3_fs.h Mon Feb 24 17:47:33 2003
+++ edited/include/linux/ext3_fs.h Fri Mar 7 18:32:53 2003
@@ -330,6 +330,7 @@
#define EXT3_MOUNT_NO_UID32 0x2000 /* Disable 32-bit UIDs */
#define EXT3_MOUNT_XATTR_USER 0x4000 /* Extended user attributes */
#define EXT3_MOUNT_POSIX_ACL 0x8000 /* POSIX Access Control Lists */
+#define EXT3_MOUNT_SORTDIR 0x10000 /* sort indexed dirs in readdir() */

/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -656,6 +657,8 @@
__u32 curr_hash;
__u32 curr_minor_hash;
__u32 next_hash;
+ struct list_head list;
+ int list_nr;
};

/*
diff -uNr linux/include/linux/ext3_fs_sb.h edited/include/linux/ext3_fs_sb.h
--- linux/include/linux/ext3_fs_sb.h Mon Nov 11 06:28:30 2002
+++ edited/include/linux/ext3_fs_sb.h Fri Mar 7 17:30:46 2003
@@ -63,6 +63,7 @@
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
#endif
+ int sortdir_prefetch; /* how many blocks to read to fill pool --AT */
};

#endif /* _LINUX_EXT3_FS_SB */

2003-03-07 17:24:11

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Fri 07 Mar 03 16:46, Alex Tomas wrote:
> Hi!
>
> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:
>
> real user sys
>
> ext3+htree: 7m22.236s 0m0.292s 0m2.204s
> ext3-htree: 3m6.539s 0m0.481s 0m2.278s
> [...]
> ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s

Nice. I posted a similar demonstration some time ago (could it *really* be
two years?) but I sorted the dirents in user space, which meant it was just a
demonstration, not a practical solution.

The problem I see with your approach is that the traversal is no longer in
hash order, so a leaf split in the middle of a directory traversal could
result in a lot of duplicate dirents. I'm not sure there's a way around that.

Another approach is to have the inode number correspond approximately to the
hash value. That's not as hard as it sounds, it simply means that the goal
for ext3_new_inode should be influenced by the hash value. (But watch out
for interaction with the new Orlov allocator.) It also depends on some
a priori estimate of the size of a directory so that increasing hash values
can be distributed more or less uniformly and monotonically across some
corresponding range of inode numbers. This isn't too hard either, just round
up the current size of the directory to a power of two and use that as the
size estimate. The size estimate would change from time to time over the
life of the directory, but there are only log N different sizes and that's
roughly how heavy the load on the inode cache would be during a directory
traversal. Finally, a nice property of this approach is that it stays stable
over many creates and deletes.

We've apparently got a simpler problem though: inode numbers aren't allocated
densely enough in the inode table blocks. This is the only thing I can think
of that could cause the amount of thrashing we're seeing. Spreading inode
number allocations out all over a volume is ok, but sparse allocation within
each table block is not ok because it multiplies the pressure on the cache.
We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
not the worst-case 30,000. As above, fixing this comes down to tweaking the
ext3_new_inode allocation goal.

Regards,

Daniel

2003-03-07 23:17:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Sat, Mar 08, 2003 at 06:38:50PM +0100, Daniel Phillips wrote:
> > The problem is that getdents(2) returns inodes out of order and
> > du causes many head seeks. I tried to solve the problem by patch
> > I included in. The idea of the patch is pretty simple: just try
> > to sort dentries by inode number in readdir().

Yup, that's the problem and the fix; the only issue is the the one
which Daniel pointed out:

> The problem I see with your approach is that the traversal is no
> longer in hash order, so a leaf split in the middle of a directory
> traversal could result in a lot of duplicate dirents. I'm not sure
> there's a way around that.

The fix in userspace would be for du and find (the two programs most
likely to care about this sort of thing) to pull into memory a large
chunks of the directory at a time, sort them by inode, and then stat
them in inode order.

The only other way around it would be to do maintain an entirely
separate B-tree on disk just for the benefit of readdir(), which could
be in inode sort order. Readdir() could then traverse the tree in
inode sort order, thus solving the performance penalty. (The b-tree
has to be on disk, since for large directories, we can't afford to fit
it on disk.)

> We've apparently got a simpler problem though: inode numbers aren't
> allocated densely enough in the inode table blocks. This is the
> only thing I can think of that could cause the amount of thrashing
> we're seeing. Spreading inode number allocations out all over a
> volume is ok, but sparse allocation within each table block is not
> ok because it multiplies the pressure on the cache. We want a
> 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
> not the worst-case 30,000. As above, fixing this comes down to
> tweaking the ext3_new_inode allocation goal.

Nope. That's not it. Inode numbers are allocated sequentially in the
inode table blocks. You can't get any more dense that that. Create a
bunch of inodes in a directory like this:

mke2fs -j -F -b 1024 -g 1024 -N 1200000 test.img
mount -t ext3 -o loop test.img /mnt
pushd /mnt
mkdir test test2 test3
cd test
time seq -f "%06.0f" 1 100 | xargs touch
cd ..
cd test2
time seq -f "%06.0f" 1 10000 | xargs touch
cd ..
cd test3
time seq -f "%06.0f" 1 100000 | xargs touch
cd ..
popd

Then look at the result using "ls -li". You will see that the inode
numbers are all sequentially allocated, when you look at the inodes in
creation order (which in this particular case is in filename sort
order ). The problem is that when you look at them in hash sort
order, the inode numbers are scattered all over the place.

> Another approach is to have the inode number correspond approximately to the
> hash value. That's not as hard as it sounds, it simply means that the goal
> for ext3_new_inode should be influenced by the hash value. (But watch out
> for interaction with the new Orlov allocator.)

I'm not all convinced that splattering the inodes all over the volume
is in fact a good thing, because it means splattering the block
allocation for files within the same directory all over the volume.
Arguably, read access to files in a directory is much more common than
the "du" or "find" case. So if we need to make trade-off in one
direction or another, I'd much rather go with striving to maximize
block locality than making a readdir+stat run go fast. This is
especially true since there *is* a user space workaround that would
speed up things for du and find.

- Ted


2003-03-08 22:55:18

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Sat 08 Mar 03 06:48, Andrew Morton wrote:
> Daniel Phillips <[email protected]> wrote:
> > OK, according to akpm, data=journal is broken at the moment. Changing
> > the host filesystem to data=ordered allowed me to complete the test.
> >
> > Interesting results. Creating the initial 30,000 directories ought to
> > take just a few seconds with the index, but instead it takes 5 minutes
> > and 9 seconds, while consuming only 5 seconds of cpu time.
>
> It takes 14 second here aganst a real disk, and 9 seconds against loop.

The best I've seen is 77 seconds to make the directories, on a fast scsi disk
with 4K blocksize. That's a 6 1/2 times difference, what could account for
it?

> The difference: I used 4k blocksize. 1k blocksize should not be that bad:
> something is broken.
>
> Suggest you test against a real disk for now. Using loop may skew things.

OK, I tested 1K and 4K blocksize on scsi, ide and ide-hosted loopback, the
results are summarized here:

Make 30,000 directories:
scsi 10K ide 5400 loopback
1K 1m44.243s 4m59.757s 4m8.362s
4K 1m16.659s 4m49.827s 1m43.399s

Make one file in each directory:
scsi 10K ide 5400 loopback
1K 2m50.483s 7m28.657s 4m8.793s
4K 2m34.461s 5m45.615s 3m0.603s

Stat the tree (du):
scsi 10K ide 5400 loopback
1K 0m43.245s 1m52.230s 0m1.343s
4K 0m1.038s 0m2.056s 0m1.016s

The first thing to notice: Alex Tomas's original complaint about Htree and
stat doesn't manifest at all here. Mind you, I did not repeat his exact
sequence of tune2fs's etc, so it's possible there could be a utils bug or
something. I'll check into that later. Meanwhile, these stats show other
breakage that needs to be investigated.

The directory creation time is several times higher than it should be, the
same for the file creates. With the index, the two sets of operations should
take about the same time. Is this a journal problem? My journal is about 8
meg. I can imagine the journal wrapping during the file creation, because of
touching all the existing inode table blocks, but this should not happen
during the directory creation.

The stats also show a huge slowdown for 1K blocksize on non-loopback, as you
noticed earlier. Hmm. First: why on the raw device and not on loopback?
Second: why does the device affect this at all? This operation is r/o. Hmm,
except for atimes. I reran the scsi tests with noatime+nodiratime and the
stat times for 1K and 4K became nearly equal, about .75 seconds. So there we
have it: several tens of seconds are spent journalling dirty inode table
blocks due to atime, but only when blocksize is 1K. Big fat clue.

Stat the tree (du), noatime+nodtime:
scsi 10K
1K 0m0.780s
4K 0m0.751s

The cpu numbers are in the detailed results below, and they show that Htree
is doing its job, all the cpu numbers are tiny. We've kicked around the idea
that Htree is causing inode table thrashing, but I think that's the wrong
tree to bark up. Journal effects are looking more and more like the real
problem, possibly in combination with IO scheduling. The fact that Htree
dirties the inode table out of order is just exposing some other underlying
problem.

A quick look at slabinfo before and after unmounting the test volume shows
that all the inodes we expect to be cached, actually are:

cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache 121469 121473 448 13497 13497 1 : 120 60

sudo umount /mnt; cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache 61506 61632 448 6836 6848 1 : 120 60

So there were about 60,000 inodes cached for that volume, exactly as
expected. So much for the theory that we're thrashing the inode table cache.
What's actually happening, I strongly suspect, is a lot of dirty blocks are
getting pushed through the journal a little prematurely, without allowing any
opportunity for coalescing. At create time, it must be the directory blocks,
not the inode table blocks, that cause the extra writeout, since inodes are
being allocated (mostly) sequentially.

By the way, the script is a perfect trigger for the scheduler interactivity
bug, leaving me mouseless for 4 seconds at a time. I guess I'll apply
Linus's patch and see if that goes away.

One other thing I noticed is that, when the script tries to umount
immediately after the stat, I get "umount: /mnt: device is busy", however,
after a few seconds it works fine. This strikes me as a bug.

======================
1K blocksize, 10K rpm scsi
======================

Make directories...

real 1m44.243s
user 0m0.142s
sys 0m3.465s

Make one file in each directory...

real 2m50.483s
user 0m23.617s
sys 0m59.465s

Stat the tree...
30M /mnt/test

real 0m43.245s
user 0m0.245s
sys 0m0.915s

======================
1K blocksize, 5400 rpm ide
======================

Make directories...

real 4m59.757s
user 0m0.138s
sys 0m5.207s

Make one file in each directory...

real 7m28.657s
user 0m24.351s
sys 0m58.941s

Stat the tree...
30M /mnt/test

real 1m52.230s
user 0m0.261s
sys 0m1.013s

===================
1K blocksize, loopback
===================

Make directories...

real 4m8.362s
user 0m0.144s
sys 0m7.448s

Make one file in each directory...

real 4m8.793s
user 0m25.262s
sys 1m20.376s

Stat the tree...
30M /mnt/test

real 0m1.343s
user 0m0.265s
sys 0m0.720s

======================
4K blocksize, 10K rpm scsi
======================

Make directories...

real 1m16.659s
user 0m0.152s
sys 0m4.106s

Make one file in each directory...

real 2m34.461s
user 0m23.555s
sys 0m58.533s

Stat the tree...
118M /mnt/test

real 0m1.038s
user 0m0.242s
sys 0m0.664s

======================
4K blocksize, 5400 rpm ide
======================

Make directories...

real 4m49.827s
user 0m0.153s
sys 0m16.666s

Make one file in each directory...

real 5m45.615s
user 0m23.836s
sys 1m6.908s

Stat the tree...
118M /mnt/test

real 0m2.056s
user 0m0.432s
sys 0m0.622s

===================
4K blocksize, loopback
===================

Make directories...

real 1m43.399s
user 0m0.149s
sys 0m10.806s

Make one file in each directory...

real 3m0.603s
user 0m25.568s
sys 1m34.851s

Stat the tree...
118M /mnt/test

real 0m1.016s
user 0m0.318s
sys 0m0.588s

========
The script
========

#volume=test.img
#volume=/dev/hda5
volume=/dev/sdb5
mke2fs -j -F -b 4096 -g 1024 $volume
mount -t ext3 -o loop $volume /mnt
cd /mnt
mkdir test
cd test

echo
echo Make directories...
time seq -f "%06.0f" 1 30000 | xargs mkdir

echo
echo Make one file in each directory...
time for i in `ls`; do cd $i && touch foo && cd .. ; done

echo
echo Stat the tree...
time du -hs /mnt/test

echo umount /mnt

2003-03-09 07:05:02

by Alex Tomas

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

>>>>> Daniel Phillips (DP) writes:

DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
DP> The problem I see with your approach is that the traversal is no
DP> longer in hash order, so a leaf split in the middle of a
DP> directory traversal could result in a lot of duplicate dirents.
DP> I'm not sure there's a way around that.

1) As far as I understand, duplicates are possible even in classic ext2
w/o sortdir/index. See the diagram:

Process 1 Process 2

getdents(2) returns
dentry1 (file1 -> Inode1)
dentry2 (file2 -> Inode2)

context switch -->
unlink(file1), empty dentry1
creat(file3), Inode3, use dentry1
creat(file1), Inode1, use dentry3

context switch -->

getdents(2) returns
dentry3(file1 -> Inode1)


Am I right?


2) Why do not use hash order for traversal like ext3_dx_readdir() does?
Upon reading several dentries within some hash set readdir() sorts them
in inode order and returns to an user.


with best regards, Alex

2003-03-09 17:44:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Sun 09 Mar 03 08:08, Alex Tomas wrote:
> >>>>> Daniel Phillips (DP) writes:
>
> DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
> DP> The problem I see with your approach is that the traversal is no
> DP> longer in hash order, so a leaf split in the middle of a
> DP> directory traversal could result in a lot of duplicate dirents.
> DP> I'm not sure there's a way around that.
>
> 1) As far as I understand, duplicates are possible even in classic ext2
> w/o sortdir/index. See the diagram:
>
> Process 1 Process 2
>
> getdents(2) returns
> dentry1 (file1 -> Inode1)
> dentry2 (file2 -> Inode2)
>
> context switch -->
> unlink(file1), empty dentry1
> creat(file3), Inode3, use
> dentry1 creat(file1), Inode1, use dentry3
>
> context switch -->
>
> getdents(2) returns
> dentry3(file1 -> Inode1)
>
>
> Am I right?
>
>
> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
> Upon reading several dentries within some hash set readdir() sorts them
> in inode order and returns to an user.
>
>
> with best regards, Alex

You're right, but you still don't win the stuffed poodle.

I put forth the same argument at last year's kernel workshop and was
overruled on the grounds that, let me see:

- Duplicates as a result of unlinks are one thing, duplicates as a
result of creates are another, worse thing.

- Creating one duplicate as a result of one unlink is one thing,
creating lots of duplicates with one operation is another, worse
thing.

- The rules are written to allow this particular duplicate behavior
in UFS (design inherited by Ext2/3) because at the time, UFS was
the only game in town.

The phrase "broken by design" comes to mind. Ted and Stephen would no doubt
be happy to elaborate. Anyway, there's another reason we need to return
results in hash order, and that is telldir/seekdir, which expects a stable
enumeration of a directory with no duplicates, even with concurrent
operations going on, and even if the server is rebooted in the middle of a
directory traversal. Not just broken by design, but smashed to pieces by
design. And we still have to make it work, for some definition of "work".

Anyway,

2003-03-09 19:22:31

by Alex Tomas

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

>>>>> Theodore Ts'o (TT) writes:

TT> The fix in userspace would be for du and find (the two programs
TT> most likely to care about this sort of thing) to pull into memory
TT> a large chunks of the directory at a time, sort them by inode,
TT> and then stat them in inode order.

in fact, this solution solves problem for filesystems with fixed inodes
placement. for example, this solutions won't work for reiserfs.

2003-03-11 05:02:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On Sun, Mar 09, 2003 at 10:08:34AM +0300, Alex Tomas wrote:
> 1) As far as I understand, duplicates are possible even in classic ext2
> w/o sortdir/index. See the diagram:
>
> Process 1 Process 2
>
> getdents(2) returns
> dentry1 (file1 -> Inode1)
> dentry2 (file2 -> Inode2)
>
> context switch -->
> unlink(file1), empty dentry1
> creat(file3), Inode3, use dentry1
> creat(file1), Inode1, use dentry3
>
> context switch -->
>
> getdents(2) returns
> dentry3(file1 -> Inode1)
>
>
> Am I right?

Yup. POSIX allows this. If you're in the middle of readdir() when a
file (or files) is created or deleted, then it is undefined how
readdir() will treat the filename(s) involved. However, you're *not*
allowed to duplicate or omit filenames that were not touched by a file
creation or deletion.

> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
> Upon reading several dentries within some hash set readdir() sorts them
> in inode order and returns to an user.

Yeah, glibc could do this, although it would make telldir/seekdir more
than a little complicated. (There's an awful lot of pain and agony
caused to filesystem developers caused by the existence of
telldir/seekdir, which inherently assumes a flat directory structure,
and so it makes it hard to do tricks like this.)

At some level this is the same solution as "do this in userspace",
except instead of needing to convince the maintainers of du and find,
we'd need to convince Ulrich. It is *not* clear that it would
necessarily be easier to get this fix into the glibc. In fact, since
du and find probably don't use seekdir/telldir, it's probably easier
to get the change into du and find.

- Ted

2003-03-11 21:50:52

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Bug 417] New: htree much slower than regular ext3

On 7 Mar 2003, Alex Tomas wrote:

> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:

Any change in the order of dentries returned is likely to run into problem
when seeking in a directory. Given that readdir() is usully followed by
either zero or many stat()s, perhaps when the first stat() comes along you
could pre-read the inodes in optimal order and cace them. However you
tuned the size of your sort should work exactly as well as the size of the
pre-read.

This seems consistent with readahead and AS disk io.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-03-13 20:53:33

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3

Hi,

On Thu, 2003-02-27 at 21:00, Andreas Dilger wrote:

> I've got a patch which should help here, although it was originally written
> to speed up the "create" case instead of the "lookup" case. In the lookup
> case, it will do a pre-read of a number of inode table blocks, since the cost
> of doing a 64kB read and doing a 4kB read is basically the same - the cost
> of the seek.

No it's not --- you're evicting 16 times as much other
potentially-useful data from the cache for each lookup. You'll improve
the "du" or "ls -l" case by prefetching, but you may well slow down the
overall system performance when you're just doing random accesses (eg.
managing large spools.)

It would be interesting to think about how we can spot the cases where
the prefetch is likely to be beneficial, for example by observing
"stat"s coming in in strict hash order.

--Stephen