2003-03-08 22:39:21

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3

On Sat 08 Mar 03 09:04, Andreas Dilger wrote:
> I was testing this in UML-over-loop in 2.4, and the difference in speed
> for doing file creates vs. directory creates is dramatic. For file
> creates I was running 3s per 10000 files, and for directory creates I
> was running 12s per 10000 files.

And on a 10K scsi disk I'm running 35s per 10,000 directories, which is way,
way slower than it ought to be. There are two analysis tools we're hurting
for badly here:

- We need to see the physical allocation maps for directories, preferably
in a running kernel. I think the best way to do this is a little
map-dumper hooked into ext3/dir.c and exported through /proc.

- We need block-access traces in a nicer form than printks (or nobody
will ever use them). IOW, we need LTT or something very much like
it.

> Depending on the size of the journal vs. how many block/inode bitmaps and
> directory blocks are dirtied, you will likely wrap the journal before you
> return to the first block group, so you might write 20kB * 32000 for the
> directory creates instead of 8kB for the file creates. You also have a
> lot of seeking to each block group to write out the directory data, instead
> of nearly sequential IO for the inode create case.

Yes, I think that's exactly what's happening. There are some questions
remaining, such as why doesn't it happen to akpm. Another question: why does
it happen to the directory creates, where the only thing being accessed
randomly is the directory itself - the inode table is supposedly being
allocated/dirtied sequentially.

Regards,

Daniel


2003-03-08 23:09:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3

Daniel Phillips <[email protected]> wrote:
>
> Yes, I think that's exactly what's happening. There are some questions
> remaining, such as why doesn't it happen to akpm.

Oh but it does. The 30,000 mkdirs test takes 76 seconds on 1k blocks, 16
seconds on 4k blocks.

With 1k blocks, the journal is 1/4 the size, and there are 4x as many
blockgroups.

So not only does the fs have to checkpoint 4x as often, it has to seek all
over the disk to do it.

If you create a 1k blocksize fs with a 100MB journal, the test takes just
nine seconds.

But note that after these nine seconds, we were left with 50MB of dirty
buffercache. When pdflush later comes along to write that out it takes >30
seconds, because of the additional fragmentation from the tiny blockgroups.
So all we've done is to pipeline the pain.

I suspect the Orlov allocator isn't doing the right thing here - a full
dumpe2fs of the disk shows that the directories are all over the place.
Nodoby has really looked at fine-tuning Orlov.

btw, an `rm -rf' of the dir which holds 30,000 dirs takes 50 seconds system
time on a 2.7GHz CPU. What's up with that?

c01945bc str2hashbuf 9790 61.1875
c0187064 ext3_htree_store_dirent 10472 28.7692
c0154920 __getblk 11319 202.1250
c018d11c dx_probe 15934 24.2896
c018d4d0 ext3_htree_fill_tree 15984 33.8644
c0188d3c ext3_get_branch 18238 81.4196
c0136388 find_get_page 18617 202.3587
c015477c bh_lru_install 20493 94.8750
c0194290 TEA_transform 21463 173.0887
c01536b0 __find_get_block_slow 24069 62.6797
c0154620 __brelse 36139 752.8958
c01893dc ext3_get_block_handle 43815 49.7898
c0154854 __find_get_block 71292 349.4706