2011-03-10 07:11:22

by Rogier Wolff

[permalink] [raw]
Subject: Time for "mkdir" on ext3.


Hi,

I have an ext3 filesystem. When I "cp -lr" a big tree there, it turns
out that the "mkdir" calls take the bulk of the time. IIRC there are
325000 directories (and 4 million files). Each mkdir call takes about
50ms (*), so that accounts for about 4.5 hours of the running time.

Would ext4 perform significantly better?

Roger.

(*) I forgot about this Email while it was still in my editor. Now a
day layter the mkdir calls all take around 17ms, and things run about
3x faster. On the other hand it's been running for over 5 hours. And
yesterday I've seen a streak of >100ms mkdir calls... So apparently
it depends on "something"....

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ


2011-03-10 15:48:42

by Eric Sandeen

[permalink] [raw]
Subject: Re: Time for "mkdir" on ext3.

On 3/10/11 1:11 AM, Rogier Wolff wrote:
>
> Hi,
>
> I have an ext3 filesystem. When I "cp -lr" a big tree there, it turns
> out that the "mkdir" calls take the bulk of the time. IIRC there are
> 325000 directories (and 4 million files). Each mkdir call takes about
> 50ms (*), so that accounts for about 4.5 hours of the running time.
>
> Would ext4 perform significantly better?
>
> Roger.
>
> (*) I forgot about this Email while it was still in my editor. Now a
> day layter the mkdir calls all take around 17ms, and things run about
> 3x faster. On the other hand it's been running for over 5 hours. And
> yesterday I've seen a streak of >100ms mkdir calls... So apparently
> it depends on "something"....
>


There's a pretty pathological case in the directory allocator, where it
scans forward to find a free block group starting at the parent. For
each new subdir, it re-scans starting at the parent, even if it found
those groups full last time. I had experimented with an in-memory
"last free group" on each parent, which sped things up after the initial
scan. That might be what you're seeing...

Here's the patch I had, untested since 2007 - if you're in a testing
mood... of course if it breaks you get to keep the pieces. :)

-Eric

diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 9724aef..2f7be0c 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -242,6 +242,7 @@ static int find_group_dir(struct super_block *sb, struct inode *parent)
static int find_group_orlov(struct super_block *sb, struct inode *parent)
{
int parent_group = EXT3_I(parent)->i_block_group;
+ unsigned int child_group = EXT3_I(parent)->i_child_block_group;
struct ext3_sb_info *sbi = EXT3_SB(sb);
struct ext3_super_block *es = sbi->s_es;
int ngroups = sbi->s_groups_count;
@@ -269,7 +270,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)
get_random_bytes(&group, sizeof(group));
parent_group = (unsigned)group % ngroups;
for (i = 0; i < ngroups; i++) {
- group = (parent_group + i) % ngroups;
+ group = (child_group + i) % ngroups;
desc = ext3_get_group_desc (sb, group, NULL);
if (!desc || !desc->bg_free_inodes_count)
continue;
@@ -312,6 +313,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)
continue;
if (le16_to_cpu(desc->bg_free_blocks_count) < min_blocks)
continue;
+ EXT3_I(parent)->i_child_block_group = group;
return group;
}

@@ -555,6 +557,8 @@ got:
ei->i_dtime = 0;
ei->i_block_alloc_info = NULL;
ei->i_block_group = group;
+ if (S_ISDIR(mode))
+ ei->i_child_block_group = group;

ext3_set_inode_flags(inode);
if (IS_DIRSYNC(inode))
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index ae94f6d..72b0c92 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -2888,6 +2888,8 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino)
ei->i_disksize = inode->i_size;
inode->i_generation = le32_to_cpu(raw_inode->i_generation);
ei->i_block_group = iloc.block_group;
+ if (S_ISDIR(inode->i_mode))
+ ei->i_child_block_group = ei->i_block_group;
/*
* NOTE! The in-memory inode i_data array is in little-endian order
* even on big-endian machines: we do NOT byteswap the block numbers!
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index f42c098..79f3a72 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -87,6 +87,7 @@ struct ext3_inode_info {
* near to their parent directory's inode.
*/
__u32 i_block_group;
+ __u32 i_child_block_group; /* last bg children allocated to */
unsigned long i_state_flags; /* Dynamic state flags for ext3 */

/* block reservation info */