2008-07-10 17:28:31

by Theodore Ts'o

[permalink] [raw]

2008-07-10 17:53:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: suspiciously good fsck times?

Based on the graphs which Eric posted, One interesting thing I think
you'll find if you repeat the ext3 experiment with e2fsck -t -t is
that pass2 will be about seven times longer than pass1. (Which is
backwards from most e2fsck runs, where pass2 is about half pass 1's
run time --- although obviously that depends on how many directory
blocks you have.)

Yes, some kind of reservation windows would help on ext3 --- but the
question is whether such a change would be too-specific for this
benchmark or not. Most of the time directories don't grow to such a
huge size. So if you use a smallish (around 8 blocks, say) for many
directories this might lead to more filesystem fragmentation that in
the long run would cause the filesystem not to age well; it also
wouldn't help much when you have over 11 million files in the
directory, and a directory with over 100,000 blocks.

I don't think delayed allocation is what's helping here either,
because the journal will force the directory blocks to be placed as
soon as we commit a transaction. I think what's saving us here is
that flex_bg and mballoc is separating the directory blocks from the
data blocks, allowng the directory blocks to be closely packed
together.

- Ted

2008-07-10 20:13:42

by Ric Wheeler

[permalink] [raw]
Subject: Re: suspiciously good fsck times?

Theodore Tso wrote:
> Based on the graphs which Eric posted, One interesting thing I think
> you'll find if you repeat the ext3 experiment with e2fsck -t -t is
> that pass2 will be about seven times longer than pass1. (Which is
> backwards from most e2fsck runs, where pass2 is about half pass 1's
> run time --- although obviously that depends on how many directory
> blocks you have.)
>
>
Pass2 was where both spent most of their time, but I can rerun later to
validate that.

> Yes, some kind of reservation windows would help on ext3 --- but the
> question is whether such a change would be too-specific for this
> benchmark or not. Most of the time directories don't grow to such a
> huge size. So if you use a smallish (around 8 blocks, say) for many
> directories this might lead to more filesystem fragmentation that in
> the long run would cause the filesystem not to age well; it also
> wouldn't help much when you have over 11 million files in the
> directory, and a directory with over 100,000 blocks.
>
I think that the key is to lay out the directories (or files for that
matter) in reasonably contiguous chunks. If we could always bump up the
allocation by enough to capture a full disk track (128k? 512k?) you
would probably be near optimal, but any significant portion of a track
would also help.

It would be interesting to rerun with the 46 million files in one
directory as well (basically, for working sets that have no natural
mapping into directories like some object based workloads).

> I don't think delayed allocation is what's helping here either,
> because the journal will force the directory blocks to be placed as
> soon as we commit a transaction. I think what's saving us here is
> that flex_bg and mballoc is separating the directory blocks from the
> data blocks, allowng the directory blocks to be closely packed
> together.
>
> - Ted
>
I can try to validate that, thanks!

ric



2008-07-11 15:39:55

by Ric Wheeler

[permalink] [raw]
Subject: Re: suspiciously good fsck times?

Theodore Tso wrote:
> Based on the graphs which Eric posted, One interesting thing I think
> you'll find if you repeat the ext3 experiment with e2fsck -t -t is
> that pass2 will be about seven times longer than pass1. (Which is
> backwards from most e2fsck runs, where pass2 is about half pass 1's
> run time --- although obviously that depends on how many directory
> blocks you have.)
>
> Yes, some kind of reservation windows would help on ext3 --- but the
> question is whether such a change would be too-specific for this
> benchmark or not. Most of the time directories don't grow to such a
> huge size. So if you use a smallish (around 8 blocks, say) for many
> directories this might lead to more filesystem fragmentation that in
> the long run would cause the filesystem not to age well; it also
> wouldn't help much when you have over 11 million files in the
> directory, and a directory with over 100,000 blocks.
>
> I don't think delayed allocation is what's helping here either,
> because the journal will force the directory blocks to be placed as
> soon as we commit a transaction. I think what's saving us here is
> that flex_bg and mballoc is separating the directory blocks from the
> data blocks, allowng the directory blocks to be closely packed
> together.
>
> - Ted
>

I made a new ext4 file system without flex_bg or uninit:[

root@localhost Perf]# /sbin/debuge4fs /dev/sdb1
debuge4fs 1.41-WIP (07-Jul-2008)
debuge4fs: feature
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent sparse_super large_file


The fsck time was a bit slower, but still looks like 8 minutes on ext4
vs 1 hour on ext3:

[root@localhost Perf]# umount /mnt
[root@localhost Perf]# time /sbin/fsck.ext4 -t -t -f /dev/sdb1
e4fsck 1.41-WIP (07-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 43944k/69424k (36476k/7469k), time: 352.48/93.27/29.45
Pass 1: I/O read: 14914MB, write: 0MB, rate: 42.31MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 71396k/61968k (51854k/19543k), time: 73.00/50.46/ 7.65
Pass 2: I/O read: 3023MB, write: 0MB, rate: 41.41MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 71396k/61968k (59307k/12090k), time:
425.82/143.83/37.10
Pass 3A: Memory used: 71396k/61968k (59307k/12090k), time: 0.00/ 0.00/ 0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 71396k/61968k (51854k/19543k), time: 0.01/ 0.00/ 0.00
Pass 3: I/O read: 1MB, write: 0MB, rate: 76.91MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 71396k/44968k (27406k/43991k), time: 2.37/ 2.36/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 71396k/240k (64671k/6726k), time: 63.60/ 4.98/ 0.33
Pass 5: I/O read: 37MB, write: 0MB, rate: 0.58MB/s
/dev/sdb1: 45600268/61054976 files (0.0% non-contiguous),
232657587/244190000 blocks
Memory used: 71396k/240k (64671k/6726k), time: 491.82/151.17/37.43
I/O read: 17974MB, write: 1MB, rate: 36.55MB/s

real 8m12.260s
user 2m31.167s
sys 0m37.766s


2008-07-14 21:19:27

by Andreas Dilger

[permalink] [raw]
Subject: Re: suspiciously good fsck times?

On Jul 10, 2008 16:13 -0400, Ric Wheeler wrote:
> It would be interesting to rerun with the 46 million files in one
> directory as well (basically, for working sets that have no natural
> mapping into directories like some object based workloads).

I think you'll hit a limit around 15M files in a single directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-07-15 00:48:13

by Ric Wheeler

[permalink] [raw]
Subject: Re: suspiciously good fsck times?

Andreas Dilger wrote:
> On Jul 10, 2008 16:13 -0400, Ric Wheeler wrote:
>
>> It would be interesting to rerun with the 46 million files in one
>> directory as well (basically, for working sets that have no natural
>> mapping into directories like some object based workloads).
>>
>
> I think you'll hit a limit around 15M files in a single directory.
>
> Cheers, Andreas
> --
>
Probably still worth a quick test, just to see how well it holds up at
the edge, thanks!

ric