Hi:
I promised back in mid-December to send out some benchmark numbers I'm
seeing with Frank Mayhar's work to allow ext4 to run without a journal. My
apologies for the delay...
I ran both iozone and compilebench on the following filesystems, using a
2.6.26-based kernel, with most ext4 patches applied. This is on a x86 based
4-core system, with a separate disk for these runs.
ext2, default create/mount options
ext3, default create/mount options
ext4, default create/mount options
ext4, created with "-O ^has_journal"
For each filesystem, I ran each benchmark twice, doing a mke2fs before each
run. The same disk was used for each run; all benchmarks ran in the mount
directory of the newly mkfs'ed disk. I averaged the values for the two runs
for each FS/thread number.
Iozone was run with the following command line:
iozone -t (# threads) -s 2g -r 256k -I -T -i0 -i1 -i2
I.e., throughput mode; 2GiB file; 256KiB buffer; O_DIRECT. Tests were
limited to
write/rewrite
read/re-read
random-read/write
I ran iozone twice for each FS: with a single thread (-t 1) and with 8
threads (-t 8).
Compilebench was run with the following command line:
compilebench -D (mount dir) -i 10 -r 30
I.e., 10 kernel trees, 30 "random operation" runs.
Results follow.
Thanks,
Curt
Iozone
======
ext2 : 1 thread
---------------
Average throughput:
Type Mean Stddev
initial_writers: 56.6 MB/s ( 0.2)
rewriters: 58.4 MB/s ( 0.2)
readers: 66.3 MB/s ( 0.2)
re-readers: 66.5 MB/s ( 0.0)
random_readers: 22.4 MB/s ( 0.1)
random_writers: 18.8 MB/s ( 0.0)
ext2 : 8 threads
----------------
Average throughput:
Type Mean Stddev
initial_writers: 28.5 MB/s ( 0.0)
rewriters: 43.5 MB/s ( 0.1)
readers: 51.5 MB/s ( 0.1)
re-readers: 51.8 MB/s ( 0.2)
random_readers: 20.3 MB/s ( 0.0)
random_writers: 17.3 MB/s ( 0.0)
ext3 : 1 thread
----------------
Average throughput:
Type Mean Stddev
initial_writers: 56.3 MB/s ( 0.2)
rewriters: 58.2 MB/s ( 0.1)
readers: 66.4 MB/s ( 0.1)
re-readers: 66.1 MB/s ( 0.2)
random_readers: 22.1 MB/s ( 0.1)
random_writers: 18.6 MB/s ( 0.1)
ext3 : 8 threads
----------------
Average throughput:
Type Mean Stddev
initial_writers: 28.7 MB/s ( 0.1)
rewriters: 43.2 MB/s ( 0.2)
readers: 51.5 MB/s ( 0.0)
re-readers: 51.5 MB/s ( 0.0)
random_readers: 20.2 MB/s ( 0.0)
random_writers: 17.3 MB/s ( 0.0)
ext4-nojournal : 1 thread
-------------------------
Average throughput:
Type Mean Stddev
initial_writers: 66.3 MB/s ( 0.2)
rewriters: 66.6 MB/s ( 0.1)
readers: 66.4 MB/s ( 0.0)
re-readers: 66.4 MB/s ( 0.0)
random_readers: 22.4 MB/s ( 0.1)
random_writers: 19.4 MB/s ( 0.2)
ext4-nojournal : 8 threads
--------------------------
Average throughput:
Type Mean Stddev
initial_writers: 56.1 MB/s ( 0.1)
rewriters: 60.3 MB/s ( 0.2)
readers: 61.0 MB/s ( 0.0)
re-readers: 61.0 MB/s ( 0.0)
random_readers: 20.4 MB/s ( 0.1)
random_writers: 18.3 MB/s ( 0.1)
ext4-stock : 1 thread
----------------------
Average throughput:
Type Mean Stddev
initial_writers: 65.5 MB/s ( 0.1)
rewriters: 65.7 MB/s ( 0.2)
readers: 65.8 MB/s ( 0.2)
re-readers: 65.6 MB/s ( 0.3)
random_readers: 21.9 MB/s ( 0.0)
random_writers: 19.1 MB/s ( 0.1)
ext4-stock : 8 threads
----------------------
Average throughput:
Type Mean Stddev
initial_writers: 53.7 MB/s ( 0.2)
rewriters: 58.3 MB/s ( 0.1)
readers: 58.8 MB/s ( 0.1)
re-readers: 59.0 MB/s ( 0.1)
random_readers: 20.2 MB/s ( 0.0)
random_writers: 18.1 MB/s ( 0.0)
Compilebench
============
ext2
----
Average values:
Type Mean Stddev
initial_create: 57.9 MB_s ( 1.9)
new_create: 13.0 MB_s ( 0.2)
patch: 7.3 MB_s ( 0.1)
compile: 25.6 MB_s ( 0.6)
clean: 70.4 MB_s ( 1.3)
read_tree: 22.1 MB_s ( 0.0)
read_compiled_tree: 33.3 MB_s ( 0.2)
delete_tree: 6.5 secs ( 0.2)
stat_tree: 5.2 secs ( 0.0)
stat_compiled_tree: 5.7 secs ( 0.1)
ext3
----
Average values:
Type Mean Stddev
initial_create: 30.6 MB_s ( 2.2)
new_create: 13.5 MB_s ( 0.2)
patch: 10.6 MB_s ( 0.1)
compile: 18.0 MB_s ( 0.3)
clean: 41.7 MB_s ( 1.8)
read_tree: 21.5 MB_s ( 0.2)
read_compiled_tree: 20.4 MB_s ( 1.1)
delete_tree: 13.5 secs ( 0.3)
stat_tree: 6.7 secs ( 0.4)
stat_compiled_tree: 9.6 secs ( 2.9)
ext4-nojournal
--------------
Average values:
Type Mean Stddev
initial_create: 77.1 MB_s ( 0.2)
new_create: 22.0 MB_s ( 0.1)
patch: 13.1 MB_s ( 0.0)
compile: 36.0 MB_s ( 0.1)
clean: 592.4 MB_s (39.4)
read_tree: 17.8 MB_s ( 0.2)
read_compiled_tree: 22.1 MB_s ( 0.1)
delete_tree: 2.5 secs ( 0.0)
stat_tree: 2.2 secs ( 0.0)
stat_compiled_tree: 2.5 secs ( 0.0)
ext4-stock
----------
Average values:
Type Mean Stddev
initial_create: 59.7 MB_s ( 0.4)
new_create: 20.5 MB_s ( 0.0)
patch: 12.5 MB_s ( 0.0)
compile: 33.9 MB_s ( 0.2)
clean: 539.5 MB_s ( 3.6)
read_tree: 17.1 MB_s ( 0.1)
read_compiled_tree: 21.8 MB_s ( 0.1)
delete_tree: 2.7 secs ( 0.1)
stat_tree: 2.4 secs ( 0.0)
stat_compiled_tree: 2.5 secs ( 0.2)
On Wed, Jan 07, 2009 at 11:29:11AM -0800, Curt Wohlgemuth wrote:
>
> I ran both iozone and compilebench on the following filesystems, using a
> 2.6.26-based kernel, with most ext4 patches applied. This is on a x86 based
> 4-core system, with a separate disk for these runs.
Curt, thanks for doing these test runs. One interesting thing to note
is that even though ext3 was running with barriers disabled, and ext4
was running with barriers enabled, ext4 still showed consistently
better resuls. (Or was this on an LVM/dm setup where barriers were
getting disabled?)
I took the liberty of reformatting the results so I could look at them
more easily:
Iozone, 1 Thread
Average throughput ext2 ext3 ext4 ext4-nojournal
Type Mean Stddev Mean Stddev Mean Stddev Mean Stddev
initl_writers: 56.6 MB/s (0.2) 56.3 MB/s (0.2) 65.5 MB/s (0.1) 66.3 MB/s (0.2)
rewriters: 58.4 MB/s (0.2) 58.2 MB/s (0.1) 65.7 MB/s (0.2) 66.6 MB/s (0.1)
readers: 66.3 MB/s (0.2) 66.4 MB/s (0.1) 65.8 MB/s (0.2) 66.4 MB/s (0.0)
re-readers: 66.5 MB/s (0.0) 66.1 MB/s (0.2) 65.6 MB/s (0.3) 66.4 MB/s (0.0)
random_readers: 22.4 MB/s (0.1) 22.1 MB/s (0.1) 21.9 MB/s (0.0) 22.4 MB/s (0.1)
random_writers: 18.8 MB/s (0.0) 18.6 MB/s (0.1) 19.1 MB/s (0.1) 19.4 MB/s (0.2)
Iozone, 8 Threads
Average throughput ext2 ext3 ext4 ext4-nojournal
Type Mean Stddev Mean Stddev Mean Stddev Mean Stddev
initl_writers: 28.5 MB/s (0.0) 28.7 MB/s (0.1) 53.7 MB/s (0.2) 56.1 MB/s (0.1)
rewriters: 43.5 MB/s (0.1) 43.2 MB/s (0.2) 58.3 MB/s (0.1) 60.3 MB/s (0.2)
readers: 51.5 MB/s (0.1) 51.5 MB/s (0.0) 58.8 MB/s (0.1) 61.0 MB/s (0.0)
re-readers: 51.8 MB/s (0.2) 51.5 MB/s (0.0) 59.0 MB/s (0.1) 61.0 MB/s (0.0)
random_readers: 20.3 MB/s (0.0) 20.2 MB/s (0.0) 20.2 MB/s (0.0) 20.4 MB/s (0.1)
random_writers: 17.3 MB/s (0.0) 17.3 MB/s (0.0) 18.1 MB/s (0.0) 18.3 MB/s (0.1)
Compilebench
Average values ext2 ext3 ext4 ext4-nojournal
Type Mean Stddev Mean Stddev Mean Stddev Mean Stddev
init_create: 57.9 MB_s (1.9) 30.6 MB_s (2.2) 59.7 MB_s (0.4) 77.1 MB_s ( 0.2)
new_create: 13.0 MB_s (0.2) 13.5 MB_s (0.2) 20.5 MB_s (0.0) 22.0 MB_s ( 0.1)
patch: 7.3 MB_s (0.1) 10.6 MB_s (0.1) 12.5 MB_s (0.0) 13.1 MB_s ( 0.0)
compile: 25.6 MB_s (0.6) 18.0 MB_s (0.3) 33.9 MB_s (0.2) 36.0 MB_s ( 0.1)
clean: 70.4 MB_s (1.3) 41.7 MB_s (1.8) 539.5 MB_s (3.6) 592.4 MB_s (39.4)
read_tree: 22.1 MB_s (0.0) 21.5 MB_s (0.2) 17.1 MB_s (0.1) 17.8 MB_s ( 0.2)
read_compld: 33.3 MB_s (0.2) 20.4 MB_s (1.1) 21.8 MB_s (0.1) 22.1 MB_s ( 0.1)
delete_tree: 6.5 secs (0.2) 13.5 secs (0.3) 2.7 secs (0.1) 2.5 secs ( 0.0)
stat_tree: 5.2 secs (0.0) 6.7 secs (0.4) 2.4 secs (0.0) 2.2 secs ( 0.0)
stat_compld: 5.7 secs (0.1) 9.6 secs (2.9) 2.5 secs (0.2) 2.5 secs ( 0.0)
A couple of things to note. If you were testing Frank's patches, I
made one additional optimization to his patch, which removed the
orphaned inode handling. This wasn't necessary if you're running
without the journal, I'm not sure if this would be measurable in your
benchmarks, since the inodes that would be getting modified were
probably going to be dirtied and require writeback anyway, but you
might get sightly better numbers with the version of the patch I
ultimately pushed to Linus.
The other thing to note is that in Compilebench's read_tree, ext2 and
ext3 are scoring better than ext4. This is probably related to ext4's
changes in its block/inode allocation hueristics, which is something
that we probably should look at as part of tuning exercises. The
brtfs.boxacle.net benchmarks showed something similar, which I also
would attribute to changes in ext4's allocation policies.
- Ted
Hi Ted:
On Wed, Jan 7, 2009 at 12:47 PM, Theodore Tso <[email protected]> wrote:
> On Wed, Jan 07, 2009 at 11:29:11AM -0800, Curt Wohlgemuth wrote:
>>
>> I ran both iozone and compilebench on the following filesystems, using a
>> 2.6.26-based kernel, with most ext4 patches applied. This is on a x86 based
>> 4-core system, with a separate disk for these runs.
>
> Curt, thanks for doing these test runs. One interesting thing to note
> is that even though ext3 was running with barriers disabled, and ext4
> was running with barriers enabled, ext4 still showed consistently
> better resuls. (Or was this on an LVM/dm setup where barriers were
> getting disabled?)
Nope. Barriers were enabled for both ext4 versions below.
> A couple of things to note. If you were testing Frank's patches, I
> made one additional optimization to his patch, which removed the
> orphaned inode handling. This wasn't necessary if you're running
> without the journal, I'm not sure if this would be measurable in your
> benchmarks, since the inodes that would be getting modified were
> probably going to be dirtied and require writeback anyway, but you
> might get sightly better numbers with the version of the patch I
> ultimately pushed to Linus.
I see the change you pushed; I'll integrate this and see if the
numbers look any different.
> The other thing to note is that in Compilebench's read_tree, ext2 and
> ext3 are scoring better than ext4. This is probably related to ext4's
> changes in its block/inode allocation hueristics, which is something
> that we probably should look at as part of tuning exercises. The
> brtfs.boxacle.net benchmarks showed something similar, which I also
> would attribute to changes in ext4's allocation policies.
Can you enlighten me as to what aspect of block allocation might be
involved in the slowdown here? Which block group these allocations
are made from? Or something more low-level than that?
Thanks,
Curt
On Wed, Jan 07, 2009 at 01:19:07PM -0800, Curt Wohlgemuth wrote:
> >
> > Curt, thanks for doing these test runs. One interesting thing to note
> > is that even though ext3 was running with barriers disabled, and ext4
> > was running with barriers enabled, ext4 still showed consistently
> > better resuls. (Or was this on an LVM/dm setup where barriers were
> > getting disabled?)
>
> Nope. Barriers were enabled for both ext4 versions below.
Well, barriers won't metter in the nojournal case, but it's nice to
know that for these workloads, ext4-stock (w/journalling) is faster
even that ext3 w/o barriers. That's probably not be true with a
metadata-heavy workload with fsync's, such as fsmark, though.
> > The other thing to note is that in Compilebench's read_tree, ext2 and
> > ext3 are scoring better than ext4. This is probably related to ext4's
> > changes in its block/inode allocation hueristics, which is something
> > that we probably should look at as part of tuning exercises. The
> > brtfs.boxacle.net benchmarks showed something similar, which I also
> > would attribute to changes in ext4's allocation policies.
>
> Can you enlighten me as to what aspect of block allocation might be
> involved in the slowdown here? Which block group these allocations
> are made from? Or something more low-level than that?
Ext4's block allocation algorithsm are quite different from ext3, but
that's not what I'm worried about. Ext4's mballoc algorithms are much
more aggressive to find contiguous blocks, and that's a good thing.
There may be some issues about how it decides to do its localilty
group preallocation vs streaming preallocation, but these are all
tactical issues that in the end probably don't make that big of a
difference. There may also be some issues about which block group
mballoc chooses if its home block group is full, but I suspect those
are second-order issues.
The bigger problem is the strategic level issues of how inodes are
allocated, in particular when new directories are allocated. It is
much more aggressive about keeping subdirectories in the same block
group. It also completely disables the Orlov allocator algorithsm to
spread out top-level directories and directories (such as /home) that
would have the top-level directory flag set. Indeed, the new ext4
allocation algorithm doesn't differentiate between directories and
inodes in its allocation algorithms at all.
My concern with the current algorithms is that for very short
benchmarks, it keeps everything very closely packed together at the
beginning of the filesystem, which is probably good for those
benchmarks. But for more complex benchmarks and longer-lived
filesystems where aging is a concern, the lack of spreading may cause
a much bigger set of problems, especially in the long-term.
There some other changes I want to make that involve avoid putting
inodes in block group that area multiple of the flex block group size,
since all of the inode table blocks and block/inode allocation bitmaps
are stored in those block groups, and reserving the blocks in that
block group for directory blocks in that block group, but that
requires testing to make sure it makes sense.
- Ted
On Jan 07, 2009 11:29 -0800, Curt Wohlgemuth wrote:
> Iozone was run with the following command line:
>
> iozone -t (# threads) -s 2g -r 256k -I -T -i0 -i1 -i2
>
> I.e., throughput mode; 2GiB file; 256KiB buffer; O_DIRECT. Tests were
> limited to
How much RAM is on the test system? If the file size is only 2GB then
it will likely fit into RAM, which is possibly why the performance
numbers of all the filesystems is so close together. The other possibility
is that a single disk is the performance bottleneck and all of the
filesystems can feed a single disk at a reasonable rate.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Hi Andreas:
On Thu, Jan 8, 2009 at 5:03 AM, Andreas Dilger <[email protected]> wrote:
> On Jan 07, 2009 11:29 -0800, Curt Wohlgemuth wrote:
>> Iozone was run with the following command line:
>>
>> iozone -t (# threads) -s 2g -r 256k -I -T -i0 -i1 -i2
>>
>> I.e., throughput mode; 2GiB file; 256KiB buffer; O_DIRECT. Tests were
>> limited to
>
> How much RAM is on the test system? If the file size is only 2GB then
> it will likely fit into RAM, which is possibly why the performance
> numbers of all the filesystems is so close together. The other possibility
> is that a single disk is the performance bottleneck and all of the
> filesystems can feed a single disk at a reasonable rate.
Indeed, the system was not memory-limited at all. I've done some
playing around with how limiting memory affects random reads in iozone
with O_DIRECT, and have found that, as expected, ext4 is much less
affected than ext2. I'm assuming this is because the metadata isn't
in the page cache, and the far larger number of metadata blocks on
ext2 than ext4 in this case causes a bigger hit on ext2.
If I generate numbers on a low-memory system, I'll post them here too.
Thanks,
Curt