LinuxLists.cc - EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

2010-02-27 00:31:07

by Justin Piszcz

[permalink] [raw]

Subject: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Hello,

Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
I see about half the performance as XFS for sequential writes.

I have checked the doc and tried several options, a few of which are shown
below (I have also tried the commit/journal_async/etc options but none of
them get the write speeds anywhere near XFS)?

Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.

When it was XFS I used to get 400-600MiB/s for writes for the same RAID
volume.

How do I 'speed' up ext4? Is it possible?

raid0_11 disks: (XFS)
# /dev/md0 /r1 xfs noatime 0 1
p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
p63:/r1#

raid0_11 disks: (EXT4)
# /dev/md0 /r1 ext4 noatime 0 1
# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
p63:/r1#

Other tests (ext4)
p63:~# mount /dev/md0 /r1 -o data=writeback
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
p63:/r1#

p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s

Justin.

2010-02-27 00:46:44

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz <[email protected]> writes:

> Hello,
>
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
>
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none of
> them get the write speeds anywhere near XFS)?
>
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
> volume.
>
> How do I 'speed' up ext4? Is it possible?
I don't know how to speedup, but i do know how to slowdown XFS :)
Seems that you forget to call fsync at the end of file write
In this case some data may reside in memory cache.
Please add "conv=fsync" or "conv=fdatasync" to the dd cmd.
And redone your measurements.
>
> raid0_11 disks: (XFS)
> # /dev/md0 /r1 xfs noatime 0 1
> p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
> p63:/r1#
>
> raid0_11 disks: (EXT4)
> # /dev/md0 /r1 ext4 noatime 0 1
> # dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
> p63:/r1#
>
> Other tests (ext4)
> p63:~# mount /dev/md0 /r1 -o data=writeback
> p63:~# cd /r1
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
> p63:/r1#
>
> p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s
>
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-02-27 00:51:53

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz wrote:
> Hello,
>
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
>
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none
> of them get the write speeds anywhere near XFS)?
>
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
> volume.
>
> How do I 'speed' up ext4? Is it possible?

Aside from Dmitry's suggestion to time sync as well (although for 10G, you are
likely not leaving much in cache) I'd ask:

What kernel version? what xfsprogs/e2fsprogs version?

Were the filesystems created to align with raid geometry?

mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
with recent kernel+e2fsprogs.

-Eric

> raid0_11 disks: (XFS)
> # /dev/md0 /r1 xfs noatime 0 1
> p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
> p63:/r1#
>
> raid0_11 disks: (EXT4)
> # /dev/md0 /r1 ext4 noatime 0 1
> # dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
> p63:/r1#
>
> Other tests (ext4)
> p63:~# mount /dev/md0 /r1 -o data=writeback
> p63:~# cd /r1
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
> p63:/r1#
>
> p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s
>
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-02-27 01:05:40

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, 27 Feb 2010, Dmitry Monakhov wrote:

> Justin Piszcz <[email protected]> writes:
>
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none of
>> them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4? Is it possible?
> I don't know how to speedup, but i do know how to slowdown XFS :)
> Seems that you forget to call fsync at the end of file write
> In this case some data may reside in memory cache.
> Please add "conv=fsync" or "conv=fdatasync" to the dd cmd.
> And redone your measurements.

Hi,

First with a sync added in the total time (still 2x as fast)

EXT3:
p63:~# mount /dev/md0 -o nobarrier,data=writeback /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.4163 s, 303 MB/s
0.02user 19.85system 0:36.97elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (5major+1145minor)pagefaults 0swaps

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.08 s, 594 MB/s
0.03user 16.15system 0:18.67elapsed 86%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+1147minor)pagefaults 0swaps
p63:/r1#

Per your request: conv=fsync & conv=fdatasync

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.2142 s, 590 MB/s
0.03user 16.05system 0:18.21elapsed 88%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (0major+832minor)pagefaults 0swaps
p63:/r1#

EXT3:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.5562 s, 271 MB/s

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.513 s, 580 MB/s
0.03user 16.25system 0:18.51elapsed 87%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+828minor)pagefaults 0swaps
p63:/r1#

p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.7859 s, 270 MB/s
0.02user 24.20system 0:39.79elapsed 60%CPU (0avgtext+0avgdata 7328maxresident)k
0inputs+0outputs (5major+829minor)pagefaults 0swaps
p63:/r1#

It is still 2x as fast?
Is there some other option I am missing here or is this correct?

Justin.

2010-02-27 01:08:32

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Fri, 26 Feb 2010, Eric Sandeen wrote:

> Justin Piszcz wrote:
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none
>> of them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4? Is it possible?
>
> Aside from Dmitry's suggestion to time sync as well (although for 10G, you are
> likely not leaving much in cache) I'd ask:
>
> What kernel version? what xfsprogs/e2fsprogs version?
2.6.33/x86_64

ii xfsprogs 3.1.1 Utilities for managing the XFS filesystem
ii e2fsprogs 1.41.10-1 ext2/ext3/ext4 file system utilities

>
> Were the filesystems created to align with raid geometry?
Only default options were used except the mount options. If that is the
culprit, I have some more testing to do, thanks, will look into it.

>
> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
> with recent kernel+e2fsprogs.
How recent?

2010-02-27 01:12:07

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz wrote:
...

>> Were the filesystems created to align with raid geometry?
> Only default options were used except the mount options. If that is the
> culprit, I have some more testing to do, thanks, will look into it.
>
>>
>> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
>> with recent kernel+e2fsprogs.
> How recent?

You're recent enough. :)

mkfs.ext4 output should include the stripe info if it was found.

printf(_("Block size=%u (log=%u)\n"), fs->blocksize,
s->s_log_block_size);
printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
s->s_log_frag_size);
printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
s->s_raid_stride, s->s_raid_stripe_width);
printf(_("%u inodes, %llu blocks\n"), s->s_inodes_count,
ext2fs_blocks_count(s));

etc.

-Eric

2010-02-27 01:28:47

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Eric Sandeen wrote:
> Justin Piszcz wrote:
> ...
>
>>> Were the filesystems created to align with raid geometry?
>> Only default options were used except the mount options. If that is the
>> culprit, I have some more testing to do, thanks, will look into it.
>>
>>> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
>>> with recent kernel+e2fsprogs.
>> How recent?
>
> You're recent enough. :)

Oh, you need very recent util-linux-ng as well, and use libblkid from there
with:

[e2fsprogs] # ./configure --disable-libblkid

Otherwise you can just feed mkfs.ext4 stripe & stride manually.

-Eric

2010-02-27 10:14:52

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Fri, 26 Feb 2010, Eric Sandeen wrote:

> Eric Sandeen wrote:
>
> Oh, you need very recent util-linux-ng as well, and use libblkid from there
> with:
>
> [e2fsprogs] # ./configure --disable-libblkid
>
> Otherwise you can just feed mkfs.ext4 stripe & stride manually.
>
> -Eric
>

Hi,

Even when set, there is still poor performance:

http://busybox.net/~aldot/mkfs_stride.html
Raid Level: 0
Number of Physical Disks: 11
RAID chunk size (in KiB): 1024
number of filesystem blocks (in KiB)
mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816

p63:~# /usr/bin/time mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=256 blocks, Stripe width=2816 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544

Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 38 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
p63:~#

p63:~# mount /dev/md0 /r1 -o nobarrier,data=writeback
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.3674 s, 273 MB/s
p63:/r1#

Still very slow?

Let's try with some optimizations:
p63:/r1# mount /dev/md0 /r1 -o noatime,barrier=0,data=writeback,nobh,commit=100,nouser_xattr,nodelalloc,max_batch_time=0^C

Still not anywhere near 500-600MiB/s of XFS:
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.4824 s, 352 MB/s
p63:/r1#

Am I doing something wrong/is there a flag I am missing that will speed it
up? Or is this performance for sequential writes on EXT4?

Justin.

2010-02-27 10:51:16

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, 27 Feb 2010, Justin Piszcz wrote:

>
>
> On Fri, 26 Feb 2010, Eric Sandeen wrote:
>
> > Eric Sandeen wrote:
> >
> > Oh, you need very recent util-linux-ng as well, and use libblkid from there
> > with:
> >
> > [e2fsprogs] # ./configure --disable-libblkid
> >
> > Otherwise you can just feed mkfs.ext4 stripe & stride manually.
> >
> > -Eric
> >

I also tried with the default chunk size (64KiB) incase ext4 had a problem
with chunk sizes > 64KiB, the results were the same for ext4, I also tried
ext2 & ext3 as well just to see what their performance would be:

p63:~# mkfs.ext2 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 19.9434 s, 538 MB/s
p63:/r1#

p63:~# mkfs.ext3 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 31.0195 s, 346 MB/s

p63:~# mkfs.ext4 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 35.3866 s, 303 MB/s

And, for comparison, XFS:
p63:~# mkfs.xfs -f /dev/md0 > /dev/null 2>&1
p63:~# mount /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1527 s, 592 MB/s
p63:/r1#

2010-02-27 11:09:08

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, 27 Feb 2010, Justin Piszcz wrote:

>
>
> On Sat, 27 Feb 2010, Justin Piszcz wrote:
>
> >
> >
> > On Fri, 26 Feb 2010, Eric Sandeen wrote:
> >

Hi,

I have found the same results on 2 different systems:

It seems to peak at ~350MiB/s performance on mdadm raid, whether
a RAID-5 or RAID-0 (two separate machines):

The only option I found that allows it to go from:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
to
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s

Is the -o nodelalloc option.

How come it is not breaking the 350MiB/s barrier is the question?

Justin.

2010-02-27 11:36:38

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, 27 Feb 2010, Justin Piszcz wrote:

>
>
> On Sat, 27 Feb 2010, Justin Piszcz wrote:
>
> >
> >
> > On Sat, 27 Feb 2010, Justin Piszcz wrote:
> >
> > >
> > >
> > > On Fri, 26 Feb 2010, Eric Sandeen wrote:
> > >
>
> Hi,
>
> I have found the same results on 2 different systems:
>
> It seems to peak at ~350MiB/s performance on mdadm raid, whether
> a RAID-5 or RAID-0 (two separate machines):
>
> The only option I found that allows it to go from:
> 10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
> to
> 10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
>
> Is the -o nodelalloc option.
>
> How come it is not breaking the 350MiB/s barrier is the question?
>
> Justin.
>
>

Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
working with many small files.

EXT4

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+971minor)pagefaults 0swaps
linux-2.6.33 linux-2.6.33.tar
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+865minor)pagefaults 0swaps

XFS

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+970minor)pagefaults 0swaps
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+864minor)pagefaults 0swaps

So I guess that's the tradeoff, for massive I/O you should use XFS, else,
use EXT4?

I still would like to know however, why 350MiB/s seems to be the maximum
performance I can get from two different md raids (that easily do 600MiB/s
with XFS).

Is this a performance issue within ext4 and md-raid?
The problem does not exist with xfs and md-raid.

Justin.

2010-02-28 05:42:56

by Theodore Ts'o

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>
> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Can you run "filefrag -v <filename>" on the large file you created
using dd? Part of the problem may be the block allocator simply not
being well optimized super large writes. To be honest, that's not
something we've tried (at all) to optimize, mainly because for most
users of ext4 they're more interested in much more reasonable sized
files, and we only have so many hours in a day to hack on ext4. :-)
XFS in contrast has in the past had plenty of paying customers
interested in writing really large scientific data sets, so this is
something Irix *has* spent time optimizing.

As far as I know none of the ext4 developers' day jobs are currently
focused on really large files using ext4. Some of us do use ext4 to
support really large files, but it's via some kind of cluster or
parallel file system layered on top of ext4 (i.e., Sun/Clusterfs
Lustre File Systems, or Google's GFS) --- and so what gets actually
stored in ext4 isn't a single 10-20 gigabyte file.

I'm saying this not as an excuse; but it's an explanation for why no
one has really noticed this performance problem until you brought it
up. I'd like to see ext4 be a good general purpose file system, which
includes handling the really big files stored in a single system. But
it's just not something we've tried optimizing yet.

So if you can gather some data, such as the filefrag information, that
would be a great first step. Something else that would be useful is
gathering blktrace information, so we can see how we are scheduling
the writes and whether we have something bad going on there. I
wouldn't be surprised if there is some stupidity going on in the
generic FS/MM writeback code which is throttling us, and which XFS has
worked around. Ext4 has worked around some writeback brain-damage
already, but I've been focused on much smaller files (files in the
tens or hundreds megabytes) since that's what I tend to use much more
frequently.

It's great to see that you're really interested in this; if you're
willing to do some investigative work, hopefully it's something we can
address.

Best Regards,

- Ted

P.S. I'm a bit unclear regarding your comment about "-o nodelalloc"
in one of your earlier threads. Does using nodelalloc actually speeds
things up? There were a bunch of numbers being thrown around, and in
some configurations I thought you were getting around 300 MB/s without
using nodelalloc? Or am I misunderstanding your numbers and what
configuratoins you used with each test run?

If nodelalloc is actually speeding things up, then we almost certainly
have some kind of writeback problem. So filefrag and blktrace are
definitely the tools we need to look at to understand what is going
on.

2010-02-28 14:55:42

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sun, 28 Feb 2010, [email protected] wrote:

> On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>>
>> I still would like to know however, why 350MiB/s seems to be the maximum
>> performance I can get from two different md raids (that easily do 600MiB/s
>> with XFS).

> Can you run "filefrag -v <filename>" on the large file you created
> using dd? Part of the problem may be the block allocator simply not
> being well optimized super large writes. To be honest, that's not
> something we've tried (at all) to optimize, mainly because for most
> users of ext4 they're more interested in much more reasonable sized
> files, and we only have so many hours in a day to hack on ext4. :-)
> XFS in contrast has in the past had plenty of paying customers
> interested in writing really large scientific data sets, so this is
> something Irix *has* spent time optimizing.
Yes, this is shown at the bottom of the e-mail both with -o data=ordered
and data=writeback.

[ .. ]

> So if you can gather some data, such as the filefrag information, that
> would be a great first step. Something else that would be useful is
> gathering blktrace information, so we can see how we are scheduling
> the writes and whether we have something bad going on there. I
> wouldn't be surprised if there is some stupidity going on in the
> generic FS/MM writeback code which is throttling us, and which XFS has
> worked around. Ext4 has worked around some writeback brain-damage
> already, but I've been focused on much smaller files (files in the
> tens or hundreds megabytes) since that's what I tend to use much more
> frequently.
>
> It's great to see that you're really interested in this; if you're
> willing to do some investigative work, hopefully it's something we can
> address.

[ .. ]

> P.S. I'm a bit unclear regarding your comment about "-o nodelalloc"
> in one of your earlier threads. Does using nodelalloc actually speeds
> things up? There were a bunch of numbers being thrown around, and in
> some configurations I thought you were getting around 300 MB/s without
> using nodelalloc? Or am I misunderstanding your numbers and what
> configuratoins you used with each test run?
This is more dramatic on the software raid (mdadm) RAID-5 configuration.
Without -o nodelalloc, I see roughly 200MiB/s. With -o nodelalloc, I hit
the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less
dramatic. The full tests and output appear at the bottom of this e-mail;
however, for brevity, the example below shows 55MiB/s and 132MiB/s
performance increases with RAID-0 and RAID-5 respectively:

For the RAID-0:

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
An increase of 55MiB/s.

For the RAID-5 (from earlier testing):

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
An increase of 132MiB/s.

>
> If nodelalloc is actually speeding things up, then we almost certainly
> have some kind of writeback problem. So filefrag and blktrace are
> definitely the tools we need to look at to understand what is going
> on.
>

=== CREATE RAID-0 WITH 11 DISKS

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~#

=== SHOW MDADM RAID-0

p63:~# mdadm -D /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sun Feb 28 06:31:41 2010
Raid Level : raid0
Array Size : 5372223296 (5123.35 GiB 5501.16 GB)
Raid Devices : 11
Total Devices : 11
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Feb 28 06:31:41 2010
State : clean
Active Devices : 11
Working Devices : 11
Failed Devices : 0
Spare Devices : 0

Chunk Size : 64K

UUID : 077d4d5c:5acbcb29:26614430:c3345183 (local to host p63)
Events : 0.1

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 81 4 active sync /dev/sdf1
5 8 97 5 active sync /dev/sdg1
6 8 113 6 active sync /dev/sdh1
7 8 129 7 active sync /dev/sdi1
8 8 145 8 active sync /dev/sdj1
9 8 161 9 active sync /dev/sdk1
10 8 177 10 active sync /dev/sdl1
p63:~#

=== KERNEL CONFIGURATION BASELINE

The following kernel configuration was used:
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-baseline.txt

=== ESTABLISH CONTROL / BASELINE

p63:~# mkfs.xfs /dev/md0 -f
meta-data=/dev/md0 isize=256 agcount=32, agsize=41970496 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1343055824, imaxpct=5
= sunit=16 swidth=176 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=16 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
p63:~# mount /dev/md0 /r1 -o nobarrier
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 17.9816 s, 597 MB/s
0.03user 16.10system 0:17.99elapsed 89%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1#

p63:/r1# xfs_bmap -v /r1/bigfile
/r1/bigfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20971519]: 671528064..692499583 2 (128..20971647) 20971520 00011
p63:/r1#

=== CREATE EXT4 FILESYSTEM ON ARRAY (note the stripe/width appears to be
irrelevant to to the speed problem as
as the cap is '350MiB/s' whether it is
aligned or not, see the following URL
for those tests)
http://lkml.org/lkml/2010/2/27/77

NOTE: It compares ext2 vs. ext3 vs. ext4
vs. XFS.

NOTE: nobarrier does not seem to be a
factor either, but I will include
it below to ensure it is not
somehow impacting the tests
performed.

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
6.50user 83.89system 2:01.86elapsed 74%CPU (0avgtext+0avgdata 829552maxresident)k
0inputs+0outputs (5major+51889minor)pagefaults 0swaps
p63:~#

=== MOUNT FILESYSTEM WITH NOBARRIER, ORDERED (DEFAULT) & RUN TEST

p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.2676 s, 304 MB/s
0.02user 19.40system 0:35.29elapsed 55%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED)

p63:/r1# filefrag -v /r1/bigfile
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 34816 32768
1 32768 67584 30720
2 63488 100352 98303 32768
3 96256 133120 30720
4 126976 165888 163839 32768
5 159744 198656 30720
6 190464 231424 229375 32768
7 223232 264192 30720
8 253952 296960 294911 32768
9 286720 329728 32768
10 319488 362496 32768
11 352256 395264 32768
12 385024 428032 32768
13 417792 460800 32768
14 450560 493568 30720
15 481280 557056 524287 32768
16 514048 589824 32768
17 546816 622592 32768
18 579584 655360 32768
19 612352 688128 32768
20 645120 720896 32768
21 677888 753664 32768
22 710656 786432 32768
23 743424 821248 819199 32768
24 776192 854016 30720
25 806912 886784 884735 32768
26 839680 919552 32768
27 872448 952320 32768
28 905216 985088 32768
29 937984 1017856 30720
30 968704 1081344 1048575 32768
31 1001472 1114112 32768
32 1034240 1146880 32768
33 1067008 1179648 32768
34 1099776 1212416 32768
35 1132544 1245184 32768
36 1165312 1277952 32768
37 1198080 1310720 32768
38 1230848 1343488 32768
39 1263616 1376256 32768
40 1296384 1409024 32768
41 1329152 1441792 32768
42 1361920 1474560 32768
43 1394688 1507328 32768
44 1427456 1540096 32768
45 1460224 1607680 1572863 32768
46 1492992 1640448 32768
47 1525760 1673216 32768
48 1558528 1705984 32768
49 1591296 1738752 32768
50 1624064 1771520 32768
51 1656832 1804288 32768
52 1689600 1837056 32768
53 1722368 1869824 32768
54 1755136 1902592 32768
55 1787904 1935360 32768
56 1820672 1968128 32768
57 1853440 2000896 32768
58 1886208 2033664 32768
59 1918976 2066432 30720
60 1949696 2129920 2097151 32768
61 1982464 2162688 32768
62 2015232 2195456 32768
63 2048000 2228224 32768
64 2080768 2260992 32768
65 2113536 2293760 32768
66 2146304 2326528 32768
67 2179072 2359296 32768
68 2211840 2392064 32768
69 2244608 2424832 32768
70 2277376 2457600 32768
71 2310144 2490368 32768
72 2342912 2523136 32768
73 2375680 2555904 32768
74 2408448 2588672 32768
75 2441216 2656256 2621439 32768
76 2473984 2689024 32768
77 2506752 2721792 32768
78 2539520 2754560 32768
79 2572288 2787328 18432
80 2590720 2818048 2805759 30720 eof
/r1/bigfile: 13 extents found
p63:/r1#

=== MOUNT FILESYSTEM WITH NOBARRIER, WRITEBACK & RUN TEST

p63:/# mount /dev/md0 -o data=writeback,nobarrier /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
0.02user 19.38system 0:34.78elapsed 55%CPU (0avgtext+0avgdata 7280maxresident)k
0inputs+0outputs (3major+491minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,WRITEBACK)

p63:/r1# filefrag -v /r1/bigfile
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 34816 32768
1 32768 67584 30720
2 63488 100352 98303 32768
3 96256 133120 30720
4 126976 165888 163839 32768
5 159744 198656 30720
6 190464 231424 229375 32768
7 223232 264192 30720
8 253952 296960 294911 32768
9 286720 329728 32768
10 319488 362496 32768
11 352256 395264 32768
12 385024 428032 32768
13 417792 460800 32768
14 450560 493568 30720
15 481280 557056 524287 32768
16 514048 589824 32768
17 546816 622592 32768
18 579584 655360 32768
19 612352 688128 32768
20 645120 720896 32768
21 677888 753664 32768
22 710656 786432 32768
23 743424 821248 819199 32768
24 776192 854016 30720
25 806912 886784 884735 32768
26 839680 919552 32768
27 872448 952320 32768
28 905216 985088 32768
29 937984 1017856 30720
30 968704 1081344 1048575 32768
31 1001472 1114112 32768
32 1034240 1146880 32768
33 1067008 1179648 32768
34 1099776 1212416 32768
35 1132544 1245184 32768
36 1165312 1277952 32768
37 1198080 1310720 32768
38 1230848 1343488 32768
39 1263616 1376256 32768
40 1296384 1409024 32768
41 1329152 1441792 32768
42 1361920 1474560 32768
43 1394688 1507328 32768
44 1427456 1540096 32768
45 1460224 1607680 1572863 32768
46 1492992 1640448 32768
47 1525760 1673216 32768
48 1558528 1705984 32768
49 1591296 1738752 32768
50 1624064 1771520 32768
51 1656832 1804288 32768
52 1689600 1837056 32768
53 1722368 1869824 32768
54 1755136 1902592 32768
55 1787904 1935360 32768
56 1820672 1968128 32768
57 1853440 2000896 32768
58 1886208 2033664 32768
59 1918976 2066432 30720
60 1949696 2129920 2097151 32768
61 1982464 2162688 32768
62 2015232 2195456 32768
63 2048000 2228224 32768
64 2080768 2260992 32768
65 2113536 2293760 32768
66 2146304 2326528 32768
67 2179072 2359296 32768
68 2211840 2392064 32768
69 2244608 2424832 32768
70 2277376 2457600 32768
71 2310144 2490368 32768
72 2342912 2523136 32768
73 2375680 2555904 32768
74 2408448 2588672 32768
75 2441216 2656256 2621439 32768
76 2473984 2689024 32768
77 2506752 2721792 32768
78 2539520 2754560 16384
/r1/bigfile: 12 extents found
p63:/r1#

=== USE OF -o nodelalloc WITH SOFTWARE RAID-0 (SPEED IMPROVEMENT)

p63:/r1# mount /dev/md0 -o data=writeback,nobarrier,nodelalloc /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
0.02user 28.95system 0:29.56elapsed 98%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

While it does help, I have not been able to get > 400MiB/s, it stops at
roughly 350-360MiB/s.

=== FIRST ATTEMPT AT USING BLKTRACE

Following these docs:
http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob;f=README
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug
http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html

Options required in the kernel:

Kernel hacking:
| | [*] Debug Filesystem | |

Then the BLK_IO_TRACE (it has moved from where the old docs say to go)
Kernel Hacking:
| | [ ] Tracers ---> | |
| | [*] Support for tracing block IO actions | |

Compile new kernel, reboot.

New kernel configuration used (only enabled the options shown above)
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-blktrace.txt

Next step, create a fresh filesystem for the trace event:

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
< .. >
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Reboot to new kernel.

Per:
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug

Mount the debug filesystem/make sure it iss mounted:

p63:~# mount -t debugfs debugfs /sys/kernel/debug
mount: debugfs already mounted or /sys/kernel/debug busy
mount: according to mtab, debugfs is already mounted on /sys/kernel/debug
p63:~#

Then follow instructions on page 14 from:
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113

p63:/dev/shm/client# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!

Mount filesystem with -o data=writeback,nobarrier, run test blktrace1.

p63:~# mount -o data=writeback,nobarrier /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.6317 s, 301 MB/s
0.03user 19.41system 0:35.67elapsed 54%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (2major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 1548634 events, 72593 KiB data
CPU 1: 1009268 events, 47310 KiB data
Total: 2557902 events (dropped 0), 119902 KiB data
p63:/dev/shm/server# ls

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
CPU 0: 1548634 events, 72593 KiB data
CPU 1: 1009268 events, 47310 KiB data
Total: 2557902 events (dropped 0), 119902 KiB data

>From this test, the following resulted:
# du -sh *
56K 192.168.168.113-2010-02-28-13:10:48
118M 192.168.168.113-2010-02-28-13:14:00

Let this trace be called blktrace1.

p63:/dev/shm/server# du -sh blktrace1/*
56K blktrace1/192.168.168.113-2010-02-28-13:10:48
118M blktrace1/192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server#

Mount with -o data=writeback,nobarrier,nodelalloc, run test blktrace2.

p63:~# umount /r1
p63:~# mount -o data=writeback,nobarrier,nodelalloc /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.6692 s, 350 MB/s
0.03user 29.55system 0:30.70elapsed 96%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 50056 events, 2347 KiB data
CPU 1: 2478242 events, 116168 KiB data
Total: 2528298 events (dropped 0), 118515 KiB data

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
CPU 0: 50056 events, 2347 KiB data
CPU 1: 2478242 events, 116168 KiB data
Total: 2528298 events (dropped 0), 118515 KiB data
#

p63:/dev/shm/server# du -sh 192.168.168.113-2010-02-28-13\:17\:22/*
2.4M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.0
114M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.1

This is blktrace2.

One more time (blktrace3) with ordered.

p63:~# mount -o nobarrier /dev/md0 /r1
p63:~# dmesg | tail -n 2
[ 2788.928806] EXT4-fs (md0): barriers disabled
[ 2789.340573] EXT4-fs (md0): mounted filesystem with ordered data mode
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 36.2893 s, 296 MB/s
0.04user 19.29system 0:36.32elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (1major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 1587087 events, 74395 KiB data
CPU 1: 970979 events, 45515 KiB data
Total: 2558066 events (dropped 0), 119910 KiB data
p63:/dev/shm/server#

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
=== md0 ===
CPU 0: 1587087 events, 74395 KiB data
CPU 1: 970979 events, 45515 KiB data
Total: 2558066 events (dropped 0), 119910 KiB data
#

TRACE OUTPUT TOTAL AND SUMMARY:

p63:~/results-20100228# du -sh *
570M blktrace1 => -o data=writeback,nobarrier
570M blktrace1-redo => -o data=writeback,nobarrier
563M blktrace2 => -o data=writeback,nobarrier,nodelalloc
570M blktrace3 => -o data=nobarrier
4.0K script
p63:~/results-20100228#

USING SCRIPT ON PAGE 24/30:

Running post-process.sh for each trace: blktrace{1,2,3}, the script itself
from page 24/30:

# cat /root/post-process.sh
#! /bin/bash
blkrawverify md0 # check data for errors
blkparse -d md0.bin -i md0 > md0.blkparse # merged binary, parsed
btt -i md0.bin --all-data > md0.btt # basic btt report
# now the whole enchilada!
btt -i md0.bin -o md0x --all-data --easy-parse-avgs \
--iostat=md0x.iostat \
--per-io-dump=md0x.pid \
--q2d-latencies=md0x \
--d2c-latencies=md0x \
--q2c-latencies=md0x \
--dump-blocknos=md0x_dbn \
--active-queue-depth=md0x \
--unplug-hist=md0x_uph \
--seeks=seeks \
--seeks-per-second=sps \
--per-io-trees=md0x_pit \
> md0x.btt # md0x.btt is empty

#

Before running any tests, backup raw data:

p63:/dev/shm# tar cf /root/server.tar server
p63:/dev/shm#

For each directory, run post-process:

blktrace1: (I must have waited too long in between steps here so it made two)

p63:/dev/shm/server/blktrace1# ls -1
192.168.168.113-2010-02-28-13:10:48
192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server/blktrace1# cd *48
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# cd ../*00
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00#

I will make another blktrace1 and be faster this time so all data results
are of the same type, it is called blktrace1-redo:

blktrace1-redo:

p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45#

blktrace2:

p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22#

blktrace3:

p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29#

------------

=== FINAL RESULTS

p63:~/results-20100228# du -sh */*
216K blktrace1/192.168.168.113-2010-02-28-13:10:48
570M blktrace1/192.168.168.113-2010-02-28-13:14:00
570M blktrace1-redo/192.168.168.113-2010-02-28-13:35:45
563M blktrace2/192.168.168.113-2010-02-28-13:17:22
570M blktrace3/192.168.168.113-2010-02-28-13:31:29
4.0K script/post-process.sh
p63:~/results-20100228#

I used 7zip to compress the results because it offers the best compression
ratio of any other utility, including the latest 'xz' utility:
http://fixunix.com/kernel/238089-response-kernel-compression-e-mail-few-months-ago.html

$ xz -9 linux-2.6.16.17.tar

$ du -sk * | sort -n
32392 linux-2.6.16.17.tar.7z
32404 linux-2.6.16.17.tar.xz
33520 linux-2.6.16.17.tar.lzma
33760 linux-2.6.16.17.tar.rar
38064 linux-2.6.16.17.tar.rz
39472 linux-2.6.16.17.tar.szip
39520 linux-2.6.16.17.tar.bz
39936 linux-2.6.16.17.tar.bz2
40000 linux-2.6.16.17.tar.bicom
40656 linux-2.6.16.17.tar.sit
47664 linux-2.6.16.17.tar.lha
49968 linux-2.6.16.17.tar.dzip
50000 linux-2.6.16.17.tar.gz
51344 linux-2.6.16.17.tar.arj
57552 linux-2.6.16.17.tar.lzo
57984 linux-2.6.16.17.tar.F
81136 linux-2.6.16.17.tar.Z
94544 linux-2.6.16.17.tar.zoo
101216 linux-2.6.16.17.tar.arc
228608 linux-2.6.16.17.tar

=== COMPRESSION RESULTS:

-rw-r--r-- 1 abc users 155M 2010-02-28 09:02 results-20100228.tar.7z
-rw-r--r-- 1 abc users 290M 2010-02-28 08:42 results-20100228.tar.bz2
-rw-r--r-- 1 abc users 2.3G 2010-02-28 08:42 results-20100228.tar

=== LOCATION:

http://liquidswords.org/~war/results-20100228.tar.7z
wget http://liquidswords.org/~war/results-20100228.tar.7z

=== MD5 CHECKSUM:

$ md5sum *
1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z
35793b283edf5c0f38738276812aad52 results-20100228.tar

=== VERIFICATION: MAKE SURE IT WORKS FOR OTHERS:

$ wget http://liquidswords.org/~war/results-20100228.tar.7z
--2010-02-28 09:48:36-- http://liquidswords.org/~war/results-20100228.tar.7z
Resolving liquidswords.org... 71.6.165.232
Connecting to liquidswords.org|71.6.165.232|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161814574 (154M) [application/x-tar]
Saving to: "results-20100228.tar.7z"

100%[======================================>] 161,814,574 2.00M/s in 69s

$

$ md5sum *7z
1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z

CORRECT

$ 7z x results-20100228.tar.7z

7-Zip 4.58 beta Copyright (c) 1999-2008 Igor Pavlov 2008-05-05
p7zip Version 4.58 (locale=en_US,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: results-20100228.tar.7z

Extracting results-20100228.tar

Everything is Ok

Size: 2382561280
Compressed: 161814574

$ md5sum *tar
35793b283edf5c0f38738276812aad52 results-20100228.tar

CORRECT

Again, the trace information details:

p63:~/results-20100228# du -sh *
570M blktrace1 => -o data=writeback,nobarrier
570M blktrace1-redo => -o data=writeback,nobarrier
563M blktrace2 => -o data=writeback,nobarrier,nodelalloc
570M blktrace3 => -o data=nobarrier
4.0K script
p63:~/results-20100228#

Let me know if you need anything else, thanks.

Justin.

2010-02-28 23:50:24

by Dave Chinner

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
> Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
> working with many small files.
>
> EXT4
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+971minor)pagefaults 0swaps
> linux-2.6.33 linux-2.6.33.tar
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+865minor)pagefaults 0swaps
>
> XFS
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+970minor)pagefaults 0swaps
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+864minor)pagefaults 0swaps

Mount XFS with "-o logbsize=262144". Metadata intensive workloads on
XFS are log IO bound, so larger log buffer size makes a big
difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've
been getting ~21s for the untar, and 8s for the rm -rf with that
option set. Still slower than ext4, but nowhere near as bad.

> So I guess that's the tradeoff, for massive I/O you should use XFS, else,
> use EXT4?

I wouldn't consider writing an 11GB file "massive IO", nor would I
consider an 600MB/s massive, either, since you can get that out of a
sub-$10k server these days....

> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Check whether the dd process on ext4 is CPU bound....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-03-01 08:39:24

by Andreas Dilger

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On 2010-02-28, at 07:55, Justin Piszcz wrote:
> === CREATE RAID-0 WITH 11 DISKS

Have you tried testing with "nice" numbers of disks in your RAID set
(e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)? The mballoc
code is really much better tuned for power-of-two sized allocations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2010-03-01 09:21:06

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Mon, 1 Mar 2010, Andreas Dilger wrote:

> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>> === CREATE RAID-0 WITH 11 DISKS
>
>
> Have you tried testing with "nice" numbers of disks in your RAID set (e.g. 8
> disks for RAID-0, 9 for RAID-5, 10 for RAID-6)? The mballoc code is really
> much better tuned for power-of-two sized allocations.

Hi,

Yes, the second system (RAID-5) has 8 disks and it shows the same
performance problems with ext4 and not XFS (as shown from previous
e-mail), where XFS usually got 500-600MiB/s for writes.

http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1

For the RAID-5 (from earlier testing): <- This one has 8 disks.
-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
An increase of 132MiB/s.

Justin.

2010-03-01 14:48:18

by Michael Tokarev

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz wrote:
>
> On Mon, 1 Mar 2010, Andreas Dilger wrote:
>
>> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>>> === CREATE RAID-0 WITH 11 DISKS
>>
>> Have you tried testing with "nice" numbers of disks in your RAID set
>> (e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)? The mballoc
>> code is really much better tuned for power-of-two sized allocations.
>
> Hi,
>
> Yes, the second system (RAID-5) has 8 disks and it shows the same
> performance problems with ext4 and not XFS (as shown from previous
> e-mail), where XFS usually got 500-600MiB/s for writes.
>
> http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1
>
>
> For the RAID-5 (from earlier testing): <- This one has 8 disks.

Note that for RAID-5, the "nice" number of disks is 9 as Andreas
said, not 8 as in your example.

/mjt

2010-03-01 15:07:35

by Justin Piszcz

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

On Mon, 1 Mar 2010, Michael Tokarev wrote:

> Justin Piszcz wrote:
>>
>> On Mon, 1 Mar 2010, Andreas Dilger wrote:
>>
>>> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>>>> === CREATE RAID-0 WITH 11 DISKS
>>>
>>> Have you tried testing with "nice" numbers of disks in your RAID set
>>> (e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)? The mballoc
>>> code is really much better tuned for power-of-two sized allocations.
>>
>> Hi,
>>
>> Yes, the second system (RAID-5) has 8 disks and it shows the same
>> performance problems with ext4 and not XFS (as shown from previous
>> e-mail), where XFS usually got 500-600MiB/s for writes.
>>
>> http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1
>>
>>
>> For the RAID-5 (from earlier testing): <- This one has 8 disks.
>
> Note that for RAID-5, the "nice" number of disks is 9 as Andreas
> said, not 8 as in your example.
>
> /mjt
>

Hi, thanks for this.

RAID-0 with 12 disks:

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-m]1 --level=0 -n 12 -c 64
mdadm: /dev/sdb1 appears to contain an ext2fs file system
size=1077256000K mtime=Sun Feb 28 08:35:47 2010
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdm1 appears to be part of a raid array:
level=raid6 devices=11 ctime=Sat Feb 27 06:57:29 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~# mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
366288896 inodes, 1465151808 blocks
73257590 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
44713 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: 28936/44713..etc

p63:~# mount -o nobarrier /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 34.9723 s, 307 MB/s
p63:/r1#

Same issue for EXT4, with XFS, it gets faster:

p63:~# mkfs.xfs /dev/md0 -f
meta-data=/dev/md0 isize=256 agcount=32, agsize=45786000 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1465151808, imaxpct=5
= sunit=16 swidth=192 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=16 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
mount /dev/md0 /r1
p63:~# mount /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 17.6473 s, 608 MB/s
p63:/r1#

Justin.

2010-03-01 16:15:26

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz wrote:
>
>
> On Sun, 28 Feb 2010, [email protected] wrote:
>
>> On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>>>
>>> I still would like to know however, why 350MiB/s seems to be the maximum
>>> performance I can get from two different md raids (that easily do
>>> 600MiB/s
>>> with XFS).
>
>> Can you run "filefrag -v <filename>" on the large file you created
>> using dd? Part of the problem may be the block allocator simply not
>> being well optimized super large writes. To be honest, that's not
>> something we've tried (at all) to optimize, mainly because for most
>> users of ext4 they're more interested in much more reasonable sized
>> files, and we only have so many hours in a day to hack on ext4. :-)
>> XFS in contrast has in the past had plenty of paying customers
>> interested in writing really large scientific data sets, so this is
>> something Irix *has* spent time optimizing.
> Yes, this is shown at the bottom of the e-mail both with -o data=ordered
> and data=writeback.

...

> === SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED)
>
> p63:/r1# filefrag -v /r1/bigfile Filesystem type is: ef53
> File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
> ext logical physical expected length flags
> 0 0 34816 32768
> 1 32768 67584 30720
> 2 63488 100352 98303 32768
> 3 96256 133120 30720
> 4 126976 165888 163839 32768
> 5 159744 198656 30720
...

That looks pretty good.

I think Dave's suggesting of seeing what cpu usage looks like is a good one.

Running blktrace on xfs vs. ext4 could possibly also shed some light.

-Eric

2010-03-02 00:08:30

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Justin Piszcz wrote:
> Hello,
>
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
>
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none
> of them get the write speeds anywhere near XFS)?
>
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
> volume.
>
> How do I 'speed' up ext4? Is it possible?
>

FWIW I'm seeing similar things on fast storage (Fusion IO),
though this is under 2.6.31. 500MB/s+ for xfs, 300 for ext4.

Overwriting an existing file is no faster. I don't think this
driver is blktraceable but I'll try a newer driver that should be I think.

(xfs's overwrite went from 534 to 597 mb/s; ext4 sat at 320-ish)

direct IO was good for both xfs & ext4 at around 530mb/s

I'll see if I can get this running on a more recent kernel to
do further investigation.

-Eric

2010-03-02 00:37:46

by Eric Sandeen

[permalink] [raw]

Subject: Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Eric Sandeen wrote:
> Justin Piszcz wrote:
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none
>> of them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4? Is it possible?
>>
>
> FWIW I'm seeing similar things on fast storage (Fusion IO),
> though this is under 2.6.31. 500MB/s+ for xfs, 300 for ext4.
>
> Overwriting an existing file is no faster. I don't think this
> driver is blktraceable but I'll try a newer driver that should be I think.

FWIW, blktrace (I'm still on 2.6.31) is enlightening:

Total (xfs):
Reads Queued: 4, 16KiB Writes Queued: 122,567, 10,485MiB
Read Dispatches: 4, 16KiB Write Dispatches: 83,219, 10,485MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 4, 16KiB Writes Completed: 83,219, 10,485MiB
Read Merges: 0, 0KiB Write Merges: 39,348, 314,804KiB
IO unplugs: 344 Timer unplugs: 338

Total (ext4):
Reads Queued: 14, 56KiB Writes Queued: 2,621K, 10,486MiB
Read Dispatches: 14, 56KiB Write Dispatches: 107,944, 10,486MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 14, 56KiB Writes Completed: 107,944, 10,486MiB
Read Merges: 0, 0KiB Write Merges: 2,513K, 10,054MiB
IO unplugs: 2,461 Timer unplugs: 2,020

See "Writes Queued" See also submit_bio() calls in xfs.

ext4 doing things a block at a time is certainly giving the elevator a workout...
I'd tend to chalk it up to that at first glance.

-Eric