LinuxLists.cc - Filesystem writes on RAID5 too slow

2013-11-18 16:02:16

Subject: Filesystem writes on RAID5 too slow

Dear list,

I am writing about an apparent issue (or maybe it is normal, that's my
question) regarding filesystem write speed in in a linux raid device.
More specifically, I have linux-3.10.10 running in an Intel Haswell
embedded system with 3 HDDs in a RAID-5 configuration.
The hard disks have 4k physical sectors which are reported as 512
logical size. I made sure the partitions underlying the raid device
start at sector 2048.

The RAID device has version 1.2 metadata and 4k (bytes) of data
offset, therefore the data should also be 4k aligned. The raid chunk
size is 512K.

I have the md0 raid device formatted as ext3 with a 4k block size, and
stride and stripes correctly chosen to match the raid chunk size, that
is, stride=128,stripe-width=256.

While I was working in a small university project, I just noticed that
the write speeds when using a filesystem over raid are *much* slower
than when writing directly to the raid device (or even compared to
filesystem read speeds).

The command line for measuring filesystem read and write speeds was:

$ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct

The command line for measuring raw read and write speeds was:

$ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
$ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct

Here are some speed measures using dd (an average of 20 runs).:

device raw/fs mode speed (MB/s) slowdown (%)
/dev/md0 raw read 207
/dev/md0 raw write 209
/dev/md1 raw read 214
/dev/md1 raw write 212

/dev/md0 xfs read 188 9
/dev/md0 xfs write 35 83

/dev/md1 ext3 read 199 7
/dev/md1 ext3 write 36 83

/dev/md0 ufs read 212 0
/dev/md0 ufs write 53 75

/dev/md0 ext2 read 202 2
/dev/md0 ext2 write 34 84

Is it possible that the filesystem has such enormous impact in the
write speed? We are talking about a slowdown of 80%!!! Even a
filesystem as simple as ufs has a slowdown of 75%! What am I missing?

Thank you,
--
Martin Boutin

2013-11-18 18:28:21

by Eric Sandeen

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On 11/18/13, 10:02 AM, Martin Boutin wrote:
> Dear list,
>
> I am writing about an apparent issue (or maybe it is normal, that's my
> question) regarding filesystem write speed in in a linux raid device.
> More specifically, I have linux-3.10.10 running in an Intel Haswell
> embedded system with 3 HDDs in a RAID-5 configuration.
> The hard disks have 4k physical sectors which are reported as 512
> logical size. I made sure the partitions underlying the raid device
> start at sector 2048.

(fixed cc: to xfs list)

> The RAID device has version 1.2 metadata and 4k (bytes) of data
> offset, therefore the data should also be 4k aligned. The raid chunk
> size is 512K.
>
> I have the md0 raid device formatted as ext3 with a 4k block size, and
> stride and stripes correctly chosen to match the raid chunk size, that
> is, stride=128,stripe-width=256.
>
> While I was working in a small university project, I just noticed that
> the write speeds when using a filesystem over raid are *much* slower
> than when writing directly to the raid device (or even compared to
> filesystem read speeds).
>
> The command line for measuring filesystem read and write speeds was:
>
> $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>
> The command line for measuring raw read and write speeds was:
>
> $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>
> Here are some speed measures using dd (an average of 20 runs).:
>
> device raw/fs mode speed (MB/s) slowdown (%)
> /dev/md0 raw read 207
> /dev/md0 raw write 209
> /dev/md1 raw read 214
> /dev/md1 raw write 212
>
> /dev/md0 xfs read 188 9
> /dev/md0 xfs write 35 83
>
> /dev/md1 ext3 read 199 7
> /dev/md1 ext3 write 36 83
>
> /dev/md0 ufs read 212 0
> /dev/md0 ufs write 53 75
>
> /dev/md0 ext2 read 202 2
> /dev/md0 ext2 write 34 84
>
> Is it possible that the filesystem has such enormous impact in the
> write speed? We are talking about a slowdown of 80%!!! Even a
> filesystem as simple as ufs has a slowdown of 75%! What am I missing?

One thing you're missing is enough info to debug this.

/proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
partition table details, etc.

If something is misaligned and you are doing RMW for these IOs it could
hurt a lot.

-Eric

> Thank you,
>

2013-11-18 18:41:40

by Roman Mamedov

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Mon, 18 Nov 2013 11:02:15 -0500
Martin Boutin <[email protected]> wrote:

> I have the md0 raid device formatted as ext3 with a 4k block size, and
> stride and stripes correctly chosen to match the raid chunk size, that
> is, stride=128,stripe-width=256.

What is your stripe cache size?
http://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

> The command line for measuring filesystem read and write speeds was:
>
> $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct

Try testing with "fdatasync" instead of "direct" here.

--
With respect,
Roman

Attachments:

signature.asc (198.00 B)

2013-11-18 19:25:43

by Roman Mamedov

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Tue, 19 Nov 2013 00:41:40 +0600
Roman Mamedov <[email protected]> wrote:

> > The command line for measuring filesystem read and write speeds was:
> >
> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>
> Try testing with "fdatasync" instead of "direct" here.

Sorry, "conv=fdatasync" instead of "oflag=direct".

--
With respect,
Roman

Attachments:

signature.asc (198.00 B)

2013-11-19 00:57:40

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
> On 11/18/13, 10:02 AM, Martin Boutin wrote:
> > Dear list,
> >
> > I am writing about an apparent issue (or maybe it is normal, that's my
> > question) regarding filesystem write speed in in a linux raid device.
> > More specifically, I have linux-3.10.10 running in an Intel Haswell
> > embedded system with 3 HDDs in a RAID-5 configuration.
> > The hard disks have 4k physical sectors which are reported as 512
> > logical size. I made sure the partitions underlying the raid device
> > start at sector 2048.
>
> (fixed cc: to xfs list)
>
> > The RAID device has version 1.2 metadata and 4k (bytes) of data
> > offset, therefore the data should also be 4k aligned. The raid chunk
> > size is 512K.
> >
> > I have the md0 raid device formatted as ext3 with a 4k block size, and
> > stride and stripes correctly chosen to match the raid chunk size, that
> > is, stride=128,stripe-width=256.
> >
> > While I was working in a small university project, I just noticed that
> > the write speeds when using a filesystem over raid are *much* slower
> > than when writing directly to the raid device (or even compared to
> > filesystem read speeds).
> >
> > The command line for measuring filesystem read and write speeds was:
> >
> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> >
> > The command line for measuring raw read and write speeds was:
> >
> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> >
> > Here are some speed measures using dd (an average of 20 runs).:
> >
> > device raw/fs mode speed (MB/s) slowdown (%)
> > /dev/md0 raw read 207
> > /dev/md0 raw write 209
> > /dev/md1 raw read 214
> > /dev/md1 raw write 212

So, that's writing to the first 1GB of /dev/md0, and all the writes
are going to be aligned to the MD stripe.

> > /dev/md0 xfs read 188 9
> > /dev/md0 xfs write 35 83o

And these will not be written to the first 1GB of the block device
but somewhere else. Most likely a region that hasn't otherwise been
used, and so isn't going to be overwriting the same blocks like the
/dev/md0 case is going to be. Perhaps there's some kind of stripe
caching effect going on here? Was the md device fully initialised
before you ran these tests?

> >
> > /dev/md1 ext3 read 199 7
> > /dev/md1 ext3 write 36 83
> >
> > /dev/md0 ufs read 212 0
> > /dev/md0 ufs write 53 75
> >
> > /dev/md0 ext2 read 202 2
> > /dev/md0 ext2 write 34 84

I suspect what you are seeing here is either the latency introduced
by having to allocate blocks before issuing the IO, or the file
layout due to allocation is not idea. Single threaded direct IO is
latency bound, not bandwidth bound and, as such, is IO size
sensitive. Allocation for direct IO is also IO size sensitive -
there's typically an allocation per IO, so the more IO you have to
do, the more allocation that occurs.

So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero"
output for the file you wrote? Specifically, I'm interested whether
it aligned the allocations to the stripe unit boundary, and if so,
what offset into the device those extents sit at....

Also, you should run iostat and blktrace to determine if MD is
doing RMW cycles when being written to through the filesystem.

> > Is it possible that the filesystem has such enormous impact in the
> > write speed? We are talking about a slowdown of 80%!!! Even a
> > filesystem as simple as ufs has a slowdown of 75%! What am I missing?
>
> One thing you're missing is enough info to debug this.
>
> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
> partition table details, etc.

THere's a good list here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> If something is misaligned and you are doing RMW for these IOs it could
> hurt a lot.
>
> -Eric
>
> > Thank you,
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-11-21 09:11:42

by Martin Boutin

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <[email protected]> wrote:
> On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>> > Dear list,
>> >
>> > I am writing about an apparent issue (or maybe it is normal, that's my
>> > question) regarding filesystem write speed in in a linux raid device.
>> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>> > embedded system with 3 HDDs in a RAID-5 configuration.
>> > The hard disks have 4k physical sectors which are reported as 512
>> > logical size. I made sure the partitions underlying the raid device
>> > start at sector 2048.
>>
>> (fixed cc: to xfs list)
>>
>> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>> > offset, therefore the data should also be 4k aligned. The raid chunk
>> > size is 512K.
>> >
>> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>> > stride and stripes correctly chosen to match the raid chunk size, that
>> > is, stride=128,stripe-width=256.
>> >
>> > While I was working in a small university project, I just noticed that
>> > the write speeds when using a filesystem over raid are *much* slower
>> > than when writing directly to the raid device (or even compared to
>> > filesystem read speeds).
>> >
>> > The command line for measuring filesystem read and write speeds was:
>> >
>> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> >
>> > The command line for measuring raw read and write speeds was:
>> >
>> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>> >
>> > Here are some speed measures using dd (an average of 20 runs).:
>> >
>> > device raw/fs mode speed (MB/s) slowdown (%)
>> > /dev/md0 raw read 207
>> > /dev/md0 raw write 209
>> > /dev/md1 raw read 214
>> > /dev/md1 raw write 212
>
> So, that's writing to the first 1GB of /dev/md0, and all the writes
> are going to be aligned to the MD stripe.
>
>> > /dev/md0 xfs read 188 9
>> > /dev/md0 xfs write 35 83o
>
> And these will not be written to the first 1GB of the block device
> but somewhere else. Most likely a region that hasn't otherwise been
> used, and so isn't going to be overwriting the same blocks like the
> /dev/md0 case is going to be. Perhaps there's some kind of stripe
> caching effect going on here? Was the md device fully initialised
> before you ran these tests?
>
>> >
>> > /dev/md1 ext3 read 199 7
>> > /dev/md1 ext3 write 36 83
>> >
>> > /dev/md0 ufs read 212 0
>> > /dev/md0 ufs write 53 75
>> >
>> > /dev/md0 ext2 read 202 2
>> > /dev/md0 ext2 write 34 84
>
> I suspect what you are seeing here is either the latency introduced
> by having to allocate blocks before issuing the IO, or the file
> layout due to allocation is not idea. Single threaded direct IO is
> latency bound, not bandwidth bound and, as such, is IO size
> sensitive. Allocation for direct IO is also IO size sensitive -
> there's typically an allocation per IO, so the more IO you have to
> do, the more allocation that occurs.

I just did a few more tests, this time with ext4:

device raw/fs mode speed (MB/s) slowdown (%)
/dev/md0 ext4 read 199 4%
/dev/md0 ext4 write 210 0%

This time, no slowdown at all on ext4. I believe this is due to the
multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
should be it). So I guess for the other filesystems, it was indeed
the latency introduced by block allocation.

Thanks,
- Martin

>
> So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero"
> output for the file you wrote? Specifically, I'm interested whether
> it aligned the allocations to the stripe unit boundary, and if so,
> what offset into the device those extents sit at....
>
> Also, you should run iostat and blktrace to determine if MD is
> doing RMW cycles when being written to through the filesystem.
>
>> > Is it possible that the filesystem has such enormous impact in the
>> > write speed? We are talking about a slowdown of 80%!!! Even a
>> > filesystem as simple as ufs has a slowdown of 75%! What am I missing?
>>
>> One thing you're missing is enough info to debug this.
>>
>> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
>> partition table details, etc.
>
> THere's a good list here:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
>> If something is misaligned and you are doing RMW for these IOs it could
>> hurt a lot.
>>
>> -Eric
>>
>> > Thank you,
>> >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Dave Chinner
> [email protected]

2013-11-21 09:26:33

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <[email protected]> wrote:
> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
> >> > Dear list,
> >> >
> >> > I am writing about an apparent issue (or maybe it is normal, that's my
> >> > question) regarding filesystem write speed in in a linux raid device.
> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
> >> > embedded system with 3 HDDs in a RAID-5 configuration.
> >> > The hard disks have 4k physical sectors which are reported as 512
> >> > logical size. I made sure the partitions underlying the raid device
> >> > start at sector 2048.
> >>
> >> (fixed cc: to xfs list)
> >>
> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
> >> > offset, therefore the data should also be 4k aligned. The raid chunk
> >> > size is 512K.
> >> >
> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
> >> > stride and stripes correctly chosen to match the raid chunk size, that
> >> > is, stride=128,stripe-width=256.
> >> >
> >> > While I was working in a small university project, I just noticed that
> >> > the write speeds when using a filesystem over raid are *much* slower
> >> > than when writing directly to the raid device (or even compared to
> >> > filesystem read speeds).
> >> >
> >> > The command line for measuring filesystem read and write speeds was:
> >> >
> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> >> >
> >> > The command line for measuring raw read and write speeds was:
> >> >
> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> >> >
> >> > Here are some speed measures using dd (an average of 20 runs).:
> >> >
> >> > device raw/fs mode speed (MB/s) slowdown (%)
> >> > /dev/md0 raw read 207
> >> > /dev/md0 raw write 209
> >> > /dev/md1 raw read 214
> >> > /dev/md1 raw write 212
> >
> > So, that's writing to the first 1GB of /dev/md0, and all the writes
> > are going to be aligned to the MD stripe.
> >
> >> > /dev/md0 xfs read 188 9
> >> > /dev/md0 xfs write 35 83o
> >
> > And these will not be written to the first 1GB of the block device
> > but somewhere else. Most likely a region that hasn't otherwise been
> > used, and so isn't going to be overwriting the same blocks like the
> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
> > caching effect going on here? Was the md device fully initialised
> > before you ran these tests?
> >
> >> >
> >> > /dev/md1 ext3 read 199 7
> >> > /dev/md1 ext3 write 36 83
> >> >
> >> > /dev/md0 ufs read 212 0
> >> > /dev/md0 ufs write 53 75
> >> >
> >> > /dev/md0 ext2 read 202 2
> >> > /dev/md0 ext2 write 34 84
> >
> > I suspect what you are seeing here is either the latency introduced
> > by having to allocate blocks before issuing the IO, or the file
> > layout due to allocation is not idea. Single threaded direct IO is
> > latency bound, not bandwidth bound and, as such, is IO size
> > sensitive. Allocation for direct IO is also IO size sensitive -
> > there's typically an allocation per IO, so the more IO you have to
> > do, the more allocation that occurs.
>
> I just did a few more tests, this time with ext4:
>
> device raw/fs mode speed (MB/s) slowdown (%)
> /dev/md0 ext4 read 199 4%
> /dev/md0 ext4 write 210 0%
>
> This time, no slowdown at all on ext4. I believe this is due to the
> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
> should be it). So I guess for the other filesystems, it was indeed
> the latency introduced by block allocation.

Except that XFS does extent based allocation as well, so that's not
likely the reason. The fact that ext4 doesn't see a slowdown like
every other filesystem really doesn't make a lot of sense to
me, either from an IO dispatch point of view or an IO alignment
point of view.

Why? Because all the filesystems align identically to the underlying
device and all should be doing 4k block aligned IO, and XFS has
roughly the same allocation overhead for this workload as ext4.
Did you retest XFS or any of the other filesystems directly after
running the ext4 tests (i.e. confirm you are testing apples to
apples)?

What we need to determine why other filesystems are slow (and why
ext4 is fast) is more information about your configuration and block
traces showing what is happening at the IO level, like was requested
in a previous email....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-11-21 09:50:51

by Martin Boutin

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <[email protected]> wrote:
> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <[email protected]> wrote:
>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>> >> > Dear list,
>> >> >
>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>> >> > question) regarding filesystem write speed in in a linux raid device.
>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>> >> > The hard disks have 4k physical sectors which are reported as 512
>> >> > logical size. I made sure the partitions underlying the raid device
>> >> > start at sector 2048.
>> >>
>> >> (fixed cc: to xfs list)
>> >>
>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>> >> > size is 512K.
>> >> >
>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>> >> > is, stride=128,stripe-width=256.
>> >> >
>> >> > While I was working in a small university project, I just noticed that
>> >> > the write speeds when using a filesystem over raid are *much* slower
>> >> > than when writing directly to the raid device (or even compared to
>> >> > filesystem read speeds).
>> >> >
>> >> > The command line for measuring filesystem read and write speeds was:
>> >> >
>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> >> >
>> >> > The command line for measuring raw read and write speeds was:
>> >> >
>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>> >> >
>> >> > Here are some speed measures using dd (an average of 20 runs).:
>> >> >
>> >> > device raw/fs mode speed (MB/s) slowdown (%)
>> >> > /dev/md0 raw read 207
>> >> > /dev/md0 raw write 209
>> >> > /dev/md1 raw read 214
>> >> > /dev/md1 raw write 212
>> >
>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>> > are going to be aligned to the MD stripe.
>> >
>> >> > /dev/md0 xfs read 188 9
>> >> > /dev/md0 xfs write 35 83o
>> >
>> > And these will not be written to the first 1GB of the block device
>> > but somewhere else. Most likely a region that hasn't otherwise been
>> > used, and so isn't going to be overwriting the same blocks like the
>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>> > caching effect going on here? Was the md device fully initialised
>> > before you ran these tests?
>> >
>> >> >
>> >> > /dev/md1 ext3 read 199 7
>> >> > /dev/md1 ext3 write 36 83
>> >> >
>> >> > /dev/md0 ufs read 212 0
>> >> > /dev/md0 ufs write 53 75
>> >> >
>> >> > /dev/md0 ext2 read 202 2
>> >> > /dev/md0 ext2 write 34 84
>> >
>> > I suspect what you are seeing here is either the latency introduced
>> > by having to allocate blocks before issuing the IO, or the file
>> > layout due to allocation is not idea. Single threaded direct IO is
>> > latency bound, not bandwidth bound and, as such, is IO size
>> > sensitive. Allocation for direct IO is also IO size sensitive -
>> > there's typically an allocation per IO, so the more IO you have to
>> > do, the more allocation that occurs.
>>
>> I just did a few more tests, this time with ext4:
>>
>> device raw/fs mode speed (MB/s) slowdown (%)
>> /dev/md0 ext4 read 199 4%
>> /dev/md0 ext4 write 210 0%
>>
>> This time, no slowdown at all on ext4. I believe this is due to the
>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>> should be it). So I guess for the other filesystems, it was indeed
>> the latency introduced by block allocation.
>
> Except that XFS does extent based allocation as well, so that's not
> likely the reason. The fact that ext4 doesn't see a slowdown like
> every other filesystem really doesn't make a lot of sense to
> me, either from an IO dispatch point of view or an IO alignment
> point of view.
>
> Why? Because all the filesystems align identically to the underlying
> device and all should be doing 4k block aligned IO, and XFS has
> roughly the same allocation overhead for this workload as ext4.
> Did you retest XFS or any of the other filesystems directly after
> running the ext4 tests (i.e. confirm you are testing apples to
> apples)?

Yes I did, the performance figures did not change for either XFS or ext3.
>
> What we need to determine why other filesystems are slow (and why
> ext4 is fast) is more information about your configuration and block
> traces showing what is happening at the IO level, like was requested
> in a previous email....

Ok, I'm going to try coming up with meaningful data. Thanks.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

--
Martin Boutin

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-11-21 13:31:38

by Martin Boutin

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

$ uname -a
Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
i686 GNU/Linux

$ xfs_repair -V
xfs_repair version 3.1.4

$ cat /proc/cpuinfo | grep processor
processor : 0
processor : 1

$ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
$ mount -t xfs /dev/md0 /tmp/diskmnt/
$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s

$ cat /proc/meminfo
MemTotal: 1313956 kB
MemFree: 1099936 kB
Buffers: 13232 kB
Cached: 141452 kB
SwapCached: 0 kB
Active: 128960 kB
Inactive: 55936 kB
Active(anon): 30548 kB
Inactive(anon): 1096 kB
Active(file): 98412 kB
Inactive(file): 54840 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 626696 kB
HighFree: 452472 kB
LowTotal: 687260 kB
LowFree: 647464 kB
SwapTotal: 72256 kB
SwapFree: 72256 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 30172 kB
Mapped: 15764 kB
Shmem: 1432 kB
Slab: 14720 kB
SReclaimable: 6632 kB
SUnreclaim: 8088 kB
KernelStack: 1792 kB
PageTables: 1176 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 729232 kB
Committed_AS: 734116 kB
VmallocTotal: 327680 kB
VmallocUsed: 10192 kB
VmallocChunk: 294904 kB
DirectMap4k: 12280 kB
DirectMap4M: 692224 kB

$ cat /proc/mounts
(...)
/dev/md0 /tmp/diskmnt xfs
rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

$ cat /proc/partitions
major minor #blocks name

8 0 976762584 sda
8 1 10281600 sda1
8 2 966479960 sda2
8 16 976762584 sdb
8 17 10281600 sdb1
8 18 966479960 sdb2
8 32 976762584 sdc
8 33 10281600 sdc1
8 34 966479960 sdc2
(...)
9 1 20560896 md1
9 0 1932956672 md0

# same layout for other disks
$ fdisk -c -u /dev/sda

The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.

Command (m for help): p

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sda1 2048 20565247 10281600 83 Linux
/dev/sda2 20565248 1953525167 966479960 83 Linux

# unfortunately I had to reinitelize the array and recovery takes a
while.. it does not impact performance much though.
$ cat /proc/mdstat
Personalities : [linear] [raid6] [raid5] [raid4]
md0 : active raid5 sda2[0] sdc2[3] sdb2[1]
1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
[>....................] recovery = 2.4% (23588740/966478336)
finish=156.6min speed=100343K/sec
bitmap: 0/1 pages [0KB], 2097152KB chunk

# sda sdb and sdc are the same model
$ hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: HGST HCC541010A9E680
(...)
Firmware Revision: JA0OA560
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II
Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project
D1697 Revision 0b
Standards:
Used: unknown (minor revision code 0x0028)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1953525168
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 953869 MBytes
device size with M = 1000*1000: 1000204 MBytes (1000 GB)
cache/buffer size = 8192 KBytes (type=DualPortCache)
Form Factor: 2.5 inch
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Advanced power management level: 128
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns

$ hdparm -I /dev/sd{a,b,c} | grep "Write cache"
* Write cache
* Write cache
* Write cache
# therefore write cache is enabled in all drives

$ xfs_info /dev/md0
meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks
= sectsz=4096 attr=2
data = bsize=4096 blocks=483239168, imaxpct=5
= sunit=128 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=8192, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

$ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
/tmp/diskmnt/filewr.zero:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
# this does not look good, does it?

# run while dd was executing, looks like we have almost the half
writes as reads....
$ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2
Linux 3.10.10 (haswell1) 11/21/2013 _i686_ (2 CPU)

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda2 13.75 6639.52 232.17 78863819 2757731
sdb2 13.74 6639.42 232.24 78862660 2758483
sdc2 13.68 55.86 6813.67 663443 80932375

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda2 78.27 11191.20 22556.07 335736 676682
sdb2 78.30 11175.73 22589.13 335272 677674
sdc2 78.30 5506.13 28258.47 165184 847754

Thanks
- Martin

On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <[email protected]> wrote:
> On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <[email protected]> wrote:
>> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <[email protected]> wrote:
>>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>>> >> > Dear list,
>>> >> >
>>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>>> >> > question) regarding filesystem write speed in in a linux raid device.
>>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>>> >> > The hard disks have 4k physical sectors which are reported as 512
>>> >> > logical size. I made sure the partitions underlying the raid device
>>> >> > start at sector 2048.
>>> >>
>>> >> (fixed cc: to xfs list)
>>> >>
>>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>>> >> > size is 512K.
>>> >> >
>>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>>> >> > is, stride=128,stripe-width=256.
>>> >> >
>>> >> > While I was working in a small university project, I just noticed that
>>> >> > the write speeds when using a filesystem over raid are *much* slower
>>> >> > than when writing directly to the raid device (or even compared to
>>> >> > filesystem read speeds).
>>> >> >
>>> >> > The command line for measuring filesystem read and write speeds was:
>>> >> >
>>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>>> >> >
>>> >> > The command line for measuring raw read and write speeds was:
>>> >> >
>>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>>> >> >
>>> >> > Here are some speed measures using dd (an average of 20 runs).:
>>> >> >
>>> >> > device raw/fs mode speed (MB/s) slowdown (%)
>>> >> > /dev/md0 raw read 207
>>> >> > /dev/md0 raw write 209
>>> >> > /dev/md1 raw read 214
>>> >> > /dev/md1 raw write 212
>>> >
>>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>>> > are going to be aligned to the MD stripe.
>>> >
>>> >> > /dev/md0 xfs read 188 9
>>> >> > /dev/md0 xfs write 35 83o
>>> >
>>> > And these will not be written to the first 1GB of the block device
>>> > but somewhere else. Most likely a region that hasn't otherwise been
>>> > used, and so isn't going to be overwriting the same blocks like the
>>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>>> > caching effect going on here? Was the md device fully initialised
>>> > before you ran these tests?
>>> >
>>> >> >
>>> >> > /dev/md1 ext3 read 199 7
>>> >> > /dev/md1 ext3 write 36 83
>>> >> >
>>> >> > /dev/md0 ufs read 212 0
>>> >> > /dev/md0 ufs write 53 75
>>> >> >
>>> >> > /dev/md0 ext2 read 202 2
>>> >> > /dev/md0 ext2 write 34 84
>>> >
>>> > I suspect what you are seeing here is either the latency introduced
>>> > by having to allocate blocks before issuing the IO, or the file
>>> > layout due to allocation is not idea. Single threaded direct IO is
>>> > latency bound, not bandwidth bound and, as such, is IO size
>>> > sensitive. Allocation for direct IO is also IO size sensitive -
>>> > there's typically an allocation per IO, so the more IO you have to
>>> > do, the more allocation that occurs.
>>>
>>> I just did a few more tests, this time with ext4:
>>>
>>> device raw/fs mode speed (MB/s) slowdown (%)
>>> /dev/md0 ext4 read 199 4%
>>> /dev/md0 ext4 write 210 0%
>>>
>>> This time, no slowdown at all on ext4. I believe this is due to the
>>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>>> should be it). So I guess for the other filesystems, it was indeed
>>> the latency introduced by block allocation.
>>
>> Except that XFS does extent based allocation as well, so that's not
>> likely the reason. The fact that ext4 doesn't see a slowdown like
>> every other filesystem really doesn't make a lot of sense to
>> me, either from an IO dispatch point of view or an IO alignment
>> point of view.
>>
>> Why? Because all the filesystems align identically to the underlying
>> device and all should be doing 4k block aligned IO, and XFS has
>> roughly the same allocation overhead for this workload as ext4.
>> Did you retest XFS or any of the other filesystems directly after
>> running the ext4 tests (i.e. confirm you are testing apples to
>> apples)?
>
> Yes I did, the performance figures did not change for either XFS or ext3.
>>
>> What we need to determine why other filesystems are slow (and why
>> ext4 is fast) is more information about your configuration and block
>> traces showing what is happening at the IO level, like was requested
>> in a previous email....
>
> Ok, I'm going to try coming up with meaningful data. Thanks.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
>
>
>
> --
> Martin Boutin

2013-11-21 16:35:15

by Martin Boutin

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

Sorry for the spam but I just noticed that the XFS stripe unit does not
match the strip unit of the underlying RAID device. I tried to do a
mkfs.xfs with a stripe of 512KiB but mkfs.xfs complains that the
maximum stripe width is 256KiB.

So I recreated the RAID with a stripe of 256KiB:
$ cat /proc/mdstat
Personalities : [linear] [raid6] [raid5] [raid4]
md0 : active raid5 sdc2[3] sdb2[1] sda2[0]
1932957184 blocks super 1.2 level 5, 256k chunk, algorithm 2 [3/2] [UU_]
resync=DELAYED
bitmap: 1/1 pages [4KB], 2097152KB chunk

and called mkf.xfs with proper parameters:
$ mkfs.xfs -d sunit=512,swidth=1024 -f -l size=32m /dev/md0

Unfortunately the file is still created unaligned to the RAID stripe.
$ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
/tmp/diskmnt/filewr.zero:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..507903]: 2048544..2556447 0 (2048544..2556447) 507904 01111
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width

Now I'm out of ideas..

- Martin

On Thu, Nov 21, 2013 at 8:31 AM, Martin Boutin <[email protected]> wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux
>
> $ xfs_repair -V
> xfs_repair version 3.1.4
>
> $ cat /proc/cpuinfo | grep processor
> processor : 0
> processor : 1
>
> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
>
> $ cat /proc/meminfo
> MemTotal: 1313956 kB
> MemFree: 1099936 kB
> Buffers: 13232 kB
> Cached: 141452 kB
> SwapCached: 0 kB
> Active: 128960 kB
> Inactive: 55936 kB
> Active(anon): 30548 kB
> Inactive(anon): 1096 kB
> Active(file): 98412 kB
> Inactive(file): 54840 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> HighTotal: 626696 kB
> HighFree: 452472 kB
> LowTotal: 687260 kB
> LowFree: 647464 kB
> SwapTotal: 72256 kB
> SwapFree: 72256 kB
> Dirty: 8 kB
> Writeback: 0 kB
> AnonPages: 30172 kB
> Mapped: 15764 kB
> Shmem: 1432 kB
> Slab: 14720 kB
> SReclaimable: 6632 kB
> SUnreclaim: 8088 kB
> KernelStack: 1792 kB
> PageTables: 1176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 729232 kB
> Committed_AS: 734116 kB
> VmallocTotal: 327680 kB
> VmallocUsed: 10192 kB
> VmallocChunk: 294904 kB
> DirectMap4k: 12280 kB
> DirectMap4M: 692224 kB
>
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
>
> $ cat /proc/partitions
> major minor #blocks name
>
> 8 0 976762584 sda
> 8 1 10281600 sda1
> 8 2 966479960 sda2
> 8 16 976762584 sdb
> 8 17 10281600 sdb1
> 8 18 966479960 sdb2
> 8 32 976762584 sdc
> 8 33 10281600 sdc1
> 8 34 966479960 sdc2
> (...)
> 9 1 20560896 md1
> 9 0 1932956672 md0
>
> # same layout for other disks
> $ fdisk -c -u /dev/sda
>
> The device presents a logical sector size that is smaller than
> the physical sector size. Aligning to a physical sector (or optimal
> I/O) size boundary is recommended, or performance may be impacted.
>
> Command (m for help): p
>
> Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sda1 2048 20565247 10281600 83 Linux
> /dev/sda2 20565248 1953525167 966479960 83 Linux
>
> # unfortunately I had to reinitelize the array and recovery takes a
> while.. it does not impact performance much though.
> $ cat /proc/mdstat
> Personalities : [linear] [raid6] [raid5] [raid4]
> md0 : active raid5 sda2[0] sdc2[3] sdb2[1]
> 1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> [>....................] recovery = 2.4% (23588740/966478336)
> finish=156.6min speed=100343K/sec
> bitmap: 0/1 pages [0KB], 2097152KB chunk
>
>
> # sda sdb and sdc are the same model
> $ hdparm -I /dev/sda
>
> /dev/sda:
>
> ATA device, with non-removable media
> Model Number: HGST HCC541010A9E680
> (...)
> Firmware Revision: JA0OA560
> Transport: Serial, ATA8-AST, SATA 1.0a, SATA II
> Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project
> D1697 Revision 0b
> Standards:
> Used: unknown (minor revision code 0x0028)
> Supported: 8 7 6 5
> Likely used: 8
> Configuration:
> Logical max current
> cylinders 16383 16383
> heads 16 16
> sectors/track 63 63
> --
> CHS current addressable sectors: 16514064
> LBA user addressable sectors: 268435455
> LBA48 user addressable sectors: 1953525168
> Logical Sector size: 512 bytes
> Physical Sector size: 4096 bytes
> Logical Sector-0 offset: 0 bytes
> device size with M = 1024*1024: 953869 MBytes
> device size with M = 1000*1000: 1000204 MBytes (1000 GB)
> cache/buffer size = 8192 KBytes (type=DualPortCache)
> Form Factor: 2.5 inch
> Nominal Media Rotation Rate: 5400
> Capabilities:
> LBA, IORDY(can be disabled)
> Queue depth: 32
> Standby timer values: spec'd by Standard, no device specific minimum
> R/W multiple sector transfer: Max = 16 Current = 16
> Advanced power management level: 128
> DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=120ns IORDY flow control=120ns
>
> $ hdparm -I /dev/sd{a,b,c} | grep "Write cache"
> * Write cache
> * Write cache
> * Write cache
> # therefore write cache is enabled in all drives
>
> $ xfs_info /dev/md0
> meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks
> = sectsz=4096 attr=2
> data = bsize=4096 blocks=483239168, imaxpct=5
> = sunit=128 swidth=256 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal bsize=4096 blocks=8192, version=2
> = sectsz=4096 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
> # this does not look good, does it?
>
> # run while dd was executing, looks like we have almost the half
> writes as reads....
> $ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2
> Linux 3.10.10 (haswell1) 11/21/2013 _i686_ (2 CPU)
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda2 13.75 6639.52 232.17 78863819 2757731
> sdb2 13.74 6639.42 232.24 78862660 2758483
> sdc2 13.68 55.86 6813.67 663443 80932375
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda2 78.27 11191.20 22556.07 335736 676682
> sdb2 78.30 11175.73 22589.13 335272 677674
> sdc2 78.30 5506.13 28258.47 165184 847754
>
> Thanks
> - Martin
>
> On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <[email protected]> wrote:
>> On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <[email protected]> wrote:
>>> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>>>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <[email protected]> wrote:
>>>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>>>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>>>> >> > Dear list,
>>>> >> >
>>>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>>>> >> > question) regarding filesystem write speed in in a linux raid device.
>>>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>>>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>>>> >> > The hard disks have 4k physical sectors which are reported as 512
>>>> >> > logical size. I made sure the partitions underlying the raid device
>>>> >> > start at sector 2048.
>>>> >>
>>>> >> (fixed cc: to xfs list)
>>>> >>
>>>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>>>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>>>> >> > size is 512K.
>>>> >> >
>>>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>>>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>>>> >> > is, stride=128,stripe-width=256.
>>>> >> >
>>>> >> > While I was working in a small university project, I just noticed that
>>>> >> > the write speeds when using a filesystem over raid are *much* slower
>>>> >> > than when writing directly to the raid device (or even compared to
>>>> >> > filesystem read speeds).
>>>> >> >
>>>> >> > The command line for measuring filesystem read and write speeds was:
>>>> >> >
>>>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>>>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>>>> >> >
>>>> >> > The command line for measuring raw read and write speeds was:
>>>> >> >
>>>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>>>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>>>> >> >
>>>> >> > Here are some speed measures using dd (an average of 20 runs).:
>>>> >> >
>>>> >> > device raw/fs mode speed (MB/s) slowdown (%)
>>>> >> > /dev/md0 raw read 207
>>>> >> > /dev/md0 raw write 209
>>>> >> > /dev/md1 raw read 214
>>>> >> > /dev/md1 raw write 212
>>>> >
>>>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>>>> > are going to be aligned to the MD stripe.
>>>> >
>>>> >> > /dev/md0 xfs read 188 9
>>>> >> > /dev/md0 xfs write 35 83o
>>>> >
>>>> > And these will not be written to the first 1GB of the block device
>>>> > but somewhere else. Most likely a region that hasn't otherwise been
>>>> > used, and so isn't going to be overwriting the same blocks like the
>>>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>>>> > caching effect going on here? Was the md device fully initialised
>>>> > before you ran these tests?
>>>> >
>>>> >> >
>>>> >> > /dev/md1 ext3 read 199 7
>>>> >> > /dev/md1 ext3 write 36 83
>>>> >> >
>>>> >> > /dev/md0 ufs read 212 0
>>>> >> > /dev/md0 ufs write 53 75
>>>> >> >
>>>> >> > /dev/md0 ext2 read 202 2
>>>> >> > /dev/md0 ext2 write 34 84
>>>> >
>>>> > I suspect what you are seeing here is either the latency introduced
>>>> > by having to allocate blocks before issuing the IO, or the file
>>>> > layout due to allocation is not idea. Single threaded direct IO is
>>>> > latency bound, not bandwidth bound and, as such, is IO size
>>>> > sensitive. Allocation for direct IO is also IO size sensitive -
>>>> > there's typically an allocation per IO, so the more IO you have to
>>>> > do, the more allocation that occurs.
>>>>
>>>> I just did a few more tests, this time with ext4:
>>>>
>>>> device raw/fs mode speed (MB/s) slowdown (%)
>>>> /dev/md0 ext4 read 199 4%
>>>> /dev/md0 ext4 write 210 0%
>>>>
>>>> This time, no slowdown at all on ext4. I believe this is due to the
>>>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>>>> should be it). So I guess for the other filesystems, it was indeed
>>>> the latency introduced by block allocation.
>>>
>>> Except that XFS does extent based allocation as well, so that's not
>>> likely the reason. The fact that ext4 doesn't see a slowdown like
>>> every other filesystem really doesn't make a lot of sense to
>>> me, either from an IO dispatch point of view or an IO alignment
>>> point of view.
>>>
>>> Why? Because all the filesystems align identically to the underlying
>>> device and all should be doing 4k block aligned IO, and XFS has
>>> roughly the same allocation overhead for this workload as ext4.
>>> Did you retest XFS or any of the other filesystems directly after
>>> running the ext4 tests (i.e. confirm you are testing apples to
>>> apples)?
>>
>> Yes I did, the performance figures did not change for either XFS or ext3.
>>>
>>> What we need to determine why other filesystems are slow (and why
>>> ext4 is fast) is more information about your configuration and block
>>> traces showing what is happening at the IO level, like was requested
>>> in a previous email....
>>
>> Ok, I'm going to try coming up with meaningful data. Thanks.
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> [email protected]
>>
>>
>>
>> --
>> Martin Boutin

--
Martin Boutin

2013-11-21 23:41:16

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux

Oh, it's 32 bit system. Things you don't know from the obfuscating
codenames everyone uses these days...

> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
....
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

sunit/swidth is 512k/1MB

> # same layout for other disks
> $ fdisk -c -u /dev/sda
....
> Device Boot Start End Blocks Id System
> /dev/sda1 2048 20565247 10281600 83 Linux

Aligned to 1 MB.

> /dev/sda2 20565248 1953525167 966479960 83 Linux

And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
aligned to 4k, though, so there shouldn't be any hardware RMW
cycles.

> $ xfs_info /dev/md0
> meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks
> = sectsz=4096 attr=2
> data = bsize=4096 blocks=483239168, imaxpct=5
> = sunit=12

sunit/swidth of 512k/1MB, so it matches the MD device.

> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
> # this does not look good, does it?

Yup, looks broken.

/me digs through git.

Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
the code that sets stripe unit alignment for the initial allocation
way back in 3.2.

[ Hmmm, that would explain the very occasional failure that
generic/223 throws outi (maybe once a month I see it fail). ]

Which means MD is doing RMW cycles for it's parity calculations, and
that's where performance is going south.

Current code:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
$

Which indicates that even if we take direct IO based allocation out
of the picture, the allocation does not get aligned properly. This
in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.

With a fixed kernel:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
$

It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.

Take fallocate out of the picture so the direct IO does the
allocation:

$ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width

It's slower than with preallocation (no surprise - no allocation
overhead per write(2) call after preallocation is done) but the
allocation is still correctly aligned.

The patch below should fix the unaligned allocation problem you are
seeing, but because XFS defaults to stripe unit alignment for large
allocations, you might still see RMW cycles when it aligns to a
stripe unit that is not the first in a MD stripe. I'll have a quick
look at fixing that behaviour when the swalloc mount option is
specified....

Cheers,

Dave.
--
Dave Chinner
[email protected]

xfs: align initial file allocations correctly.

From: Dave Chinner <[email protected]>

The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.

Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.

Fix it by considering allocation into empty files as requiring
aligned allocation again.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_bmap.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 3ef11b2..8401f11 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
* blocks at the end of the file which do not start at the previous data block,
* we will try to align the new blocks at stripe unit boundaries.
*
- * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
+ * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
* at, or past the EOF.
*/
STATIC int
@@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
bma->aeof = 0;
error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
&is_empty);
- if (error || is_empty)
+ if (error)
return error;

+ if (is_empty) {
+ bma->aeof = 1;
+ return 0;
+ }
+
/*
* Check if we are allocation or past the last extent, or at least into
* the last delayed allocated extent.

2013-11-22 09:21:38

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

> From: Dave Chinner <[email protected]>
>
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
>
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
>
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
>
> Signed-off-by: Dave Chinner <[email protected]>

Ooops. The fix looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

Might be worth cooking up a test for this, scsi_debug can expose
geometry, and we already have it wired to to large sector size
testing in xfstests.

2013-11-22 13:33:41

by Martin Boutin

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

Dave, I just applied your patch in my vanilla 3.10.10 Linux. Here are
the new performance figures for XFS:

$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.95292 s, 212 MB/s

: )
So things make more sense now... I hit a bug in XFS and ext3 and ufs
do not support some kind of multiblock allocation.

Thank you all,
- Martin

On Thu, Nov 21, 2013 at 6:41 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
>> $ uname -a
>> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
>> i686 GNU/Linux
>
> Oh, it's 32 bit system. Things you don't know from the obfuscating
> codenames everyone uses these days...
>
>> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
>> $ mount -t xfs /dev/md0 /tmp/diskmnt/
>> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
> ....
>> $ cat /proc/mounts
>> (...)
>> /dev/md0 /tmp/diskmnt xfs
>> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
>
> sunit/swidth is 512k/1MB
>
>> # same layout for other disks
>> $ fdisk -c -u /dev/sda
> ....
>> Device Boot Start End Blocks Id System
>> /dev/sda1 2048 20565247 10281600 83 Linux
>
> Aligned to 1 MB.
>
>> /dev/sda2 20565248 1953525167 966479960 83 Linux
>
> And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
> aligned to 4k, though, so there shouldn't be any hardware RMW
> cycles.
>
>> $ xfs_info /dev/md0
>> meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks
>> = sectsz=4096 attr=2
>> data = bsize=4096 blocks=483239168, imaxpct=5
>> = sunit=12
>
> sunit/swidth of 512k/1MB, so it matches the MD device.
>
>> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
>> /tmp/diskmnt/filewr.zero:
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>> 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111
>> FLAG Values:
>> 010000 Unwritten preallocated extent
>> 001000 Doesn't begin on stripe unit
>> 000100 Doesn't end on stripe unit
>> 000010 Doesn't begin on stripe width
>> 000001 Doesn't end on stripe width
>> # this does not look good, does it?
>
> Yup, looks broken.
>
> /me digs through git.
>
> Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
> the code that sets stripe unit alignment for the initial allocation
> way back in 3.2.
>
> [ Hmmm, that would explain the very occasional failure that
> generic/223 throws outi (maybe once a month I see it fail). ]
>
> Which means MD is doing RMW cycles for it's parity calculations, and
> that's where performance is going south.
>
> Current code:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
> $
>
> Which indicates that even if we take direct IO based allocation out
> of the picture, the allocation does not get aligned properly. This
> in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.
>
> With a fixed kernel:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
> $
>
> It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.
>
> Take fallocate out of the picture so the direct IO does the
> allocation:
>
> $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
> testfile:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
>
> It's slower than with preallocation (no surprise - no allocation
> overhead per write(2) call after preallocation is done) but the
> allocation is still correctly aligned.
>
> The patch below should fix the unaligned allocation problem you are
> seeing, but because XFS defaults to stripe unit alignment for large
> allocations, you might still see RMW cycles when it aligns to a
> stripe unit that is not the first in a MD stripe. I'll have a quick
> look at fixing that behaviour when the swalloc mount option is
> specified....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> xfs: align initial file allocations correctly.
>
> From: Dave Chinner <[email protected]>
>
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
>
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
>
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/xfs/xfs_bmap.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 3ef11b2..8401f11 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
> * blocks at the end of the file which do not start at the previous data block,
> * we will try to align the new blocks at stripe unit boundaries.
> *
> - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
> + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
> * at, or past the EOF.
> */
> STATIC int
> @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
> bma->aeof = 0;
> error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
> &is_empty);
> - if (error || is_empty)
> + if (error)
> return error;
>
> + if (is_empty) {
> + bma->aeof = 1;
> + return 0;
> + }
> +
> /*
> * Check if we are allocation or past the last extent, or at least into
> * the last delayed allocated extent.

--
Martin Boutin

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-11-22 22:40:44

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Fri, Nov 22, 2013 at 01:21:36AM -0800, Christoph Hellwig wrote:
> > From: Dave Chinner <[email protected]>
> >
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> >
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> >
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
>
> Ooops. The fix looks good,
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
>
> Might be worth cooking up a test for this, scsi_debug can expose
> geometry, and we already have it wired to to large sector size
> testing in xfstests.

We don't need to screw around with the sector size - that is
irrelevant to the problem, and we have an allocation alignment
test that is supposed to catch these issues: generic/223.

As I said, I have seen occasional failures of that test (once a
month, on average) as a result of this bug. It was simply not often
enough - running in a hard loop didn't increase the frequency of
failures - to be able debug it or to reach my "there's a regression
I need to look at" threshold. Perhaps we need to revisit that test
and see if we can make it more likely to trigger failures...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-11-23 08:41:08

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > geometry, and we already have it wired to to large sector size
> > testing in xfstests.
>
> We don't need to screw around with the sector size - that is
> irrelevant to the problem, and we have an allocation alignment
> test that is supposed to catch these issues: generic/223.

It didn't imply we need large sector sizes, but the same mechanism
to expodse a large sector size can also be used to present large
stripe units/width.

> As I said, I have seen occasional failures of that test (once a
> month, on average) as a result of this bug. It was simply not often
> enough - running in a hard loop didn't increase the frequency of
> failures - to be able debug it or to reach my "there's a regression
> I need to look at" threshold. Perhaps we need to revisit that test
> and see if we can make it more likely to trigger failures...

Seems like 233 should have cought it regularly with the explicit
alignment options on mkfs time. Maybe we also need a test mirroring
the plain dd more closely?

I've not seen 233 fail for a long time..

2013-11-24 23:21:42

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Sat, Nov 23, 2013 at 12:41:06AM -0800, Christoph Hellwig wrote:
> On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > > geometry, and we already have it wired to to large sector size
> > > testing in xfstests.
> >
> > We don't need to screw around with the sector size - that is
> > irrelevant to the problem, and we have an allocation alignment
> > test that is supposed to catch these issues: generic/223.
>
> It didn't imply we need large sector sizes, but the same mechanism
> to expodse a large sector size can also be used to present large
> stripe units/width.
>
> > As I said, I have seen occasional failures of that test (once a
> > month, on average) as a result of this bug. It was simply not often
> > enough - running in a hard loop didn't increase the frequency of
> > failures - to be able debug it or to reach my "there's a regression
> > I need to look at" threshold. Perhaps we need to revisit that test
> > and see if we can make it more likely to trigger failures...
>
> Seems like 233 should have cought it regularly with the explicit
> alignment options on mkfs time. Maybe we also need a test mirroring
> the plain dd more closely?

Preallocation showed the problem, too, so we probably don't even
need dd to check whether allocation alignment is working properly.
We should probably write a test that spefically checks all the
different anlignment/extent size combinations we can use.

Preallocation should behave very similarly to direct IO, but I'm
pretty sure that it won't do things like round up allocations to
stripe unit/widths like direct IO does. The fact that we do
allocation sunit/swidth size alignment for direct Io outside the
allocator and sunit/swidth offset alignment inside the allocation is
kinda funky....

> I've not seen 233 fail for a long time..

Not surprising, it is a one in several hundred test runs occurrence
here...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-10 19:18:07

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

> xfs: align initial file allocations correctly.
>
> From: Dave Chinner <[email protected]>
>
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
>
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
>
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.

Seems like this one didn't get picked up yet?

2013-12-11 00:27:53

by Dave Chinner

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > xfs: align initial file allocations correctly.
> >
> > From: Dave Chinner <[email protected]>
> >
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> >
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> >
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
>
> Seems like this one didn't get picked up yet?

I'm about to resend all my outstanding patches...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-11 19:09:45

by Ben Myers

[permalink] [raw]

Subject: Re: Filesystem writes on RAID5 too slow

Hi,

On Wed, Dec 11, 2013 at 11:27:53AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > > xfs: align initial file allocations correctly.
> > >
> > > From: Dave Chinner <[email protected]>
> > >
> > > The function xfs_bmap_isaeof() is used to indicate that an
> > > allocation is occurring at or past the end of file, and as such
> > > should be aligned to the underlying storage geometry if possible.
> > >
> > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > > behaviour of this function for empty files - it turned off
> > > allocation alignment for this case accidentally. Hence large initial
> > > allocations from direct IO are not getting correctly aligned to the
> > > underlying geometry, and that is cause write performance to drop in
> > > alignment sensitive configurations.
> > >
> > > Fix it by considering allocation into empty files as requiring
> > > aligned allocation again.
> >
> > Seems like this one didn't get picked up yet?
>
> I'm about to resend all my outstanding patches...

Sorry I didn't see that one. If you stick the keyword 'patch' in the subject I
tend to do a bit better.

Regards,
Ben

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs