From: Dave Chinner <david@fromorbit.com>
Subject: Re: Filesystem writes on RAID5 too slow
Date: Fri, 22 Nov 2013 10:41:16 +1100
Message-ID: <20131121234116.GD6502@dastard>
References: <CACtJ3HZxp6xEjY_wOucCcqX4scNzEGuiAsovQYObJS9whtYJsQ@mail.gmail.com>
 <528A5C45.4080906@redhat.com>
 <20131119005740.GY6188@dastard>
 <CACtJ3Ha3C7JNi5VZRnNMn+-okNheygmbj=j9AnUMvfzfZjNwug@mail.gmail.com>
 <20131121092606.GU11434@dastard>
 <CACtJ3HZAsOtmLArMWraygfQxpGymtZjr+a_reXv8o6LJzoMbvw@mail.gmail.com>
 <CACtJ3Ha5P2Heu4qiEEk6c4g+tKyR=RrD-4E-Cqj+bP8YDjKQ6w@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Eric Sandeen <sandeen@redhat.com>,
	"Kernel.org-Linux-RAID" <linux-raid@vger.kernel.org>,
	xfs-oss <xfs@oss.sgi.com>,
	"Kernel.org-Linux-EXT4" <linux-ext4@vger.kernel.org>
To: Martin Boutin <martboutin@gmail.com>
Return-path: <linux-raid-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <CACtJ3Ha5P2Heu4qiEEk6c4g+tKyR=RrD-4E-Cqj+bP8YDjKQ6w@mail.gmail.com>
Sender: linux-raid-owner@vger.kernel.org
List-Id: linux-ext4.vger.kernel.org

On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux

Oh, it's 32 bit system. Things you don't know from the obfuscating
codenames everyone uses these days...

> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
....
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

sunit/swidth is 512k/1MB

> # same layout for other disks
> $ fdisk -c -u /dev/sda
....
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1            2048    20565247    10281600   83  Linux

Aligned to 1 MB.

> /dev/sda2        20565248  1953525167   966479960   83  Linux

And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
aligned to 4k, though, so there shouldn't be any hardware RMW
cycles.

> $ xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>          =                       sectsz=4096  attr=2
> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>          =                       sunit=12

sunit/swidth of 512k/1MB, so it matches the MD device.

> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> # this does not look good, does it?

Yup, looks broken.

/me digs through git. 

Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
the code that sets stripe unit alignment for the initial allocation
way back in 3.2.

[ Hmmm, that would explain the very occasional failure that
generic/223 throws outi (maybe once a month I see it fail). ]

Which means MD is doing RMW cycles for it's parity calculations, and
that's where performance is going south.

Current code:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2097151]:    1056..2098207     0 (1056..2098207)  2097152 11111
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
$

Which indicates that even if we take direct IO based allocation out
of the picture, the allocation does not get aligned properly. This
in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.

With a fixed kernel:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    6293504..8390655  0 (6293504..8390655) 2097152 10000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
$

It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.

Take fallocate out of the picture so the direct IO does the
allocation:

$ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    2099200..4196351  0 (2099200..4196351) 2097152 00000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

It's slower than with preallocation (no surprise - no allocation
overhead per write(2) call after preallocation is done) but the
allocation is still correctly aligned.

The patch below should fix the unaligned allocation problem you are
seeing, but because XFS defaults to stripe unit alignment for large
allocations, you might still see RMW cycles when it aligns to a
stripe unit that is not the first in a MD stripe. I'll have a quick
look at fixing that behaviour when the swalloc mount option is
specified....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: align initial file allocations correctly.

From: Dave Chinner <dchinner@redhat.com>

The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.

Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.

Fix it by considering allocation into empty files as requiring
aligned allocation again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 3ef11b2..8401f11 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
  * blocks at the end of the file which do not start at the previous data block,
  * we will try to align the new blocks at stripe unit boundaries.
  *
- * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
+ * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
  * at, or past the EOF.
  */
 STATIC int
@@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
 	bma->aeof = 0;
 	error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
 				     &is_empty);
-	if (error || is_empty)
+	if (error)
 		return error;
 
+	if (is_empty) {
+		bma->aeof = 1;
+		return 0;
+	}
+
 	/*
 	 * Check if we are allocation or past the last extent, or at least into
 	 * the last delayed allocated extent.