2006-11-10 01:48:56

by Chris Mason

[permalink] [raw]
Subject: [PATCH] avoid too many boundaries in DIO

Dave Chinner found a 10% performance regression with ext3 when using DIO
to fill holes instead of buffered IO. On large IOs, the ext3 get_block
routine will send more than a page worth of blocks back to DIO via a
single buffer_head with a large b_size value.

The DIO code iterates through this massive block and tests for a
boundary buffer over and over again. For every block size unit spanned
by the big map_bh, the boundary bit is tested and a bio may be forced
down to the block layer.

There are two potential fixes, one is to ignore the boundary bit on
large regions returned by the FS. DIO can't tell which part of the big
region was a boundary, and so it may not be a good idea to trust the
hint.

This patch just clears the boundary bit after using it once. It is 10%
faster for a streaming DIO write w/blocksize of 512k on my sata drive.

Signed-off-by: Chris Mason <[email protected]>

diff -r 38d08cbe880b fs/direct-io.c
--- a/fs/direct-io.c Thu Nov 09 20:02:08 2006 -0500
+++ b/fs/direct-io.c Thu Nov 09 20:31:12 2006 -0500
@@ -959,6 +959,17 @@ do_holes:
BUG_ON(this_chunk_bytes == 0);

dio->boundary = buffer_boundary(map_bh);
+
+ /*
+ * get_block may return more than one page worth
+ * of blocks. Only make the first one a boundary.
+ * This is still sub-optimal, it probably only
+ * makes sense to play with boundaries when
+ * get_block returns a single FS block sized
+ * unit.
+ */
+ clear_buffer_boundary(map_bh);
+
ret = submit_page_section(dio, page, offset_in_page,
this_chunk_bytes, dio->next_block_for_io);
if (ret) {


2006-11-10 06:36:06

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] avoid too many boundaries in DIO

On Thu, Nov 09, 2006 at 08:48:54PM -0500, Chris Mason wrote:
> Dave Chinner found a 10% performance regression with ext3 when using DIO
> to fill holes instead of buffered IO. On large IOs, the ext3 get_block
> routine will send more than a page worth of blocks back to DIO via a
> single buffer_head with a large b_size value.
>
> The DIO code iterates through this massive block and tests for a
> boundary buffer over and over again. For every block size unit spanned
> by the big map_bh, the boundary bit is tested and a bio may be forced
> down to the block layer.
>
> There are two potential fixes, one is to ignore the boundary bit on
> large regions returned by the FS. DIO can't tell which part of the big
> region was a boundary, and so it may not be a good idea to trust the
> hint.
>
> This patch just clears the boundary bit after using it once. It is 10%
> faster for a streaming DIO write w/blocksize of 512k on my sata drive.

8p altix, 8GB RAM, 64 FC disks, >2.5GiB/s sustainable raw throughput.
dm stripe, outer 1GB of each disk for 64GB volume. Chunk size 128k.
Single thread Direct I/O, I/O size of 512MiB, sequential file extend.

# mkfs.ext3 -E stride-size=32 /dev/mapper/testvol
# mkfs.xfs -f -d sunit=256,swidth=16384 /dev/mapper/testvol

ext3 mounted data=ordered; data=writeback results are the same
for direct I/O.

2.6.19-rc3-pl is 2.6.19-rc3 + the direct I/o placeholder patches.
2.6.19-rc3-pl-bd is 2.6.19-rc3-pl plus the boundary patch.

Kernel fs throughput I/Os/s sys+intr
----------- ---- ---------- ------- ------
2.6.19-rc3 ext3 660MiB/s ~36,000 70+60
2.6.19-rc3-pl ext3 600MiB/s ~36,000 70+60
2.6.19-rc3-pl-bd ext3 715MiB/s ~16,000 45+35
2.6.19-rc3 xfs 2.28GiB/s ~18,000 65+65
2.6.19-rc3-pl xfs 2.24GiB/s ~18,000 65+65
2.6.19-rc3-pl-bd xfs 2.26GiB/s ~18,000 65+65

Hole filling with direct I/O shows equivalent results.

The boundary patch doubles the average I/O size of ext3 and
substantially reduces CPU usage for direct I/O. Nice one, Chris.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-10 06:56:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] avoid too many boundaries in DIO

On Thu, 9 Nov 2006 20:48:54 -0500
Chris Mason <[email protected]> wrote:

> Dave Chinner found a 10% performance regression with ext3 when using DIO
> to fill holes instead of buffered IO. On large IOs, the ext3 get_block
> routine will send more than a page worth of blocks back to DIO via a
> single buffer_head with a large b_size value.
>
> The DIO code iterates through this massive block and tests for a
> boundary buffer over and over again. For every block size unit spanned
> by the big map_bh, the boundary bit is tested and a bio may be forced
> down to the block layer.
>
> There are two potential fixes, one is to ignore the boundary bit on
> large regions returned by the FS. DIO can't tell which part of the big
> region was a boundary, and so it may not be a good idea to trust the
> hint.
>
> This patch just clears the boundary bit after using it once. It is 10%
> faster for a streaming DIO write w/blocksize of 512k on my sata drive.
>

Thanks.

But that's two large performance regressions (so far) from the multi-block
get_block() feature. And that was allegedly a performance optimisation!
Who's testing this stuff?

>
> diff -r 38d08cbe880b fs/direct-io.c
> --- a/fs/direct-io.c Thu Nov 09 20:02:08 2006 -0500
> +++ b/fs/direct-io.c Thu Nov 09 20:31:12 2006 -0500
> @@ -959,6 +959,17 @@ do_holes:
> BUG_ON(this_chunk_bytes == 0);
>
> dio->boundary = buffer_boundary(map_bh);
> +
> + /*
> + * get_block may return more than one page worth
> + * of blocks. Only make the first one a boundary.
> + * This is still sub-optimal, it probably only
> + * makes sense to play with boundaries when
> + * get_block returns a single FS block sized
> + * unit.
> + */
> + clear_buffer_boundary(map_bh);
> +
> ret = submit_page_section(dio, page, offset_in_page,
> this_chunk_bytes, dio->next_block_for_io);
> if (ret) {


Is that actually correct? If ->get_block() returned a buffer_boundary()
buffer then what we want to do is to push down all the thus-far-queued BIOs
once we've submitted _all_ of the BIOs represented by map_bh. I think that
if we require more than one BIO to cover map_bh.b_size then we'll do the
submission after the first BIO has been sent instead of after the final one
has been sent?

2006-11-10 21:49:38

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] avoid too many boundaries in DIO

On Thu, Nov 09, 2006 at 10:56:18PM -0800, Andrew Morton wrote:
> > This patch just clears the boundary bit after using it once. It is 10%
> > faster for a streaming DIO write w/blocksize of 512k on my sata drive.
> >
>
> Thanks.
>
> But that's two large performance regressions (so far) from the multi-block
> get_block() feature. And that was allegedly a performance optimisation!
> Who's testing this stuff?

Well, I've been wearing the sata-drive-writeback-cache cape of shame
since Dave found the regression while testing my stuff, so I can't
really point fingers. On some drives the regression didn't show up, but
it should be there on any beefy storage.

> Is that actually correct? If ->get_block() returned a
> buffer_boundary() buffer then what we want to do is to push down all
> the thus-far-queued BIOs once we've submitted _all_ of the BIOs
> represented by map_bh. I think that if we require more than one BIO
> to cover map_bh.b_size then we'll do the submission after the first
> BIO has been sent instead of after the final one has been sent?
>
I realized the same thing this morning, but it took a while to figure
out why honoring the boundary on the last block was 5% slower than my
first patch. It turns out that we consistently send down the boundary
bio too soon.

Testing of this has been very light, but I wanted to get it out for
review. I'll test more over the weekend.

-chris

From: Chris Mason <[email protected]>
Subject: avoid too many boundaries in DIO

Dave Chinner found a 10% performance regression with ext3 when using DIO
to fill holes instead of buffered IO. On large IOs, the ext3 get_block
routine will send more than a page worth of blocks back to DIO via a
single buffer_head with a large b_size value.

The DIO code iterates through this massive block and tests for a
boundary buffer over and over again. For every block size unit spanned
by the big map_bh, the boundary bit is tested and a bio may be forced
down to the block layer.

This patch changes things to only submit the boundary bio for the
last block in the big map_bh.

The DIO code had a number of places that would honor dio->boundary
too early, sending the bio down before actually adding the boundary
block to it. Those are also fixed.

Signed-off-by: Chris Mason <[email protected]>

diff -r 18a9e9f5c707 fs/direct-io.c
--- a/fs/direct-io.c Thu Oct 19 08:30:00 2006 +0700
+++ b/fs/direct-io.c Fri Nov 10 16:33:04 2006 -0500
@@ -572,7 +571,6 @@ static int dio_new_bio(struct dio *dio,
nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
BUG_ON(nr_pages <= 0);
ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
- dio->boundary = 0;
out:
return ret;
}
@@ -626,12 +624,6 @@ static int dio_send_cur_page(struct dio
*/
if (dio->final_block_in_bio != dio->cur_page_block)
dio_bio_submit(dio);
- /*
- * Submit now if the underlying fs is about to perform a
- * metadata read
- */
- if (dio->boundary)
- dio_bio_submit(dio);
}

if (dio->bio == NULL) {
@@ -648,6 +640,12 @@ static int dio_send_cur_page(struct dio
BUG_ON(ret != 0);
}
}
+ /*
+ * Submit now if the underlying fs is about to perform a
+ * metadata read
+ */
+ if (dio->boundary)
+ dio_bio_submit(dio);
out:
return ret;
}
@@ -674,6 +672,10 @@ submit_page_section(struct dio *dio, str
unsigned offset, unsigned len, sector_t blocknr)
{
int ret = 0;
+ int boundary = dio->boundary;
+
+ /* don't let dio_send_cur_page do the boundary too soon */
+ dio->boundary = 0;

/*
* Can we just grow the current page's presence in the dio?
@@ -683,17 +683,7 @@ submit_page_section(struct dio *dio, str
(dio->cur_page_block +
(dio->cur_page_len >> dio->blkbits) == blocknr)) {
dio->cur_page_len += len;
-
- /*
- * If dio->boundary then we want to schedule the IO now to
- * avoid metadata seeks.
- */
- if (dio->boundary) {
- ret = dio_send_cur_page(dio);
- page_cache_release(dio->cur_page);
- dio->cur_page = NULL;
- }
- goto out;
+ goto out_send;
}

/*
@@ -712,6 +702,18 @@ submit_page_section(struct dio *dio, str
dio->cur_page_offset = offset;
dio->cur_page_len = len;
dio->cur_page_block = blocknr;
+
+out_send:
+ /*
+ * If dio->boundary then we want to schedule the IO now to
+ * avoid metadata seeks.
+ */
+ if (boundary) {
+ dio->boundary = 1;
+ ret = dio_send_cur_page(dio);
+ page_cache_release(dio->cur_page);
+ dio->cur_page = NULL;
+ }
out:
return ret;
}
@@ -917,7 +919,16 @@ do_holes:
this_chunk_bytes = this_chunk_blocks << blkbits;
BUG_ON(this_chunk_bytes == 0);

- dio->boundary = buffer_boundary(map_bh);
+ /*
+ * get_block may return more than one page worth
+ * of blocks. Make sure only the last io we
+ * send down for this region is a boundary
+ */
+ if (dio->blocks_available == this_chunk_blocks)
+ dio->boundary = buffer_boundary(map_bh);
+ else
+ dio->boundary = 0;
+
ret = submit_page_section(dio, page, offset_in_page,
this_chunk_bytes, dio->next_block_for_io);
if (ret) {