We are looking at adding calls to posix_fadvise(DONTNEED) to various
data logging routines. This has two benefits:
- frequent write-out -> shorter queues give lower latency, also disk
is more utilized as writeout begins immediately
- less useless stuff in page cache
One problem with fadvise() (and ext2, at least) is that associated
metadata isn't scheduled with the data. So, for a large log file with
a high append rate, hundreds of indirect blocks are left to be written
out by periodic writeback. This metadata consists of single blocks
spaced by 4MB, leading to spikes of very inefficient disk utilization,
deep queues and high latency.
Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
sys_sync_file_range(), and leaving posix_fadvise() alone. That will
work for my purposes, but it seems like it leaves
posix_fadvise(DONTNEED) with a performance bug on ext2 (or any other
filesystem with interleaved data/metadata). Andrew's argument is that
people have expectations about posix_fadvise() behavior as it's been
around for years in Linux.
I'd like to get a consensus on what The Right Thing is, so I can move
toward implementing it and moving the logging code onto that
interface.
Chad
On Fri, 31 Oct 2008 13:54:14 -0700 Chad Talbott <[email protected]> wrote:
> We are looking at adding calls to posix_fadvise(DONTNEED) to various
> data logging routines. This has two benefits:
>
> - frequent write-out -> shorter queues give lower latency, also disk
> is more utilized as writeout begins immediately
>
> - less useless stuff in page cache
>
> One problem with fadvise() (and ext2, at least) is that associated
> metadata isn't scheduled with the data. So, for a large log file with
> a high append rate, hundreds of indirect blocks are left to be written
> out by periodic writeback. This metadata consists of single blocks
> spaced by 4MB, leading to spikes of very inefficient disk utilization,
> deep queues and high latency.
>
> Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
> sys_sync_file_range(), and leaving posix_fadvise() alone. That will
> work for my purposes, but it seems like it leaves
> posix_fadvise(DONTNEED) with a performance bug on ext2 (or any other
> filesystem with interleaved data/metadata). Andrew's argument is that
> people have expectations about posix_fadvise() behavior as it's been
> around for years in Linux.
Sort-of. It's just that posix_fadvise() is so poorly defined, and
there is some need to be compatible with other implementations.
And fadvise(FADV_DONTNEED) is just that: "I won't be using that data
again". Implementing specific writeback behaviour underneath that hint
is unobvious and a bit weird. It's a bit of a fluke that it does
writeout at all!
We have much more flexibility with sync_file_range(), and it is more
explicit.
That being said, I don't understand why the IO scheduling problems
which you're seeing are occurring. There is code in fs/mpage.c
specifically to handle this case (search for "write_boundary_block").
It will spot that 4k indirect block in the middle of two 4MB data
blocks and will schedule it for writeout at the right time.
So why isn't that working?
The below (I merged it this week) is kinda related...
From: Miquel van Smoorenburg <[email protected]>
While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c
As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.
Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set. But we only need to push what I/O we have
accumulated at the last block of this range.
This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.
Signed-off-by: Miquel van Smoorenburg <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/mpage.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff -puN fs/mpage.c~do_mpage_readpage-dont-submit-lots-of-small-bios-on-boundary fs/mpage.c
--- a/fs/mpage.c~do_mpage_readpage-dont-submit-lots-of-small-bios-on-boundary
+++ a/fs/mpage.c
@@ -308,7 +308,10 @@ alloc_new:
goto alloc_new;
}
- if (buffer_boundary(map_bh) || (first_hole != blocks_per_page))
+ relative_block = block_in_file - *first_logical_block;
+ nblocks = map_bh->b_size >> blkbits;
+ if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
+ (first_hole != blocks_per_page))
bio = mpage_bio_submit(READ, bio);
else
*last_block_in_bio = blocks[blocks_per_page - 1];
_
On Fri, Oct 31, 2008 at 01:54:14PM -0700, Chad Talbott wrote:
> We are looking at adding calls to posix_fadvise(DONTNEED) to various
> data logging routines. This has two benefits:
>
> - frequent write-out -> shorter queues give lower latency, also disk
> is more utilized as writeout begins immediately
>
> - less useless stuff in page cache
>
> One problem with fadvise() (and ext2, at least) is that associated
> metadata isn't scheduled with the data. So, for a large log file with
> a high append rate, hundreds of indirect blocks are left to be written
> out by periodic writeback. This metadata consists of single blocks
> spaced by 4MB, leading to spikes of very inefficient disk utilization,
> deep queues and high latency.
Sounds like a filesystem bug to me, not a problem with
posix_fadvise(DONTNEED).
> Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
> sys_sync_file_range(), and leaving posix_fadvise() alone.
What is the interface that a filesystem will see? No filesystem has
a "metadata sync" method - is this going to fall through to some new
convoluted combination of writeback flags to an inode/mapping
that more filesystems than not can get wrong?
FWIW, sys_sync_file_range() is fundamentally broken for data
integrity writeback - at no time does it call a filesystem method
that can result in a barrier I/O being issued to disk after
writeback is complete. So, unlike fsync() or fdatasync(), the data
can still be lost after completion due to power failure on drives
with volatile write caches....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat, Nov 1, 2008 at 1:21 AM, Andrew Morton <[email protected]> wrote:
> And fadvise(FADV_DONTNEED) is just that: "I won't be using that data
> again". Implementing specific writeback behaviour underneath that hint
> is unobvious and a bit weird. It's a bit of a fluke that it does
> writeout at all!
>
> We have much more flexibility with sync_file_range(), and it is more
> explicit.
So in the new world, an application should call sync_file_range
(solving my problem by including metadata) to initiate writeout, and
then call posix_fadvise(DONTNEED) to drop the pages from page cache?
I think this would work for me.
> That being said, I don't understand why the IO scheduling problems
> which you're seeing are occurring. There is code in fs/mpage.c
> specifically to handle this case (search for "write_boundary_block").
> It will spot that 4k indirect block in the middle of two 4MB data
> blocks and will schedule it for writeout at the right time.
>
> So why isn't that working?
Very good question, I'll look into why it's not helping here.
Chad
On Wed, 5 Nov 2008 16:56:54 -0800
Chad Talbott <[email protected]> wrote:
> On Sat, Nov 1, 2008 at 1:21 AM, Andrew Morton <[email protected]> wrote:
> > And fadvise(FADV_DONTNEED) is just that: "I won't be using that data
> > again". Implementing specific writeback behaviour underneath that hint
> > is unobvious and a bit weird. It's a bit of a fluke that it does
> > writeout at all!
> >
> > We have much more flexibility with sync_file_range(), and it is more
> > explicit.
>
> So in the new world, an application should call sync_file_range
> (solving my problem by including metadata) to initiate writeout, and
> then call posix_fadvise(DONTNEED) to drop the pages from page cache?
> I think this would work for me.
That would work.
Although Nick is threatening to make
sync_file_range(SYNC_FILE_RANGE_WRITE) all slow by using WB_SYNC_ALL,
probably unnecessarily, ho hum.
> > That being said, I don't understand why the IO scheduling problems
> > which you're seeing are occurring. There is code in fs/mpage.c
> > specifically to handle this case (search for "write_boundary_block").
> > It will spot that 4k indirect block in the middle of two 4MB data
> > blocks and will schedule it for writeout at the right time.
> >
> > So why isn't that working?
>
> Very good question, I'll look into why it's not helping here.
>
Thanks.
On Sun, Nov 2, 2008 at 2:45 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Oct 31, 2008 at 01:54:14PM -0700, Chad Talbott wrote:
>> We are looking at adding calls to posix_fadvise(DONTNEED) to various
>> data logging routines. This has two benefits:
>>
>> - frequent write-out -> shorter queues give lower latency, also disk
>> is more utilized as writeout begins immediately
>>
>> - less useless stuff in page cache
>>
>> One problem with fadvise() (and ext2, at least) is that associated
>> metadata isn't scheduled with the data. So, for a large log file with
>> a high append rate, hundreds of indirect blocks are left to be written
>> out by periodic writeback. This metadata consists of single blocks
>> spaced by 4MB, leading to spikes of very inefficient disk utilization,
>> deep queues and high latency.
>
> Sounds like a filesystem bug to me, not a problem with
> posix_fadvise(DONTNEED).
Agreed that the right fix is not to hack fadvise(). If the boundary
page mechanism can't be made to work, It looks like right thing might
be to modify ext2_writepages to opportunistically write dirty metadata
in holes between dirty data. For post-ext2 filesystems which attempt
to provide transactional semantics, this is probably not acceptable.
>> Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
>> sys_sync_file_range(), and leaving posix_fadvise() alone.
>
> What is the interface that a filesystem will see? No filesystem has
> a "metadata sync" method - is this going to fall through to some new
> convoluted combination of writeback flags to an inode/mapping
> that more filesystems than not can get wrong?
Good point, coupled with metadata/data ordering and your argument
below, a decent argument against exposing this interface.
> FWIW, sys_sync_file_range() is fundamentally broken for data
> integrity writeback - at no time does it call a filesystem method
> that can result in a barrier I/O being issued to disk after
> writeback is complete. So, unlike fsync() or fdatasync(), the data
> can still be lost after completion due to power failure on drives
> with volatile write caches....
Seems to be true. I'm not currently concerned with sync_file_range
for data integrity, so I'm going to punt on this issue.
If the consensus is against exposing a "sync metadata" interface, I'm
fine with ext2 silently updating metadata alongside neighboring data
in *either* posix_fadvise() or sync_file_range. Either way, does it
seem reasonable for posix_fadvise(DONTNEED) to call
__filemap_fdatawrite_range to do its work? This is the same path that
sync_file_range uses. I prefer this to the current behavior of
ignoring the passed range and initiating writeback on the entire file.
Chad
On Wed, Nov 5, 2008 at 5:07 PM, Andrew Morton <[email protected]> wrote:
> On Wed, 5 Nov 2008 16:56:54 -0800
> Chad Talbott <[email protected]> wrote:
>> So in the new world, an application should call sync_file_range
>> (solving my problem by including metadata) to initiate writeout, and
>> then call posix_fadvise(DONTNEED) to drop the pages from page cache?
>> I think this would work for me.
>
> That would work.
>
> Although Nick is threatening to make
> sync_file_range(SYNC_FILE_RANGE_WRITE) all slow by using WB_SYNC_ALL,
> probably unnecessarily, ho hum.
Boo! That will drive users (me at least) running back to
posix_fadvise(DONTNEED) for the non-blocking-writeback-inducing side
effect.
Chad
On Wed, Nov 05, 2008 at 05:19:20PM -0800, Chad Talbott wrote:
> On Sun, Nov 2, 2008 at 2:45 PM, Dave Chinner <[email protected]> wrote:
> > On Fri, Oct 31, 2008 at 01:54:14PM -0700, Chad Talbott wrote:
> >> Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
> >> sys_sync_file_range(), and leaving posix_fadvise() alone.
> >
> > What is the interface that a filesystem will see? No filesystem has
> > a "metadata sync" method - is this going to fall through to some new
> > convoluted combination of writeback flags to an inode/mapping
> > that more filesystems than not can get wrong?
>
> Good point, coupled with metadata/data ordering and your argument
> below, a decent argument against exposing this interface.
>
> > FWIW, sys_sync_file_range() is fundamentally broken for data
> > integrity writeback - at no time does it call a filesystem method
> > that can result in a barrier I/O being issued to disk after
> > writeback is complete. So, unlike fsync() or fdatasync(), the data
> > can still be lost after completion due to power failure on drives
> > with volatile write caches....
>
> Seems to be true. I'm not currently concerned with sync_file_range
> for data integrity, so I'm going to punt on this issue.
;)
> If the consensus is against exposing a "sync metadata" interface, I'm
> fine with ext2 silently updating metadata alongside neighboring data
> in *either* posix_fadvise() or sync_file_range.
I think that sync_file_range is the better choice for "correct"
behaviour. There is the assumption with syncing data explicitly
that the metadata needs to reference that data is written to disk
as well.
> Either way, does it
> seem reasonable for posix_fadvise(DONTNEED) to call
> __filemap_fdatawrite_range to do its work?
>From a kernel perspective, I don't think it really matters. To an
application, it could. e.g. If you're calling posix_fadvise on a
large range, then the I/O patterns will be the same either way. If
you're calling posix_fadvise() on small, sparse ranges of the file,
then you'll turn one large, fast writeout into lots of small random
writes. i.e. upgrade the kernel and the application goes much
slower....
I guess this all depends on whether this would be considered a
regression or a stupid application ;)
Cheers,
Dave.
--
Dave Chinner
[email protected]