Hi Darrick and Allison,
There has been a standing performance regression involving AIO DIO
4k-aligned writes on ext4 backed by a fast local SSD since the switch
to iomap. I think it was originally reported and investigated in this
thread: https://lore.kernel.org/all/[email protected]/
Short version:
Pre-iomap, for ext4 async direct writes, after the bio is written to disk
the completion function is called directly during the endio stage.
Post-iomap, for direct writes, after the bio is written to disk, the completion
function is deferred to a work queue. This adds latency that impacts
performance most noticeably in very fast SSDs.
Detailed version:
A possible explanation is below, followed by a few questions to figure
out the right way to fix it.
In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has completed
in the nvme driver, the interrupt handler for the write request ends
in calling bio_endio() which ends up calling dio_bio_end_aio(). A
different end_io function is used for async and sync io. If there are
no pages mapped in memory for the write operation's inode, then the
completion function for ext4 is called directly. If there are pages
mapped, then they might be dirty and need to be updated and work
is deferred to a work queue.
Here is the relevant 4.15 code:
fs/direct-io.c: dio_bio_end_aio()
if (dio->result)
defer_completion = dio->defer_completion ||
(dio_op == REQ_OP_WRITE &&
dio->inode->i_mapping->nrpages);
if (defer_completion) {
INIT_WORK(&dio->complete_work, dio_aio_complete_work);
queue_work(dio->inode->i_sb->s_dio_done_wq,
&dio->complete_work);
} else {
dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
}
After ext4 switched to using iomap, the endio function became
iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end io
function is used for both async and sync io. All write requests will
defer io completion to a work queue even if there are no mapped pages
for the inode.
Here is the relevant code in latest kernel (post-iomap) ...
fs/iomap/direct-io.c: iomap_dio_bio_end_io()
if (dio->wait_for_completion) {
struct task_struct *waiter = dio->submit.waiter;
WRITE_ONCE(dio->submit.waiter, NULL);
blk_wake_io_task(waiter);
} else if (dio->flags & IOMAP_DIO_WRITE) {
struct inode *inode = file_inode(dio->iocb->ki_filp);
WRITE_ONCE(dio->iocb->private, NULL);
INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
} else {
WRITE_ONCE(dio->iocb->private, NULL);
iomap_dio_complete_work(&dio->aio.work);
}
With the attached patch, I see significantly better performance in 5.10 than 4.15. 5.10 is the latest kernel where I have driver support for an SSD that is fast enough to reproduce the regression. I verified that upstream iomap works the same.
Test results using the reproduction script from the original report
and testing with 4k/8k/12k/16k blocksizes and write-only:
https://people.collabora.com/~krisman/dio/week21/bench.sh
fio benchmark command:
fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --iodepth=64 \
--time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \
--name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
--runtime=100 --output-format=terse >> ${LOG}
For 4.15, with all write completions called in io handler:
4k: bw=1056MiB/s
8k: bw=2082MiB/s
12k: bw=2332MiB/s
16k: bw=2453MiB/s
For unmodified 5.10, with all write completions deferred:
4k: bw=1004MiB/s
8k: bw=2074MiB/s
12k: bw=2309MiB/s
16k: bw=2465MiB/s
For modified 5.10, with all write completions called in io handler:
4k: bw=1193MiB/s
8k: bw=2258MiB/s
12k: bw=2346MiB/s
16k: bw=2446MiB/s
Questions:
Why did iomap from the beginning not make the async/sync io and
mapped/unmapped distinction that fs/direct-io.c did?
Since no issues have been found for ext4 calling completion work
directly in the io handler pre-iomap, it is unlikely that this is
unsafe (sleeping within an io handler callback). However, this may not
be true for all filesystems. Does XFS potentially sleep in its
completion code?
Allison, Ted mentioned you saw a similar problem when doing performance testing for the latest version of Unbreakable Linux. Can you test/verify this patch addresses your performance regression as well?
Jeremy Bongio (1):
For DIO writes with no mapped pages for inode, skip deferring
completion.
fs/iomap/direct-io.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--
2.41.0.185.g7c58973941-goog
If there are no mapped pages for an DIO write then the page cache does not
need to be updated. For very fast SSDs and direct async IO, deferring work
completion can result in a significant performance loss.
---
fs/iomap/direct-io.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 019cc87d0fb3..8f27d0dc4f6d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -168,7 +168,9 @@ void iomap_dio_bio_end_io(struct bio *bio)
struct task_struct *waiter = dio->submit.waiter;
WRITE_ONCE(dio->submit.waiter, NULL);
blk_wake_io_task(waiter);
- } else if (dio->flags & IOMAP_DIO_WRITE) {
+ } else if (dio->flags & IOMAP_DIO_WRITE &&
+ (!dio->iocb->ki_filp->f_inode ||
+ dio->iocb->ki_filp->f_inode->i_mapping->nrpages))) {
struct inode *inode = file_inode(dio->iocb->ki_filp);
WRITE_ONCE(dio->iocb->private, NULL);
--
2.41.0.185.g7c58973941-goog
On Wed, Jun 21, 2023 at 10:29:20AM -0700, Jeremy Bongio wrote:
> +++ b/fs/iomap/direct-io.c
> @@ -168,7 +168,9 @@ void iomap_dio_bio_end_io(struct bio *bio)
> struct task_struct *waiter = dio->submit.waiter;
> WRITE_ONCE(dio->submit.waiter, NULL);
> blk_wake_io_task(waiter);
> - } else if (dio->flags & IOMAP_DIO_WRITE) {
> + } else if (dio->flags & IOMAP_DIO_WRITE &&
> + (!dio->iocb->ki_filp->f_inode ||
> + dio->iocb->ki_filp->f_inode->i_mapping->nrpages))) {
I don't think it's possible for file->f_inode to be NULL here, is it?
At any rate, that amount of indirection is just nasty. How about this?
+++ b/fs/iomap/direct-io.c
@@ -161,15 +161,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
struct task_struct *waiter = dio->submit.waiter;
WRITE_ONCE(dio->submit.waiter, NULL);
blk_wake_io_task(waiter);
- } else if (dio->flags & IOMAP_DIO_WRITE) {
+ } else {
struct inode *inode = file_inode(dio->iocb->ki_filp);
WRITE_ONCE(dio->iocb->private, NULL);
- INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
- queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
- } else {
- WRITE_ONCE(dio->iocb->private, NULL);
- iomap_dio_complete_work(&dio->aio.work);
+ if (dio->flags & IOMAP_DIO_WRITE &&
+ (inode->i_mapping->nrpages > 0) {
+ INIT_WORK(&dio->aio.work,
+ iomap_dio_complete_work);
+ queue_work(inode->i_sb->s_dio_done_wq,
+ &dio->aio.work);
+ } else {
+ iomap_dio_complete_work(&dio->aio.work);
+ }
}
}
On Thu, Jun 22, 2023 at 11:55:23AM +1000, Dave Chinner wrote:
> Ok, so having spent a bit more thought on this away from the office
> this morning, I think there is a generic way we can avoid deferring
> completions for pure overwrites.
OK, this is how we can, but should we? The same amount of work
needs to be done, no matter whether we do it in interrupt context or
workqueue context. Doing it in interrupt context has lower latency,
but maybe allows us to batch up the work and so get better bandwidth.
And we can't handle other interrupts while we're handling this one,
so from a whole-system perspective, I think we'd rather do the work in
the workqueue.
Latency is important for reads, but why is it important for writes?
There's such a thing as a dependent read, but writes are usually buffered
and we can wait as long as we like for a write to complete.
On Thu, Jun 22, 2023 at 03:55:52AM +0100, Matthew Wilcox wrote:
> Latency is important for reads, but why is it important for writes?
> There's such a thing as a dependent read, but writes are usually buffered
> and we can wait as long as we like for a write to complete.
That was exactly my reasoning on why I did always defer the write
completions in the initial iomap direct I/O code, and until now no
one has complained.
I could see why people care either about synchronous writes, or polled
io_uring writes, for which we might be able to do the work in the
completion thread context, but for aio writes this feels a bit odd.
On Wed, 2023-06-21 at 10:29 -0700, Jeremy Bongio wrote:
> Hi Darrick and Allison,
>
> There has been a standing performance regression involving AIO DIO
> 4k-aligned writes on ext4 backed by a fast local SSD since the switch
> to iomap. I think it was originally reported and investigated in this
> thread:
> https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CQMgHZyQ$
>
>
> Short version:
> Pre-iomap, for ext4 async direct writes, after the bio is written to
> disk
> the completion function is called directly during the endio stage.
>
> Post-iomap, for direct writes, after the bio is written to disk, the
> completion
> function is deferred to a work queue. This adds latency that impacts
> performance most noticeably in very fast SSDs.
>
> Detailed version:
> A possible explanation is below, followed by a few questions to
> figure
> out the right way to fix it.
>
> In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has
> completed
> in the nvme driver, the interrupt handler for the write request ends
> in calling bio_endio() which ends up calling dio_bio_end_aio(). A
> different end_io function is used for async and sync io. If there are
> no pages mapped in memory for the write operation's inode, then the
> completion function for ext4 is called directly. If there are pages
> mapped, then they might be dirty and need to be updated and work
> is deferred to a work queue.
>
> Here is the relevant 4.15 code:
>
> fs/direct-io.c: dio_bio_end_aio()
> if (dio->result)
> defer_completion = dio->defer_completion ||
> (dio_op == REQ_OP_WRITE &&
> dio->inode->i_mapping->nrpages);
> if (defer_completion) {
> INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> queue_work(dio->inode->i_sb->s_dio_done_wq,
> &dio->complete_work);
> } else {
> dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
> }
>
> After ext4 switched to using iomap, the endio function became
> iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end
> io
> function is used for both async and sync io. All write requests will
> defer io completion to a work queue even if there are no mapped pages
> for the inode.
>
> Here is the relevant code in latest kernel (post-iomap) ...
>
> fs/iomap/direct-io.c: iomap_dio_bio_end_io()
> if (dio->wait_for_completion) {
> struct task_struct *waiter = dio->submit.waiter;
> WRITE_ONCE(dio->submit.waiter, NULL);
> blk_wake_io_task(waiter);
> } else if (dio->flags & IOMAP_DIO_WRITE) {
> struct inode *inode = file_inode(dio->iocb->ki_filp);
>
> WRITE_ONCE(dio->iocb->private, NULL);
> INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> } else {
> WRITE_ONCE(dio->iocb->private, NULL);
> iomap_dio_complete_work(&dio->aio.work);
> }
>
> With the attached patch, I see significantly better performance in
> 5.10 than 4.15. 5.10 is the latest kernel where I have driver support
> for an SSD that is fast enough to reproduce the regression. I
> verified that upstream iomap works the same.
>
> Test results using the reproduction script from the original report
> and testing with 4k/8k/12k/16k blocksizes and write-only:
> https://urldefense.com/v3/__https://people.collabora.com/*krisman/dio/week21/bench.sh__;fg!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CpnhVXfg$
>
>
> fio benchmark command:
> fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --
> iodepth=64 \
> --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \
> --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
> --runtime=100 --output-format=terse >> ${LOG}
>
> For 4.15, with all write completions called in io handler:
> 4k: bw=1056MiB/s
> 8k: bw=2082MiB/s
> 12k: bw=2332MiB/s
> 16k: bw=2453MiB/s
>
> For unmodified 5.10, with all write completions deferred:
> 4k: bw=1004MiB/s
> 8k: bw=2074MiB/s
> 12k: bw=2309MiB/s
> 16k: bw=2465MiB/s
>
> For modified 5.10, with all write completions called in io handler:
> 4k: bw=1193MiB/s
> 8k: bw=2258MiB/s
> 12k: bw=2346MiB/s
> 16k: bw=2446MiB/s
>
> Questions:
>
> Why did iomap from the beginning not make the async/sync io and
> mapped/unmapped distinction that fs/direct-io.c did?
>
> Since no issues have been found for ext4 calling completion work
> directly in the io handler pre-iomap, it is unlikely that this is
> unsafe (sleeping within an io handler callback). However, this may
> not
> be true for all filesystems. Does XFS potentially sleep in its
> completion code?
>
> Allison, Ted mentioned you saw a similar problem when doing
> performance testing for the latest version of Unbreakable Linux. Can
> you test/verify this patch addresses your performance regression as
> well?
Hi Jeremy,
Sure I can link this patch to the bug report to see if the tester sees
any improvements. If anything, I think it's a good data point to have,
even if we are still considering other solutions. Thanks!
Allison
>
> Jeremy Bongio (1):
> For DIO writes with no mapped pages for inode, skip deferring
> completion.
>
> fs/iomap/direct-io.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
On Thu, Jun 22, 2023 at 09:59:29AM +1000, Dave Chinner wrote:
> Ah, you are testing pure overwrites, which means for ext4 the only
> thing it needs to care about is cached mappings. What happens when
> you add O_DSYNC here?
I think you mean O_SYNC, right? In a pure overwrite case, where all
of the extents are initialized and where the Oracle or DB2 server is
doing writes to preallocated, pre-initialized space in the tablespace
file followed by fdatasync(), there *are* no post-I/O data integrity
operations which are required.
If the file is opened O_SYNC or if the blocks were not preallocated
using fallocate(2) and not initialized ahead of time, then sure, we
can't use this optimization.
However, the cases where databases workloads *are* doing overwrites
and using fdatasync(2) most certainly do exist, and the benefit of
this optimization can be a 20% throughput. Which is nothing to sneeze
at.
What we might to do is to let the file system tell the iomap layer via
a flag whether or not there are no post-I/O metadata operations
required, and then *if* that flag is set, and *if* the inode has no
pages in the page cache (so there are no invalidate operations
necessary), it should be safe to skip using queue_work(). That way,
the file system has to affirmatively state that it is safe to skip the
workqueue, so it shouldn't do any harm to other file systems using the
iomap DIO layer.
What am I missing?
Cheers,
- Ted