In some cases, e.g. for zoned block devices, direct writes are
forced into buffered writes that will populate the page cache
and be written out just like buffered io.
Direct reads, on the other hand, is supported for the zoned
block device case. This has the effect that applications
built for direct io will fill up the page cache with data
that will never be read, and that is a waste of resources.
If we agree that this is a problem, how do we fix it?
A) Supporting proper direct writes for zoned block devices would
be the best, but it is currently not supported (probably for
a good but non-obvious reason). Would it be feasible to
implement proper direct IO?
B) Avoid the cost of keeping unwanted data by syncing and throwing
out the cached pages for buffered O_DIRECT writes before completion.
This patch implements B) by reusing the code for how partial
block writes are flushed out on the "normal" direct write path.
Note that this changes the performance characteristics of f2fs
quite a bit.
Direct IO performance for zoned block devices is lower for
small writes after this patch, but this should be expected
with direct IO and in line with how f2fs behaves on top of
conventional block devices.
Another open question is if the flushing should be done for
all cases where buffered writes are forced.
Signed-off-by: Hans Holmberg <[email protected]>
---
fs/f2fs/file.c | 38 ++++++++++++++++++++++++++++++--------
1 file changed, 30 insertions(+), 8 deletions(-)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ecbc8c135b49..4e57c37bce35 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4513,6 +4513,19 @@ static const struct iomap_dio_ops f2fs_iomap_dio_write_ops = {
.end_io = f2fs_dio_write_end_io,
};
+static void f2fs_flush_buffered_write(struct address_space *mapping,
+ loff_t start_pos, loff_t end_pos)
+{
+ int ret;
+
+ ret = filemap_write_and_wait_range(mapping, start_pos, end_pos);
+ if (ret < 0)
+ return;
+ invalidate_mapping_pages(mapping,
+ start_pos >> PAGE_SHIFT,
+ end_pos >> PAGE_SHIFT);
+}
+
static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
bool *may_need_sync)
{
@@ -4612,14 +4625,9 @@ static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
ret += ret2;
- ret2 = filemap_write_and_wait_range(file->f_mapping,
- bufio_start_pos,
- bufio_end_pos);
- if (ret2 < 0)
- goto out;
- invalidate_mapping_pages(file->f_mapping,
- bufio_start_pos >> PAGE_SHIFT,
- bufio_end_pos >> PAGE_SHIFT);
+ f2fs_flush_buffered_write(file->f_mapping,
+ bufio_start_pos,
+ bufio_end_pos);
}
} else {
/* iomap_dio_rw() already handled the generic_write_sync(). */
@@ -4717,8 +4725,22 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
inode_unlock(inode);
out:
trace_f2fs_file_write_iter(inode, orig_pos, orig_count, ret);
+
if (ret > 0 && may_need_sync)
ret = generic_write_sync(iocb, ret);
+
+ /* If buffered IO was forced, flush and drop the data from
+ * the page cache to preserve O_DIRECT semantics
+ */
+ if (ret > 0 && !dio && (iocb->ki_flags & IOCB_DIRECT)) {
+ struct file *file = iocb->ki_filp;
+ loff_t end_pos = orig_pos + ret - 1;
+
+ f2fs_flush_buffered_write(file->f_mapping,
+ orig_pos,
+ end_pos);
+ }
+
return ret;
}
--
2.25.1
>In some cases, e.g. for zoned block devices, direct writes are
>forced into buffered writes that will populate the page cache
>and be written out just like buffered io.
>
>Direct reads, on the other hand, is supported for the zoned
>block device case. This has the effect that applications
>built for direct io will fill up the page cache with data
>that will never be read, and that is a waste of resources.
>
>If we agree that this is a problem, how do we fix it?
I agree
thanks
>
>A) Supporting proper direct writes for zoned block devices would
>be the best, but it is currently not supported (probably for
>a good but non-obvious reason). Would it be feasible to
>implement proper direct IO?
>
>B) Avoid the cost of keeping unwanted data by syncing and throwing
>out the cached pages for buffered O_DIRECT writes before completion.
>
>This patch implements B) by reusing the code for how partial
>block writes are flushed out on the "normal" direct write path.
>
>Note that this changes the performance characteristics of f2fs
>quite a bit.
>
>Direct IO performance for zoned block devices is lower for
>small writes after this patch, but this should be expected
>with direct IO and in line with how f2fs behaves on top of
>conventional block devices.
>
>Another open question is if the flushing should be done for
>all cases where buffered writes are forced.
>
>Signed-off-by: Hans Holmberg <[email protected]>
Reviewed-by: Yonggil Song <[email protected]>
>---
> fs/f2fs/file.c | 38 ++++++++++++++++++++++++++++++--------
> 1 file changed, 30 insertions(+), 8 deletions(-)
>
>diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>index ecbc8c135b49..4e57c37bce35 100644
>--- a/fs/f2fs/file.c
>+++ b/fs/f2fs/file.c
>@@ -4513,6 +4513,19 @@ static const struct iomap_dio_ops f2fs_iomap_dio_write_ops = {
> .end_io = f2fs_dio_write_end_io,
> };
>
>+static void f2fs_flush_buffered_write(struct address_space *mapping,
>+ loff_t start_pos, loff_t end_pos)
>+{
>+ int ret;
>+
>+ ret = filemap_write_and_wait_range(mapping, start_pos, end_pos);
>+ if (ret < 0)
>+ return;
>+ invalidate_mapping_pages(mapping,
>+ start_pos >> PAGE_SHIFT,
>+ end_pos >> PAGE_SHIFT);
>+}
>+
> static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
> bool *may_need_sync)
> {
>@@ -4612,14 +4625,9 @@ static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
>
> ret += ret2;
>
>- ret2 = filemap_write_and_wait_range(file->f_mapping,
>- bufio_start_pos,
>- bufio_end_pos);
>- if (ret2 < 0)
>- goto out;
>- invalidate_mapping_pages(file->f_mapping,
>- bufio_start_pos >> PAGE_SHIFT,
>- bufio_end_pos >> PAGE_SHIFT);
>+ f2fs_flush_buffered_write(file->f_mapping,
>+ bufio_start_pos,
>+ bufio_end_pos);
> }
> } else {
> /* iomap_dio_rw() already handled the generic_write_sync(). */
>@@ -4717,8 +4725,22 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> inode_unlock(inode);
> out:
> trace_f2fs_file_write_iter(inode, orig_pos, orig_count, ret);
>+
> if (ret > 0 && may_need_sync)
> ret = generic_write_sync(iocb, ret);
>+
>+ /* If buffered IO was forced, flush and drop the data from
>+ * the page cache to preserve O_DIRECT semantics
>+ */
>+ if (ret > 0 && !dio && (iocb->ki_flags & IOCB_DIRECT)) {
>+ struct file *file = iocb->ki_filp;
>+ loff_t end_pos = orig_pos + ret - 1;
>+
>+ f2fs_flush_buffered_write(file->f_mapping,
>+ orig_pos,
>+ end_pos);
>+ }
>+
> return ret;
> }
>
>--
>2.25.1
On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> A) Supporting proper direct writes for zoned block devices would
> be the best, but it is currently not supported (probably for
> a good but non-obvious reason). Would it be feasible to
> implement proper direct IO?
I don't think why not. In many ways direct writes to zoned devices
should be easier than non-direct writes.
Any comments from the maintainers why the direct I/O writes to zoned
devices are disabled? I could not find anything helpful in the comments
or commit logs.
On 03/20, Christoph Hellwig wrote:
> On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> > A) Supporting proper direct writes for zoned block devices would
> > be the best, but it is currently not supported (probably for
> > a good but non-obvious reason). Would it be feasible to
> > implement proper direct IO?
>
> I don't think why not. In many ways direct writes to zoned devices
> should be easier than non-direct writes.
>
> Any comments from the maintainers why the direct I/O writes to zoned
> devices are disabled? I could not find anything helpful in the comments
> or commit logs.
The direct IO wants to overwrite the data on the same block address, while
zoned device does not support it?
On 02/20, Hans Holmberg wrote:
> In some cases, e.g. for zoned block devices, direct writes are
> forced into buffered writes that will populate the page cache
> and be written out just like buffered io.
>
> Direct reads, on the other hand, is supported for the zoned
> block device case. This has the effect that applications
> built for direct io will fill up the page cache with data
> that will never be read, and that is a waste of resources.
>
> If we agree that this is a problem, how do we fix it?
>
> A) Supporting proper direct writes for zoned block devices would
> be the best, but it is currently not supported (probably for
> a good but non-obvious reason). Would it be feasible to
> implement proper direct IO?
>
> B) Avoid the cost of keeping unwanted data by syncing and throwing
> out the cached pages for buffered O_DIRECT writes before completion.
>
> This patch implements B) by reusing the code for how partial
> block writes are flushed out on the "normal" direct write path.
>
> Note that this changes the performance characteristics of f2fs
> quite a bit.
>
> Direct IO performance for zoned block devices is lower for
> small writes after this patch, but this should be expected
> with direct IO and in line with how f2fs behaves on top of
> conventional block devices.
>
> Another open question is if the flushing should be done for
> all cases where buffered writes are forced.
>
> Signed-off-by: Hans Holmberg <[email protected]>
> ---
> fs/f2fs/file.c | 38 ++++++++++++++++++++++++++++++--------
> 1 file changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index ecbc8c135b49..4e57c37bce35 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4513,6 +4513,19 @@ static const struct iomap_dio_ops f2fs_iomap_dio_write_ops = {
> .end_io = f2fs_dio_write_end_io,
> };
>
> +static void f2fs_flush_buffered_write(struct address_space *mapping,
> + loff_t start_pos, loff_t end_pos)
> +{
> + int ret;
> +
> + ret = filemap_write_and_wait_range(mapping, start_pos, end_pos);
> + if (ret < 0)
> + return;
> + invalidate_mapping_pages(mapping,
> + start_pos >> PAGE_SHIFT,
> + end_pos >> PAGE_SHIFT);
> +}
> +
> static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
> bool *may_need_sync)
> {
> @@ -4612,14 +4625,9 @@ static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
>
> ret += ret2;
>
> - ret2 = filemap_write_and_wait_range(file->f_mapping,
> - bufio_start_pos,
> - bufio_end_pos);
> - if (ret2 < 0)
> - goto out;
> - invalidate_mapping_pages(file->f_mapping,
> - bufio_start_pos >> PAGE_SHIFT,
> - bufio_end_pos >> PAGE_SHIFT);
> + f2fs_flush_buffered_write(file->f_mapping,
> + bufio_start_pos,
> + bufio_end_pos);
> }
> } else {
> /* iomap_dio_rw() already handled the generic_write_sync(). */
> @@ -4717,8 +4725,22 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> inode_unlock(inode);
> out:
> trace_f2fs_file_write_iter(inode, orig_pos, orig_count, ret);
> +
> if (ret > 0 && may_need_sync)
> ret = generic_write_sync(iocb, ret);
> +
> + /* If buffered IO was forced, flush and drop the data from
> + * the page cache to preserve O_DIRECT semantics
> + */
> + if (ret > 0 && !dio && (iocb->ki_flags & IOCB_DIRECT)) {
> + struct file *file = iocb->ki_filp;
> + loff_t end_pos = orig_pos + ret - 1;
> +
> + f2fs_flush_buffered_write(file->f_mapping,
> + orig_pos,
> + end_pos);
I applied a minor change:
/* If buffered IO was forced, flush and drop the data from
* the page cache to preserve O_DIRECT semantics
*/
- if (ret > 0 && !dio && (iocb->ki_flags & IOCB_DIRECT)) {
- struct file *file = iocb->ki_filp;
- loff_t end_pos = orig_pos + ret - 1;
-
- f2fs_flush_buffered_write(file->f_mapping,
+ if (ret > 0 && !dio && (iocb->ki_flags & IOCB_DIRECT))
+ f2fs_flush_buffered_write(iocb->ki_filp->f_mapping,
orig_pos,
- end_pos);
- }
+ orig_pos + ret - 1);
return ret;
}
> + }
> +
> return ret;
> }
>
> --
> 2.25.1
On Thu, 2023-03-23 at 15:14 -0700, Jaegeuk Kim wrote:
> On 03/20, Christoph Hellwig wrote:
> > On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> > > A) Supporting proper direct writes for zoned block devices would
> > > be the best, but it is currently not supported (probably for
> > > a good but non-obvious reason). Would it be feasible to
> > > implement proper direct IO?
> >
> > I don't think why not. In many ways direct writes to zoned devices
> > should be easier than non-direct writes.
> >
> > Any comments from the maintainers why the direct I/O writes to
> > zoned
> > devices are disabled? I could not find anything helpful in the
> > comments
> > or commit logs.
>
> The direct IO wants to overwrite the data on the same block address,
> while
> zoned device does not support it?
Surely that is not the case with LFS mode, doesn't it ? Otherwise, even
buffered overwrites would have an issue.
On 03/23, Damien Le Moal wrote:
> On Thu, 2023-03-23 at 15:14 -0700, Jaegeuk Kim wrote:
> > On 03/20, Christoph Hellwig wrote:
> > > On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> > > > A) Supporting proper direct writes for zoned block devices would
> > > > be the best, but it is currently not supported (probably for
> > > > a good but non-obvious reason). Would it be feasible to
> > > > implement proper direct IO?
> > >
> > > I don't think why not.? In many ways direct writes to zoned devices
> > > should be easier than non-direct writes.
> > >
> > > Any comments from the maintainers why the direct I/O writes to
> > > zoned
> > > devices are disabled?? I could not find anything helpful in the
> > > comments
> > > or commit logs.
> >
> > The direct IO wants to overwrite the data on the same block address,
> > while
> > zoned device does not support it?
>
> Surely that is not the case with LFS mode, doesn't it ? Otherwise, even
> buffered overwrites would have an issue.
Zoned device only supports LFS mode.
>
On Thu, 2023-03-23 at 16:46 -0700, Jaegeuk Kim wrote:
> On 03/23, Damien Le Moal wrote:
> > On Thu, 2023-03-23 at 15:14 -0700, Jaegeuk Kim wrote:
> > > On 03/20, Christoph Hellwig wrote:
> > > > On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> > > > > A) Supporting proper direct writes for zoned block devices
> > > > > would
> > > > > be the best, but it is currently not supported (probably for
> > > > > a good but non-obvious reason). Would it be feasible to
> > > > > implement proper direct IO?
> > > >
> > > > I don't think why not. In many ways direct writes to zoned
> > > > devices
> > > > should be easier than non-direct writes.
> > > >
> > > > Any comments from the maintainers why the direct I/O writes to
> > > > zoned
> > > > devices are disabled? I could not find anything helpful in the
> > > > comments
> > > > or commit logs.
> > >
> > > The direct IO wants to overwrite the data on the same block
> > > address,
> > > while
> > > zoned device does not support it?
> >
> > Surely that is not the case with LFS mode, doesn't it ? Otherwise,
> > even
> > buffered overwrites would have an issue.
>
> Zoned device only supports LFS mode.
Yes, and that was exactly my point: with LFS mode, O_DIRECT write
should never overwrite anything. So I do not see why direct writes
should be handled as buffered writes with zoned devices. Am I missing
something here ?
>
On 03/24, Damien Le Moal wrote:
> On Thu, 2023-03-23 at 16:46 -0700, Jaegeuk Kim wrote:
> > On 03/23, Damien Le Moal wrote:
> > > On Thu, 2023-03-23 at 15:14 -0700, Jaegeuk Kim wrote:
> > > > On 03/20, Christoph Hellwig wrote:
> > > > > On Mon, Feb 20, 2023 at 01:20:04PM +0100, Hans Holmberg wrote:
> > > > > > A) Supporting proper direct writes for zoned block devices
> > > > > > would
> > > > > > be the best, but it is currently not supported (probably for
> > > > > > a good but non-obvious reason). Would it be feasible to
> > > > > > implement proper direct IO?
> > > > >
> > > > > I don't think why not.? In many ways direct writes to zoned
> > > > > devices
> > > > > should be easier than non-direct writes.
> > > > >
> > > > > Any comments from the maintainers why the direct I/O writes to
> > > > > zoned
> > > > > devices are disabled?? I could not find anything helpful in the
> > > > > comments
> > > > > or commit logs.
> > > >
> > > > The direct IO wants to overwrite the data on the same block
> > > > address,
> > > > while
> > > > zoned device does not support it?
> > >
> > > Surely that is not the case with LFS mode, doesn't it ? Otherwise,
> > > even
> > > buffered overwrites would have an issue.
> >
> > Zoned device only supports LFS mode.
>
> Yes, and that was exactly my point: with LFS mode, O_DIRECT write
> should never overwrite anything. So I do not see why direct writes
> should be handled as buffered writes with zoned devices. Am I missing
> something here ?
That's an easiest way to serialize block allocation and submit_bio when users
are calling buffered writes and direct writes in parallel. :)
I just felt that if we can manage both of them in direct write path along with
buffered write path, we may be able to avoid memcpy.
>
> >
>
On Thu, Mar 23, 2023 at 05:46:37PM -0700, Jaegeuk Kim wrote:
> > Yes, and that was exactly my point: with LFS mode, O_DIRECT write
> > should never overwrite anything. So I do not see why direct writes
> > should be handled as buffered writes with zoned devices. Am I missing
> > something here ?
>
> That's an easiest way to serialize block allocation and submit_bio when users
> are calling buffered writes and direct writes in parallel. :)
> I just felt that if we can manage both of them in direct write path along with
> buffered write path, we may be able to avoid memcpy.
Yes. Note that right now f2fs doesn't really support proper O_DIRECT
for buffered I/O either, as non-overwrites require a feature similar
to unwritten extents, or a split of the allocation phase and the record
metdata phase. If we'd go for the second choice for f2fs, which is the
more elegant thing to do, you'll get the zoned direct I/O write support
almost for free.
On Sun, Mar 26, 2023 at 04:39:10PM -0700, [email protected] wrote:
> On Thu, Mar 23, 2023 at 05:46:37PM -0700, Jaegeuk Kim wrote:
> > > Yes, and that was exactly my point: with LFS mode, O_DIRECT write
> > > should never overwrite anything. So I do not see why direct writes
> > > should be handled as buffered writes with zoned devices. Am I missing
> > > something here ?
> >
> > That's an easiest way to serialize block allocation and submit_bio when users
> > are calling buffered writes and direct writes in parallel. :)
> > I just felt that if we can manage both of them in direct write path along with
> > buffered write path, we may be able to avoid memcpy.
>
> Yes. Note that right now f2fs doesn't really support proper O_DIRECT
> for buffered I/O either, as non-overwrites require a feature similar
> to unwritten extents, or a split of the allocation phase and the record
> metdata phase. If we'd go for the second choice for f2fs, which is the
> more elegant thing to do, you'll get the zoned direct I/O write support
> almost for free.
So, Jaegeuk, do you think suporting direct io proper is the way to do to fix this
issue? That looks like a better solution to me (at least long term).
Until that would be put into place, do you want my fix (with your code
style fixes) rebased and resent?
Cheers,
Hans
On 06/05, Hans Holmberg wrote:
>
> On Sun, Mar 26, 2023 at 04:39:10PM -0700, [email protected] wrote:
> > On Thu, Mar 23, 2023 at 05:46:37PM -0700, Jaegeuk Kim wrote:
> > > > Yes, and that was exactly my point: with LFS mode, O_DIRECT write
> > > > should never overwrite anything. So I do not see why direct writes
> > > > should be handled as buffered writes with zoned devices. Am I missing
> > > > something here ?
> > >
> > > That's an easiest way to serialize block allocation and submit_bio when users
> > > are calling buffered writes and direct writes in parallel. :)
> > > I just felt that if we can manage both of them in direct write path along with
> > > buffered write path, we may be able to avoid memcpy.
> >
> > Yes. Note that right now f2fs doesn't really support proper O_DIRECT
> > for buffered I/O either, as non-overwrites require a feature similar
> > to unwritten extents, or a split of the allocation phase and the record
> > metdata phase. If we'd go for the second choice for f2fs, which is the
> > more elegant thing to do, you'll get the zoned direct I/O write support
> > almost for free.
>
> So, Jaegeuk, do you think suporting direct io proper is the way to do to fix this
> issue? That looks like a better solution to me (at least long term).
>
> Until that would be put into place, do you want my fix (with your code
> style fixes) rebased and resent?
Yes, it's already landed in 6.4-rc1 of Linus tree, and surely I have the topic
in my long term plan.
Thanks,
>
> Cheers,
> Hans