2023-10-11 14:22:26

by Jan Kara

[permalink] [raw]
Subject: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
sync file size update and thus if we crash at unfortunate moment, the
file can have smaller size although O_SYNC IO has reported successful
completion. The problem happens because update of on-disk inode size is
handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
dio_complete() in particular) has returned and generic_file_sync() gets
called by dio_complete(). Fix the problem by handling on-disk inode size
update directly in our ->end_io completion handler.

References: https://lore.kernel.org/all/[email protected]
Reported-by: Gao Xiang <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
1 file changed, 52 insertions(+), 87 deletions(-)

So finally I've hopefully got all the corner cases right ;) At least fstest
pass now.

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1492b1ae21f4..d0711c1a9b06 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -306,80 +306,34 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
}

static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
- ssize_t written, size_t count)
+ ssize_t count)
{
handle_t *handle;
- bool truncate = false;
- u8 blkbits = inode->i_blkbits;
- ext4_lblk_t written_blk, end_blk;
- int ret;
-
- /*
- * Note that EXT4_I(inode)->i_disksize can get extended up to
- * inode->i_size while the I/O was running due to writeback of delalloc
- * blocks. But, the code in ext4_iomap_alloc() is careful to use
- * zeroed/unwritten extents if this is possible; thus we won't leave
- * uninitialized blocks in a file even if we didn't succeed in writing
- * as much as we intended.
- */
- WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
- if (offset + count <= EXT4_I(inode)->i_disksize) {
- /*
- * We need to ensure that the inode is removed from the orphan
- * list if it has been added prematurely, due to writeback of
- * delalloc blocks.
- */
- if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-
- if (IS_ERR(handle)) {
- ext4_orphan_del(NULL, inode);
- return PTR_ERR(handle);
- }
-
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- }
-
- return written;
- }
-
- if (written < 0)
- goto truncate;

+ lockdep_assert_held_write(&inode->i_rwsem);
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- written = PTR_ERR(handle);
- goto truncate;
- }
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);

- if (ext4_update_inode_size(inode, offset + written)) {
- ret = ext4_mark_inode_dirty(handle, inode);
+ if (ext4_update_inode_size(inode, offset + count)) {
+ int ret = ext4_mark_inode_dirty(handle, inode);
if (unlikely(ret)) {
- written = ret;
ext4_journal_stop(handle);
- goto truncate;
+ return ret;
}
}

- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- written_blk = ALIGN(offset + written, 1 << blkbits);
- end_blk = ALIGN(offset + count, 1 << blkbits);
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
-
- /*
- * Remove the inode from the orphan list if it has been extended and
- * everything went OK.
- */
- if (!truncate && inode->i_nlink)
+ if (inode->i_nlink)
ext4_orphan_del(handle, inode);
ext4_journal_stop(handle);

- if (truncate) {
-truncate:
+ return count;
+}
+
+static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
+{
+ lockdep_assert_held_write(&inode->i_rwsem);
+ if (count < 0) {
ext4_truncate_failed_write(inode);
/*
* If the truncate operation failed early, then the inode may
@@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
*/
if (inode->i_nlink)
ext4_orphan_del(NULL, inode);
+ return;
}
+ /*
+ * If i_disksize got extended due to writeback of delalloc blocks while
+ * the DIO was running we could fail to cleanup the orphan list in
+ * ext4_handle_inode_extension(). Do it now.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+ handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);

- return written;
+ if (IS_ERR(handle)) {
+ /*
+ * The write has successfully completed. Not much to
+ * do with the error here so just cleanup the orphan
+ * list and hope for the best.
+ */
+ ext4_orphan_del(NULL, inode);
+ return;
+ }
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
}

static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
@@ -399,31 +372,22 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(iocb->ki_filp);

+ if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)
+ error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
if (error)
return error;
-
- if (size && flags & IOMAP_DIO_END_UNWRITTEN) {
- error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
- if (error < 0)
- return error;
- }
/*
- * If we are extending the file, we have to update i_size here before
- * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
- * buffered reads could zero out too much from page cache pages. Update
- * of on-disk size will happen later in ext4_dio_write_iter() where
- * we have enough information to also perform orphan list handling etc.
- * Note that we perform all extending writes synchronously under
- * i_rwsem held exclusively so i_size update is safe here in that case.
- * If the write was not extending, we cannot see pos > i_size here
- * because operations reducing i_size like truncate wait for all
- * outstanding DIO before updating i_size.
+ * Note that EXT4_I(inode)->i_disksize can get extended up to
+ * inode->i_size while the I/O was running due to writeback of delalloc
+ * blocks. But the code in ext4_iomap_alloc() is careful to use
+ * zeroed/unwritten extents if this is possible; thus we won't leave
+ * uninitialized blocks in a file even if we didn't succeed in writing
+ * as much as we intended.
*/
- pos += size;
- if (pos > i_size_read(inode))
- i_size_write(inode, pos);
-
- return 0;
+ WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
+ if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
+ return 0;
+ return ext4_handle_inode_extension(inode, pos, size);
}

static const struct iomap_dio_ops ext4_dio_write_ops = {
@@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
dio_flags, NULL, 0);
if (ret == -ENOTBLK)
ret = 0;
-
if (extend)
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+ ext4_inode_extension_cleanup(inode, ret);

out:
if (ilock_shared)
@@ -689,8 +652,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)

ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);

- if (extend)
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+ if (extend) {
+ ret = ext4_handle_inode_extension(inode, offset, ret);
+ ext4_inode_extension_cleanup(inode, ret);
+ }
out:
inode_unlock(inode);
if (ret > 0)
--
2.35.3


2023-10-12 02:21:46

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

Hi Dave,

On 2023/10/12 08:26, Dave Chinner wrote:
> On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
>> Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
>> sync file size update and thus if we crash at unfortunate moment, the
>> file can have smaller size although O_SYNC IO has reported successful
>> completion. The problem happens because update of on-disk inode size is
>> handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
>> dio_complete() in particular) has returned and generic_file_sync() gets
>> called by dio_complete(). Fix the problem by handling on-disk inode size
>> update directly in our ->end_io completion handler.
>>
>> References: https://lore.kernel.org/all/[email protected]
>> Reported-by: Gao Xiang <[email protected]>
>> Signed-off-by: Jan Kara <[email protected]>
>> ---
>> fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
>> 1 file changed, 52 insertions(+), 87 deletions(-)
> .....
>> @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>> */
>> if (inode->i_nlink)
>> ext4_orphan_del(NULL, inode);
>> + return;
>> }
>> + /*
>> + * If i_disksize got extended due to writeback of delalloc blocks while
>> + * the DIO was running we could fail to cleanup the orphan list in
>> + * ext4_handle_inode_extension(). Do it now.
>> + */
>> + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
>> + handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>
> So this has to be called after the DIO write completes and calls
> ext4_handle_inode_extension()?
>
> ....
>
>> @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> dio_flags, NULL, 0);
>> if (ret == -ENOTBLK)
>> ret = 0;
>> -
>> if (extend)
>> - ret = ext4_handle_inode_extension(inode, offset, ret, count);
>> + ext4_inode_extension_cleanup(inode, ret);
>
> Because this doesn't wait for AIO DIO to complete and actually
> extend the file before running the cleanup code...

As far as I know, for ext4 AIO DIO extension cases,
IOMAP_DIO_FORCE_WAIT will be set, thus no async DIO here.

So the timing for this case will be strictly:
- ext4_handle_inode_extension() --- record i_disksize in .end_io

- generic_write_sync() --- forcely do fsync()

- ext4_inode_extension_cleanup() --- drop orphan in another transaction
as mentioned in [1]

Anyway, that is my current limited thoughts.

[1] https://lore.kernel.org/linux-ext4/20230920152005.7iowrlukd5zbvp43@quack3/

Thanks,
Gao Xiang

>
> Cheers,
>
> Dave.

2023-10-12 08:59:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

On Thu 12-10-23 11:26:15, Dave Chinner wrote:
> On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > sync file size update and thus if we crash at unfortunate moment, the
> > file can have smaller size although O_SYNC IO has reported successful
> > completion. The problem happens because update of on-disk inode size is
> > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > dio_complete() in particular) has returned and generic_file_sync() gets
> > called by dio_complete(). Fix the problem by handling on-disk inode size
> > update directly in our ->end_io completion handler.
> >
> > References: https://lore.kernel.org/all/[email protected]
> > Reported-by: Gao Xiang <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> > 1 file changed, 52 insertions(+), 87 deletions(-)
> .....
> > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > */
> > if (inode->i_nlink)
> > ext4_orphan_del(NULL, inode);
> > + return;
> > }
> > + /*
> > + * If i_disksize got extended due to writeback of delalloc blocks while
> > + * the DIO was running we could fail to cleanup the orphan list in
> > + * ext4_handle_inode_extension(). Do it now.
> > + */
> > + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > + handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>
> So this has to be called after the DIO write completes and calls
> ext4_handle_inode_extension()?

Yes, if the write was setup as extending one ('extend' is set to true in
ext4_dio_write_iter()).

> > @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > dio_flags, NULL, 0);
> > if (ret == -ENOTBLK)
> > ret = 0;
> > -
> > if (extend)
> > - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > + ext4_inode_extension_cleanup(inode, ret);
>
> Because this doesn't wait for AIO DIO to complete and actually
> extend the file before running the cleanup code...

As Gao wrote, ext4 sets IOMAP_DIO_FORCE_WAIT if 'extend' is set (see
ext4_dio_write_checks()) so if we get to calling
ext4_inode_extension_cleanup() we are guaranteed the IO has already
completed.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-12 15:26:33

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

Jan Kara <[email protected]> writes:

> Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> sync file size update and thus if we crash at unfortunate moment, the
> file can have smaller size although O_SYNC IO has reported successful
> completion. The problem happens because update of on-disk inode size is
> handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> dio_complete() in particular) has returned and generic_file_sync() gets
> called by dio_complete(). Fix the problem by handling on-disk inode size
> update directly in our ->end_io completion handler.
>
> References: https://lore.kernel.org/all/[email protected]
> Reported-by: Gao Xiang <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>

Guess we need a fixes tag.

> ---
> fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> 1 file changed, 52 insertions(+), 87 deletions(-)
>
> So finally I've hopefully got all the corner cases right ;) At least fstest
> pass now.
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1492b1ae21f4..d0711c1a9b06 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -306,80 +306,34 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> }
>
> static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> - ssize_t written, size_t count)
> + ssize_t count)
> {
> handle_t *handle;
> - bool truncate = false;
> - u8 blkbits = inode->i_blkbits;
> - ext4_lblk_t written_blk, end_blk;
> - int ret;
> -
> - /*
> - * Note that EXT4_I(inode)->i_disksize can get extended up to
> - * inode->i_size while the I/O was running due to writeback of delalloc
> - * blocks. But, the code in ext4_iomap_alloc() is careful to use
> - * zeroed/unwritten extents if this is possible; thus we won't leave
> - * uninitialized blocks in a file even if we didn't succeed in writing
> - * as much as we intended.
> - */
> - WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> - if (offset + count <= EXT4_I(inode)->i_disksize) {
> - /*
> - * We need to ensure that the inode is removed from the orphan
> - * list if it has been added prematurely, due to writeback of
> - * delalloc blocks.
> - */
> - if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> -
> - if (IS_ERR(handle)) {
> - ext4_orphan_del(NULL, inode);
> - return PTR_ERR(handle);
> - }
> -
> - ext4_orphan_del(handle, inode);
> - ext4_journal_stop(handle);
> - }
> -
> - return written;
> - }
> -
> - if (written < 0)
> - goto truncate;
>
> + lockdep_assert_held_write(&inode->i_rwsem);
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - written = PTR_ERR(handle);
> - goto truncate;
> - }
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
>
> - if (ext4_update_inode_size(inode, offset + written)) {
> - ret = ext4_mark_inode_dirty(handle, inode);
> + if (ext4_update_inode_size(inode, offset + count)) {
> + int ret = ext4_mark_inode_dirty(handle, inode);
> if (unlikely(ret)) {
> - written = ret;
> ext4_journal_stop(handle);
> - goto truncate;
> + return ret;
> }
> }
>
> - /*
> - * We may need to truncate allocated but not written blocks beyond EOF.
> - */
> - written_blk = ALIGN(offset + written, 1 << blkbits);
> - end_blk = ALIGN(offset + count, 1 << blkbits);
> - if (written_blk < end_blk && ext4_can_truncate(inode))
> - truncate = true;
> -
> - /*
> - * Remove the inode from the orphan list if it has been extended and
> - * everything went OK.
> - */
> - if (!truncate && inode->i_nlink)
> + if (inode->i_nlink)
> ext4_orphan_del(handle, inode);
> ext4_journal_stop(handle);
>
> - if (truncate) {
> -truncate:
> + return count;
> +}
> +
> +static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
> +{
> + lockdep_assert_held_write(&inode->i_rwsem);
> + if (count < 0) {
> ext4_truncate_failed_write(inode);
> /*
> * If the truncate operation failed early, then the inode may
> @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> */
> if (inode->i_nlink)
> ext4_orphan_del(NULL, inode);
> + return;
> }
> + /*
> + * If i_disksize got extended due to writeback of delalloc blocks while
> + * the DIO was running we could fail to cleanup the orphan list in
> + * ext4_handle_inode_extension(). Do it now.
> + */
> + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> + handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>
> - return written;
> + if (IS_ERR(handle)) {
> + /*
> + * The write has successfully completed. Not much to
> + * do with the error here so just cleanup the orphan
> + * list and hope for the best.
> + */
> + ext4_orphan_del(NULL, inode);
> + return;
> + }
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> + }
> }
>
> static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> @@ -399,31 +372,22 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> loff_t pos = iocb->ki_pos;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> + if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)

Do we have IOMAP_DIO_END_UNWRITTEN? or should it be IOMAP_DIO_UNWRITTEN?
Also we don't need to check !error case if we return early in case of an error.

> + error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> if (error)
> return error;
> -
> - if (size && flags & IOMAP_DIO_END_UNWRITTEN) {

ditto.

> - error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> - if (error < 0)
> - return error;
> - }
> /*
> - * If we are extending the file, we have to update i_size here before
> - * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
> - * buffered reads could zero out too much from page cache pages. Update
> - * of on-disk size will happen later in ext4_dio_write_iter() where
> - * we have enough information to also perform orphan list handling etc.
> - * Note that we perform all extending writes synchronously under
> - * i_rwsem held exclusively so i_size update is safe here in that case.
> - * If the write was not extending, we cannot see pos > i_size here
> - * because operations reducing i_size like truncate wait for all
> - * outstanding DIO before updating i_size.
> + * Note that EXT4_I(inode)->i_disksize can get extended up to
> + * inode->i_size while the I/O was running due to writeback of delalloc
> + * blocks. But the code in ext4_iomap_alloc() is careful to use
> + * zeroed/unwritten extents if this is possible; thus we won't leave
> + * uninitialized blocks in a file even if we didn't succeed in writing
> + * as much as we intended.
> */
> - pos += size;
> - if (pos > i_size_read(inode))
> - i_size_write(inode, pos);
> -
> - return 0;
> + WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> + return 0;
> + return ext4_handle_inode_extension(inode, pos, size);
> }

Although it is not a problem, but we are sometimes returning 0 and
sometimes count here.

>
> static const struct iomap_dio_ops ext4_dio_write_ops = {
> @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> dio_flags, NULL, 0);
> if (ret == -ENOTBLK)
> ret = 0;
> -
> if (extend)
> - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> + ext4_inode_extension_cleanup(inode, ret);
>
> out:
> if (ilock_shared)
> @@ -689,8 +652,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>
> - if (extend)
> - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> + if (extend) {
> + ret = ext4_handle_inode_extension(inode, offset, ret);
> + ext4_inode_extension_cleanup(inode, ret);

ok, looks like we are using that return value here.

> + }
> out:
> inode_unlock(inode);
> if (ret > 0)
> --
> 2.35.3

Otherwise it looks good to me.

-ritesh

2023-10-12 16:17:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

On Thu 12-10-23 20:55:29, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > sync file size update and thus if we crash at unfortunate moment, the
> > file can have smaller size although O_SYNC IO has reported successful
> > completion. The problem happens because update of on-disk inode size is
> > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > dio_complete() in particular) has returned and generic_file_sync() gets
> > called by dio_complete(). Fix the problem by handling on-disk inode size
> > update directly in our ->end_io completion handler.
> >
> > References: https://lore.kernel.org/all/[email protected]
> > Reported-by: Gao Xiang <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
>
> Guess we need a fixes tag.

Good point, will add.

> > ---
> > fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> > 1 file changed, 52 insertions(+), 87 deletions(-)
> >
> > So finally I've hopefully got all the corner cases right ;) At least fstest
> > pass now.
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 1492b1ae21f4..d0711c1a9b06 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -306,80 +306,34 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > }
> >
> > static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > - ssize_t written, size_t count)
> > + ssize_t count)
> > {
> > handle_t *handle;
> > - bool truncate = false;
> > - u8 blkbits = inode->i_blkbits;
> > - ext4_lblk_t written_blk, end_blk;
> > - int ret;
> > -
> > - /*
> > - * Note that EXT4_I(inode)->i_disksize can get extended up to
> > - * inode->i_size while the I/O was running due to writeback of delalloc
> > - * blocks. But, the code in ext4_iomap_alloc() is careful to use
> > - * zeroed/unwritten extents if this is possible; thus we won't leave
> > - * uninitialized blocks in a file even if we didn't succeed in writing
> > - * as much as we intended.
> > - */
> > - WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > - if (offset + count <= EXT4_I(inode)->i_disksize) {
> > - /*
> > - * We need to ensure that the inode is removed from the orphan
> > - * list if it has been added prematurely, due to writeback of
> > - * delalloc blocks.
> > - */
> > - if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > -
> > - if (IS_ERR(handle)) {
> > - ext4_orphan_del(NULL, inode);
> > - return PTR_ERR(handle);
> > - }
> > -
> > - ext4_orphan_del(handle, inode);
> > - ext4_journal_stop(handle);
> > - }
> > -
> > - return written;
> > - }
> > -
> > - if (written < 0)
> > - goto truncate;
> >
> > + lockdep_assert_held_write(&inode->i_rwsem);
> > handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > - if (IS_ERR(handle)) {
> > - written = PTR_ERR(handle);
> > - goto truncate;
> > - }
> > + if (IS_ERR(handle))
> > + return PTR_ERR(handle);
> >
> > - if (ext4_update_inode_size(inode, offset + written)) {
> > - ret = ext4_mark_inode_dirty(handle, inode);
> > + if (ext4_update_inode_size(inode, offset + count)) {
> > + int ret = ext4_mark_inode_dirty(handle, inode);
> > if (unlikely(ret)) {
> > - written = ret;
> > ext4_journal_stop(handle);
> > - goto truncate;
> > + return ret;
> > }
> > }
> >
> > - /*
> > - * We may need to truncate allocated but not written blocks beyond EOF.
> > - */
> > - written_blk = ALIGN(offset + written, 1 << blkbits);
> > - end_blk = ALIGN(offset + count, 1 << blkbits);
> > - if (written_blk < end_blk && ext4_can_truncate(inode))
> > - truncate = true;
> > -
> > - /*
> > - * Remove the inode from the orphan list if it has been extended and
> > - * everything went OK.
> > - */
> > - if (!truncate && inode->i_nlink)
> > + if (inode->i_nlink)
> > ext4_orphan_del(handle, inode);
> > ext4_journal_stop(handle);
> >
> > - if (truncate) {
> > -truncate:
> > + return count;
> > +}
> > +
> > +static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
> > +{
> > + lockdep_assert_held_write(&inode->i_rwsem);
> > + if (count < 0) {
> > ext4_truncate_failed_write(inode);
> > /*
> > * If the truncate operation failed early, then the inode may
> > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > */
> > if (inode->i_nlink)
> > ext4_orphan_del(NULL, inode);
> > + return;
> > }
> > + /*
> > + * If i_disksize got extended due to writeback of delalloc blocks while
> > + * the DIO was running we could fail to cleanup the orphan list in
> > + * ext4_handle_inode_extension(). Do it now.
> > + */
> > + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > + handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> >
> > - return written;
> > + if (IS_ERR(handle)) {
> > + /*
> > + * The write has successfully completed. Not much to
> > + * do with the error here so just cleanup the orphan
> > + * list and hope for the best.
> > + */
> > + ext4_orphan_del(NULL, inode);
> > + return;
> > + }
> > + ext4_orphan_del(handle, inode);
> > + ext4_journal_stop(handle);
> > + }
> > }
> >
> > static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > @@ -399,31 +372,22 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > loff_t pos = iocb->ki_pos;
> > struct inode *inode = file_inode(iocb->ki_filp);
> >
> > + if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)
>
> Do we have IOMAP_DIO_END_UNWRITTEN? or should it be IOMAP_DIO_UNWRITTEN?
> Also we don't need to check !error case if we return early in case of an error.

It should be IOMAP_DIO_UNWRITTEN. Unrelated iomap cleanup snuck under this
patch in my tree (because older versions of the patch needed it).

> > + error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> > if (error)
> > return error;
> > -
> > - if (size && flags & IOMAP_DIO_END_UNWRITTEN) {
>
> ditto.
>
> > - error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> > - if (error < 0)
> > - return error;
> > - }
> > /*
> > - * If we are extending the file, we have to update i_size here before
> > - * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
> > - * buffered reads could zero out too much from page cache pages. Update
> > - * of on-disk size will happen later in ext4_dio_write_iter() where
> > - * we have enough information to also perform orphan list handling etc.
> > - * Note that we perform all extending writes synchronously under
> > - * i_rwsem held exclusively so i_size update is safe here in that case.
> > - * If the write was not extending, we cannot see pos > i_size here
> > - * because operations reducing i_size like truncate wait for all
> > - * outstanding DIO before updating i_size.
> > + * Note that EXT4_I(inode)->i_disksize can get extended up to
> > + * inode->i_size while the I/O was running due to writeback of delalloc
> > + * blocks. But the code in ext4_iomap_alloc() is careful to use
> > + * zeroed/unwritten extents if this is possible; thus we won't leave
> > + * uninitialized blocks in a file even if we didn't succeed in writing
> > + * as much as we intended.
> > */
> > - pos += size;
> > - if (pos > i_size_read(inode))
> > - i_size_write(inode, pos);
> > -
> > - return 0;
> > + WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> > + return 0;
> > + return ext4_handle_inode_extension(inode, pos, size);
> > }
>
> Although it is not a problem, but we are sometimes returning 0 and
> sometimes count here.

Yeah. iomap_dio_complete() actually fixes this up but I agree it is
confusing a bit. Let's return 'size' here.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-13 10:14:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Properly sync file size update after O_SYNC direct IO

On Fri 13-10-23 10:33:08, Dave Chinner wrote:
> On Thu, Oct 12, 2023 at 10:59:37AM +0200, Jan Kara wrote:
> > On Thu 12-10-23 11:26:15, Dave Chinner wrote:
> > > On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> > > > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > > > sync file size update and thus if we crash at unfortunate moment, the
> > > > file can have smaller size although O_SYNC IO has reported successful
> > > > completion. The problem happens because update of on-disk inode size is
> > > > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > > > dio_complete() in particular) has returned and generic_file_sync() gets
> > > > called by dio_complete(). Fix the problem by handling on-disk inode size
> > > > update directly in our ->end_io completion handler.
> > > >
> > > > References: https://lore.kernel.org/all/[email protected]
> > > > Reported-by: Gao Xiang <[email protected]>
> > > > Signed-off-by: Jan Kara <[email protected]>
> > > > ---
> > > > fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> > > > 1 file changed, 52 insertions(+), 87 deletions(-)
> > > .....
> > > > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > > > */
> > > > if (inode->i_nlink)
> > > > ext4_orphan_del(NULL, inode);
> > > > + return;
> > > > }
> > > > + /*
> > > > + * If i_disksize got extended due to writeback of delalloc blocks while
> > > > + * the DIO was running we could fail to cleanup the orphan list in
> > > > + * ext4_handle_inode_extension(). Do it now.
> > > > + */
> > > > + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > > > + handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > >
> > > So this has to be called after the DIO write completes and calls
> > > ext4_handle_inode_extension()?
> >
> > Yes, if the write was setup as extending one ('extend' is set to true in
> > ext4_dio_write_iter()).
>
> Then that is worth a comment to document the constraint for anyone
> that is trying to understand how ext4 is using the iomap DIO code.

Fair enough, comment added.

> > > > @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > > dio_flags, NULL, 0);
> > > > if (ret == -ENOTBLK)
> > > > ret = 0;
> > > > -
> > > > if (extend)
> > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > > + ext4_inode_extension_cleanup(inode, ret);
> > >
> > > Because this doesn't wait for AIO DIO to complete and actually
> > > extend the file before running the cleanup code...
> >
> > As Gao wrote, ext4 sets IOMAP_DIO_FORCE_WAIT if 'extend' is set (see
> > ext4_dio_write_checks()) so if we get to calling
> > ext4_inode_extension_cleanup() we are guaranteed the IO has already
> > completed.
>
> Ugh. That's a pretty nasty undocumented landmine. It definitely
> needs a comment (or better, a WARN_ON_ONCE()) to document that this
> code -only- works if AIO is disabled. This isn't for ext4
> developers, it's for people working on the iomap code to understand
> that ext4 has some really non-obvious constraints in it's DIO code
> paths and that's why the landmine is not being stepped on....

OK. Ext4 has always been this way so I never felt the need to document it
but you're right. I've added:

if (extend) {
/*
* We always perform extending DIO write synchronously so by
* now the IO is completed and ext4_handle_inode_extension()
* was called. Cleanup the inode in case of error or race with
* writeback of delalloc blocks.
*/
WARN_ON_ONCE(ret == -EIOCBQUEUED);
ext4_inode_extension_cleanup(inode, ret);
}

Thanks for the suggestions!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR