2023-09-19 06:03:24

by Gao Xiang

[permalink] [raw]
Subject: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Hi folks,

Our consumer reports a behavior change between pre-iomap and iomap
direct io conversion:

If the system crashes after an appending write to a file open with
O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
O_SYNC was marked before.

It can be reproduced by a test program in the attachment with
gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger

After some analysis, we found that before iomap direct I/O conversion,
the timing was roughly (taking Linux 3.10 codebase as an example):

..
- ext4_file_dio_write
- __generic_file_aio_write
..
- ext4_direct_IO # generic_file_direct_write
- ext4_ext_direct_IO
- ext4_ind_direct_IO # final_size > inode->i_size
- ..
- ret = blockdev_direct_IO()
- i_size_write(inode, end) # orphan && ret > 0 &&
# end > inode->i_size
- ext4_mark_inode_dirty()
- ...
- generic_write_sync # handling O_SYNC

So the dirty inode meta will be committed into journal immediately
if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
inode extension/truncate code out from ->iomap_end() callback"),
the new behavior seems as below:

..
- ext4_dio_write_iter
- ext4_dio_write_checks # extend = 1
- iomap_dio_rw
- __iomap_dio_rw
- iomap_dio_complete
- generic_write_sync
- ext4_handle_inode_extension # extend = 1

So that i_size will be recorded only after generic_write_sync() is
called. So O_SYNC won't flush the update i_size to the disk.

On the other side, after a quick look of XFS side, it will record
i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
have this problem.

Thanks,
Gao Xiang


Attachments:
repro.c (1.25 kB)

2023-09-19 13:01:01

by Jan Kara

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Hello!

On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> Our consumer reports a behavior change between pre-iomap and iomap
> direct io conversion:
>
> If the system crashes after an appending write to a file open with
> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> O_SYNC was marked before.
>
> It can be reproduced by a test program in the attachment with
> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>
> After some analysis, we found that before iomap direct I/O conversion,
> the timing was roughly (taking Linux 3.10 codebase as an example):
>
> ..
> - ext4_file_dio_write
> - __generic_file_aio_write
> ..
> - ext4_direct_IO # generic_file_direct_write
> - ext4_ext_direct_IO
> - ext4_ind_direct_IO # final_size > inode->i_size
> - ..
> - ret = blockdev_direct_IO()
> - i_size_write(inode, end) # orphan && ret > 0 &&
> # end > inode->i_size
> - ext4_mark_inode_dirty()
> - ...
> - generic_write_sync # handling O_SYNC
>
> So the dirty inode meta will be committed into journal immediately
> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
> inode extension/truncate code out from ->iomap_end() callback"),
> the new behavior seems as below:
>
> ..
> - ext4_dio_write_iter
> - ext4_dio_write_checks # extend = 1
> - iomap_dio_rw
> - __iomap_dio_rw
> - iomap_dio_complete
> - generic_write_sync
> - ext4_handle_inode_extension # extend = 1
>
> So that i_size will be recorded only after generic_write_sync() is
> called. So O_SYNC won't flush the update i_size to the disk.

Indeed, that looks like a bug. Thanks for report!

> On the other side, after a quick look of XFS side, it will record
> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
> have this problem.

Yes, I'm a bit hazy on the details but I think we've decided to call
ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
from ext4_dio_write_end_io() it was difficult to test in a race-free way
whether extending i_size (and i_disksize) is needed or not (we don't
necessarily hold i_rwsem there). I'll think how we could fix the problem
you've reported.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-09-19 13:59:04

by Gao Xiang

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion


(sorry... add Darrick here...)

Hi Jan,

On 2023/9/19 20:05, Jan Kara wrote:
> Hello!
>
> On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>> Our consumer reports a behavior change between pre-iomap and iomap
>> direct io conversion:
>>
>> If the system crashes after an appending write to a file open with
>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>> O_SYNC was marked before.
>>
>> It can be reproduced by a test program in the attachment with
>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>>
>> After some analysis, we found that before iomap direct I/O conversion,
>> the timing was roughly (taking Linux 3.10 codebase as an example):
>>
>> ..
>> - ext4_file_dio_write
>> - __generic_file_aio_write
>> ..
>> - ext4_direct_IO # generic_file_direct_write
>> - ext4_ext_direct_IO
>> - ext4_ind_direct_IO # final_size > inode->i_size
>> - ..
>> - ret = blockdev_direct_IO()
>> - i_size_write(inode, end) # orphan && ret > 0 &&
>> # end > inode->i_size
>> - ext4_mark_inode_dirty()
>> - ...
>> - generic_write_sync # handling O_SYNC
>>
>> So the dirty inode meta will be committed into journal immediately
>> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
>> inode extension/truncate code out from ->iomap_end() callback"),
>> the new behavior seems as below:
>>
>> ..
>> - ext4_dio_write_iter
>> - ext4_dio_write_checks # extend = 1
>> - iomap_dio_rw
>> - __iomap_dio_rw
>> - iomap_dio_complete
>> - generic_write_sync
>> - ext4_handle_inode_extension # extend = 1
>>
>> So that i_size will be recorded only after generic_write_sync() is
>> called. So O_SYNC won't flush the update i_size to the disk.
>
> Indeed, that looks like a bug. Thanks for report!

Thanks for the confirmation!

>
>> On the other side, after a quick look of XFS side, it will record
>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
>> have this problem.
>
> Yes, I'm a bit hazy on the details but I think we've decided to call
> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> from ext4_dio_write_end_io() it was difficult to test in a race-free way
> whether extending i_size (and i_disksize) is needed or not (we don't
> necessarily hold i_rwsem there). I'll think how we could fix the problem
> you've reported.

Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
is changed too.

I had a rough feeling that currently iomap DIO behaviors on these are
too strict and might not fit in each specific fs detailed
implementation, tho.

Thanks,
Gao Xiang

>
> Honza

2023-09-20 07:45:02

by Gao Xiang

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion



On 2023/9/20 15:29, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote:
>>
>> (sorry... add Darrick here...)
>>
>> Hi Jan,
>>
>> On 2023/9/19 20:05, Jan Kara wrote:
>>> Hello!
>>>
>>> On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>>>> Our consumer reports a behavior change between pre-iomap and iomap
>>>> direct io conversion:
>>>>
>>>> If the system crashes after an appending write to a file open with
>>>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>>>> O_SYNC was marked before.
>>>>
>>>> It can be reproduced by a test program in the attachment with
>>>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>>>>
>>>> After some analysis, we found that before iomap direct I/O conversion,
>>>> the timing was roughly (taking Linux 3.10 codebase as an example):
>>>>
>>>> ..
>>>> - ext4_file_dio_write
>>>> - __generic_file_aio_write
>>>> ..
>>>> - ext4_direct_IO # generic_file_direct_write
>>>> - ext4_ext_direct_IO
>>>> - ext4_ind_direct_IO # final_size > inode->i_size
>>>> - ..
>>>> - ret = blockdev_direct_IO()
>>>> - i_size_write(inode, end) # orphan && ret > 0 &&
>>>> # end > inode->i_size
>>>> - ext4_mark_inode_dirty()
>>>> - ...
>>>> - generic_write_sync # handling O_SYNC
>>>>
>>>> So the dirty inode meta will be committed into journal immediately
>>>> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
>>>> inode extension/truncate code out from ->iomap_end() callback"),
>>>> the new behavior seems as below:
>>>>
>>>> ..
>>>> - ext4_dio_write_iter
>>>> - ext4_dio_write_checks # extend = 1
>>>> - iomap_dio_rw
>>>> - __iomap_dio_rw
>>>> - iomap_dio_complete
>>>> - generic_write_sync
>>>> - ext4_handle_inode_extension # extend = 1
>>>>
>>>> So that i_size will be recorded only after generic_write_sync() is
>>>> called. So O_SYNC won't flush the update i_size to the disk.
>>>
>>> Indeed, that looks like a bug. Thanks for report!
>>
>> Thanks for the confirmation!
>>
>>>
>>>> On the other side, after a quick look of XFS side, it will record
>>>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
>>>> have this problem.
>>>
>>> Yes, I'm a bit hazy on the details but I think we've decided to call
>>> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
>>> from ext4_dio_write_end_io() it was difficult to test in a race-free way
>>> whether extending i_size (and i_disksize) is needed or not (we don't
>>> necessarily hold i_rwsem there). I'll think how we could fix the problem
>>> you've reported.
>
> Given that ext4 can run extent conversion in IO completion, it can
> run file extension in IO completion. Yes, that might require
> additional synchronisation of file size updates to co-ordinate
> submission and completion size checks. XFS just uses a spinlock for
> this....
>
>> Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
>> is changed too.
>
> For O_DSYNC, the file size change needs to be covered by the
> call to generic_write_sync() as well. O_DSYNC should be thought of
> as being essentially the same as O_SYNC except for minor details.
>
>> I had a rough feeling that currently iomap DIO behaviors on these are
>> too strict and might not fit in each specific fs detailed
>> implementation, tho.
>
>
> In what way? iomap implements exactly the data integrity semantics
> that are required for O_DSYNC and O_SYNC writes, and it requires
> filesystem end_io method to finalize all metadata modifications
> needed for data integrity purposes.
>
> Keep in mind that iomap is designed around the requirements async IO
> (AIO and io_uring) place on individual IOs: there is no waiting
> context to "finish" the IO before userspace is signalled that it is
> complete. Hence everything related to data integrity needs to be
> done by the filesystem in ->end_io before the iomap completion runs
> generic_write_sync() and signals IO completion....

Yes, I understand iomap is well-suited in this model. Anyway, it's
somewhat out of scope on my side.

Thanks,
Gao Xiang

>
> Cheers,
>
> Dave.

2023-09-20 10:18:08

by Dave Chinner

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote:
>
> (sorry... add Darrick here...)
>
> Hi Jan,
>
> On 2023/9/19 20:05, Jan Kara wrote:
> > Hello!
> >
> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> > > Our consumer reports a behavior change between pre-iomap and iomap
> > > direct io conversion:
> > >
> > > If the system crashes after an appending write to a file open with
> > > O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> > > O_SYNC was marked before.
> > >
> > > It can be reproduced by a test program in the attachment with
> > > gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
> > >
> > > After some analysis, we found that before iomap direct I/O conversion,
> > > the timing was roughly (taking Linux 3.10 codebase as an example):
> > >
> > > ..
> > > - ext4_file_dio_write
> > > - __generic_file_aio_write
> > > ..
> > > - ext4_direct_IO # generic_file_direct_write
> > > - ext4_ext_direct_IO
> > > - ext4_ind_direct_IO # final_size > inode->i_size
> > > - ..
> > > - ret = blockdev_direct_IO()
> > > - i_size_write(inode, end) # orphan && ret > 0 &&
> > > # end > inode->i_size
> > > - ext4_mark_inode_dirty()
> > > - ...
> > > - generic_write_sync # handling O_SYNC
> > >
> > > So the dirty inode meta will be committed into journal immediately
> > > if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
> > > inode extension/truncate code out from ->iomap_end() callback"),
> > > the new behavior seems as below:
> > >
> > > ..
> > > - ext4_dio_write_iter
> > > - ext4_dio_write_checks # extend = 1
> > > - iomap_dio_rw
> > > - __iomap_dio_rw
> > > - iomap_dio_complete
> > > - generic_write_sync
> > > - ext4_handle_inode_extension # extend = 1
> > >
> > > So that i_size will be recorded only after generic_write_sync() is
> > > called. So O_SYNC won't flush the update i_size to the disk.
> >
> > Indeed, that looks like a bug. Thanks for report!
>
> Thanks for the confirmation!
>
> >
> > > On the other side, after a quick look of XFS side, it will record
> > > i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
> > > have this problem.
> >
> > Yes, I'm a bit hazy on the details but I think we've decided to call
> > ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> > from ext4_dio_write_end_io() it was difficult to test in a race-free way
> > whether extending i_size (and i_disksize) is needed or not (we don't
> > necessarily hold i_rwsem there). I'll think how we could fix the problem
> > you've reported.

Given that ext4 can run extent conversion in IO completion, it can
run file extension in IO completion. Yes, that might require
additional synchronisation of file size updates to co-ordinate
submission and completion size checks. XFS just uses a spinlock for
this....

> Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
> is changed too.

For O_DSYNC, the file size change needs to be covered by the
call to generic_write_sync() as well. O_DSYNC should be thought of
as being essentially the same as O_SYNC except for minor details.

> I had a rough feeling that currently iomap DIO behaviors on these are
> too strict and might not fit in each specific fs detailed
> implementation, tho.


In what way? iomap implements exactly the data integrity semantics
that are required for O_DSYNC and O_SYNC writes, and it requires
filesystem end_io method to finalize all metadata modifications
needed for data integrity purposes.

Keep in mind that iomap is designed around the requirements async IO
(AIO and io_uring) place on individual IOs: there is no waiting
context to "finish" the IO before userspace is signalled that it is
complete. Hence everything related to data integrity needs to be
done by the filesystem in ->end_io before the iomap completion runs
generic_write_sync() and signals IO completion....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-09-20 16:41:27

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Jan Kara <[email protected]> writes:

> Hello!
>
> On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>> Our consumer reports a behavior change between pre-iomap and iomap
>> direct io conversion:
>>
>> If the system crashes after an appending write to a file open with
>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>> O_SYNC was marked before.
>>
>> It can be reproduced by a test program in the attachment with
>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>>
>> After some analysis, we found that before iomap direct I/O conversion,
>> the timing was roughly (taking Linux 3.10 codebase as an example):
>>
>> ..
>> - ext4_file_dio_write
>> - __generic_file_aio_write
>> ..
>> - ext4_direct_IO # generic_file_direct_write
>> - ext4_ext_direct_IO
>> - ext4_ind_direct_IO # final_size > inode->i_size
>> - ..
>> - ret = blockdev_direct_IO()
>> - i_size_write(inode, end) # orphan && ret > 0 &&
>> # end > inode->i_size
>> - ext4_mark_inode_dirty()
>> - ...
>> - generic_write_sync # handling O_SYNC
>>
>> So the dirty inode meta will be committed into journal immediately
>> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
>> inode extension/truncate code out from ->iomap_end() callback"),
>> the new behavior seems as below:
>>
>> ..
>> - ext4_dio_write_iter
>> - ext4_dio_write_checks # extend = 1
>> - iomap_dio_rw
>> - __iomap_dio_rw
>> - iomap_dio_complete
>> - generic_write_sync
>> - ext4_handle_inode_extension # extend = 1

Yes, since ext4_handle_inode_extension() will handle inode i_disksize
update and mark the inode dirty, generic_write_sync() call should happen
after that.

That also means then we don't have any generic FS testcase which can validate
this behaviour.

>>
>> So that i_size will be recorded only after generic_write_sync() is
>> called. So O_SYNC won't flush the update i_size to the disk.
>
> Indeed, that looks like a bug. Thanks for report!
>
>> On the other side, after a quick look of XFS side, it will record
>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
>> have this problem.
>
> Yes, I'm a bit hazy on the details but I think we've decided to call
> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> from ext4_dio_write_end_io() it was difficult to test in a race-free way
> whether extending i_size (and i_disksize) is needed or not (we don't
> necessarily hold i_rwsem there).

We do hold i_rwsem in exclusive write mode for file extend case.
(ext4_dio_write_checks()).

IIUC, ext4_handle_inode_extension() takes "written" and "count" as it's
argument. This means that "count" bytes were mapped, but only "written"
bytes were written. This information is used in
ext4_handle_inode_extension() case for truncating blocks beyond EOF.

I also found this discussion here [1].

[1]:
https://lore.kernel.org/linux-ext4/[email protected]/

From this thread it looks like we decided to move
ext4_handle_inode_extension() case out of ->end_io callback after v4
series (in v5) to handle above case. Right?

Do let me know if you think it was due to something else.

> I'll think how we could fix the problem you've reported.
>

1. I was thinking why do we need to truncate those blocks which are beyond
EOF for DIO case? Wasn't there an argument, that for short DIO writes,
we can use the remaining blocks allocated to be written by buffered-io,
right? Do we risk exposing anything in doing so?
We do fallback to buffered-io for short writes in ext4_dio_write_iter().

2. Either ways let's say we still would like to call truncate. Then can we
move ext4_truncate() logic out of ext4_handle_inode_extension() and call
ext4_handle_inode_extension() from within ->end_io callback.
ext4_truncate() can be then done in the main routine directly i.e. in
ext4_dio_write_iter() where we do have both "count" and "written" information.

Thoughts?

-ritesh

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2023-09-20 16:56:47

by Jan Kara

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

On Wed 20-09-23 17:08:19, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > Hello!
> >
> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> >> Our consumer reports a behavior change between pre-iomap and iomap
> >> direct io conversion:
> >>
> >> If the system crashes after an appending write to a file open with
> >> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> >> O_SYNC was marked before.
> >>
> >> It can be reproduced by a test program in the attachment with
> >> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
> >>
> >> After some analysis, we found that before iomap direct I/O conversion,
> >> the timing was roughly (taking Linux 3.10 codebase as an example):
> >>
> >> ..
> >> - ext4_file_dio_write
> >> - __generic_file_aio_write
> >> ..
> >> - ext4_direct_IO # generic_file_direct_write
> >> - ext4_ext_direct_IO
> >> - ext4_ind_direct_IO # final_size > inode->i_size
> >> - ..
> >> - ret = blockdev_direct_IO()
> >> - i_size_write(inode, end) # orphan && ret > 0 &&
> >> # end > inode->i_size
> >> - ext4_mark_inode_dirty()
> >> - ...
> >> - generic_write_sync # handling O_SYNC
> >>
> >> So the dirty inode meta will be committed into journal immediately
> >> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
> >> inode extension/truncate code out from ->iomap_end() callback"),
> >> the new behavior seems as below:
> >>
> >> ..
> >> - ext4_dio_write_iter
> >> - ext4_dio_write_checks # extend = 1
> >> - iomap_dio_rw
> >> - __iomap_dio_rw
> >> - iomap_dio_complete
> >> - generic_write_sync
> >> - ext4_handle_inode_extension # extend = 1
>
> Yes, since ext4_handle_inode_extension() will handle inode i_disksize
> update and mark the inode dirty, generic_write_sync() call should happen
> after that.
>
> That also means then we don't have any generic FS testcase which can
> validate this behaviour.

Yeah.

> >> So that i_size will be recorded only after generic_write_sync() is
> >> called. So O_SYNC won't flush the update i_size to the disk.
> >
> > Indeed, that looks like a bug. Thanks for report!
> >
> >> On the other side, after a quick look of XFS side, it will record
> >> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
> >> have this problem.
> >
> > Yes, I'm a bit hazy on the details but I think we've decided to call
> > ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> > from ext4_dio_write_end_io() it was difficult to test in a race-free way
> > whether extending i_size (and i_disksize) is needed or not (we don't
> > necessarily hold i_rwsem there).
>
> We do hold i_rwsem in exclusive write mode for file extend case.
> (ext4_dio_write_checks()).

Yes, the case I'm a bit concerned about is that for AIO overwrites we don't
hold i_rwsem at all in iomap_dio_complete() but we will still be performing
checks for file extension so we have to be careful they cannot have false
positives (or some other races) in the unlocked case.

> IIUC, ext4_handle_inode_extension() takes "written" and "count" as it's
> argument. This means that "count" bytes were mapped, but only "written"
> bytes were written. This information is used in
> ext4_handle_inode_extension() case for truncating blocks beyond EOF.
>
> I also found this discussion here [1].
>
> [1]:
> https://lore.kernel.org/linux-ext4/[email protected]/

Yeah, thanks for finding this.

> From this thread it looks like we decided to move
> ext4_handle_inode_extension() case out of ->end_io callback after v4
> series (in v5) to handle above case. Right?

Yes, the lack of original IO length in the ->end_io callback was the final
problem that made us move the ext4_handle_inode_extension() call out of
->end_io callback.

> > I'll think how we could fix the problem you've reported.
> >
>
> 1. I was thinking why do we need to truncate those blocks which are beyond
> EOF for DIO case? Wasn't there an argument, that for short DIO writes,
> we can use the remaining blocks allocated to be written by buffered-io,
> right? Do we risk exposing anything in doing so?
> We do fallback to buffered-io for short writes in ext4_dio_write_iter().

Yes, but as I mentioned in the thread you've referenced if we crash at
unfortunate moment, we will have inodes with blocks beyond EOF which is
not nice as we are "leaking" blocks.

> 2. Either ways let's say we still would like to call truncate. Then can we
> move ext4_truncate() logic out of ext4_handle_inode_extension() and call
> ext4_handle_inode_extension() from within ->end_io callback.
> ext4_truncate() can be then done in the main routine directly i.e. in
> ext4_dio_write_iter() where we do have both "count" and "written" information.

The truncate itself is not a big deal, as you say that can happen later.
The real question for which we need both "count" and "written" is whether
we can remove the inode from the orphan list in ->end_io callback or not.
For the common case of successful write, we want to do the removal from the
orphan list in the same transaction as the i_disksize update for
performance reasons. So that's why we have to do the decision about
truncation at the place where we update i_disksize.

But I think it shouldn't be a big deal to actually propagate the original
IO size to iomap_dio_end() and ->end_io callback. Let's try that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-09-22 13:34:36

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Jan Kara <[email protected]> writes:

> On Wed 20-09-23 17:08:19, Ritesh Harjani wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > Hello!
>> >
>> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>> >> Our consumer reports a behavior change between pre-iomap and iomap
>> >> direct io conversion:
>> >>
>> >> If the system crashes after an appending write to a file open with
>> >> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>> >> O_SYNC was marked before.
>> >>
>> >> It can be reproduced by a test program in the attachment with
>> >> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>> >>
>> >> After some analysis, we found that before iomap direct I/O conversion,
>> >> the timing was roughly (taking Linux 3.10 codebase as an example):
>> >>
>> >> ..
>> >> - ext4_file_dio_write
>> >> - __generic_file_aio_write
>> >> ..
>> >> - ext4_direct_IO # generic_file_direct_write
>> >> - ext4_ext_direct_IO
>> >> - ext4_ind_direct_IO # final_size > inode->i_size
>> >> - ..
>> >> - ret = blockdev_direct_IO()
>> >> - i_size_write(inode, end) # orphan && ret > 0 &&
>> >> # end > inode->i_size
>> >> - ext4_mark_inode_dirty()
>> >> - ...
>> >> - generic_write_sync # handling O_SYNC
>> >>
>> >> So the dirty inode meta will be committed into journal immediately
>> >> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
>> >> inode extension/truncate code out from ->iomap_end() callback"),
>> >> the new behavior seems as below:
>> >>
>> >> ..
>> >> - ext4_dio_write_iter
>> >> - ext4_dio_write_checks # extend = 1
>> >> - iomap_dio_rw
>> >> - __iomap_dio_rw
>> >> - iomap_dio_complete
>> >> - generic_write_sync
>> >> - ext4_handle_inode_extension # extend = 1
>>
>> Yes, since ext4_handle_inode_extension() will handle inode i_disksize
>> update and mark the inode dirty, generic_write_sync() call should happen
>> after that.
>>
>> That also means then we don't have any generic FS testcase which can
>> validate this behaviour.
>
> Yeah.
>

Ok. Let me then first send a fstest in response to this integrity
problem with directio and o_sync.

-ritesh

2023-09-22 15:47:16

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: [PATCH] generic: Add integrity tests with synchronous directio

This test covers data & metadata integrity check with directio with
o_sync flag and checks the file contents & size after sudden fileystem
shutdown once the directio write is completed. ext4 directio after iomap
conversion was broken in the sense that if the FS crashes after
synchronous directio write, it's file size is not properly updated.
This test adds a testcase to cover such scenario.

Man page of open says that -
O_SYNC provides synchronized I/O file integrity completion, meaning write
operations will flush data and all associated metadata to the underlying
hardware

Reported-by: Gao Xiang <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
tests/generic/471 | 45 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/471.out | 8 ++++++++
2 files changed, 53 insertions(+)
create mode 100755 tests/generic/471
create mode 100644 tests/generic/471.out

diff --git a/tests/generic/471 b/tests/generic/471
new file mode 100755
index 00000000..6c31cff8
--- /dev/null
+++ b/tests/generic/471
@@ -0,0 +1,45 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
+#
+# FS QA Test 471
+#
+# Integrity test with DIRECT_IO & O_SYNC with sudden shutdown
+#
+. ./common/preamble
+_begin_fstest auto quick shutdown
+
+# Override the default cleanup function.
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_require_scratch
+_require_scratch_shutdown
+
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount
+
+echo "Create a 1M file using O_DIRECT & O_SYNC"
+xfs_io -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile > /dev/null 2>&1
+
+echo "Shutdown the fs suddenly"
+_scratch_shutdown
+
+echo "Cycle mount"
+_scratch_cycle_mount
+
+echo "File contents after cycle mount"
+_hexdump $SCRATCH_MNT/testfile
+
+status=0
+exit
diff --git a/tests/generic/471.out b/tests/generic/471.out
new file mode 100644
index 00000000..ae279b79
--- /dev/null
+++ b/tests/generic/471.out
@@ -0,0 +1,8 @@
+QA output created by 471
+Create a 1M file using O_DIRECT & O_SYNC
+Shutdown the fs suddenly
+Cycle mount
+File contents after cycle mount
+000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
+*
+100000
--
2.41.0

2023-09-22 18:46:53

by Zorro Lang

[permalink] [raw]
Subject: Re: [PATCH] generic: Add integrity tests with synchronous directio

On Fri, Sep 22, 2023 at 05:40:36PM +0530, Ritesh Harjani (IBM) wrote:
> This test covers data & metadata integrity check with directio with
> o_sync flag and checks the file contents & size after sudden fileystem
> shutdown once the directio write is completed. ext4 directio after iomap
> conversion was broken in the sense that if the FS crashes after
> synchronous directio write, it's file size is not properly updated.
> This test adds a testcase to cover such scenario.

Thanks for this patch, some review points as below.

Is there a bug ? If there is, please use _fixed_by_kernel_commit to point
out that.

>
> Man page of open says that -
> O_SYNC provides synchronized I/O file integrity completion, meaning write
> operations will flush data and all associated metadata to the underlying
> hardware
>
> Reported-by: Gao Xiang <[email protected]>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> tests/generic/471 | 45 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/471.out | 8 ++++++++
> 2 files changed, 53 insertions(+)
> create mode 100755 tests/generic/471
> create mode 100644 tests/generic/471.out
>
> diff --git a/tests/generic/471 b/tests/generic/471
> new file mode 100755
> index 00000000..6c31cff8
> --- /dev/null
> +++ b/tests/generic/471
> @@ -0,0 +1,45 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
> +#
> +# FS QA Test 471
> +#
> +# Integrity test with DIRECT_IO & O_SYNC with sudden shutdown
> +#
> +. ./common/preamble
> +_begin_fstest auto quick shutdown
> +
> +# Override the default cleanup function.
> +_cleanup()
> +{
> + cd /
> + rm -r -f $tmp.*
> +}

This _cleanup looks same ith the default one, so you don't need to do
this "override", just remove this _cleanup and use the default one.

> +
> +# Import common functions.
> +. ./common/filter

If you don't need any filter helpers, feel free to remove this line.

> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
^^^
If you'll send a v2, feel free to remove this comment line :)

> +_supported_fs generic
> +_require_scratch
> +_require_scratch_shutdown

_require_odirect ??

Or if you will add aio test in v2, please use _require_aiodio.
Also add "aio" test group (in the _begin_fstest line).

> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +echo "Create a 1M file using O_DIRECT & O_SYNC"
> +xfs_io -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile > /dev/null 2>&1

$XFS_IO_PROG

Thanks,
Zorro

> +
> +echo "Shutdown the fs suddenly"
> +_scratch_shutdown
> +
> +echo "Cycle mount"
> +_scratch_cycle_mount
> +
> +echo "File contents after cycle mount"
> +_hexdump $SCRATCH_MNT/testfile
> +
> +status=0
> +exit
> diff --git a/tests/generic/471.out b/tests/generic/471.out
> new file mode 100644
> index 00000000..ae279b79
> --- /dev/null
> +++ b/tests/generic/471.out
> @@ -0,0 +1,8 @@
> +QA output created by 471
> +Create a 1M file using O_DIRECT & O_SYNC
> +Shutdown the fs suddenly
> +Cycle mount
> +File contents after cycle mount
> +000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
> +*
> +100000
> --
> 2.41.0
>

2023-09-23 12:05:44

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: [PATCHv2 1/2] aio-dio-write-verify: Add sync and noverify option

This patch adds -S for O_SYNC and -N for noverify option to
aio-dio-write-verify test. We will use this for integrity
verification test for aio-dio.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
src/aio-dio-regress/aio-dio-write-verify.c | 29 ++++++++++++++++------
1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/src/aio-dio-regress/aio-dio-write-verify.c b/src/aio-dio-regress/aio-dio-write-verify.c
index 302b8fe4..61519f6e 100644
--- a/src/aio-dio-regress/aio-dio-write-verify.c
+++ b/src/aio-dio-regress/aio-dio-write-verify.c
@@ -34,13 +34,16 @@

void usage(char *progname)
{
- fprintf(stderr, "usage: %s [-t truncsize ] <-a size=N,off=M [-a ...]> filename\n"
+ fprintf(stderr, "usage: %s [-t truncsize ] <-a size=N,off=M [-a ...]> [-S] [-N] filename\n"
"\t-t truncsize: truncate the file to a special size before AIO wirte\n"
"\t-a: specify once AIO write size and startoff, this option can be specified many times, but less than 128\n"
"\t\tsize=N: AIO write size\n"
"\t\toff=M: AIO write startoff\n"
- "e.g: %s -t 4608 -a size=4096,off=512 -a size=4096,off=4608 filename\n",
- progname, progname);
+ "\t-S: uses O_SYNC flag for open. By default O_SYNC is not used\n"
+ "\t-N: no_verify: means no write verification. By default noverify is false\n"
+ "e.g: %s -t 4608 -a size=4096,off=512 -a size=4096,off=4608 filename\n"
+ "e.g: %s -t 1048576 -a size=1048576 -S -N filename\n",
+ progname, progname, progname);
exit(1);
}

@@ -281,8 +284,10 @@ int main(int argc, char *argv[])
char *filename = NULL;
int num_events = 0;
off_t tsize = 0;
+ int o_sync = 0;
+ int no_verify = 0;

- while ((c = getopt(argc, argv, "a:t:")) != -1) {
+ while ((c = getopt(argc, argv, "a:t:SN")) != -1) {
char *endp;

switch (c) {
@@ -297,6 +302,12 @@ int main(int argc, char *argv[])
case 't':
tsize = strtoul(optarg, &endp, 0);
break;
+ case 'S':
+ o_sync = O_SYNC;
+ break;
+ case 'N':
+ no_verify = 1;
+ break;
default:
usage(argv[0]);
}
@@ -313,7 +324,7 @@ int main(int argc, char *argv[])
else
usage(argv[0]);

- fd = open(filename, O_DIRECT | O_CREAT | O_TRUNC | O_RDWR, 0600);
+ fd = open(filename, O_DIRECT | O_CREAT | O_TRUNC | O_RDWR | o_sync, 0600);
if (fd == -1) {
perror("open");
return 1;
@@ -331,9 +342,11 @@ int main(int argc, char *argv[])
return 1;
}

- if (io_verify(fd) != 0) {
- fprintf(stderr, "Data verification fails\n");
- return 1;
+ if (no_verify == 0) {
+ if (io_verify(fd) != 0) {
+ fprintf(stderr, "Data verification fails\n");
+ return 1;
+ }
}

close(fd);
--
2.41.0

2023-09-23 12:21:09

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [PATCH] generic: Add integrity tests with synchronous directio

Zorro Lang <[email protected]> writes:

> On Fri, Sep 22, 2023 at 05:40:36PM +0530, Ritesh Harjani (IBM) wrote:
>> This test covers data & metadata integrity check with directio with
>> o_sync flag and checks the file contents & size after sudden fileystem
>> shutdown once the directio write is completed. ext4 directio after iomap
>> conversion was broken in the sense that if the FS crashes after
>> synchronous directio write, it's file size is not properly updated.
>> This test adds a testcase to cover such scenario.
>
> Thanks for this patch, some review points as below.
>
> Is there a bug ? If there is, please use _fixed_by_kernel_commit to point
> out that.
>

It's still under discussion. So I am fine if you would like to wait
until the official fix is ready.

>>
>> Man page of open says that -
>> O_SYNC provides synchronized I/O file integrity completion, meaning write
>> operations will flush data and all associated metadata to the underlying
>> hardware
>>
>> Reported-by: Gao Xiang <[email protected]>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>> ---
>> tests/generic/471 | 45 +++++++++++++++++++++++++++++++++++++++++++
>> tests/generic/471.out | 8 ++++++++
>> 2 files changed, 53 insertions(+)
>> create mode 100755 tests/generic/471
>> create mode 100644 tests/generic/471.out
>>
>> diff --git a/tests/generic/471 b/tests/generic/471
>> new file mode 100755
>> index 00000000..6c31cff8
>> --- /dev/null
>> +++ b/tests/generic/471
>> @@ -0,0 +1,45 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
>> +#
>> +# FS QA Test 471
>> +#
>> +# Integrity test with DIRECT_IO & O_SYNC with sudden shutdown
>> +#
>> +. ./common/preamble
>> +_begin_fstest auto quick shutdown
>> +
>> +# Override the default cleanup function.
>> +_cleanup()
>> +{
>> + cd /
>> + rm -r -f $tmp.*
>> +}
>
> This _cleanup looks same ith the default one, so you don't need to do
> this "override", just remove this _cleanup and use the default one.
>

Ok, IIUC, we don't need to define _cleanup function, since
". ./common/preamble" does it for us.

>> +
>> +# Import common functions.
>> +. ./common/filter
>
> If you don't need any filter helpers, feel free to remove this line.
>

will remove it.

>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
> ^^^
> If you'll send a v2, feel free to remove this comment line :)
>

Will remove.


>> +_supported_fs generic
>> +_require_scratch
>> +_require_scratch_shutdown
>
> _require_odirect ??
>
> Or if you will add aio test in v2, please use _require_aiodio.
> Also add "aio" test group (in the _begin_fstest line).
>

Sure. Thanks for pointing out.

>> +
>> +_scratch_mkfs > $seqres.full 2>&1
>> +_scratch_mount
>> +
>> +echo "Create a 1M file using O_DIRECT & O_SYNC"
>> +xfs_io -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile > /dev/null 2>&1
>
> $XFS_IO_PROG

done.
>
> Thanks,
> Zorro
>

Thanks for the review!
-ritesh

2023-09-23 12:42:26

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: [PATCHv2 2/2] generic: Add integrity tests with synchronous directio

This test covers data & metadata integrity check with directio with
o_sync flag and checks the file contents & size after sudden fileystem
shutdown once the directio write is completed. ext4 directio after iomap
conversion was broken in the sense that if the FS crashes after
synchronous directio write, it's file size is not properly updated.
This test adds a testcase to cover such scenario.

Man page of open says that -
O_SYNC provides synchronized I/O file integrity completion, meaning write
operations will flush data and all associated metadata to the underlying
hardware

Reported-by: Gao Xiang <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
tests/generic/471 | 50 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/471.out | 22 +++++++++++++++++++
2 files changed, 72 insertions(+)
create mode 100755 tests/generic/471
create mode 100644 tests/generic/471.out

diff --git a/tests/generic/471 b/tests/generic/471
new file mode 100755
index 00000000..218e6676
--- /dev/null
+++ b/tests/generic/471
@@ -0,0 +1,50 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
+#
+# FS QA Test 471
+#
+# Integrity test for O_SYNC with buff-io, dio, aio-dio with sudden shutdown
+#
+. ./common/preamble
+_begin_fstest auto quick shutdown aio
+
+# real QA test starts here
+_supported_fs generic
+_require_scratch
+_require_scratch_shutdown
+_require_odirect
+_require_aiodio aio-dio-write-verify
+
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount
+
+echo "T-1: Create a 1M file using buff-io & O_SYNC"
+$XFS_IO_PROG -fs -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t1 > /dev/null 2>&1
+echo "T-1: Shutdown the fs suddenly"
+_scratch_shutdown
+echo "T-1: Cycle mount"
+_scratch_cycle_mount
+echo "T-1: File contents after cycle mount"
+_hexdump $SCRATCH_MNT/testfile.t1
+
+echo "T-2: Create a 1M file using O_DIRECT & O_SYNC"
+$XFS_IO_PROG -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t2 > /dev/null 2>&1
+echo "T-2: Shutdown the fs suddenly"
+_scratch_shutdown
+echo "T-2: Cycle mount"
+_scratch_cycle_mount
+echo "T-2: File contents after cycle mount"
+_hexdump $SCRATCH_MNT/testfile.t2
+
+echo "T-3: Create a 1M file using AIO-DIO & O_SYNC"
+$AIO_TEST -a size=1048576 -S -N $SCRATCH_MNT/testfile.t3 > /dev/null 2>&1
+echo "T-3: Shutdown the fs suddenly"
+_scratch_shutdown
+echo "T-3: Cycle mount"
+_scratch_cycle_mount
+echo "T-3: File contents after cycle mount"
+_hexdump $SCRATCH_MNT/testfile.t3
+
+status=0
+exit
diff --git a/tests/generic/471.out b/tests/generic/471.out
new file mode 100644
index 00000000..2bfb033d
--- /dev/null
+++ b/tests/generic/471.out
@@ -0,0 +1,22 @@
+QA output created by 471
+T-1: Create a 1M file using buff-io & O_SYNC
+T-1: Shutdown the fs suddenly
+T-1: Cycle mount
+T-1: File contents after cycle mount
+000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
+*
+100000
+T-2: Create a 1M file using O_DIRECT & O_SYNC
+T-2: Shutdown the fs suddenly
+T-2: Cycle mount
+T-2: File contents after cycle mount
+000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
+*
+100000
+T-3: Create a 1M file using AIO-DIO & O_SYNC
+T-3: Shutdown the fs suddenly
+T-3: Cycle mount
+T-3: File contents after cycle mount
+000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
+*
+100000
--
2.41.0

2023-09-26 03:11:46

by Jan Kara

[permalink] [raw]
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

On Fri 22-09-23 17:33:59, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > On Wed 20-09-23 17:08:19, Ritesh Harjani wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> > Hello!
> >> >
> >> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> >> >> Our consumer reports a behavior change between pre-iomap and iomap
> >> >> direct io conversion:
> >> >>
> >> >> If the system crashes after an appending write to a file open with
> >> >> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> >> >> O_SYNC was marked before.
> >> >>
> >> >> It can be reproduced by a test program in the attachment with
> >> >> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
> >> >>
> >> >> After some analysis, we found that before iomap direct I/O conversion,
> >> >> the timing was roughly (taking Linux 3.10 codebase as an example):
> >> >>
> >> >> ..
> >> >> - ext4_file_dio_write
> >> >> - __generic_file_aio_write
> >> >> ..
> >> >> - ext4_direct_IO # generic_file_direct_write
> >> >> - ext4_ext_direct_IO
> >> >> - ext4_ind_direct_IO # final_size > inode->i_size
> >> >> - ..
> >> >> - ret = blockdev_direct_IO()
> >> >> - i_size_write(inode, end) # orphan && ret > 0 &&
> >> >> # end > inode->i_size
> >> >> - ext4_mark_inode_dirty()
> >> >> - ...
> >> >> - generic_write_sync # handling O_SYNC
> >> >>
> >> >> So the dirty inode meta will be committed into journal immediately
> >> >> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move
> >> >> inode extension/truncate code out from ->iomap_end() callback"),
> >> >> the new behavior seems as below:
> >> >>
> >> >> ..
> >> >> - ext4_dio_write_iter
> >> >> - ext4_dio_write_checks # extend = 1
> >> >> - iomap_dio_rw
> >> >> - __iomap_dio_rw
> >> >> - iomap_dio_complete
> >> >> - generic_write_sync
> >> >> - ext4_handle_inode_extension # extend = 1
> >>
> >> Yes, since ext4_handle_inode_extension() will handle inode i_disksize
> >> update and mark the inode dirty, generic_write_sync() call should happen
> >> after that.
> >>
> >> That also means then we don't have any generic FS testcase which can
> >> validate this behaviour.
> >
> > Yeah.
> >
>
> Ok. Let me then first send a fstest in response to this integrity
> problem with directio and o_sync.

Thanks for working on the testcase! I have written a fix in the meantime
but so far it causes weird inconsistencies in fsstress ENOSPC hitters tests
so I'm still debugging that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-09-28 05:24:20

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [PATCHv2 2/2] generic: Add integrity tests with synchronous directio

Zorro Lang <[email protected]> writes:

> On Sat, Sep 23, 2023 at 05:30:24PM +0530, Ritesh Harjani (IBM) wrote:
>> This test covers data & metadata integrity check with directio with
>> o_sync flag and checks the file contents & size after sudden fileystem
>> shutdown once the directio write is completed. ext4 directio after iomap
>> conversion was broken in the sense that if the FS crashes after
>> synchronous directio write, it's file size is not properly updated.
>> This test adds a testcase to cover such scenario.
>>
>> Man page of open says that -
>> O_SYNC provides synchronized I/O file integrity completion, meaning write
>> operations will flush data and all associated metadata to the underlying
>> hardware
>
>
>
>>
>> Reported-by: Gao Xiang <[email protected]>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>> ---
>> tests/generic/471 | 50 +++++++++++++++++++++++++++++++++++++++++++
>> tests/generic/471.out | 22 +++++++++++++++++++
>> 2 files changed, 72 insertions(+)
>> create mode 100755 tests/generic/471
>> create mode 100644 tests/generic/471.out
>>
>> diff --git a/tests/generic/471 b/tests/generic/471
>
> The generic/471 has been taken last week, you can choose another number.
> Or simply use generic/999, then I'll change the 999 to a proper number.
>
>> new file mode 100755
>> index 00000000..218e6676
>> --- /dev/null
>> +++ b/tests/generic/471
>> @@ -0,0 +1,50 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
>> +#
>> +# FS QA Test 471
>> +#
>> +# Integrity test for O_SYNC with buff-io, dio, aio-dio with sudden shutdown
>> +#
>> +. ./common/preamble
>> +_begin_fstest auto quick shutdown aio
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>
> Is the bug fix be reviewed and acked now? If it is, please use _fixed_by_kernel_commit
> at here. The commit id can be "xxxxxxxxxxxx" if it's not merged by acked.
>

Hi Zorro,

Yes, the patch is still being worked on. I think we can wait till then
to not spook the distro CI testing to start reporting multiple bug
reports using this testcase :P


>> +_require_scratch
>> +_require_scratch_shutdown
>> +_require_odirect
>
> Due to you add aio test in v2, so this line should be: _require_aiodio
>

Ok, I guess I can just remove this line "_require_odirect". Because in the
next line I anyway include _require_aiodio.

>> +_require_aiodio aio-dio-write-verify

here ^^

>> +
>> +_scratch_mkfs > $seqres.full 2>&1
>> +_scratch_mount
>> +
>> +echo "T-1: Create a 1M file using buff-io & O_SYNC"
>> +$XFS_IO_PROG -fs -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t1 > /dev/null 2>&1
>> +echo "T-1: Shutdown the fs suddenly"
>> +_scratch_shutdown
>> +echo "T-1: Cycle mount"
>> +_scratch_cycle_mount
>> +echo "T-1: File contents after cycle mount"
>> +_hexdump $SCRATCH_MNT/testfile.t1
>> +
>> +echo "T-2: Create a 1M file using O_DIRECT & O_SYNC"
>> +$XFS_IO_PROG -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t2 > /dev/null 2>&1
>> +echo "T-2: Shutdown the fs suddenly"
>> +_scratch_shutdown
>> +echo "T-2: Cycle mount"
>> +_scratch_cycle_mount
>> +echo "T-2: File contents after cycle mount"
>> +_hexdump $SCRATCH_MNT/testfile.t2
>> +
>> +echo "T-3: Create a 1M file using AIO-DIO & O_SYNC"
>> +$AIO_TEST -a size=1048576 -S -N $SCRATCH_MNT/testfile.t3 > /dev/null 2>&1
>
> So you just need aio-dio-write-verify.c to do aio write. Maybe we can have aio
> read and write support in xfs_io in one day:)

Yes, someday maybe :)
But I am not sure what plan Darrick has about it. Was there any
history behind no aio-dio support in xfs_io? Just curious since the
discussion came up.


Thanks for the review!
-ritesh

2023-09-28 06:30:39

by Zorro Lang

[permalink] [raw]
Subject: Re: [PATCHv2 2/2] generic: Add integrity tests with synchronous directio

On Sat, Sep 23, 2023 at 05:30:24PM +0530, Ritesh Harjani (IBM) wrote:
> This test covers data & metadata integrity check with directio with
> o_sync flag and checks the file contents & size after sudden fileystem
> shutdown once the directio write is completed. ext4 directio after iomap
> conversion was broken in the sense that if the FS crashes after
> synchronous directio write, it's file size is not properly updated.
> This test adds a testcase to cover such scenario.
>
> Man page of open says that -
> O_SYNC provides synchronized I/O file integrity completion, meaning write
> operations will flush data and all associated metadata to the underlying
> hardware



>
> Reported-by: Gao Xiang <[email protected]>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> tests/generic/471 | 50 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/471.out | 22 +++++++++++++++++++
> 2 files changed, 72 insertions(+)
> create mode 100755 tests/generic/471
> create mode 100644 tests/generic/471.out
>
> diff --git a/tests/generic/471 b/tests/generic/471

The generic/471 has been taken last week, you can choose another number.
Or simply use generic/999, then I'll change the 999 to a proper number.

> new file mode 100755
> index 00000000..218e6676
> --- /dev/null
> +++ b/tests/generic/471
> @@ -0,0 +1,50 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2023 IBM Corporation. All Rights Reserved.
> +#
> +# FS QA Test 471
> +#
> +# Integrity test for O_SYNC with buff-io, dio, aio-dio with sudden shutdown
> +#
> +. ./common/preamble
> +_begin_fstest auto quick shutdown aio
> +
> +# real QA test starts here
> +_supported_fs generic

Is the bug fix be reviewed and acked now? If it is, please use _fixed_by_kernel_commit
at here. The commit id can be "xxxxxxxxxxxx" if it's not merged by acked.

> +_require_scratch
> +_require_scratch_shutdown
> +_require_odirect

Due to you add aio test in v2, so this line should be: _require_aiodio

> +_require_aiodio aio-dio-write-verify
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +echo "T-1: Create a 1M file using buff-io & O_SYNC"
> +$XFS_IO_PROG -fs -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t1 > /dev/null 2>&1
> +echo "T-1: Shutdown the fs suddenly"
> +_scratch_shutdown
> +echo "T-1: Cycle mount"
> +_scratch_cycle_mount
> +echo "T-1: File contents after cycle mount"
> +_hexdump $SCRATCH_MNT/testfile.t1
> +
> +echo "T-2: Create a 1M file using O_DIRECT & O_SYNC"
> +$XFS_IO_PROG -fsd -c "pwrite -S 0x5a 0 1M" $SCRATCH_MNT/testfile.t2 > /dev/null 2>&1
> +echo "T-2: Shutdown the fs suddenly"
> +_scratch_shutdown
> +echo "T-2: Cycle mount"
> +_scratch_cycle_mount
> +echo "T-2: File contents after cycle mount"
> +_hexdump $SCRATCH_MNT/testfile.t2
> +
> +echo "T-3: Create a 1M file using AIO-DIO & O_SYNC"
> +$AIO_TEST -a size=1048576 -S -N $SCRATCH_MNT/testfile.t3 > /dev/null 2>&1

So you just need aio-dio-write-verify.c to do aio write. Maybe we can have aio
read and write support in xfs_io in one day:)

Thanks,
Zorro


> +echo "T-3: Shutdown the fs suddenly"
> +_scratch_shutdown
> +echo "T-3: Cycle mount"
> +_scratch_cycle_mount
> +echo "T-3: File contents after cycle mount"
> +_hexdump $SCRATCH_MNT/testfile.t3
> +
> +status=0
> +exit
> diff --git a/tests/generic/471.out b/tests/generic/471.out
> new file mode 100644
> index 00000000..2bfb033d
> --- /dev/null
> +++ b/tests/generic/471.out
> @@ -0,0 +1,22 @@
> +QA output created by 471
> +T-1: Create a 1M file using buff-io & O_SYNC
> +T-1: Shutdown the fs suddenly
> +T-1: Cycle mount
> +T-1: File contents after cycle mount
> +000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
> +*
> +100000
> +T-2: Create a 1M file using O_DIRECT & O_SYNC
> +T-2: Shutdown the fs suddenly
> +T-2: Cycle mount
> +T-2: File contents after cycle mount
> +000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
> +*
> +100000
> +T-3: Create a 1M file using AIO-DIO & O_SYNC
> +T-3: Shutdown the fs suddenly
> +T-3: Cycle mount
> +T-3: File contents after cycle mount
> +000000 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a >ZZZZZZZZZZZZZZZZ<
> +*
> +100000
> --
> 2.41.0
>