2021-07-06 02:34:47

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH 1/4] ext4: check and update i_disksize properly

After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
isize"), i_disksize could always be updated to i_size in ext4_setattr(),
and it seems that there is no other way that could appear
i_disksize < i_size besides the delalloc write. In the case of delay
alloc write, ext4_writepages() could update i_disksize for the new delay
allocated blocks properly. So we could switch to check i_size instead
of i_disksize in ext4_da_write_end() when write to the end of the file.
we also could remove ext4_mark_inode_dirty() together because
generic_write_end() will dirty the inode.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d8de607849df..6f6a61f3ae5f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3087,32 +3087,27 @@ static int ext4_da_write_end(struct file *file,
* generic_write_end() will run mark_inode_dirty() if i_size
* changes. So let's piggyback the i_disksize mark_inode_dirty
* into that.
+ *
+ * Check i_size not i_disksize here because ext4_writepages() could
+ * update i_disksize from i_size for delay allocated blocks properly.
*/
new_i_size = pos + copied;
- if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
+ if (copied && new_i_size > inode->i_size) {
if (ext4_has_inline_data(inode) ||
- ext4_da_should_update_i_disksize(page, end)) {
+ ext4_da_should_update_i_disksize(page, end))
ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ret = ext4_mark_inode_dirty(handle, inode);
- }
}

if (write_mode != CONVERT_INLINE_DATA &&
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
ext4_has_inline_data(inode))
- ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
+ ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
page);
else
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret = generic_write_end(file, mapping, pos, len, copied,
page, fsdata);

- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
+ copied = ret;
ret2 = ext4_journal_stop(handle);
if (unlikely(ret2 && !ret))
ret = ret2;
--
2.31.1


2021-07-06 12:35:47

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] ext4: check and update i_disksize properly

On Tue 06-07-21 10:42:07, Zhang Yi wrote:
> After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
> isize"), i_disksize could always be updated to i_size in ext4_setattr(),
> and it seems that there is no other way that could appear
> i_disksize < i_size besides the delalloc write. In the case of delay

Well, there are also direct IO writes which have temporarily i_disksize <
i_size but when you hold i_rwsem, you're right that delalloc is the only
reason why you can see i_disksize < i_size AFAIK.

> alloc write, ext4_writepages() could update i_disksize for the new delay
> allocated blocks properly. So we could switch to check i_size instead
> of i_disksize in ext4_da_write_end() when write to the end of the file.

I agree that since ext4_da_should_update_i_disksize() needs to return true
for us to touch i_disksize, writeback has to have already allocated block
underlying the end of write (new_i_size position) and thus we are
guaranteed that writeback will also soon update i_disksize after the
new_i_size position. So I agree that your switch to testing i_size instead
of i_disksize should not have any bad effect... Thinking about this some
more why do we need i_disksize update in ext4_da_write_end() at all? The
page will be dirtied and when writeback will happen we will update
i_disksize to i_size. Updating i_disksize earlier brings no benefit - the user
will see zeros instead of valid data if we crash before the writeback
happened. Am I missing something guys?

Honza

> we also could remove ext4_mark_inode_dirty() together because
> generic_write_end() will dirty the inode.
>
> Signed-off-by: Zhang Yi <[email protected]>
> ---
> fs/ext4/inode.c | 21 ++++++++-------------
> 1 file changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d8de607849df..6f6a61f3ae5f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3087,32 +3087,27 @@ static int ext4_da_write_end(struct file *file,
> * generic_write_end() will run mark_inode_dirty() if i_size
> * changes. So let's piggyback the i_disksize mark_inode_dirty
> * into that.
> + *
> + * Check i_size not i_disksize here because ext4_writepages() could
> + * update i_disksize from i_size for delay allocated blocks properly.
> */
> new_i_size = pos + copied;
> - if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
> + if (copied && new_i_size > inode->i_size) {
> if (ext4_has_inline_data(inode) ||
> - ext4_da_should_update_i_disksize(page, end)) {
> + ext4_da_should_update_i_disksize(page, end))
> ext4_update_i_disksize(inode, new_i_size);
> - /* We need to mark inode dirty even if
> - * new_i_size is less that inode->i_size
> - * bu greater than i_disksize.(hint delalloc)
> - */
> - ret = ext4_mark_inode_dirty(handle, inode);
> - }
> }
>
> if (write_mode != CONVERT_INLINE_DATA &&
> ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
> ext4_has_inline_data(inode))
> - ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
> + ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
> page);
> else
> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> + ret = generic_write_end(file, mapping, pos, len, copied,
> page, fsdata);
>
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> + copied = ret;
> ret2 = ext4_journal_stop(handle);
> if (unlikely(ret2 && !ret))
> ret = ret2;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-07-06 14:41:22

by Zhang Yi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] ext4: check and update i_disksize properly

On 2021/7/6 20:11, Jan Kara wrote:
> On Tue 06-07-21 10:42:07, Zhang Yi wrote:
>> After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
>> isize"), i_disksize could always be updated to i_size in ext4_setattr(),
>> and it seems that there is no other way that could appear
>> i_disksize < i_size besides the delalloc write. In the case of delay
>
> Well, there are also direct IO writes which have temporarily i_disksize <
> i_size but when you hold i_rwsem, you're right that delalloc is the only
> reason why you can see i_disksize < i_size AFAIK.
>
>> alloc write, ext4_writepages() could update i_disksize for the new delay
>> allocated blocks properly. So we could switch to check i_size instead
>> of i_disksize in ext4_da_write_end() when write to the end of the file.
>
> I agree that since ext4_da_should_update_i_disksize() needs to return true
> for us to touch i_disksize, writeback has to have already allocated block
> underlying the end of write (new_i_size position) and thus we are
> guaranteed that writeback will also soon update i_disksize after the
> new_i_size position. So I agree that your switch to testing i_size instead
> of i_disksize should not have any bad effect... Thinking about this some
> more why do we need i_disksize update in ext4_da_write_end() at all? The
> page will be dirtied and when writeback will happen we will update
> i_disksize to i_size. Updating i_disksize earlier brings no benefit - the user
> will see zeros instead of valid data if we crash before the writeback
> happened. Am I missing something guys?
>

Hi, Jan.

Do you remember the patch and question I asked 2 years ago[1][2]? The case of
new_i_size > i_size && ext4_da_should_update_i_disksize() here means partial
block append write, ext4_writepages() does not update i_disksize for this case
now. And the journal data=ordered mode also cannot guarantee write data before
metadata. So we cannot guarantee we cannot see zeros where data was written after
crash.

Thanks,
Yi.

[1]https://lore.kernel.org/linux-ext4/[email protected]/
[2]https://lore.kernel.org/linux-ext4/[email protected]/

>
>> we also could remove ext4_mark_inode_dirty() together because
>> generic_write_end() will dirty the inode.
>>
>> Signed-off-by: Zhang Yi <[email protected]>
>> ---
>> fs/ext4/inode.c | 21 ++++++++-------------
>> 1 file changed, 8 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d8de607849df..6f6a61f3ae5f 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3087,32 +3087,27 @@ static int ext4_da_write_end(struct file *file,
>> * generic_write_end() will run mark_inode_dirty() if i_size
>> * changes. So let's piggyback the i_disksize mark_inode_dirty
>> * into that.
>> + *
>> + * Check i_size not i_disksize here because ext4_writepages() could
>> + * update i_disksize from i_size for delay allocated blocks properly.
>> */
>> new_i_size = pos + copied;
>> - if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
>> + if (copied && new_i_size > inode->i_size) {
>> if (ext4_has_inline_data(inode) ||
>> - ext4_da_should_update_i_disksize(page, end)) {
>> + ext4_da_should_update_i_disksize(page, end))
>> ext4_update_i_disksize(inode, new_i_size);
>> - /* We need to mark inode dirty even if
>> - * new_i_size is less that inode->i_size
>> - * bu greater than i_disksize.(hint delalloc)
>> - */
>> - ret = ext4_mark_inode_dirty(handle, inode);
>> - }
>> }
>>
>> if (write_mode != CONVERT_INLINE_DATA &&
>> ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
>> ext4_has_inline_data(inode))
>> - ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
>> + ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
>> page);
>> else
>> - ret2 = generic_write_end(file, mapping, pos, len, copied,
>> + ret = generic_write_end(file, mapping, pos, len, copied,
>> page, fsdata);
>>
>> - copied = ret2;
>> - if (ret2 < 0)
>> - ret = ret2;
>> + copied = ret;
>> ret2 = ext4_journal_stop(handle);
>> if (unlikely(ret2 && !ret))
>> ret = ret2;
>> --
>> 2.31.1
>>

2021-07-06 15:28:08

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] ext4: check and update i_disksize properly

On Tue 06-07-21 22:40:46, Zhang Yi wrote:
> On 2021/7/6 20:11, Jan Kara wrote:
> > On Tue 06-07-21 10:42:07, Zhang Yi wrote:
> >> After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
> >> isize"), i_disksize could always be updated to i_size in ext4_setattr(),
> >> and it seems that there is no other way that could appear
> >> i_disksize < i_size besides the delalloc write. In the case of delay
> >
> > Well, there are also direct IO writes which have temporarily i_disksize <
> > i_size but when you hold i_rwsem, you're right that delalloc is the only
> > reason why you can see i_disksize < i_size AFAIK.
> >
> >> alloc write, ext4_writepages() could update i_disksize for the new delay
> >> allocated blocks properly. So we could switch to check i_size instead
> >> of i_disksize in ext4_da_write_end() when write to the end of the file.
> >
> > I agree that since ext4_da_should_update_i_disksize() needs to return true
> > for us to touch i_disksize, writeback has to have already allocated block
> > underlying the end of write (new_i_size position) and thus we are
> > guaranteed that writeback will also soon update i_disksize after the
> > new_i_size position. So I agree that your switch to testing i_size instead
> > of i_disksize should not have any bad effect... Thinking about this some
> > more why do we need i_disksize update in ext4_da_write_end() at all? The
> > page will be dirtied and when writeback will happen we will update
> > i_disksize to i_size. Updating i_disksize earlier brings no benefit - the user
> > will see zeros instead of valid data if we crash before the writeback
> > happened. Am I missing something guys?
> >
>
> Hi, Jan.
>
> Do you remember the patch and question I asked 2 years ago[1][2]? The
> case of new_i_size > i_size && ext4_da_should_update_i_disksize() here
> means partial block append write,

Agreed.

> ext4_writepages() does not update i_disksize for this case now.

Doesn't it? Hmm, so mpage_map_and_submit_extent() certainly does make sure
we update i_size properly. But you are actually correct that
ext4_writepage() does not update i_disksize and neither does
mpage_prepare_extent_to_map() which can also writeback fully mapped pages.
Changing mpage_prepare_extent_to_map() to handle i_disksize update would be
trivial but dealing with ext4_writepage() would be difficult. So yes, let's
keep the i_disksize update in ext4_da_write_end() for now. But please add a
comment there explaining the situation. Like:

/*
* Since we are holding inode lock, we are sure i_disksize <=
* i_size. We also know that if i_disksize < i_size, there are
* delalloc writes pending in the range upto i_size. If the end of
* the current write is <= i_size, there's no need to touch
* i_disksize since writeback will push i_disksize upto i_size
* eventually. If the end of the current write is > i_size and
* inside an allocated block (ext4_da_should_update_i_disksize()
* check), we need to update i_disksize here as neither
* ext4_writepage() nor certain ext4_writepages() paths not
* allocating blocks update i_disksize.
*
* Note that we defer inode dirtying to generic_write_end() /
* ext4_da_write_inline_data_end().
*/

> And the journal data=ordered mode also
> cannot guarantee write data before metadata. So we cannot guarantee we
> cannot see zeros where data was written after crash.

Yes, but that is IMO somewhat different question.

Honza

>
> Thanks,
> Yi.
>
> [1]https://lore.kernel.org/linux-ext4/[email protected]/
> [2]https://lore.kernel.org/linux-ext4/[email protected]/
>
> >
> >> we also could remove ext4_mark_inode_dirty() together because
> >> generic_write_end() will dirty the inode.
> >>
> >> Signed-off-by: Zhang Yi <[email protected]>
> >> ---
> >> fs/ext4/inode.c | 21 ++++++++-------------
> >> 1 file changed, 8 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index d8de607849df..6f6a61f3ae5f 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3087,32 +3087,27 @@ static int ext4_da_write_end(struct file *file,
> >> * generic_write_end() will run mark_inode_dirty() if i_size
> >> * changes. So let's piggyback the i_disksize mark_inode_dirty
> >> * into that.
> >> + *
> >> + * Check i_size not i_disksize here because ext4_writepages() could
> >> + * update i_disksize from i_size for delay allocated blocks properly.
> >> */
> >> new_i_size = pos + copied;
> >> - if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
> >> + if (copied && new_i_size > inode->i_size) {
> >> if (ext4_has_inline_data(inode) ||
> >> - ext4_da_should_update_i_disksize(page, end)) {
> >> + ext4_da_should_update_i_disksize(page, end))
> >> ext4_update_i_disksize(inode, new_i_size);
> >> - /* We need to mark inode dirty even if
> >> - * new_i_size is less that inode->i_size
> >> - * bu greater than i_disksize.(hint delalloc)
> >> - */
> >> - ret = ext4_mark_inode_dirty(handle, inode);
> >> - }
> >> }
> >>
> >> if (write_mode != CONVERT_INLINE_DATA &&
> >> ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
> >> ext4_has_inline_data(inode))
> >> - ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
> >> + ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
> >> page);
> >> else
> >> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> >> + ret = generic_write_end(file, mapping, pos, len, copied,
> >> page, fsdata);
> >>
> >> - copied = ret2;
> >> - if (ret2 < 0)
> >> - ret = ret2;
> >> + copied = ret;
> >> ret2 = ext4_journal_stop(handle);
> >> if (unlikely(ret2 && !ret))
> >> ret = ret2;
> >> --
> >> 2.31.1
> >>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-07-07 06:19:35

by Zhang Yi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] ext4: check and update i_disksize properly

On 2021/7/6 23:26, Jan Kara wrote:
> On Tue 06-07-21 22:40:46, Zhang Yi wrote:
>> On 2021/7/6 20:11, Jan Kara wrote:
>>> On Tue 06-07-21 10:42:07, Zhang Yi wrote:
>>>> After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
>>>> isize"), i_disksize could always be updated to i_size in ext4_setattr(),
>>>> and it seems that there is no other way that could appear
>>>> i_disksize < i_size besides the delalloc write. In the case of delay
>>>
>>> Well, there are also direct IO writes which have temporarily i_disksize <
>>> i_size but when you hold i_rwsem, you're right that delalloc is the only
>>> reason why you can see i_disksize < i_size AFAIK.
>>>
>>>> alloc write, ext4_writepages() could update i_disksize for the new delay
>>>> allocated blocks properly. So we could switch to check i_size instead
>>>> of i_disksize in ext4_da_write_end() when write to the end of the file.
>>>
>>> I agree that since ext4_da_should_update_i_disksize() needs to return true
>>> for us to touch i_disksize, writeback has to have already allocated block
>>> underlying the end of write (new_i_size position) and thus we are
>>> guaranteed that writeback will also soon update i_disksize after the
>>> new_i_size position. So I agree that your switch to testing i_size instead
>>> of i_disksize should not have any bad effect... Thinking about this some
>>> more why do we need i_disksize update in ext4_da_write_end() at all? The
>>> page will be dirtied and when writeback will happen we will update
>>> i_disksize to i_size. Updating i_disksize earlier brings no benefit - the user
>>> will see zeros instead of valid data if we crash before the writeback
>>> happened. Am I missing something guys?
>>>
>>
>> Hi, Jan.
>>
>> Do you remember the patch and question I asked 2 years ago[1][2]? The
>> case of new_i_size > i_size && ext4_da_should_update_i_disksize() here
>> means partial block append write,
>
> Agreed.
>
>> ext4_writepages() does not update i_disksize for this case now.
>
> Doesn't it? Hmm, so mpage_map_and_submit_extent() certainly does make sure
> we update i_size properly. But you are actually correct that
> ext4_writepage() does not update i_disksize and neither does
> mpage_prepare_extent_to_map() which can also writeback fully mapped pages.
> Changing mpage_prepare_extent_to_map() to handle i_disksize update would be
> trivial but dealing with ext4_writepage() would be difficult. So yes, let's
> keep the i_disksize update in ext4_da_write_end() for now. But please add a
> comment there explaining the situation. Like:
>
> /*
> * Since we are holding inode lock, we are sure i_disksize <=
> * i_size. We also know that if i_disksize < i_size, there are
> * delalloc writes pending in the range upto i_size. If the end of
> * the current write is <= i_size, there's no need to touch
> * i_disksize since writeback will push i_disksize upto i_size
> * eventually. If the end of the current write is > i_size and
> * inside an allocated block (ext4_da_should_update_i_disksize()
> * check), we need to update i_disksize here as neither
> * ext4_writepage() nor certain ext4_writepages() paths not
> * allocating blocks update i_disksize.
> *
> * Note that we defer inode dirtying to generic_write_end() /
> * ext4_da_write_inline_data_end().
> */
>

Yeah, it makes things clear, I will add this comments in the next iteration.

Thanks,
Yi.