2021-12-23 03:24:13

by Xin Yin

[permalink] [raw]
Subject: [PATCH 0/2] ext4: fast commit crash consistency issues

This patch sets fix 2 crash-consistency issues of fast commit.
First patch change to use ext4_ext_remove_space instead of
ext4_punch_hole during replay delete range procedure. This
avoid replay procedure being affeced by incorrect inode->i_size.
Second patch correct the trank range logic for ftruncte.

After testing this patch sets with xfstests-bld, in the "log" and
"quick" group with config "fast_commit" is selected. No regressions
was found.

Signed-off-by: Xin Yin <[email protected]>

Xin Yin (2):
ext4: use ext4_ext_remove_space() for fast commit replay delete range
ext4: fast commit may miss tracking unwritten range during ftruncate

fs/ext4/fast_commit.c | 13 ++++++++-----
fs/ext4/inode.c | 3 +--
2 files changed, 9 insertions(+), 7 deletions(-)

--
2.20.1



2021-12-23 03:24:18

by Xin Yin

[permalink] [raw]
Subject: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

For now ,we use ext4_punch_hole() during fast commit replay delete range
procedure. But it will be affected by inode->i_size, which may not
correct during fast commit replay procedure. The following test will
failed.

-create & write foo (len 1000K)
-falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
-create & fsync bar
-falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
-fsync foo
-crash before a full commit

After the fast_commit reply procedure, the range 400K-500K will not be
removed. Because in this case, when calling ext4_punch_hole() the
inode->i_size is 0, and it just retruns with doing nothing.

Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
to remove blocks of inode directly.

Signed-off-by: Xin Yin <[email protected]>
---
fs/ext4/fast_commit.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index aa05b23f9c14..3deb97b22ca4 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
}
}

- ret = ext4_punch_hole(inode,
- le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
- le32_to_cpu(lrange.fc_len) << sb->s_blocksize_bits);
- if (ret)
- jbd_debug(1, "ext4_punch_hole returned %d", ret);
+ down_write(&EXT4_I(inode)->i_data_sem);
+ ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
+ lrange.fc_lblk + lrange.fc_len - 1);
+ up_write(&EXT4_I(inode)->i_data_sem);
+ if (ret) {
+ iput(inode);
+ return 0;
+ }
ext4_ext_replay_shrink_inode(inode,
i_size_read(inode) >> sb->s_blocksize_bits);
ext4_mark_inode_dirty(NULL, inode);
--
2.20.1


2021-12-23 03:24:25

by Xin Yin

[permalink] [raw]
Subject: [PATCH 2/2] ext4: fast commit may miss tracking unwritten range during ftruncate

If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the
inode->i_size will not include the unwritten range. When call
ftruncate with fast commit enabled, it will miss to track the
unwritten range.

Change to trace the full range during ftruncate.

Signed-off-by: Xin Yin <[email protected]>
---
fs/ext4/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 82f555d26980..1d2ba63874ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5423,8 +5423,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
ext4_fc_track_range(handle, inode,
(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
inode->i_sb->s_blocksize_bits,
- (oldsize > 0 ? oldsize - 1 : 0) >>
- inode->i_sb->s_blocksize_bits);
+ EXT_MAX_BLOCKS - 1);
else
ext4_fc_track_range(
handle, inode,
--
2.20.1


2021-12-23 20:11:31

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

Looks good to me.

Reviewed-by: Harshad Shirwadkar <[email protected]>

On Wed, Dec 22, 2021 at 7:24 PM Xin Yin <[email protected]> wrote:
>
> For now ,we use ext4_punch_hole() during fast commit replay delete range
> procedure. But it will be affected by inode->i_size, which may not
> correct during fast commit replay procedure. The following test will
> failed.
>
> -create & write foo (len 1000K)
> -falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
> -create & fsync bar
> -falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
> -fsync foo
> -crash before a full commit
>
> After the fast_commit reply procedure, the range 400K-500K will not be
> removed. Because in this case, when calling ext4_punch_hole() the
> inode->i_size is 0, and it just retruns with doing nothing.
>
> Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
> to remove blocks of inode directly.
>
> Signed-off-by: Xin Yin <[email protected]>
> ---
> fs/ext4/fast_commit.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index aa05b23f9c14..3deb97b22ca4 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
> }
> }
>
> - ret = ext4_punch_hole(inode,
> - le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
> - le32_to_cpu(lrange.fc_len) << sb->s_blocksize_bits);
> - if (ret)
> - jbd_debug(1, "ext4_punch_hole returned %d", ret);
> + down_write(&EXT4_I(inode)->i_data_sem);
> + ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
> + lrange.fc_lblk + lrange.fc_len - 1);
> + up_write(&EXT4_I(inode)->i_data_sem);
> + if (ret) {
> + iput(inode);
> + return 0;
> + }
> ext4_ext_replay_shrink_inode(inode,
> i_size_read(inode) >> sb->s_blocksize_bits);
> ext4_mark_inode_dirty(NULL, inode);
> --
> 2.20.1
>

2021-12-23 20:12:12

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: fast commit may miss tracking unwritten range during ftruncate

Looks good to me.

Reviewed-by: Harshad Shirwadkar <[email protected]>

On Wed, Dec 22, 2021 at 7:24 PM Xin Yin <[email protected]> wrote:
>
> If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the
> inode->i_size will not include the unwritten range. When call
> ftruncate with fast commit enabled, it will miss to track the
> unwritten range.
>
> Change to trace the full range during ftruncate.
>
> Signed-off-by: Xin Yin <[email protected]>
> ---
> fs/ext4/inode.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 82f555d26980..1d2ba63874ad 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5423,8 +5423,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
> ext4_fc_track_range(handle, inode,
> (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> inode->i_sb->s_blocksize_bits,
> - (oldsize > 0 ? oldsize - 1 : 0) >>
> - inode->i_sb->s_blocksize_bits);
> + EXT_MAX_BLOCKS - 1);
> else
> ext4_fc_track_range(
> handle, inode,
> --
> 2.20.1
>

2021-12-23 20:13:37

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: fast commit crash consistency issues

Thanks for the patches, these are good finds.

Reviewed-by: Harshad Shirwadkar <[email protected]>

On Wed, Dec 22, 2021 at 7:24 PM Xin Yin <[email protected]> wrote:
>
> This patch sets fix 2 crash-consistency issues of fast commit.
> First patch change to use ext4_ext_remove_space instead of
> ext4_punch_hole during replay delete range procedure. This
> avoid replay procedure being affeced by incorrect inode->i_size.
> Second patch correct the trank range logic for ftruncte.
>
> After testing this patch sets with xfstests-bld, in the "log" and
> "quick" group with config "fast_commit" is selected. No regressions
> was found.
>
> Signed-off-by: Xin Yin <[email protected]>
>
> Xin Yin (2):
> ext4: use ext4_ext_remove_space() for fast commit replay delete range
> ext4: fast commit may miss tracking unwritten range during ftruncate
>
> fs/ext4/fast_commit.c | 13 ++++++++-----
> fs/ext4/inode.c | 3 +--
> 2 files changed, 9 insertions(+), 7 deletions(-)
>
> --
> 2.20.1
>

2021-12-24 00:28:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/2] ext4: fast commit crash consistency issues

On Thu, 23 Dec 2021 11:23:35 +0800, Xin Yin wrote:
> This patch sets fix 2 crash-consistency issues of fast commit.
> First patch change to use ext4_ext_remove_space instead of
> ext4_punch_hole during replay delete range procedure. This
> avoid replay procedure being affeced by incorrect inode->i_size.
> Second patch correct the trank range logic for ftruncte.
>
> After testing this patch sets with xfstests-bld, in the "log" and
> "quick" group with config "fast_commit" is selected. No regressions
> was found.
>
> [...]

Applied, thanks!

[1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range
commit: 20a262d66487b27133c3f8b7ac0245c2fabaa526
[2/2] ext4: fast commit may miss tracking unwritten range during ftruncate
commit: 949197083545cbff930650f893135d30fd043d4d

Best regards,
--
Theodore Ts'o <[email protected]>

2022-02-02 11:25:16

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

Hello Xin,

Sorry about revisiting this thread so late :(
Recently when I was working on one of the fast_commit issue, I got interested
in looking into some of those recent fast_commit fixes.

Hence some of these queries.

On 21/12/23 11:23AM, Xin Yin wrote:
> For now ,we use ext4_punch_hole() during fast commit replay delete range
> procedure. But it will be affected by inode->i_size, which may not
> correct during fast commit replay procedure. The following test will
> failed.
>
> -create & write foo (len 1000K)
> -falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
> -create & fsync bar
^^^^ do you mean "fsync foo" or is this actually a new file create and fsync
bar?


> -falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
> -fsync foo
> -crash before a full commit
>
> After the fast_commit reply procedure, the range 400K-500K will not be
> removed. Because in this case, when calling ext4_punch_hole() the
> inode->i_size is 0, and it just retruns with doing nothing.

I tried looking into this, but I am not able to put my head around that when
will the inode->i_size will be 0?

So, what I think should happen is when you are doing falocate/fsync foo in your
above list of operations then, anyways the inode i_disksize will be updated
using ext4_mark_inode_dirty() and during replay phase inode->i_size will hold
the right value no?

Could you please help understand when, where and how will inode->i_size will be
0?

Also - it would be helpful if you have some easy reproducer of this issue you
mentioned.

-ritesh

>
> Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
> to remove blocks of inode directly.
>
> Signed-off-by: Xin Yin <[email protected]>
> ---
> fs/ext4/fast_commit.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index aa05b23f9c14..3deb97b22ca4 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
> }
> }
>
> - ret = ext4_punch_hole(inode,
> - le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
> - le32_to_cpu(lrange.fc_len) << sb->s_blocksize_bits);
> - if (ret)
> - jbd_debug(1, "ext4_punch_hole returned %d", ret);
> + down_write(&EXT4_I(inode)->i_data_sem);
> + ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
> + lrange.fc_lblk + lrange.fc_len - 1);
> + up_write(&EXT4_I(inode)->i_data_sem);
> + if (ret) {
> + iput(inode);
> + return 0;
> + }
> ext4_ext_replay_shrink_inode(inode,
> i_size_read(inode) >> sb->s_blocksize_bits);
> ext4_mark_inode_dirty(NULL, inode);
> --
> 2.20.1
>

2022-02-02 13:43:46

by Xin Yin

[permalink] [raw]
Subject: Re: [External] Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

On Wed, Feb 2, 2022 at 4:34 AM Ritesh Harjani <[email protected]> wrote:
>
> Hello Xin,
>
> Sorry about revisiting this thread so late :(
> Recently when I was working on one of the fast_commit issue, I got interested
> in looking into some of those recent fast_commit fixes.
>
> Hence some of these queries.
>
> On 21/12/23 11:23AM, Xin Yin wrote:
> > For now ,we use ext4_punch_hole() during fast commit replay delete range
> > procedure. But it will be affected by inode->i_size, which may not
> > correct during fast commit replay procedure. The following test will
> > failed.
> >
> > -create & write foo (len 1000K)
> > -falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
> > -create & fsync bar
> ^^^^ do you mean "fsync foo" or is this actually a new file create and fsync
> bar?
bar is a new created file, it is the brother file of foo , it would be
like this.
./foo ./bar

>
>
> > -falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
> > -fsync foo
> > -crash before a full commit
> >
> > After the fast_commit reply procedure, the range 400K-500K will not be
> > removed. Because in this case, when calling ext4_punch_hole() the
> > inode->i_size is 0, and it just retruns with doing nothing.
>
> I tried looking into this, but I am not able to put my head around that when
> will the inode->i_size will be 0?
>
> So, what I think should happen is when you are doing falocate/fsync foo in your
> above list of operations then, anyways the inode i_disksize will be updated
> using ext4_mark_inode_dirty() and during replay phase inode->i_size will hold
> the right value no?
yes, the inode->i_size hold the right value and ext4_fc_replay_inode()
will update inode to the final state, but during replay phase
ext4_fc_replay_inode() usually is the last step, so before this the
inode->i_size may not correct.

>
> Could you please help understand when, where and how will inode->i_size will be
> 0?
I didn't check why inode->i_size is 0, in this case. I just think
inode->i_size should not affect the behavior of the replay phase.
Another case is inode->i_size may not include unwritten blocks , and
if a file has unwritten blocks at bottom, we can not use
ext4_punch_hole() to remove the unwritten blocks beyond i_size during
the replay phase.

>
> Also - it would be helpful if you have some easy reproducer of this issue you
> mentioned.
The attached test code can reproduce this issue, hope it helps.


>
> -ritesh
>
> >
> > Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
> > to remove blocks of inode directly.
> >
> > Signed-off-by: Xin Yin <[email protected]>
> > ---
> > fs/ext4/fast_commit.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > index aa05b23f9c14..3deb97b22ca4 100644
> > --- a/fs/ext4/fast_commit.c
> > +++ b/fs/ext4/fast_commit.c
> > @@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
> > }
> > }
> >
> > - ret = ext4_punch_hole(inode,
> > - le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
> > - le32_to_cpu(lrange.fc_len) << sb->s_blocksize_bits);
> > - if (ret)
> > - jbd_debug(1, "ext4_punch_hole returned %d", ret);
> > + down_write(&EXT4_I(inode)->i_data_sem);
> > + ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
> > + lrange.fc_lblk + lrange.fc_len - 1);
> > + up_write(&EXT4_I(inode)->i_data_sem);
> > + if (ret) {
> > + iput(inode);
> > + return 0;
> > + }
> > ext4_ext_replay_shrink_inode(inode,
> > i_size_read(inode) >> sb->s_blocksize_bits);
> > ext4_mark_inode_dirty(NULL, inode);
> > --
> > 2.20.1
> >


Attachments:
del_range_issue.c (1.84 kB)

2022-02-04 07:18:19

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [External] Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

On 22/02/02 09:40PM, Xin Yin wrote:
> On Wed, Feb 2, 2022 at 4:34 AM Ritesh Harjani <[email protected]> wrote:
> >
> > Hello Xin,
> >
> > Sorry about revisiting this thread so late :(
> > Recently when I was working on one of the fast_commit issue, I got interested
> > in looking into some of those recent fast_commit fixes.
> >
> > Hence some of these queries.
> >
> > On 21/12/23 11:23AM, Xin Yin wrote:
> > > For now ,we use ext4_punch_hole() during fast commit replay delete range
> > > procedure. But it will be affected by inode->i_size, which may not
> > > correct during fast commit replay procedure. The following test will
> > > failed.
> > >
> > > -create & write foo (len 1000K)
> > > -falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
> > > -create & fsync bar
> > ^^^^ do you mean "fsync foo" or is this actually a new file create and fsync
> > bar?
> bar is a new created file, it is the brother file of foo , it would be
> like this.
> ./foo ./bar
>
> >
> >
> > > -falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
> > > -fsync foo
> > > -crash before a full commit
> > >
> > > After the fast_commit reply procedure, the range 400K-500K will not be
> > > removed. Because in this case, when calling ext4_punch_hole() the
> > > inode->i_size is 0, and it just retruns with doing nothing.
> >
> > I tried looking into this, but I am not able to put my head around that when
> > will the inode->i_size will be 0?
> >
> > So, what I think should happen is when you are doing falocate/fsync foo in your
> > above list of operations then, anyways the inode i_disksize will be updated
> > using ext4_mark_inode_dirty() and during replay phase inode->i_size will hold
> > the right value no?
> yes, the inode->i_size hold the right value and ext4_fc_replay_inode()
> will update inode to the final state, but during replay phase
> ext4_fc_replay_inode() usually is the last step, so before this the
> inode->i_size may not correct.
>
> >
> > Could you please help understand when, where and how will inode->i_size will be
> > 0?
> I didn't check why inode->i_size is 0, in this case. I just think

Ok, so I now know why the inode->i_size is 0 during replay phase (for file foo).
This is because inode->i_disksize is not really updated until after the
ext4_writepages() kicks in, which in this case, won't happen (for file foo)
when we are doing fsync on file bar. And hence fsync on file bar won't also
not ensure the delalloc blocks for file foo get's written out.


In fact this above information was something that I was assuming it all wrong.
Earlier I was of the opinion that fast_commit still pushes _all_ the dirty
pagecache data of other files to disk too (which is incorrect) and the only
performance gains happens via less writes to disk (since we write less metadata
on disk).

But I think what really happens is -
In case of fast_commit when fsync is called on any file (say bar), apart from that
file's (bar) dirty data, it only writes the necessary required metadata information
of the blocks of others files (in this case file foo) which are already allocated.
(which in this case was due to fzero operation).
It does not actually allocate the delalloc blocks due to buffered writes of any
other file (other than for file on which fsync is called).

This happens in
ext4_fc_perform_commit() -> ext4_fc_submit_inode_data_all() ->
jbd2_submit_inode_data -> jbd2_journal_submit_inode_data_buffers() ->
generic_writepages() -> using writepage() which won't do block allocation for
delalloc blocks.

So that above is what should give the major performance boost with fast_commit
in case of multiple file writes doing fsync. :)

@Jan/Harshad - could you please confirm if above is correct?


> inode->i_size should not affect the behavior of the replay phase.
> Another case is inode->i_size may not include unwritten blocks , and
> if a file has unwritten blocks at bottom, we can not use
> ext4_punch_hole() to remove the unwritten blocks beyond i_size during
> the replay phase.

Right. So then yes, we should not depend on inode->i_size during replay phase,
since it might have an entry in fast_commit area which is still only partially
correct (or in some transient state w.r.t i_disksize).


>
> >
> > Also - it would be helpful if you have some easy reproducer of this issue you
> > mentioned.
> The attached test code can reproduce this issue, hope it helps.

Thanks, yes it did help.

-ritesh

>
>
> >
> > -ritesh
> >
> > >
> > > Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
> > > to remove blocks of inode directly.
> > >
> > > Signed-off-by: Xin Yin <[email protected]>
> > > ---
> > > fs/ext4/fast_commit.c | 13 ++++++++-----
> > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > > index aa05b23f9c14..3deb97b22ca4 100644
> > > --- a/fs/ext4/fast_commit.c
> > > +++ b/fs/ext4/fast_commit.c
> > > @@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
> > > }
> > > }
> > >
> > > - ret = ext4_punch_hole(inode,
> > > - le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
> > > - le32_to_cpu(lrange.fc_len) << sb->s_blocksize_bits);
> > > - if (ret)
> > > - jbd_debug(1, "ext4_punch_hole returned %d", ret);
> > > + down_write(&EXT4_I(inode)->i_data_sem);
> > > + ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
> > > + lrange.fc_lblk + lrange.fc_len - 1);
> > > + up_write(&EXT4_I(inode)->i_data_sem);
> > > + if (ret) {
> > > + iput(inode);
> > > + return 0;
> > > + }
> > > ext4_ext_replay_shrink_inode(inode,
> > > i_size_read(inode) >> sb->s_blocksize_bits);
> > > ext4_mark_inode_dirty(NULL, inode);
> > > --
> > > 2.20.1
> > >


2022-02-06 21:14:24

by Jan Kara

[permalink] [raw]
Subject: Re: [External] Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

On Fri 04-02-22 02:44:16, Ritesh Harjani wrote:
> Ok, so I now know why the inode->i_size is 0 during replay phase (for file foo).
> This is because inode->i_disksize is not really updated until after the
> ext4_writepages() kicks in, which in this case, won't happen (for file foo)
> when we are doing fsync on file bar. And hence fsync on file bar won't also
> not ensure the delalloc blocks for file foo get's written out.
>
> In fact this above information was something that I was assuming it all
> wrong. Earlier I was of the opinion that fast_commit still pushes _all_
> the dirty pagecache data of other files to disk too (which is incorrect)
> and the only performance gains happens via less writes to disk (since we
> write less metadata on disk).
>
> But I think what really happens is - In case of fast_commit when fsync is
> called on any file (say bar), apart from that file's (bar) dirty data, it
> only writes the necessary required metadata information of the blocks of
> others files (in this case file foo) which are already allocated. (which
> in this case was due to fzero operation). It does not actually allocate
> the delalloc blocks due to buffered writes of any other file (other than
> for file on which fsync is called).

Yes, but that is exactly what also happens for normal commit. I.e. even
without fastcommits, if we fsync(2), we will flush out data for that file
but for all the other files, buffered data still stays in delalloc state in
the page cache. Following journal commit will thus write all metadata (and
wait for data) of the fsynced files but not for any other file that has
only delalloc blocks. If writeback of some other file also happened before
we commit, then yes, we include all the changes to the commit as well.

> This happens in
> ext4_fc_perform_commit() -> ext4_fc_submit_inode_data_all() ->
> jbd2_submit_inode_data -> jbd2_journal_submit_inode_data_buffers() ->
> generic_writepages() -> using writepage() which won't do block allocation for
> delalloc blocks.
>
> So that above is what should give the major performance boost with fast_commit
> in case of multiple file writes doing fsync. :)
>
> @Jan/Harshad - could you please confirm if above is correct?

What you describe is correct but not special to fastcommit. As I mentioned
on the call yesterday, fastcommit is currently beneficial only because the
logical logging it does ends up writing much less blocks to the journal.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-02-07 10:10:02

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [External] Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

On 22/02/04 12:36PM, Jan Kara wrote:
> On Fri 04-02-22 02:44:16, Ritesh Harjani wrote:
> > Ok, so I now know why the inode->i_size is 0 during replay phase (for file foo).
> > This is because inode->i_disksize is not really updated until after the
> > ext4_writepages() kicks in, which in this case, won't happen (for file foo)
> > when we are doing fsync on file bar. And hence fsync on file bar won't also
> > not ensure the delalloc blocks for file foo get's written out.
> >
> > In fact this above information was something that I was assuming it all
> > wrong. Earlier I was of the opinion that fast_commit still pushes _all_
> > the dirty pagecache data of other files to disk too (which is incorrect)
> > and the only performance gains happens via less writes to disk (since we
> > write less metadata on disk).
> >
> > But I think what really happens is - In case of fast_commit when fsync is
> > called on any file (say bar), apart from that file's (bar) dirty data, it
> > only writes the necessary required metadata information of the blocks of
> > others files (in this case file foo) which are already allocated. (which
> > in this case was due to fzero operation). It does not actually allocate
> > the delalloc blocks due to buffered writes of any other file (other than
> > for file on which fsync is called).
>
> Yes, but that is exactly what also happens for normal commit. I.e. even
> without fastcommits, if we fsync(2), we will flush out data for that file
> but for all the other files, buffered data still stays in delalloc state in
> the page cache. Following journal commit will thus write all metadata (and
> wait for data) of the fsynced files but not for any other file that has
> only delalloc blocks. If writeback of some other file also happened before
> we commit, then yes, we include all the changes to the commit as well.
>
> > This happens in
> > ext4_fc_perform_commit() -> ext4_fc_submit_inode_data_all() ->
> > jbd2_submit_inode_data -> jbd2_journal_submit_inode_data_buffers() ->
> > generic_writepages() -> using writepage() which won't do block allocation for
> > delalloc blocks.
> >
> > So that above is what should give the major performance boost with fast_commit
> > in case of multiple file writes doing fsync. :)
> >
> > @Jan/Harshad - could you please confirm if above is correct?
>
> What you describe is correct but not special to fastcommit. As I mentioned
> on the call yesterday, fastcommit is currently beneficial only because the
> logical logging it does ends up writing much less blocks to the journal.
>

Yes, thanks for taking time to explain it again.
I got this now.

Thanks!
-ritesh


> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR