2022-08-12 12:47:17

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes

ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.

The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().

Signed-off-by: Lukas Czerner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
---
v2, v3: no change

fs/ext4/inode.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..2a220be34caa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
}
ext4_fc_track_inode(handle, inode);

- if (IS_I_VERSION(inode))
+ /*
+ * ea_inodes are using i_version for storing reference count, don't
+ * mess with it
+ */
+ if (IS_I_VERSION(inode) &&
+ !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
inode_inc_iversion(inode);

/* the do_update_inode consumes one bh->b_count */
--
2.37.1


2022-08-12 12:47:36

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter

From: Jeff Layton <[email protected]>

The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, remove the handling of
Opt_i_version as well, since we're almost to 5.20 anyway.

Ideally, we'd couple this change with a way to disable the i_version
counter (just in case), but the way the iversion mount option was
implemented makes that difficult to do. We'd need to add a new mount
option altogether or do something with tune2fs. That's probably best
left to later patches if it turns out to be needed.

[ Removed leftover bits of i_version from ext4_apply_options() since it
now can't ever be set in ctx->mask_s_flags -- lczerner ]

Cc: Dave Chinner <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Lukas Czerner <[email protected]>
---
v3: Removed leftover bits of i_version from ext4_apply_options
v4: no change

fs/ext4/inode.c | 5 ++---
fs/ext4/super.c | 21 ++++-----------------
2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a220be34caa..c77d40f05763 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
return -EINVAL;
}

- if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
+ if (attr->ia_size != inode->i_size)
inode_inc_iversion(inode);

if (shrink) {
@@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
* ea_inodes are using i_version for storing reference count, don't
* mess with it
*/
- if (IS_I_VERSION(inode) &&
- !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
+ if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
inode_inc_iversion(inode);

/* the do_update_inode consumes one bh->b_count */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..1c953f6d400e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1585,7 +1585,7 @@ enum {
Opt_inlinecrypt,
Opt_usrjquota, Opt_grpjquota, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_prjquota,
Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
@@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
fsparam_flag ("barrier", Opt_barrier),
fsparam_u32 ("barrier", Opt_barrier),
fsparam_flag ("nobarrier", Opt_nobarrier),
- fsparam_flag ("i_version", Opt_i_version),
fsparam_flag ("dax", Opt_dax),
fsparam_enum ("dax", Opt_dax_type, ext4_param_dax),
fsparam_u32 ("stripe", Opt_stripe),
@@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
case Opt_abort:
ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
return 0;
- case Opt_i_version:
- ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
- ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
- ctx_set_flags(ctx, SB_I_VERSION);
- return 0;
case Opt_inlinecrypt:
#ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
ctx_set_flags(ctx, SB_INLINECRYPT);
@@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
sb->s_flags &= ~ctx->mask_s_flags;
sb->s_flags |= ctx->vals_s_flags;

- /*
- * i_version differs from common mount option iversion so we have
- * to let vfs know that it was set, otherwise it would get cleared
- * on remount
- */
- if (ctx->mask_s_flags & SB_I_VERSION)
- fc->sb_flags |= SB_I_VERSION;
-
#define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
APPLY(s_commit_interval);
APPLY(s_stripe);
@@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
- if (sb->s_flags & SB_I_VERSION)
- SEQ_OPTS_PUTS("i_version");
if (nodefs || sbi->s_stripe)
SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
if (nodefs || EXT4_MOUNT_DATA_FLAGS &
@@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);

+ /* i_version is always enabled now */
+ sb->s_flags |= SB_I_VERSION;
+
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
(ext4_has_compat_features(sb) ||
ext4_has_ro_compat_features(sb) ||
--
2.37.1

2022-08-12 12:49:03

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Lukas Czerner <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
v2: Reworked according to suggestions from Jan
v3: Update documentation, add comments, change flag to flags in
xfs_fs_dirty_inode()

Documentation/filesystems/vfs.rst | 2 ++
fs/fs-writeback.c | 34 ++++++++++++++++++++-----------
fs/xfs/xfs_super.c | 12 +++++++++--
include/linux/fs.h | 9 ++++----
4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 6cd6953e175b..5d72b6ba4e63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -274,6 +274,8 @@ or bottom half).
This is specifically for the inode itself being marked dirty,
not its data. If the update needs to be persisted by fdatasync(),
then I_DIRTY_DATASYNC will be set in the flags argument.
+ If the inode has dirty timestamp and lazytime is enabled
+ I_DIRTY_TIME will be set in the flags.

``write_inode``
this method is called when the VFS needs to write an inode to
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05221366a16d..638dbf143727 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1718,9 +1718,14 @@ static int writeback_single_inode(struct inode *inode,
*/
if (!(inode->i_state & I_DIRTY_ALL))
inode_cgwb_move_to_attached(inode, wb);
- else if (!(inode->i_state & I_SYNC_QUEUED) &&
- (inode->i_state & I_DIRTY))
- redirty_tail_locked(inode, wb);
+ else if (!(inode->i_state & I_SYNC_QUEUED)) {
+ if ((inode->i_state & I_DIRTY))
+ redirty_tail_locked(inode, wb);
+ else if (inode->i_state & I_DIRTY_TIME) {
+ inode->dirtied_when = jiffies;
+ inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
+ }
+ }

spin_unlock(&wb->list_lock);
inode_sync_complete(inode);
@@ -2369,6 +2374,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
trace_writeback_mark_inode_dirty(inode, flags);

if (flags & I_DIRTY_INODE) {
+
+ /* Inode timestamp update will piggback on this dirtying */
+ if (inode->i_state & I_DIRTY_TIME) {
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & I_DIRTY_TIME) {
+ inode->i_state &= ~I_DIRTY_TIME;
+ flags |= I_DIRTY_TIME;
+ }
+ spin_unlock(&inode->i_lock);
+ }
+
/*
* Notify the filesystem about the inode being dirtied, so that
* (if needed) it can update on-disk fields and journal the
@@ -2378,7 +2394,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
trace_writeback_dirty_inode_start(inode, flags);
if (sb->s_op->dirty_inode)
- sb->s_op->dirty_inode(inode, flags & I_DIRTY_INODE);
+ sb->s_op->dirty_inode(inode,
+ flags & (I_DIRTY_INODE | I_DIRTY_TIME));
trace_writeback_dirty_inode(inode, flags);

/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
@@ -2399,21 +2416,15 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
smp_mb();

- if (((inode->i_state & flags) == flags) ||
- (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+ if ((inode->i_state & flags) == flags)
return;

spin_lock(&inode->i_lock);
- if (dirtytime && (inode->i_state & I_DIRTY_INODE))
- goto out_unlock_inode;
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

inode_attach_wb(inode, NULL);

- /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
- if (flags & I_DIRTY_INODE)
- inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;

/*
@@ -2486,7 +2497,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
out_unlock:
if (wb)
spin_unlock(&wb->list_lock);
-out_unlock_inode:
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(__mark_inode_dirty);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9ac59814bbb6..13efc77a1e79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -653,7 +653,7 @@ xfs_fs_destroy_inode(
static void
xfs_fs_dirty_inode(
struct inode *inode,
- int flag)
+ int flags)
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
@@ -661,7 +661,15 @@ xfs_fs_dirty_inode(

if (!(inode->i_sb->s_flags & SB_LAZYTIME))
return;
- if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
+
+ /*
+ * Only do the timestamp update if the inode is dirty (I_DIRTY_SYNC)
+ * and has dirty timestamp (I_DIRTY_TIME). I_DIRTY_TIME can be either
+ * already set in i_state, or passed in flags possibly together with
+ * I_DIRTY_SYNC.
+ */
+ if ((flags & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
+ !((inode->i_state | flags) & I_DIRTY_TIME))
return;

if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5113f65c786f..e220d26d771a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2376,13 +2376,14 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* don't have to write inode on fdatasync() when only
* e.g. the timestamps have changed.
* I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
- * I_DIRTY_TIME The inode itself only has dirty timestamps, and the
+ * I_DIRTY_TIME The inode itself has dirty timestamps, and the
* lazytime mount option is enabled. We keep track of this
* separately from I_DIRTY_SYNC in order to implement
* lazytime. This gets cleared if I_DIRTY_INODE
- * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. I.e.
- * either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
- * i_state, but not both. I_DIRTY_PAGES may still be set.
+ * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ * in place because writeback might already be in progress
+ * and we don't want to lose the time update
* I_NEW Serves as both a mutex and completion notification.
* New inodes set I_NEW. If two processes both create
* the same inode, one of them will release its inode and
--
2.37.1

2022-08-12 13:25:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes

On Fri, Aug 12, 2022 at 02:37:25PM +0200, Lukas Czerner wrote:
> ea_inodes are using i_version for storing part of the reference count so
> we really need to leave it alone.
>
> The problem can be reproduced by xfstest ext4/026 when iversion is
> enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> inodes in ext4_mark_iloc_dirty().
>
> Signed-off-by: Lukas Czerner <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jeff Layton <[email protected]>
> ---

Reviewed-by: Christian Brauner (Microsoft) <[email protected]>

2022-08-12 13:25:22

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter

On Fri, Aug 12, 2022 at 02:37:27PM +0200, Lukas Czerner wrote:
> From: Jeff Layton <[email protected]>
>
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).
>
> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
>
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well, since we're almost to 5.20 anyway.
>
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
>
> [ Removed leftover bits of i_version from ext4_apply_options() since it
> now can't ever be set in ctx->mask_s_flags -- lczerner ]
>
> Cc: Dave Chinner <[email protected]>
> Cc: Benjamin Coddington <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---

Since ext4 seems to ignore unknown mount options in ext4_parse_param()
removing seems good,
Reviewed-by: Christian Brauner (Microsoft) <[email protected]>

2022-08-12 18:03:06

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE.

How can this be reconciled with the below code in __mark_inode_dirty(), which
this patch doesn't touch?

/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
flags &= ~I_DIRTY_TIME;

Also inode_is_dirtytime_only(), which I thought I mentioned before:

/*
* Returns true if the given inode itself only has dirty timestamps (its pages
* may still be dirty) and isn't currently being allocated or freed.
* Filesystems should call this if when writing an inode when lazytime is
* enabled, they want to opportunistically write the timestamps of other inodes
* located very nearby on-disk, e.g. in the same inode block. This returns true
* if the given inode is in need of such an opportunistic update. Requires
* i_lock, or at least later re-checking under i_lock.
*/
static inline bool inode_is_dirtytime_only(struct inode *inode)
{
return (inode->i_state & (I_DIRTY_TIME | I_NEW |
I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
}

- Eric

2022-08-12 18:16:10

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 6cd6953e175b..5d72b6ba4e63 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -274,6 +274,8 @@ or bottom half).
> This is specifically for the inode itself being marked dirty,
> not its data. If the update needs to be persisted by fdatasync(),
> then I_DIRTY_DATASYNC will be set in the flags argument.
> + If the inode has dirty timestamp and lazytime is enabled
> + I_DIRTY_TIME will be set in the flags.

The new sentence is not always true, since with this patch if
__mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
the inode still has dirty timestamps. Please be super clear about what the
flags actually mean -- I'm still struggling to understand this patch...

- Eric

2022-08-12 18:46:19

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
>
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option.
>
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE. Also make sure that the case is properly handled in
> writeback_single_inode() as well. Additionally changes in
> xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
>
> Thanks Jan Kara for suggestions on how to make this work properly.
>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Signed-off-by: Lukas Czerner <[email protected]>
> Suggested-by: Jan Kara <[email protected]>

Sorry for so many separate emails. One more thought: isn't there a much more
straightforward fix to this bug that wouldn't require changing the semantics of
the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
That would fix the bug by making the filesystem update the on-disk inode.

Perhaps you aren't doing that in order to strictly maintain the semantics of
'lazytime', where timestamp updates are only persisted at certain times? Is
this useful even in the short window of time that an inode is dirty?

- Eric

2022-08-12 18:46:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes

On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> ea_inodes are using i_version for storing part of the reference count so
> we really need to leave it alone.
>
> The problem can be reproduced by xfstest ext4/026 when iversion is
> enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> inodes in ext4_mark_iloc_dirty().
>
> Signed-off-by: Lukas Czerner <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jeff Layton <[email protected]>
> ---
> v2, v3: no change
>
> fs/ext4/inode.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 601214453c3a..2a220be34caa 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> }
> ext4_fc_track_inode(handle, inode);
>
> - if (IS_I_VERSION(inode))
> + /*
> + * ea_inodes are using i_version for storing reference count, don't
> + * mess with it
> + */
> + if (IS_I_VERSION(inode) &&
> + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> inode_inc_iversion(inode);
>
> /* the do_update_inode consumes one bh->b_count */


I've spent some time writing tests for the i_version counter (still
quite rough right now), and what I've found is that this particular
inode_inc_iversion results in the counter being bumped on _reads_ as
well as writes, due to the atime changing. This call to
inode_inc_iversion seems to make no sense, as we aren't bumping the
mtime here.

I'm still working on and testing this, but I think we'll probably just
want to remove this inode_inc_iversion entirely, and leave the i_version
bumping for normal files to happen when the timestamps are updated. So
far, my testing seems to indicate that that does the right thing.

Hopefully I'll have some testcases + patches for this next week
sometime.

Cheers,
--
Jeff Layton <[email protected]>

2022-08-16 11:52:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri 12-08-22 11:12:27, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 6cd6953e175b..5d72b6ba4e63 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -274,6 +274,8 @@ or bottom half).
> > This is specifically for the inode itself being marked dirty,
> > not its data. If the update needs to be persisted by fdatasync(),
> > then I_DIRTY_DATASYNC will be set in the flags argument.
> > + If the inode has dirty timestamp and lazytime is enabled
> > + I_DIRTY_TIME will be set in the flags.
>
> The new sentence is not always true, since with this patch if
> __mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
> I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
> the inode still has dirty timestamps. Please be super clear about what the
> flags actually mean -- I'm still struggling to understand this patch...

Let me chime in here because I was the one who suggested the solution to
Lukas. There are two different things (which is why this is confusing I
guess):

1) I_DIRTY_TIME in the inode->i_state should mean: struct inode has times
updated after we last called ->dirty_inode() callback. Hence
inode_is_dirtytime_only() as well as the chunk:
/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
flags &= ~I_DIRTY_TIME;
you mention in the previous email are compatible with this meaning AFAICT.

2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
bit of a hack. Currently XFS relies on the fact that the only time its
->dirty_inode() callback needs to do anything is when VFS decides it is
time to writeback timestamps and XFS detects this situation by checking for
I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
callback (otherwise timestamp update can get lost). So the solution I've
suggested was to propagate the information "timestamp update needed" to XFS
through I_DIRTY_TIME in flags passed to ->dirty_inode().

I hope things are clearer now.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-16 12:00:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri 12-08-22 11:42:21, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> >
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option.
> >
> > Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> > I_DIRTY_INODE. Also make sure that the case is properly handled in
> > writeback_single_inode() as well. Additionally changes in
> > xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> >
> > Thanks Jan Kara for suggestions on how to make this work properly.
> >
> > Cc: Dave Chinner <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Signed-off-by: Lukas Czerner <[email protected]>
> > Suggested-by: Jan Kara <[email protected]>
>
> Sorry for so many separate emails. One more thought: isn't there a much more
> straightforward fix to this bug that wouldn't require changing the semantics of
> the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
> i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
> That would fix the bug by making the filesystem update the on-disk inode.

This is a good question and I was also considering this when we first
discussed the problem with Lukas. It should fix the bug for ext4 but
eventually I've decided against this because XFS would still need something
else to fix the problem (see my previous email) and for ext4 it seemed as
unnecessary overhead (see below).

> Perhaps you aren't doing that in order to strictly maintain the semantics of
> 'lazytime', where timestamp updates are only persisted at certain times? Is
> this useful even in the short window of time that an inode is dirty?

The result of this for ext4 will be that after the inode is dirtied for
some reason, we will be logging every timestamp update to the journal
(effectively disabling lazytime for the inode) for the 30s time window that
the inode stays dirty before writeback code decides to do writeback (which
just results in clearing the I_DIRTY_INODE flag for ext4). Not too bad I
guess but I'd prefer to avoid this overhead.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-16 12:09:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter

On Fri 12-08-22 14:37:27, Lukas Czerner wrote:
> From: Jeff Layton <[email protected]>
>
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).
>
> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
>
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well, since we're almost to 5.20 anyway.
>
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
>
> [ Removed leftover bits of i_version from ext4_apply_options() since it
> now can't ever be set in ctx->mask_s_flags -- lczerner ]
>
> Cc: Dave Chinner <[email protected]>
> Cc: Benjamin Coddington <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> Signed-off-by: Lukas Czerner <[email protected]>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> v3: Removed leftover bits of i_version from ext4_apply_options
> v4: no change
>
> fs/ext4/inode.c | 5 ++---
> fs/ext4/super.c | 21 ++++-----------------
> 2 files changed, 6 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2a220be34caa..c77d40f05763 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
> return -EINVAL;
> }
>
> - if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
> + if (attr->ia_size != inode->i_size)
> inode_inc_iversion(inode);
>
> if (shrink) {
> @@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> * ea_inodes are using i_version for storing reference count, don't
> * mess with it
> */
> - if (IS_I_VERSION(inode) &&
> - !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> + if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> inode_inc_iversion(inode);
>
> /* the do_update_inode consumes one bh->b_count */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..1c953f6d400e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1585,7 +1585,7 @@ enum {
> Opt_inlinecrypt,
> Opt_usrjquota, Opt_grpjquota, Opt_quota,
> Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> - Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> + Opt_usrquota, Opt_grpquota, Opt_prjquota,
> Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
> Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
> Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
> fsparam_flag ("barrier", Opt_barrier),
> fsparam_u32 ("barrier", Opt_barrier),
> fsparam_flag ("nobarrier", Opt_nobarrier),
> - fsparam_flag ("i_version", Opt_i_version),
> fsparam_flag ("dax", Opt_dax),
> fsparam_enum ("dax", Opt_dax_type, ext4_param_dax),
> fsparam_u32 ("stripe", Opt_stripe),
> @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
> case Opt_abort:
> ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
> return 0;
> - case Opt_i_version:
> - ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> - ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> - ctx_set_flags(ctx, SB_I_VERSION);
> - return 0;
> case Opt_inlinecrypt:
> #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
> ctx_set_flags(ctx, SB_INLINECRYPT);
> @@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
> sb->s_flags &= ~ctx->mask_s_flags;
> sb->s_flags |= ctx->vals_s_flags;
>
> - /*
> - * i_version differs from common mount option iversion so we have
> - * to let vfs know that it was set, otherwise it would get cleared
> - * on remount
> - */
> - if (ctx->mask_s_flags & SB_I_VERSION)
> - fc->sb_flags |= SB_I_VERSION;
> -
> #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
> APPLY(s_commit_interval);
> APPLY(s_stripe);
> @@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
> if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
> SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> - if (sb->s_flags & SB_I_VERSION)
> - SEQ_OPTS_PUTS("i_version");
> if (nodefs || sbi->s_stripe)
> SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
> if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> @@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
> (test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
>
> + /* i_version is always enabled now */
> + sb->s_flags |= SB_I_VERSION;
> +
> if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
> (ext4_has_compat_features(sb) ||
> ext4_has_ro_compat_features(sb) ||
> --
> 2.37.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-16 12:11:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes

On Fri 12-08-22 14:42:36, Jeff Layton wrote:
> On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> > ea_inodes are using i_version for storing part of the reference count so
> > we really need to leave it alone.
> >
> > The problem can be reproduced by xfstest ext4/026 when iversion is
> > enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> > inodes in ext4_mark_iloc_dirty().
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Jeff Layton <[email protected]>
> > ---
> > v2, v3: no change
> >
> > fs/ext4/inode.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 601214453c3a..2a220be34caa 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> > }
> > ext4_fc_track_inode(handle, inode);
> >
> > - if (IS_I_VERSION(inode))
> > + /*
> > + * ea_inodes are using i_version for storing reference count, don't
> > + * mess with it
> > + */
> > + if (IS_I_VERSION(inode) &&
> > + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > inode_inc_iversion(inode);
> >
> > /* the do_update_inode consumes one bh->b_count */
>
>
> I've spent some time writing tests for the i_version counter (still
> quite rough right now), and what I've found is that this particular
> inode_inc_iversion results in the counter being bumped on _reads_ as
> well as writes, due to the atime changing. This call to
> inode_inc_iversion seems to make no sense, as we aren't bumping the
> mtime here.
>
> I'm still working on and testing this, but I think we'll probably just
> want to remove this inode_inc_iversion entirely, and leave the i_version
> bumping for normal files to happen when the timestamps are updated. So
> far, my testing seems to indicate that that does the right thing.

I agree that inode_inc_iversion() may be overly agressive here but where
else does get iversion updated for things like inode owner update or
permission changes?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-16 12:20:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes

On Tue, 2022-08-16 at 13:52 +0200, Jan Kara wrote:
> On Fri 12-08-22 14:42:36, Jeff Layton wrote:
> > On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> > > ea_inodes are using i_version for storing part of the reference count so
> > > we really need to leave it alone.
> > >
> > > The problem can be reproduced by xfstest ext4/026 when iversion is
> > > enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> > > inodes in ext4_mark_iloc_dirty().
> > >
> > > Signed-off-by: Lukas Czerner <[email protected]>
> > > Reviewed-by: Jan Kara <[email protected]>
> > > Reviewed-by: Jeff Layton <[email protected]>
> > > ---
> > > v2, v3: no change
> > >
> > > fs/ext4/inode.c | 7 ++++++-
> > > 1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 601214453c3a..2a220be34caa 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> > > }
> > > ext4_fc_track_inode(handle, inode);
> > >
> > > - if (IS_I_VERSION(inode))
> > > + /*
> > > + * ea_inodes are using i_version for storing reference count, don't
> > > + * mess with it
> > > + */
> > > + if (IS_I_VERSION(inode) &&
> > > + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > > inode_inc_iversion(inode)
> > >
> > > /* the do_update_inode consumes one bh->b_count */
> >
> >
> > I've spent some time writing tests for the i_version counter (still
> > quite rough right now), and what I've found is that this particular
> > inode_inc_iversion results in the counter being bumped on _reads_ as
> > well as writes, due to the atime changing. This call to
> > inode_inc_iversion seems to make no sense, as we aren't bumping the
> > mtime here.
> >
> > I'm still working on and testing this, but I think we'll probably just
> > want to remove this inode_inc_iversion entirely, and leave the i_version
> > bumping for normal files to happen when the timestamps are updated. So
> > far, my testing seems to indicate that that does the right thing.
>
> I agree that inode_inc_iversion() may be overly agressive here but where
> else does get iversion updated for things like inode owner update or
> permission changes?
>
> Honza

If we remove it here, then both the setattr and setxattr codepaths will
need to explicitly bump the iversion counter. Note that we update the
ctime in those paths too, so that gives us a guidepost as to when we
should update i_version. xfs will need similar changes, but btrfs turns
out to already do the right thing.

I'm planning to post my latest patches in just a bit.
--
Jeff Layton <[email protected]>

2022-08-21 06:27:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> bit of a hack. Currently XFS relies on the fact that the only time its
> ->dirty_inode() callback needs to do anything is when VFS decides it is
> time to writeback timestamps and XFS detects this situation by checking for
> I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> callback (otherwise timestamp update can get lost). So the solution I've
> suggested was to propagate the information "timestamp update needed" to XFS
> through I_DIRTY_TIME in flags passed to ->dirty_inode().

Maybe we should just add a separate update_lazy_time method to make this
a little more clear?

2022-08-22 08:34:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Sat 20-08-22 23:14:37, Christoph Hellwig wrote:
> On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> > 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> > bit of a hack. Currently XFS relies on the fact that the only time its
> > ->dirty_inode() callback needs to do anything is when VFS decides it is
> > time to writeback timestamps and XFS detects this situation by checking for
> > I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> > I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> > callback (otherwise timestamp update can get lost). So the solution I've
> > suggested was to propagate the information "timestamp update needed" to XFS
> > through I_DIRTY_TIME in flags passed to ->dirty_inode().
>
> Maybe we should just add a separate update_lazy_time method to make this
> a little more clear?

Yes, we could do that if people prefer this. Although I'd say that good
documentation at the place in __mark_inode_dirty() where this gets used and
in documentation of .dirty_inode might clear the confusion as well.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR