Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.
The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.
Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
Thanks Jan Kara for suggestions on how to make this work properly.
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Lukas Czerner <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
v2: Reworked according to suggestions from Jan
fs/fs-writeback.c | 34 ++++++++++++++++++++++------------
fs/xfs/xfs_super.c | 3 ++-
include/linux/fs.h | 6 +++---
3 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05221366a16d..638dbf143727 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1718,9 +1718,14 @@ static int writeback_single_inode(struct inode *inode,
*/
if (!(inode->i_state & I_DIRTY_ALL))
inode_cgwb_move_to_attached(inode, wb);
- else if (!(inode->i_state & I_SYNC_QUEUED) &&
- (inode->i_state & I_DIRTY))
- redirty_tail_locked(inode, wb);
+ else if (!(inode->i_state & I_SYNC_QUEUED)) {
+ if ((inode->i_state & I_DIRTY))
+ redirty_tail_locked(inode, wb);
+ else if (inode->i_state & I_DIRTY_TIME) {
+ inode->dirtied_when = jiffies;
+ inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
+ }
+ }
spin_unlock(&wb->list_lock);
inode_sync_complete(inode);
@@ -2369,6 +2374,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
trace_writeback_mark_inode_dirty(inode, flags);
if (flags & I_DIRTY_INODE) {
+
+ /* Inode timestamp update will piggback on this dirtying */
+ if (inode->i_state & I_DIRTY_TIME) {
+ spin_lock(&inode->i_lock);
+ if (inode->i_state & I_DIRTY_TIME) {
+ inode->i_state &= ~I_DIRTY_TIME;
+ flags |= I_DIRTY_TIME;
+ }
+ spin_unlock(&inode->i_lock);
+ }
+
/*
* Notify the filesystem about the inode being dirtied, so that
* (if needed) it can update on-disk fields and journal the
@@ -2378,7 +2394,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
trace_writeback_dirty_inode_start(inode, flags);
if (sb->s_op->dirty_inode)
- sb->s_op->dirty_inode(inode, flags & I_DIRTY_INODE);
+ sb->s_op->dirty_inode(inode,
+ flags & (I_DIRTY_INODE | I_DIRTY_TIME));
trace_writeback_dirty_inode(inode, flags);
/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
@@ -2399,21 +2416,15 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
smp_mb();
- if (((inode->i_state & flags) == flags) ||
- (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+ if ((inode->i_state & flags) == flags)
return;
spin_lock(&inode->i_lock);
- if (dirtytime && (inode->i_state & I_DIRTY_INODE))
- goto out_unlock_inode;
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
inode_attach_wb(inode, NULL);
- /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
- if (flags & I_DIRTY_INODE)
- inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
/*
@@ -2486,7 +2497,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
out_unlock:
if (wb)
spin_unlock(&wb->list_lock);
-out_unlock_inode:
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(__mark_inode_dirty);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index aa977c7ea370..cff05a4771b5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -658,7 +658,8 @@ xfs_fs_dirty_inode(
if (!(inode->i_sb->s_flags & SB_LAZYTIME))
return;
- if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
+ if ((flag & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
+ !((inode->i_state | flag) & I_DIRTY_TIME))
return;
if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..2243797badf2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2245,9 +2245,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* lazytime mount option is enabled. We keep track of this
* separately from I_DIRTY_SYNC in order to implement
* lazytime. This gets cleared if I_DIRTY_INODE
- * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. I.e.
- * either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
- * i_state, but not both. I_DIRTY_PAGES may still be set.
+ * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ * in place.
* I_NEW Serves as both a mutex and completion notification.
* New inodes set I_NEW. If two processes both create
* the same inode, one of them will release its inode and
--
2.37.1
On Wed, Aug 03, 2022 at 12:53:39PM +0200, Lukas Czerner wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ad5e3520fae..2243797badf2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2245,9 +2245,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
> * The inode itself only has dirty timestamps, and the
> * lazytime mount option is enabled. We keep track of this
> * separately from I_DIRTY_SYNC in order to implement
> * lazytime. This gets cleared if I_DIRTY_INODE
> - * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. I.e.
> - * either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
> - * i_state, but not both. I_DIRTY_PAGES may still be set.
> + * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
> + * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
> + * in place.
I'm still having a hard time understanding the new semantics. The first
sentence above needs to be updated since I_DIRTY_TIME no longer means "the inode
itself only has dirty timestamps", right?
Also, have you checked all the places that I_DIRTY_TIME is used and verified
they do the right thing now? What about inode_is_dirtytime_only()?
Also what is the precise meaning of the flags argument to ->dirty_inode now?
sb->s_op->dirty_inode(inode,
flags & (I_DIRTY_INODE | I_DIRTY_TIME));
Note that dirty_inode is documented in Documentation/filesystems/vfs.rst.
- Eric
On Fri, Aug 05, 2022 at 01:05:45AM -0700, Eric Biggers wrote:
> On Wed, Aug 03, 2022 at 12:53:39PM +0200, Lukas Czerner wrote:
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9ad5e3520fae..2243797badf2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2245,9 +2245,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
> > * The inode itself only has dirty timestamps, and the
> > * lazytime mount option is enabled. We keep track of this
> > * separately from I_DIRTY_SYNC in order to implement
> > * lazytime. This gets cleared if I_DIRTY_INODE
> > - * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. I.e.
> > - * either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
> > - * i_state, but not both. I_DIRTY_PAGES may still be set.
> > + * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
> > + * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
> > + * in place.
>
> I'm still having a hard time understanding the new semantics. The first
> sentence above needs to be updated since I_DIRTY_TIME no longer means "the inode
> itself only has dirty timestamps", right?
The problem is that it was always assumed that I_DIRTY_INODE superseeds
I_DIRTY_TIME and so it would get cleared in __mark_inode_dirty() when we
have I_DIRTY_INODE. That's true, we call sb->s_op->dirty_inode(), the
time update gets pushed into on-disk inode structure, I_DIRTY_TIME
cleared and it will get queued for writeback.
Any subsequent dirtying with I_DIRTY_TIME gets ignored simply because
I_DIRTY_INODE is already set in i_state. But in ext4 this time update
will never get pushed into on disk inode and there is no I_DIRTY_TIME so
once the writeback is done we've lost all those I_DIRTY_TIME updates in
between even if there was a sync.
Now, we still clear I_DIRTY_TIME when we get I_DIRTY_INODE, but any
subsequent I_DIRTY_TIME only updates won't be ignored and we set it into
i_state. After the writeback is done it'll be moved to b_dirty_time
list.
So I am not sure how would you like it to be re-worded, simply removing
the 'only' would be ok?
>
> Also, have you checked all the places that I_DIRTY_TIME is used and verified
> they do the right thing now? What about inode_is_dirtytime_only()?
Yes, that's fine, despite the slightly misleading name ;)
>
> Also what is the precise meaning of the flags argument to ->dirty_inode now?
>
> sb->s_op->dirty_inode(inode,
> flags & (I_DIRTY_INODE | I_DIRTY_TIME));
>
> Note that dirty_inode is documented in Documentation/filesystems/vfs.rst.
Don't know. It alredy don't mention I_DIRTY_SYNC that can be there as
well. Additionaly it can have I_DIRTY_TIME to inform the fs we have a
dirty timestamp as well (in case of lazytime).
Perhaps we can add:
If the inode has dirty timestamp and lazytime is enabled I_DIRTY_TIME
will be set in the flags.
-Lukas
>
> - Eric
>
On Wed, Aug 03, 2022 at 12:53:39PM +0200, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
>
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option.
>
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE. Also make sure that the case is properly handled in
> writeback_single_inode() as well. Additionally changes in
> xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
>
> Thanks Jan Kara for suggestions on how to make this work properly.
>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Signed-off-by: Lukas Czerner <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> ---
> v2: Reworked according to suggestions from Jan
....
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index aa977c7ea370..cff05a4771b5 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -658,7 +658,8 @@ xfs_fs_dirty_inode(
>
> if (!(inode->i_sb->s_flags & SB_LAZYTIME))
> return;
> - if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> + if ((flag & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
> + !((inode->i_state | flag) & I_DIRTY_TIME))
> return;
My eyes, they bleed. The dirty time code was already a horrid
abomination, and this makes it worse.
From looking at the code, I cannot work out what the new semantics
for I_DIRTY_TIME and I_DIRTY_SYNC are supposed to be, nor can I work
out what the condition this is new code is supposed to be doing. I
*can't verify it is correct* by reading the code.
Can you please add a comment here explaining the conditions where we
don't have to log a new timestamp update?
Also, if "flag" now contains multiple flags, can you rename it
"flags"?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Aug 08, 2022 at 09:08:10AM +1000, Dave Chinner wrote:
> On Wed, Aug 03, 2022 at 12:53:39PM +0200, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> >
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option.
> >
> > Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> > I_DIRTY_INODE. Also make sure that the case is properly handled in
> > writeback_single_inode() as well. Additionally changes in
> > xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> >
> > Thanks Jan Kara for suggestions on how to make this work properly.
> >
> > Cc: Dave Chinner <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Signed-off-by: Lukas Czerner <[email protected]>
> > Suggested-by: Jan Kara <[email protected]>
> > ---
> > v2: Reworked according to suggestions from Jan
>
> ....
>
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index aa977c7ea370..cff05a4771b5 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -658,7 +658,8 @@ xfs_fs_dirty_inode(
> >
> > if (!(inode->i_sb->s_flags & SB_LAZYTIME))
> > return;
> > - if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> > + if ((flag & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
> > + !((inode->i_state | flag) & I_DIRTY_TIME))
> > return;
>
> My eyes, they bleed. The dirty time code was already a horrid
> abomination, and this makes it worse.
>
> From looking at the code, I cannot work out what the new semantics
> for I_DIRTY_TIME and I_DIRTY_SYNC are supposed to be, nor can I work
Hi Dave,
please see the other thready for this patch with Eric Biggers, where I
try to explain and give some suggestion to change the doc. Does it make
sense to you, or am I missing something?
https://marc.info/?l=linux-ext4&m=165970194205621&w=2
> out what the condition this is new code is supposed to be doing. I
> *can't verify it is correct* by reading the code.
The ->dirty_inode() needed to be changed to clear I_DIRTY_TIME from
i_state *before* we call ->dirty_inode() to avoid race where we would
lose timestamp update that comes just a little later, after
-dirty_inode() call with I_DRITY_INODE.
But that would break xfs, so I decided to keep the condition and loosen
the requirement so that I_DIRTY_TIME can also be se in 'flag', not just
the i_state. Hence the abomination.
>
> Can you please add a comment here explaining the conditions where we
> don't have to log a new timestamp update?
How about something like this?
Only do the timestamp update if the inode is dirty (I_DIRTY_SYNC) and
has dirty timestamp (I_DIRTY_TIME). I_DIRTY_TIME can be either already
set in i_state, or passed in flags possibly together with I_DIRTY_SYNC.
>
> Also, if "flag" now contains multiple flags, can you rename it
> "flags"?
Sure, I can do that.
Thanks!
-Lukas
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> [email protected]
>
On Fri, Aug 05, 2022 at 02:23:06PM +0200, Lukas Czerner wrote:
> >
> > Also what is the precise meaning of the flags argument to ->dirty_inode now?
> >
> > sb->s_op->dirty_inode(inode,
> > flags & (I_DIRTY_INODE | I_DIRTY_TIME));
> >
> > Note that dirty_inode is documented in Documentation/filesystems/vfs.rst.
>
> Don't know. It alredy don't mention I_DIRTY_SYNC that can be there as
> well.
Well, it didn't really need to because there were only two possibilities:
datasync and not datasync. This patch changes that.
> Additionaly it can have I_DIRTY_TIME to inform the fs we have a
> dirty timestamp as well (in case of lazytime).
This is introduced by this patch.
- Eric