2022-07-28 13:54:43

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
the inode already has I_DIRTY_INODE.

Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
get cleared.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/fs-writeback.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05221366a16d..174f01e6b912 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2383,6 +2383,11 @@ void __mark_inode_dirty(struct inode *inode, int flags)

/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
flags &= ~I_DIRTY_TIME;
+ if (inode->i_state & I_DIRTY_TIME) {
+ spin_lock(&inode->i_lock);
+ inode->i_state &= ~I_DIRTY_TIME;
+ spin_unlock(&inode->i_lock);
+ }
} else {
/*
* Else it's either I_DIRTY_PAGES, I_DIRTY_TIME, or nothing.
@@ -2399,13 +2404,20 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
smp_mb();

- if (((inode->i_state & flags) == flags) ||
- (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+ if ((inode->i_state & flags) == flags)
return;

spin_lock(&inode->i_lock);
- if (dirtytime && (inode->i_state & I_DIRTY_INODE))
+ if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
+ /*
+ * We've got a new lazytime update. Make sure it's recorded in
+ * i_state, because the time might have already got updated in
+ * ->dirty_inode() and will not get updated until next
+ * I_DIRTY_INODE update.
+ */
+ inode->i_state |= I_DIRTY_TIME;
goto out_unlock_inode;
+ }
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

--
2.35.3


2022-07-28 16:57:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Thu 28-07-22 15:39:14, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
>
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
> the inode already has I_DIRTY_INODE.

As a datapoint I've checked and XFS has the very same problem as ext4.

> Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
> get cleared.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/fs-writeback.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05221366a16d..174f01e6b912 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2383,6 +2383,11 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>
> /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
> flags &= ~I_DIRTY_TIME;
> + if (inode->i_state & I_DIRTY_TIME) {
> + spin_lock(&inode->i_lock);
> + inode->i_state &= ~I_DIRTY_TIME;
> + spin_unlock(&inode->i_lock);
> + }

Hum, so this is a bit dangerous because inode->i_state may be inconsistent
with the writeback list inode is queued in (wb->b_dirty_time) and these two
are supposed to be in sync. So I rather think we need to make sure we go
through the full round of 'update flags and writeback list' below in case
we need to clear I_DIRTY_TIME from inode->i_state.

> } else {
> /*
> * Else it's either I_DIRTY_PAGES, I_DIRTY_TIME, or nothing.
> @@ -2399,13 +2404,20 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> */
> smp_mb();
>
> - if (((inode->i_state & flags) == flags) ||
> - (dirtytime && (inode->i_state & I_DIRTY_INODE)))
> + if ((inode->i_state & flags) == flags)
> return;
>
> spin_lock(&inode->i_lock);
> - if (dirtytime && (inode->i_state & I_DIRTY_INODE))
> + if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
> + /*
> + * We've got a new lazytime update. Make sure it's recorded in
> + * i_state, because the time might have already got updated in
> + * ->dirty_inode() and will not get updated until next
> + * I_DIRTY_INODE update.
> + */
> + inode->i_state |= I_DIRTY_TIME;
> goto out_unlock_inode;
> + }

So I'm afraid this combination is not properly handled in
writeback_single_inode() where we have at the end:

if (!(inode->i_state & I_DIRTY_ALL))
inode_cgwb_move_to_attached(inode, wb);
else if (!(inode->i_state & I_SYNC_QUEUED) &&
(inode->i_state & I_DIRTY))
redirty_tail_locked(inode, wb);

So inode that had I_DIRTY_SYNC | I_DIRTY_TIME will not be properly refiled
to wb->b_dirty_time list after writeback was done and I_DIRTY_SYNC got
cleared.

So we need to refine it to something like:

if (!(inode->i_state & I_DIRTY_ALL))
inode_cgwb_move_to_attached(inode, wb);
else if (!(inode->i_state & I_SYNC_QUEUED)) {
if (inode->i_state & I_DIRTY) {
redirty_tail_locked(inode, wb);
} else if (inode->i_state & I_DIRTY_TIME) {
inode->dirtied_when = jiffies;
inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
}
}

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-07-29 04:24:19

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Thu, Jul 28, 2022 at 03:39:14PM +0200, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
>
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
> the inode already has I_DIRTY_INODE.
>
> Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
> get cleared.
>
> Signed-off-by: Lukas Czerner <[email protected]>

If you're going to change the meaning of I_* flags, please update the comment in
include/linux/fs.h that describes what they mean.

- Eric

2022-07-29 09:07:31

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Thu, Jul 28, 2022 at 09:05:11PM -0700, Eric Biggers wrote:
> On Thu, Jul 28, 2022 at 03:39:14PM +0200, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> >
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
> > the inode already has I_DIRTY_INODE.
> >
> > Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
> > get cleared.
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
>
> If you're going to change the meaning of I_* flags, please update the comment in
> include/linux/fs.h that describes what they mean.
>
> - Eric

Good point, it does say that I_DIRTY_TIME and I_DIRTY_INODE can't be
both set.

Thanks!
-Lukas

2022-07-29 09:07:31

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Thu, Jul 28, 2022 at 06:53:32PM +0200, Jan Kara wrote:
> On Thu 28-07-22 15:39:14, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> >
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
> > the inode already has I_DIRTY_INODE.

Hi Jan,

thanks for th review.

>
> As a datapoint I've checked and XFS has the very same problem as ext4.

Very interesting, I did look at xfs when I was debugging this problem
and wans't able to tell whether they have the same problem or not, but
it certainly can't be reproduced by generic/622. Or at least I can't
reproduce it on XFS.

So I wonder what is XFS doing differently in that case.

>
> > Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
> > get cleared.
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
> > ---
> > fs/fs-writeback.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 05221366a16d..174f01e6b912 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2383,6 +2383,11 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> >
> > /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
> > flags &= ~I_DIRTY_TIME;
> > + if (inode->i_state & I_DIRTY_TIME) {
> > + spin_lock(&inode->i_lock);
> > + inode->i_state &= ~I_DIRTY_TIME;
> > + spin_unlock(&inode->i_lock);
> > + }
>
> Hum, so this is a bit dangerous because inode->i_state may be inconsistent
> with the writeback list inode is queued in (wb->b_dirty_time) and these two
> are supposed to be in sync. So I rather think we need to make sure we go
> through the full round of 'update flags and writeback list' below in case
> we need to clear I_DIRTY_TIME from inode->i_state.

Ok, so we're clearing I_DIRTY_TIME in __ext4_update_other_inode_time()
which will opportunistically update the time fields for inodes in the
same block as the inode we're doing an update for via
ext4_do_update_inode(). Don't we also need to rewire that differently?

XFS is also clearing it on it's own in log code, but I can't tell if it
has the same problem as you describe here.

>
> > } else {
> > /*
> > * Else it's either I_DIRTY_PAGES, I_DIRTY_TIME, or nothing.
> > @@ -2399,13 +2404,20 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> > */
> > smp_mb();
> >
> > - if (((inode->i_state & flags) == flags) ||
> > - (dirtytime && (inode->i_state & I_DIRTY_INODE)))
> > + if ((inode->i_state & flags) == flags)
> > return;
> >
> > spin_lock(&inode->i_lock);
> > - if (dirtytime && (inode->i_state & I_DIRTY_INODE))
> > + if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
> > + /*
> > + * We've got a new lazytime update. Make sure it's recorded in
> > + * i_state, because the time might have already got updated in
> > + * ->dirty_inode() and will not get updated until next
> > + * I_DIRTY_INODE update.
> > + */
> > + inode->i_state |= I_DIRTY_TIME;
> > goto out_unlock_inode;
> > + }
>
> So I'm afraid this combination is not properly handled in
> writeback_single_inode() where we have at the end:
>
> if (!(inode->i_state & I_DIRTY_ALL))
> inode_cgwb_move_to_attached(inode, wb);
> else if (!(inode->i_state & I_SYNC_QUEUED) &&
> (inode->i_state & I_DIRTY))
> redirty_tail_locked(inode, wb);
>
> So inode that had I_DIRTY_SYNC | I_DIRTY_TIME will not be properly refiled
> to wb->b_dirty_time list after writeback was done and I_DIRTY_SYNC got
> cleared.
>
> So we need to refine it to something like:
>
> if (!(inode->i_state & I_DIRTY_ALL))
> inode_cgwb_move_to_attached(inode, wb);
> else if (!(inode->i_state & I_SYNC_QUEUED)) {
> if (inode->i_state & I_DIRTY) {
> redirty_tail_locked(inode, wb);
> } else if (inode->i_state & I_DIRTY_TIME) {
> inode->dirtied_when = jiffies;
> inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
> }
> }

Very nice, thanks, I'll have a look.

-Lukas

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

2022-07-29 11:20:06

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

On Fri 29-07-22 10:52:19, Lukas Czerner wrote:
> On Thu, Jul 28, 2022 at 06:53:32PM +0200, Jan Kara wrote:
> > On Thu 28-07-22 15:39:14, Lukas Czerner wrote:
> > > Currently the I_DIRTY_TIME will never get set if the inode already has
> > > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's
> > > true, however ext4 will only update the on-disk inode in
> > > ->dirty_inode(), not on actual writeback. As a result if the inode
> > > already has I_DIRTY_INODE state by the time we get to
> > > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > > update, which might never come if we crash or get a power failure.
> > >
> > > The problem can be reproduced on ext4 by running xfstest generic/622
> > > with -o iversion mount option. Fix it by setting I_DIRTY_TIME even if
> > > the inode already has I_DIRTY_INODE.
>
> Hi Jan,
>
> thanks for th review.
>
> >
> > As a datapoint I've checked and XFS has the very same problem as ext4.
>
> Very interesting, I did look at xfs when I was debugging this problem
> and wans't able to tell whether they have the same problem or not, but
> it certainly can't be reproduced by generic/622. Or at least I can't
> reproduce it on XFS.
>
> So I wonder what is XFS doing differently in that case.

OK, that's a bit curious but xfs has xfs_fs_dirty_inode() that's there
exactly to update timestamps when lazytime period expires. So in theory it
seems possible we lose the timestamp update.

> > > Also clear the I_DIRTY_TIME after ->dirty_inode() otherwise it may never
> > > get cleared.
> > >
> > > Signed-off-by: Lukas Czerner <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 18 +++++++++++++++---
> > > 1 file changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 05221366a16d..174f01e6b912 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -2383,6 +2383,11 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> > >
> > > /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
> > > flags &= ~I_DIRTY_TIME;
> > > + if (inode->i_state & I_DIRTY_TIME) {
> > > + spin_lock(&inode->i_lock);
> > > + inode->i_state &= ~I_DIRTY_TIME;
> > > + spin_unlock(&inode->i_lock);
> > > + }
> >
> > Hum, so this is a bit dangerous because inode->i_state may be inconsistent
> > with the writeback list inode is queued in (wb->b_dirty_time) and these two
> > are supposed to be in sync. So I rather think we need to make sure we go
> > through the full round of 'update flags and writeback list' below in case
> > we need to clear I_DIRTY_TIME from inode->i_state.
>
> Ok, so we're clearing I_DIRTY_TIME in __ext4_update_other_inode_time()
> which will opportunistically update the time fields for inodes in the
> same block as the inode we're doing an update for via
> ext4_do_update_inode(). Don't we also need to rewire that differently?
>
> XFS is also clearing it on it's own in log code, but I can't tell if it
> has the same problem as you describe here.

Yes, we'll possibly have clean inodes still on wb->b_dirty_time list.
Checking the code, this should be safe in the end.

But thinking more about the possible races these two places clearing
I_DIRTY_TIME are safe because we copy timestamps to on-disk inode after
clearing I_DIRTY_TIME. But your clearing of I_DIRTY_TIME in
__mark_inode_dirty() could result in loosing timestamp update if it races
in the wrong way (basically the bug you're trying to fix would remain
unfixed).

Hum, thinking about it, even clearing of I_DIRTY_TIME later in
__mark_inode_dirty is problematic. There is still a race like:

CPU1 CPU2
__mark_inode_dirty(inode, I_DIRTY_TIME)
sets I_DIRTY_TIME in inode->i_state

__mark_inode_dirty(inode, I_DIRTY_SYNC)
->dirty_inode() - copies timestamps

__mark_inode_dirty(inode, I_DIRTY_TIME)
I_DIRTY_TIME already set -> bail
...
if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;

and we have just lost the second timestamp update.

To fix this we'd need to clear I_DIRTY_TIME in inode->i_state before
calling ->dirty_inode() but that clashes with XFS' usage of ->dirty_inode
which uses I_DIRTY_TIME in inode->i_state to detect that timestamp update
is requested. I think we could do something like:

if (flags & I_DIRTY_INODE) {
/* Inode timestamp update will piggback on this dirtying */
if (inode->i_state & I_DIRTY_TIME) {
spin_lock(&inode->i_lock);
if (inode->i_state & I_DIRTY_TIME) {
inode->i_state &= ~I_DIRTY_TIME;
flags |= I_DIRTY_TIME;
}
spin_unlock(&inode->i_lock);
}
...
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode,
flags & (I_DIRTY_INODE | I_DIRTY_TIME));
...
}

And then XFS could check for I_DIRTY_TIME in flags to detect what it needs
to do.

Hopefully now things are correct ;). Famous last words...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR