2023-10-31 11:05:05

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Tue, 2023-10-31 at 09:37 +1100, Dave Chinner wrote:
> On Fri, Oct 27, 2023 at 06:35:58AM -0400, Jeff Layton wrote:
> > On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote:
> > > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote:
> > > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > > > In earlier discussions you alluded to some repair and/or analysis tools
> > > > that depended on this counter.
> > >
> > > Yes, and one of those "tools" is *me*.
> > >
> > > I frequently look at the di_changecount when doing forensic and/or
> > > failure analysis on filesystem corpses. SOE analysis, relative
> > > modification activity, etc all give insight into what happened to
> > > the filesystem to get it into the state it is currently in, and
> > > di_changecount provides information no other metadata in the inode
> > > contains.
> > >
> > > > I took a quick look in xfsprogs, but I
> > > > didn't see anything there. Is there a library or something that these
> > > > tools use to get at this value?
> > >
> > > xfs_db is the tool I use for this, such as:
> > >
> > > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast
> > > v3.change_count = 35
> > > $
> > >
> > > The root inode in this filesystem has a change count of 35. The root
> > > inode has 32 dirents in it, which means that no entries have ever
> > > been removed or renamed. This sort of insight into the past history
> > > of inode metadata is largely impossible to get any other way, and
> > > it's been the difference between understanding failure and having no
> > > clue more than once.
> > >
> > > Most block device parsing applications simply write their own
> > > decoder that walks the on-disk format. That's pretty trivial to do,
> > > developers can get all the information needed to do this from the
> > > on-disk format specification documentation we keep on kernel.org...
> > >
> >
> > Fair enough. I'm not here to tell you that you guys that you need to
> > change how di_changecount works. If it's too valuable to keep it
> > counting atime-only updates, then so be it.
> >
> > If that's the case however, and given that the multigrain timestamp work
> > is effectively dead, then I don't see an alternative to growing the on-
> > disk inode. Do you?
>
> Yes, I do see alternatives. That's what I've been trying
> (unsuccessfully) to describe and get consensus on. I feel like I'm
> being ignored and rail-roaded here, because nobody is even
> acknowledging that I'm proposing alternatives and keeps insisting
> that the only solution is a change of on-disk format.
>
> So, I'll summarise the situation *yet again* in the hope that this
> time I won't get people arguing about atime vs i-version and what
> constitutes an on-disk format change because that goes nowhere and
> does nothing to determine which solution might be acceptible.
>
> The basic situation is this:
>
> If XFS can ignore relatime or lazytime persistent updates for given
> situations, then *we don't need to make periodic on-disk updates of
> atime*. This makes the whole problem of "persistent atime update bumps
> i_version" go away because then we *aren't making persistent atime
> updates* except when some other persistent modification that bumps
> [cm]time occurs.
>
> But I don't want to do this unconditionally - for systems not
> running anything that samples i_version we want relatime/lazytime
> to behave as they are supposed to and do periodic persistent updates
> as per normal. Principle of least surprise and all that jazz.
>
> So we really need an indication for inodes that we should enable this
> mode for the inode. I have asked if we can have per-operation
> context flag to trigger this given the needs for io_uring to have
> context flags for timestamp updates to be added.
>
> I have asked if we can have an inode flag set by the VFS or
> application code for this. e.g. a flag set by nfsd whenever it accesses a
> given inode.
>
> I have asked if this inode flag can just be triggered if we ever see
> I_VERSION_QUERIED set or statx is used to retrieve a change cookie,
> and whether this is a reliable mechanism for setting such a flag.
>

Ok, so to make sure I understand what you're proposing:

This would be a new inode flag that would be set in conjunction with
I_VERSION_QUERIED (but presumably is never cleared)? When XFS sees this
flag set, it would skip sending the atime to disk.

Given that you want to avoid on-disk changes, I assume this flag will
not be stored on disk. What happens after the NFS server reboots?

Consider:

1/ NFS server queries for the i_version and we set the
I_NO_ATIME_UPDATES_ON_DISK flag (or whatever) in conjunction with
I_VERSION_QUERIED. Some atime updates occur and the i_version isn't
bumped (as you'd expect).

2/ The server then reboots.

3/ Server comes back up, and some local task issues a read against the
inode. I_NO_ATIME_UPDATES_ON_DISK never had a chance to be set after the
reboot, so that atime update ends up incrementing the i_version counter.

4/ client cache invalidation occurs even though there was no write to
the file

This might reduce some of the spurious i_version bumps, but I don't see
how it can eliminate them entirely.

> I have suggested mechanisms for using masked off bits of timestamps
> to encode sub-timestamp granularity change counts and keep them
> invisible to userspace and then not using i_version at all for XFS.
> This avoids all the problems that the multi-grain timestamp
> infrastructure exposed due to variable granularity of user visible
> timestamps and ordering across inodes with different granularity.
> This is potentially a general solution, too.
>

I don't really understand this at all, but trying to do anything with
fine-grained timestamps will just run into a lot of the same problems we
hit with the multigrain work. If you still see this as a path forward,
maybe you can describe it more detail?


> So, yeah, there are *lots* of ways we can solve this problem without
> needing to change on-disk formats.
>

--
Jeff Layton <[email protected]>


2023-10-31 12:22:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Tue 31-10-23 07:04:53, Jeff Layton wrote:
> On Tue, 2023-10-31 at 09:37 +1100, Dave Chinner wrote:
> > I have suggested mechanisms for using masked off bits of timestamps
> > to encode sub-timestamp granularity change counts and keep them
> > invisible to userspace and then not using i_version at all for XFS.
> > This avoids all the problems that the multi-grain timestamp
> > infrastructure exposed due to variable granularity of user visible
> > timestamps and ordering across inodes with different granularity.
> > This is potentially a general solution, too.
>
> I don't really understand this at all, but trying to do anything with
> fine-grained timestamps will just run into a lot of the same problems we
> hit with the multigrain work. If you still see this as a path forward,
> maybe you can describe it more detail?

Dave explained a bit more details here [1] like:

Another options is for XFS to play it's own internal tricks with
[cm]time granularity and turn off i_version. e.g. limit external
timestamp visibility to 1us and use the remaining dozen bits of the
ns field to hold a change counter for updates within a single coarse
timer tick. This guarantees the timestamp changes within a coarse
tick for the purposes of change detection, but we don't expose those
bits to applications so applications that compare timestamps across
inodes won't get things back to front like was happening with the
multi-grain timestamps....
-

So as far as I understand Dave wants to effectively persist counter in low
bits of ctime and expose ctime+counter as its change cookie. I guess that
could work and what makes the complexity manageable compared to full
multigrain timestamps is the fact that we have one filesystem, one on-disk
format etc. The only slight trouble could be that if we previously handed
out something in low bits of ctime for XFS, we need to keep handing the
same thing out until the inode changes (i.e., no rounding until the moment
inode changes) as the old timestamp could be stored somewhere externally
and compared.

Honza

[1] https://lore.kernel.org/all/[email protected]/


--
Jan Kara <[email protected]>
SUSE Labs, CR