2023-10-18 17:41:52

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

Multigrain timestamps allow VFS to request fine-grained timestamps when
there has been recent interest in the file's attributes. Unfortunately,
the coarse-grained timestamps can lag the fine-grained ones. A file
stamped with a coarse-grained timestamp after one with a fine-grained
one can appear to have been modified in reverse order. This is
problematic for programs like "make" or "rsync" which depend on being
able to compare timestamps between two different inodes.

One way to prevent this is to ensure that when we stamp a file with a
fine-grained timestamp, that we use that value to establish a floor for
any later timestamp update.

Add a new mg_nsec field to the timekeeper structure for tracking the
nsec value of the most recent multigrain timestamp, along with two new
helper functions for fetching coarse and fine grained timestamps. When
getting a fine-grained timestamp update the mg_nsec with the value that
was fetched via timekeeping_get_ns.

When fetching a coarse-grained timestamp, set the mg_nsec to
TK_MG_NSEC_IGNORE. When fetching a coarse-grained time for a multigrain
timestamp, apply the mg_nsec value if it's not set to TK_MG_NSEC_IGNORE.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/timekeeper_internal.h | 2 +
include/linux/timekeeping.h | 4 ++
kernel/time/timekeeping.c | 80 +++++++++++++++++++++++++++++++++++++
3 files changed, 86 insertions(+)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 84ff2844df2a..6583b06e7d08 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -55,6 +55,7 @@ struct tk_read_base {
* @tai_offset: The current UTC to TAI offset in seconds
* @clock_was_set_seq: The sequence number of clock was set events
* @cs_was_changed_seq: The sequence number of clocksource change events
+ * @mg_nsec: Nanosecond delta for multigrain timestamps
* @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second
* @raw_sec: CLOCK_MONOTONIC_RAW time in seconds
* @monotonic_to_boot: CLOCK_MONOTONIC to CLOCK_BOOTTIME offset
@@ -110,6 +111,7 @@ struct timekeeper {
u64 xtime_interval;
s64 xtime_remainder;
u64 raw_interval;
+ u64 mg_nsec;
/* The ntp_tick_length() value currently being used.
* This cached copy ensures we consistently apply the tick
* length for an entire tick, as ntp_tick_length may change
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fe1e467ba046..5dc0ad619d42 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -44,6 +44,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv);
extern void ktime_get_coarse_ts64(struct timespec64 *ts);
extern void ktime_get_coarse_real_ts64(struct timespec64 *ts);

+/* multigrain timestamp support */
+void ktime_get_mg_fine_ts64(struct timespec64 *ts);
+void ktime_get_mg_coarse_ts64(struct timespec64 *ts);
+
void getboottime64(struct timespec64 *ts);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 266d02809dbb..7c20c98b1ea8 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -33,6 +33,9 @@
#define TK_MIRROR (1 << 1)
#define TK_CLOCK_WAS_SET (1 << 2)

+/* When mg_nsec is set to this, ignore it */
+#define TK_MG_NSEC_IGNORE (~(0ULL))
+
enum timekeeping_adv_mode {
/* Update timekeeper when a tick has passed */
TK_ADV_TICK,
@@ -139,6 +142,7 @@ static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
{
tk->xtime_sec = ts->tv_sec;
tk->tkr_mono.xtime_nsec = (u64)ts->tv_nsec << tk->tkr_mono.shift;
+ tk->mg_nsec = TK_MG_NSEC_IGNORE;
}

static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts)
@@ -146,6 +150,7 @@ static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts)
tk->xtime_sec += ts->tv_sec;
tk->tkr_mono.xtime_nsec += (u64)ts->tv_nsec << tk->tkr_mono.shift;
tk_normalize_xtime(tk);
+ tk->mg_nsec = TK_MG_NSEC_IGNORE;
}

static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm)
@@ -1664,6 +1669,7 @@ void __init timekeeping_init(void)
tk_setup_internals(tk, clock);

tk_set_xtime(tk, &wall_time);
+ tk->mg_nsec = TK_MG_NSEC_IGNORE;
tk->raw_sec = 0;

tk_set_wall_to_mono(tk, wall_to_mono);
@@ -2283,6 +2289,80 @@ void ktime_get_coarse_ts64(struct timespec64 *ts)
}
EXPORT_SYMBOL(ktime_get_coarse_ts64);

+/**
+ * ktime_get_mg_fine_ts64 - Returns a fine-grained time of day in a timespec64
+ * @ts: pointer to the timespec64 to be set
+ *
+ * Multigrain filesystems use a scheme where they use coarse-grained
+ * timestamps most of the time, but will use a fine-grained timestamp
+ * if the previous timestamp was queried and the coarse-grained clock
+ * hasn't ticked yet.
+ *
+ * For this case, the caller is requesting a fine-grained timestamp,
+ * so we must update the sliding mg_nsec value before returning it. This
+ * ensures that any coarse-grained timestamp updates that occur after
+ * this won't appear to have occurred before.
+ *
+ * Fills in @ts with the current fine-grained time of day, suitable for
+ * a file timestamp.
+ */
+void ktime_get_mg_fine_ts64(struct timespec64 *ts)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+ unsigned long flags;
+ u32 nsecs;
+
+ WARN_ON(timekeeping_suspended);
+
+ raw_spin_lock_irqsave(&timekeeper_lock, flags);
+ write_seqcount_begin(&tk_core.seq);
+
+ ts->tv_sec = tk->xtime_sec;
+ nsecs = timekeeping_get_ns(&tk->tkr_mono);
+ tk->mg_nsec = nsecs;
+
+ write_seqcount_end(&tk_core.seq);
+ raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
+
+ ts->tv_nsec = 0;
+ timespec64_add_ns(ts, nsecs);
+}
+
+/**
+ * ktime_get_mg_coarse_ts64 - Returns a coarse-grained time of day in a timespec64
+ * @ts: pointer to the timespec64 to be set
+ *
+ * Multigrain filesystems use a scheme where they use coarse-grained
+ * timestamps most of the time, but will use a fine-grained timestamp
+ * if the previous timestamp was queried and the coarse-grained clock
+ * hasn't ticked yet.
+ *
+ * For this case, the caller is requesting a coarse-grained timestamp,
+ * which will always be equal to or later than the latest fine-grained
+ * timestamp that has been handed out.
+ *
+ * Fills in @ts with the current coarse-grained time of day, suitable
+ * for a file timestamp.
+ */
+void ktime_get_mg_coarse_ts64(struct timespec64 *ts)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+ unsigned int seq;
+ u64 nsec;
+
+ do {
+ seq = read_seqcount_begin(&tk_core.seq);
+
+ *ts = tk_xtime(tk);
+ nsec = tk->mg_nsec;
+ } while (read_seqcount_retry(&tk_core.seq, seq));
+
+ if (nsec != TK_MG_NSEC_IGNORE) {
+ ts->tv_nsec = 0;
+ timespec64_add_ns(ts, nsec);
+ }
+}
+
/*
* Must hold jiffies_lock
*/

--
2.41.0


2023-10-18 19:19:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed, 18 Oct 2023 at 10:41, Jeff Layton <[email protected]> wrote:
>
> One way to prevent this is to ensure that when we stamp a file with a
> fine-grained timestamp, that we use that value to establish a floor for
> any later timestamp update.

I'm very leery of this.

I don't like how it's using a global time - and a global fine-grained
offset - when different filesystems will very naturally have different
granularities. I also don't like how it's no using that global lock.

Yes, yes, since the use of this all is then gated by the 'is_mgtime()'
thing, any filesystem with big granularities will presumably never set
FS_MGTIME in the first time, and that hides the worst pointless cases.
But it still feels iffy to me.

Also, the whole current_ctime() logic seems wrong. Later (in 4/9), you do this:

static struct timespec64 current_ctime(struct inode *inode)
{
if (is_mgtime(inode))
return current_mgtime(inode);

and current_mgtime() does

if (nsec & I_CTIME_QUERIED) {
ktime_get_real_ts64(&now);
return timestamp_truncate(now, inode);
}

so once the code has set I_CTIME_QUERIED, it will now use the
expensive fine-grained time - even when it makes no sense.

As far as I can tell, there is *never* a reason to get the
fine-grained time if the old inode ctime is already sufficiently far
away.

IOW, my gut feel is that all this logic should always not only be
guarded by FS_MGTIME (like you do seem to do), *and* by "has anybody
even queried this time" - it should *also* always be guarded by "if I
get the coarse-grained time, is that sufficient?"

So I think the current_ctime() logic should be something along the lines of

static struct timespec64 current_ctime(struct inode *inode)
{
struct timespec64 ts64 = current_time(inode);
unsigned long old_ctime_sec = inode->i_ctime_sec;
unsigned int old_ctime_nsec = inode->i_ctime_nsec;

if (ts64.tv_sec != old_ctime_sec)
return ts64;

/*
* No need to check is_mgtime(inode) - the I_CTIME_QUERIED
* flag is only set for mgtime filesystems
*/
if (!(old_ctime_nsec & I_CTIME_QUERIED))
return ts64;
old_ctime_nsec &= ~I_CTIME_QUERIED;
if (ts64.tv_nsec > old_ctime_nsec + inode->i_sb->s_time_gran)
return ts64;

/* Ok, only *now* do we do a finegrained value */
ktime_get_real_ts64(&ts64);
return timestamp_truncate(ts64);
}

or whatever. Make it *very* clear that the finegrained timestamp is
the absolute last option, after we have checked that the regular one
isn't possible.

I dunno.

Linus

2023-10-20 12:12:58

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Fri, 2023-10-20 at 09:02 +1100, Dave Chinner wrote:
> On Thu, Oct 19, 2023 at 07:28:48AM -0400, Jeff Layton wrote:
> > On Thu, 2023-10-19 at 11:29 +0200, Christian Brauner wrote:
> > > > Back to your earlier point though:
> > > >
> > > > Is a global offset really a non-starter? I can see about doing something
> > > > per-superblock, but ktime_get_mg_coarse_ts64 should be roughly as cheap
> > > > as ktime_get_coarse_ts64. I don't see the downside there for the non-
> > > > multigrain filesystems to call that.
> > >
> > > I have to say that this doesn't excite me. This whole thing feels a bit
> > > hackish. I think that a change version is the way more sane way to go.
> > >
> >
> > What is it about this set that feels so much more hackish to you? Most
> > of this set is pretty similar to what we had to revert. Is it just the
> > timekeeper changes? Why do you feel those are a problem?
> >
> > > >
> > > > On another note: maybe I need to put this behind a Kconfig option
> > > > initially too?
> > >
> > > So can we for a second consider not introducing fine-grained timestamps
> > > at all. We let NFSv3 live with the cache problem it's been living with
> > > forever.
> > >
> > > And for NFSv4 we actually do introduce a proper i_version for all
> > > filesystems that matter to it.
> > >
> > > What filesystems exactly don't expose a proper i_version and what does
> > > prevent them from adding one or fixing it?
> >
> > Certainly we can drop this series altogether if that's the consensus.
> >
> > The main exportable filesystem that doesn't have a suitable change
> > counter now is XFS. Fixing it will require an on-disk format change to
> > accommodate a new version counter that doesn't increment on atime
> > updates. This is something the XFS folks were specifically looking to
> > avoid, but maybe that's the simpler option.
>
> And now we have travelled the full circle.
>

LOL, yes!

> The problem NFS has with atime updates on XFS is a result of
> the default behaviour of relatime - it *always* forces a persistent
> atime update after mtime has changed. Hence a read-after-write
> operation will trigger an atime update because atime is older than
> mtime. This is what causes XFS to run a transaction (i.e. a
> persistent atime update) and that bumps iversion.
>

Those particular atime updates are not a problem. If we're updating the
mtime and ctime anyway, then bumping the change attribute is OK.

The problem is that relatime _also_ does an on-disk update once a day
for just an atime update. On XFS, this means that the change attribute
also gets bumped and the clients invalidate their caches all at once.

That doesn't sound like a big problem at first, but what if you're
sharing a multi-gigabyte r/o file between multiple clients? This sort of
thing is fairly common on render-farm workloads, and all of your clients
will end up invalidating their caches once once a day if you're serving
from XFS.

> lazytime does not behave this way - it delays all persistent
> timestamp updates until the next persistent change or until the
> lazytime aggregation period expires (24 hours). Hence with lazytime,
> read-after-write operations do not trigger a persistent atime
> update, and so XFS does not run a transaction to update atime. Hence
> i_version does not get bumped, and NFS behaves as expected.
>

Similar problem here. Once a day, NFS clients will invalidate the cache
on any static content served from XFS.

> IOWs, what the NFS server actually wants from the filesytsems is for
> lazy timestamp updates to always be used on read operations. It does
> not want persistent timestamp updates that change on-disk state. The
> recent "redefinition" of when i_version should change effectively
> encodes this - i_version should only change when a persistent
> metadata or data change is made that also changes [cm]time.
>
> Hence the simple, in-memory solution to this problem is for NFS to
> tell the filesysetms that it needs to using lazy (in-memory) atime
> updates for the given operation rather than persistent atime updates.
>
> We already need to modify how atime updates work for io_uring -
> io_uring needs atime updates to be guaranteed non-blocking similar
> to updating mtime in the write IO path. If a persistent timestamp
> change needs to be run, then the timestamp update needs to return
> -EAGAIN rather than (potentially) blocking so the entire operation
> can be punted to a context that can block.
>
> This requires control flags to be passed to the core atime handling
> functions. If a filesystem doesn't understand/support the flags, it
> can just ignore it and do the update however it was going to do it.
> It won't make anything work incorrectly, just might do something
> that is not ideal.
>
> With this new "non-blocking update only" flag for io_uring and a
> new "non-persistent update only" flag for NFS, we have a very
> similar conditional atime update requirements from two completely
> independent in-kernel applications.
>
> IOWs, this can be solved quite simply by having the -application-
> define the persistence semantics of the operation being performed.
> Add a RWF_LAZYTIME/IOCB_LAZYTIME flag for read IO that is being
> issued from the nfs daemon (i.e. passed to vfs_iter_read()) and then
> the vfs/filesystem can do exactly the right thing for the IO being
> issued.
>
> This is what io_uring does with IOCB_NOWAIT to tell the filesystems
> that the IO must be non-blocking, and it's the key we already use
> for non-blocking mtime updates and will use to trigger non-blocking
> atime updates....
>
> I also know of cases where a per-IO RWF_LAZYTIME flag would be
> beneficial - large databases are already using lazytime mount
> options so that their data IO doesn't take persistent mtime update
> overhead hits on every write IO.....
>

I don't think that trying to do something "special" for activity that is
initiated by the NFS server solves anything. Bear in mind that NFS
clients care about locally-initiated activity too.

The bottom line is that we don't want to be foisting a cache
invalidation on the clients just because someone read a file, or did
some similar activity like a readdir or readlink. The lazytime/relatime
options may mitigate the problem, but they're not a real solution.

What NFS _really_ wants is a proper change counter that doesn't
increment on read(like) operations. In practice, that comes down to just
not incrementing it on atime updates.

btrfs, ext4 and tmpfs have this (now). xfs does not because its change
attribute is incremented when an atime update is logged, and that is
evidently something that cannot be changed without an on-disk format
change.

> > There is also bcachefs which I don't think has a change attr yet. They'd
> > also likely need a on-disk format change, but hopefully that's a easier
> > thing to do there since it's a brand new filesystem.
>
> It's not a "brand new filesystem". It's been out there for quite a
> long while, and it has many users that would be impacted by on-disk
> format changes at this point in it's life. on-disk format changes
> are a fairly major deal for filesystems, and if there is any way we
> can avoid them we should.


Sure. It's new to me though. It's also not yet merged into mainline.

I'd _really_ like to see a proper change counter added before it's
merged, or at least space in the on-disk inode reserved for one until we
can get it plumbed in. That would suck for the current users, I suppose,
but at least that userbase is small now. Once it's merged, there will be
a lot more people using it and it becomes just that much more difficult.

I suppose bcachefs could try to hold out for the multigrain timestamp
work too, but that may not ever make it in.
--
Jeff Layton <[email protected]>

2023-10-20 20:07:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Fri, 20 Oct 2023 at 05:12, Jeff Layton <[email protected]> wrote:.
>
> I'd _really_ like to see a proper change counter added before it's
> merged, or at least space in the on-disk inode reserved for one until we
> can get it plumbed in.

Hmm. Can we not perhaps just do an in-memory change counter, and try
to initialize it to a random value when instantiating an inode? Do we
even *require* on-disk format changes?

So on reboot, the inode would count as "changed" as far any remote
user is concerned. It would flush client caches, but isn't that what
you'd want anyway? I'd hate to waste lots of memory, but maybe people
would be ok with just a 32-bit random value. And if not...

But I actually came into this whole discussion purely through the
inode timestamp side, so I may *entirely* miss what the change counter
requirements for NFSd actually are. If it needs to be stable across
reboots, my idea is clearly complete garbage.

You can now all jump on me and point out my severe intellectual
limitations. Please use small words when you do ;)

Linus

2023-10-20 21:05:18

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Fri, 2023-10-20 at 13:06 -0700, Linus Torvalds wrote:
> On Fri, 20 Oct 2023 at 05:12, Jeff Layton <[email protected]> wrote:.
> >
> > I'd _really_ like to see a proper change counter added before it's
> > merged, or at least space in the on-disk inode reserved for one until we
> > can get it plumbed in.
>
> Hmm. Can we not perhaps just do an in-memory change counter, and try
> to initialize it to a random value when instantiating an inode? Do we
> even *require* on-disk format changes?
>
> So on reboot, the inode would count as "changed" as far any remote
> user is concerned. It would flush client caches, but isn't that what
> you'd want anyway? I'd hate to waste lots of memory, but maybe people
> would be ok with just a 32-bit random value. And if not...
>
> But I actually came into this whole discussion purely through the
> inode timestamp side, so I may *entirely* miss what the change counter
> requirements for NFSd actually are. If it needs to be stable across
> reboots, my idea is clearly complete garbage.
>
> You can now all jump on me and point out my severe intellectual
> limitations. Please use small words when you do ;)
>

Much like inode timestamps, we do depend on the change attribute
persisting across reboots. Having to invalidate all of your cached data
just because the server rebooted is particularly awful. That usually
results in the server being hammered with reads from all of the clients
at once, soon after rebooting.

--
Jeff Layton <[email protected]>

2023-10-23 14:46:29

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Mon, 2023-10-23 at 09:17 +1100, Dave Chinner wrote:
> On Fri, Oct 20, 2023 at 08:12:45AM -0400, Jeff Layton wrote:
> > On Fri, 2023-10-20 at 09:02 +1100, Dave Chinner wrote:
> > > On Thu, Oct 19, 2023 at 07:28:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2023-10-19 at 11:29 +0200, Christian Brauner wrote:
> > > > > > Back to your earlier point though:
> > > > > >
> > > > > > Is a global offset really a non-starter? I can see about doing something
> > > > > > per-superblock, but ktime_get_mg_coarse_ts64 should be roughly as cheap
> > > > > > as ktime_get_coarse_ts64. I don't see the downside there for the non-
> > > > > > multigrain filesystems to call that.
> > > > >
> > > > > I have to say that this doesn't excite me. This whole thing feels a bit
> > > > > hackish. I think that a change version is the way more sane way to go.
> > > > >
> > > >
> > > > What is it about this set that feels so much more hackish to you? Most
> > > > of this set is pretty similar to what we had to revert. Is it just the
> > > > timekeeper changes? Why do you feel those are a problem?
> > > >
> > > > > >
> > > > > > On another note: maybe I need to put this behind a Kconfig option
> > > > > > initially too?
> > > > >
> > > > > So can we for a second consider not introducing fine-grained timestamps
> > > > > at all. We let NFSv3 live with the cache problem it's been living with
> > > > > forever.
> > > > >
> > > > > And for NFSv4 we actually do introduce a proper i_version for all
> > > > > filesystems that matter to it.
> > > > >
> > > > > What filesystems exactly don't expose a proper i_version and what does
> > > > > prevent them from adding one or fixing it?
> > > >
> > > > Certainly we can drop this series altogether if that's the consensus.
> > > >
> > > > The main exportable filesystem that doesn't have a suitable change
> > > > counter now is XFS. Fixing it will require an on-disk format change to
> > > > accommodate a new version counter that doesn't increment on atime
> > > > updates. This is something the XFS folks were specifically looking to
> > > > avoid, but maybe that's the simpler option.
> > >
> > > And now we have travelled the full circle.
> > >
> >
> > LOL, yes!
> >
> > > The problem NFS has with atime updates on XFS is a result of
> > > the default behaviour of relatime - it *always* forces a persistent
> > > atime update after mtime has changed. Hence a read-after-write
> > > operation will trigger an atime update because atime is older than
> > > mtime. This is what causes XFS to run a transaction (i.e. a
> > > persistent atime update) and that bumps iversion.
> > >
> >
> > Those particular atime updates are not a problem. If we're updating the
> > mtime and ctime anyway, then bumping the change attribute is OK.
> >
> > The problem is that relatime _also_ does an on-disk update once a day
> > for just an atime update. On XFS, this means that the change attribute
> > also gets bumped and the clients invalidate their caches all at once.
> >
> > That doesn't sound like a big problem at first, but what if you're
> > sharing a multi-gigabyte r/o file between multiple clients? This sort of
> > thing is fairly common on render-farm workloads, and all of your clients
> > will end up invalidating their caches once once a day if you're serving
> > from XFS.
>
> So we have noatime inode and mount options for such specialised
> workloads that cannot tolerate cached ever being invalidated, yes?
>
> > > lazytime does not behave this way - it delays all persistent
> > > timestamp updates until the next persistent change or until the
> > > lazytime aggregation period expires (24 hours). Hence with lazytime,
> > > read-after-write operations do not trigger a persistent atime
> > > update, and so XFS does not run a transaction to update atime. Hence
> > > i_version does not get bumped, and NFS behaves as expected.
> > >
> >
> > Similar problem here. Once a day, NFS clients will invalidate the cache
> > on any static content served from XFS.
>
> Lazytime has /proc/sys/vm/dirtytime_expire_seconds to change the
> interval that triggers persistent time changes. That could easily be
> configured to be longer than a day for workloads that care about
> this sort of thing. Indeed, we could just set up timestamps that NFS
> says "do not make persistent" to only be persisted when the inode is
> removed from server memory rather than be timed out by background
> writeback....
>
> -----
>
> All I'm suggesting is that rather than using mount options for
> noatime-like behaviour for NFSD accesses, we actually have the nfsd
> accesses say "we'd like pure atime updates without iversion, please".
>
> Keep in mind that XFS does actually try to avoid bumping i_version
> on pure timestamp updates - we carved that out a long time ago (see
> the difference in XFS_ILOG_CORE vs XFS_ILOG_TIMESTAMP in
> xfs_vn_update_time() and xfs_trans_log_inode()) so that we could
> optimise fdatasync() to ignore timestamp updates that occur as a
> result of pure data overwrites.
>
> Hence XFS only bumps i_version for pure timestamp updates if the
> iversion queried flag is set. IOWs, XFS it is actually doing exactly
> what the VFS iversion implementation is telling it to do with
> timestamp updates for non-core inode metadata updates.
>
> That's the fundamental issue here: nfsd has set VFS state that tells
> the filesystem to "bump iversion on next persistent inode change",
> but the nfsd then runs operations that can change non-critical
> persistent inode state in "query-only" operations. It then expects
> filesystems to know that it should ignore the iversion queried state
> within this context. However, without external behavioural control
> flags, filesystems cannot know that an isolated metadata update has
> context specific iversion behavioural constraints.

> Hence fixing this is purely a VFS/nfsd i_version implementation
> problem - if the nfsd is running a querying operation, it should
> tell the filesystem that it should ignore iversion query state. If
> nothing the application level cache cares about is being changed
> during the query operation, it should tell the filesystem to ignore
> iversion query state because it is likely the nfsd query itself will
> set it (or have already set it itself in the case of compound
> operations).
>
> This does not need XFS on-disk format changes to fix. This does not
> need changes to timestamp infrastructure to fix. We just need the
> nfsd application to tell us that we should ignore the vfs i_version
> query state when we update non-core inode metadata within query
> operation contexts.
>


I think you're missing the point of the problem I'm trying to solve.
I'm not necessarily trying to guard nfsd against its own accesses. The
reads that trigger an eventual atime update could come from anywhere --
nfsd, userland accesses, etc.

If you are serving an XFS filesystem, with the (default) relatime mount
option, then you are guaranteed that the clients will invalidate their
cache of a file once per day, assuming that at least one read was issued
against the file during that day.

That read will cause an eventual atime bump to be logged, at which point
the change attribute will change. The client will then assume that it
needs to invalidate its cache when it sees that change.

Changing how nfsd does its own accesses won't fix anything, because the
problematic atime bump can come from any sort of read access.
--
Jeff Layton <[email protected]>

2023-10-24 00:19:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Mon, 23 Oct 2023 at 13:26, Dave Chinner <[email protected]> wrote:
>
> The problem is the first read request after a modification has been
> made. That is causing relatime to see mtime > atime and triggering
> an atime update. XFS sees this, does an atime update, and in
> committing that persistent inode metadata update, it calls
> inode_maybe_inc_iversion(force = false) to check if an iversion
> update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> i_version and tells XFS to persist it.

Could we perhaps just have a mode where we don't increment i_version
for just atime updates?

Maybe we don't even need a mode, and could just decide that atime
updates aren't i_version updates at all?

Yes, yes, it's obviously technically a "inode modification", but does
anybody actually *want* atime updates with no actual other changes to
be version events?

Or maybe i_version can update, but callers of getattr() could have two
bits for that STATX_CHANGE_COOKIE, one for "I care about atime" and
one for others, and we'd pass that down to inode_query_version, and
we'd have a I_VERSION_QUERIED and a I_VERSION_QUERIED_STRICT, and the
"I care about atime" case ould set the strict one.

Then inode_maybe_inc_iversion() could - for atome updates - skip the
version update *unless* it sees that I_VERSION_QUERIED_STRICT bit.

Does that sound sane to people?

Because it does sound completely insane to me to say "inode changed"
and have a cache invalidation just for an atime update.

Linus

2023-10-24 04:11:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Mon, 23 Oct 2023 at 17:40, Dave Chinner <[email protected]> wrote:
> >
> > Maybe we don't even need a mode, and could just decide that atime
> > updates aren't i_version updates at all?
>
> We do that already - in memory atime updates don't bump i_version at
> all. The issue is the rare persistent atime update requests that
> still happen - they are the ones that trigger an i_version bump on
> XFS, and one of the relatime heuristics tickle this specific issue.

Yes, yes, but that's what I kind of allude to - maybe people still
want the on-disk atime updates, but do they actually want the
i_version updates just because they want more aggressive atime
updates?

> > Or maybe i_version can update, but callers of getattr() could have two
> > bits for that STATX_CHANGE_COOKIE, one for "I care about atime" and
> > one for others, and we'd pass that down to inode_query_version, and
> > we'd have a I_VERSION_QUERIED and a I_VERSION_QUERIED_STRICT, and the
> > "I care about atime" case ould set the strict one.
>
> This makes correct behaviour reliant on the applicaiton using the
> query mechanism correctly. I have my doubts that userspace
> developers will be able to understand the subtle difference between
> the two options and always choose correctly....

I don't think we _have_ a user space interface.

At least the STATX_CHANGE_COOKIE bit isn't exposed to user space at
all. Not in the uapi headers, but not even in xstat():

/* STATX_CHANGE_COOKIE is kernel-only for now */
tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;

So the *only* users of STATX_CHANGE_COOKIE seem to be entirely
in-kernel, unless there is something I'm missing where somebody uses
i_version through some other interface (random ioctl?).

End result: splitting STATX_CHANGE_COOKIE into a "I don't care about
atime" and a "give me all change events" shouldn't possibly break
anything that I can see.

The only other uses of inode_query_iversion() seem to be the explicit
directory optimizations (ie the "I'm caching the offset and don't want
to re-check that it's valid unless required, so give me the inode
version for the directory as a way to decide if I need to re-check"
thing).

And those *definitely* don't want i_version updates on any atime updates.

There might be some other use of inode_query_iversion() that I missed,
of course. But from a quick look, it really looks to me like we can
relax our i_version updates.

Linus

2023-10-24 19:07:03

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Mon, 2023-10-23 at 14:18 -1000, Linus Torvalds wrote:
> On Mon, 23 Oct 2023 at 13:26, Dave Chinner <[email protected]> wrote:
> >
> > The problem is the first read request after a modification has been
> > made. That is causing relatime to see mtime > atime and triggering
> > an atime update. XFS sees this, does an atime update, and in
> > committing that persistent inode metadata update, it calls
> > inode_maybe_inc_iversion(force = false) to check if an iversion
> > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > i_version and tells XFS to persist it.
>
> Could we perhaps just have a mode where we don't increment i_version
> for just atime updates?
>
> Maybe we don't even need a mode, and could just decide that atime
> updates aren't i_version updates at all?
>
> Yes, yes, it's obviously technically a "inode modification", but does
> anybody actually *want* atime updates with no actual other changes to
> be version events?
>
> Or maybe i_version can update, but callers of getattr() could have two
> bits for that STATX_CHANGE_COOKIE, one for "I care about atime" and
> one for others, and we'd pass that down to inode_query_version, and
> we'd have a I_VERSION_QUERIED and a I_VERSION_QUERIED_STRICT, and the
> "I care about atime" case ould set the strict one.
>
> Then inode_maybe_inc_iversion() could - for atome updates - skip the
> version update *unless* it sees that I_VERSION_QUERIED_STRICT bit.
>
> Does that sound sane to people?
>
> Because it does sound completely insane to me to say "inode changed"
> and have a cache invalidation just for an atime update.
>


The new flag idea is a good one. The catch though is that there are no
readers of i_version in-kernel other than NFSD and IMA, so there would
be no in-kernel users of I_VERSION_QUERIED_STRICT.

In earlier discussions, I was given to believe that the problem with
changing how this works in XFS involved offline filesystem access tools.
That said, I didn't press for enough details at the time, so I may have
misunderstood Dave's reticence to change how this works.
--
Jeff Layton <[email protected]>

2023-10-25 10:43:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed, Oct 25, 2023 at 11:05 AM Dave Chinner <[email protected]> wrote:
>
> On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <[email protected]> wrote:
> > > >
> > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > The problem is the first read request after a modification has been
> > > > > > made. That is causing relatime to see mtime > atime and triggering
> > > > > > an atime update. XFS sees this, does an atime update, and in
> > > > > > committing that persistent inode metadata update, it calls
> > > > > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > > > > i_version and tells XFS to persist it.
> > > > >
> > > > > Could we perhaps just have a mode where we don't increment i_version
> > > > > for just atime updates?
> > > > >
> > > > > Maybe we don't even need a mode, and could just decide that atime
> > > > > updates aren't i_version updates at all?
> > > >
> > > > We do that already - in memory atime updates don't bump i_version at
> > > > all. The issue is the rare persistent atime update requests that
> > > > still happen - they are the ones that trigger an i_version bump on
> > > > XFS, and one of the relatime heuristics tickle this specific issue.
> > > >
> > > > If we push the problematic persistent atime updates to be in-memory
> > > > updates only, then the whole problem with i_version goes away....
> > > >
> > > > > Yes, yes, it's obviously technically a "inode modification", but does
> > > > > anybody actually *want* atime updates with no actual other changes to
> > > > > be version events?
> > > >
> > > > Well, yes, there was. That's why we defined i_version in the on disk
> > > > format this way well over a decade ago. It was part of some deep
> > > > dark magical HSM beans that allowed the application to combine
> > > > multiple scans for different inode metadata changes into a single
> > > > pass. atime changes was one of the things it needed to know about
> > > > for tiering and space scavenging purposes....
> > > >
> > >
> > > But if this is such an ancient mystical program, why do we have to
> > > keep this XFS behavior in the present?
> > > BTW, is this the same HSM whose DMAPI ioctls were deprecated
> > > a few years back?
>
> Drop the attitude, Amir.
>
> That "ancient mystical program" is this:
>
> https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088
>

Sorry for the attitude Dave, I somehow got the impression that you
were talking about a hypothetical old program that may be out of use.
I believe that Jeff and Linus got the same impression...

> Yup, that product is backed by a proprietary descendent of the Irix
> XFS code base XFS that is DMAPI enabled and still in use today. It's
> called HPE XFS these days....
>

What do you mean?
Do you mean that the HPE product uses patched XFS?
If so, why is that an upstream concern?

Upstream xfs indeed preserves di_dmstate,di_dmevmask, but it does
not change those state members when file changes happen.

So if mounting an HPE XFS disk on with upstream kernel is not
going to record DMAPI state changes, does it matter if upstream
xfs does not update di_changecount on atime change?

Maybe I did not understand the situation w.r.t HPE XFS.

> > > I mean, I understand that you do not want to change the behavior of
> > > i_version update without an opt-in config or mount option - let the distro
> > > make that choice.
> > > But calling this an "on-disk format change" is a very long stretch.
>
> Telling the person who created, defined and implemented the on disk
> format that they don't know what constitutes a change of that
> on-disk format seems kinda Dunning-Kruger to me....
>

OK. I will choose my words more carefully:

I still do not understand, from everything that you have told us
so far, including the mention of the specific product above,
why not updating di_changecount on atime update constitutes
an on-disk format change and not a runtime behavior change.

You also did not address my comment that xfs_repair does not
update di_changecount on any inode changes to the best of my
code reading abilities.

> There are *lots* of ways that di_changecount is now incompatible
> with the VFS change counter. That's now defined as "i_version should
> only change when [cm]time is changed".
>
> di_changecount is defined to be a count of the number of changes
> made to the attributes of the inode. It's not just atime at issue
> here - we bump di_changecount when make any inode change, including
> background work that does not otherwise change timestamps. e.g.
> allocation at writeback time, unwritten extent conversion, on-disk
> EOF extension at IO completion, removal of speculative
> pre-allocation beyond EOF, etc.
>

I see.
Does xfs update ctime on all those inode block map changes?

> IOWs, di_changecount was never defined as a linux "i_version"
> counter, regardless of the fact we originally we able to implement
> i_version with it - all extra bumps to di_changecount were not
> important to the users of i_version for about a decade.
>
> Unfortunately, the new i_version definition is very much
> incompatible with the existing di_changecount definition and that's
> the underlying problem here. i.e. the problem is not that we bump
> i_version on atime, it's that di_changecount is now completely
> incompatible with the new i_version change semantics.
>
> To implement the new i_version semantics exactly, we need to add a
> new field to the inode to hold this information.
> If we change the on disk format like this, then the atime
> problems go away because the new field would not get updated on
> atime updates. We'd still be bumping di_changecount on atime
> updates, though, because that's what is required by the on-disk
> format.
>

I fully agree with you that we should avoid on-disk format change.
This is exactly the reason that I'm insisting on the point of clarifying
how exactly, this semantic change of di_changecount is going to
break existing applications that run on upstream kernel.

Thanks,
Amir.

2023-10-25 12:26:37

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <[email protected]> wrote:
> > > >
> > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > The problem is the first read request after a modification has been
> > > > > > made. That is causing relatime to see mtime > atime and triggering
> > > > > > an atime update. XFS sees this, does an atime update, and in
> > > > > > committing that persistent inode metadata update, it calls
> > > > > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > > > > i_version and tells XFS to persist it.
> > > > >
> > > > > Could we perhaps just have a mode where we don't increment i_version
> > > > > for just atime updates?
> > > > >
> > > > > Maybe we don't even need a mode, and could just decide that atime
> > > > > updates aren't i_version updates at all?
> > > >
> > > > We do that already - in memory atime updates don't bump i_version at
> > > > all. The issue is the rare persistent atime update requests that
> > > > still happen - they are the ones that trigger an i_version bump on
> > > > XFS, and one of the relatime heuristics tickle this specific issue.
> > > >
> > > > If we push the problematic persistent atime updates to be in-memory
> > > > updates only, then the whole problem with i_version goes away....
> > > >
> > > > > Yes, yes, it's obviously technically a "inode modification", but does
> > > > > anybody actually *want* atime updates with no actual other changes to
> > > > > be version events?
> > > >
> > > > Well, yes, there was. That's why we defined i_version in the on disk
> > > > format this way well over a decade ago. It was part of some deep
> > > > dark magical HSM beans that allowed the application to combine
> > > > multiple scans for different inode metadata changes into a single
> > > > pass. atime changes was one of the things it needed to know about
> > > > for tiering and space scavenging purposes....
> > > >
> > >
> > > But if this is such an ancient mystical program, why do we have to
> > > keep this XFS behavior in the present?
> > > BTW, is this the same HSM whose DMAPI ioctls were deprecated
> > > a few years back?
>
> Drop the attitude, Amir.
>
> That "ancient mystical program" is this:
>
> https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088
>
> Yup, that product is backed by a proprietary descendent of the Irix
> XFS code base XFS that is DMAPI enabled and still in use today. It's
> called HPE XFS these days....
>
> > > I mean, I understand that you do not want to change the behavior of
> > > i_version update without an opt-in config or mount option - let the distro
> > > make that choice.
> > > But calling this an "on-disk format change" is a very long stretch.
>
> Telling the person who created, defined and implemented the on disk
> format that they don't know what constitutes a change of that
> on-disk format seems kinda Dunning-Kruger to me....
>
> There are *lots* of ways that di_changecount is now incompatible
> with the VFS change counter. That's now defined as "i_version should
> only change when [cm]time is changed".
>
> di_changecount is defined to be a count of the number of changes
> made to the attributes of the inode. It's not just atime at issue
> here - we bump di_changecount when make any inode change, including
> background work that does not otherwise change timestamps. e.g.
> allocation at writeback time, unwritten extent conversion, on-disk
> EOF extension at IO completion, removal of speculative
> pre-allocation beyond EOF, etc.
>
> IOWs, di_changecount was never defined as a linux "i_version"
> counter, regardless of the fact we originally we able to implement
> i_version with it - all extra bumps to di_changecount were not
> important to the users of i_version for about a decade.
>
> Unfortunately, the new i_version definition is very much
> incompatible with the existing di_changecount definition and that's
> the underlying problem here. i.e. the problem is not that we bump
> i_version on atime, it's that di_changecount is now completely
> incompatible with the new i_version change semantics.
>
> To implement the new i_version semantics exactly, we need to add a
> new field to the inode to hold this information.
> If we change the on disk format like this, then the atime
> problems go away because the new field would not get updated on
> atime updates. We'd still be bumping di_changecount on atime
> updates, though, because that's what is required by the on-disk
> format.
>
> I'm really trying to avoid changing the on-disk format unless it
> is absolutely necessary. If we can get the in-memory timestamp
> updates to avoid tripping di_changecount updates then the atime
> problems go away.
>
> If we can get [cm]time sufficiently fine grained that we don't need
> i_version, then we can turn off i_version in XFS and di_changecount
> ends up being entirely internal. That's what was attempted with
> generic multi-grain timestamps, but that hasn't worked.
>
> Another options is for XFS to play it's own internal tricks with
> [cm]time granularity and turn off i_version. e.g. limit external
> timestamp visibility to 1us and use the remaining dozen bits of the
> ns field to hold a change counter for updates within a single coarse
> timer tick. This guarantees the timestamp changes within a coarse
> tick for the purposes of change detection, but we don't expose those
> bits to applications so applications that compare timestamps across
> inodes won't get things back to front like was happening with the
> multi-grain timestamps....
>
> Another option is to work around the visible symptoms of the
> semantic mismatch between i_version and di_changecount. The only
> visible symptom we currently know about is the atime vs i_version
> issue. If people are happy for us to simply ignore VFS atime
> guidelines (i.e. ignore realtime/lazytime) and do completely our own
> stuff with timestamp update deferal, then that also solve the
> immediate issues.
>
> > > Does xfs_repair guarantee that changes of atime, or any inode changes
> > > for that matter, update i_version? No, it does not.
> > > So IMO, "atime does not update i_version" is not an "on-disk format change",
> > > it is a runtime behavior change, just like lazytime is.
> >
> > This would certainly be my preference. I don't want to break any
> > existing users though.
>
> That's why I'm trying to get some kind of consensus on what
> rules and/or atime configurations people are happy for me to break
> to make it look to users like there's a viable working change
> attribute being supplied by XFS without needing to change the on
> disk format.
>

I agree that the only bone of contention is whether to count atime
updates against the change attribute. I think we have consensus that all
in-kernel users do _not_ want atime updates counted against the change
attribute. The only real question is these "legacy" users of
di_changecount.

> > Perhaps this ought to be a mkfs option? Existing XFS filesystems could
> > still behave with the legacy behavior, but we could make mkfs.xfs build
> > filesystems by default that work like NFS requires.
>
> If we require mkfs to set a flag to change behaviour, then we're
> talking about making an explicit on-disk format change to select the
> optional behaviour. That's precisely what I want to avoid.
>

Right. The on-disk di_changecount would have a (subtly) different
meaning at that point.

It's not a change that requires drastic retooling though. If we were to
do this, we wouldn't need to grow the on-disk inode. Booting to an older
kernel would cause the behavior to revert. That's sub-optimal, but not
fatal.

What I don't quite understand is how these tools are accessing
di_changecount?

XFS only accesses the di_changecount to propagate the value to and from
the i_version, and there is nothing besides NFSD and IMA that queries
the i_version value in-kernel. So, this must be done via some sort of
userland tool that is directly accessing the block device (or some 3rd
party kernel module).

In earlier discussions you alluded to some repair and/or analysis tools
that depended on this counter. I took a quick look in xfsprogs, but I
didn't see anything there. Is there a library or something that these
tools use to get at this value?
--
Jeff Layton <[email protected]>

2023-10-27 10:36:31

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote:
> On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote:
> > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > > > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > > > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <[email protected]> wrote:
> > > > > Does xfs_repair guarantee that changes of atime, or any inode changes
> > > > > for that matter, update i_version? No, it does not.
> > > > > So IMO, "atime does not update i_version" is not an "on-disk format change",
> > > > > it is a runtime behavior change, just like lazytime is.
> > > >
> > > > This would certainly be my preference. I don't want to break any
> > > > existing users though.
> > >
> > > That's why I'm trying to get some kind of consensus on what
> > > rules and/or atime configurations people are happy for me to break
> > > to make it look to users like there's a viable working change
> > > attribute being supplied by XFS without needing to change the on
> > > disk format.
> > >
> >
> > I agree that the only bone of contention is whether to count atime
> > updates against the change attribute. I think we have consensus that all
> > in-kernel users do _not_ want atime updates counted against the change
> > attribute. The only real question is these "legacy" users of
> > di_changecount.
>
> Please stop refering to "legacy users" of di_changecount. Whether
> there are users or not is irrelevant - it is defined by the current
> on-disk format specification, and as such there may be applications
> we do not know about making use of the current behaviour.
>
> It's like a linux syscall - we can't remove them because there may
> be some user we don't know about still using that old syscall. We
> simply don't make changes that can potentially break user
> applications like that.
>
> The on disk format is the same - there is software out that we don't
> know about that expects a certain behaviour based on the
> specification. We don't break the on disk format by making silent
> behavioural changes - we require a feature flag to indicate
> behaviour has changed so that applications can take appropriate
> actions with stuff they don't understand.
>
> The example for this is the BIGTIME timestamp format change. The on
> disk inode structure is physically unchanged, but the contents of
> the timestamp fields are encoded very differently. Sure, the older
> kernels can read the timestamp data without any sort of problem
> occurring, except for the fact the timestamps now appear to be
> completely corrupted.
>
> Changing the meaning of ithe contents of di_changecount is no
> different. It might look OK and nothing crashes, but nothing can be
> inferred from the value in the field because we don't know how it
> has been modified.
>
> Hence we can't just change the meaning, encoding or behaviour of an
> on disk field that would result in existing kernels and applications
> doing the wrong thing with that field (either read or write) without
> adding a feature flag to indicate what behaviour that field should
> have.
>
> > > > Perhaps this ought to be a mkfs option? Existing XFS filesystems could
> > > > still behave with the legacy behavior, but we could make mkfs.xfs build
> > > > filesystems by default that work like NFS requires.
> > >
> > > If we require mkfs to set a flag to change behaviour, then we're
> > > talking about making an explicit on-disk format change to select the
> > > optional behaviour. That's precisely what I want to avoid.
> > >
> >
> > Right. The on-disk di_changecount would have a (subtly) different
> > meaning at that point.
> >
> > It's not a change that requires drastic retooling though. If we were to
> > do this, we wouldn't need to grow the on-disk inode. Booting to an older
> > kernel would cause the behavior to revert. That's sub-optimal, but not
> > fatal.
>
> See above: redefining the contents, behaviour or encoding of an on
> disk field is a change of the on-disk format specification.
>
> The rules for on disk format changes that we work to were set in
> place long before I started working on XFS. They are sane, well
> thought out rules that have stood the test of time and massive new
> feature introductions (CRCs, reflink, rmap, etc). And they only work
> because we don't allow anyone to bend them for convenience, short
> cuts or expediting their pet project.
>
> > What I don't quite understand is how these tools are accessing
> > di_changecount?
>
> As I keep saying: this is largely irrelevant to the problem at hand.
>
> > XFS only accesses the di_changecount to propagate the value to and from
> > the i_version,
>
> Yes. XFS has a strong separation between on-disk structures and
> in-memory values, and i_version is simply the in-memory field we use
> to store the current di_changecount value. We force bump i_version
> every time we modify the inode core regardless of whether anyone has
> queried i_version because that's what di_changecount requires. i.e.
> the filesystem controls the contents of i_version, not the VFS.
>
> Now that NFS is using a proper abstraction (i.e. vfs_statx()) to get
> the change cookie, we really don't need to expose di_changecount in
> i_version at all - we could simply copy an internal di_changecount
> value into the statx cookie field in xfs_vn_getattr() and there
> would be almost no change of behaviour from the perspective of NFS
> and IMA at all.
>
> > and there is nothing besides NFSD and IMA that queries
> > the i_version value in-kernel. So, this must be done via some sort of
> > userland tool that is directly accessing the block device (or some 3rd
> > party kernel module).
>
> Yup, both of those sort of applications exist. e.g. the DMAPI kernel
> module allows direct access to inode metadata through a custom
> bulkstat formatter implementation - it returns different information
> comapred to the standard XFS one in the upstream kernel.
>
> > In earlier discussions you alluded to some repair and/or analysis tools
> > that depended on this counter.
>
> Yes, and one of those "tools" is *me*.
>
> I frequently look at the di_changecount when doing forensic and/or
> failure analysis on filesystem corpses. SOE analysis, relative
> modification activity, etc all give insight into what happened to
> the filesystem to get it into the state it is currently in, and
> di_changecount provides information no other metadata in the inode
> contains.
>
> > I took a quick look in xfsprogs, but I
> > didn't see anything there. Is there a library or something that these
> > tools use to get at this value?
>
> xfs_db is the tool I use for this, such as:
>
> $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast
> v3.change_count = 35
> $
>
> The root inode in this filesystem has a change count of 35. The root
> inode has 32 dirents in it, which means that no entries have ever
> been removed or renamed. This sort of insight into the past history
> of inode metadata is largely impossible to get any other way, and
> it's been the difference between understanding failure and having no
> clue more than once.
>
> Most block device parsing applications simply write their own
> decoder that walks the on-disk format. That's pretty trivial to do,
> developers can get all the information needed to do this from the
> on-disk format specification documentation we keep on kernel.org...
>

Fair enough. I'm not here to tell you that you guys that you need to
change how di_changecount works. If it's too valuable to keep it
counting atime-only updates, then so be it.

If that's the case however, and given that the multigrain timestamp work
is effectively dead, then I don't see an alternative to growing the on-
disk inode. Do you?
--
Jeff Layton <[email protected]>

2023-10-31 13:55:21

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Tue, 2023-10-31 at 11:26 +0100, Christian Brauner wrote:
> On Thu, Oct 19, 2023 at 07:28:48AM -0400, Jeff Layton wrote:
> > On Thu, 2023-10-19 at 11:29 +0200, Christian Brauner wrote:
> > > > Back to your earlier point though:
> > > >
> > > > Is a global offset really a non-starter? I can see about doing something
> > > > per-superblock, but ktime_get_mg_coarse_ts64 should be roughly as cheap
> > > > as ktime_get_coarse_ts64. I don't see the downside there for the non-
> > > > multigrain filesystems to call that.
> > >
> > > I have to say that this doesn't excite me. This whole thing feels a bit
> > > hackish. I think that a change version is the way more sane way to go.
> > >
> >
> > What is it about this set that feels so much more hackish to you? Most
> > of this set is pretty similar to what we had to revert. Is it just the
> > timekeeper changes? Why do you feel those are a problem?
>
> So I think that the multi-grain timestamp work was well intended but it
> was ultimately a mistake. Because we added code that complicated
> timestamp timestamp handling in the vfs to a point where the costs
> clearly outweighed the benefits.
>
> And I don't think that this direction is worth going into. This whole
> thread ultimately boils down to complicating generic infrastructure
> quite extensively for nfs to handle exposing xfs without forcing an
> on-disk format change. That's even fine.
>
> That's not a problem but in the same way I don't think the solution is
> just stuffing this complexity into the vfs. IOW, if we make this a vfs
> problem then at the lowest possible cost and not by changing how
> timestamps work for everyone even if it's just internal.

I'll point out that this last posting I did was an RFC. It was invasive
to the timekeeping code, but I don't think that's a hard requirement for
doing this.

I do appreciate the feedback on this version of the series (particularly
from Thomas who gave a great technical reason why this approach was
wrong), but I don't think we necessarily have to give up on the whole
idea because this particular implementation was too costly.

The core idea for fixing the problem with the original series is sane,
IMO. There is nothing wrong with simply making it that when we stamp a
file with a fine-grained timestamp that we consider that a floor for all
later timestamp updates. The only real question is how to keep that
(global) fine-grained floor offset at a low cost. I think that's a
solvable problem.

I also believe that real, measurable fine-grained timestamp differences
are worthwhile for other use cases beyond NFS. Everyone was pointing out
the problems with lagging timestamps vs. make and rsync, but that's a
double-edged sword. With the current always coarse-grained timestamps,
the ordering of files written within the same jiffy can't be determined
since their timestamps will be identical. We could conceivably change
that with this series.

That said, if this has no chance of ever being merged, then I won't
bother working on it further, and we can try to pursue something that is
(maybe) XFS-specific.

Let me know, either way.
--
Jeff Layton <[email protected]>

2023-10-31 23:12:58

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Tue, Oct 31, 2023 at 09:03:57AM +0200, Amir Goldstein wrote:
> On Tue, Oct 31, 2023 at 3:42 AM Dave Chinner <[email protected]> wrote:
> >
> [...]
> > .... and what is annoying is that that the new i_version just a
> > glorified ctime change counter. What we should be fixing is ctime -
> > integrating this change counting into ctime would allow us to make
> > i_version go away entirely. i.e. We don't need a persistent ctime
> > change counter if the ctime has sufficient resolution or persistent
> > encoding that it does not need an external persistent change
> > counter.
> >
> > That was reasoning behind the multi-grain timestamps. While the mgts
> > implementation was flawed, the reasoning behind it certainly isn't.
> > We should be trying to get rid of i_version by integrating it into
> > ctime updates, not arguing how atime vs i_version should work.
> >
> > > So I don't think the issue here is "i_version" per se. I think in a
> > > vacuum, the best option of i_version is pretty obvious. But if you
> > > want i_version to track di_changecount, *then* you end up with that
> > > situation where the persistence of atime matters, and i_version needs
> > > to update whenever a (persistent) atime update happens.
> >
> > Yet I don't want i_version to track di_changecount.
> >
> > I want to *stop supporting i_version altogether* in XFS.
> >
> > I want i_version as filesystem internal metadata to die completely.
> >
> > I don't want to change the on disk format to add a new i_version
> > field because we'll be straight back in this same siutation when the
> > next i_version bug is found and semantics get changed yet again.
> >
> > Hence if we can encode the necessary change attributes into ctime,
> > we can drop VFS i_version support altogether. Then the "atime bumps
> > i_version" problem also goes away because then we *don't use
> > i_version*.
> >
> > But if we can't get the VFS to do this with ctime, at least we have
> > the abstractions available to us (i.e. timestamp granularity and
> > statx change cookie) to allow XFS to implement this sort of
> > ctime-with-integrated-change-counter internally to the filesystem
> > and be able to drop i_version support....
> >
>
> I don't know if it was mentioned before in one of the many threads,
> but there is another benefit of ctime-with-integrated-change-counter
> approach - it is the ability to extend the solution with some adaptations
> also to mtime.
>
> The "change cookie" is used to know if inode metadata cache should
> be invalidated and mtime is often used to know if data cache should
> be invalidated, or if data comparison could be skipped (e.g. rsync).
>
> The difference is that mtime can be set by user, so using lower nsec
> bits for modification counter would require to truncate the user set
> time granularity to 100ns - that is probably acceptable, but only as
> an opt-in behavior.
>
> The special value 0 for mtime-change-counter could be reserved for
> mtime that was set by the user or for upgrade of existing inode,
> where 0 counter means that mtime cannot be trusted as an accurate
> data modification-cookie.

What about write faults on an mmap region? The first ro->rw transition
results in an mtime update, but not again until the page gets cleaned.

> This feature is going to be useful for the vfs HSM implementation [1]
> that I am working on and it actually rhymes with the XFS DMAPI
> patches that were never fully merged upstream.

Kudos, I cannot figure out a non-pejorative word that rhymes with
"**API". ;)

--D

> Speaking on behalf of my employer, we would love to see the data
> modification-cookie feature implemented, whether in vfs or in xfs.
>
> *IF* the result on this thread is that the chosen solution is
> ctime-with-change-counter in XFS
> *AND* if there is agreement among XFS developers to extend it with
> an opt-in mkfs/mount option to 100ns-mtime-with-change-counter in XFS
> *THEN* I think I will be able to allocate resources to drive this xfs work.
>
> Thanks,
> Amir.
>
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API
>

2023-11-01 10:17:02

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed 01-11-23 08:57:09, Dave Chinner wrote:
> 5. When-ever the inode is persisted, the timestamp is copied to the
> on-disk structure and the current change counter is folded in.
>
> This means the on-disk structure always contains the latest
> change attribute that has been persisted, just like we
> currently do with i_version now.
>
> 6. When-ever we read the inode off disk, we split the change counter
> from the timestamp and update the appropriate internal structures
> with this information.
>
> This ensures that the VFS and userspace never see the change
> counter implementation in the inode timestamps.

OK, but is this compatible with the current XFS behavior? AFAICS currently
XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will
have some mostly random garbage in low bits of the ctime. Now if you look
at such inode with a kernel using this new scheme, stat(2) will report
ctime with low bits zeroed-out so if the ctime fetched in the old kernel was
stored in some external database and compared to the newly fetched ctime, it
will appear that ctime has gone backwards... Maybe we don't care but it is
a user visible change that can potentially confuse something.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-01 11:39:03

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed, Nov 1, 2023 at 12:16 PM Jan Kara <[email protected]> wrote:
>
> On Wed 01-11-23 08:57:09, Dave Chinner wrote:
> > 5. When-ever the inode is persisted, the timestamp is copied to the
> > on-disk structure and the current change counter is folded in.
> >
> > This means the on-disk structure always contains the latest
> > change attribute that has been persisted, just like we
> > currently do with i_version now.
> >
> > 6. When-ever we read the inode off disk, we split the change counter
> > from the timestamp and update the appropriate internal structures
> > with this information.
> >
> > This ensures that the VFS and userspace never see the change
> > counter implementation in the inode timestamps.
>
> OK, but is this compatible with the current XFS behavior? AFAICS currently
> XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will
> have some mostly random garbage in low bits of the ctime. Now if you look
> at such inode with a kernel using this new scheme, stat(2) will report
> ctime with low bits zeroed-out so if the ctime fetched in the old kernel was
> stored in some external database and compared to the newly fetched ctime, it
> will appear that ctime has gone backwards... Maybe we don't care but it is
> a user visible change that can potentially confuse something.

See xfs_inode_has_bigtime() and auto-upgrade of inode format in
xfs_inode_item_precommit().

In the case of BIGTIME inode format, admin needs to opt-in to
BIGTIME format conversion by setting an INCOMPAT_BIGTIME
sb feature flag.

I imagine that something similar (inode flag + sb flag) would need
to be done for the versioned-timestamp, but I think that in that case,
the feature flag could be RO_COMPAT - there is no harm in exposing
made-up nsec lower bits if fs would be mounted read-only on an old
kernel, is there?

The same RO_COMPAT feature flag could also be used to determine
s_time_gran, because IIUC, s_time_gran for timestamp updates
is uniform across all inodes.

I know that Dave said he wants to avoid changing on-disk format,
but I am hoping that this well defined and backward compat with
lazy upgrade, on-disk format change may be acceptable?

Thanks,
Amir.

2023-11-02 10:18:04

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Wed, 2023-11-01 at 13:38 +0200, Amir Goldstein wrote:
> On Wed, Nov 1, 2023 at 12:16 PM Jan Kara <[email protected]> wrote:
> >
> > On Wed 01-11-23 08:57:09, Dave Chinner wrote:
> > > 5. When-ever the inode is persisted, the timestamp is copied to the
> > > on-disk structure and the current change counter is folded in.
> > >
> > > This means the on-disk structure always contains the latest
> > > change attribute that has been persisted, just like we
> > > currently do with i_version now.
> > >
> > > 6. When-ever we read the inode off disk, we split the change counter
> > > from the timestamp and update the appropriate internal structures
> > > with this information.
> > >
> > > This ensures that the VFS and userspace never see the change
> > > counter implementation in the inode timestamps.
> >
> > OK, but is this compatible with the current XFS behavior? AFAICS currently
> > XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will
> > have some mostly random garbage in low bits of the ctime. Now if you look
> > at such inode with a kernel using this new scheme, stat(2) will report
> > ctime with low bits zeroed-out so if the ctime fetched in the old kernel was
> > stored in some external database and compared to the newly fetched ctime, it
> > will appear that ctime has gone backwards... Maybe we don't care but it is
> > a user visible change that can potentially confuse something.
>
> See xfs_inode_has_bigtime() and auto-upgrade of inode format in
> xfs_inode_item_precommit().
>
> In the case of BIGTIME inode format, admin needs to opt-in to
> BIGTIME format conversion by setting an INCOMPAT_BIGTIME
> sb feature flag.
>
> I imagine that something similar (inode flag + sb flag) would need
> to be done for the versioned-timestamp, but I think that in that case,
> the feature flag could be RO_COMPAT - there is no harm in exposing
> made-up nsec lower bits if fs would be mounted read-only on an old
> kernel, is there?
>
> The same RO_COMPAT feature flag could also be used to determine
> s_time_gran, because IIUC, s_time_gran for timestamp updates
> is uniform across all inodes.
>
> I know that Dave said he wants to avoid changing on-disk format,
> but I am hoping that this well defined and backward compat with
> lazy upgrade, on-disk format change may be acceptable?


With the ctime, we're somewhat saved by the fact that it's not settable
by users, so we don't need to worry as much about returning specific
values there, I think.

With the scheme Dave is proposing, booting to a new kernel vs. an old
kernel might show a different ctime on an inode though. That might be
enough to justify needing a way to opt-in to the change on existing
filesystems.
--
Jeff Layton <[email protected]>

2023-11-02 10:29:49

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

On Thu, 2023-11-02 at 10:29 +1100, Dave Chinner wrote:
> On Wed, Nov 01, 2023 at 09:34:57PM +0000, Trond Myklebust wrote:
> > On Wed, 2023-11-01 at 10:10 -1000, Linus Torvalds wrote:
> > > The above does not expose *any* changes to timestamps to users, and
> > > should work across a wide variety of filesystems, without requiring
> > > any special code from the filesystem itself.
> > >
> > > And now please all jump on me and say "No, Linus, that won't work,
> > > because XYZ".
> > >
> > > Because it is *entirely* possible that I missed something truly
> > > fundamental, and the above is completely broken for some obvious
> > > reason that I just didn't think of.
> > >
> >
> > My client writes to the file and immediately reads the ctime. A 3rd
> > party client then writes immediately after my ctime read.
> > A reboot occurs (maybe minutes later), then I re-read the ctime, and
> > get the same value as before the 3rd party write.
> >
> > Yes, most of the time that is better than the naked ctime, but not
> > across a reboot.
>
> This sort of "crash immediately after 3rd party data write" scenario
> has never worked properly, even with i_version.
>
> The issue is that 3rd party (local) buffered writes or metadata
> changes do not require any integrity or metadata stability
> operations to be performed by the filesystem unless O_[D]SYNC is set
> on the fd, RWF_[D]SYNC is set on the IO, or f{data}sync() is
> performed on the file.
>
> Hence no local filesystem currently persists i_version or ctime
> outside of operations with specific data integrity semantics.
>
> nfsd based modifications have application specific persistence
> requirements and that is triggered by the nfsd calling
> ->commit_metadata prior to returning the operation result to the
> client. This is what persists i_version/timestamp changes that were
> made during the nfsd operation - this persistence behaviour is not
> driven by the local filesystem.
>
> IOWs, this "change attribute failure" scenario is an existing
> problem with the current i_version implementation. It has always
> been flawed in this way but this didn't matter a decade ago because
> it's only purpose (and user) was nfsd and that had the required
> persistence semantics to hide these flaws within the application's
> context.
>
> Now that we are trying to expose i_version as a "generic change
> attribute", these persistence flaws get exposed because local
> filesystem operations do not have the same enforced persistence
> semantics as the NFS server.
>
> This is another reason I want i_version to die.
>
> What we need is a clear set of well defined semantics around statx
> change attribute sampling. Correct crash-recovery/integrity behaviour
> requires this rule:
>
> If the change attribute has been sampled, then the next
> modification to the filesystem that bumps change attribute *must*
> persist the change attribute modification atomically with the
> modification that requires it to change, or submit and complete
> persistence of the change attribute modification before the
> modification that requires it starts.
>
> e.g. a truncate can bump the change attribute atomically with the
> metadata changes in a transaction-based filesystem (ext4, XFS,
> btrfs, bcachefs, etc).
>
> Data writes are much harder, though. Some filesysetm structures can
> write data and metadata in a single update e.g. log structured or
> COW filesystems that can mix data and metadata like btrfs.
> Journalling filesystems require ordering between journal writes and
> the data writes to guarantee the change attribute is persistent
> before we write the data. Non-journalling filesystems require inode
> vs data write ordering.
>
> Hence I strongly doubt that a persistent change attribute is best
> implemented at the VFS - optimal, efficient implementations are
> highly filesystem specific regardless of how the change attribute is
> encoded in filesysetm metadata.
>
> This is another reason I want to change how the inode timestamp code
> is structured to call into the filesystem first rather than last.
> Different filesystems will need to do different things to persist
> a "ctime change counter" attribute correctly and efficiently -
> it's not a one-size fits all situation....

FWIW, the big danger for nfsd is is i_version rollback after a crash:

We can end up handing out an i_version value to the client before it
ever makes it to disk. If that happens, and the server crashes before it
ever makes it to disk, then the client can see the old i_version when it
queries it again (assuming the earlier write was lost).

That, in an of itself, is not a _huge_ problem for NFS clients. They'll
typically just invalidate their cache if that occurs and reread any data
they need.

The real danger is that you can have a write that occurs after the
reboot that is different from the earlier one and hand out a change
attribute that is a duplicate of the one viewed earlier. Now you have
the same change attribute that refers to two different states of the
file (and potential data corruption).

We mitigate that today by factoring in the ctime on regular files when
generating the change attribute (see nfsd4_change_attribute()). In
theory, i_version rolling back + a clock jump backward could generate
change attr collisions, even with that, but that's a bit harder to
contrive so we mostly don't worry about it.

I'm all for coming up with a way to make this more resilient though. If
we can offer the guarantee that you're proposing above, then that would
be a very nice thing.
--
Jeff Layton <[email protected]>