(Apologies for the resend, but I didn't send this with a wide enough
distribution list originally).
A few weeks ago, during one of the discussions around i_version, Dave
Chinner wrote this:
"You've missed the part where I suggested lifting the "nfsd sampled
i_version" state into an inode state flag rather than hiding it in
the i_version field. At that point, we could optimise away the
secondary ctime updates just like you are proposing we do with the
i_version updates. Further, we could also use that state it to
decide whether we need to use high resolution timestamps when
recording ctime updates - if the nfsd has not sampled the
ctime/i_version, we don't need high res timestamps to be recorded
for ctime...."
While I don't think we can practically optimize away ctime updates
like we do with i_version, I do like the idea of using this scheme to
indicate when we need to use a high-res timestamp.
This patchset is a first stab at a scheme to do this. It declares a new
i_state flag for this purpose and adds two new vfs-layer functions to
implement conditional high-res timestamp fetching. It then converts both
tmpfs and xfs to use it.
This seems to behave fine under xfstests, but I haven't yet done
any performance testing with it. I wouldn't expect it to create huge
regressions though since we're only grabbing high res timestamps after
each query.
I like this scheme because we can potentially convert any filesystem to
use it. No special storage requirements like with i_version field. I
think it'd potentially improve NFS cache coherency with a whole swath of
exportable filesystems, and helps out NFSv3 too.
This is really just a proof-of-concept. There are a number of things we
could change:
1/ We could use the top bit in the tv_sec field as the flag. That'd give
us different flags for ctime and mtime. We also wouldn't need to use
a spinlock.
2/ We could probably optimize away the high-res timestamp fetch in more
cases. Basically, always do a coarse-grained ts fetch and only fetch
the high-res ts when the QUERIED flag is set and the existing time
hasn't changed.
If this approach looks reasonable, I'll plan to start working on
converting more filesystems.
One thing I'm not clear on is how widely available high res timestamps
are. Is this something we need to gate on particular CONFIG_* options?
Thoughts?
Jeff Layton (3):
fs: add infrastructure for opportunistic high-res ctime/mtime updates
shmem: mark for high-res timestamps on next update after getattr
xfs: mark the inode for high-res timestamp update in getattr
fs/inode.c | 40 +++++++++++++++++++++++++++++++--
fs/stat.c | 10 +++++++++
fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
fs/xfs/xfs_acl.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_iops.c | 15 ++++++++++---
include/linux/fs.h | 5 ++++-
mm/shmem.c | 23 ++++++++++---------
8 files changed, 80 insertions(+), 19 deletions(-)
--
2.39.2
The VFS always uses coarse-grained timestamp updates for filling out the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away metadata updates.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
lot of exported filesystems don't properly support a change attribute
and are subject to the same problem of timestamp granularity. Other
applications have similar issues (e.g backup applications).
Switching to always using high resolution timestamps would improve the
situation for NFS, but that becomes rather expensive, as we'd have to
log a lot more metadata updates.
This patch grabs a new i_state bit to use as a flag that filesystems can
set in their getattr routine to indicate that the mtime or ctime was
queried since it was last updated.
It then adds a new current_cmtime function that acts like the
current_time helper, but will conditionally grab high-res timestamps
when the i_state flag is set in the inode.
This allows NFS and other applications to reap the benefits of high-res
ctime and mtime timestamps, but at a substantially lower cost than
fetching them every time.
Cc: Dave Chinner <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++--
fs/stat.c | 10 ++++++++++
include/linux/fs.h | 5 ++++-
3 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f1355..3630f67fd042 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
return ret;
}
+/**
+ * current_cmtime - Return FS time (possibly high-res)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime or mtime change. If something recently
+ * fetched the ctime or mtime out of the inode via getattr, then get a
+ * high-resolution timestamp.
+ *
+ * Note that inode and inode->sb cannot be NULL.
+ * Otherwise, the function warns and returns coarse time without truncation.
+ */
+struct timespec64 current_cmtime(struct inode *inode)
+{
+ struct timespec64 now;
+
+ if (unlikely(!inode->i_sb)) {
+ WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
+ ktime_get_coarse_real_ts64(&now);
+ return now;
+ }
+
+ /* Do a lockless check for the flag before taking the spinlock */
+ if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
+ ktime_get_real_ts64(&now);
+ spin_lock(&inode->i_lock);
+ inode->i_state &= ~I_CMTIME_QUERIED;
+ spin_unlock(&inode->i_lock);
+ } else {
+ ktime_get_coarse_real_ts64(&now);
+ }
+
+ return timestamp_truncate(now, inode);
+}
+EXPORT_SYMBOL(current_cmtime);
+
/**
* file_update_time - update mtime and ctime time
* @file: file accessed
@@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
{
int ret;
struct inode *inode = file_inode(file);
- struct timespec64 now = current_time(inode);
+ struct timespec64 now = current_cmtime(inode);
ret = inode_needs_update_time(inode, &now);
if (ret <= 0)
@@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
{
int ret;
struct inode *inode = file_inode(file);
- struct timespec64 now = current_time(inode);
+ struct timespec64 now = current_cmtime(inode);
/*
* Clear the security bits if the process is not being run by root.
diff --git a/fs/stat.c b/fs/stat.c
index 7c238da22ef0..d8b80a2e36b7 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
}
EXPORT_SYMBOL(generic_fillattr);
+void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
+{
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_CMTIME_QUERIED;
+ stat->ctime = inode->i_ctime;
+ stat->mtime = inode->i_mtime;
+ spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL(fill_cmtime_and_mark);
+
/**
* generic_fill_statx_attr - Fill in the statx attributes from the inode flags
* @inode: Inode to use as the source
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..7dece4390979 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1457,7 +1457,8 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb,
kgid_has_mapping(fs_userns, kgid);
}
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_cmtime(struct inode *inode);
/*
* Snapshotting support.
@@ -2116,6 +2117,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
#define I_DONTCACHE (1 << 16)
#define I_SYNC_QUEUED (1 << 17)
#define I_PINNING_FSCACHE_WB (1 << 18)
+#define I_CMTIME_QUERIED (1 << 19)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -2839,6 +2841,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern void kfree_link(void *);
void generic_fillattr(struct mnt_idmap *, struct inode *, struct kstat *);
+void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat);
void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
--
2.39.2
When the mtime or ctime is being queried via getattr, ensure that we
mark the inode for a high-res timestamp update on the next pass. Also,
switch to current_cmtime for other c/mtime updates.
Signed-off-by: Jeff Layton <[email protected]>
---
mm/shmem.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 448f393d8ab2..75dd09492c36 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1039,7 +1039,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
shmem_undo_range(inode, lstart, lend, false);
- inode->i_ctime = inode->i_mtime = current_time(inode);
+ inode->i_ctime = inode->i_mtime = current_cmtime(inode);
inode_inc_iversion(inode);
}
EXPORT_SYMBOL_GPL(shmem_truncate_range);
@@ -1065,7 +1065,10 @@ static int shmem_getattr(struct mnt_idmap *idmap,
stat->attributes_mask |= (STATX_ATTR_APPEND |
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP);
+
generic_fillattr(idmap, inode, stat);
+ if (request_mask & (STATX_CTIME|STATX_MTIME))
+ fill_cmtime_and_mark(inode, stat);
if (shmem_is_huge(inode, 0, false, NULL, 0))
stat->blksize = HPAGE_PMD_SIZE;
@@ -1136,7 +1139,7 @@ static int shmem_setattr(struct mnt_idmap *idmap,
if (attr->ia_valid & ATTR_MODE)
error = posix_acl_chmod(idmap, dentry, inode->i_mode);
if (!error && update_ctime) {
- inode->i_ctime = current_time(inode);
+ inode->i_ctime = current_cmtime(inode);
if (update_mtime)
inode->i_mtime = inode->i_ctime;
inode_inc_iversion(inode);
@@ -2361,7 +2364,7 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block
inode->i_ino = ino;
inode_init_owner(idmap, inode, dir, mode);
inode->i_blocks = 0;
- inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+ inode->i_atime = inode->i_mtime = inode->i_ctime = current_cmtime(inode);
inode->i_generation = get_random_u32();
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
@@ -2940,7 +2943,7 @@ shmem_mknod(struct mnt_idmap *idmap, struct inode *dir,
error = 0;
dir->i_size += BOGO_DIRENT_SIZE;
- dir->i_ctime = dir->i_mtime = current_time(dir);
+ dir->i_ctime = dir->i_mtime = current_cmtime(dir);
inode_inc_iversion(dir);
d_instantiate(dentry, inode);
dget(dentry); /* Extra count - pin the dentry in core */
@@ -3016,7 +3019,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
}
dir->i_size += BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = current_time(inode);
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = current_cmtime(inode);
inode_inc_iversion(dir);
inc_nlink(inode);
ihold(inode); /* New dentry reference */
@@ -3034,7 +3037,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
shmem_free_inode(inode->i_sb);
dir->i_size -= BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = current_time(inode);
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = current_cmtime(inode);
inode_inc_iversion(dir);
drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
@@ -3124,7 +3127,7 @@ static int shmem_rename2(struct mnt_idmap *idmap,
new_dir->i_size += BOGO_DIRENT_SIZE;
old_dir->i_ctime = old_dir->i_mtime =
new_dir->i_ctime = new_dir->i_mtime =
- inode->i_ctime = current_time(old_dir);
+ inode->i_ctime = current_cmtime(old_dir);
inode_inc_iversion(old_dir);
inode_inc_iversion(new_dir);
return 0;
@@ -3178,7 +3181,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
folio_put(folio);
}
dir->i_size += BOGO_DIRENT_SIZE;
- dir->i_ctime = dir->i_mtime = current_time(dir);
+ dir->i_ctime = dir->i_mtime = current_cmtime(dir);
inode_inc_iversion(dir);
d_instantiate(dentry, inode);
dget(dentry);
@@ -3250,7 +3253,7 @@ static int shmem_fileattr_set(struct mnt_idmap *idmap,
(fa->flags & SHMEM_FL_USER_MODIFIABLE);
shmem_set_inode_flags(inode, info->fsflags);
- inode->i_ctime = current_time(inode);
+ inode->i_ctime = current_cmtime(inode);
inode_inc_iversion(inode);
return 0;
}
@@ -3320,7 +3323,7 @@ static int shmem_xattr_handler_set(const struct xattr_handler *handler,
name = xattr_full_name(handler, name);
err = simple_xattr_set(&info->xattrs, name, value, size, flags, NULL);
if (!err) {
- inode->i_ctime = current_time(inode);
+ inode->i_ctime = current_cmtime(inode);
inode_inc_iversion(inode);
}
return err;
--
2.39.2
When the mtime or ctime is being queried via getattr, ensure that we
mark the inode for a high-res timestamp update on the next pass. Also,
switch to current_cmtime for other c/mtime updates.
With this change, we're better off having the NFS server just ignore
the i_version field and have it use the ctime instead, so clear the
STATX_CHANGE_COOKIE flag in the result mask in ->getattr.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
fs/xfs/xfs_acl.c | 2 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_iops.c | 15 ++++++++++++---
4 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 8b5547073379..9ad7c229c617 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -63,7 +63,7 @@ xfs_trans_ichgtime(
ASSERT(tp);
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
- tv = current_time(inode);
+ tv = current_cmtime(inode);
if (flags & XFS_ICHGTIME_MOD)
inode->i_mtime = tv;
diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
index 791db7d9c849..461adc58cf8c 100644
--- a/fs/xfs/xfs_acl.c
+++ b/fs/xfs/xfs_acl.c
@@ -233,7 +233,7 @@ xfs_acl_set_mode(
xfs_ilock(ip, XFS_ILOCK_EXCL);
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
inode->i_mode = mode;
- inode->i_ctime = current_time(inode);
+ inode->i_ctime = current_cmtime(inode);
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
if (xfs_has_wsync(mp))
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5808abab786c..80f9d731e261 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -843,7 +843,7 @@ xfs_init_new_inode(
ip->i_df.if_nextents = 0;
ASSERT(ip->i_nblocks == 0);
- tv = current_time(inode);
+ tv = current_cmtime(inode);
inode->i_mtime = tv;
inode->i_atime = tv;
inode->i_ctime = tv;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 24718adb3c16..a0b07f90e16c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -565,6 +565,15 @@ xfs_vn_getattr(
if (xfs_is_shutdown(mp))
return -EIO;
+ /*
+ * XFS uses the i_version infrastructure to track any change to
+ * the inode, including atime updates. This means that the i_version
+ * returned by getattr doesn't conform to what the callers expect.
+ * Clear it here so that nfsd will fake up a change cookie from the
+ * ctime instead.
+ */
+ stat->result_mask &= ~STATX_CHANGE_COOKIE;
+
stat->size = XFS_ISIZE(ip);
stat->dev = inode->i_sb->s_dev;
stat->mode = inode->i_mode;
@@ -573,8 +582,8 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode->i_atime;
- stat->mtime = inode->i_mtime;
- stat->ctime = inode->i_ctime;
+ if (request_mask & (STATX_CTIME|STATX_MTIME))
+ fill_cmtime_and_mark(inode, stat);
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
if (xfs_has_v3inodes(mp)) {
@@ -917,7 +926,7 @@ xfs_setattr_size(
if (newsize != oldsize &&
!(iattr->ia_valid & (ATTR_CTIME | ATTR_MTIME))) {
iattr->ia_ctime = iattr->ia_mtime =
- current_time(inode);
+ current_cmtime(inode);
iattr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
}
--
2.39.2
On Tue, Apr 11, 2023 at 10:37:02AM -0400, Jeff Layton wrote:
> When the mtime or ctime is being queried via getattr, ensure that we
> mark the inode for a high-res timestamp update on the next pass. Also,
> switch to current_cmtime for other c/mtime updates.
>
> With this change, we're better off having the NFS server just ignore
> the i_version field and have it use the ctime instead, so clear the
> STATX_CHANGE_COOKIE flag in the result mask in ->getattr.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
> fs/xfs/xfs_acl.c | 2 +-
> fs/xfs/xfs_inode.c | 2 +-
> fs/xfs/xfs_iops.c | 15 ++++++++++++---
> 4 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..9ad7c229c617 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -63,7 +63,7 @@ xfs_trans_ichgtime(
> ASSERT(tp);
> ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>
> - tv = current_time(inode);
> + tv = current_cmtime(inode);
>
> if (flags & XFS_ICHGTIME_MOD)
> inode->i_mtime = tv;
> diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
> index 791db7d9c849..461adc58cf8c 100644
> --- a/fs/xfs/xfs_acl.c
> +++ b/fs/xfs/xfs_acl.c
> @@ -233,7 +233,7 @@ xfs_acl_set_mode(
> xfs_ilock(ip, XFS_ILOCK_EXCL);
> xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> inode->i_mode = mode;
> - inode->i_ctime = current_time(inode);
> + inode->i_ctime = current_cmtime(inode);
Hmm, now we're adding a spinlock to all these updates.
Does lockdep have anything exciting to say about this?
(I don't think it will, just wondering aloud...)
> xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>
> if (xfs_has_wsync(mp))
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 5808abab786c..80f9d731e261 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -843,7 +843,7 @@ xfs_init_new_inode(
> ip->i_df.if_nextents = 0;
> ASSERT(ip->i_nblocks == 0);
>
> - tv = current_time(inode);
> + tv = current_cmtime(inode);
> inode->i_mtime = tv;
> inode->i_atime = tv;
> inode->i_ctime = tv;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 24718adb3c16..a0b07f90e16c 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -565,6 +565,15 @@ xfs_vn_getattr(
> if (xfs_is_shutdown(mp))
> return -EIO;
>
> + /*
> + * XFS uses the i_version infrastructure to track any change to
> + * the inode, including atime updates. This means that the i_version
> + * returned by getattr doesn't conform to what the callers expect.
> + * Clear it here so that nfsd will fake up a change cookie from the
> + * ctime instead.
> + */
> + stat->result_mask &= ~STATX_CHANGE_COOKIE;
> +
> stat->size = XFS_ISIZE(ip);
> stat->dev = inode->i_sb->s_dev;
> stat->mode = inode->i_mode;
> @@ -573,8 +582,8 @@ xfs_vn_getattr(
> stat->gid = vfsgid_into_kgid(vfsgid);
> stat->ino = ip->i_ino;
> stat->atime = inode->i_atime;
> - stat->mtime = inode->i_mtime;
> - stat->ctime = inode->i_ctime;
> + if (request_mask & (STATX_CTIME|STATX_MTIME))
> + fill_cmtime_and_mark(inode, stat);
Should we be setting STATX_[CM]TIME in the result_mask, just in case the
caller asked for ctime and not mtime?
--D
> stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>
> if (xfs_has_v3inodes(mp)) {
> @@ -917,7 +926,7 @@ xfs_setattr_size(
> if (newsize != oldsize &&
> !(iattr->ia_valid & (ATTR_CTIME | ATTR_MTIME))) {
> iattr->ia_ctime = iattr->ia_mtime =
> - current_time(inode);
> + current_cmtime(inode);
> iattr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
> }
>
> --
> 2.39.2
>
On Tue, Apr 11, 2023 at 07:54:46AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 11, 2023 at 10:37:02AM -0400, Jeff Layton wrote:
> > When the mtime or ctime is being queried via getattr, ensure that we
> > mark the inode for a high-res timestamp update on the next pass. Also,
> > switch to current_cmtime for other c/mtime updates.
> >
> > With this change, we're better off having the NFS server just ignore
> > the i_version field and have it use the ctime instead, so clear the
> > STATX_CHANGE_COOKIE flag in the result mask in ->getattr.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
> > fs/xfs/xfs_acl.c | 2 +-
> > fs/xfs/xfs_inode.c | 2 +-
> > fs/xfs/xfs_iops.c | 15 ++++++++++++---
> > 4 files changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 8b5547073379..9ad7c229c617 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -63,7 +63,7 @@ xfs_trans_ichgtime(
> > ASSERT(tp);
> > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> >
> > - tv = current_time(inode);
> > + tv = current_cmtime(inode);
> >
> > if (flags & XFS_ICHGTIME_MOD)
> > inode->i_mtime = tv;
> > diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
> > index 791db7d9c849..461adc58cf8c 100644
> > --- a/fs/xfs/xfs_acl.c
> > +++ b/fs/xfs/xfs_acl.c
> > @@ -233,7 +233,7 @@ xfs_acl_set_mode(
> > xfs_ilock(ip, XFS_ILOCK_EXCL);
> > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > inode->i_mode = mode;
> > - inode->i_ctime = current_time(inode);
> > + inode->i_ctime = current_cmtime(inode);
>
> Hmm, now we're adding a spinlock to all these updates.
> Does lockdep have anything exciting to say about this?
>
> (I don't think it will, just wondering aloud...)
>
> > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >
> > if (xfs_has_wsync(mp))
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 5808abab786c..80f9d731e261 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -843,7 +843,7 @@ xfs_init_new_inode(
> > ip->i_df.if_nextents = 0;
> > ASSERT(ip->i_nblocks == 0);
> >
> > - tv = current_time(inode);
> > + tv = current_cmtime(inode);
> > inode->i_mtime = tv;
> > inode->i_atime = tv;
> > inode->i_ctime = tv;
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 24718adb3c16..a0b07f90e16c 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -565,6 +565,15 @@ xfs_vn_getattr(
> > if (xfs_is_shutdown(mp))
> > return -EIO;
> >
> > + /*
> > + * XFS uses the i_version infrastructure to track any change to
> > + * the inode, including atime updates. This means that the i_version
> > + * returned by getattr doesn't conform to what the callers expect.
> > + * Clear it here so that nfsd will fake up a change cookie from the
> > + * ctime instead.
> > + */
> > + stat->result_mask &= ~STATX_CHANGE_COOKIE;
> > +
> > stat->size = XFS_ISIZE(ip);
> > stat->dev = inode->i_sb->s_dev;
> > stat->mode = inode->i_mode;
> > @@ -573,8 +582,8 @@ xfs_vn_getattr(
> > stat->gid = vfsgid_into_kgid(vfsgid);
> > stat->ino = ip->i_ino;
> > stat->atime = inode->i_atime;
> > - stat->mtime = inode->i_mtime;
> > - stat->ctime = inode->i_ctime;
> > + if (request_mask & (STATX_CTIME|STATX_MTIME))
> > + fill_cmtime_and_mark(inode, stat);
>
> Should we be setting STATX_[CM]TIME in the result_mask, just in case the
> caller asked for ctime and not mtime?
I think the expectation is that everything in STATX_BASIC_MASK is always
returned to allow equivalence between stat() and statx(). So requesting
STATX_CTIME separately from STATX_MTIME isn't implemented widely, maybe
even not at atll?, yet.
On Tue, 2023-04-11 at 17:07 +0200, Christian Brauner wrote:
> On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away metadata updates.
> >
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problem of timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> >
> > Switching to always using high resolution timestamps would improve the
> > situation for NFS, but that becomes rather expensive, as we'd have to
> > log a lot more metadata updates.
> >
> > This patch grabs a new i_state bit to use as a flag that filesystems can
> > set in their getattr routine to indicate that the mtime or ctime was
> > queried since it was last updated.
> >
> > It then adds a new current_cmtime function that acts like the
> > current_time helper, but will conditionally grab high-res timestamps
> > when the i_state flag is set in the inode.
> >
> > This allows NFS and other applications to reap the benefits of high-res
> > ctime and mtime timestamps, but at a substantially lower cost than
> > fetching them every time.
> >
> > Cc: Dave Chinner <[email protected]>
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++--
> > fs/stat.c | 10 ++++++++++
> > include/linux/fs.h | 5 ++++-
> > 3 files changed, 52 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 4558dc2f1355..3630f67fd042 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
> > return ret;
> > }
> >
> > +/**
> > + * current_cmtime - Return FS time (possibly high-res)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime or mtime change. If something recently
> > + * fetched the ctime or mtime out of the inode via getattr, then get a
> > + * high-resolution timestamp.
> > + *
> > + * Note that inode and inode->sb cannot be NULL.
> > + * Otherwise, the function warns and returns coarse time without truncation.
> > + */
> > +struct timespec64 current_cmtime(struct inode *inode)
> > +{
> > + struct timespec64 now;
> > +
> > + if (unlikely(!inode->i_sb)) {
> > + WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
>
> How would this happen? Seems weird to even bother checking this.
>
Agreed. I copied this from current_time. I'm fine with leaving that out.
Maybe we should remove it from current_time as well?
> > + ktime_get_coarse_real_ts64(&now);
> > + return now;
> > + }
> > +
> > + /* Do a lockless check for the flag before taking the spinlock */
> > + if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
> > + ktime_get_real_ts64(&now);
> > + spin_lock(&inode->i_lock);
> > + inode->i_state &= ~I_CMTIME_QUERIED;
> > + spin_unlock(&inode->i_lock);
> > + } else {
> > + ktime_get_coarse_real_ts64(&now);
> > + }
> > +
> > + return timestamp_truncate(now, inode);
> > +}
> > +EXPORT_SYMBOL(current_cmtime);
> > +
> > /**
> > * file_update_time - update mtime and ctime time
> > * @file: file accessed
> > @@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
> > {
> > int ret;
> > struct inode *inode = file_inode(file);
> > - struct timespec64 now = current_time(inode);
> > + struct timespec64 now = current_cmtime(inode);
> >
> > ret = inode_needs_update_time(inode, &now);
> > if (ret <= 0)
> > @@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
> > {
> > int ret;
> > struct inode *inode = file_inode(file);
> > - struct timespec64 now = current_time(inode);
> > + struct timespec64 now = current_cmtime(inode);
> >
> > /*
> > * Clear the security bits if the process is not being run by root.
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 7c238da22ef0..d8b80a2e36b7 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
> > }
> > EXPORT_SYMBOL(generic_fillattr);
> >
> > +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
> > +{
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_CMTIME_QUERIED;
> > + stat->ctime = inode->i_ctime;
> > + stat->mtime = inode->i_mtime;
> > + spin_unlock(&inode->i_lock);
> > +}
> > +EXPORT_SYMBOL(fill_cmtime_and_mark);
>
> So that means that each stat call would mark an inode for a
> high-resolution update.
>
Yep. At least any statx call with STATX_CTIME|STATX_MTIME set (which
includes legacy stat() calls of course).
> There's some performance concerns here. Calling
> stat() is super common and it would potentially make the next iop more
> expensive. Recursively changing ownership in the container use-case come
> to mind which are already expensive.
stat() is common, but not generally as common as write calls are. I
expect that we'll get somewhat similar results tochanged i_version over
to use a similar QUERIED flag.
The i_version field was originally very expensive and required metadata
updates on every write. After making that change, we got the same
performance back in most tests that we got without the i_version field
being enabled at all. Basically, this just means we'll end up logging an
extra journal transaction on some writes that follow a stat() call,
which turns out to be line noise for most workloads.
I do agree that performance is a concern here though. We'll need to
benchmark this somehow.
--
Jeff Layton <[email protected]>
On Tue, Apr 11, 2023 at 10:36:59AM -0400, Jeff Layton wrote:
> (Apologies for the resend, but I didn't send this with a wide enough
> distribution list originally).
>
> A few weeks ago, during one of the discussions around i_version, Dave
> Chinner wrote this:
>
> "You've missed the part where I suggested lifting the "nfsd sampled
> i_version" state into an inode state flag rather than hiding it in
> the i_version field. At that point, we could optimise away the
> secondary ctime updates just like you are proposing we do with the
> i_version updates. Further, we could also use that state it to
> decide whether we need to use high resolution timestamps when
> recording ctime updates - if the nfsd has not sampled the
> ctime/i_version, we don't need high res timestamps to be recorded
> for ctime...."
>
> While I don't think we can practically optimize away ctime updates
> like we do with i_version, I do like the idea of using this scheme to
> indicate when we need to use a high-res timestamp.
>
> This patchset is a first stab at a scheme to do this. It declares a new
> i_state flag for this purpose and adds two new vfs-layer functions to
> implement conditional high-res timestamp fetching. It then converts both
> tmpfs and xfs to use it.
>
> This seems to behave fine under xfstests, but I haven't yet done
> any performance testing with it. I wouldn't expect it to create huge
> regressions though since we're only grabbing high res timestamps after
> each query.
>
> I like this scheme because we can potentially convert any filesystem to
> use it. No special storage requirements like with i_version field. I
> think it'd potentially improve NFS cache coherency with a whole swath of
> exportable filesystems, and helps out NFSv3 too.
>
> This is really just a proof-of-concept. There are a number of things we
> could change:
>
> 1/ We could use the top bit in the tv_sec field as the flag. That'd give
> us different flags for ctime and mtime. We also wouldn't need to use
> a spinlock.
>
> 2/ We could probably optimize away the high-res timestamp fetch in more
> cases. Basically, always do a coarse-grained ts fetch and only fetch
> the high-res ts when the QUERIED flag is set and the existing time
> hasn't changed.
>
> If this approach looks reasonable, I'll plan to start working on
> converting more filesystems.
Seems reasonable to me. In terms of testing, I suspect the main
impact is going to be the additionaly overhead of taking a spinlock
in normal stat calls. In which case, testing common tools like giti
would be useful. e.g. `git status` runs about 170k stat calls on a
typical kernel tree. If anything is going to be noticed by users
that actually care, it'll be workloads like this...
If we manage to elide the spinlock altogether, then I don't think
we're going to be able to measure any sort perf difference on modern
hardware short of high end NFS benchmarks that drive servers to
their CPU usage limits....
> One thing I'm not clear on is how widely available high res timestamps
> are. Is this something we need to gate on particular CONFIG_* options?
Don't think so - the kernel should always provide the highest
resoultion it can through the get_time interfaces - the _coarse
variants simple return what was read from the high res timer at the
last scheduler tick, hence avoiding the hardware timer overhead when
high res timer resolution is not needed.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 11, 2023 at 5:38 PM Jeff Layton <[email protected]> wrote:
>
> (Apologies for the resend, but I didn't send this with a wide enough
> distribution list originally).
>
> A few weeks ago, during one of the discussions around i_version, Dave
> Chinner wrote this:
>
> "You've missed the part where I suggested lifting the "nfsd sampled
> i_version" state into an inode state flag rather than hiding it in
> the i_version field. At that point, we could optimise away the
> secondary ctime updates just like you are proposing we do with the
> i_version updates. Further, we could also use that state it to
> decide whether we need to use high resolution timestamps when
> recording ctime updates - if the nfsd has not sampled the
> ctime/i_version, we don't need high res timestamps to be recorded
> for ctime...."
>
> While I don't think we can practically optimize away ctime updates
> like we do with i_version, I do like the idea of using this scheme to
> indicate when we need to use a high-res timestamp.
>
> This patchset is a first stab at a scheme to do this. It declares a new
> i_state flag for this purpose and adds two new vfs-layer functions to
> implement conditional high-res timestamp fetching. It then converts both
> tmpfs and xfs to use it.
>
> This seems to behave fine under xfstests, but I haven't yet done
> any performance testing with it. I wouldn't expect it to create huge
> regressions though since we're only grabbing high res timestamps after
> each query.
>
> I like this scheme because we can potentially convert any filesystem to
> use it. No special storage requirements like with i_version field. I
> think it'd potentially improve NFS cache coherency with a whole swath of
> exportable filesystems, and helps out NFSv3 too.
>
> This is really just a proof-of-concept. There are a number of things we
> could change:
>
> 1/ We could use the top bit in the tv_sec field as the flag. That'd give
> us different flags for ctime and mtime. We also wouldn't need to use
> a spinlock.
>
> 2/ We could probably optimize away the high-res timestamp fetch in more
> cases. Basically, always do a coarse-grained ts fetch and only fetch
> the high-res ts when the QUERIED flag is set and the existing time
> hasn't changed.
>
> If this approach looks reasonable, I'll plan to start working on
> converting more filesystems.
>
> One thing I'm not clear on is how widely available high res timestamps
> are. Is this something we need to gate on particular CONFIG_* options?
>
> Thoughts?
Jeff,
Considering that this proposal is pretty uncontroversial,
do you still want to discuss/lead a session on i_version changes in LSF/MM?
I noticed that Chuck listed "timespamt resolution and i_version" as part
of his NFSD BoF topic proposal [1], but I do not think all of these topics
can fit in one 30 minute session.
Dave,
I would like to use this opportunity to invite you and any developers that
are involved in fs development and not going to attend LSF/MM in-person,
to join LSF/MM virtually for some sessions that you may be interested in.
See this lore query [2] for TOPICs proposed this year.
You can let me know privately which sessions you are interested in
attending and your time zone and I will do my best to schedule those
sessions in time slots that would be more convenient for your time zone.
Obviously, I am referring to FS track sessions.
Cross track sessions are going to be harder to accommodate,
but I can try.
Thanks,
Amir.
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
[2] https://lore.kernel.org/linux-fsdevel/?q=LSF+TOPIC+-re+d%3A4.months.ago..
On Sat, 2023-04-15 at 14:35 +0300, Amir Goldstein wrote:
> On Tue, Apr 11, 2023 at 5:38 PM Jeff Layton <[email protected]> wrote:
> >
> > (Apologies for the resend, but I didn't send this with a wide enough
> > distribution list originally).
> >
> > A few weeks ago, during one of the discussions around i_version, Dave
> > Chinner wrote this:
> >
> > "You've missed the part where I suggested lifting the "nfsd sampled
> > i_version" state into an inode state flag rather than hiding it in
> > the i_version field. At that point, we could optimise away the
> > secondary ctime updates just like you are proposing we do with the
> > i_version updates. Further, we could also use that state it to
> > decide whether we need to use high resolution timestamps when
> > recording ctime updates - if the nfsd has not sampled the
> > ctime/i_version, we don't need high res timestamps to be recorded
> > for ctime...."
> >
> > While I don't think we can practically optimize away ctime updates
> > like we do with i_version, I do like the idea of using this scheme to
> > indicate when we need to use a high-res timestamp.
> >
> > This patchset is a first stab at a scheme to do this. It declares a new
> > i_state flag for this purpose and adds two new vfs-layer functions to
> > implement conditional high-res timestamp fetching. It then converts both
> > tmpfs and xfs to use it.
> >
> > This seems to behave fine under xfstests, but I haven't yet done
> > any performance testing with it. I wouldn't expect it to create huge
> > regressions though since we're only grabbing high res timestamps after
> > each query.
> >
> > I like this scheme because we can potentially convert any filesystem to
> > use it. No special storage requirements like with i_version field. I
> > think it'd potentially improve NFS cache coherency with a whole swath of
> > exportable filesystems, and helps out NFSv3 too.
> >
> > This is really just a proof-of-concept. There are a number of things we
> > could change:
> >
> > 1/ We could use the top bit in the tv_sec field as the flag. That'd give
> > us different flags for ctime and mtime. We also wouldn't need to use
> > a spinlock.
> >
> > 2/ We could probably optimize away the high-res timestamp fetch in more
> > cases. Basically, always do a coarse-grained ts fetch and only fetch
> > the high-res ts when the QUERIED flag is set and the existing time
> > hasn't changed.
> >
> > If this approach looks reasonable, I'll plan to start working on
> > converting more filesystems.
> >
> > One thing I'm not clear on is how widely available high res timestamps
> > are. Is this something we need to gate on particular CONFIG_* options?
> >
> > Thoughts?
>
> Jeff,
>
> Considering that this proposal is pretty uncontroversial,
> do you still want to discuss/lead a session on i_version changes in LSF/MM?
>
> I noticed that Chuck listed "timespamt resolution and i_version" as part
> of his NFSD BoF topic proposal [1], but I do not think all of these topics
> can fit in one 30 minute session.
>
Agreed. I think we'll need an hour for the nfsd BoF.
I probably don't need a full 30 min slot for this topic, between the
nfsd BoF and hallway track.
I've started a TOPIC email for this about 5 times now, and keep deleting
it. I think most of the more controversial bits are pretty much settled
at this point, and the rest (crash resilience) is still too embryonic
for discussion.
I might want a lightning talk at some point about what I'd _really_ like
to do long term with the i_version counter (basically: I want to be able
to do a write that is gated on the i_version not having changed).
> Dave,
>
> I would like to use this opportunity to invite you and any developers that
> are involved in fs development and not going to attend LSF/MM in-person,
> to join LSF/MM virtually for some sessions that you may be interested in.
> See this lore query [2] for TOPICs proposed this year.
>
> You can let me know privately which sessions you are interested in
> attending and your time zone and I will do my best to schedule those
> sessions in time slots that would be more convenient for your time zone.
>
> Obviously, I am referring to FS track sessions.
> Cross track sessions are going to be harder to accommodate,
> but I can try.
>
> Thanks,
> Amir.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
> [2] https://lore.kernel.org/linux-fsdevel/?q=LSF+TOPIC+-re+d%3A4.months.ago..
--
Jeff Layton <[email protected]>
On Tue 11-04-23 12:04:36, Jeff Layton wrote:
> On Tue, 2023-04-11 at 17:07 +0200, Christian Brauner wrote:
> > On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> > There's some performance concerns here. Calling
> > stat() is super common and it would potentially make the next iop more
> > expensive. Recursively changing ownership in the container use-case come
> > to mind which are already expensive.
>
> stat() is common, but not generally as common as write calls are. I
> expect that we'll get somewhat similar results tochanged i_version over
> to use a similar QUERIED flag.
>
> The i_version field was originally very expensive and required metadata
> updates on every write. After making that change, we got the same
> performance back in most tests that we got without the i_version field
> being enabled at all. Basically, this just means we'll end up logging an
> extra journal transaction on some writes that follow a stat() call,
> which turns out to be line noise for most workloads.
>
> I do agree that performance is a concern here though. We'll need to
> benchmark this somehow.
So for stat-intensive read-only workloads the additional inode_lock locking
during stat may be noticeable. I suppose a stress test stating the same
file in a loop from all CPUs the machine has will certainly notice :) But
that's just an unrealistic worst case.
We could check whether the QUERIED flag is already set and if yes, skip the
locking. That should fix the read-only workload case. We just have to think
whether there would not be some unpleasant races created.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR