v4:
- add request_mask argument to generic_fillattr
- Drop current_ctime helper and just code functionality into current_time
- rework i_ctime accessor functions
A few weeks ago, during one of the discussions around i_version, Dave
Chinner wrote this:
"You've missed the part where I suggested lifting the "nfsd sampled
i_version" state into an inode state flag rather than hiding it in
the i_version field. At that point, we could optimise away the
secondary ctime updates just like you are proposing we do with the
i_version updates. Further, we could also use that state it to
decide whether we need to use high resolution timestamps when
recording ctime updates - if the nfsd has not sampled the
ctime/i_version, we don't need high res timestamps to be recorded
for ctime...."
While I don't think we can practically optimize away ctime updates
like we do with i_version, I do like the idea of using this scheme to
indicate when we need to use a high-res timestamp.
The basic idea here is to use an unused bit in the timespec64.tv_nsec
field to act as a flag to indicate that the value was queried since
the last time we updated it. If that flag is set when we go to update
the timestamp, we'll clear it and grab a fine-grained ktime value for
the update.
The first couple of patches add the necessary infrastructure, and the
last several patches update various filesystems to use it. For now, I'm
focusing on widely-used, exportable filesystems, but this scheme is
probably suitable for most filesystems in the kernel.
Note that this does cause at least one test failure with LTP's statx06
test. I have submitted a patch to fix the issue (by changing how it
fetches the "after" timestamp in that test).
Jeff Layton (9):
fs: pass the request_mask to generic_fillattr
fs: add infrastructure for multigrain inode i_m/ctime
overlayfs: allow it to handle multigrain timestamps
nfsd: ensure we use ctime_peek to grab the inode->i_ctime
ksmbd: use ctime_peek to grab the ctime out of the inode
tmpfs: add support for multigrain timestamps
xfs: switch to multigrain timestamps
ext4: convert to multigrain timestamps
btrfs: convert to multigrain timestamps
fs/9p/vfs_inode.c | 4 +--
fs/9p/vfs_inode_dotl.c | 4 +--
fs/afs/inode.c | 2 +-
fs/btrfs/delayed-inode.c | 2 +-
fs/btrfs/inode.c | 4 +--
fs/btrfs/super.c | 5 +--
fs/btrfs/tree-log.c | 2 +-
fs/ceph/inode.c | 2 +-
fs/cifs/inode.c | 2 +-
fs/coda/inode.c | 3 +-
fs/ecryptfs/inode.c | 5 +--
fs/erofs/inode.c | 2 +-
fs/exfat/file.c | 2 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 19 ++++++++--
fs/ext4/super.c | 2 +-
fs/f2fs/file.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dir.c | 2 +-
fs/gfs2/inode.c | 2 +-
fs/hfsplus/inode.c | 2 +-
fs/inode.c | 48 +++++++++++++++++++++----
fs/kernfs/inode.c | 2 +-
fs/ksmbd/smb2pdu.c | 28 +++++++--------
fs/ksmbd/vfs.c | 3 +-
fs/libfs.c | 4 +--
fs/minix/inode.c | 2 +-
fs/nfs/inode.c | 2 +-
fs/nfs/namespace.c | 3 +-
fs/nfsd/nfsfh.c | 11 ++++--
fs/ntfs3/file.c | 2 +-
fs/ocfs2/file.c | 2 +-
fs/orangefs/inode.c | 2 +-
fs/overlayfs/file.c | 7 ++--
fs/overlayfs/util.c | 2 +-
fs/proc/base.c | 4 +--
fs/proc/fd.c | 2 +-
fs/proc/generic.c | 2 +-
fs/proc/proc_net.c | 2 +-
fs/proc/proc_sysctl.c | 2 +-
fs/proc/root.c | 3 +-
fs/stat.c | 59 ++++++++++++++++++++++++------
fs/sysv/itree.c | 3 +-
fs/ubifs/dir.c | 2 +-
fs/udf/symlink.c | 2 +-
fs/vboxsf/utils.c | 2 +-
fs/xfs/libxfs/xfs_inode_buf.c | 2 +-
fs/xfs/xfs_inode_item.c | 2 +-
fs/xfs/xfs_iops.c | 4 +--
fs/xfs/xfs_super.c | 2 +-
include/linux/fs.h | 68 +++++++++++++++++++++++++++++++++--
mm/shmem.c | 4 +--
52 files changed, 260 insertions(+), 95 deletions(-)
--
2.40.1
The VFS always uses coarse-grained timestamp updates for filling out the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
lot of exported filesystems don't properly support a change attribute
and are subject to the same problems with timestamp granularity. Other
applications have similar issues (e.g backup applications).
Switching to always using fine-grained timestamps would improve the
situation, but that becomes rather expensive, as the underlying
filesystem will have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are
being actively queried.
The kernel always stores normalized ctime values, so only the first 30
bits of the tv_nsec field are ever used. Whenever the mtime changes, the
ctime must also change.
Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the i_mtime or i_ctime. When this flag is set,
on the next timestamp update, the kernel can fetch a fine-grained
timestamp instead of the usual coarse-grained one.
This patch adds the infrastructure this scheme. Filesytems can opt
into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
Later patches will convert individual filesystems over to use it.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/inode.c | 48 ++++++++++++++++++++++++++++-----
fs/stat.c | 41 ++++++++++++++++++++++++++--
include/linux/fs.h | 66 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 145 insertions(+), 10 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 577799b7855f..24769e08fbaa 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2029,6 +2029,7 @@ EXPORT_SYMBOL(file_remove_privs);
static int inode_needs_update_time(struct inode *inode, struct timespec64 *now)
{
int sync_it = 0;
+ struct timespec64 ctime;
/* First try to exhaust all avenues to not sync */
if (IS_NOCMTIME(inode))
@@ -2037,7 +2038,8 @@ static int inode_needs_update_time(struct inode *inode, struct timespec64 *now)
if (!timespec64_equal(&inode->i_mtime, now))
sync_it = S_MTIME;
- if (!timespec64_equal(&inode->i_ctime, now))
+ ctime = ctime_peek(inode);
+ if (!timespec64_equal(&ctime, now))
sync_it |= S_CTIME;
if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2431,6 +2433,40 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
}
EXPORT_SYMBOL(timestamp_truncate);
+/**
+ * current_mg_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+static struct timespec64 current_mg_time(struct inode *inode)
+{
+ struct timespec64 now;
+ atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
+ long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
+
+ if (nsec & I_CTIME_QUERIED) {
+ ktime_get_real_ts64(&now);
+ } else {
+ struct timespec64 ctime;
+
+ ktime_get_coarse_real_ts64(&now);
+
+ /*
+ * If we've recently fetched a fine-grained timestamp
+ * then the coarse-grained one may still be earlier than the
+ * existing one. Just keep the existing ctime if so.
+ */
+ ctime = ctime_peek(inode);
+ if (timespec64_compare(&ctime, &now) > 0)
+ now = ctime;
+ }
+
+ return now;
+}
+
/**
* current_time - Return FS time
* @inode: inode.
@@ -2445,12 +2481,10 @@ struct timespec64 current_time(struct inode *inode)
{
struct timespec64 now;
- ktime_get_coarse_real_ts64(&now);
-
- if (unlikely(!inode->i_sb)) {
- WARN(1, "current_time() called with uninitialized super_block in the inode");
- return now;
- }
+ if (is_multigrain_ts(inode))
+ now = current_mg_time(inode);
+ else
+ ktime_get_coarse_real_ts64(&now);
return timestamp_truncate(now, inode);
}
diff --git a/fs/stat.c b/fs/stat.c
index 9b513a142a56..74d8283cc5c6 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -26,6 +26,38 @@
#include "internal.h"
#include "mount.h"
+/**
+ * fill_multigrain_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
+ * @request_mask: STATX_* values requested
+ * @inode: inode from which to grab the c/mtime
+ * @stat: where to store the resulting values
+ *
+ * Given @inode, grab the ctime and mtime out if it and store the result
+ * in @stat. When fetching the value, flag it as queried so the next write
+ * will use a fine-grained timestamp.
+ */
+void fill_multigrain_cmtime(u32 request_mask, struct inode *inode,
+ struct kstat *stat)
+{
+ atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
+
+ /* If neither time was requested, then don't report them */
+ if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
+ stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
+ return;
+ }
+
+ stat->mtime = inode->i_mtime;
+ stat->ctime.tv_sec = inode->i_ctime.tv_sec;
+ /*
+ * Atomically set the QUERIED flag and fetch the new value with
+ * the flag masked off.
+ */
+ stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) &
+ ~I_CTIME_QUERIED;
+}
+EXPORT_SYMBOL(fill_multigrain_cmtime);
+
/**
* generic_fillattr - Fill in the basic attributes from the inode struct
* @idmap: idmap of the mount the inode was found from
@@ -58,11 +90,16 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
stat->rdev = inode->i_rdev;
stat->size = i_size_read(inode);
stat->atime = inode->i_atime;
- stat->mtime = inode->i_mtime;
- stat->ctime = inode->i_ctime;
stat->blksize = i_blocksize(inode);
stat->blocks = inode->i_blocks;
+ if (is_multigrain_ts(inode)) {
+ fill_multigrain_cmtime(request_mask, inode, stat);
+ } else {
+ stat->mtime = inode->i_mtime;
+ stat->ctime = inode->i_ctime;
+ }
+
if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
stat->result_mask |= STATX_CHANGE_COOKIE;
stat->change_cookie = inode_query_iversion(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5896f90093a..1f670cf1edbd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,7 +1474,7 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb,
kgid_has_mapping(fs_userns, kgid);
}
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
/*
* Snapshotting support.
@@ -2212,6 +2212,7 @@ struct file_system_type {
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
+#define FS_MULTIGRAIN_TS 64 /* Filesystem uses multigrain timestamps */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
int (*init_fs_context)(struct fs_context *);
const struct fs_parameter_spec *parameters;
@@ -2235,6 +2236,67 @@ struct file_system_type {
#define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
+/*
+ * Multigrain timestamps
+ *
+ * Conditionally use fine-grained ctime and mtime timestamps when there
+ * are users actively observing them via getattr. The primary use-case
+ * for this is NFS clients that use the ctime to distinguish between
+ * different states of the file, and that are often fooled by multiple
+ * operations that occur in the same coarse-grained timer tick.
+ */
+static inline bool is_multigrain_ts(const struct inode *inode)
+{
+ return inode->i_sb->s_type->fs_flags & FS_MULTIGRAIN_TS;
+}
+
+/*
+ * The kernel always keeps normalized struct timespec64 values in the ctime,
+ * which means that only the first 30 bits of the value are used. Use the
+ * 31st bit of the ctime's tv_nsec field as a flag to indicate that the value
+ * has been queried since it was last updated.
+ */
+#define I_CTIME_QUERIED (1L<<30)
+
+/**
+ * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
+ * @inode: inode to fetch the ctime from
+ *
+ * Grab the current ctime tv_nsec field from the inode, mask off the
+ * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
+ * internal consumers of the ctime that aren't concerned with ensuring a
+ * fine-grained update on the next change (e.g. when preparing to store
+ * the value in the backing store for later retrieval).
+ *
+ * This is safe to call regardless of whether the underlying filesystem
+ * is using multigrain timestamps.
+ */
+static inline long ctime_nsec_peek(const struct inode *inode)
+{
+ return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
+}
+
+/**
+ * ctime_peek - peek at (but don't query) the ctime
+ * @inode: inode to fetch the ctime from
+ *
+ * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
+ * use by internal consumers that don't require a fine-grained update on
+ * the next change.
+ *
+ * This is safe to call regardless of whether the underlying filesystem
+ * is using multigrain timestamps.
+ */
+static inline struct timespec64 ctime_peek(const struct inode *inode)
+{
+ struct timespec64 ctime;
+
+ ctime.tv_sec = inode->i_ctime.tv_sec;
+ ctime.tv_nsec = ctime_nsec_peek(inode);
+
+ return ctime;
+}
+
extern struct dentry *mount_bdev(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data,
int (*fill_super)(struct super_block *, void *, int));
@@ -2857,6 +2919,8 @@ extern void page_put_link(void *);
extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern void kfree_link(void *);
+void fill_multigrain_cmtime(u32 request_mask, struct inode *inode,
+ struct kstat *stat);
void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
--
2.40.1
Ensure that we strip off the I_CTIME_QUERIED bit when copying up.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/overlayfs/file.c | 7 +++++--
fs/overlayfs/util.c | 2 +-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 7c04f033aadd..cad715df8c4e 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -222,6 +222,7 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
static void ovl_file_accessed(struct file *file)
{
struct inode *inode, *upperinode;
+ struct timespec64 ctime, uctime;
if (file->f_flags & O_NOATIME)
return;
@@ -232,10 +233,12 @@ static void ovl_file_accessed(struct file *file)
if (!upperinode)
return;
+ ctime = ctime_peek(inode);
+ uctime = ctime_peek(upperinode);
if ((!timespec64_equal(&inode->i_mtime, &upperinode->i_mtime) ||
- !timespec64_equal(&inode->i_ctime, &upperinode->i_ctime))) {
+ !timespec64_equal(&ctime, &uctime))) {
inode->i_mtime = upperinode->i_mtime;
- inode->i_ctime = upperinode->i_ctime;
+ inode->i_ctime = uctime;
}
touch_atime(&file->f_path);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 923d66d131c1..f4f9d7e189ef 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -1117,6 +1117,6 @@ void ovl_copyattr(struct inode *inode)
inode->i_mode = realinode->i_mode;
inode->i_atime = realinode->i_atime;
inode->i_mtime = realinode->i_mtime;
- inode->i_ctime = realinode->i_ctime;
+ inode->i_ctime = ctime_peek(realinode);
i_size_write(inode, i_size_read(realinode));
}
--
2.40.1
This ensures that the flag is masked off in the result.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ksmbd/smb2pdu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ksmbd/smb2pdu.c b/fs/ksmbd/smb2pdu.c
index d39ddb344417..c33128570448 100644
--- a/fs/ksmbd/smb2pdu.c
+++ b/fs/ksmbd/smb2pdu.c
@@ -4745,7 +4745,7 @@ static int find_file_posix_info(struct smb2_query_info_rsp *rsp,
file_info->LastAccessTime = cpu_to_le64(time);
time = ksmbd_UnixTimeToNT(inode->i_mtime);
file_info->LastWriteTime = cpu_to_le64(time);
- time = ksmbd_UnixTimeToNT(inode->i_ctime);
+ time = ksmbd_UnixTimeToNT(ctime_peek(inode));
file_info->ChangeTime = cpu_to_le64(time);
file_info->DosAttributes = fp->f_ci->m_fattr;
file_info->Inode = cpu_to_le64(inode->i_ino);
@@ -5386,7 +5386,7 @@ int smb2_close(struct ksmbd_work *work)
rsp->LastAccessTime = cpu_to_le64(time);
time = ksmbd_UnixTimeToNT(inode->i_mtime);
rsp->LastWriteTime = cpu_to_le64(time);
- time = ksmbd_UnixTimeToNT(inode->i_ctime);
+ time = ksmbd_UnixTimeToNT(ctime_peek(inode));
rsp->ChangeTime = cpu_to_le64(time);
ksmbd_fd_put(work, fp);
} else {
@@ -5605,7 +5605,7 @@ static int set_file_basic_info(struct ksmbd_file *fp,
if (file_info->ChangeTime)
attrs.ia_ctime = ksmbd_NTtimeToUnix(file_info->ChangeTime);
else
- attrs.ia_ctime = inode->i_ctime;
+ attrs.ia_ctime = ctime_peek(inode);
if (file_info->LastWriteTime) {
attrs.ia_mtime = ksmbd_NTtimeToUnix(file_info->LastWriteTime);
--
2.40.1
Signed-off-by: Jeff Layton <[email protected]>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 8208d4f85dff..94ea3086eacc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4085,7 +4085,7 @@ static struct file_system_type shmem_fs_type = {
#endif
.kill_sb = kill_litter_super,
#ifdef CONFIG_SHMEM
- .fs_flags = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+ .fs_flags = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MULTIGRAIN_TS,
#else
.fs_flags = FS_USERNS_MOUNT,
#endif
--
2.40.1
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ext4/inode.c | 17 +++++++++++++++--
fs/ext4/super.c | 2 +-
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e0bbcf7a07b5..37840aeb7ff9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4235,6 +4235,19 @@ static int ext4_inode_blocks_set(struct ext4_inode *raw_inode,
return 0;
}
+static void ext4_inode_set_ctime(struct inode *inode, struct ext4_inode *raw_inode)
+{
+ struct timespec64 ctime = ctime_peek(inode);
+
+ if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), i_ctime_extra)) {
+ raw_inode->i_ctime = cpu_to_le32(ctime.tv_sec);
+ raw_inode->i_ctime_extra = ext4_encode_extra_time(&ctime);
+ } else {
+ raw_inode->i_ctime = cpu_to_le32(clamp_t(int32_t,
+ ctime.tv_sec, S32_MIN, S32_MAX));
+ }
+}
+
static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -4275,7 +4288,7 @@ static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode
}
raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
- EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+ ext4_inode_set_ctime(inode, raw_inode);
EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
@@ -4983,7 +4996,7 @@ static void __ext4_update_other_inode_time(struct super_block *sb,
spin_unlock(&inode->i_lock);
spin_lock(&ei->i_raw_lock);
- EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+ ext4_inode_set_ctime(inode, raw_inode);
EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
ext4_inode_csum_set(inode, raw_inode, ei);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9680fe753e59..4de4977dcb21 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7258,7 +7258,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context = ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+ .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MULTIGRAIN_TS,
};
MODULE_ALIAS_FS("ext4");
--
2.40.1
Acked-by: Theodore Ts'o <[email protected]>
On Thu, 18 May 2023 07:47:33 -0400, Jeff Layton wrote:
> v4:
> - add request_mask argument to generic_fillattr
> - Drop current_ctime helper and just code functionality into current_time
> - rework i_ctime accessor functions
>
> A few weeks ago, during one of the discussions around i_version, Dave
> Chinner wrote this:
>
> [...]
Let's get this into -next so we can see whether this leads to any
performance or other regressions. It's moved to a vfs.unstable.* branch
for now. If nothing bad happens it'll be upgraded to a vfs.* branch.
Filesystems that prefer to carry their fs specific patch themselves can
request a stable tag for the generic changes.
---
Applied to the vfs.unstable.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.ctime branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.ctime
[1/9] fs: pass the request_mask to generic_fillattr
https://git.kernel.org/vfs/vfs/c/ec1239bab5fd
[2/9] fs: add infrastructure for multigrain inode i_m/ctime
https://git.kernel.org/vfs/vfs/c/97e9fbb03240
[3/9] overlayfs: allow it to handle multigrain timestamps
https://git.kernel.org/vfs/vfs/c/ada0fe43f748
[4/9] nfsd: ensure we use ctime_peek to grab the inode->i_ctime
https://git.kernel.org/vfs/vfs/c/39493918b700
[5/9] ksmbd: use ctime_peek to grab the ctime out of the inode
https://git.kernel.org/vfs/vfs/c/35527cdc7840
[6/9] tmpfs: add support for multigrain timestamps
https://git.kernel.org/vfs/vfs/c/ce1dc211dbde
[7/9] xfs: switch to multigrain timestamps
https://git.kernel.org/vfs/vfs/c/78bbdfd2fb74
[8/9] ext4: convert to multigrain timestamps
https://git.kernel.org/vfs/vfs/c/d2100ca52e14
[9/9] btrfs: convert to multigrain timestamps
https://git.kernel.org/vfs/vfs/c/c725b40cfbd5
On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamp updates for filling out the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
>
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> lot of exported filesystems don't properly support a change attribute
> and are subject to the same problems with timestamp granularity. Other
> applications have similar issues (e.g backup applications).
>
> Switching to always using fine-grained timestamps would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem will have to log a lot more metadata updates.
>
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried.
>
> The kernel always stores normalized ctime values, so only the first 30
> bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> ctime must also change.
>
> Use the 31st bit of the ctime tv_nsec field to indicate that something
> has queried the inode for the i_mtime or i_ctime. When this flag is set,
> on the next timestamp update, the kernel can fetch a fine-grained
> timestamp instead of the usual coarse-grained one.
>
> This patch adds the infrastructure this scheme. Filesytems can opt
> into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
>
> Later patches will convert individual filesystems over to use it.
>
> Signed-off-by: Jeff Layton <[email protected]>
So there are two things I dislike about this series because I think they
are fragile:
1) If we have a filesystem supporting multigrain ts and someone
accidentally directly uses the value of inode->i_ctime, he can get bogus
value (with QUERIED flag). This mistake is very easy to do. So I think we
should rename i_ctime to something like __i_ctime and always use accessor
function for it.
2) As I already commented in a previous version of the series, the scheme
with just one flag for both ctime and mtime and flag getting cleared in
current_time() relies on the fact that filesystems always do an equivalent
of:
inode->i_mtime = inode->i_ctime = current_time();
Otherwise we can do coarse grained update where we should have done a fine
grained one. Filesystems often update timestamps like this but not
universally. Grepping shows some instances where only inode->i_mtime is set
from current_time() e.g. in autofs or bfs. Again a mistake that is rather
easy to make and results in subtle issues. I think this would be also
nicely solved by renaming i_ctime to __i_ctime and using a function to set
ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
I understand this is quite some churn but a very mechanical one that could
be just done with Coccinelle and a few manual fixups. So IMHO it is worth
the more robust result.
Some more nits below.
> +/**
> + * current_mg_time - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp.
> + */
The comment should also mention that QUERIED flag is cleared from the ctime.
> +static struct timespec64 current_mg_time(struct inode *inode)
> +{
> + struct timespec64 now;
> + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> +
> + if (nsec & I_CTIME_QUERIED) {
> + ktime_get_real_ts64(&now);
> + } else {
> + struct timespec64 ctime;
> +
> + ktime_get_coarse_real_ts64(&now);
> +
> + /*
> + * If we've recently fetched a fine-grained timestamp
> + * then the coarse-grained one may still be earlier than the
> + * existing one. Just keep the existing ctime if so.
> + */
> + ctime = ctime_peek(inode);
> + if (timespec64_compare(&ctime, &now) > 0)
> + now = ctime;
> + }
> +
> + return now;
> +}
> +
...
> +/**
> + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> + * @inode: inode to fetch the ctime from
> + *
> + * Grab the current ctime tv_nsec field from the inode, mask off the
> + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> + * internal consumers of the ctime that aren't concerned with ensuring a
> + * fine-grained update on the next change (e.g. when preparing to store
> + * the value in the backing store for later retrieval).
> + *
> + * This is safe to call regardless of whether the underlying filesystem
> + * is using multigrain timestamps.
> + */
> +static inline long ctime_nsec_peek(const struct inode *inode)
> +{
> + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
This is somewhat unusual spacing. I'd use:
inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
> +}
> +
> +/**
> + * ctime_peek - peek at (but don't query) the ctime
> + * @inode: inode to fetch the ctime from
> + *
> + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> + * use by internal consumers that don't require a fine-grained update on
> + * the next change.
> + *
> + * This is safe to call regardless of whether the underlying filesystem
> + * is using multigrain timestamps.
> + */
> +static inline struct timespec64 ctime_peek(const struct inode *inode)
> +{
> + struct timespec64 ctime;
> +
> + ctime.tv_sec = inode->i_ctime.tv_sec;
> + ctime.tv_nsec = ctime_nsec_peek(inode);
> +
> + return ctime;
> +}
Given this is in a header that gets included in a lot of places, maybe we
should call it like inode_ctime_peek() or inode_ctime_get() to reduce
chances of a name clash?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 23-05-23 12:02:40, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> >
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> >
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> >
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> >
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> >
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> >
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> >
> > Later patches will convert individual filesystems over to use it.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
>
> So there are two things I dislike about this series because I think they
> are fragile:
>
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
>
> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
>
> inode->i_mtime = inode->i_ctime = current_time();
>
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
>
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.
Also as I'm thinking about it your current scheme is slightly racy. Suppose
the filesystem does:
CPU1 CPU2
statx()
inode->i_ctime = current_time()
current_mg_time()
nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
if (nsec & QUERIED) - not set
ktime_get_coarse_real_ts64(&now)
return timestamp_truncate(now, inode);
- QUERIED flag in the inode->i_ctime gets overwritten by the assignment
=> we need not update ctime due to granularity although it was queried
One more reason to use explicit function to update inode->i_ctime ;)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, May 23, 2023 at 12:02:40PM +0200, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> >
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> >
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> >
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> >
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> >
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> >
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> >
> > Later patches will convert individual filesystems over to use it.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
>
> So there are two things I dislike about this series because I think they
> are fragile:
>
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
>
> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
>
> inode->i_mtime = inode->i_ctime = current_time();
>
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
>
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.
Yeah, these are all good points.
>
> Some more nits below.
>
> > +/**
> > + * current_mg_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
>
> The comment should also mention that QUERIED flag is cleared from the ctime.
>
> > +static struct timespec64 current_mg_time(struct inode *inode)
> > +{
> > + struct timespec64 now;
> > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> > +
> > + if (nsec & I_CTIME_QUERIED) {
> > + ktime_get_real_ts64(&now);
> > + } else {
> > + struct timespec64 ctime;
> > +
> > + ktime_get_coarse_real_ts64(&now);
> > +
> > + /*
> > + * If we've recently fetched a fine-grained timestamp
> > + * then the coarse-grained one may still be earlier than the
> > + * existing one. Just keep the existing ctime if so.
> > + */
> > + ctime = ctime_peek(inode);
> > + if (timespec64_compare(&ctime, &now) > 0)
> > + now = ctime;
> > + }
> > +
> > + return now;
> > +}
> > +
>
> ...
>
> > +/**
> > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime tv_nsec field from the inode, mask off the
> > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> > + * internal consumers of the ctime that aren't concerned with ensuring a
> > + * fine-grained update on the next change (e.g. when preparing to store
> > + * the value in the backing store for later retrieval).
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline long ctime_nsec_peek(const struct inode *inode)
> > +{
> > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
>
> This is somewhat unusual spacing. I'd use:
>
> inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
>
> > +}
> > +
> > +/**
> > + * ctime_peek - peek at (but don't query) the ctime
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > + * use by internal consumers that don't require a fine-grained update on
> > + * the next change.
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > +{
> > + struct timespec64 ctime;
> > +
> > + ctime.tv_sec = inode->i_ctime.tv_sec;
> > + ctime.tv_nsec = ctime_nsec_peek(inode);
> > +
> > + return ctime;
> > +}
>
> Given this is in a header that gets included in a lot of places, maybe we
> should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> chances of a name clash?
I think I mentioned this in an earlier comment. Independent of this
series, it would be kinda nice if we could start moving stuff out of
fs.h so we end up with a finer grained split of fs.h.
On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> >
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> >
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> >
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> >
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> >
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> >
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> >
> > Later patches will convert individual filesystems over to use it.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
>
> So there are two things I dislike about this series because I think they
> are fragile:
>
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
>
We could do this, but it'll be quite invasive. We'd have to change any
place that touches i_ctime (and there are a lot of them), even on
filesystems that are not being converted.
> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
>
> inode->i_mtime = inode->i_ctime = current_time();
>
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
>
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.
AFAICT, under POSIX, you must _always_ set the ctime when you set the
mtime, but the reverse is not true. That's why keeping the flag in the
ctime makes sense. If we're updating the mtime, then we necessarily must
update the ctime.
> Some more nits below.
>
> > +/**
> > + * current_mg_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
>
> The comment should also mention that QUERIED flag is cleared from the ctime.
>
Fair point. I can fix that up.
> > +static struct timespec64 current_mg_time(struct inode *inode)
> > +{
> > + struct timespec64 now;
> > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> > +
> > + if (nsec & I_CTIME_QUERIED) {
> > + ktime_get_real_ts64(&now);
> > + } else {
> > + struct timespec64 ctime;
> > +
> > + ktime_get_coarse_real_ts64(&now);
> > +
> > + /*
> > + * If we've recently fetched a fine-grained timestamp
> > + * then the coarse-grained one may still be earlier than the
> > + * existing one. Just keep the existing ctime if so.
> > + */
> > + ctime = ctime_peek(inode);
> > + if (timespec64_compare(&ctime, &now) > 0)
> > + now = ctime;
> > + }
> > +
> > + return now;
> > +}
> > +
>
> ...
>
> > +/**
> > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime tv_nsec field from the inode, mask off the
> > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> > + * internal consumers of the ctime that aren't concerned with ensuring a
> > + * fine-grained update on the next change (e.g. when preparing to store
> > + * the value in the backing store for later retrieval).
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline long ctime_nsec_peek(const struct inode *inode)
> > +{
> > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
>
> This is somewhat unusual spacing. I'd use:
>
> inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
>
Yeah, I don't know what happened there. I'll fix that up.
> > +}
> > +
> > +/**
> > + * ctime_peek - peek at (but don't query) the ctime
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > + * use by internal consumers that don't require a fine-grained update on
> > + * the next change.
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > +{
> > + struct timespec64 ctime;
> > +
> > + ctime.tv_sec = inode->i_ctime.tv_sec;
> > + ctime.tv_nsec = ctime_nsec_peek(inode);
> > +
> > + return ctime;
> > +}
>
> Given this is in a header that gets included in a lot of places, maybe we
> should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> chances of a name clash?
I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me.
We don't really use that nomenclature elsewhere.
--
Jeff Layton <[email protected]>
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > >
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > >
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > >
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > >
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > >
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > >
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > >
> > > Later patches will convert individual filesystems over to use it.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> >
> > So there are two things I dislike about this series because I think they
> > are fragile:
> >
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> >
> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> >
> > inode->i_mtime = inode->i_ctime = current_time();
> >
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> >
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
>
> Also as I'm thinking about it your current scheme is slightly racy. Suppose
> the filesystem does:
>
> CPU1 CPU2
>
> statx()
> inode->i_ctime = current_time()
> current_mg_time()
> nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> if (nsec & QUERIED) - not set
> ktime_get_coarse_real_ts64(&now)
> return timestamp_truncate(now, inode);
> - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> => we need not update ctime due to granularity although it was queried
>
> One more reason to use explicit function to update inode->i_ctime ;)
When we store the new time in the i_ctime field, the flag gets cleared
because at that point we're storing a new (unseen) time.
However, you're correct: if the i_ctime in your above example starts at
the same value that is currently being returned by
ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.
I think the right fix there would be to not update the ctime at all if
it's a coarse grained time, and the value wouldn't have an apparent
change to an observer. That would leave the flag intact.
That does mean we'd need to move to a function that does clock fetch and
assigns it to i_ctime in one go (like you suggest). Something like:
inode_update_ctime(inode);
How we do that with atomic operations over two values (the tv_sec and
tv_nsec) is a bit tricky. I'll have to think about it.
Christian, given Jan's concerns do you want to drop this series for now
and let me respin it?
Thanks,
--
Jeff Layton <[email protected]>
On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote:
> On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> > On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > > ctime and mtime after a change. This has the benefit of allowing
> > > > filesystems to optimize away a lot metadata updates, down to around 1
> > > > per jiffy, even when a file is under heavy writes.
> > > >
> > > > Unfortunately, this has always been an issue when we're exporting via
> > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > > lot of exported filesystems don't properly support a change attribute
> > > > and are subject to the same problems with timestamp granularity. Other
> > > > applications have similar issues (e.g backup applications).
> > > >
> > > > Switching to always using fine-grained timestamps would improve the
> > > > situation, but that becomes rather expensive, as the underlying
> > > > filesystem will have to log a lot more metadata updates.
> > > >
> > > > What we need is a way to only use fine-grained timestamps when they are
> > > > being actively queried.
> > > >
> > > > The kernel always stores normalized ctime values, so only the first 30
> > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > > ctime must also change.
> > > >
> > > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > > on the next timestamp update, the kernel can fetch a fine-grained
> > > > timestamp instead of the usual coarse-grained one.
> > > >
> > > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > >
> > > > Later patches will convert individual filesystems over to use it.
> > > >
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > >
> > > So there are two things I dislike about this series because I think they
> > > are fragile:
> > >
> > > 1) If we have a filesystem supporting multigrain ts and someone
> > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > should rename i_ctime to something like __i_ctime and always use accessor
> > > function for it.
> > >
> > > 2) As I already commented in a previous version of the series, the scheme
> > > with just one flag for both ctime and mtime and flag getting cleared in
> > > current_time() relies on the fact that filesystems always do an equivalent
> > > of:
> > >
> > > inode->i_mtime = inode->i_ctime = current_time();
> > >
> > > Otherwise we can do coarse grained update where we should have done a fine
> > > grained one. Filesystems often update timestamps like this but not
> > > universally. Grepping shows some instances where only inode->i_mtime is set
> > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > > easy to make and results in subtle issues. I think this would be also
> > > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > >
> > > I understand this is quite some churn but a very mechanical one that could
> > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > > the more robust result.
> >
> > Also as I'm thinking about it your current scheme is slightly racy. Suppose
> > the filesystem does:
> >
> > CPU1 CPU2
> >
> > statx()
> > inode->i_ctime = current_time()
> > current_mg_time()
> > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> > if (nsec & QUERIED) - not set
> > ktime_get_coarse_real_ts64(&now)
> > return timestamp_truncate(now, inode);
> > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> > => we need not update ctime due to granularity although it was queried
> >
> > One more reason to use explicit function to update inode->i_ctime ;)
>
> When we store the new time in the i_ctime field, the flag gets cleared
> because at that point we're storing a new (unseen) time.
>
> However, you're correct: if the i_ctime in your above example starts at
> the same value that is currently being returned by
> ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.
>
> I think the right fix there would be to not update the ctime at all if
> it's a coarse grained time, and the value wouldn't have an apparent
> change to an observer. That would leave the flag intact.
>
> That does mean we'd need to move to a function that does clock fetch and
> assigns it to i_ctime in one go (like you suggest). Something like:
>
> inode_update_ctime(inode);
>
> How we do that with atomic operations over two values (the tv_sec and
> tv_nsec) is a bit tricky. I'll have to think about it.
>
> Christian, given Jan's concerns do you want to drop this series for now
> and let me respin it?
I deliberately put it into a vfs.unstable.* branch. I would leave it
there until you send a new one then drop it. If we get lucky the bots
that run on -next will have time to report potential perf issues while
it's not currently causing conflicts.
On Tue, 2023-05-23 at 13:01 +0200, Christian Brauner wrote:
> On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote:
> > On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> > > On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > > > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > > > ctime and mtime after a change. This has the benefit of allowing
> > > > > filesystems to optimize away a lot metadata updates, down to around 1
> > > > > per jiffy, even when a file is under heavy writes.
> > > > >
> > > > > Unfortunately, this has always been an issue when we're exporting via
> > > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > > > lot of exported filesystems don't properly support a change attribute
> > > > > and are subject to the same problems with timestamp granularity. Other
> > > > > applications have similar issues (e.g backup applications).
> > > > >
> > > > > Switching to always using fine-grained timestamps would improve the
> > > > > situation, but that becomes rather expensive, as the underlying
> > > > > filesystem will have to log a lot more metadata updates.
> > > > >
> > > > > What we need is a way to only use fine-grained timestamps when they are
> > > > > being actively queried.
> > > > >
> > > > > The kernel always stores normalized ctime values, so only the first 30
> > > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > > > ctime must also change.
> > > > >
> > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > > > on the next timestamp update, the kernel can fetch a fine-grained
> > > > > timestamp instead of the usual coarse-grained one.
> > > > >
> > > > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > > >
> > > > > Later patches will convert individual filesystems over to use it.
> > > > >
> > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > >
> > > > So there are two things I dislike about this series because I think they
> > > > are fragile:
> > > >
> > > > 1) If we have a filesystem supporting multigrain ts and someone
> > > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > > should rename i_ctime to something like __i_ctime and always use accessor
> > > > function for it.
> > > >
> > > > 2) As I already commented in a previous version of the series, the scheme
> > > > with just one flag for both ctime and mtime and flag getting cleared in
> > > > current_time() relies on the fact that filesystems always do an equivalent
> > > > of:
> > > >
> > > > inode->i_mtime = inode->i_ctime = current_time();
> > > >
> > > > Otherwise we can do coarse grained update where we should have done a fine
> > > > grained one. Filesystems often update timestamps like this but not
> > > > universally. Grepping shows some instances where only inode->i_mtime is set
> > > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > > > easy to make and results in subtle issues. I think this would be also
> > > > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > > >
> > > > I understand this is quite some churn but a very mechanical one that could
> > > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > > > the more robust result.
> > >
> > > Also as I'm thinking about it your current scheme is slightly racy. Suppose
> > > the filesystem does:
> > >
> > > CPU1 CPU2
> > >
> > > statx()
> > > inode->i_ctime = current_time()
> > > current_mg_time()
> > > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> > > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> > > if (nsec & QUERIED) - not set
> > > ktime_get_coarse_real_ts64(&now)
> > > return timestamp_truncate(now, inode);
> > > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> > > => we need not update ctime due to granularity although it was queried
> > >
> > > One more reason to use explicit function to update inode->i_ctime ;)
> >
> > When we store the new time in the i_ctime field, the flag gets cleared
> > because at that point we're storing a new (unseen) time.
> >
> > However, you're correct: if the i_ctime in your above example starts at
> > the same value that is currently being returned by
> > ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.
> >
> > I think the right fix there would be to not update the ctime at all if
> > it's a coarse grained time, and the value wouldn't have an apparent
> > change to an observer. That would leave the flag intact.
> >
> > That does mean we'd need to move to a function that does clock fetch and
> > assigns it to i_ctime in one go (like you suggest). Something like:
> >
> > inode_update_ctime(inode);
> >
> > How we do that with atomic operations over two values (the tv_sec and
> > tv_nsec) is a bit tricky. I'll have to think about it.
> >
> > Christian, given Jan's concerns do you want to drop this series for now
> > and let me respin it?
>
> I deliberately put it into a vfs.unstable.* branch. I would leave it
> there until you send a new one then drop it. If we get lucky the bots
> that run on -next will have time to report potential perf issues while
> it's not currently causing conflicts.
Sounds good to me. Thanks!
--
Jeff Layton <[email protected]>
On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote:
> On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > >
> > > So there are two things I dislike about this series because I think they
> > > are fragile:
> > >
> > > 1) If we have a filesystem supporting multigrain ts and someone
> > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > should rename i_ctime to something like __i_ctime and always use accessor
> > > function for it.
> > >
> >
> > We could do this, but it'll be quite invasive. We'd have to change any
> > place that touches i_ctime (and there are a lot of them), even on
> > filesystems that are not being converted.
>
> Yes, that's why I suggested Coccinelle to deal with this.
I've done the work to convert all of the accesses of i_ctime into
accessor functions in the kernel. The current state of it is here:
https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime
As expected, it touches a lot of code, all over the place. So far I have
most of the conversion in one giant patch, and I need to split it up
(probably per-subsystem).
What's the best way to feed this change into mainline? Should I try to
get subsystem maintainers to pick these up, or are we better off feeding
this in via a separate branch?
For reference, the diffstat for the big conversion patch is below:
arch/powerpc/platforms/cell/spufs/inode.c | 2 +-
arch/s390/hypfs/inode.c | 4 +-
drivers/android/binderfs.c | 8 ++--
drivers/infiniband/hw/qib/qib_fs.c | 4 +-
drivers/misc/ibmasm/ibmasmfs.c | 2 +-
drivers/misc/ibmvmc.c | 2 +-
drivers/usb/core/devio.c | 16 ++++----
drivers/usb/gadget/function/f_fs.c | 6 +--
drivers/usb/gadget/legacy/inode.c | 3 +-
fs/9p/vfs_inode.c | 6 ++-
fs/9p/vfs_inode_dotl.c | 11 +++---
fs/adfs/inode.c | 4 +-
fs/affs/amigaffs.c | 6 +--
fs/affs/inode.c | 17 ++++----
fs/afs/dynroot.c | 2 +-
fs/afs/inode.c | 6 +--
fs/attr.c | 2 +-
fs/autofs/inode.c | 2 +-
fs/autofs/root.c | 6 +--
fs/bad_inode.c | 3 +-
fs/befs/linuxvfs.c | 2 +-
fs/bfs/dir.c | 16 ++++----
fs/bfs/inode.c | 6 +--
fs/binfmt_misc.c | 3 +-
fs/btrfs/delayed-inode.c | 10 +++--
fs/btrfs/file.c | 24 ++++-------
fs/btrfs/inode.c | 66 ++++++++++++------------
-------
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/reflink.c | 7 ++--
fs/btrfs/transaction.c | 3 +-
fs/btrfs/tree-log.c | 4 +-
fs/btrfs/xattr.c | 4 +-
fs/ceph/acl.c | 2 +-
fs/ceph/caps.c | 2 +-
fs/ceph/inode.c | 17 ++++----
fs/ceph/snap.c | 2 +-
fs/ceph/xattr.c | 2 +-
fs/coda/coda_linux.c | 2 +-
fs/coda/dir.c | 2 +-
fs/coda/file.c | 2 +-
fs/coda/inode.c | 2 +-
fs/configfs/inode.c | 6 +--
fs/cramfs/inode.c | 2 +-
fs/debugfs/inode.c | 2 +-
fs/devpts/inode.c | 6 +--
fs/ecryptfs/inode.c | 2 +-
fs/efivarfs/file.c | 2 +-
fs/efivarfs/inode.c | 2 +-
fs/efs/inode.c | 5 ++-
fs/erofs/inode.c | 16 ++++----
fs/exfat/file.c | 4 +-
fs/exfat/inode.c | 6 +--
fs/exfat/namei.c | 29 +++++++-------
fs/exfat/super.c | 4 +-
fs/ext2/acl.c | 2 +-
fs/ext2/dir.c | 6 +--
fs/ext2/ialloc.c | 2 +-
fs/ext2/inode.c | 11 +++---
fs/ext2/ioctl.c | 4 +-
fs/ext2/namei.c | 8 ++--
fs/ext2/super.c | 2 +-
fs/ext2/xattr.c | 2 +-
fs/ext4/acl.c | 2 +-
fs/ext4/ext4.h | 20 ++++++++++
fs/ext4/extents.c | 12 +++---
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 4 +-
fs/ext4/inode.c | 16 ++++----
fs/ext4/ioctl.c | 9 +++--
fs/ext4/namei.c | 26 ++++++------
fs/ext4/super.c | 2 +-
fs/ext4/xattr.c | 6 +--
fs/f2fs/dir.c | 8 ++--
fs/f2fs/f2fs.h | 5 ++-
fs/f2fs/file.c | 16 ++++----
fs/f2fs/inline.c | 2 +-
fs/f2fs/inode.c | 10 ++---
fs/f2fs/namei.c | 12 +++---
fs/f2fs/recovery.c | 4 +-
fs/f2fs/super.c | 2 +-
fs/f2fs/xattr.c | 2 +-
fs/fat/inode.c | 8 ++--
fs/fat/misc.c | 7 +++-
fs/freevxfs/vxfs_inode.c | 4 +-
fs/fuse/control.c | 2 +-
fs/fuse/dir.c | 8 ++--
fs/fuse/inode.c | 18 +++++----
fs/gfs2/acl.c | 2 +-
fs/gfs2/bmap.c | 11 +++---
fs/gfs2/dir.c | 15 +++----
fs/gfs2/file.c | 2 +-
fs/gfs2/glops.c | 4 +-
fs/gfs2/inode.c | 8 ++--
fs/gfs2/super.c | 4 +-
fs/gfs2/xattr.c | 8 ++--
fs/hfs/catalog.c | 8 ++--
fs/hfs/dir.c | 2 +-
fs/hfs/inode.c | 13 +++---
fs/hfs/sysdep.c | 2 +-
fs/hfsplus/catalog.c | 8 ++--
fs/hfsplus/dir.c | 6 +--
fs/hfsplus/inode.c | 14 +++----
fs/hostfs/hostfs_kern.c | 5 ++-
fs/hpfs/dir.c | 8 ++--
fs/hpfs/inode.c | 6 +--
fs/hpfs/namei.c | 26 ++++++------
fs/hpfs/super.c | 5 ++-
fs/hugetlbfs/inode.c | 12 +++---
fs/inode.c | 12 ++++--
fs/isofs/inode.c | 4 +-
fs/isofs/rock.c | 16 ++++----
fs/jffs2/dir.c | 19 ++++-----
fs/jffs2/file.c | 3 +-
fs/jffs2/fs.c | 10 ++---
fs/jffs2/os-linux.h | 2 +-
fs/jfs/acl.c | 2 +-
fs/jfs/inode.c | 2 +-
fs/jfs/ioctl.c | 2 +-
fs/jfs/jfs_imap.c | 8 ++--
fs/jfs/jfs_inode.c | 4 +-
fs/jfs/namei.c | 25 ++++++------
fs/jfs/super.c | 2 +-
fs/jfs/xattr.c | 2 +-
fs/kernfs/inode.c | 4 +-
fs/libfs.c | 32 ++++++++-------
fs/minix/bitmap.c | 2 +-
fs/minix/dir.c | 6 +--
fs/minix/inode.c | 11 +++---
fs/minix/itree_common.c | 4 +-
fs/minix/namei.c | 6 +--
fs/nfs/callback_proc.c | 2 +-
fs/nfs/fscache.h | 4 +-
fs/nfs/inode.c | 21 +++++-----
fs/nfsd/nfsctl.c | 2 +-
fs/nfsd/nfsfh.c | 4 +-
fs/nfsd/vfs.c | 2 +-
fs/nilfs2/dir.c | 6 +--
fs/nilfs2/inode.c | 12 +++---
fs/nilfs2/ioctl.c | 2 +-
fs/nilfs2/namei.c | 8 ++--
fs/nsfs.c | 2 +-
fs/ntfs/inode.c | 15 +++----
fs/ntfs/mft.c | 3 +-
fs/ntfs3/file.c | 6 +--
fs/ntfs3/frecord.c | 4 +-
fs/ntfs3/inode.c | 14 ++++---
fs/ntfs3/namei.c | 10 ++---
fs/ntfs3/xattr.c | 4 +-
fs/ocfs2/acl.c | 6 +--
fs/ocfs2/alloc.c | 6 +--
fs/ocfs2/aops.c | 2 +-
fs/ocfs2/dir.c | 8 ++--
fs/ocfs2/dlmfs/dlmfs.c | 4 +-
fs/ocfs2/dlmglue.c | 10 +++--
fs/ocfs2/file.c | 16 ++++----
fs/ocfs2/inode.c | 14 ++++---
fs/ocfs2/move_extents.c | 6 +--
fs/ocfs2/namei.c | 22 ++++++-----
fs/ocfs2/refcounttree.c | 14 +++----
fs/ocfs2/xattr.c | 6 +--
fs/omfs/dir.c | 4 +-
fs/omfs/inode.c | 10 ++---
fs/openpromfs/inode.c | 4 +-
fs/orangefs/namei.c | 2 +-
fs/orangefs/orangefs-utils.c | 6 +--
fs/overlayfs/file.c | 7 +++-
fs/overlayfs/util.c | 2 +-
fs/pipe.c | 2 +-
fs/posix_acl.c | 2 +-
fs/proc/base.c | 2 +-
fs/proc/inode.c | 2 +-
fs/proc/proc_sysctl.c | 2 +-
fs/proc/self.c | 2 +-
fs/proc/thread_self.c | 2 +-
fs/pstore/inode.c | 4 +-
fs/qnx4/inode.c | 4 +-
fs/qnx6/inode.c | 4 +-
fs/ramfs/inode.c | 6 +--
fs/reiserfs/inode.c | 14 +++----
fs/reiserfs/ioctl.c | 4 +-
fs/reiserfs/namei.c | 21 +++++-----
fs/reiserfs/stree.c | 4 +-
fs/reiserfs/super.c | 2 +-
fs/reiserfs/xattr.c | 5 ++-
fs/reiserfs/xattr_acl.c | 2 +-
fs/romfs/super.c | 4 +-
fs/smb/client/file.c | 4 +-
fs/smb/client/fscache.h | 5 ++-
fs/smb/client/inode.c | 15 ++++---
fs/smb/client/smb2ops.c | 2 +-
fs/smb/server/smb2pdu.c | 8 ++--
fs/squashfs/inode.c | 2 +-
fs/stack.c | 2 +-
fs/stat.c | 2 +-
fs/sysv/dir.c | 6 +--
fs/sysv/ialloc.c | 2 +-
fs/sysv/inode.c | 6 +--
fs/sysv/itree.c | 4 +-
fs/sysv/namei.c | 6 +--
fs/tracefs/inode.c | 2 +-
fs/ubifs/debug.c | 4 +-
fs/ubifs/dir.c | 39 +++++++++---------
fs/ubifs/file.c | 16 ++++----
fs/ubifs/ioctl.c | 2 +-
fs/ubifs/journal.c | 4 +-
fs/ubifs/super.c | 4 +-
fs/ubifs/xattr.c | 6 +--
fs/udf/ialloc.c | 2 +-
fs/udf/inode.c | 17 ++++----
fs/udf/namei.c | 24 +++++------
fs/ufs/dir.c | 6 +--
fs/ufs/ialloc.c | 2 +-
fs/ufs/inode.c | 23 ++++++-----
fs/ufs/namei.c | 8 ++--
fs/vboxsf/utils.c | 4 +-
fs/xfs/libxfs/xfs_inode_buf.c | 4 +-
fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
fs/xfs/xfs_acl.c | 2 +-
fs/xfs/xfs_bmap_util.c | 6 ++-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_inode_item.c | 2 +-
fs/xfs/xfs_iops.c | 4 +-
fs/xfs/xfs_itable.c | 4 +-
fs/zonefs/super.c | 8 ++--
include/linux/fs.h | 1 +
include/linux/fs_stack.h | 2 +-
ipc/mqueue.c | 20 +++++-----
kernel/bpf/inode.c | 4 +-
mm/shmem.c | 28 +++++++------
net/sunrpc/rpc_pipe.c | 2 +-
security/apparmor/apparmorfs.c | 6 +--
security/apparmor/policy_unpack.c | 4 +-
security/inode.c | 2 +-
security/selinux/selinuxfs.c | 2 +-
234 files changed, 851 insertions(+), 808 deletions(-)
--
Jeff Layton <[email protected]>
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > >
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > >
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > >
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > >
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > >
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > >
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > >
> > > Later patches will convert individual filesystems over to use it.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> >
> > So there are two things I dislike about this series because I think they
> > are fragile:
> >
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> >
> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> >
> > inode->i_mtime = inode->i_ctime = current_time();
> >
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> >
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
>
> Also as I'm thinking about it your current scheme is slightly racy. Suppose
> the filesystem does:
>
> CPU1 CPU2
>
> statx()
> inode->i_ctime = current_time()
> current_mg_time()
> nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> if (nsec & QUERIED) - not set
> ktime_get_coarse_real_ts64(&now)
> return timestamp_truncate(now, inode);
> - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> => we need not update ctime due to granularity although it was queried
>
> One more reason to use explicit function to update inode->i_ctime ;)
Thinking about this some more, I think we can fix the race you pointed
out by just not clearing the queried flag when we fetch the
i_ctime.tv_nsec field when we're updating.
So, instead of atomic_long_fetch_andnot, we'd just want to use an
atomic_long_fetch there, and just let the eventual assignment of
inode->__i_ctime.tv_nsec be what clears the flag.
Any task that goes to update the time during the interim window will
fetch a fine-grained time, but that's what we want anyway.
Since you bring up races though, there are a couple of other things we
should be aware of. Note that both problems exist today too:
1) it's possible for two tasks to race in such a way that the ctime goes
backward. There's no synchronization between tasks doing the updating,
so an older time can overwrite a newer one. I think you'd need a pretty
tight race to observe this though.
2) it's possible to fetch a "torn" timestamp out of the inode.
timespec64 is two words, and we don't order its loads or stores. We
could consider adding a seqcount_t and some barriers and fixing it that
way. I'm not sure it's worth it though, given that we haven't worried
about this in the past.
For now, I propose that we ignore both problems, unless and until we can
prove that they are more than theoretical.
--
Jeff Layton <[email protected]>
On Tue, Jun 13, 2023 at 09:09:29AM -0400, Jeff Layton wrote:
> On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote:
> > On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> > > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > > >
> > > > So there are two things I dislike about this series because I think they
> > > > are fragile:
> > > >
> > > > 1) If we have a filesystem supporting multigrain ts and someone
> > > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > > should rename i_ctime to something like __i_ctime and always use accessor
> > > > function for it.
> > > >
> > >
> > > We could do this, but it'll be quite invasive. We'd have to change any
> > > place that touches i_ctime (and there are a lot of them), even on
> > > filesystems that are not being converted.
> >
> > Yes, that's why I suggested Coccinelle to deal with this.
>
>
> I've done the work to convert all of the accesses of i_ctime into
> accessor functions in the kernel. The current state of it is here:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime
>
> As expected, it touches a lot of code, all over the place. So far I have
> most of the conversion in one giant patch, and I need to split it up
> (probably per-subsystem).
Yeah, you have time since it'll be v6.6 material.
>
> What's the best way to feed this change into mainline? Should I try to
> get subsystem maintainers to pick these up, or are we better off feeding
> this in via a separate branch?
I would prefer if we send them all through the vfs tree since trickle
down conversions are otherwise very painful and potentially very slow.