tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it
if no one is looking at it. We can also clean up the code to prepare
to eventually expose this value via statx().
The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION
set. IMA will also use it (though it's not clear to me that that
works 100% -- no check for MS_I_VERSION there).
Only btrfs, ext4, and xfs implement it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]
It turns out though that none of these users of i_version require that
i_version change on every change to the file. The only real requirement
is that it be different if _something_ changed since the last time we
queried for it. [2]
So, if we simply keep track of when something queries the value, we
can avoid bumping the counter and that metadata update.
This patchset implements this:
It starts with some small cleanup patches to just remove any mention of
the i_version counter in filesystems that don't actually use it.
Then, we add a new set of static inlines for managing the counter. The
initial version should work identically to what we have now. Then, all
of the remaining filesystems that use i_version are converted to the new
inlines.
Once that's in place, we switch to a new implementation that allows us
to track readers of i_version counter, and only bump it when it's
necessary or convenient (i.e. we're going to disk anyway).
The final patch switches from a scheme that uses the i_lock to serialize
the counter updates during write to an atomic64_t. That's a wash
performance-wise in my testing, but I like not having to take the i_lock
down where it could end up nested inside other locks.
With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).
The pessimal workload here is 1 byte writes, and it helps that
significantly. Of course, that's not a real-world workload.
A tiobench-example.fio workload also shows some modest performance
gains, and I've gotten mails from the kernel test robot that show some
significant performance gains on some microbenchmarks (case-msync-mt in
the vm-scalability testsuite to be specific).
I'm happy to run other workloads if anyone can suggest them.
At this point, the patchset works and does what it's expected to do in
my own testing. It seems like it's at least a modest performance win
across all 3 major disk-based filesystems. It may also encourage others
to implement i_version as well since it reduces that cost.
Is this an avenue that's worthwhile to pursue?
Note that I think we may have other changes coming in the future that
will make this sort of cleanup necessary anyway. I'd like to plug in the
Ceph change attribute here eventually, and that will require something
like this anyway.
Thoughts, comments and suggestions are welcome...
---
[1]: On ext4 it must be turned on with the i_version mount option,
mostly due to fears of incurring this impact, AFAICT.
[2]: NFS also recommends that it appear to increase in value over time, so
that clients can discard metadata updates that are older than ones
they've already seen.
Jeff Layton (30):
lustre: don't set f_version in ll_readdir
ecryptfs: remove unnecessary i_version bump
ceph: remove the bump of i_version
f2fs: don't bother setting i_version
hpfs: don't bother with the i_version counter
jfs: remove initialization of i_version counter
nilfs2: remove inode->i_version initialization
orangefs: remove initialization of i_version
reiserfs: remove unneeded i_version bump
ntfs: remove i_version handling
fs: new API for handling i_version
fat: convert to new i_version API
affs: convert to new i_version API
afs: convert to new i_version API
btrfs: convert to new i_version API
exofs: switch to new i_version API
ext2: convert to new i_version API
ext4: convert to new i_version API
nfs: convert to new i_version API
nfsd: convert to new i_version API
ocfs2: convert to new i_version API
ufs: use new i_version API
xfs: convert to new i_version API
IMA: switch IMA over to new i_version API
fs: add a "force" parameter to inode_inc_iversion
fs: only set S_VERSION when updating times if it has been queried
xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
btrfs: only dirty the inode in btrfs_update_time if something was
changed
fs: track whether the i_version has been queried with an i_state flag
fs: convert i_version counter over to an atomic64_t
drivers/staging/lustre/lustre/llite/dir.c | 3 -
fs/affs/amigaffs.c | 4 +-
fs/affs/dir.c | 4 +-
fs/affs/super.c | 2 +-
fs/afs/fsclient.c | 2 +-
fs/afs/inode.c | 4 +-
fs/btrfs/delayed-inode.c | 4 +-
fs/btrfs/file.c | 4 +-
fs/btrfs/inode.c | 41 ++++----
fs/btrfs/ioctl.c | 4 +-
fs/btrfs/tree-log.c | 2 +-
fs/btrfs/xattr.c | 2 +-
fs/ceph/inode.c | 1 -
fs/ecryptfs/inode.c | 1 -
fs/exofs/dir.c | 8 +-
fs/exofs/super.c | 2 +-
fs/ext2/dir.c | 8 +-
fs/ext2/super.c | 4 +-
fs/ext4/dir.c | 8 +-
fs/ext4/inline.c | 6 +-
fs/ext4/inode.c | 16 ++--
fs/ext4/ioctl.c | 2 +-
fs/ext4/namei.c | 8 +-
fs/ext4/super.c | 2 +-
fs/f2fs/super.c | 1 -
fs/fat/dir.c | 2 +-
fs/fat/inode.c | 8 +-
fs/fat/namei_msdos.c | 6 +-
fs/fat/namei_vfat.c | 20 ++--
fs/hpfs/dir.c | 1 -
fs/hpfs/dnode.c | 2 -
fs/hpfs/super.c | 1 -
fs/inode.c | 9 +-
fs/jfs/super.c | 1 -
fs/nfs/delegation.c | 2 +-
fs/nfs/fscache-index.c | 4 +-
fs/nfs/inode.c | 16 ++--
fs/nfs/nfs4proc.c | 4 +-
fs/nfs/nfstrace.h | 4 +-
fs/nfs/write.c | 2 +-
fs/nfsd/nfs3xdr.c | 2 +-
fs/nfsd/nfs4xdr.c | 2 +-
fs/nfsd/nfsfh.h | 2 +-
fs/nilfs2/super.c | 1 -
fs/ntfs/inode.c | 9 --
fs/ntfs/mft.c | 6 --
fs/ocfs2/dir.c | 14 +--
fs/ocfs2/inode.c | 2 +-
fs/ocfs2/namei.c | 2 +-
fs/ocfs2/quota_global.c | 2 +-
fs/orangefs/super.c | 2 -
fs/reiserfs/super.c | 1 -
fs/ufs/dir.c | 8 +-
fs/ufs/inode.c | 2 +-
fs/ufs/super.c | 2 +-
fs/xfs/libxfs/xfs_inode_buf.c | 4 +-
fs/xfs/xfs_icache.c | 4 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_inode_item.c | 2 +-
fs/xfs/xfs_trans_inode.c | 12 +--
include/linux/fs.h | 151 ++++++++++++++++++++++++++++--
security/integrity/ima/ima_api.c | 2 +-
security/integrity/ima/ima_main.c | 2 +-
63 files changed, 288 insertions(+), 173 deletions(-)
--
2.7.4
f_version is only ever used by filesystem-specific code, and nothing in
lustre ever looks at it.
Signed-off-by: Jeff Layton <[email protected]>
---
drivers/staging/lustre/lustre/llite/dir.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index ea5d247a3f70..1d288ec5f03e 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -367,8 +367,6 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
}
ctx->pos = pos;
ll_finish_md_op_data(op_data);
- filp->f_version = inode->i_version;
-
out:
if (!rc)
ll_stats_ops_tally(sbi, LPROC_LL_READDIR, 1);
@@ -1671,7 +1669,6 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
else
fd->lfd_pos = offset;
file->f_pos = offset;
- file->f_version = 0;
}
ret = offset;
}
--
2.7.4
Nothing uses it on HPFS.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/hpfs/dir.c | 1 -
fs/hpfs/dnode.c | 2 --
fs/hpfs/super.c | 1 -
3 files changed, 4 deletions(-)
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 7b9150c2e75c..1a9fded00378 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -149,7 +149,6 @@ static int hpfs_readdir(struct file *file, struct dir_context *ctx)
if (unlikely(ret < 0))
goto out;
ctx->pos = ((loff_t) hpfs_de_as_down_as_possible(inode->i_sb, hpfs_inode->i_dno) << 4) + 1;
- file->f_version = inode->i_version;
}
next_pos = ctx->pos;
if (!(de = map_pos_dirent(inode, &next_pos, &qbh))) {
diff --git a/fs/hpfs/dnode.c b/fs/hpfs/dnode.c
index 86ab7e790b4e..454acb0a009a 100644
--- a/fs/hpfs/dnode.c
+++ b/fs/hpfs/dnode.c
@@ -418,7 +418,6 @@ int hpfs_add_dirent(struct inode *i,
c = 1;
goto ret;
}
- i->i_version++;
c = hpfs_add_to_dnode(i, dno, name, namelen, new_de, 0);
ret:
return c;
@@ -725,7 +724,6 @@ int hpfs_remove_dirent(struct inode *i, dnode_secno dno, struct hpfs_dirent *de,
return 2;
}
}
- i->i_version++;
for_all_poss(i, hpfs_pos_del, (t = get_pos(dnode, de)) + 1, 1);
hpfs_delete_de(i->i_sb, dnode, de);
hpfs_mark_4buffers_dirty(qbh);
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index 82067ca22f2b..2fd66c854255 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -235,7 +235,6 @@ static struct inode *hpfs_alloc_inode(struct super_block *sb)
ei = kmem_cache_alloc(hpfs_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
- ei->vfs_inode.i_version = 1;
return &ei->vfs_inode;
}
--
2.7.4
This change is straightforward, but I have some serious doubts about
this code.
Signed-off-by: Jeff Layton <[email protected]>
---
security/integrity/ima/ima_api.c | 2 +-
security/integrity/ima/ima_main.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 9df26a2b75ba..1f676313c30b 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -204,7 +204,7 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
} hash;
if (!(iint->flags & IMA_COLLECTED)) {
- u64 i_version = file_inode(file)->i_version;
+ u64 i_version = inode_get_iversion(file_inode(file));
if (file->f_flags & O_DIRECT) {
audit_cause = "failed(directio)";
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 423d111b3b94..8d95cf42d20c 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -122,7 +122,7 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(&inode->i_writecount) == 1) {
- if ((iint->version != inode->i_version) ||
+ if (inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/affs/amigaffs.c | 4 ++--
fs/affs/dir.c | 4 ++--
fs/affs/super.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0ec65c133b93..00aff7cfc883 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -59,7 +59,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
dir->i_mtime = dir->i_ctime = current_time(dir);
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
mark_inode_dirty(dir);
return 0;
@@ -113,7 +113,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh)
affs_brelse(bh);
dir->i_mtime = dir->i_ctime = current_time(dir);
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
mark_inode_dirty(dir);
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index f1e7294381c5..a587d57668e3 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -79,7 +79,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
* we can jump directly to where we left off.
*/
ino = (u32)(long)file->private_data;
- if (ino && file->f_version == inode->i_version) {
+ if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -129,7 +129,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
done:
- file->f_version = inode->i_version;
+ file->f_version = inode_get_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d6384863192c..b3ebf90f9798 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -99,7 +99,7 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
if (!i)
return NULL;
- i->vfs_inode.i_version = 1;
+ inode_set_iversion(&i->vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
--
2.7.4
Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfs3xdr.c | 2 +-
fs/nfsd/nfs4xdr.c | 2 +-
fs/nfsd/nfsfh.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index dba2ff8eaa68..a645b1dbc4e3 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
err = fh_getattr(fhp, &fhp->fh_post_attr);
- fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+ fhp->fh_post_change = inode_get_iversion(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7ecf16be4a44..98bd030f849a 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1969,7 +1969,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode)
{
if (IS_I_VERSION(inode)) {
- p = xdr_encode_hyper(p, inode->i_version);
+ p = xdr_encode_hyper(p, inode_get_iversion(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..7a66065f6e74 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -253,7 +253,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size = inode->i_size;
- fhp->fh_pre_change = inode->i_version;
+ fhp->fh_pre_change = inode_get_iversion(inode);
fhp->fh_pre_saved = true;
}
}
--
2.7.4
NFSv4 has some pretty relaxed rules for the i_version counter that
we can exploit for our own (performance) gain. The rules basically
boil down to:
1) it must steadily increase so that a client can discard change
attributes that are older than ones it has already seen.
2) the value must be different from the last time we checked it if
there was a data or metadata change.
This last bit is important, as we don't necessarily need to bump the
counter when no one is querying for it. On a write-intensive workload
this can add up to the metadata being written a lot less.
Add a new I_VERS_BUMP i_state flag that we can use to track when the
i_version has been queried. When it's queried we take the i_lock,
get the value and set the flag and then drop the lock and return it.
When we would go to bump it, we check the flag and only bump the
the counter if it's set and we weren't requested to forcibly bump it.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 66 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 46 insertions(+), 20 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75323e7b6954..917557faa8e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1909,6 +1909,9 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
* wb stat updates to grab mapping->tree_lock. See
* inode_switch_wb_work_fn() for details.
*
+ * I_VERS_BUMP inode->i_version counter must be bumped on the next
+ * change. See the inode_*_iversion functions.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
#define I_DIRTY_SYNC (1 << 0)
@@ -1929,6 +1932,7 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
#define __I_DIRTY_TIME_EXPIRED 12
#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
#define I_WB_SWITCH (1 << 13)
+#define I_VERS_BUMP (1 << 14)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@@ -1976,20 +1980,6 @@ inode_set_iversion(struct inode *inode, const u64 new)
}
/**
- * inode_inc_iversion_locked - increment i_version while protected
- * @inode: inode to be updated
- *
- * Increment the i_version field in the inode. This version is usable
- * when there is some other sort of lock in play that would prevent
- * concurrent accessors.
- */
-static inline void
-inode_inc_iversion_locked(struct inode *inode)
-{
- inode->i_version++;
-}
-
-/**
* inode_set_iversion_read - set i_version to a particular value and flag
* set flag to indicate that it has been viewed
* @inode: inode to set
@@ -2002,7 +1992,10 @@ inode_inc_iversion_locked(struct inode *inode)
static inline void
inode_set_iversion_read(struct inode *inode, const u64 new)
{
+ spin_lock(&inode->i_lock);
inode_set_iversion(inode, new);
+ inode->i_state |= I_VERS_BUMP;
+ spin_unlock(&inode->i_lock);
}
/**
@@ -2011,14 +2004,36 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
*
* Every time the inode is modified, the i_version field will be incremented.
* The filesystem has to be mounted with MS_I_VERSION flag.
+ *
+ * Returns true if counter was bumped, and false if it wasn't necessary.
*/
static inline bool
inode_inc_iversion(struct inode *inode, bool force)
{
+ bool ret = false;
+
spin_lock(&inode->i_lock);
- inode_inc_iversion_locked(inode);
+ if (force || (inode->i_state & I_VERS_BUMP)) {
+ inode->i_version++;
+ inode->i_state &= ~I_VERS_BUMP;
+ ret = true;
+ }
spin_unlock(&inode->i_lock);
- return true;
+ return ret;
+}
+
+/**
+ * inode_inc_iversion_locked - increment i_version while protected
+ * @inode: inode to be updated
+ *
+ * Increment the i_version field in the inode. This version is usable
+ * when there is some other sort of lock in play that would prevent
+ * concurrent increments (typically inode->i_rwsem for write).
+ */
+static inline void
+inode_inc_iversion_locked(struct inode *inode)
+{
+ inode_inc_iversion(inode, true);
}
/**
@@ -2043,9 +2058,15 @@ inode_get_iversion_raw(const struct inode *inode)
* to store the returned i_version for later comparison.
*/
static inline u64
-inode_get_iversion(const struct inode *inode)
+inode_get_iversion(struct inode *inode)
{
- return inode_get_iversion_raw(inode);
+ u64 ret;
+
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_VERS_BUMP;
+ ret = inode->i_version;
+ spin_unlock(&inode->i_lock);
+ return ret;
}
/**
@@ -2054,7 +2075,7 @@ inode_get_iversion(const struct inode *inode)
* @old: old value to check against its i_version
*
* Compare an i_version counter with a previous one. Returns 0 if they are
- * the same or non-zero if they are different.
+ * the same, greater than zero if the inode's is "later" than the old value.
*/
static inline s64
inode_cmp_iversion(const struct inode *inode, const u64 old)
@@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const u64 old)
static inline bool
inode_iversion_need_inc(struct inode *inode)
{
- return true;
+ bool ret;
+
+ spin_lock(&inode->i_lock);
+ ret = inode->i_state & I_VERS_BUMP;
+ spin_unlock(&inode->i_lock);
+ return ret;
}
enum file_time_flags {
--
2.7.4
At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/inode.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a03e5a1d5e05..65a7065c0fbf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6025,19 +6025,20 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
int flags)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
+ bool dirty = flags & ~S_VERSION;
if (btrfs_root_readonly(root))
return -EROFS;
if (flags & S_VERSION)
- inode_inc_iversion(inode, true);
+ dirty |= inode_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
- return btrfs_dirty_inode(inode);
+ return dirty ? btrfs_dirty_inode(inode) : 0;
}
/*
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ext4/dir.c | 8 ++++----
fs/ext4/inline.c | 6 +++---
fs/ext4/inode.c | 12 ++++++++----
fs/ext4/ioctl.c | 2 +-
fs/ext4/namei.c | 8 ++++----
fs/ext4/super.c | 2 +-
6 files changed, 21 insertions(+), 17 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..cd84f15d3d41 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -207,7 +207,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (file->f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -226,7 +226,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
- file->f_version = inode->i_version;
+ file->f_version = inode_get_iversion(inode);
}
while (ctx->pos < inode->i_size
@@ -567,10 +567,10 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx)
* cached entries.
*/
if ((!info->curr_node) ||
- (file->f_version != inode->i_version)) {
+ inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(&info->root);
- file->f_version = inode->i_version;
+ file->f_version = inode_get_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
info->curr_minor_hash,
&info->next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 437df6a1a841..6a3d173cb9ec 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1042,7 +1042,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
*/
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
ext4_mark_inode_dirty(handle, dir);
return 1;
}
@@ -1496,7 +1496,7 @@ int ext4_read_inline_dir(struct file *file,
* dirent right now. Scan from the start of the inline
* dir to make sure.
*/
- if (file->f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
* "." is with offset 0 and
@@ -1528,7 +1528,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
- file->f_version = inode->i_version;
+ file->f_version = inode_get_iversion(inode);
}
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..5603e1782c65 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4709,12 +4709,14 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
- inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+ u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
- inode->i_version |=
+ ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+ inode_set_iversion_read(inode, ivers);
}
ret = 0;
@@ -5000,11 +5002,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
- raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+ u64 ivers = inode_get_iversion_raw(inode);
+
+ raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
raw_inode->i_version_hi =
- cpu_to_le32(inode->i_version >> 32);
+ cpu_to_le32(ivers >> 32);
raw_inode->i_extra_isize =
cpu_to_le16(ei->i_extra_isize);
}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index fb844da4836f..7e1d56e3def0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -139,7 +139,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
i_gid_write(inode_bl, 0);
inode_bl->i_flags = 0;
ei_bl->i_flags = 0;
- inode_bl->i_version = 1;
+ inode_set_iversion(inode_bl, 1);
i_size_write(inode_bl, 0);
inode_bl->i_mode = S_IFREG;
if (ext4_has_feature_extents(sb)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index eadba919f26b..a8f812162dd6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1943,7 +1943,7 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
*/
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
ext4_mark_inode_dirty(handle, dir);
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
err = ext4_handle_dirty_dirent_node(handle, dir, bh);
@@ -2350,7 +2350,7 @@ int ext4_generic_delete_entry(handle_t *handle,
blocksize);
else
de->inode = 0;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
return 0;
}
i += ext4_rec_len_from_disk(de->rec_len, blocksize);
@@ -2980,7 +2980,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
"empty directory '%.*s' has too many links (%u)",
dentry->d_name.len, dentry->d_name.name,
inode->i_nlink);
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
clear_nlink(inode);
/* There's no need to set i_disksize: the fact that i_nlink is
* zero will ensure that the right thing happens during any
@@ -3379,7 +3379,7 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent,
ent->de->inode = cpu_to_le32(ino);
if (ext4_has_feature_filetype(ent->dir->i_sb))
ent->de->file_type = file_type;
- ent->dir->i_version++;
+ inode_inc_iversion(ent->dir);
ent->dir->i_ctime = ent->dir->i_mtime =
current_time(ent->dir);
ext4_mark_inode_dirty(handle, ent->dir);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 63a6b6332682..7938d852733b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -921,7 +921,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;
- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
spin_lock_init(&ei->i_raw_lock);
INIT_LIST_HEAD(&ei->i_prealloc_list);
spin_lock_init(&ei->i_prealloc_lock);
--
2.7.4
The spinlock is only used to serialize callers that want to increment
the counter. We can achieve the same thing with an atomic64_t and
get the i_lock out of this codepath.
Drop the I_VERS_BUMP flag, and instead, borrow the most significant bit
in the counter to use as the flag. With this change, we can stop taking
the i_lock in this codepath, and can use atomics instead to manage the
thing.
On the query side, if the flag is already set, then we just return the
counter value. Otherwise, we set the flag in our in-memory copy and use
cmpxchg to swap it into place if it hasn't changed. If it has, then we
use the value from the cmpxchg as the new "old" value and try again.
When we go to bump the thing, we fetch the value and check the flag bit.
If it's clear then we don't need to do anything if the update isn't
being forced.
If we do need to update, then we clear the flag in our in-memory copy
and bump the counter (handling any overflow into the flag bit by
resetting the counter to zero). We then do a cmpxchg to swap the updated
value into place if it hasn't changed. If it has changed, then we use
the value we got back from cmpxchg to try again.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 82 ++++++++++++++++++++++++++++++++----------------------
1 file changed, 48 insertions(+), 34 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 917557faa8e8..401e38d76171 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -621,7 +621,7 @@ struct inode {
struct hlist_head i_dentry;
struct rcu_head i_rcu;
};
- u64 i_version;
+ atomic64_t i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
@@ -1909,9 +1909,6 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
* wb stat updates to grab mapping->tree_lock. See
* inode_switch_wb_work_fn() for details.
*
- * I_VERS_BUMP inode->i_version counter must be bumped on the next
- * change. See the inode_*_iversion functions.
- *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
#define I_DIRTY_SYNC (1 << 0)
@@ -1932,7 +1929,6 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
#define __I_DIRTY_TIME_EXPIRED 12
#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
#define I_WB_SWITCH (1 << 13)
-#define I_VERS_BUMP (1 << 14)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@@ -1965,6 +1961,14 @@ static inline void inode_dec_link_count(struct inode *inode)
mark_inode_dirty(inode);
}
+/*
+ * We borrow the top bit in the i_version to use as a flag to tell us whether
+ * it has been queried since we last bumped it. If it has, then we must bump
+ * it and set the flag. Note that this means that we have to handle wrapping
+ * manually.
+ */
+#define INODE_I_VERSION_QUERIED (1ULL<<63)
+
/**
* inode_set_iversion - set i_version to a particular value
* @inode: inode to set
@@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode *inode)
static inline void
inode_set_iversion(struct inode *inode, const u64 new)
{
- inode->i_version = new;
+ atomic64_set(&inode->i_version, new);
}
/**
@@ -1992,10 +1996,7 @@ inode_set_iversion(struct inode *inode, const u64 new)
static inline void
inode_set_iversion_read(struct inode *inode, const u64 new)
{
- spin_lock(&inode->i_lock);
- inode_set_iversion(inode, new);
- inode->i_state |= I_VERS_BUMP;
- spin_unlock(&inode->i_lock);
+ inode_set_iversion(inode, new | INODE_I_VERSION_QUERIED);
}
/**
@@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
static inline bool
inode_inc_iversion(struct inode *inode, bool force)
{
- bool ret = false;
+ u64 cur, old, new;
+
+ cur = (u64)atomic64_read(&inode->i_version);
+ for (;;) {
+ /* If flag is clear then we needn't do anything */
+ if (!force && !(cur & INODE_I_VERSION_QUERIED))
+ return false;
+
+ new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
+
+ /* Did we overflow into flag bit? Reset to 0 if so. */
+ if (unlikely(new == INODE_I_VERSION_QUERIED))
+ new = 0;
- spin_lock(&inode->i_lock);
- if (force || (inode->i_state & I_VERS_BUMP)) {
- inode->i_version++;
- inode->i_state &= ~I_VERS_BUMP;
- ret = true;
+ old = atomic64_cmpxchg(&inode->i_version, cur, new);
+ if (likely(old == cur))
+ break;
+ cur = old;
}
- spin_unlock(&inode->i_lock);
- return ret;
+ return true;
}
/**
@@ -2027,8 +2038,9 @@ inode_inc_iversion(struct inode *inode, bool force)
* @inode: inode to be updated
*
* Increment the i_version field in the inode. This version is usable
- * when there is some other sort of lock in play that would prevent
- * concurrent increments (typically inode->i_rwsem for write).
+ * when there is some other sort of lock in play (e.g. i_rwsem for write)
+ * that would prevent concurrent incrementors, and is typically used on
+ * directories or other non-regular files.
*/
static inline void
inode_inc_iversion_locked(struct inode *inode)
@@ -2047,7 +2059,7 @@ inode_inc_iversion_locked(struct inode *inode)
static inline u64
inode_get_iversion_raw(const struct inode *inode)
{
- return inode->i_version;
+ return atomic64_read(&inode->i_version) & ~INODE_I_VERSION_QUERIED;
}
/**
@@ -2060,13 +2072,20 @@ inode_get_iversion_raw(const struct inode *inode)
static inline u64
inode_get_iversion(struct inode *inode)
{
- u64 ret;
+ u64 cur, old, new;
- spin_lock(&inode->i_lock);
- inode->i_state |= I_VERS_BUMP;
- ret = inode->i_version;
- spin_unlock(&inode->i_lock);
- return ret;
+ cur = atomic64_read(&inode->i_version);
+ for (;;) {
+ if (cur & INODE_I_VERSION_QUERIED)
+ return (cur & ~INODE_I_VERSION_QUERIED);
+
+ new = (cur | INODE_I_VERSION_QUERIED);
+ old = atomic64_cmpxchg(&inode->i_version, cur, new);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+ return cur;
}
/**
@@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
static inline s64
inode_cmp_iversion(const struct inode *inode, const u64 old)
{
- return (s64)inode->i_version - (s64)old;
+ return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
}
/**
@@ -2093,12 +2112,7 @@ inode_cmp_iversion(const struct inode *inode, const u64 old)
static inline bool
inode_iversion_need_inc(struct inode *inode)
{
- bool ret;
We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.
If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/inode.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 2484d91ae8ef..584516a32905 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1621,17 +1621,18 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
int generic_update_time(struct inode *inode, struct timespec *time, int flags)
{
int iflags = I_DIRTY_TIME;
+ bool dirty = flags & ~S_VERSION;
if (flags & S_ATIME)
inode->i_atime = *time;
- if (flags & S_VERSION)
- inode_inc_iversion(inode, true);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;
+ if (flags & S_VERSION)
+ dirty |= inode_inc_iversion(inode, dirty);
- if (!(inode->i_sb->s_flags & MS_LAZYTIME) || (flags & S_VERSION))
+ if (dirty || !(inode->i_sb->s_flags & MS_LAZYTIME))
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1850,7 +1851,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(&inode->i_ctime, &now))
sync_it |= S_CTIME;
- if (IS_I_VERSION(inode))
+ if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
if (!sync_it)
--
2.7.4
We do go ahead and increment it though if XFS_ILOG_CORE is already
set when we get to this point.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/xfs_trans_inode.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 844c08886170..fac422da0bbe 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -110,15 +110,15 @@ xfs_trans_log_inode(
/*
* First time we log the inode in a transaction, bump the inode change
- * counter if it is configured for this to occur. We don't use
- * inode_inc_version() because there is no need for extra locking around
- * i_version as we already hold the inode locked exclusively for
- * metadata modification.
+ * counter if it is configured for this to occur. While we hold the
+ * inode locked exclusively for metadata modification, we still use
+ * inode_inc_iversion as it allows us to avoid setting XFS_ILOG_CORE
+ * if the version hasn't been queried since the last bump.
*/
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
- inode_inc_iversion_locked(VFS_I(ip));
- flags |= XFS_ILOG_CORE;
+ if (inode_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+ flags |= XFS_ILOG_CORE;
}
tp->t_flags |= XFS_TRANS_DIRTY;
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/libxfs/xfs_inode_buf.c | 4 ++--
fs/xfs/xfs_icache.c | 4 ++--
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_inode_item.c | 2 +-
fs/xfs/xfs_trans_inode.c | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index dd483e2767f7..a45e9ae5a17e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -263,7 +263,7 @@ xfs_inode_from_disk(
to->di_flags = be16_to_cpu(from->di_flags);
if (to->di_version == 3) {
- inode->i_version = be64_to_cpu(from->di_changecount);
+ inode_set_iversion_read(inode, be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -313,7 +313,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
if (from->di_version == 3) {
- to->di_changecount = cpu_to_be64(inode->i_version);
+ to->di_changecount = cpu_to_be64(inode_get_iversion_raw(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ff4d6311c7f4..f5be7fef8f59 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -278,14 +278,14 @@ xfs_reinit_inode(
int error;
uint32_t nlink = inode->i_nlink;
uint32_t generation = inode->i_generation;
- uint64_t version = inode->i_version;
+ uint64_t version = inode_get_iversion_raw(inode);
umode_t mode = inode->i_mode;
error = inode_init_always(mp->m_super, inode);
set_nlink(inode, nlink);
inode->i_generation = generation;
- inode->i_version = version;
+ inode_set_iversion_read(inode, version);
inode->i_mode = mode;
return error;
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b9557795eb74..dad9c6476d6f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -851,7 +851,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
if (ip->i_d.di_version == 3) {
- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index d90e7811ccdd..838a045cf9e5 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -364,7 +364,7 @@ xfs_inode_to_log_dinode(
to->di_flags = from->di_flags;
if (from->di_version == 3) {
- to->di_changecount = inode->i_version;
+ to->di_changecount = inode_get_iversion_raw(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index dab8daa676f9..844c08886170 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -117,7 +117,7 @@ xfs_trans_log_inode(
*/
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
- VFS_I(ip)->i_version++;
+ inode_inc_iversion_locked(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
--
2.7.4
In some coming changes, we will want to avoid changing the i_version
counter in some cases as that may allow us to avoid having to log a
metadata change to the inode. In other cases we will want to forcibly
change it even when it's not strictly required (i.e. if we're going to
log a time change to disk anyway).
Add a parameter to tell the incrementing routine to forcibly bump the
counter. For now, everyone sets this to "true" and the value is ignored,
but later patches will change this.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/file.c | 4 ++--
fs/btrfs/inode.c | 34 +++++++++++++++++-----------------
fs/btrfs/ioctl.c | 4 ++--
fs/btrfs/xattr.c | 2 +-
fs/ext2/super.c | 2 +-
fs/ext4/inode.c | 4 ++--
fs/ext4/namei.c | 2 +-
fs/inode.c | 2 +-
fs/ocfs2/namei.c | 2 +-
include/linux/fs.h | 2 +-
10 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b5c5da215d05..f301afaf9af5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1804,7 +1804,7 @@ static void update_time_for_write(struct inode *inode)
inode->i_ctime = now;
if (IS_I_VERSION(inode))
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
}
static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -2636,7 +2636,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
if (!trans)
goto out_free;
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_mtime = inode->i_ctime = current_time(inode);
trans->block_rsv = &fs_info->trans_block_rsv;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a56a45a3e992..a03e5a1d5e05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4086,8 +4086,8 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
goto out;
btrfs_i_size_write(dir, dir->i_size - name_len * 2);
- inode_inc_iversion(inode);
- inode_inc_iversion(dir);
+ inode_inc_iversion(inode, true);
+ inode_inc_iversion(dir, true);
inode->i_ctime = dir->i_mtime =
dir->i_ctime = current_time(inode);
ret = btrfs_update_inode(trans, root, dir);
@@ -4232,7 +4232,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
}
btrfs_i_size_write(dir, dir->i_size - name_len * 2);
- inode_inc_iversion(dir);
+ inode_inc_iversion(dir, true);
dir->i_mtime = dir->i_ctime = current_time(dir);
ret = btrfs_update_inode_fallback(trans, root, dir);
if (ret)
@@ -5000,7 +5000,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
* explicitly if it wants a timestamp update.
*/
if (newsize != oldsize) {
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
if (!(mask & (ATTR_CTIME | ATTR_MTIME)))
inode->i_ctime = inode->i_mtime =
current_time(inode);
@@ -5122,7 +5122,7 @@ static int btrfs_setattr(struct dentry *dentry, struct iattr *attr)
if (attr->ia_valid) {
setattr_copy(inode, attr);
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
err = btrfs_dirty_inode(inode);
if (!err && attr->ia_valid & ATTR_MODE)
@@ -6030,7 +6030,7 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
return -EROFS;
if (flags & S_VERSION)
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
@@ -6354,7 +6354,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
btrfs_i_size_write(parent_inode, parent_inode->i_size +
name_len * 2);
- inode_inc_iversion(parent_inode);
+ inode_inc_iversion(parent_inode, true);
parent_inode->i_mtime = parent_inode->i_ctime =
current_time(parent_inode);
ret = btrfs_update_inode(trans, root, parent_inode);
@@ -6576,7 +6576,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
/* There are several dir indexes for this inode, clear the cache. */
BTRFS_I(inode)->dir_index = 0ULL;
inc_nlink(inode);
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_ctime = current_time(inode);
ihold(inode);
set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags);
@@ -9576,10 +9576,10 @@ static int btrfs_rename_exchange(struct inode *old_dir,
}
/* Update inode version and ctime/mtime. */
- inode_inc_iversion(old_dir);
- inode_inc_iversion(new_dir);
- inode_inc_iversion(old_inode);
- inode_inc_iversion(new_inode);
+ inode_inc_iversion(old_dir, true);
+ inode_inc_iversion(new_dir, true);
+ inode_inc_iversion(old_inode, true);
+ inode_inc_iversion(new_inode, true);
old_dir->i_ctime = old_dir->i_mtime = ctime;
new_dir->i_ctime = new_dir->i_mtime = ctime;
old_inode->i_ctime = ctime;
@@ -9858,9 +9858,9 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
goto out_fail;
}
- inode_inc_iversion(old_dir);
- inode_inc_iversion(new_dir);
- inode_inc_iversion(old_inode);
+ inode_inc_iversion(old_dir, true);
+ inode_inc_iversion(new_dir, true);
+ inode_inc_iversion(old_inode, true);
old_dir->i_ctime = old_dir->i_mtime =
new_dir->i_ctime = new_dir->i_mtime =
old_inode->i_ctime = current_time(old_dir);
@@ -9887,7 +9887,7 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
}
if (new_inode) {
- inode_inc_iversion(new_inode);
+ inode_inc_iversion(new_inode, true);
new_inode->i_ctime = current_time(new_inode);
if (unlikely(btrfs_ino(new_inode) ==
BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)) {
@@ -10408,7 +10408,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
cur_offset += ins.offset;
*alloc_hint = ins.objectid + ins.offset;
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_ctime = current_time(inode);
BTRFS_I(inode)->flags |= BTRFS_INODE_PREALLOC;
if (!(mode & FALLOC_FL_KEEP_SIZE) &&
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 33f967d30b2a..c2d845b73e58 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -348,7 +348,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
}
btrfs_update_iflags(inode);
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_ctime = current_time(inode);
ret = btrfs_update_inode(trans, root, inode);
@@ -3294,7 +3294,7 @@ static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
struct btrfs_root *root = BTRFS_I(inode)->root;
int ret;
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
if (!no_time_update)
inode->i_mtime = inode->i_ctime = current_time(inode);
/*
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 9621c7f2503e..b9a5475754b8 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -253,7 +253,7 @@ int __btrfs_setxattr(struct btrfs_trans_handle *trans,
if (ret)
goto out;
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_ctime = current_time(inode);
set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags);
ret = btrfs_update_inode(trans, root, inode);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 5c9ba3b3f77b..db8251e9fbd9 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1542,7 +1542,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5603e1782c65..cd1c255944b3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5253,7 +5253,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
return -EINVAL;
if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
if (ext4_should_order_data(inode) &&
(attr->ia_size < inode->i_size)) {
@@ -5488,7 +5488,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
int err = 0;
if (IS_I_VERSION(inode))
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
/* the do_update_inode consumes one bh->b_count */
get_bh(iloc->bh);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a8f812162dd6..3c5110817a55 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3379,7 +3379,7 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent,
ent->de->inode = cpu_to_le32(ino);
if (ext4_has_feature_filetype(ent->dir->i_sb))
ent->de->file_type = file_type;
- inode_inc_iversion(ent->dir);
+ inode_inc_iversion(ent->dir, true);
ent->dir->i_ctime = ent->dir->i_mtime =
current_time(ent->dir);
ext4_mark_inode_dirty(handle, ent->dir);
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd0b282..2484d91ae8ef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1625,7 +1625,7 @@ int generic_update_time(struct inode *inode, struct timespec *time, int flags)
if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
- inode_inc_iversion(inode);
+ inode_inc_iversion(inode, true);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c045826b716a..387cfa36c7b7 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
- inode_inc_iversion(new_dir);
+ inode_inc_iversion(new_dir, true);
if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 075c915fe2b1..75323e7b6954 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2013,7 +2013,7 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
* The filesystem has to be mounted with MS_I_VERSION flag.
*/
static inline bool
-inode_inc_iversion(struct inode *inode)
+inode_inc_iversion(struct inode *inode, bool force)
{
spin_lock(&inode->i_lock);
inode_inc_iversion_locked(inode);
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ufs/dir.c | 8 ++++----
fs/ufs/inode.c | 2 +-
fs/ufs/super.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index de01b8f2aa78..c6e670732cc5 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -46,7 +46,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -427,7 +427,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
- int need_revalidate = file->f_version != inode->i_version;
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
UFSD("BEGIN\n");
@@ -454,8 +454,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_get_iversion(inode);
+ need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index 1bc0bd6a9848..4cb39d555df6 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -680,7 +680,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned long ino)
if (err)
goto bad_inode;
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index f04ab232d08d..ed7038aaa258 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1398,7 +1398,7 @@ static struct inode *ufs_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;
- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
seqlock_init(&ei->meta_lock);
mutex_init(&ei->truncate_mutex);
return &ei->vfs_inode;
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ocfs2/dir.c | 14 +++++++-------
fs/ocfs2/inode.c | 2 +-
fs/ocfs2/namei.c | 2 +-
fs/ocfs2/quota_global.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index 3ecb9f337b7d..b78105d404b3 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1174,7 +1174,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir,
le16_add_cpu(&pde->rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1729,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1775,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (*f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1791,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
- *f_version = inode->i_version;
+ *f_version = inode_get_iversion(inode);
}
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1869,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (*f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + i);
/* It's too expensive to do a full
@@ -1886,7 +1886,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
- *f_version = inode->i_version;
+ *f_version = inode_get_iversion(inode);
}
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1940,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version,
*/
int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
{
- u64 version = inode->i_version;
+ u64 version = inode_get_iversion(inode);
ocfs2_dir_foreach_blk(inode, &version, ctx, true);
return 0;
}
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 382401d3e88f..31f1e80cd980 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -302,7 +302,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c045826b716a 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
- new_dir->i_version++;
+ inode_inc_iversion(new_dir);
if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index cec495a921e3..a9fad85597e1 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -288,7 +288,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,
mlog_errno(err);
return err;
}
- gqinode->i_version++;
+ inode_get_iversion(gqinode);
ocfs2_mark_inode_dirty(handle, gqinode, oinfo->dqi_gqi_bh);
return len;
}
--
2.7.4
For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/delegation.c | 2 +-
fs/nfs/fscache-index.c | 4 ++--
fs/nfs/inode.c | 16 ++++++++--------
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfstrace.h | 4 ++--
fs/nfs/write.c | 2 +-
6 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index d7df5e67b0c1..36d8a000e664 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -347,7 +347,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct rpc_cred *cred, struct
nfs4_stateid_copy(&delegation->stateid, &res->delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
- delegation->change_attr = inode->i_version;
+ delegation->change_attr = inode_get_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<<NFS_DELEGATION_REFERENCED;
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 777b055063f6..40f2ea03cf4f 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -211,7 +211,7 @@ static uint16_t nfs_fscache_inode_get_aux(const void *cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
- auxdata.change_attr = nfsi->vfs_inode.i_version;
+ auxdata.change_attr = inode_get_iversion_raw(&nfsi->vfs_inode);
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +243,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
- auxdata.change_attr = nfsi->vfs_inode.i_version;
+ auxdata.change_attr = inode_get_iversion_raw(&nfsi->vfs_inode);
if (memcmp(data, &auxdata, datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 5864146e05e6..d38ec9c92c1d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -424,7 +424,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
memset(&inode->i_atime, 0, sizeof(inode->i_atime));
memset(&inode->i_mtime, 0, sizeof(inode->i_mtime));
memset(&inode->i_ctime, 0, sizeof(inode->i_ctime));
- inode->i_version = 0;
+ inode_set_iversion(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(&init_user_ns, -2);
@@ -449,7 +449,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
- inode->i_version = fattr->change_attr;
+ inode_set_iversion(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1244,8 +1244,8 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
- && inode->i_version == fattr->pre_change_attr) {
- inode->i_version = fattr->change_attr;
+ && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+ inode_set_iversion(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1303,7 +1303,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
- if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode->i_version != fattr->change_attr)
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE;
if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&inode->i_mtime, &fattr->mtime))
@@ -1604,7 +1604,7 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
}
if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
(fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
- fattr->pre_change_attr = inode->i_version;
+ fattr->pre_change_attr = inode_get_iversion_raw(inode);
fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
}
if ((fattr->valid & NFS_ATTR_FATTR_CTIME) != 0 &&
@@ -1740,7 +1740,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
- if (inode->i_version != fattr->change_attr) {
+ if (inode_cmp_iversion(inode, fattr->change_attr)) {
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id, inode->i_ino);
/* Could it be a race with writeback? */
@@ -1752,7 +1752,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
if (S_ISDIR(inode->i_mode))
nfs_force_lookup_revalidate(inode);
}
- inode->i_version = fattr->change_attr;
+ inode_set_iversion(inode, fattr->change_attr);
}
} else {
nfsi->cache_validity |= save_cache_validity;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index d33242c8d95d..3751f735310d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1089,9 +1089,9 @@ static void update_changeattr(struct inode *dir, struct nfs4_change_info *cinfo)
spin_lock(&dir->i_lock);
nfsi->cache_validity |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
- if (!cinfo->atomic || cinfo->before != dir->i_version)
+ if (!cinfo->atomic || inode_cmp_iversion(dir, cinfo->before))
nfs_force_lookup_revalidate(dir);
- dir->i_version = cinfo->after;
+ inode_set_iversion(dir, cinfo->after);
nfsi->attr_gencount = nfs_inc_attr_generation_counter();
nfs_fscache_invalidate(dir);
spin_unlock(&dir->i_lock);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 2ca9167bc97d..16acb94bbf20 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -60,7 +60,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event,
__entry->dev = inode->i_sb->s_dev;
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
- __entry->version = inode->i_version;
+ __entry->version = inode_get_iversion_raw(inode);
),
TP_printk(
@@ -99,7 +99,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done,
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
__entry->type = nfs_umode_to_dtype(inode->i_mode);
- __entry->version = inode->i_version;
+ __entry->version = inode_get_iversion_raw(inode);
__entry->size = i_size_read(inode);
__entry->nfsi_flags = nfsi->flags;
__entry->cache_validity = nfsi->cache_validity;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 6e761f3f4cbf..da05b57e394d 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -692,7 +692,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
spin_lock(&inode->i_lock);
if (!nfsi->nrequests &&
NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
/*
* Swap-space should not get truncated. Hence no need to plug the race
* with invalidate/truncate.
--
2.7.4
We already have inode_inc_iversion. Add inode_set_iversion,
inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
These should encompass how i_version is currently accessed and
manipulated so that we can later change the implementation.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 104 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..075c915fe2b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode *inode)
}
/**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set the inode's i_version. Should generally be called when initializing
+ * a new inode.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+ inode->i_version = new;
+}
+
+/**
+ * inode_inc_iversion_locked - increment i_version while protected
+ * @inode: inode to be updated
+ *
+ * Increment the i_version field in the inode. This version is usable
+ * when there is some other sort of lock in play that would prevent
+ * concurrent accessors.
+ */
+static inline void
+inode_inc_iversion_locked(struct inode *inode)
+{
+ inode->i_version++;
+}
+
+/**
+ * inode_set_iversion_read - set i_version to a particular value and flag
+ * set flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set the inode's i_version, and flag the inode as if it has been viewed.
+ * Should generally be called when initializinga new inode in memory from
+ * from disk.
+ */
+static inline void
+inode_set_iversion_read(struct inode *inode, const u64 new)
+{
+ inode_set_iversion(inode, new);
+}
+
+/**
* inode_inc_iversion - increments i_version
* @inode: inode that need to be updated
*
* Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
+ * The filesystem has to be mounted with MS_I_VERSION flag.
+ */
+static inline bool
+inode_inc_iversion(struct inode *inode)
+{
+ spin_lock(&inode->i_lock);
+ inode_inc_iversion_locked(inode);
+ spin_unlock(&inode->i_lock);
+ return true;
+}
+
+/**
+ * inode_get_iversion_raw - read i_version without "perturbing" it
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter for an inode without registering it as a
+ * read. Should typically be used by local filesystems that need to store an
+ * i_version on disk.
+ */
+static inline u64
+inode_get_iversion_raw(const struct inode *inode)
+{
+ return inode->i_version;
+}
+
+/**
+ * inode_get_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison.
+ */
+static inline u64
+inode_get_iversion(const struct inode *inode)
+{
+ return inode_get_iversion_raw(inode);
+}
+
+/**
+ * inode_cmp_iversion - check whether the i_version counter has changed
+ * @inode: inode to check
+ * @old: old value to check against its i_version
+ *
+ * Compare an i_version counter with a previous one. Returns 0 if they are
+ * the same or non-zero if they are different.
*/
+static inline s64
+inode_cmp_iversion(const struct inode *inode, const u64 old)
+{
+ return (s64)inode->i_version - (s64)old;
+}
-static inline void inode_inc_iversion(struct inode *inode)
+/**
+ * inode_iversion_need_inc - is the i_version in need of being incremented?
+ * @inode: inode to check
+ *
+ * Returns whether the inode->i_version counter needs incrementing on the next
+ * change.
+ */
+static inline bool
+inode_iversion_need_inc(struct inode *inode)
{
- spin_lock(&inode->i_lock);
- inode->i_version++;
- spin_unlock(&inode->i_lock);
+ return true;
}
enum file_time_flags {
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ext2/dir.c | 8 ++++----
fs/ext2/super.c | 4 ++--
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9508e4..aae282798d18 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -91,7 +91,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
@@ -292,7 +292,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
- int need_revalidate = file->f_version != inode->i_version;
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -318,8 +318,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_get_iversion(inode);
+ need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 6cb042b53b5b..5c9ba3b3f77b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -167,7 +167,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
#ifdef CONFIG_QUOTA
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
#endif
@@ -1542,7 +1542,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
- inode->i_version++;
+ inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/exofs/dir.c | 8 ++++----
fs/exofs/super.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 42f9a0a0c4ca..753828ef2db1 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -60,7 +60,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +241,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
- int need_revalidate = (file->f_version != inode->i_version);
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +264,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_get_iversion(inode);
+ need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 1076a4233b39..74400d1dcea5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -159,7 +159,7 @@ static struct inode *exofs_alloc_inode(struct super_block *sb)
if (!oi)
return NULL;
- oi->vfs_inode.i_version = 1;
+ inode_set_iversion(&oi->vfs_inode, 1);
return &oi->vfs_inode;
}
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/delayed-inode.c | 4 ++--
fs/btrfs/inode.c | 4 ++--
fs/btrfs/tree-log.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 80982a83c9fd..5e1015705c8b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1744,7 +1744,7 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
BTRFS_I(inode)->generation);
- btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+ btrfs_set_stack_inode_sequence(inode_item, inode_get_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1798,7 +1798,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
- inode->i_version = btrfs_stack_inode_sequence(inode_item);
+ inode_set_iversion_read(inode, btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f2b281ad7af6..a56a45a3e992 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3732,7 +3732,7 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
- inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+ inode_set_iversion_read(inode, btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3903,7 +3903,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
&token);
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
&token);
- btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+ btrfs_set_token_inode_sequence(leaf, item, inode_get_iversion(inode), &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f10bf5213ed8..4acc4d27383a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3585,7 +3585,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
&token);
- btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+ btrfs_set_token_inode_sequence(leaf, item, inode_get_iversion(inode), &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
--
2.7.4
For AFS, it's generally treated as an opaque value.
Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/afs/fsclient.c | 2 +-
fs/afs/inode.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 31c616ab9b40..7958001a2bf7 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -108,7 +108,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_server;
vnode->vfs_inode.i_mtime = vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime = vnode->vfs_inode.i_ctime;
- vnode->vfs_inode.i_version = data_version;
+ inode_set_iversion(&vnode->vfs_inode, data_version);
}
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 86cc7264c21c..ec3a0274dc95 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -77,7 +77,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
- inode->i_version = vnode->status.data_version;
+ inode_set_iversion(inode, vnode->status.data_version);
inode->i_mapping->a_ops = &afs_fs_aops;
/* check to see whether a symbolic link is really a mountpoint */
@@ -182,7 +182,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const char *dev_name,
inode->i_ctime.tv_nsec = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
- inode->i_version = 0;
+ inode_set_iversion(inode, 0);
inode->i_generation = 0;
set_bit(AFS_VNODE_PSEUDODIR, &vnode->flags);
--
2.7.4
Signed-off-by: Jeff Layton <[email protected]>
---
fs/fat/dir.c | 2 +-
fs/fat/inode.c | 8 ++++----
fs/fat/namei_msdos.c | 6 +++---
fs/fat/namei_vfat.c | 20 ++++++++++----------
4 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 81cecbe6d7cf..76a4cace7543 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -1056,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 338d2f73eb29..9a8a22111031 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -507,7 +507,7 @@ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
inode->i_generation = get_seconds();
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +590,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1367,7 +1367,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1817,7 +1817,7 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
- root_inode->i_version = 1;
+ inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 7d6a105d601b..459fe1d9b578 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -480,7 +480,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
} else
mark_inode_dirty(old_inode);
- old_dir->i_version++;
+ inode_inc_iversion_locked(old_dir);
old_dir->i_ctime = old_dir->i_mtime = current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +508,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
- new_dir->i_version++;
+ inode_inc_iversion_locked(new_dir);
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +540,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
- old_dir->i_version++;
+ inode_inc_iversion_locked(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 6a7152d0c250..d7db82751a98 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -44,7 +44,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
{
int ret = 1;
spin_lock(&dentry->d_lock);
- if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+ if (inode_cmp_iversion(d_inode(dentry->d_parent), vfat_d_version(dentry)))
ret = 0;
spin_unlock(&dentry->d_lock);
return ret;
@@ -770,7 +770,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
if (!inode)
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_get_iversion(dir));
return d_splice_alias(inode, dentry);
error:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -792,7 +792,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
err = vfat_add_entry(dir, &dentry->d_name, 0, 0, &ts, &sinfo);
if (err)
goto out;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
brelse(sinfo.bh);
@@ -800,7 +800,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
err = PTR_ERR(inode);
goto out;
}
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
/* timestamp is already written, so mark_inode_dirty() is unneeded. */
@@ -834,7 +834,7 @@ static int vfat_rmdir(struct inode *dir, struct dentry *dentry)
clear_nlink(inode);
inode->i_mtime = inode->i_atime = current_time(inode);
fat_detach(inode);
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_get_iversion(dir));
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -860,7 +860,7 @@ static int vfat_unlink(struct inode *dir, struct dentry *dentry)
clear_nlink(inode);
inode->i_mtime = inode->i_atime = current_time(inode);
fat_detach(inode);
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_get_iversion(dir));
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -886,7 +886,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
err = vfat_add_entry(dir, &dentry->d_name, 1, cluster, &ts, &sinfo);
if (err)
goto out_free;
- dir->i_version++;
+ inode_inc_iversion_locked(dir);
inc_nlink(dir);
inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
@@ -896,7 +896,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
/* the directory was completed, just return a error */
goto out;
}
- inode->i_version++;
+ inode_inc_iversion_locked(inode);
set_nlink(inode, 2);
inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
/* timestamp is already written, so mark_inode_dirty() is unneeded. */
@@ -962,7 +962,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
goto out;
new_i_pos = sinfo.i_pos;
}
- new_dir->i_version++;
+ inode_inc_iversion_locked(new_dir);
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -990,7 +990,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
- old_dir->i_version++;
+ inode_inc_iversion_locked(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
--
2.7.4
...as it's completely unused.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/orangefs/super.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index c48859f16e7b..58ad9875eafc 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -85,8 +85,6 @@ static void orangefs_inode_cache_ctor(void *req)
inode_init_once(&orangefs_inode->vfs_inode);
init_rwsem(&orangefs_inode->xattr_sem);
-
- orangefs_inode->vfs_inode.i_version = 1;
}
static struct inode *orangefs_alloc_inode(struct super_block *sb)
--
2.7.4
Nothing uses this, and the i_version is never incremented on ntfs.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ntfs/inode.c | 9 ---------
fs/ntfs/mft.c | 6 ------
2 files changed, 15 deletions(-)
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 7c410f879412..1c1ee489284b 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -560,13 +560,6 @@ static int ntfs_read_locked_inode(struct inode *vi)
ntfs_debug("Entering for i_ino 0x%lx.", vi->i_ino);
/* Setup the generic vfs inode parts now. */
-
- /*
- * This is for checking whether an inode has changed w.r.t. a file so
- * that the file can be updated if necessary (compare with f_version).
- */
- vi->i_version = 1;
-
vi->i_uid = vol->uid;
vi->i_gid = vol->gid;
vi->i_mode = 0;
@@ -1240,7 +1233,6 @@ static int ntfs_read_locked_attr_inode(struct inode *base_vi, struct inode *vi)
base_ni = NTFS_I(base_vi);
/* Just mirror the values from the base inode. */
- vi->i_version = base_vi->i_version;
vi->i_uid = base_vi->i_uid;
vi->i_gid = base_vi->i_gid;
set_nlink(vi, base_vi->i_nlink);
@@ -1507,7 +1499,6 @@ static int ntfs_read_locked_index_inode(struct inode *base_vi, struct inode *vi)
ni = NTFS_I(vi);
base_ni = NTFS_I(base_vi);
/* Just mirror the values from the base inode. */
- vi->i_version = base_vi->i_version;
vi->i_uid = base_vi->i_uid;
vi->i_gid = base_vi->i_gid;
set_nlink(vi, base_vi->i_nlink);
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index b6f402194f02..165133321577 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -2641,12 +2641,6 @@ ntfs_inode *ntfs_mft_record_alloc(ntfs_volume *vol, const int mode,
goto undo_mftbmp_alloc;
}
vi->i_ino = bit;
- /*
- * This is for checking whether an inode has changed w.r.t. a
- * file so that the file can be updated if necessary (compare
- * with f_version).
- */
- vi->i_version = 1;
/* The owner and group come from the ntfs volume. */
vi->i_uid = vol->uid;
--
2.7.4
The i_version field in reiserfs is not initialized and is only ever
updated here. Nothing ever views it, so just remove it.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/reiserfs/super.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index e314cb30a181..b7cf777fc2ac 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -2521,7 +2521,6 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
return err;
if (inode->i_size < off + len - towrite)
i_size_write(inode, off + len - towrite);
- inode->i_version++;
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
--
2.7.4
It's never used in nilfs2.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nilfs2/super.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 12eeae62a2b1..d157f6f96d9a 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -160,7 +160,6 @@ struct inode *nilfs_alloc_inode(struct super_block *sb)
ii->i_bh = NULL;
ii->i_state = 0;
ii->i_cno = 0;
- ii->vfs_inode.i_version = 1;
nilfs_mapping_init(&ii->i_btnode_cache, &ii->vfs_inode);
return &ii->vfs_inode;
}
--
2.7.4
It's not used in f2fs currently. Just remove the initialization.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/f2fs/super.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 702638e21c76..78cceb0bed9e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -572,7 +572,6 @@ static struct inode *f2fs_alloc_inode(struct super_block *sb)
init_once((void *) fi);
/* Initialize f2fs-specific inode info */
- fi->vfs_inode.i_version = 1;
atomic_set(&fi->dirty_pages, 0);
fi->i_current_depth = 1;
fi->i_advise = 0;
--
2.7.4
It's bumped on quota writes but is otherwise untouched.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/jfs/super.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 85671f7f8518..2739600f826e 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -829,7 +829,6 @@ static ssize_t jfs_quota_write(struct super_block *sb, int type,
}
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
- inode->i_version++;
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
inode_unlock(inode);
--
2.7.4
Eventually, we'll want to wire it up to use the change attribute that
the cluster tracks instead, but for now this is unneeded.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ceph/inode.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 398e5328b309..c3406acb36e7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -796,7 +796,6 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
/* update inode */
ci->i_version = le64_to_cpu(info->version);
- inode->i_version++;
inode->i_rdev = le32_to_cpu(info->rdev);
inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
--
2.7.4
ecryptfs bumps the i_version counter when a new inode is
instantiated, but never touches it again.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ecryptfs/inode.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e7413f82d27b..9493e8614de1 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -64,7 +64,6 @@ static int ecryptfs_inode_set(struct inode *inode, void *opaque)
/* i_size will be overwritten for encrypted regular files */
fsstack_copy_inode_size(inode, lower_inode);
inode->i_ino = lower_inode->i_ino;
- inode->i_version++;
inode->i_mapping->a_ops = &ecryptfs_aops;
if (S_ISLNK(inode->i_mode))
--
2.7.4
On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton <[email protected]> wrote:
> The spinlock is only used to serialize callers that want to increment
> the counter. We can achieve the same thing with an atomic64_t and
> get the i_lock out of this codepath.
>
Cool work! See some nits and suggestions below.
> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED (1ULL<<63)
> +
> /**
> * inode_set_iversion - set i_version to a particular value
> * @inode: inode to set
> @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode *inode)
> static inline void
> inode_set_iversion(struct inode *inode, const u64 new)
> {
> - inode->i_version = new;
> + atomic64_set(&inode->i_version, new);
> }
>
Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
See API change suggestion below.
> /**
> @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
> static inline bool
> inode_inc_iversion(struct inode *inode, bool force)
> {
> - bool ret = false;
> + u64 cur, old, new;
> +
> + cur = (u64)atomic64_read(&inode->i_version);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & INODE_I_VERSION_QUERIED))
> + return false;
> +
> + new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> +
> + /* Did we overflow into flag bit? Reset to 0 if so. */
> + if (unlikely(new == INODE_I_VERSION_QUERIED))
> + new = 0;
>
Did you consider changing f_version type and the signature of the new
i_version API to set/get s64 instead of u64?
It makes a bit more sense from API users perspective to know that
the valid range for version is >=0.
file->f_version is not the only struct member used to store&compare
i_version. nfs and xfs have other struct members for that, but even
if all those members are not changed to type s64, the explicit cast
to (s64) and back to (u64) will serve as a good documentation in
the code about the valid range of version in the new API.
> /**
> @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
> static inline s64
> inode_cmp_iversion(const struct inode *inode, const u64 old)
> {
> - return (s64)inode->i_version - (s64)old;
> + return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
> }
>
IMO, it is better for the API to determine that 'old' is valid a value
returned from
inode_get_iversion* and therefore should not have the MSB set.
Unless the reason you chose to shift those 2 values is because it is cheaper
then masking INODE_I_VERSION_QUERIED??
Cheers,
Amir.
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes. That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
Do you have numbers to justify these changes?
On Thu, 2016-12-22 at 10:38 +0200, Amir Goldstein wrote:
> On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton <[email protected]> wrote:
> >
> > The spinlock is only used to serialize callers that want to increment
> > the counter. We can achieve the same thing with an atomic64_t and
> > get the i_lock out of this codepath.
> >
>
> Cool work! See some nits and suggestions below.
>
> >
> > +/*
> > + * We borrow the top bit in the i_version to use as a flag to tell us whether
> > + * it has been queried since we last bumped it. If it has, then we must bump
> > + * it and set the flag. Note that this means that we have to handle wrapping
> > + * manually.
> > + */
> > +#define INODE_I_VERSION_QUERIED (1ULL<<63)
> > +
> > /**
> > * inode_set_iversion - set i_version to a particular value
> > * @inode: inode to set
> > @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode *inode)
> > static inline void
> > inode_set_iversion(struct inode *inode, const u64 new)
> > {
> > - inode->i_version = new;
> > + atomic64_set(&inode->i_version, new);
> > }
> >
>
> Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
> See API change suggestion below.
>
>
Possibly. Note that in some cases (when the i_version can be stored on
disk across a remount), we need to ensure that we set this flag when the
inode is read in from disk. It's always possible that we'll get a query
for it, and then crash so we always set the flag just in case.
> >
> > /**
> > @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
> > static inline bool
> > inode_inc_iversion(struct inode *inode, bool force)
> > {
> > - bool ret = false;
> > + u64 cur, old, new;
> > +
> > + cur = (u64)atomic64_read(&inode->i_version);
> > + for (;;) {
> > + /* If flag is clear then we needn't do anything */
> > + if (!force && !(cur & INODE_I_VERSION_QUERIED))
> > + return false;
> > +
> > + new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> > +
> > + /* Did we overflow into flag bit? Reset to 0 if so. */
> > + if (unlikely(new == INODE_I_VERSION_QUERIED))
> > + new = 0;
> >
>
> Did you consider changing f_version type and the signature of the new
> i_version API to set/get s64 instead of u64?
>
> It makes a bit more sense from API users perspective to know that
> the valid range for version is >=0.
>
> file->f_version is not the only struct member used to store&compare
> i_version. nfs and xfs have other struct members for that, but even
> if all those members are not changed to type s64, the explicit cast
> to (s64) and back to (u64) will serve as a good documentation in
> the code about the valid range of version in the new API.
>
This API is definitely not set in stone. That said, we have to consider
that there are really three classes of filesystems here:
1) ones that treat i_version as an opaque value: Mostly AFS and NFS,
as they get this value from the server. These both can also use the
entire u64 field, so we need to ensure that we don't monkey with the
flag bit on them.
2) filesystems that just use it internally: These don't set MS_I_VERSION
and mostly use it to detect directory changes that occur during readdir.
i_version is initialized to some value (0 or 1) when the struct inode is
allocated and bump it on directory changes.
3) filesystems where the kernel manages it completely: these set
MS_I_VERSION and the kernel handles bumping it on writes. Currently,
this is btrfs, ext4 and xfs. These are persistent across remounts as
well.
So, we have to ensure that this API encompasses all 3 of these use
cases.
> > /**
> > @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
> > static inline s64
> > inode_cmp_iversion(const struct inode *inode, const u64 old)
> > {
> > - return (s64)inode->i_version - (s64)old;
> > + return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
> > }
> >
>
> IMO, it is better for the API to determine that 'old' is valid a value
> returned from
> inode_get_iversion* and therefore should not have the MSB set.
> Unless the reason you chose to shift those 2 values is because it is cheaper
> then masking INODE_I_VERSION_QUERIED??
>
>
No, we need to do that in order to handle wraparound correctly. We want
this check to work something like the time_before/after macros in the
kernel that handle jiffies wraparound.
So, the sign returned here matters, as positive values indicate that the
current one is "newer" than the old one. That's the main reason for the
shift here.
Note that that that should be documented here too, I'll plan to add that
for the next revision.
Thanks for the comments so far!
--
Jeff Layton <[email protected]>
On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> >
> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes. That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
>
> Do you have numbers to justify these changes?
I have numbers. As to whether they justify the changes, I'm not sure.
This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
XFS, with this fio jobfile:
--------------------8<------------------
[global]
direct=0
size=2g
filesize=512m
bsrange=1-1
timeout=60
numjobs=1
directory=/mnt/scratch
[f1]
filename=randwrite
rw=randwrite
--------------------8<------------------
Unpatched kernel:
 WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=60000msec, maxt=60000msec
Patched kernel:
 WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=60000msec, maxt=60000msec
So quite a difference there and it's pretty consistent across runs. If I
change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
variation between the two doesn't seem to be significant (numbers vary
as much between runs on the same kernels and are roughly the same).
Playing with buffered I/O sizes between 1 byte and 4k shows that as the
I/O sizes get larger, this makes less difference (which is what I'd
expect).
Previous testing with ext4 shows roughly the same results. btrfs shows
some benefit here but significantly less than with ext4 or xfs. Not sure
why that is yet -- maybe CoW effects?
That said, I don't have a great test rig for this. I'm using VMs with a
dedicated LVM volume that's on a random SSD I had laying around. It
could use testing on a wider set of configurations and workloads.
I was also hoping that others may have workloads that they think might
be (postively or negatively) affected by these changes. If you can think
of any in particular, then I'm interested to hear about them.
--
Jeff Layton <[email protected]>
The patch ordering is a little annoying as I'd like to be able to be
able to verify the implementation at the same time these new interfaces
are added, but, I don't know, I don't have a better idea.
Anyway, various nits:
On Wed, Dec 21, 2016 at 12:03:28PM -0500, Jeff Layton wrote:
> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
>
> These should encompass how i_version is currently accessed and
> manipulated so that we can later change the implementation.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> include/linux/fs.h | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 104 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2ba074328894..075c915fe2b1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode *inode)
> }
>
> /**
> + * inode_set_iversion - set i_version to a particular value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version. Should generally be called when initializing
> + * a new inode.
> + */
> +static inline void
> +inode_set_iversion(struct inode *inode, const u64 new)
> +{
> + inode->i_version = new;
> +}
> +
> +/**
> + * inode_inc_iversion_locked - increment i_version while protected
> + * @inode: inode to be updated
> + *
> + * Increment the i_version field in the inode. This version is usable
> + * when there is some other sort of lock in play that would prevent
> + * concurrent accessors.
> + */
> +static inline void
> +inode_inc_iversion_locked(struct inode *inode)
> +{
> + inode->i_version++;
> +}
> +
> +/**
> + * inode_set_iversion_read - set i_version to a particular value and flag
> + * set flag to indicate that it has been viewed
s/flag set flag/set flag/.
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version, and flag the inode as if it has been viewed.
> + * Should generally be called when initializinga new inode in memory from
> + * from disk.
s/from from/from/.
> + */
> +static inline void
> +inode_set_iversion_read(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion(inode, new);
> +}
> +
> +/**
> * inode_inc_iversion - increments i_version
> * @inode: inode that need to be updated
> *
> * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> + * The filesystem has to be mounted with MS_I_VERSION flag.
I'm not sure why that comment's there. Filesystems can choose to
increment i_version without requiring the mount option if they want,
can't they?
> + */
> +static inline bool
Document what the return value means?
> +inode_inc_iversion(struct inode *inode)
> +{
> + spin_lock(&inode->i_lock);
> + inode_inc_iversion_locked(inode);
> + spin_unlock(&inode->i_lock);
> + return true;
> +}
> +
> +/**
> + * inode_get_iversion_raw - read i_version without "perturbing" it
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter for an inode without registering it as a
> + * read. Should typically be used by local filesystems that need to store an
> + * i_version on disk.
> + */
> +static inline u64
> +inode_get_iversion_raw(const struct inode *inode)
> +{
> + return inode->i_version;
> +}
> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.
I'm not sure what "for later comparison" means. This is the "read"
operation that actually flags the i_version as read, which you'd use
(for example) to implement NFS getattr? I wonder if there's a better
way to say that.
> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}
> +
> +/**
> + * inode_cmp_iversion - check whether the i_version counter has changed
> + * @inode: inode to check
> + * @old: old value to check against its i_version
> + *
> + * Compare an i_version counter with a previous one. Returns 0 if they are
> + * the same or non-zero if they are different.
Does this flag the i_version as read? What's it for? (OK, I guess I'll
find out after a few more patches...).
--b.
> */
> +static inline s64
> +inode_cmp_iversion(const struct inode *inode, const u64 old)
> +{
> + return (s64)inode->i_version - (s64)old;
> +}
>
> -static inline void inode_inc_iversion(struct inode *inode)
> +/**
> + * inode_iversion_need_inc - is the i_version in need of being incremented?
> + * @inode: inode to check
> + *
> + * Returns whether the inode->i_version counter needs incrementing on the next
> + * change.
> + */
> +static inline bool
> +inode_iversion_need_inc(struct inode *inode)
> {
> - spin_lock(&inode->i_lock);
> - inode->i_version++;
> - spin_unlock(&inode->i_lock);
> + return true;
> }
>
> enum file_time_flags {
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> tl;dr: I think we can greatly reduce the cost of the inode->i_version
> counter, by exploiting the fact that we don't need to increment it
> if no one is looking at it. We can also clean up the code to prepare
> to eventually expose this value via statx().
>
> The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION
> set. IMA will also use it (though it's not clear to me that that
> works 100% -- no check for MS_I_VERSION there).
I'm a little confused about the internal uses for directories. Is it
necessarily kept on disk? In cases it's not, then there's not the same
performance issue, right? Is there any risk these patches make
performance slightly worse in that case?
> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes.
On those filesystems that's done for both files and directories, right?
--b.
> That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
>
> It turns out though that none of these users of i_version require that
> i_version change on every change to the file. The only real requirement
> is that it be different if _something_ changed since the last time we
> queried for it. [2]
>
> So, if we simply keep track of when something queries the value, we
> can avoid bumping the counter and that metadata update.
>
> This patchset implements this:
>
> It starts with some small cleanup patches to just remove any mention of
> the i_version counter in filesystems that don't actually use it.
>
> Then, we add a new set of static inlines for managing the counter. The
> initial version should work identically to what we have now. Then, all
> of the remaining filesystems that use i_version are converted to the new
> inlines.
>
> Once that's in place, we switch to a new implementation that allows us
> to track readers of i_version counter, and only bump it when it's
> necessary or convenient (i.e. we're going to disk anyway).
>
> The final patch switches from a scheme that uses the i_lock to serialize
> the counter updates during write to an atomic64_t. That's a wash
> performance-wise in my testing, but I like not having to take the i_lock
> down where it could end up nested inside other locks.
>
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
>
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not a real-world workload.
>
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific).
>
> I'm happy to run other workloads if anyone can suggest them.
>
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces that cost.
>
> Is this an avenue that's worthwhile to pursue?
>
> Note that I think we may have other changes coming in the future that
> will make this sort of cleanup necessary anyway. I'd like to plug in the
> Ceph change attribute here eventually, and that will require something
> like this anyway.
>
> Thoughts, comments and suggestions are welcome...
>
> ---
>
> [1]: On ext4 it must be turned on with the i_version mount option,
> mostly due to fears of incurring this impact, AFAICT.
>
> [2]: NFS also recommends that it appear to increase in value over time, so
> that clients can discard metadata updates that are older than ones
> they've already seen.
>
> Jeff Layton (30):
> lustre: don't set f_version in ll_readdir
> ecryptfs: remove unnecessary i_version bump
> ceph: remove the bump of i_version
> f2fs: don't bother setting i_version
> hpfs: don't bother with the i_version counter
> jfs: remove initialization of i_version counter
> nilfs2: remove inode->i_version initialization
> orangefs: remove initialization of i_version
> reiserfs: remove unneeded i_version bump
> ntfs: remove i_version handling
> fs: new API for handling i_version
> fat: convert to new i_version API
> affs: convert to new i_version API
> afs: convert to new i_version API
> btrfs: convert to new i_version API
> exofs: switch to new i_version API
> ext2: convert to new i_version API
> ext4: convert to new i_version API
> nfs: convert to new i_version API
> nfsd: convert to new i_version API
> ocfs2: convert to new i_version API
> ufs: use new i_version API
> xfs: convert to new i_version API
> IMA: switch IMA over to new i_version API
> fs: add a "force" parameter to inode_inc_iversion
> fs: only set S_VERSION when updating times if it has been queried
> xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
> incrementing
> btrfs: only dirty the inode in btrfs_update_time if something was
> changed
> fs: track whether the i_version has been queried with an i_state flag
> fs: convert i_version counter over to an atomic64_t
>
> drivers/staging/lustre/lustre/llite/dir.c | 3 -
> fs/affs/amigaffs.c | 4 +-
> fs/affs/dir.c | 4 +-
> fs/affs/super.c | 2 +-
> fs/afs/fsclient.c | 2 +-
> fs/afs/inode.c | 4 +-
> fs/btrfs/delayed-inode.c | 4 +-
> fs/btrfs/file.c | 4 +-
> fs/btrfs/inode.c | 41 ++++----
> fs/btrfs/ioctl.c | 4 +-
> fs/btrfs/tree-log.c | 2 +-
> fs/btrfs/xattr.c | 2 +-
> fs/ceph/inode.c | 1 -
> fs/ecryptfs/inode.c | 1 -
> fs/exofs/dir.c | 8 +-
> fs/exofs/super.c | 2 +-
> fs/ext2/dir.c | 8 +-
> fs/ext2/super.c | 4 +-
> fs/ext4/dir.c | 8 +-
> fs/ext4/inline.c | 6 +-
> fs/ext4/inode.c | 16 ++--
> fs/ext4/ioctl.c | 2 +-
> fs/ext4/namei.c | 8 +-
> fs/ext4/super.c | 2 +-
> fs/f2fs/super.c | 1 -
> fs/fat/dir.c | 2 +-
> fs/fat/inode.c | 8 +-
> fs/fat/namei_msdos.c | 6 +-
> fs/fat/namei_vfat.c | 20 ++--
> fs/hpfs/dir.c | 1 -
> fs/hpfs/dnode.c | 2 -
> fs/hpfs/super.c | 1 -
> fs/inode.c | 9 +-
> fs/jfs/super.c | 1 -
> fs/nfs/delegation.c | 2 +-
> fs/nfs/fscache-index.c | 4 +-
> fs/nfs/inode.c | 16 ++--
> fs/nfs/nfs4proc.c | 4 +-
> fs/nfs/nfstrace.h | 4 +-
> fs/nfs/write.c | 2 +-
> fs/nfsd/nfs3xdr.c | 2 +-
> fs/nfsd/nfs4xdr.c | 2 +-
> fs/nfsd/nfsfh.h | 2 +-
> fs/nilfs2/super.c | 1 -
> fs/ntfs/inode.c | 9 --
> fs/ntfs/mft.c | 6 --
> fs/ocfs2/dir.c | 14 +--
> fs/ocfs2/inode.c | 2 +-
> fs/ocfs2/namei.c | 2 +-
> fs/ocfs2/quota_global.c | 2 +-
> fs/orangefs/super.c | 2 -
> fs/reiserfs/super.c | 1 -
> fs/ufs/dir.c | 8 +-
> fs/ufs/inode.c | 2 +-
> fs/ufs/super.c | 2 +-
> fs/xfs/libxfs/xfs_inode_buf.c | 4 +-
> fs/xfs/xfs_icache.c | 4 +-
> fs/xfs/xfs_inode.c | 2 +-
> fs/xfs/xfs_inode_item.c | 2 +-
> fs/xfs/xfs_trans_inode.c | 12 +--
> include/linux/fs.h | 151 ++++++++++++++++++++++++++++--
> security/integrity/ima/ima_api.c | 2 +-
> security/integrity/ima/ima_main.c | 2 +-
> 63 files changed, 288 insertions(+), 173 deletions(-)
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 21 2016, Jeff Layton wrote:
> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
This list of added interfaces is incomplete.
And some of these interfaces could really use some justification up
front.
You later add a "force" parameter to inode_inc_version.
Why not do that up front here?
> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.
> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}
I don't understand why this can use the _raw version rather than the
_read version.
Surely you need to know about any changes after this read.
Thanks,
NeilBrown
On Wed, Dec 21 2016, Jeff Layton wrote:
>
> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED (1ULL<<63)
> +
I would prefer that the least significant bit were used, rather than the
most significant.
Partly, this is because the "queried" state is less significant than
that "number has changed" state.
But most, this would mean we wouldn't need inode_cmp_iversion() at all.
We could just use "<" or ">=" or whatever.
The number returned by inode_get_iversion() would always be even (or
maybe odd) and wrapping (after the end of time) would "just work".
Thanks,
NeilBrown
On Fri, 2017-03-03 at 17:36 -0500, J. Bruce Fields wrote:
> The patch ordering is a little annoying as I'd like to be able to be
> able to verify the implementation at the same time these new interfaces
> are added, but, I don't know, I don't have a better idea.
>
Fair point. My thinking was "add new API, convert everything over to it,
and then convert the implementation to do something different". Maybe
that's the wrong approach?
> Anyway, various nits:
>
> On Wed, Dec 21, 2016 at 12:03:28PM -0500, Jeff Layton wrote:
> > We already have inode_inc_iversion. Add inode_set_iversion,
> > inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
> >
> > These should encompass how i_version is currently accessed and
> > manipulated so that we can later change the implementation.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > include/linux/fs.h | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 104 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 2ba074328894..075c915fe2b1 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode *inode)
> > }
> >
> > /**
> > + * inode_set_iversion - set i_version to a particular value
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set the inode's i_version. Should generally be called when initializing
> > + * a new inode.
> > + */
> > +static inline void
> > +inode_set_iversion(struct inode *inode, const u64 new)
> > +{
> > + inode->i_version = new;
> > +}
> > +
> > +/**
> > + * inode_inc_iversion_locked - increment i_version while protected
> > + * @inode: inode to be updated
> > + *
> > + * Increment the i_version field in the inode. This version is usable
> > + * when there is some other sort of lock in play that would prevent
> > + * concurrent accessors.
> > + */
> > +static inline void
> > +inode_inc_iversion_locked(struct inode *inode)
> > +{
> > + inode->i_version++;
> > +}
> > +
> > +/**
> > + * inode_set_iversion_read - set i_version to a particular value and flag
> > + * set flag to indicate that it has been viewed
>
> s/flag set flag/set flag/.
>
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set the inode's i_version, and flag the inode as if it has been viewed.
> > + * Should generally be called when initializinga new inode in memory from
> > + * from disk.
>
> s/from from/from/.
>
> > + */
> > +static inline void
> > +inode_set_iversion_read(struct inode *inode, const u64 new)
> > +{
> > + inode_set_iversion(inode, new);
> > +}
> > +
> > +/**
> > * inode_inc_iversion - increments i_version
> > * @inode: inode that need to be updated
> > *
> > * Every time the inode is modified, the i_version field will be incremented.
> > - * The filesystem has to be mounted with i_version flag
> > + * The filesystem has to be mounted with MS_I_VERSION flag.
>
> I'm not sure why that comment's there. Filesystems can choose to
> increment i_version without requiring the mount option if they want,
> can't they?
>
Yeah, it should probably be removed. It honestly doesn't make much sense
since we can end up bumping the thing regardless of whether it's set.
> > + */
> > +static inline bool
>
> Document what the return value means?
>
Will do.
> > +inode_inc_iversion(struct inode *inode)
> > +{
> > + spin_lock(&inode->i_lock);
> > + inode_inc_iversion_locked(inode);
> > + spin_unlock(&inode->i_lock);
> > + return true;
> > +}
> > +
> > +/**
> > + * inode_get_iversion_raw - read i_version without "perturbing" it
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter for an inode without registering it as a
> > + * read. Should typically be used by local filesystems that need to store an
> > + * i_version on disk.
> > + */
> > +static inline u64
> > +inode_get_iversion_raw(const struct inode *inode)
> > +{
> > + return inode->i_version;
> > +}
> > +
> > +/**
> > + * inode_get_iversion - read i_version for later use
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter. This should be used by callers that wish
> > + * to store the returned i_version for later comparison.
>
> I'm not sure what "for later comparison" means. This is the "read"
> operation that actually flags the i_version as read, which you'd use
> (for example) to implement NFS getattr? I wonder if there's a better
> way to say that.
>
Yeah, I'm definitely open to better suggestions on naming.
> > + */
> > +static inline u64
> > +inode_get_iversion(const struct inode *inode)
> > +{
> > + return inode_get_iversion_raw(inode);
> > +}
> > +
> > +/**
> > + * inode_cmp_iversion - check whether the i_version counter has changed
> > + * @inode: inode to check
> > + * @old: old value to check against its i_version
> > + *
> > + * Compare an i_version counter with a previous one. Returns 0 if they are
> > + * the same or non-zero if they are different.
>
> Does this flag the i_version as read? What's it for? (OK, I guess I'll
> find out after a few more patches...).
>
No, that won't flag it as read.
Fetching an i_version value at a point in time must flag it as having
been read. But, we can always "peek" at the i_version and see if it has
changed from a given value without having to flag the new value as
having been read. We haven't recorded its value, so we don't have to
ensure that it changes on subsequent reads from where it is now.
This function implements the latter. Given a value that we fetched
earlier, has the current i_version value changed?
> > */
> > +static inline s64
> > +inode_cmp_iversion(const struct inode *inode, const u64 old)
> > +{
> > + return (s64)inode->i_version - (s64)old;
> > +}
> >
> > -static inline void inode_inc_iversion(struct inode *inode)
> > +/**
> > + * inode_iversion_need_inc - is the i_version in need of being incremented?
> > + * @inode: inode to check
> > + *
> > + * Returns whether the inode->i_version counter needs incrementing on the next
> > + * change.
> > + */
> > +static inline bool
> > +inode_iversion_need_inc(struct inode *inode)
> > {
> > - spin_lock(&inode->i_lock);
> > - inode->i_version++;
> > - spin_unlock(&inode->i_lock);
> > + return true;
> > }
> >
> > enum file_time_flags {
> > --
> > 2.7.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jeff Layton <[email protected]>
On Wed, Dec 21 2016, Jeff Layton wrote:
> @@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const u64 old)
> static inline bool
> inode_iversion_need_inc(struct inode *inode)
> {
> - return true;
> + bool ret;
> +
> + spin_lock(&inode->i_lock);
> + ret = inode->i_state & I_VERS_BUMP;
> + spin_unlock(&inode->i_lock);
> + return ret;
> }
>
I know this code gets removed, so this isn't really important.
By why do you take the spinlock here? What are you racing again?
Thanks,
NeilBrown
On Sat, 2017-03-04 at 11:03 +1100, NeilBrown wrote:
> On Wed, Dec 21 2016, Jeff Layton wrote:
>
> > @@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const u64 old)
> > static inline bool
> > inode_iversion_need_inc(struct inode *inode)
> > {
> > - return true;
> > + bool ret;
> > +
> > + spin_lock(&inode->i_lock);
> > + ret = inode->i_state & I_VERS_BUMP;
> > + spin_unlock(&inode->i_lock);
> > + return ret;
> > }
> >
>
> I know this code gets removed, so this isn't really important.
> By why do you take the spinlock here? What are you racing again?
>
> Thanks,
> NeilBrown
I think I was worried about I_VERS_BUMP being set or cleared during an
increment or query. It is quite possible that that spinlock is not
necessary.
--
Jeff Layton <[email protected]>
On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it
> > if no one is looking at it. We can also clean up the code to prepare
> > to eventually expose this value via statx().
> >
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > set. IMA will also use it (though it's not clear to me that that
> > works 100% -- no check for MS_I_VERSION there).
>
> I'm a little confused about the internal uses for directories. Is it
> necessarily kept on disk?
No, they aren't necessarily stored on disk, and in fact they aren't on
most (maybe all?) of those filesystems. It's just a convenient place to
store a dir change value that is subsequently checked in readdir
operations.
That's part of the "fun" of untangling this mess. ;)
> In cases it's not, then there's not the same
> performance issue, right?
Right, I don't think it's really a big deal for most of those and I'm
not terribly concerned about the i_version counter on directory change
operations. An extra atomic op on a directory change seems unlikely to
hurt anything.
The main purpose of this is to improve the situation with small writes.
This should also help pave the way for fs' like NFS and Ceph that
implement a version counter but may not necessarily bump it on every
change.
I think once we have things more consistent, we'll be able to consider
exposing the i_version counter to userland via statx.
> Is there any risk these patches make
> performance slightly worse in that case?
>
Maybe, but I think that risk is pretty low. Note that I haven't measured
that specifically here, so I could be completely wrong.
If it is a problem, we could consider unioning this thing with a non-
atomic field for the directory change cases, but that would add some
complexity and I'm not sure it's worth it. I'd want to measure it first.
> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes.
>
> On those filesystems that's done for both files and directories, right?
>
Yes.
> > That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> >
> > It turns out though that none of these users of i_version require that
> > i_version change on every change to the file. The only real requirement
> > is that it be different if _something_ changed since the last time we
> > queried for it. [2]
> >
> > So, if we simply keep track of when something queries the value, we
> > can avoid bumping the counter and that metadata update.
> >
> > This patchset implements this:
> >
> > It starts with some small cleanup patches to just remove any mention of
> > the i_version counter in filesystems that don't actually use it.
> >
> > Then, we add a new set of static inlines for managing the counter. The
> > initial version should work identically to what we have now. Then, all
> > of the remaining filesystems that use i_version are converted to the new
> > inlines.
> >
> > Once that's in place, we switch to a new implementation that allows us
> > to track readers of i_version counter, and only bump it when it's
> > necessary or convenient (i.e. we're going to disk anyway).
> >
> > The final patch switches from a scheme that uses the i_lock to serialize
> > the counter updates during write to an atomic64_t. That's a wash
> > performance-wise in my testing, but I like not having to take the i_lock
> > down where it could end up nested inside other locks.
> >
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> >
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not a real-world workload.
> >
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific).
> >
> > I'm happy to run other workloads if anyone can suggest them.
> >
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces that cost.
> >
> > Is this an avenue that's worthwhile to pursue?
> >
> > Note that I think we may have other changes coming in the future that
> > will make this sort of cleanup necessary anyway. I'd like to plug in the
> > Ceph change attribute here eventually, and that will require something
> > like this anyway.
> >
> > Thoughts, comments and suggestions are welcome...
> >
> > ---
> >
> > [1]: On ext4 it must be turned on with the i_version mount option,
> > mostly due to fears of incurring this impact, AFAICT.
> >
> > [2]: NFS also recommends that it appear to increase in value over time, so
> > that clients can discard metadata updates that are older than ones
> > they've already seen.
> >
> > Jeff Layton (30):
> > lustre: don't set f_version in ll_readdir
> > ecryptfs: remove unnecessary i_version bump
> > ceph: remove the bump of i_version
> > f2fs: don't bother setting i_version
> > hpfs: don't bother with the i_version counter
> > jfs: remove initialization of i_version counter
> > nilfs2: remove inode->i_version initialization
> > orangefs: remove initialization of i_version
> > reiserfs: remove unneeded i_version bump
> > ntfs: remove i_version handling
> > fs: new API for handling i_version
> > fat: convert to new i_version API
> > affs: convert to new i_version API
> > afs: convert to new i_version API
> > btrfs: convert to new i_version API
> > exofs: switch to new i_version API
> > ext2: convert to new i_version API
> > ext4: convert to new i_version API
> > nfs: convert to new i_version API
> > nfsd: convert to new i_version API
> > ocfs2: convert to new i_version API
> > ufs: use new i_version API
> > xfs: convert to new i_version API
> > IMA: switch IMA over to new i_version API
> > fs: add a "force" parameter to inode_inc_iversion
> > fs: only set S_VERSION when updating times if it has been queried
> > xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
> > incrementing
> > btrfs: only dirty the inode in btrfs_update_time if something was
> > changed
> > fs: track whether the i_version has been queried with an i_state flag
> > fs: convert i_version counter over to an atomic64_t
> >
> > drivers/staging/lustre/lustre/llite/dir.c | 3 -
> > fs/affs/amigaffs.c | 4 +-
> > fs/affs/dir.c | 4 +-
> > fs/affs/super.c | 2 +-
> > fs/afs/fsclient.c | 2 +-
> > fs/afs/inode.c | 4 +-
> > fs/btrfs/delayed-inode.c | 4 +-
> > fs/btrfs/file.c | 4 +-
> > fs/btrfs/inode.c | 41 ++++----
> > fs/btrfs/ioctl.c | 4 +-
> > fs/btrfs/tree-log.c | 2 +-
> > fs/btrfs/xattr.c | 2 +-
> > fs/ceph/inode.c | 1 -
> > fs/ecryptfs/inode.c | 1 -
> > fs/exofs/dir.c | 8 +-
> > fs/exofs/super.c | 2 +-
> > fs/ext2/dir.c | 8 +-
> > fs/ext2/super.c | 4 +-
> > fs/ext4/dir.c | 8 +-
> > fs/ext4/inline.c | 6 +-
> > fs/ext4/inode.c | 16 ++--
> > fs/ext4/ioctl.c | 2 +-
> > fs/ext4/namei.c | 8 +-
> > fs/ext4/super.c | 2 +-
> > fs/f2fs/super.c | 1 -
> > fs/fat/dir.c | 2 +-
> > fs/fat/inode.c | 8 +-
> > fs/fat/namei_msdos.c | 6 +-
> > fs/fat/namei_vfat.c | 20 ++--
> > fs/hpfs/dir.c | 1 -
> > fs/hpfs/dnode.c | 2 -
> > fs/hpfs/super.c | 1 -
> > fs/inode.c | 9 +-
> > fs/jfs/super.c | 1 -
> > fs/nfs/delegation.c | 2 +-
> > fs/nfs/fscache-index.c | 4 +-
> > fs/nfs/inode.c | 16 ++--
> > fs/nfs/nfs4proc.c | 4 +-
> > fs/nfs/nfstrace.h | 4 +-
> > fs/nfs/write.c | 2 +-
> > fs/nfsd/nfs3xdr.c | 2 +-
> > fs/nfsd/nfs4xdr.c | 2 +-
> > fs/nfsd/nfsfh.h | 2 +-
> > fs/nilfs2/super.c | 1 -
> > fs/ntfs/inode.c | 9 --
> > fs/ntfs/mft.c | 6 --
> > fs/ocfs2/dir.c | 14 +--
> > fs/ocfs2/inode.c | 2 +-
> > fs/ocfs2/namei.c | 2 +-
> > fs/ocfs2/quota_global.c | 2 +-
> > fs/orangefs/super.c | 2 -
> > fs/reiserfs/super.c | 1 -
> > fs/ufs/dir.c | 8 +-
> > fs/ufs/inode.c | 2 +-
> > fs/ufs/super.c | 2 +-
> > fs/xfs/libxfs/xfs_inode_buf.c | 4 +-
> > fs/xfs/xfs_icache.c | 4 +-
> > fs/xfs/xfs_inode.c | 2 +-
> > fs/xfs/xfs_inode_item.c | 2 +-
> > fs/xfs/xfs_trans_inode.c | 12 +--
> > include/linux/fs.h | 151 ++++++++++++++++++++++++++++--
> > security/integrity/ima/ima_api.c | 2 +-
> > security/integrity/ima/ima_main.c | 2 +-
> > 63 files changed, 288 insertions(+), 173 deletions(-)
> >
> > --
> > 2.7.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jeff Layton <[email protected]>
On Sat, 2017-03-04 at 10:55 +1100, NeilBrown wrote:
> On Wed, Dec 21 2016, Jeff Layton wrote:
>
> > We already have inode_inc_iversion. Add inode_set_iversion,
> > inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
>
> This list of added interfaces is incomplete.
> And some of these interfaces could really use some justification up
> front.
>
> You later add a "force" parameter to inode_inc_version.
> Why not do that up front here?
>
First, thanks to you and Bruce for having a look.
Yes, it's clear from this and the earlier emails that I didn't do
enough documentation and explanation. I'll plan to respin this when I
get a chance, and lay out the justification and discussion a bit more.
I'll also make sure the not change the API midstream like this when I
respin.
> > +
> > +/**
> > + * inode_get_iversion - read i_version for later use
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter. This should be used by callers that wish
> > + * to store the returned i_version for later comparison.
> > + */
> > +static inline u64
> > +inode_get_iversion(const struct inode *inode)
> > +{
> > + return inode_get_iversion_raw(inode);
> > +}
>
> I don't understand why this can use the _raw version rather than the
> _read version.
> Surely you need to know about any changes after this read.
>
The approach here was to convert everything to a new API and have it
work much like the code works today and then to morph it into something
that only conditionally bumps the counter.
In the later implementation, yes you do need to know about changes
after the read, but in the initial implementation it doesn't matter
since the counter is bumped on every change anyway.
I'll try to do a better job laying out this rationale in follow-on
postings.
--
Jeff Layton <[email protected]>
On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote:
> On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > > counter, by exploiting the fact that we don't need to increment it
> > > if no one is looking at it. We can also clean up the code to prepare
> > > to eventually expose this value via statx().
> > >
> > > The inode->i_version field is supposed to be a value that changes
> > > whenever there is any data or metadata change to the inode. Some
> > > filesystems use it internally to detect directory changes during
> > > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > > set. IMA will also use it (though it's not clear to me that that
> > > works 100% -- no check for MS_I_VERSION there).
> >
> > I'm a little confused about the internal uses for directories. Is it
> > necessarily kept on disk?
>
> No, they aren't necessarily stored on disk, and in fact they aren't on
> most (maybe all?) of those filesystems. It's just a convenient place to
> store a dir change value that is subsequently checked in readdir
> operations.
>
> That's part of the "fun" of untangling this mess. ;)
>
> > In cases it's not, then there's not the same
> > performance issue, right?
>
> Right, I don't think it's really a big deal for most of those and I'm
> not terribly concerned about the i_version counter on directory change
> operations. An extra atomic op on a directory change seems unlikely to
> hurt anything.
>
> The main purpose of this is to improve the situation with small writes.
> This should also help pave the way for fs' like NFS and Ceph that
> implement a version counter but may not necessarily bump it on every
> change.
>
> I think once we have things more consistent, we'll be able to consider
> exposing the i_version counter to userland via statx.
>
> > Is there any risk these patches make
> > performance slightly worse in that case?
> >
>
> Maybe, but I think that risk is pretty low. Note that I haven't measured
> that specifically here, so I could be completely wrong.
>
> If it is a problem, we could consider unioning this thing with a non-
> atomic field for the directory change cases, but that would add some
> complexity and I'm not sure it's worth it. I'd want to measure it first.
Makes sense to me. I agree that it's unlikely to be a problem, just
wanted to make sure I understood....
Anyway, I'm a great fan of the basic idea, I hope we can get this done.
--b.
On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote:
> On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > >
> > > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > > this, these filesystems must log the inode to disk whenever the
> > > i_version counter changes. That has a non-zero performance impact,
> > > especially on write-heavy workloads, because we end up dirtying the
> > > inode metadata on every write, not just when the times change. [1]
> >
> > Do you have numbers to justify these changes?
>
> I have numbers. As to whether they justify the changes, I'm not sure.
> This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
> XFS, with this fio jobfile:
To me, the interesting question is whether this allows us to turn on
i_version updates by default on xfs and ext4.
When Josef looked at doing that previously he withdrew the patch due to
performance regressions. I think the most useful thread started here:
http://lkml.kernel.org/r/[email protected]
Skimming quickly.... I think the regression was also in the small-write
case. So apparently that was thought to reveal a real problem?
So if you've mostly eliminated that regression, then that's good
motivation for your patches. (Though I think in addition to comparing
the patched and unpatched i_version case, we need to compare to the
unpatched not-i_version case. I'm not clear whether you did that.)
--b.
>
> --------------------8<------------------
> [global]
> direct=0
> size=2g
> filesize=512m
> bsrange=1-1
> timeout=60
> numjobs=1
> directory=/mnt/scratch
>
> [f1]
> filename=randwrite
> rw=randwrite
> --------------------8<------------------
>
> Unpatched kernel:
> Â WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=60000msec, maxt=60000msec
>
> Patched kernel:
> Â WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=60000msec, maxt=60000msec
>
> So quite a difference there and it's pretty consistent across runs. If I
> change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
> variation between the two doesn't seem to be significant (numbers vary
> as much between runs on the same kernels and are roughly the same).
>
> Playing with buffered I/O sizes between 1 byte and 4k shows that as the
> I/O sizes get larger, this makes less difference (which is what I'd
> expect).
>
> Previous testing with ext4 shows roughly the same results. btrfs shows
> some benefit here but significantly less than with ext4 or xfs. Not sure
> why that is yet -- maybe CoW effects?
>
> That said, I don't have a great test rig for this. I'm using VMs with a
> dedicated LVM volume that's on a random SSD I had laying around. It
> could use testing on a wider set of configurations and workloads.
>
> I was also hoping that others may have workloads that they think might
> be (postively or negatively) affected by these changes. If you can think
> of any in particular, then I'm interested to hear about them.
>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> To me, the interesting question is whether this allows us to turn on
> i_version updates by default on xfs and ext4.
XFS v5 file systems have it on by default. Although we'll still need
to agree on the exact semantics of i_version before it's going to be
useful.
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > To me, the interesting question is whether this allows us to turn on
> > i_version updates by default on xfs and ext4.
>
> XFS v5 file systems have it on by default.
Great, thanks.
> Although we'll still need to agree on the exact semantics of i_version
> before it's going to be useful.
Once it's figured out maybe we should write it up for a manpage that
could be used if statx starts exposing it to userspace.
A first attempt:
- It's a u64.
- It works for regular files and directories. (What about symlinks or
other special types?)
- It changes between two checks if and only if there were intervening
data or metadata changes. The change will always be an increase, but
the amount of the increase is meaningless.
- NFS doesn't actually require that it increases, but I think it
should. I assume 64 bits means we don't need a discussion of
wraparound.
- AFS wants an actual counter: if you get i_version X, then
write twice, then get i_version X+2, you're allowed to assume
your writes were the only modifications. Let's ignore this
for now. In the future if someone explains how to count
operations, then we can extend the interface to tell the
caller it can get those extra semantics.
- It's durable; the above comparison still works if there were reboots
between the two i_version checks.
- I don't know how realistic this is--we may need to figure out
if there's a weaker guarantee that's still useful. Do
filesystems actually make ctime/mtime/i_version changes
atomically with the changes that caused them? What if a
change attribute is exposed to an NFS client but doesn't make
it to disk, and then that value is reused after reboot?
Am I missing any issues?
--b.
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > > To me, the interesting question is whether this allows us to turn on
> > > i_version updates by default on xfs and ext4.
> >
> > XFS v5 file systems have it on by default.
>
> Great, thanks.
>
> > Although we'll still need to agree on the exact semantics of i_version
> > before it's going to be useful.
>
> Once it's figured out maybe we should write it up for a manpage that
> could be used if statx starts exposing it to userspace.
>
> A first attempt:
>
> - It's a u64.
>
> - It works for regular files and directories. (What about symlinks or
> other special types?)
>
> - It changes between two checks if and only if there were intervening
> data or metadata changes. The change will always be an increase, but
> the amount of the increase is meaningless.
> - NFS doesn't actually require that it increases, but I think it
> should. I assume 64 bits means we don't need a discussion of
> wraparound.
I thought NFS spec required that you be able to recognize old change
attributes so that they can be discarded. I could be wrong here though.
I'd have to go back and look through the spec to be sure.
> - AFS wants an actual counter: if you get i_version X, then
> write twice, then get i_version X+2, you're allowed to assume
> your writes were the only modifications. Let's ignore this
> for now. In the future if someone explains how to count
> operations, then we can extend the interface to tell the
> caller it can get those extra semantics.
>
> - It's durable; the above comparison still works if there were reboots
> between the two i_version checks.
> - I don't know how realistic this is--we may need to figure out
> if there's a weaker guarantee that's still useful. Do
> filesystems actually make ctime/mtime/i_version changes
> atomically with the changes that caused them? What if a
> change attribute is exposed to an NFS client but doesn't make
> it to disk, and then that value is reused after reboot?
>
Yeah, there could be atomicity there. If we bump i_version, we'll mark
the inode dirty and I think that will end up with the new i_version at
least being journalled before __mark_inode_dirty returns.
That said, I suppose it is possible for us to bump the counter, hand
that new counter value out to a NFS client and then the box crashes
before it makes it to the journal.
Not sure how big a problem that really is.
> Am I missing any issues?
>
No, I think you have it covered, and that's pretty much exactly what I
had in mind as far as semantics go. Thanks for writing it up!
--
Jeff Layton <[email protected]>
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - NFS doesn't actually require that it increases, but I think it
> > should. I assume 64 bits means we don't need a discussion of
> > wraparound.
>
> I thought NFS spec required that you be able to recognize old change
> attributes so that they can be discarded. I could be wrong here though.
> I'd have to go back and look through the spec to be sure.
https://tools.ietf.org/html/rfc7862#section-10
--b.
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - NFS doesn't actually require that it increases, but I think it
> > > should. I assume 64 bits means we don't need a discussion of
> > > wraparound.
> >
> > I thought NFS spec required that you be able to recognize old change
> > attributes so that they can be discarded. I could be wrong here though.
> > I'd have to go back and look through the spec to be sure.
>
> https://tools.ietf.org/html/rfc7862#section-10
So, I'm suggesting we implement this one:
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value
MUST monotonically increase for every atomic change to the file
attributes, data, or directory contents.
It may be a slight lie--after your patches we wouldn't actually increase
"for every atomic change". I think that's OK.
--b.
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> > between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> > if there's a weaker guarantee that's still useful. Do
> > filesystems actually make ctime/mtime/i_version changes
> > atomically with the changes that caused them? What if a
> > change attribute is exposed to an NFS client but doesn't make
> > it to disk, and then that value is reused after reboot?
> >
>
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.
So you think the filesystem can provide the atomicity? In more detail:
- if I write to a file, a simultaneous reader should see either
(old data, old i_version) or (new data, new i_version), not a
combination of the two.
- ditto for metadata modifications.
- the same should be true if there's a crash.
(If that's not possible, then I think we could live with a brief window
of (new data, old i_version) as long as it doesn't persist beyond sync?)
> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
>
> Not sure how big a problem that really is.
The other case I was wondering about may have been unclear. Represent
the state of a file by a (data, i_version) pair. Say:
- file is modified from (F, V) to (F', V+1).
- client reads and caches (F', V+1).
- server crashes before writeback, so disk still has (F, V).
- after restart, someone else modifies file to (F'', V+1).
- original client revalidates its cache, sees V+1, concludes
file data is still F'.
This may not cause a real problem for clients depending only on
traditional NFS close-to-open semantics.
--b.
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > > between the two i_version checks.
> > > - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful. Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them? What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > >
> >
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
>
> So you think the filesystem can provide the atomicity? In more detail:
>
Sorry, I hit send too quickly. That should have read:
"Yeah, there could be atomicity issues there."
I think providing that level of atomicity may be difficult, though
maybe there's some way to make the querying of i_version block until
the inode update has been journalled?
> - if I write to a file, a simultaneous reader should see either
> (old data, old i_version) or (new data, new i_version), not a
> combination of the two.
> - ditto for metadata modifications.
> - the same should be true if there's a crash.
>
> (If that's not possible, then I think we could live with a brief window
> of (new data, old i_version) as long as it doesn't persist beyond sync?)
>
> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> >
> > Not sure how big a problem that really is.
>
> The other case I was wondering about may have been unclear. Represent
> the state of a file by a (data, i_version) pair. Say:
>
> - file is modified from (F, V) to (F', V+1).
> - client reads and caches (F', V+1).
> - server crashes before writeback, so disk still has (F, V).
> - after restart, someone else modifies file to (F'', V+1).
> - original client revalidates its cache, sees V+1, concludes
> file data is still F'.
>
> This may not cause a real problem for clients depending only on
> traditional NFS close-to-open semantics.
>
>
No, I think that is a legitimate problem.
That said, after F'', the mtime would almost certainly be different
from the time after F', and that would likely be enough to prevent
confusion in NFS clients.
Applications that are just looking at i_version via statx() could get
confused though.
--
Jeff Layton <[email protected]>
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > > between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > > if there's a weaker guarantee that's still useful. Do
> > > > filesystems actually make ctime/mtime/i_version changes
> > > > atomically with the changes that caused them? What if a
> > > > change attribute is exposed to an NFS client but doesn't make
> > > > it to disk, and then that value is reused after reboot?
> > > >
> > >
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> >
> > So you think the filesystem can provide the atomicity? In more detail:
> >
>
> Sorry, I hit send too quickly. That should have read:
>
> "Yeah, there could be atomicity issues there."
>
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?
No idea. Anyway, I'd like to figure out some reasonable requirement
that we can document.
>
> > - if I write to a file, a simultaneous reader should see either
> > (old data, old i_version) or (new data, new i_version), not a
> > combination of the two.
> > - ditto for metadata modifications.
> > - the same should be true if there's a crash.
> >
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> >
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > >
> > > Not sure how big a problem that really is.
> >
> > The other case I was wondering about may have been unclear. Represent
> > the state of a file by a (data, i_version) pair. Say:
> >
> > - file is modified from (F, V) to (F', V+1).
> > - client reads and caches (F', V+1).
> > - server crashes before writeback, so disk still has (F, V).
> > - after restart, someone else modifies file to (F'', V+1).
> > - original client revalidates its cache, sees V+1, concludes
> > file data is still F'.
> >
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> >
> >
>
> No, I think that is a legitimate problem.
>
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.
Oh, good point. So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too). I think that should be ctime not mtime,
though?
--b.
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> > between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> > if there's a weaker guarantee that's still useful. Do
> > filesystems actually make ctime/mtime/i_version changes
> > atomically with the changes that caused them? What if a
> > change attribute is exposed to an NFS client but doesn't make
> > it to disk, and then that value is reused after reboot?
> >
>
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.
The change may be journalled, but it isn't guaranteed stable until
fsync is run on the inode.
NFS server operations commit the metadata changed by a modification
through ->commit_metadata or sync_inode_metadata() before the
response is sent back to the client, hence guaranteeing that
i_version changes through the NFS server are stable and durable.
This is not the case for normal operations done through the POSIX
API - the journalling is asynchronous and the only durability
guarantees are provided by fsync()....
> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
Yup, this has aways been a problem when you mix posix applications
running on the NFS server modifying the same files as the NFS
clients are accessing and requiring synchronisation.
> Not sure how big a problem that really is.
This coherency problem has always existed on the server side...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > > between the two i_version checks.
> > > > > - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful. Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them? What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > >
> > > >
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > >
> > > So you think the filesystem can provide the atomicity? In more detail:
> > >
> >
> > Sorry, I hit send too quickly. That should have read:
> >
> > "Yeah, there could be atomicity issues there."
> >
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
>
> No idea. Anyway, I'd like to figure out some reasonable requirement
> that we can document.
>
> >
> > > - if I write to a file, a simultaneous reader should see either
> > > (old data, old i_version) or (new data, new i_version), not a
> > > combination of the two.
> > > - ditto for metadata modifications.
> > > - the same should be true if there's a crash.
> > >
> > > (If that's not possible, then I think we could live with a brief window
> > > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > >
> > > > That said, I suppose it is possible for us to bump the counter, hand
> > > > that new counter value out to a NFS client and then the box crashes
> > > > before it makes it to the journal.
> > > >
> > > > Not sure how big a problem that really is.
> > >
> > > The other case I was wondering about may have been unclear. Represent
> > > the state of a file by a (data, i_version) pair. Say:
> > >
> > > - file is modified from (F, V) to (F', V+1).
> > > - client reads and caches (F', V+1).
> > > - server crashes before writeback, so disk still has (F, V).
> > > - after restart, someone else modifies file to (F'', V+1).
> > > - original client revalidates its cache, sees V+1, concludes
> > > file data is still F'.
> > >
> > > This may not cause a real problem for clients depending only on
> > > traditional NFS close-to-open semantics.
> > >
> > >
> >
> > No, I think that is a legitimate problem.
> >
> > That said, after F'', the mtime would almost certainly be different
> > from the time after F', and that would likely be enough to prevent
> > confusion in NFS clients.
>
> Oh, good point. So, may be worth saying that anyone wanting to make
> sense of these across reboot should compare times as well (maybe that
> should be in nfs rfc's too). I think that should be ctime not mtime,
> though?
>
Yes, it might be worth a mention there. IIRC, it does mention that you
shouldn't just look at a single attribute for cache validation
purposes, but the wording is a bit vague. I can't find the section at
the moment though.
The more I think about it though, simply ensuring that we don't publish
a new change attr until the inode update has hit the journal may be
the best we can do. I'd have to think about how to implement that
though.
--
Jeff Layton <[email protected]>
On Wed, 2017-03-22 at 08:45 +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > > between the two i_version checks.
> > > - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful. Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them? What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > >
> >
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
>
> The change may be journalled, but it isn't guaranteed stable until
> fsync is run on the inode.
>
> NFS server operations commit the metadata changed by a modification
> through ->commit_metadata or sync_inode_metadata() before the
> response is sent back to the client, hence guaranteeing that
> i_version changes through the NFS server are stable and durable.
>
> This is not the case for normal operations done through the POSIX
> API - the journalling is asynchronous and the only durability
> guarantees are provided by fsync()....
>
Ahh ok, I missed that...thanks.
I think we'll have a hard time making this fully atomic. We may end up
having to settle for something less (and doing our best to warn users
of that possibility).
One idea might be to tie the behavior to AT_FORCE/DONT_SYNC. In the
don't sync case, allow the kernel to hand out the i_version without
syncing it to disk. In the FORCE_SYNC case, do an fsync internally
before returning.
> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
>
> Yup, this has aways been a problem when you mix posix applications
> running on the NFS server modifying the same files as the NFS
> clients are accessing and requiring synchronisation.
>
> > Not sure how big a problem that really is.
>
> This coherency problem has always existed on the server side...
>
Yes. I don't think this patchset makes anything worse in this regard.
We will need well-defined semantics here before i_version can be
exposed to userland via statx however.
--
Jeff Layton <[email protected]>
On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > > between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > > if there's a weaker guarantee that's still useful. Do
> > > > filesystems actually make ctime/mtime/i_version changes
> > > > atomically with the changes that caused them? What if a
> > > > change attribute is exposed to an NFS client but doesn't make
> > > > it to disk, and then that value is reused after reboot?
> > > >
> > >
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> >
> > So you think the filesystem can provide the atomicity? In more detail:
> >
>
> Sorry, I hit send too quickly. That should have read:
>
> "Yeah, there could be atomicity issues there."
>
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?
Just to complement what Dave said from ext4 side - similarly as with XFS
ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
Until that you can see arbitrary combination of data & i_version after the
crash. We do take care to keep data and metadata in sync only when there
are security implications to that (like exposing uninitialized disk blocks)
and if not, we are as lazy as we can to improve performance...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > > between the two i_version checks.
> > > > > - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful. Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them? What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > >
> > > >
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > >
> > > So you think the filesystem can provide the atomicity? In more detail:
> > >
> >
> > Sorry, I hit send too quickly. That should have read:
> >
> > "Yeah, there could be atomicity issues there."
> >
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
>
> Just to complement what Dave said from ext4 side - similarly as with XFS
> ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> Until that you can see arbitrary combination of data & i_version after the
> crash. We do take care to keep data and metadata in sync only when there
> are security implications to that (like exposing uninitialized disk blocks)
> and if not, we are as lazy as we can to improve performance...
>
>
Yeah, I think what we'll have to do here is ensure that those
filesystems do an fsync prior to reporting the i_version getattr
codepath. It's not pretty, but I don't see a real alternative.
--
Jeff Layton <[email protected]>
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were reboots
> > > > > > between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > > if there's a weaker guarantee that's still useful. Do
> > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > atomically with the changes that caused them? What if a
> > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > it to disk, and then that value is reused after reboot?
> > > > > >
> > > > >
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > >
> > > > So you think the filesystem can provide the atomicity? In more detail:
> > > >
> > >
> > > Sorry, I hit send too quickly. That should have read:
> > >
> > > "Yeah, there could be atomicity issues there."
> > >
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> >
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> >
> >
>
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.
I think that's even more problematic. ->getattr currently runs
completely unlocked for performance reasons - it's racy w.r.t. to
ongoing modifications to begin with, so /nothing/ that is returned
to userspace via stat/statx can be guaranteed to be "coherent".
Linus will be very unhappy if you make his git workload (which is
/very/ stat heavy) run slower by adding any sort of locking in this
hot path.
Even if we did put an fsync() into ->getattr() (and dealt with all
the locking issues that entails), by the time the statx syscall
returns to userspace the i_version value may not match the
data/metadata in the inode(*). IOWs, by the time i_version gets
to userspace, it is out of date and any use of it for data
versioning from userspace is going to be prone to race conditions.
Cheers,
Dave.
(*) fiemap has exactly the same "stale the moment internal fs
locks are released" race conditions, which is why it cannot safely
be used for mapping holes when copying file data....
--
Dave Chinner
[email protected]
On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were reboots
> > > > > > between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > > if there's a weaker guarantee that's still useful. Do
> > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > atomically with the changes that caused them? What if a
> > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > it to disk, and then that value is reused after reboot?
> > > > > >
> > > > >
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > >
> > > > So you think the filesystem can provide the atomicity? In more detail:
> > > >
> > >
> > > Sorry, I hit send too quickly. That should have read:
> > >
> > > "Yeah, there could be atomicity issues there."
> > >
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> >
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> >
> >
>
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.
Hum, so are we fine if i_version just changes (increases) for all inodes
after a server crash? If I understand its use right, it would mean
invalidation of all client's caches but that is not such a big deal given
how frequent server crashes should be, right?
Because if above is acceptable we could make reported i_version to be a sum
of "superblock crash counter" and "inode i_version". We increment
"superblock crash counter" whenever we detect unclean filesystem shutdown.
That way after a crash we are guaranteed each inode will report new
i_version (the sum would probably have to look like "superblock crash
counter" * 65536 + "inode i_version" so that we avoid reusing possible
i_version numbers we gave away but did not write to disk but still...).
Thoughts?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were reboots
> > > > > > > between the two i_version checks.
> > > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful. Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them? What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > >
> > > > > >
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > >
> > > > > So you think the filesystem can provide the atomicity? In more detail:
> > > > >
> > > >
> > > > Sorry, I hit send too quickly. That should have read:
> > > >
> > > > "Yeah, there could be atomicity issues there."
> > > >
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > >
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > >
> > >
> >
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
>
> Hum, so are we fine if i_version just changes (increases) for all inodes
> after a server crash? If I understand its use right, it would mean
> invalidation of all client's caches but that is not such a big deal given
> how frequent server crashes should be, right?
>
> Because if above is acceptable we could make reported i_version to be a sum
> of "superblock crash counter" and "inode i_version". We increment
> "superblock crash counter" whenever we detect unclean filesystem shutdown.
> That way after a crash we are guaranteed each inode will report new
> i_version (the sum would probably have to look like "superblock crash
> counter" * 65536 + "inode i_version" so that we avoid reusing possible
> i_version numbers we gave away but did not write to disk but still...).
> Thoughts?
>
That does sound like a good idea. This is a 64 bit value, so we should
be able to carve out some upper bits for a crash counter without risking
wrapping.
The other constraint here is that we'd like any later version of the
counter to be larger than any earlier value that was handed out. I think
this idea would still satisfy that.
--
Jeff Layton <[email protected]>
On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were reboots
> > > > > > > between the two i_version checks.
> > > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful. Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them? What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > >
> > > > > >
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > >
> > > > > So you think the filesystem can provide the atomicity? In more detail:
> > > > >
> > > >
> > > > Sorry, I hit send too quickly. That should have read:
> > > >
> > > > "Yeah, there could be atomicity issues there."
> > > >
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > >
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > >
> > >
> >
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
>
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
>
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*). IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.
>
> Cheers,
>
> Dave.
>
> (*) fiemap has exactly the same "stale the moment internal fs
> locks are released" race conditions, which is why it cannot safely
> be used for mapping holes when copying file data....
>
FWIW, I'm not terribly worried about atomicity for all of the reasons
you describe. My main concern is reusing an i_version value that has
already been handed out when the inode is now in an entirely different
state. To that end, I was only considering doing an fsync iff
STATX_VERSION was requested. If it wasn't we wouldn't need to do one.
A lot of ->getattr implementations already call filemap_write_and_wait,
but you're correct that flushing out the metadata is different matter.Â
It'd be nice to avoid that if we can.
--
Jeff Layton <[email protected]>
On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > Hum, so are we fine if i_version just changes (increases) for all inodes
> > after a server crash? If I understand its use right, it would mean
> > invalidation of all client's caches but that is not such a big deal given
> > how frequent server crashes should be, right?
Even if it's rare, it may be really painful when all your clients are
forced to throw out and repopulate their caches after a crash. But,
yes, maybe we can live with it.
> > Because if above is acceptable we could make reported i_version to be a sum
> > of "superblock crash counter" and "inode i_version". We increment
> > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > That way after a crash we are guaranteed each inode will report new
> > i_version (the sum would probably have to look like "superblock crash
> > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > i_version numbers we gave away but did not write to disk but still...).
> > Thoughts?
How hard is this for filesystems to support? Do they need an on-disk
format change to keep track of the crash counter? Maybe not, maybe the
high bits of the i_version counters are all they need.
> That does sound like a good idea. This is a 64 bit value, so we should
> be able to carve out some upper bits for a crash counter without risking
> wrapping.
>
> The other constraint here is that we'd like any later version of the
> counter to be larger than any earlier value that was handed out. I think
> this idea would still satisfy that.
I guess we just want to have some back-of-the-envelope estimates of
maximum number of i_version increments possible between crashes and
maximum number of crashes possible over lifetime of a filesystem, to
decide how to split up the bits.
I wonder if we could get away with using the new crash counter only for
*new* values of the i_version? After a crash, use the on disk i_version
as is, and put off using the new crash counter until the next time the
file's modified.
That would still eliminate the risk of accidental reuse of an old
i_version value. It still leaves some cases where the client could fail
to notice an update indefinitely. All these cases I think have to
assume that a writer made some changes that it failed to ever sync, so
as long as we care only about close-to-open semantics perhaps those
cases don't matter.
I wonder if repeated crashes can lead to any odd corner cases.
--b.
On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Hum, so are we fine if i_version just changes (increases) for all inodes
> > > after a server crash? If I understand its use right, it would mean
> > > invalidation of all client's caches but that is not such a big deal given
> > > how frequent server crashes should be, right?
>
> Even if it's rare, it may be really painful when all your clients are
> forced to throw out and repopulate their caches after a crash. But,
> yes, maybe we can live with it.
>
Yeah, assuming that normal reboots wouldn't cause this, then I don't see
it as being too bad.
> > > Because if above is acceptable we could make reported i_version to be a sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
>
> How hard is this for filesystems to support? Do they need an on-disk
> format change to keep track of the crash counter? Maybe not, maybe the
> high bits of the i_version counters are all they need.
>
Yeah, I imagine we'd need a on-disk change for this unless there's
something already present that we could use in place of a crash counter.
> > That does sound like a good idea. This is a 64 bit value, so we should
> > be able to carve out some upper bits for a crash counter without risking
> > wrapping.
> >
> > The other constraint here is that we'd like any later version of the
> > counter to be larger than any earlier value that was handed out. I think
> > this idea would still satisfy that.
>
> I guess we just want to have some back-of-the-envelope estimates of
> maximum number of i_version increments possible between crashes and
> maximum number of crashes possible over lifetime of a filesystem, to
> decide how to split up the bits.
>
> I wonder if we could get away with using the new crash counter only for
> *new* values of the i_version? After a crash, use the on disk i_version
> as is, and put off using the new crash counter until the next time the
> file's modified.
>
That sounds difficult to get right. Suppose I have an inode that has not
been updated in a long time. Someone writes to it and then queries the
i_version. How do I know whether there were crashes since the last time
I updated it? Or am I misunderstanding what you're proposing here?
> That would still eliminate the risk of accidental reuse of an old
> i_version value. It still leaves some cases where the client could fail
> to notice an update indefinitely. All these cases I think have to
> assume that a writer made some changes that it failed to ever sync, so
> as long as we care only about close-to-open semantics perhaps those
> cases don't matter.
>
> I wonder if repeated crashes can lead to any odd corner cases.
>
--
Jeff Layton <[email protected]>
On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
>
Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.
But this will need some math foo to get the bits compacted correctly
just a thought.
Thanks
Boaz
On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Because if above is acceptable we could make reported i_version to be a sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
>
> How hard is this for filesystems to support? Do they need an on-disk
> format change to keep track of the crash counter?
Yes. We'll need version counter in the superblock, and we'll need to
know what the increment semantics are.
The big question is how do we know there was a crash? The only thing
a journalling filesystem knows at mount time is whether it is clean
or requires recovery. Filesystems can require recovery for many
reasons that don't involve a crash (e.g. root fs is never unmounted
cleanly, so always requires recovery). Further, some filesystems may
not even know there was a crash at mount time because their
architecture always leaves a consistent filesystem on disk (e.g. COW
filesystems)....
> I wonder if repeated crashes can lead to any odd corner cases.
WIthout defined, locked down behavour of the superblock counter, the
almost certainly corner cases will exist...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> >
> > How hard is this for filesystems to support? Do they need an on-disk
> > format change to keep track of the crash counter?
>
> Yes. We'll need version counter in the superblock, and we'll need to
> know what the increment semantics are.
>
> The big question is how do we know there was a crash? The only thing
> a journalling filesystem knows at mount time is whether it is clean
> or requires recovery. Filesystems can require recovery for many
> reasons that don't involve a crash (e.g. root fs is never unmounted
> cleanly, so always requires recovery). Further, some filesystems may
> not even know there was a crash at mount time because their
> architecture always leaves a consistent filesystem on disk (e.g. COW
> filesystems)....
What filesystems can or cannot easily do obviously differs. Ext4 has a
recovery flag set in superblock on RW mount/remount and cleared on
umount/RO remount. This flag being set on mount would imply incrementing
the crash counter. It should be pretty easy for each filesystem to
implement such flag and the counter but I agree it requires an on-disk
format change.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > > Because if above is acceptable we could make reported i_version to be a sum
> > > > > of "superblock crash counter" and "inode i_version". We increment
> > > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > > > That way after a crash we are guaranteed each inode will report new
> > > > > i_version (the sum would probably have to look like "superblock crash
> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > > i_version numbers we gave away but did not write to disk but still...).
> > > > > Thoughts?
> > >
> > > How hard is this for filesystems to support? Do they need an on-disk
> > > format change to keep track of the crash counter?
> >
> > Yes. We'll need version counter in the superblock, and we'll need to
> > know what the increment semantics are.
> >
> > The big question is how do we know there was a crash? The only thing
> > a journalling filesystem knows at mount time is whether it is clean
> > or requires recovery. Filesystems can require recovery for many
> > reasons that don't involve a crash (e.g. root fs is never unmounted
> > cleanly, so always requires recovery). Further, some filesystems may
> > not even know there was a crash at mount time because their
> > architecture always leaves a consistent filesystem on disk (e.g. COW
> > filesystems)....
>
> What filesystems can or cannot easily do obviously differs. Ext4 has a
> recovery flag set in superblock on RW mount/remount and cleared on
> umount/RO remount.
Even this doesn't help. A recent bug that was reported to the XFS
list - turns out that systemd can't remount-ro the root
filesystem sucessfully on shutdown because there are open write fds
on the root filesystem when it attempts the remount. So it just
reboots without a remount-ro. This uncovered a bug in grub in
that it (still!) thinks sync(1) is sufficient to get all the
metadata that points to a kernel image onto disk in places it can
read. XFS, like ext4, leaves it in the journal and so the system then fails to
boot because systemd didn't remount-ro the root fs and hence the
journal was never flushed before reboot and so grub can't find the
kernel and so everything fails....
> This flag being set on mount would imply incrementing the crash
> counter. It should be pretty easy for each filesystem to implement
> such flag and the counter but I agree it requires an on-disk
> format change.
Yup, anything we want that is persistent and consistent across
filesystems will need on-disk format changes. Hence we need a solid
specification first, not to mention tests to validate correct
behaviour across all filesystems in xfstests...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> > What filesystems can or cannot easily do obviously differs. Ext4 has a
> > recovery flag set in superblock on RW mount/remount and cleared on
> > umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro.
I'd certainly rather not invalidate caches on *every* boot.
On the other hand, if the only cases involve the root filesystem, then
from the point of view of NFS, we probably don't care much.
> > This flag being set on mount would imply incrementing the crash
> > counter. It should be pretty easy for each filesystem to implement
> > such flag and the counter but I agree it requires an on-disk
> > format change.
>
> Yup, anything we want that is persistent and consistent across
> filesystems will need on-disk format changes. Hence we need a solid
> specification first, not to mention tests to validate correct
> behaviour across all filesystems in xfstests...
For xfstests we'll need a way to get i_version (patch it into statx, I
guess?). Ideally we'd have a way to test behavior across crashes too,
any advice there?
--b.
On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> >
> > How hard is this for filesystems to support? Do they need an on-disk
> > format change to keep track of the crash counter? Maybe not, maybe the
> > high bits of the i_version counters are all they need.
> >
>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
We could consider using the current time instead. So, put the current
time (or time of last boot, or this inode's ctime, or something) in the
high bits of the change attribute, and keep the low bits as a counter.
Then as long as we trust our clock, we're no longer at risk of reusing
an i_version value.
> > I guess we just want to have some back-of-the-envelope estimates of
> > maximum number of i_version increments possible between crashes and
> > maximum number of crashes possible over lifetime of a filesystem, to
> > decide how to split up the bits.
> >
> > I wonder if we could get away with using the new crash counter only for
> > *new* values of the i_version? After a crash, use the on disk i_version
> > as is, and put off using the new crash counter until the next time the
> > file's modified.
> >
>
> That sounds difficult to get right. Suppose I have an inode that has not
> been updated in a long time. Someone writes to it and then queries the
> i_version. How do I know whether there were crashes since the last time
> I updated it? Or am I misunderstanding what you're proposing here?
I believe Jan was suggesting that we keep the i_version as-is on disk
but combine with with the crash counter when it's queried.
I was suggesting instead that on write, when we bump the i_version, we
replace the simple increment by something that increments *and* sticks
the current crash counter (or maybe just a time) in the high bits. And
that combination will be what goes in i_version and is written to disk.
That guarantees that we never reuse an old value when we increment.
It also avoids having to invalidate absolutely every cache, even of
completely static files.
Clients could still see the change attribute go backwards, though, if
they query a file with dirty data and the server crashes before it
writes out the new change attribute.
Also, they could fail to detect data that reverted after boot if they
cache data while the file's dirty on the server side, and the server
crash preserves the i_version update but not the new data.
Presumably both are possible already for ctime (depending on the
filesystem), and I'm not convinced they're a big problem.
In both cases the reading client is caching while somebody's still
writing, and as long as the writer comes back and finishes its job,
readers will thereafter see the right thing. So I think it's adequate
for close-to-open.
--b.
On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were reboots
> > > > > > > between the two i_version checks.
> > > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful. Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them? What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > >
> > > > > >
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > >
> > > > > So you think the filesystem can provide the atomicity? In more detail:
> > > > >
> > > >
> > > > Sorry, I hit send too quickly. That should have read:
> > > >
> > > > "Yeah, there could be atomicity issues there."
> > > >
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > >
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > >
> > >
> >
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
>
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
>
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*). IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.
A slightly out-of-date i_version is fine, I think. It's just the
reverse we want to avoid. E.g., assuming an i_version-supporting
statux, if somebody could called statx then read, and got the new
i_version followed by the old data, that would cause problems.
--b.
On Tue, Apr 04 2017, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
>> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
>> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
>> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > > Because if above is acceptable we could make reported i_version to be a sum
>> > > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
>> > > > > That way after a crash we are guaranteed each inode will report new
>> > > > > i_version (the sum would probably have to look like "superblock crash
>> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > > Thoughts?
>> > >
>> > > How hard is this for filesystems to support? Do they need an on-disk
>> > > format change to keep track of the crash counter?
>> >
>> > Yes. We'll need version counter in the superblock, and we'll need to
>> > know what the increment semantics are.
>> >
>> > The big question is how do we know there was a crash? The only thing
>> > a journalling filesystem knows at mount time is whether it is clean
>> > or requires recovery. Filesystems can require recovery for many
>> > reasons that don't involve a crash (e.g. root fs is never unmounted
>> > cleanly, so always requires recovery). Further, some filesystems may
>> > not even know there was a crash at mount time because their
>> > architecture always leaves a consistent filesystem on disk (e.g. COW
>> > filesystems)....
>>
>> What filesystems can or cannot easily do obviously differs. Ext4 has a
>> recovery flag set in superblock on RW mount/remount and cleared on
>> umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro. This uncovered a bug in grub in
Filesystems could use register_reboot_notifier() to get a notification
that even systemd cannot stuff-up. It could check for dirty data and, if
there is none (which there shouldn't be if a sync happened), it does a
single write to disk to update the superblock (or a single write to each
disk... or something).
md does this, because getting the root device to be marked read-only is
even harder than getting the root filesystem to be remounted read-only.
NeilBrown
On Tue, Apr 04 2017, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> >
>> > How hard is this for filesystems to support? Do they need an on-disk
>> > format change to keep track of the crash counter? Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> >
>>
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead. So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.
This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.
To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields. Maybe there are enough bits.
If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb. Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.
If you want to change a file with an even i_version, you subtract
crash-count*large-number
to the i_version, then set lsb. This is written to stable storage before
the change.
If a file has not been changed for a while, you can add
crash-count*large-number
and clear lsb.
The lsb of the i_version would be for internal use only. It would not
be visible outside the filesystem.
It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.
NeilBrown
On Wed 05-04-17 11:43:32, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> >> > > > i_version numbers we gave away but did not write to disk but still...).
> >> > > > Thoughts?
> >> >
> >> > How hard is this for filesystems to support? Do they need an on-disk
> >> > format change to keep track of the crash counter? Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> >
> >>
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead. So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
>
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.
Yes, that was my thinking.
> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields. Maybe there are enough bits.
Furthermore you'd have a potential problem that you need to change
i_version on disk just because you are reading after a crash and such
changes tend to be problematic (think of read-only mounts and stuff like
that).
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb. Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
>
> If you want to change a file with an even i_version, you subtract
> crash-count*large-number
> to the i_version, then set lsb. This is written to stable storage before
> the change.
>
> If a file has not been changed for a while, you can add
> crash-count*large-number
> and clear lsb.
>
> The lsb of the i_version would be for internal use only. It would not
> be visible outside the filesystem.
>
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.
Well, there is also a problem that you would need to somehow remember with
which 'crash count' the i_version has been previously reported as that is
not stored on disk with my scheme. So I don't think we can easily use your
scheme.
So the options we have are:
1) Keep i_version as is, make clients also check for i_ctime.
Pro: No on-disk format changes.
Cons: After a crash, i_version can go backwards (but when file changes
i_version, i_ctime pair should be still different) or not, data can be
old or not.
2) Fsync when reporting i_version.
Pro: No on-disk format changes, strong consistency of i_version and
data.
Cons: Difficult to implement for filesystems due to locking constrains.
High performance overhead or i_version reporting.
3) Some variant of crash counter.
Pro: i_version cannot go backwards.
Cons: Requires on-disk format changes. After a crash data can be old
(however i_version increased).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> >> > > > i_version numbers we gave away but did not write to disk but still...).
> >> > > > Thoughts?
> >> >
> >> > How hard is this for filesystems to support? Do they need an on-disk
> >> > format change to keep track of the crash counter? Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> >
> >>
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead. So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
>
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.
I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.
> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields. Maybe there are enough bits.
i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb. Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
>
> If you want to change a file with an even i_version, you subtract
> crash-count*large-number
> to the i_version, then set lsb. This is written to stable storage before
> the change.
>
> If a file has not been changed for a while, you can add
> crash-count*large-number
> and clear lsb.
>
> The lsb of the i_version would be for internal use only. It would not
> be visible outside the filesystem.
>
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.
I'm not sure how to model the costs. Something like that might work.
--b.
On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> 1) Keep i_version as is, make clients also check for i_ctime.
That would be a protocol revision, which we'd definitely rather avoid.
But can't we accomplish the same by using something like
ctime * (some constant) + i_version
?
> Pro: No on-disk format changes.
> Cons: After a crash, i_version can go backwards (but when file changes
> i_version, i_ctime pair should be still different) or not, data can be
> old or not.
This is probably good enough for NFS purposes: typically on an NFS
filesystem, results of a read in the face of a concurrent write open are
undefined. And writers sync before close.
So after a crash with a dirty inode, we're in a situation where an NFS
client still needs to resend some writes, sync, and close. I'm OK with
things being inconsistent during this window.
I do expect things to return to normal once that client's has resent its
writes--hence the worry about actually resuing old values after boot
(such as if i_version regresses on boot and then increments back to the
same value after further writes). Factoring in ctime fixes that.
> 2) Fsync when reporting i_version.
> Pro: No on-disk format changes, strong consistency of i_version and
> data.
> Cons: Difficult to implement for filesystems due to locking constrains.
> High performance overhead or i_version reporting.
Sounds painful.
> 3) Some variant of crash counter.
> Pro: i_version cannot go backwards.
> Cons: Requires on-disk format changes. After a crash data can be old
> (however i_version increased).
Also, some unnecessary invalidations. Which maybe can be mostly avoided
by some variation of Neil's scheme.
It looks to me like option (1) is doable now and introduces no
regressions compared to the current situation. (2) and (3) are more
copmlicated and involve some tradeoffs.
Also, we can implement (1) and switch to (2) or (3) later. We'd want to
do it without reported i_versions decreasing on kernel upgrade, but
there are multiple ways of handling that. (Worst case, just restrict
the change to newly created filesystems.)
--b.
On Wed, Apr 05 2017, Jan Kara wrote:
> On Wed 05-04-17 11:43:32, NeilBrown wrote:
>> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>>
>> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> >> > > > Because if above is acceptable we could make reported i_version to be a sum
>> >> > > > of "superblock crash counter" and "inode i_version". We increment
>> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
>> >> > > > That way after a crash we are guaranteed each inode will report new
>> >> > > > i_version (the sum would probably have to look like "superblock crash
>> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> >> > > > i_version numbers we gave away but did not write to disk but still...).
>> >> > > > Thoughts?
>> >> >
>> >> > How hard is this for filesystems to support? Do they need an on-disk
>> >> > format change to keep track of the crash counter? Maybe not, maybe the
>> >> > high bits of the i_version counters are all they need.
>> >> >
>> >>
>> >> Yeah, I imagine we'd need a on-disk change for this unless there's
>> >> something already present that we could use in place of a crash counter.
>> >
>> > We could consider using the current time instead. So, put the current
>> > time (or time of last boot, or this inode's ctime, or something) in the
>> > high bits of the change attribute, and keep the low bits as a counter.
>>
>> This is a very different proposal.
>> I don't think Jan was suggesting that the i_version be split into two
>> bit fields, one the change-counter and one the crash-counter.
>> Rather, the crash-counter was multiplied by a large-number and added to
>> the change-counter with the expectation that while not ever
>> change-counter landed on disk, at least 1 in every large-number would.
>> So after each crash we effectively add large-number to the
>> change-counter, and can be sure that number hasn't been used already.
>
> Yes, that was my thinking.
>
>> To store the crash-counter in each inode (which does appeal) you would
>> need to be able to remove it before adding the new crash counter, and
>> that requires bit-fields. Maybe there are enough bits.
>
> Furthermore you'd have a potential problem that you need to change
> i_version on disk just because you are reading after a crash and such
> changes tend to be problematic (think of read-only mounts and stuff like
> that).
>
>> If you want to ensure read-only files can remain cached over a crash,
>> then you would have to mark a file in some way on stable storage
>> *before* allowing any change.
>> e.g. you could use the lsb. Odd i_versions might have been changed
>> recently and crash-count*large-number needs to be added.
>> Even i_versions have not been changed recently and nothing need be
>> added.
>>
>> If you want to change a file with an even i_version, you subtract
>> crash-count*large-number
>> to the i_version, then set lsb. This is written to stable storage before
>> the change.
>>
>> If a file has not been changed for a while, you can add
>> crash-count*large-number
>> and clear lsb.
>>
>> The lsb of the i_version would be for internal use only. It would not
>> be visible outside the filesystem.
>>
>> It feels a bit clunky, but I think it would work and is the best
>> combination of Jan's idea and your requirement.
>> The biggest cost would be switching to 'odd' before an changes, and the
>> unknown is when does it make sense to switch to 'even'.
>
> Well, there is also a problem that you would need to somehow remember with
> which 'crash count' the i_version has been previously reported as that is
> not stored on disk with my scheme. So I don't think we can easily use your
> scheme.
I don't think there is a problem here.... maybe I didn't explain
properly or something.
I'm assuming there is a crash-count that is stored once per filesystem.
This might be a disk-format change, or maybe the "Last checked" time
could be used with ext4 (that is a bit horrible though).
Every on-disk i_version has a flag to choose between:
- use this number as it is, but update it on-disk before any change
- add multiple of current crash-count to this number before use.
If you crash during an update, the i_version is thus automatically
increased.
To change from the first option to the second option you subtract the
multiple of the current crash-count (which might make the stored
i_version negative), and flip the bit.
To change from the second option to the first, you add the multiple
of the current crash-count, and flip the bit.
In each case, the externally visible i_version does not change.
Nothing needs to be stored except the per-inode i_version and the per-fs
crash_count.
>
> So the options we have are:
>
> 1) Keep i_version as is, make clients also check for i_ctime.
> Pro: No on-disk format changes.
> Cons: After a crash, i_version can go backwards (but when file changes
> i_version, i_ctime pair should be still different) or not, data can be
> old or not.
I like to think of this approach as using the i_version as an extension
to the i_ctime.
i_ctime doesn't necessarily change on every file modification, either
because it is not a modification that is meant to change i_ctime, or
because i_ctime doesn't have the resolution to show a very small change
in time, or because the clock that is used to update i_ctime doesn't
have much resolution.
So when a change happens, if the stored c_time changes, set i_version to
zero, otherwise increment i_version.
Then the externally visible i-version is a combination of the stored
c_time and the stored i_version.
If you only used 1-second ctime resolution for versioning purposes, you
could provide a 64bit i_version as 34 bits of ctime and 30 bits of
changes-in-one-second.
It is important that the resolution of ctime used is less that the
fastest possible restart after a crash.
I don't think that i_version going backwards should be a problem, as
long as an old version means exactly the same old data. Presumably
journalling would ensure that the data and ctime/version are updated
atomically.
>
> 2) Fsync when reporting i_version.
> Pro: No on-disk format changes, strong consistency of i_version and
> data.
> Cons: Difficult to implement for filesystems due to locking constrains.
> High performance overhead or i_version reporting.
This reminds me of the old ext3 fsync-when-renaming a file. People
might depend on it for all the wrong reasons, and other people might
studiously avoid it due to the performance implications.
>
> 3) Some variant of crash counter.
> Pro: i_version cannot go backwards.
> Cons: Requires on-disk format changes. After a crash data can be old
> (however i_version increased).
If it is essential for i_version to always go forward, then I think this
is the best approach.
If an i_version reset can be tolerated, then I think a
time-plus-version-count approach is likely to be best.
Thanks,
NeilBrown
>
> Honza
> --
> Jan Kara <jack-IBi9RG/[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 06-04-17 11:12:02, NeilBrown wrote:
> On Wed, Apr 05 2017, Jan Kara wrote:
> >> If you want to ensure read-only files can remain cached over a crash,
> >> then you would have to mark a file in some way on stable storage
> >> *before* allowing any change.
> >> e.g. you could use the lsb. Odd i_versions might have been changed
> >> recently and crash-count*large-number needs to be added.
> >> Even i_versions have not been changed recently and nothing need be
> >> added.
> >>
> >> If you want to change a file with an even i_version, you subtract
> >> crash-count*large-number
> >> to the i_version, then set lsb. This is written to stable storage before
> >> the change.
> >>
> >> If a file has not been changed for a while, you can add
> >> crash-count*large-number
> >> and clear lsb.
> >>
> >> The lsb of the i_version would be for internal use only. It would not
> >> be visible outside the filesystem.
> >>
> >> It feels a bit clunky, but I think it would work and is the best
> >> combination of Jan's idea and your requirement.
> >> The biggest cost would be switching to 'odd' before an changes, and the
> >> unknown is when does it make sense to switch to 'even'.
> >
> > Well, there is also a problem that you would need to somehow remember with
> > which 'crash count' the i_version has been previously reported as that is
> > not stored on disk with my scheme. So I don't think we can easily use your
> > scheme.
>
> I don't think there is a problem here.... maybe I didn't explain
> properly or something.
>
> I'm assuming there is a crash-count that is stored once per filesystem.
> This might be a disk-format change, or maybe the "Last checked" time
> could be used with ext4 (that is a bit horrible though).
>
> Every on-disk i_version has a flag to choose between:
> - use this number as it is, but update it on-disk before any change
> - add multiple of current crash-count to this number before use.
> If you crash during an update, the i_version is thus automatically
> increased.
>
> To change from the first option to the second option you subtract the
> multiple of the current crash-count (which might make the stored
> i_version negative), and flip the bit.
> To change from the second option to the first, you add the multiple
> of the current crash-count, and flip the bit.
> In each case, the externally visible i_version does not change.
> Nothing needs to be stored except the per-inode i_version and the per-fs
> crash_count.
Right, I didn't realize you would subtract crash counter when flipping the
bit and then add it back when flipping again. That would work.
> > So the options we have are:
> >
> > 1) Keep i_version as is, make clients also check for i_ctime.
> > Pro: No on-disk format changes.
> > Cons: After a crash, i_version can go backwards (but when file changes
> > i_version, i_ctime pair should be still different) or not, data can be
> > old or not.
>
> I like to think of this approach as using the i_version as an extension
> to the i_ctime.
> i_ctime doesn't necessarily change on every file modification, either
> because it is not a modification that is meant to change i_ctime, or
> because i_ctime doesn't have the resolution to show a very small change
> in time, or because the clock that is used to update i_ctime doesn't
> have much resolution.
> So when a change happens, if the stored c_time changes, set i_version to
> zero, otherwise increment i_version.
> Then the externally visible i-version is a combination of the stored
> c_time and the stored i_version.
> If you only used 1-second ctime resolution for versioning purposes, you
> could provide a 64bit i_version as 34 bits of ctime and 30 bits of
> changes-in-one-second.
> It is important that the resolution of ctime used is less that the
> fastest possible restart after a crash.
>
> I don't think that i_version going backwards should be a problem, as
> long as an old version means exactly the same old data. Presumably
> journalling would ensure that the data and ctime/version are updated
> atomically.
So as Dave and I wrote earlier in this thread, journalling does not ensure
data vs ctime/version consistency (well, except for ext4 in data=journal
mode but people rarely run that due to performance implications). So you
can get old data and new version as well as new data and old version after
a crash. The only thing filesystems guarantee is that you will not see
uninitialized blocks and that fsync makes both data & ctime/version
persistent. But as Bruce wrote for NFS open-to-close semantics this may be
actually good enough.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > 1) Keep i_version as is, make clients also check for i_ctime.
>
> That would be a protocol revision, which we'd definitely rather avoid.
>
> But can't we accomplish the same by using something like
>
> ctime * (some constant) + i_version
>
> ?
>
> > Pro: No on-disk format changes.
> > Cons: After a crash, i_version can go backwards (but when file changes
> > i_version, i_ctime pair should be still different) or not, data can be
> > old or not.
>
> This is probably good enough for NFS purposes: typically on an NFS
> filesystem, results of a read in the face of a concurrent write open are
> undefined. And writers sync before close.
>
> So after a crash with a dirty inode, we're in a situation where an NFS
> client still needs to resend some writes, sync, and close. I'm OK with
> things being inconsistent during this window.
>
> I do expect things to return to normal once that client's has resent its
> writes--hence the worry about actually resuing old values after boot
> (such as if i_version regresses on boot and then increments back to the
> same value after further writes). Factoring in ctime fixes that.
So for now I'm thinking of just doing something like the following.
Only nfsd needs it for now, but it could be moved to a vfs helper for
statx, or for individual filesystems that want to do something
different. (The NFSv4 client will want to use the server's change
attribute instead, I think. And other filesystems might want to try
something more ambitious like Neil's proposal.)
--b.
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 12feac6ee2fd..9636c9a60aba 100644
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..14f09f1ef605 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
fhp->fh_pre_saved = false;
}
+static inline u64 nfsd4_change_attribute(struct inode *inode)
+{
+ u64 chattr;
+
+ chattr = inode->i_ctime.tv_sec << 30;
+ chattr += inode->i_ctime.tv_nsec;
+ chattr += inode->i_version;
+ return chattr;
+}
+
/*
* Fill in the pre_op attr for the wcc data
*/
@@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size = inode->i_size;
- fhp->fh_pre_change = inode->i_version;
+ fhp->fh_pre_change = nfsd4_change_attribute(inode);
fhp->fh_pre_saved = true;
}
}
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
err = fh_getattr(fhp, &fhp->fh_post_attr);
- fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+ fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 26780d53a6f9..a09532d4a383 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
*p++ = 0;
} else if (IS_I_VERSION(inode)) {
- p = xdr_encode_hyper(p, inode->i_version);
+ p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);
On Thu, May 11 2017, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
>> > 1) Keep i_version as is, make clients also check for i_ctime.
>>
>> That would be a protocol revision, which we'd definitely rather avoid.
>>
>> But can't we accomplish the same by using something like
>>
>> ctime * (some constant) + i_version
>>
>> ?
>>
>> > Pro: No on-disk format changes.
>> > Cons: After a crash, i_version can go backwards (but when file changes
>> > i_version, i_ctime pair should be still different) or not, data can be
>> > old or not.
>>
>> This is probably good enough for NFS purposes: typically on an NFS
>> filesystem, results of a read in the face of a concurrent write open are
>> undefined. And writers sync before close.
>>
>> So after a crash with a dirty inode, we're in a situation where an NFS
>> client still needs to resend some writes, sync, and close. I'm OK with
>> things being inconsistent during this window.
>>
>> I do expect things to return to normal once that client's has resent its
>> writes--hence the worry about actually resuing old values after boot
>> (such as if i_version regresses on boot and then increments back to the
>> same value after further writes). Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different. (The NFSv4 client will want to use the server's change
> attribute instead, I think. And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> fhp->fh_pre_saved = false;
> }
>
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
So if I chmod a file, all clients will need to flush the content from their cache?
Maybe they already do? Maybe it is a boring corner case?
> +}
> +
> /*
> * Fill in the pre_op attr for the wcc data
> */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> fhp->fh_pre_mtime = inode->i_mtime;
> fhp->fh_pre_ctime = inode->i_ctime;
> fhp->fh_pre_size = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> fhp->fh_pre_saved = true;
> }
> }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> printk("nfsd: inode locked twice during operation.\n");
>
> err = fh_getattr(fhp, &fhp->fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> if (err) {
> fhp->fh_post_saved = false;
> /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> *p++ = 0;
> } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> } else {
> *p++ = cpu_to_be32(stat->ctime.tv_sec);
> *p++ = cpu_to_be32(stat->ctime.tv_nsec);
It is *really* confusing to find that fh_post_change is only set in nfs3
code, and only used in nfs4 code.
It is probably time to get a 'version' field in 'struct kstat'.
That would allow this code to get a little cleaner.
(to me, this exercise is just a reminder that the NFSv4 change attribute
is poorly designed ... so it just makes me grumpy).
NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> >
> > That would be a protocol revision, which we'd definitely rather avoid.
> >
> > But can't we accomplish the same by using something like
> >
> > ctime * (some constant) + i_version
> >
> > ?
> >
> > > Pro: No on-disk format changes.
> > > Cons: After a crash, i_version can go backwards (but when file changes
> > > i_version, i_ctime pair should be still different) or not, data can be
> > > old or not.
> >
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined. And writers sync before close.
> >
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close. I'm OK with
> > things being inconsistent during this window.
> >
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes). Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different. (The NFSv4 client will want to use the server's change
> attribute instead, I think. And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> fhp->fh_pre_saved = false;
> }
>
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
Probably you need explicit (u64) cast... Otherwise I'm fine with this.
Honza
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
> /*
> * Fill in the pre_op attr for the wcc data
> */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> fhp->fh_pre_mtime = inode->i_mtime;
> fhp->fh_pre_ctime = inode->i_ctime;
> fhp->fh_pre_size = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> fhp->fh_pre_saved = true;
> }
> }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> printk("nfsd: inode locked twice during operation.\n");
>
> err = fh_getattr(fhp, &fhp->fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> if (err) {
> fhp->fh_post_saved = false;
> /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> *p++ = 0;
> } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> } else {
> *p++ = cpu_to_be32(stat->ctime.tv_sec);
> *p++ = cpu_to_be32(stat->ctime.tv_nsec);
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 2017-05-11 at 14:59 -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> >
> > That would be a protocol revision, which we'd definitely rather avoid.
> >
> > But can't we accomplish the same by using something like
> >
> > ctime * (some constant) + i_version
> >
> > ?
> >
> > > Pro: No on-disk format changes.
> > > Cons: After a crash, i_version can go backwards (but when file changes
> > > i_version, i_ctime pair should be still different) or not, data can be
> > > old or not.
> >
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined. And writers sync before close.
> >
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close. I'm OK with
> > things being inconsistent during this window.
> >
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes). Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different. (The NFSv4 client will want to use the server's change
> attribute instead, I think. And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> fhp->fh_pre_saved = false;
> }
>
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
> /*
> * Fill in the pre_op attr for the wcc data
> */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> fhp->fh_pre_mtime = inode->i_mtime;
> fhp->fh_pre_ctime = inode->i_ctime;
> fhp->fh_pre_size = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> fhp->fh_pre_saved = true;
> }
> }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> printk("nfsd: inode locked twice during operation.\n");
>
> err = fh_getattr(fhp, &fhp->fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> if (err) {
> fhp->fh_post_saved = false;
> /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> *p++ = 0;
> } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> } else {
> *p++ = cpu_to_be32(stat->ctime.tv_sec);
> *p++ = cpu_to_be32(stat->ctime.tv_nsec);
Sorry I've been MIA on this discussion. I've had a very busy spring...
This looks reasonable to me (modulo Jan's comment about casting tv_sec
to u64).
To be clear, I think this is mostly orthogonal to the changes that I was
originally proposing, right? I think we can still benefit from only
bumping and storing i_version values after they've been queried.
--
Jeff Layton <[email protected]>
On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote:
> On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > > 1) Keep i_version as is, make clients also check for i_ctime.
> > >
> > > That would be a protocol revision, which we'd definitely rather avoid.
> > >
> > > But can't we accomplish the same by using something like
> > >
> > > ctime * (some constant) + i_version
> > >
> > > ?
> > >
> > > > Pro: No on-disk format changes.
> > > > Cons: After a crash, i_version can go backwards (but when file changes
> > > > i_version, i_ctime pair should be still different) or not, data can be
> > > > old or not.
> > >
> > > This is probably good enough for NFS purposes: typically on an NFS
> > > filesystem, results of a read in the face of a concurrent write open are
> > > undefined. And writers sync before close.
> > >
> > > So after a crash with a dirty inode, we're in a situation where an NFS
> > > client still needs to resend some writes, sync, and close. I'm OK with
> > > things being inconsistent during this window.
> > >
> > > I do expect things to return to normal once that client's has resent its
> > > writes--hence the worry about actually resuing old values after boot
> > > (such as if i_version regresses on boot and then increments back to the
> > > same value after further writes). Factoring in ctime fixes that.
> >
> > So for now I'm thinking of just doing something like the following.
> >
> > Only nfsd needs it for now, but it could be moved to a vfs helper for
> > statx, or for individual filesystems that want to do something
> > different. (The NFSv4 client will want to use the server's change
> > attribute instead, I think. And other filesystems might want to try
> > something more ambitious like Neil's proposal.)
> >
> > --b.
> >
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 12feac6ee2fd..9636c9a60aba 100644
> > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> > index f84fe6bf9aee..14f09f1ef605 100644
> > --- a/fs/nfsd/nfsfh.h
> > +++ b/fs/nfsd/nfsfh.h
> > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_saved = false;
> > }
> >
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > + u64 chattr;
> > +
> > + chattr = inode->i_ctime.tv_sec << 30;
>
> Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
> Probably you need explicit (u64) cast... Otherwise I'm fine with this.
Whoops, yes. Or just assign to chattr as a separate step. I'll fix
that.
--b.
> > + chattr += inode->i_ctime.tv_nsec;
> > + chattr += inode->i_version;
> > + return chattr;
> > +}
> > +
> > /*
> > * Fill in the pre_op attr for the wcc data
> > */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size = inode->i_size;
> > - fhp->fh_pre_change = inode->i_version;
> > + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> > }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >
> > err = fh_getattr(fhp, &fhp->fh_post_attr);
> > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > - p = xdr_encode_hyper(p, inode->i_version);
> > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote:
> This looks reasonable to me (modulo Jan's comment about casting tv_sec
> to u64).
>
> To be clear, I think this is mostly orthogonal to the changes that I was
> originally proposing, right? I think we can still benefit from only
> bumping and storing i_version values after they've been queried.
Definitely, yes.
--b.
On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> On Thu, May 11 2017, J. Bruce Fields wrote:
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > + u64 chattr;
> > +
> > + chattr = inode->i_ctime.tv_sec << 30;
> > + chattr += inode->i_ctime.tv_nsec;
> > + chattr += inode->i_version;
> > + return chattr;
>
> So if I chmod a file, all clients will need to flush the content from their cache?
> Maybe they already do? Maybe it is a boring corner case?
Yeah, that's the assumption, maybe it's wrong. I can't recall
complaints about anyone bitten by that case.
> > /*
> > * Fill in the pre_op attr for the wcc data
> > */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size = inode->i_size;
> > - fhp->fh_pre_change = inode->i_version;
> > + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> > }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >
> > err = fh_getattr(fhp, &fhp->fh_post_attr);
> > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > - p = xdr_encode_hyper(p, inode->i_version);
> > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
>
> It is *really* confusing to find that fh_post_change is only set in nfs3
> code, and only used in nfs4 code.
Yup.
> It is probably time to get a 'version' field in 'struct kstat'.
The pre/post_wcc code doesn't seem to be doing an explicit stat, I
wonder if that matters?
--b.
> That would allow this code to get a little cleaner.
>
> (to me, this exercise is just a reminder that the NFSv4 change attribute
> is poorly designed ... so it just makes me grumpy).
>
> NeilBrown
>
>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2017-05-12 at 12:21 -0400, J. Bruce Fields wrote:
> On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> > On Thu, May 11 2017, J. Bruce Fields wrote:
> > > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > > +{
> > > + u64 chattr;
> > > +
> > > + chattr = inode->i_ctime.tv_sec << 30;
> > > + chattr += inode->i_ctime.tv_nsec;
> > > + chattr += inode->i_version;
> > > + return chattr;
> >
> > So if I chmod a file, all clients will need to flush the content from their cache?
> > Maybe they already do? Maybe it is a boring corner case?
>
> Yeah, that's the assumption, maybe it's wrong. I can't recall
> complaints about anyone bitten by that case.
>
I'm pretty sure that's required by the RFC. The change attribute changes
with both data and metadata changes, and there is no way to tell what
sort of change it was. You have to dump everything out of the cache when
it changes.
> > > /*
> > > * Fill in the pre_op attr for the wcc data
> > > */
> > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > > fhp->fh_pre_mtime = inode->i_mtime;
> > > fhp->fh_pre_ctime = inode->i_ctime;
> > > fhp->fh_pre_size = inode->i_size;
> > > - fhp->fh_pre_change = inode->i_version;
> > > + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > > fhp->fh_pre_saved = true;
> > > }
> > > }
> > > --- a/fs/nfsd/nfs3xdr.c
> > > +++ b/fs/nfsd/nfs3xdr.c
> > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > > printk("nfsd: inode locked twice during operation.\n");
> > >
> > > err = fh_getattr(fhp, &fhp->fh_post_attr);
> > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > > if (err) {
> > > fhp->fh_post_saved = false;
> > > /* Grab the ctime anyway - set_change_info might use it */
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 26780d53a6f9..a09532d4a383 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > > *p++ = 0;
> > > } else if (IS_I_VERSION(inode)) {
> > > - p = xdr_encode_hyper(p, inode->i_version);
> > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > > } else {
> > > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> >
> > It is *really* confusing to find that fh_post_change is only set in nfs3
> > code, and only used in nfs4 code.
>
> Yup.
>
> > It is probably time to get a 'version' field in 'struct kstat'.
>
> The pre/post_wcc code doesn't seem to be doing an explicit stat, I
> wonder if that matters?
>
Probably not for now. We only use this for namespace altering operations
anyway (create, link, unlink, and rename).
The post code does do a fh_getattr. It's only the pre-op i_version that
comes out of the cache. Only btrfs, xfs, and ext4 have a real i_version
counter today, and they just scrape that info out of the in-core inode.
So while not completely atomic, you should see a difference in the
change_info4 during any of those operations.
FWIW, userland cephfs now supports a cluster-coherent change attribute,
though the kernel client still needs some work before we can implement
it there. Eventually we'll add that, and at that point we might need to
have nfsd do a getattr in the pre part as well.
--
Jeff Layton <[email protected]>