v6: add support for STATX_ATTR_VERSION_MONOTONIC
add patch to expose i_version counter to userland
patches to update times/version after copying write data
An updated set of i_version handling changes! I've dropped the earlier
ext4 patches since Ted has picked up the relevant ext4 ones.
This set is based on linux-next, to make sure we don't collide with the
statx DIO alignment patches, and some other i_version cleanups that are
in flight. I'm hoping those land in 6.1.
There are a few changes since v5, mostly centered around adding
STATX_ATTR_VERSION_MONOTONIC. I've also re-added the patch to expose
STATX_VERSION to userland via statx. What I'm proposing should now
(mostly) conform to the semantics I layed out in the manpage patch I
sent recently [1].
Finally, I've added two patches to make __generic_file_write_iter and
ext4 update the c/mtime after copying file data instead of before, which
Neil pointed out makes for better cache-coherency handling. Those should
take care of ext4 and tmpfs. xfs and btrfs will need to make the same
changes.
One thing I'm not sure of is what we should do if update_times fails
after an otherwise successful write. Should we just ignore that and move
on (and maybe WARN)? Return an error? Set a writeback error? What's the
right recourse there?
I'd like to go ahead and get the first 6 patches from this series into
linux-next fairly soon, so if anyone has objections, please speak up!
[1]: https://lore.kernel.org/linux-nfs/[email protected]/T/#u
Jeff Layton (9):
iversion: move inode_query_iversion to libfs.c
iversion: clarify when the i_version counter must be updated
vfs: plumb i_version handling into struct kstat
nfs: report the inode version in getattr if requested
ceph: report the inode version in getattr if requested
nfsd: use the getattr operation to fetch i_version
vfs: expose STATX_VERSION to userland
vfs: update times after copying data in __generic_file_write_iter
ext4: update times after I/O in write codepaths
fs/ceph/inode.c | 16 +++++++++----
fs/ext4/file.c | 20 +++++++++++++---
fs/libfs.c | 36 +++++++++++++++++++++++++++++
fs/nfs/export.c | 7 ------
fs/nfs/inode.c | 10 ++++++--
fs/nfsd/nfs4xdr.c | 4 +++-
fs/nfsd/nfsfh.c | 40 ++++++++++++++++++++++++++++++++
fs/nfsd/nfsfh.h | 29 +----------------------
fs/nfsd/vfs.h | 7 +++++-
fs/stat.c | 7 ++++++
include/linux/exportfs.h | 1 -
include/linux/iversion.h | 48 ++++++++-------------------------------
include/linux/stat.h | 2 +-
include/uapi/linux/stat.h | 6 +++--
mm/filemap.c | 17 ++++++++++----
samples/vfs/test-statx.c | 8 +++++--
16 files changed, 163 insertions(+), 95 deletions(-)
--
2.37.3
From: Jeff Layton <[email protected]>
The NFS server has a lot of special handling for different types of
change attribute access, depending on what sort of inode we have. In
most cases, it's doing a getattr anyway and then fetching that value
after the fact.
Rather that do that, add a new STATX_VERSION flag that is a kernel-only
symbol (for now). If requested and getattr can implement it, it can fill
out this field. For IS_I_VERSION inodes, add a generic implementation in
vfs_getattr_nosec. Take care to mask STATX_VERSION off in requests from
userland and in the result mask.
Eventually if we decide to make this available to userland, we can just
designate a field for it in struct statx, and move the STATX_VERSION
definition to the uapi header.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/stat.c | 17 +++++++++++++++--
include/linux/stat.h | 9 +++++++++
2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/fs/stat.c b/fs/stat.c
index a7930d744483..e7f8cd4b24e1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -17,6 +17,7 @@
#include <linux/syscalls.h>
#include <linux/pagemap.h>
#include <linux/compat.h>
+#include <linux/iversion.h>
#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -118,6 +119,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
STATX_ATTR_DAX);
+ if ((request_mask & STATX_VERSION) && IS_I_VERSION(inode)) {
+ stat->result_mask |= STATX_VERSION;
+ stat->version = inode_query_iversion(inode);
+ }
+
mnt_userns = mnt_user_ns(path->mnt);
if (inode->i_op->getattr)
return inode->i_op->getattr(mnt_userns, path, stat,
@@ -587,9 +593,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
memset(&tmp, 0, sizeof(tmp));
- tmp.stx_mask = stat->result_mask;
+ /* STATX_VERSION is kernel-only for now */
+ tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
tmp.stx_blksize = stat->blksize;
- tmp.stx_attributes = stat->attributes;
+ /* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
+ tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
tmp.stx_nlink = stat->nlink;
tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -628,6 +636,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
return -EINVAL;
+ /* STATX_VERSION is kernel-only for now. Ignore requests
+ * from userland.
+ */
+ mask &= ~STATX_VERSION;
+
error = vfs_statx(dfd, filename, flags, &stat, mask);
if (error)
return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index ff277ced50e9..4e9428d86a3a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -52,6 +52,15 @@ struct kstat {
u64 mnt_id;
u32 dio_mem_align;
u32 dio_offset_align;
+ u64 version;
};
+/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
+
+/* mask values */
+#define STATX_VERSION 0x40000000U /* Want/got stx_change_attr */
+
+/* file attribute values */
+#define STATX_ATTR_VERSION_MONOTONIC 0x8000000000000000ULL /* version monotonically increases */
+
#endif
--
2.37.3
The c/mtime and i_version currently get updated before the data is
copied (or a DIO write is issued), which is problematic for NFS.
READ+GETATTR can race with a write (even a local one) in such a way as
to make the client associate the state of the file with the wrong change
attribute. That association can persist indefinitely if the file sees no
further changes.
Move the setting of times to the bottom of the function in
__generic_file_write_iter and only update it if something was
successfully written.
If the time update fails, log a warning once, but don't fail the write.
All of the existing callers use update_time functions that don't fail,
so we should never trip this.
Signed-off-by: Jeff Layton <[email protected]>
---
mm/filemap.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 15800334147b..72c0ceb75176 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err)
goto out;
- err = file_update_time(file);
- if (err)
- goto out;
-
if (iocb->ki_flags & IOCB_DIRECT) {
loff_t pos, endbyte;
@@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
iocb->ki_pos += written;
}
out:
+ if (written > 0) {
+ err = file_update_time(file);
+ /*
+ * There isn't much we can do at this point if updating the
+ * times fails after a successful write. The times and i_version
+ * should still be updated in the inode, and it should still be
+ * marked dirty, so hopefully the next inode update will catch it.
+ * Log a warning once so we have a record that something untoward
+ * has occurred.
+ */
+ WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
+ }
+
current->backing_dev_info = NULL;
return written ? written : err;
}
--
2.37.3
From: Jeff Layton <[email protected]>
Claim one of the spare fields in struct statx to hold a 64-bit inode
version attribute. When userland requests STATX_VERSION, copy the
value from the kstat struct there, and stop masking off
STATX_ATTR_VERSION_MONOTONIC.
Update the test-statx sample program to output the change attr and
MountId.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/stat.c | 12 +++---------
include/linux/stat.h | 9 ---------
include/uapi/linux/stat.h | 6 ++++--
samples/vfs/test-statx.c | 8 ++++++--
4 files changed, 13 insertions(+), 22 deletions(-)
diff --git a/fs/stat.c b/fs/stat.c
index e7f8cd4b24e1..8396c372022f 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
memset(&tmp, 0, sizeof(tmp));
- /* STATX_VERSION is kernel-only for now */
- tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
+ tmp.stx_mask = stat->result_mask;
tmp.stx_blksize = stat->blksize;
- /* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
- tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
+ tmp.stx_attributes = stat->attributes;
tmp.stx_nlink = stat->nlink;
tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_mnt_id = stat->mnt_id;
tmp.stx_dio_mem_align = stat->dio_mem_align;
tmp.stx_dio_offset_align = stat->dio_offset_align;
+ tmp.stx_version = stat->version;
return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
@@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
return -EINVAL;
- /* STATX_VERSION is kernel-only for now. Ignore requests
- * from userland.
- */
- mask &= ~STATX_VERSION;
-
error = vfs_statx(dfd, filename, flags, &stat, mask);
if (error)
return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 4e9428d86a3a..69c79e4fd1b1 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -54,13 +54,4 @@ struct kstat {
u32 dio_offset_align;
u64 version;
};
-
-/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
-
-/* mask values */
-#define STATX_VERSION 0x40000000U /* Want/got stx_change_attr */
-
-/* file attribute values */
-#define STATX_ATTR_VERSION_MONOTONIC 0x8000000000000000ULL /* version monotonically increases */
-
#endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..4a0a1f27c059 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,8 @@ struct statx {
__u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
__u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
/* 0xa0 */
- __u64 __spare3[12]; /* Spare space for future expansion */
+ __u64 stx_version; /* Inode change attribute */
+ __u64 __spare3[11]; /* Spare space for future expansion */
/* 0x100 */
};
@@ -154,6 +155,7 @@ struct statx {
#define STATX_BTIME 0x00000800U /* Want/got stx_btime */
#define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
#define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
+#define STATX_VERSION 0x00004000U /* Want/got stx_version */
#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
@@ -189,6 +191,6 @@ struct statx {
#define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */
#define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
#define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */
-
+#define STATX_ATTR_VERSION_MONOTONIC 0x00400000 /* stx_version increases w/ every change */
#endif /* _UAPI_LINUX_STAT_H */
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..bdbc371c9774 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
printf("Device: %-15s", buffer);
if (stx->stx_mask & STATX_INO)
printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
+ if (stx->stx_mask & STATX_MNT_ID)
+ printf(" MountId: %llx", stx->stx_mnt_id);
if (stx->stx_mask & STATX_NLINK)
printf(" Links: %-5u", stx->stx_nlink);
if (stx->stx_mask & STATX_TYPE) {
@@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
if (stx->stx_mask & STATX_CTIME)
print_time("Change: ", &stx->stx_ctime);
if (stx->stx_mask & STATX_BTIME)
- print_time(" Birth: ", &stx->stx_btime);
+ print_time("Birth: ", &stx->stx_btime);
+ if (stx->stx_mask & STATX_VERSION)
+ printf("Inode Version: 0x%llx\n", stx->stx_version);
if (stx->stx_attributes_mask) {
unsigned char bits, mbits;
@@ -218,7 +222,7 @@ int main(int argc, char **argv)
struct statx stx;
int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
- unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+ unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
for (argv++; *argv; argv++) {
if (strcmp(*argv, "-F") == 0) {
--
2.37.3
When getattr requests the STX_VERSION, request the full gamut of caps
(similarly to how ctime is handled). When the change attribute seems to
be valid, return it in the ino_version field and set the flag in the
reply mask. Also, unconditionally enable STATX_ATTR_VERSION_MONOTONIC.
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/ceph/inode.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42351d7a0dd6..bcab855bf1ae 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2415,10 +2415,10 @@ static int statx_to_caps(u32 want, umode_t mode)
{
int mask = 0;
- if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME))
+ if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME|STATX_VERSION))
mask |= CEPH_CAP_AUTH_SHARED;
- if (want & (STATX_NLINK|STATX_CTIME)) {
+ if (want & (STATX_NLINK|STATX_CTIME|STATX_VERSION)) {
/*
* The link count for directories depends on inode->i_subdirs,
* and that is only updated when Fs caps are held.
@@ -2429,11 +2429,10 @@ static int statx_to_caps(u32 want, umode_t mode)
mask |= CEPH_CAP_LINK_SHARED;
}
- if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
- STATX_BLOCKS))
+ if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|STATX_BLOCKS|STATX_VERSION))
mask |= CEPH_CAP_FILE_SHARED;
- if (want & (STATX_CTIME))
+ if (want & (STATX_CTIME|STATX_VERSION))
mask |= CEPH_CAP_XATTR_SHARED;
return mask;
@@ -2475,6 +2474,11 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
valid_mask |= STATX_BTIME;
}
+ if (request_mask & STATX_VERSION) {
+ stat->version = inode_peek_iversion_raw(inode);
+ valid_mask |= STATX_VERSION;
+ }
+
if (ceph_snap(inode) == CEPH_NOSNAP)
stat->dev = inode->i_sb->s_dev;
else
@@ -2498,6 +2502,8 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
stat->nlink = 1 + 1 + ci->i_subdirs;
}
+ stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
+ stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
stat->result_mask = request_mask & valid_mask;
return err;
}
--
2.37.3
Hi Jeff,
I love your patch! Perhaps something to improve:
[auto build test WARNING on next-20220929]
[cannot apply to tytso-ext4/dev trondmy-nfs/linux-next ceph-client/for-linus linus/master v6.0-rc7 v6.0-rc6 v6.0-rc5 v6.0-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jeff-Layton/vfs-nfsd-clean-up-handling-of-i_version-counter/20220930-192844
base: 1c6c4f42b3de4f18ea96d62950d0e266ca35a055
config: i386-randconfig-a001
compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/b305ff6981591917824b59e5ac7f833afea77ce6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Jeff-Layton/vfs-nfsd-clean-up-handling-of-i_version-counter/20220930-192844
git checkout b305ff6981591917824b59e5ac7f833afea77ce6
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
In file included from arch/x86/include/asm/bug.h:87,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:56,
from include/linux/wait.h:9,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from include/linux/dax.h:5,
from mm/filemap.c:15:
mm/filemap.c: In function '__generic_file_write_iter':
>> mm/filemap.c:3892:32: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'ssize_t' {aka 'int'} [-Wformat=]
3892 | WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
| |
| ssize_t {aka int}
include/asm-generic/bug.h:105:31: note: in definition of macro '__WARN_printf'
105 | __warn_printk(arg); \
| ^~~
include/linux/once_lite.h:31:25: note: in expansion of macro 'WARN'
31 | func(__VA_ARGS__); \
| ^~~~
include/asm-generic/bug.h:151:9: note: in expansion of macro 'DO_ONCE_LITE_IF'
151 | DO_ONCE_LITE_IF(condition, WARN, 1, format)
| ^~~~~~~~~~~~~~~
mm/filemap.c:3892:17: note: in expansion of macro 'WARN_ONCE'
3892 | WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
| ^~~~~~~~~
mm/filemap.c:3892:73: note: format string is defined here
3892 | WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
| ~~^
| |
| long int
| %d
vim +3892 mm/filemap.c
3793
3794 /**
3795 * __generic_file_write_iter - write data to a file
3796 * @iocb: IO state structure (file, offset, etc.)
3797 * @from: iov_iter with data to write
3798 *
3799 * This function does all the work needed for actually writing data to a
3800 * file. It does all basic checks, removes SUID from the file, updates
3801 * modification times and calls proper subroutines depending on whether we
3802 * do direct IO or a standard buffered write.
3803 *
3804 * It expects i_rwsem to be grabbed unless we work on a block device or similar
3805 * object which does not need locking at all.
3806 *
3807 * This function does *not* take care of syncing data in case of O_SYNC write.
3808 * A caller has to handle it. This is mainly due to the fact that we want to
3809 * avoid syncing under i_rwsem.
3810 *
3811 * Return:
3812 * * number of bytes written, even for truncated writes
3813 * * negative error code if no data has been written at all
3814 */
3815 ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
3816 {
3817 struct file *file = iocb->ki_filp;
3818 struct address_space *mapping = file->f_mapping;
3819 struct inode *inode = mapping->host;
3820 ssize_t written = 0;
3821 ssize_t err;
3822 ssize_t status;
3823
3824 /* We can write back this queue in page reclaim */
3825 current->backing_dev_info = inode_to_bdi(inode);
3826 err = file_remove_privs(file);
3827 if (err)
3828 goto out;
3829
3830 if (iocb->ki_flags & IOCB_DIRECT) {
3831 loff_t pos, endbyte;
3832
3833 written = generic_file_direct_write(iocb, from);
3834 /*
3835 * If the write stopped short of completing, fall back to
3836 * buffered writes. Some filesystems do this for writes to
3837 * holes, for example. For DAX files, a buffered write will
3838 * not succeed (even if it did, DAX does not handle dirty
3839 * page-cache pages correctly).
3840 */
3841 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
3842 goto out;
3843
3844 pos = iocb->ki_pos;
3845 status = generic_perform_write(iocb, from);
3846 /*
3847 * If generic_perform_write() returned a synchronous error
3848 * then we want to return the number of bytes which were
3849 * direct-written, or the error code if that was zero. Note
3850 * that this differs from normal direct-io semantics, which
3851 * will return -EFOO even if some bytes were written.
3852 */
3853 if (unlikely(status < 0)) {
3854 err = status;
3855 goto out;
3856 }
3857 /*
3858 * We need to ensure that the page cache pages are written to
3859 * disk and invalidated to preserve the expected O_DIRECT
3860 * semantics.
3861 */
3862 endbyte = pos + status - 1;
3863 err = filemap_write_and_wait_range(mapping, pos, endbyte);
3864 if (err == 0) {
3865 iocb->ki_pos = endbyte + 1;
3866 written += status;
3867 invalidate_mapping_pages(mapping,
3868 pos >> PAGE_SHIFT,
3869 endbyte >> PAGE_SHIFT);
3870 } else {
3871 /*
3872 * We don't know how much we wrote, so just return
3873 * the number of bytes which were direct-written
3874 */
3875 }
3876 } else {
3877 written = generic_perform_write(iocb, from);
3878 if (likely(written > 0))
3879 iocb->ki_pos += written;
3880 }
3881 out:
3882 if (written > 0) {
3883 err = file_update_time(file);
3884 /*
3885 * There isn't much we can do at this point if updating the
3886 * times fails after a successful write. The times and i_version
3887 * should still be updated in the inode, and it should still be
3888 * marked dirty, so hopefully the next inode update will catch it.
3889 * Log a warning once so we have a record that something untoward
3890 * has occurred.
3891 */
> 3892 WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
3893 }
3894
3895 current->backing_dev_info = NULL;
3896 return written ? written : err;
3897 }
3898 EXPORT_SYMBOL(__generic_file_write_iter);
3899
--
0-DAY CI Kernel Test Service
https://01.org/lkp
Hi Jeff,
I love your patch! Perhaps something to improve:
[auto build test WARNING on next-20220929]
[cannot apply to tytso-ext4/dev trondmy-nfs/linux-next ceph-client/for-linus linus/master v6.0-rc7 v6.0-rc6 v6.0-rc5 v6.0-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jeff-Layton/vfs-nfsd-clean-up-handling-of-i_version-counter/20220930-192844
base: 1c6c4f42b3de4f18ea96d62950d0e266ca35a055
config: i386-randconfig-a002
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/b305ff6981591917824b59e5ac7f833afea77ce6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Jeff-Layton/vfs-nfsd-clean-up-handling-of-i_version-counter/20220930-192844
git checkout b305ff6981591917824b59e5ac7f833afea77ce6
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> mm/filemap.c:3892:65: warning: format specifies type 'long' but the argument has type 'ssize_t' (aka 'int') [-Wformat]
WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
~~~ ^~~
%zd
include/asm-generic/bug.h:151:38: note: expanded from macro 'WARN_ONCE'
DO_ONCE_LITE_IF(condition, WARN, 1, format)
^~~~~~
include/linux/once_lite.h:31:9: note: expanded from macro 'DO_ONCE_LITE_IF'
func(__VA_ARGS__); \
^~~~~~~~~~~
include/asm-generic/bug.h:133:29: note: expanded from macro 'WARN'
__WARN_printf(TAINT_WARN, format); \
^~~~~~
include/asm-generic/bug.h:105:17: note: expanded from macro '__WARN_printf'
__warn_printk(arg); \
^~~
1 warning generated.
vim +3892 mm/filemap.c
3793
3794 /**
3795 * __generic_file_write_iter - write data to a file
3796 * @iocb: IO state structure (file, offset, etc.)
3797 * @from: iov_iter with data to write
3798 *
3799 * This function does all the work needed for actually writing data to a
3800 * file. It does all basic checks, removes SUID from the file, updates
3801 * modification times and calls proper subroutines depending on whether we
3802 * do direct IO or a standard buffered write.
3803 *
3804 * It expects i_rwsem to be grabbed unless we work on a block device or similar
3805 * object which does not need locking at all.
3806 *
3807 * This function does *not* take care of syncing data in case of O_SYNC write.
3808 * A caller has to handle it. This is mainly due to the fact that we want to
3809 * avoid syncing under i_rwsem.
3810 *
3811 * Return:
3812 * * number of bytes written, even for truncated writes
3813 * * negative error code if no data has been written at all
3814 */
3815 ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
3816 {
3817 struct file *file = iocb->ki_filp;
3818 struct address_space *mapping = file->f_mapping;
3819 struct inode *inode = mapping->host;
3820 ssize_t written = 0;
3821 ssize_t err;
3822 ssize_t status;
3823
3824 /* We can write back this queue in page reclaim */
3825 current->backing_dev_info = inode_to_bdi(inode);
3826 err = file_remove_privs(file);
3827 if (err)
3828 goto out;
3829
3830 if (iocb->ki_flags & IOCB_DIRECT) {
3831 loff_t pos, endbyte;
3832
3833 written = generic_file_direct_write(iocb, from);
3834 /*
3835 * If the write stopped short of completing, fall back to
3836 * buffered writes. Some filesystems do this for writes to
3837 * holes, for example. For DAX files, a buffered write will
3838 * not succeed (even if it did, DAX does not handle dirty
3839 * page-cache pages correctly).
3840 */
3841 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
3842 goto out;
3843
3844 pos = iocb->ki_pos;
3845 status = generic_perform_write(iocb, from);
3846 /*
3847 * If generic_perform_write() returned a synchronous error
3848 * then we want to return the number of bytes which were
3849 * direct-written, or the error code if that was zero. Note
3850 * that this differs from normal direct-io semantics, which
3851 * will return -EFOO even if some bytes were written.
3852 */
3853 if (unlikely(status < 0)) {
3854 err = status;
3855 goto out;
3856 }
3857 /*
3858 * We need to ensure that the page cache pages are written to
3859 * disk and invalidated to preserve the expected O_DIRECT
3860 * semantics.
3861 */
3862 endbyte = pos + status - 1;
3863 err = filemap_write_and_wait_range(mapping, pos, endbyte);
3864 if (err == 0) {
3865 iocb->ki_pos = endbyte + 1;
3866 written += status;
3867 invalidate_mapping_pages(mapping,
3868 pos >> PAGE_SHIFT,
3869 endbyte >> PAGE_SHIFT);
3870 } else {
3871 /*
3872 * We don't know how much we wrote, so just return
3873 * the number of bytes which were direct-written
3874 */
3875 }
3876 } else {
3877 written = generic_perform_write(iocb, from);
3878 if (likely(written > 0))
3879 iocb->ki_pos += written;
3880 }
3881 out:
3882 if (written > 0) {
3883 err = file_update_time(file);
3884 /*
3885 * There isn't much we can do at this point if updating the
3886 * times fails after a successful write. The times and i_version
3887 * should still be updated in the inode, and it should still be
3888 * marked dirty, so hopefully the next inode update will catch it.
3889 * Log a warning once so we have a record that something untoward
3890 * has occurred.
3891 */
> 3892 WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
3893 }
3894
3895 current->backing_dev_info = NULL;
3896 return written ? written : err;
3897 }
3898 EXPORT_SYMBOL(__generic_file_write_iter);
3899
--
0-DAY CI Kernel Test Service
https://01.org/lkp
On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
>
> The c/mtime and i_version currently get updated before the data is
> copied (or a DIO write is issued), which is problematic for NFS.
>
> READ+GETATTR can race with a write (even a local one) in such a way as
> to make the client associate the state of the file with the wrong change
> attribute. That association can persist indefinitely if the file sees no
> further changes.
>
> Move the setting of times to the bottom of the function in
> __generic_file_write_iter and only update it if something was
> successfully written.
>
This solution is wrong for several reasons:
1. There is still file_update_time() in ->page_mkwrite() so you haven't
solved the problem completely
2. The other side of the coin is that post crash state is more likely to end
up data changes without mtime/ctime change
If I read the problem description correctly, then a solution that invalidates
the NFS cache before AND after the write would be acceptable. Right?
Would an extra i_version bump after the write solve the race?
> If the time update fails, log a warning once, but don't fail the write.
> All of the existing callers use update_time functions that don't fail,
> so we should never trip this.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> mm/filemap.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 15800334147b..72c0ceb75176 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (err)
> goto out;
>
> - err = file_update_time(file);
> - if (err)
> - goto out;
> -
> if (iocb->ki_flags & IOCB_DIRECT) {
> loff_t pos, endbyte;
>
> @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iocb->ki_pos += written;
> }
> out:
> + if (written > 0) {
> + err = file_update_time(file);
> + /*
> + * There isn't much we can do at this point if updating the
> + * times fails after a successful write. The times and i_version
> + * should still be updated in the inode, and it should still be
> + * marked dirty, so hopefully the next inode update will catch it.
> + * Log a warning once so we have a record that something untoward
> + * has occurred.
> + */
> + WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
pr_warn_once() please - this is not a programming assertion.
Thanks,
Amir.
On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
> >
> > The c/mtime and i_version currently get updated before the data is
> > copied (or a DIO write is issued), which is problematic for NFS.
> >
> > READ+GETATTR can race with a write (even a local one) in such a way as
> > to make the client associate the state of the file with the wrong change
> > attribute. That association can persist indefinitely if the file sees no
> > further changes.
> >
> > Move the setting of times to the bottom of the function in
> > __generic_file_write_iter and only update it if something was
> > successfully written.
> >
>
> This solution is wrong for several reasons:
>
> 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> solved the problem completely
Right. I don't think there is a way to solve the problem vs. mmap.
Userland can write to a writeable mmap'ed page at any time and we'd
never know. We have to specifically carve out mmap as an exception here.
I'll plan to add something to the manpage patch for this.
> 2. The other side of the coin is that post crash state is more likely to end
> up data changes without mtime/ctime change
>
Is this really something filesystems rely on? I suppose the danger is
that some cached data gets written to disk before the write returns and
the inode on disk never gets updated.
But...isn't that a danger now? Some of the cached data could get written
out and the updated inode just never makes it to disk before a crash
(AFAIU). I'm not sure that this increases our exposure to that problem.
> If I read the problem description correctly, then a solution that invalidates
> the NFS cache before AND after the write would be acceptable. Right?
> Would an extra i_version bump after the write solve the race?
>
I based this patch on Neil's assertion that updating the time before an
operation was pointless if we were going to do it afterward. The NFS
client only really cares about seeing it change after a write.
Doing both would be fine from a correctness standpoint, and in most
cases, the second would be a no-op anyway since a query would have to
race in between the two for that to happen.
FWIW, I think we should update the m/ctime and version at the same time.
If the version changes, then there is always the potential that a timer
tick has occurred. So, that would translate to a second call to
file_update_time in here.
The downside of bumping the times/version both before and after is that
these are hot codepaths, and we'd be adding extra operations there. Even
in the case where nothing has changed, we'd have to call
inode_needs_update_time a second time for every write. Is that worth the
cost?
> > If the time update fails, log a warning once, but don't fail the write.
> > All of the existing callers use update_time functions that don't fail,
> > so we should never trip this.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > mm/filemap.c | 17 +++++++++++++----
> > 1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 15800334147b..72c0ceb75176 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > if (err)
> > goto out;
> >
> > - err = file_update_time(file);
> > - if (err)
> > - goto out;
> > -
> > if (iocb->ki_flags & IOCB_DIRECT) {
> > loff_t pos, endbyte;
> >
> > @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > iocb->ki_pos += written;
> > }
> > out:
> > + if (written > 0) {
> > + err = file_update_time(file);
> > + /*
> > + * There isn't much we can do at this point if updating the
> > + * times fails after a successful write. The times and i_version
> > + * should still be updated in the inode, and it should still be
> > + * marked dirty, so hopefully the next inode update will catch it.
> > + * Log a warning once so we have a record that something untoward
> > + * has occurred.
> > + */
> > + WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
>
> pr_warn_once() please - this is not a programming assertion.
>
ACK. I'll change that.
--
Jeff Layton <[email protected]>
On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <[email protected]> wrote:
>
> On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
> > >
> > > The c/mtime and i_version currently get updated before the data is
> > > copied (or a DIO write is issued), which is problematic for NFS.
> > >
> > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > to make the client associate the state of the file with the wrong change
> > > attribute. That association can persist indefinitely if the file sees no
> > > further changes.
> > >
> > > Move the setting of times to the bottom of the function in
> > > __generic_file_write_iter and only update it if something was
> > > successfully written.
> > >
> >
> > This solution is wrong for several reasons:
> >
> > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > solved the problem completely
>
> Right. I don't think there is a way to solve the problem vs. mmap.
> Userland can write to a writeable mmap'ed page at any time and we'd
> never know. We have to specifically carve out mmap as an exception here.
> I'll plan to add something to the manpage patch for this.
>
> > 2. The other side of the coin is that post crash state is more likely to end
> > up data changes without mtime/ctime change
> >
>
> Is this really something filesystems rely on? I suppose the danger is
> that some cached data gets written to disk before the write returns and
> the inode on disk never gets updated.
>
> But...isn't that a danger now? Some of the cached data could get written
> out and the updated inode just never makes it to disk before a crash
> (AFAIU). I'm not sure that this increases our exposure to that problem.
>
>
You are correct that that danger exists, but it only exists for overwriting
to allocated blocks.
For writing to new blocks, mtime change is recorded in transaction
before the block mapping is recorded in transaction so there is no
danger in this case (before your patch).
Also, observing size change without observing mtime change
after crash seems like a very bad outcome that may be possible
after your change.
These are just a few cases that I could think of, they may be filesystem
dependent, but my gut feeling is that if you remove the time update before
the operation, that has been like that forever, a lot of s#!t is going to float
for various filesystems and applications.
And it is not one of those things that are discovered during rc or even
stable kernel testing - they are discovered much later when users start to
realize their applications got bogged up after crash, so it feels like to me
like playing with fire.
> > If I read the problem description correctly, then a solution that invalidates
> > the NFS cache before AND after the write would be acceptable. Right?
> > Would an extra i_version bump after the write solve the race?
> >
>
> I based this patch on Neil's assertion that updating the time before an
> operation was pointless if we were going to do it afterward. The NFS
> client only really cares about seeing it change after a write.
>
Pointless to NFS client maybe.
Whether or not this is not changing user behavior for other applications
is up to you to prove and I doubt that you can prove it because I doubt
that it is true.
> Doing both would be fine from a correctness standpoint, and in most
> cases, the second would be a no-op anyway since a query would have to
> race in between the two for that to happen.
>
> FWIW, I think we should update the m/ctime and version at the same time.
> If the version changes, then there is always the potential that a timer
> tick has occurred. So, that would translate to a second call to
> file_update_time in here.
>
> The downside of bumping the times/version both before and after is that
> these are hot codepaths, and we'd be adding extra operations there. Even
> in the case where nothing has changed, we'd have to call
> inode_needs_update_time a second time for every write. Is that worth the
> cost?
Is there a practical cost for iversion bump AFTER write as I suggested?
If you NEED m/ctime update AFTER write and iversion update is not enough
then I did not understand from your commit message why that is.
Thanks,
Amir.
On Tue, 04 Oct 2022, Amir Goldstein wrote:
> On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <[email protected]> wrote:
> >
> > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
> > > >
> > > > The c/mtime and i_version currently get updated before the data is
> > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > >
> > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > to make the client associate the state of the file with the wrong change
> > > > attribute. That association can persist indefinitely if the file sees no
> > > > further changes.
> > > >
> > > > Move the setting of times to the bottom of the function in
> > > > __generic_file_write_iter and only update it if something was
> > > > successfully written.
> > > >
> > >
> > > This solution is wrong for several reasons:
> > >
> > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > solved the problem completely
> >
> > Right. I don't think there is a way to solve the problem vs. mmap.
> > Userland can write to a writeable mmap'ed page at any time and we'd
> > never know. We have to specifically carve out mmap as an exception here.
> > I'll plan to add something to the manpage patch for this.
> >
> > > 2. The other side of the coin is that post crash state is more likely to end
> > > up data changes without mtime/ctime change
> > >
> >
> > Is this really something filesystems rely on? I suppose the danger is
> > that some cached data gets written to disk before the write returns and
> > the inode on disk never gets updated.
> >
> > But...isn't that a danger now? Some of the cached data could get written
> > out and the updated inode just never makes it to disk before a crash
> > (AFAIU). I'm not sure that this increases our exposure to that problem.
> >
> >
>
> You are correct that that danger exists, but it only exists for overwriting
> to allocated blocks.
>
> For writing to new blocks, mtime change is recorded in transaction
> before the block mapping is recorded in transaction so there is no
> danger in this case (before your patch).
>
> Also, observing size change without observing mtime change
> after crash seems like a very bad outcome that may be possible
> after your change.
>
> These are just a few cases that I could think of, they may be filesystem
> dependent, but my gut feeling is that if you remove the time update before
> the operation, that has been like that forever, a lot of s#!t is going to float
> for various filesystems and applications.
>
> And it is not one of those things that are discovered during rc or even
> stable kernel testing - they are discovered much later when users start to
> realize their applications got bogged up after crash, so it feels like to me
> like playing with fire.
>
> > > If I read the problem description correctly, then a solution that invalidates
> > > the NFS cache before AND after the write would be acceptable. Right?
> > > Would an extra i_version bump after the write solve the race?
> > >
> >
> > I based this patch on Neil's assertion that updating the time before an
> > operation was pointless if we were going to do it afterward. The NFS
> > client only really cares about seeing it change after a write.
> >
>
> Pointless to NFS client maybe.
> Whether or not this is not changing user behavior for other applications
> is up to you to prove and I doubt that you can prove it because I doubt
> that it is true.
>
> > Doing both would be fine from a correctness standpoint, and in most
> > cases, the second would be a no-op anyway since a query would have to
> > race in between the two for that to happen.
> >
> > FWIW, I think we should update the m/ctime and version at the same time.
> > If the version changes, then there is always the potential that a timer
> > tick has occurred. So, that would translate to a second call to
> > file_update_time in here.
> >
> > The downside of bumping the times/version both before and after is that
> > these are hot codepaths, and we'd be adding extra operations there. Even
> > in the case where nothing has changed, we'd have to call
> > inode_needs_update_time a second time for every write. Is that worth the
> > cost?
>
> Is there a practical cost for iversion bump AFTER write as I suggested?
> If you NEED m/ctime update AFTER write and iversion update is not enough
> then I did not understand from your commit message why that is.
>
> Thanks,
> Amir.
>
Maybe we should split i_version updates from ctime updates.
While it isn't true that ctime updates have happened before the write
"forever" it has been true since 2.3.43[1] which is close to forever.
For ctime there doesn't appear to be a strong specification of when the
change happens, so history provides a good case for leaving it before.
For i_version we want to provide clear and unambiguous semantics.
Performing 2 updates makes the specification muddy.
So I would prefer a single update for i_version, performed after the
change becomes visible. If that means it has to be separate from ctime,
then so be it.
NeilBrown
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29
On Fri, 30 Sep 2022, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Claim one of the spare fields in struct statx to hold a 64-bit inode
> version attribute. When userland requests STATX_VERSION, copy the
> value from the kstat struct there, and stop masking off
> STATX_ATTR_VERSION_MONOTONIC.
>
> Update the test-statx sample program to output the change attr and
> MountId.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/stat.c | 12 +++---------
> include/linux/stat.h | 9 ---------
> include/uapi/linux/stat.h | 6 ++++--
> samples/vfs/test-statx.c | 8 ++++++--
> 4 files changed, 13 insertions(+), 22 deletions(-)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index e7f8cd4b24e1..8396c372022f 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>
> memset(&tmp, 0, sizeof(tmp));
>
> - /* STATX_VERSION is kernel-only for now */
> - tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
> + tmp.stx_mask = stat->result_mask;
> tmp.stx_blksize = stat->blksize;
> - /* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
> - tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
> + tmp.stx_attributes = stat->attributes;
> tmp.stx_nlink = stat->nlink;
> tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
> tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> @@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> tmp.stx_mnt_id = stat->mnt_id;
> tmp.stx_dio_mem_align = stat->dio_mem_align;
> tmp.stx_dio_offset_align = stat->dio_offset_align;
> + tmp.stx_version = stat->version;
>
> return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
> }
> @@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
> if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
> return -EINVAL;
>
> - /* STATX_VERSION is kernel-only for now. Ignore requests
> - * from userland.
> - */
> - mask &= ~STATX_VERSION;
> -
> error = vfs_statx(dfd, filename, flags, &stat, mask);
> if (error)
> return error;
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 4e9428d86a3a..69c79e4fd1b1 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -54,13 +54,4 @@ struct kstat {
> u32 dio_offset_align;
> u64 version;
> };
> -
> -/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> -
> -/* mask values */
> -#define STATX_VERSION 0x40000000U /* Want/got stx_change_attr */
> -
> -/* file attribute values */
> -#define STATX_ATTR_VERSION_MONOTONIC 0x8000000000000000ULL /* version monotonically increases */
> -
> #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7cab2c65d3d7..4a0a1f27c059 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -127,7 +127,8 @@ struct statx {
> __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
> __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
> /* 0xa0 */
> - __u64 __spare3[12]; /* Spare space for future expansion */
> + __u64 stx_version; /* Inode change attribute */
> + __u64 __spare3[11]; /* Spare space for future expansion */
> /* 0x100 */
> };
>
> @@ -154,6 +155,7 @@ struct statx {
> #define STATX_BTIME 0x00000800U /* Want/got stx_btime */
> #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
> #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
> +#define STATX_VERSION 0x00004000U /* Want/got stx_version */
>
> #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
>
> @@ -189,6 +191,6 @@ struct statx {
> #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */
> #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
> #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */
> -
> +#define STATX_ATTR_VERSION_MONOTONIC 0x00400000 /* stx_version increases w/ every change */
>
> #endif /* _UAPI_LINUX_STAT_H */
> diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
> index 49c7a46cee07..bdbc371c9774 100644
> --- a/samples/vfs/test-statx.c
> +++ b/samples/vfs/test-statx.c
> @@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
> printf("Device: %-15s", buffer);
> if (stx->stx_mask & STATX_INO)
> printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> + if (stx->stx_mask & STATX_MNT_ID)
> + printf(" MountId: %llx", stx->stx_mnt_id);
> if (stx->stx_mask & STATX_NLINK)
> printf(" Links: %-5u", stx->stx_nlink);
> if (stx->stx_mask & STATX_TYPE) {
> @@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
> if (stx->stx_mask & STATX_CTIME)
> print_time("Change: ", &stx->stx_ctime);
> if (stx->stx_mask & STATX_BTIME)
> - print_time(" Birth: ", &stx->stx_btime);
> + print_time("Birth: ", &stx->stx_btime);
> + if (stx->stx_mask & STATX_VERSION)
> + printf("Inode Version: 0x%llx\n", stx->stx_version);
Why hex? not decimal? I don't really care but it seems like an odd choice.
>
> if (stx->stx_attributes_mask) {
> unsigned char bits, mbits;
> @@ -218,7 +222,7 @@ int main(int argc, char **argv)
> struct statx stx;
> int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
>
> - unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
>
> for (argv++; *argv; argv++) {
> if (strcmp(*argv, "-F") == 0) {
> --
> 2.37.3
>
>
Reviewed-by: NeilBrown <[email protected]>
Thanks,
NeilBrown
On Tue, 2022-10-04 at 10:42 +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > From: Jeff Layton <[email protected]>
> >
> > Claim one of the spare fields in struct statx to hold a 64-bit inode
> > version attribute. When userland requests STATX_VERSION, copy the
> > value from the kstat struct there, and stop masking off
> > STATX_ATTR_VERSION_MONOTONIC.
> >
> > Update the test-statx sample program to output the change attr and
> > MountId.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/stat.c | 12 +++---------
> > include/linux/stat.h | 9 ---------
> > include/uapi/linux/stat.h | 6 ++++--
> > samples/vfs/test-statx.c | 8 ++++++--
> > 4 files changed, 13 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/stat.c b/fs/stat.c
> > index e7f8cd4b24e1..8396c372022f 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> >
> > memset(&tmp, 0, sizeof(tmp));
> >
> > - /* STATX_VERSION is kernel-only for now */
> > - tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
> > + tmp.stx_mask = stat->result_mask;
> > tmp.stx_blksize = stat->blksize;
> > - /* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
> > - tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
> > + tmp.stx_attributes = stat->attributes;
> > tmp.stx_nlink = stat->nlink;
> > tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
> > tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> > @@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> > tmp.stx_mnt_id = stat->mnt_id;
> > tmp.stx_dio_mem_align = stat->dio_mem_align;
> > tmp.stx_dio_offset_align = stat->dio_offset_align;
> > + tmp.stx_version = stat->version;
> >
> > return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
> > }
> > @@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
> > if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
> > return -EINVAL;
> >
> > - /* STATX_VERSION is kernel-only for now. Ignore requests
> > - * from userland.
> > - */
> > - mask &= ~STATX_VERSION;
> > -
> > error = vfs_statx(dfd, filename, flags, &stat, mask);
> > if (error)
> > return error;
> > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > index 4e9428d86a3a..69c79e4fd1b1 100644
> > --- a/include/linux/stat.h
> > +++ b/include/linux/stat.h
> > @@ -54,13 +54,4 @@ struct kstat {
> > u32 dio_offset_align;
> > u64 version;
> > };
> > -
> > -/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> > -
> > -/* mask values */
> > -#define STATX_VERSION 0x40000000U /* Want/got stx_change_attr */
> > -
> > -/* file attribute values */
> > -#define STATX_ATTR_VERSION_MONOTONIC 0x8000000000000000ULL /* version monotonically increases */
> > -
> > #endif
> > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > index 7cab2c65d3d7..4a0a1f27c059 100644
> > --- a/include/uapi/linux/stat.h
> > +++ b/include/uapi/linux/stat.h
> > @@ -127,7 +127,8 @@ struct statx {
> > __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
> > __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
> > /* 0xa0 */
> > - __u64 __spare3[12]; /* Spare space for future expansion */
> > + __u64 stx_version; /* Inode change attribute */
> > + __u64 __spare3[11]; /* Spare space for future expansion */
> > /* 0x100 */
> > };
> >
> > @@ -154,6 +155,7 @@ struct statx {
> > #define STATX_BTIME 0x00000800U /* Want/got stx_btime */
> > #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
> > #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
> > +#define STATX_VERSION 0x00004000U /* Want/got stx_version */
> >
> > #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
> >
> > @@ -189,6 +191,6 @@ struct statx {
> > #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */
> > #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
> > #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */
> > -
> > +#define STATX_ATTR_VERSION_MONOTONIC 0x00400000 /* stx_version increases w/ every change */
> >
> > #endif /* _UAPI_LINUX_STAT_H */
> > diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
> > index 49c7a46cee07..bdbc371c9774 100644
> > --- a/samples/vfs/test-statx.c
> > +++ b/samples/vfs/test-statx.c
> > @@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
> > printf("Device: %-15s", buffer);
> > if (stx->stx_mask & STATX_INO)
> > printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> > + if (stx->stx_mask & STATX_MNT_ID)
> > + printf(" MountId: %llx", stx->stx_mnt_id);
> > if (stx->stx_mask & STATX_NLINK)
> > printf(" Links: %-5u", stx->stx_nlink);
> > if (stx->stx_mask & STATX_TYPE) {
> > @@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
> > if (stx->stx_mask & STATX_CTIME)
> > print_time("Change: ", &stx->stx_ctime);
> > if (stx->stx_mask & STATX_BTIME)
> > - print_time(" Birth: ", &stx->stx_btime);
> > + print_time("Birth: ", &stx->stx_btime);
> > + if (stx->stx_mask & STATX_VERSION)
> > + printf("Inode Version: 0x%llx\n", stx->stx_version);
>
> Why hex? not decimal? I don't really care but it seems like an odd choice.
>
Habit. You're probably right that this is better viewed in decimal. I'll
change it in the next iteration. We should probably also have this
display the new DIOALIGN fields as well. I'll roll that in too.
> >
> > if (stx->stx_attributes_mask) {
> > unsigned char bits, mbits;
> > @@ -218,7 +222,7 @@ int main(int argc, char **argv)
> > struct statx stx;
> > int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> >
> > - unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> > + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
> >
> > for (argv++; *argv; argv++) {
> > if (strcmp(*argv, "-F") == 0) {
> > --
> > 2.37.3
> >
> >
>
> Reviewed-by: NeilBrown <[email protected]>
>
> Thanks,
> NeilBrown
--
Jeff Layton <[email protected]>
On Tue, 2022-10-04 at 09:56 +1100, NeilBrown wrote:
> On Tue, 04 Oct 2022, Amir Goldstein wrote:
> > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <[email protected]> wrote:
> > >
> > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
> > > > >
> > > > > The c/mtime and i_version currently get updated before the data is
> > > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > > >
> > > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > > to make the client associate the state of the file with the wrong change
> > > > > attribute. That association can persist indefinitely if the file sees no
> > > > > further changes.
> > > > >
> > > > > Move the setting of times to the bottom of the function in
> > > > > __generic_file_write_iter and only update it if something was
> > > > > successfully written.
> > > > >
> > > >
> > > > This solution is wrong for several reasons:
> > > >
> > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > > solved the problem completely
> > >
> > > Right. I don't think there is a way to solve the problem vs. mmap.
> > > Userland can write to a writeable mmap'ed page at any time and we'd
> > > never know. We have to specifically carve out mmap as an exception here.
> > > I'll plan to add something to the manpage patch for this.
> > >
> > > > 2. The other side of the coin is that post crash state is more likely to end
> > > > up data changes without mtime/ctime change
> > > >
> > >
> > > Is this really something filesystems rely on? I suppose the danger is
> > > that some cached data gets written to disk before the write returns and
> > > the inode on disk never gets updated.
> > >
> > > But...isn't that a danger now? Some of the cached data could get written
> > > out and the updated inode just never makes it to disk before a crash
> > > (AFAIU). I'm not sure that this increases our exposure to that problem.
> > >
> > >
> >
> > You are correct that that danger exists, but it only exists for overwriting
> > to allocated blocks.
> >
> > For writing to new blocks, mtime change is recorded in transaction
> > before the block mapping is recorded in transaction so there is no
> > danger in this case (before your patch).
> >
> > Also, observing size change without observing mtime change
> > after crash seems like a very bad outcome that may be possible
> > after your change.
> >
> > These are just a few cases that I could think of, they may be filesystem
> > dependent, but my gut feeling is that if you remove the time update before
> > the operation, that has been like that forever, a lot of s#!t is going to float
> > for various filesystems and applications.
> >
> > And it is not one of those things that are discovered during rc or even
> > stable kernel testing - they are discovered much later when users start to
> > realize their applications got bogged up after crash, so it feels like to me
> > like playing with fire.
> >
> > > > If I read the problem description correctly, then a solution that invalidates
> > > > the NFS cache before AND after the write would be acceptable. Right?
> > > > Would an extra i_version bump after the write solve the race?
> > > >
> > >
> > > I based this patch on Neil's assertion that updating the time before an
> > > operation was pointless if we were going to do it afterward. The NFS
> > > client only really cares about seeing it change after a write.
> > >
> >
> > Pointless to NFS client maybe.
> > Whether or not this is not changing user behavior for other applications
> > is up to you to prove and I doubt that you can prove it because I doubt
> > that it is true.
> >
> > > Doing both would be fine from a correctness standpoint, and in most
> > > cases, the second would be a no-op anyway since a query would have to
> > > race in between the two for that to happen.
> > >
> > > FWIW, I think we should update the m/ctime and version at the same time.
> > > If the version changes, then there is always the potential that a timer
> > > tick has occurred. So, that would translate to a second call to
> > > file_update_time in here.
> > >
> > > The downside of bumping the times/version both before and after is that
> > > these are hot codepaths, and we'd be adding extra operations there. Even
> > > in the case where nothing has changed, we'd have to call
> > > inode_needs_update_time a second time for every write. Is that worth the
> > > cost?
> >
> > Is there a practical cost for iversion bump AFTER write as I suggested?
> > If you NEED m/ctime update AFTER write and iversion update is not enough
> > then I did not understand from your commit message why that is.
> >
> > Thanks,
> > Amir.
> >
>
> Maybe we should split i_version updates from ctime updates.
>
> While it isn't true that ctime updates have happened before the write
> "forever" it has been true since 2.3.43[1] which is close to forever.
>
> For ctime there doesn't appear to be a strong specification of when the
> change happens, so history provides a good case for leaving it before.
> For i_version we want to provide clear and unambiguous semantics.
> Performing 2 updates makes the specification muddy.
>
> So I would prefer a single update for i_version, performed after the
> change becomes visible. If that means it has to be separate from ctime,
> then so be it.
>
> NeilBrown
>
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29
Not necessarily. We can document it in such a way that bumping it twice
is allowed, but not required.
My main concern with splitting them up is that we'd have to dirty the
inode twice if both the times and the i_version need updating. If the
inode gets written out in between, then we end up doing twice the I/O.
The interim on-disk metadata would be in sort of a weird state too --
the ctime would have changed but the version would still be old.
It might be worthwhile to just go ahead and continue bumping it in
file_update_time, and then we'd just attempt to bump the i_version again
afterward. The second bump will almost always be a no-op anyway.
--
Jeff Layton <[email protected]>
On Thu, 06 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 09:56 +1100, NeilBrown wrote:
> > On Tue, 04 Oct 2022, Amir Goldstein wrote:
> > > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > The c/mtime and i_version currently get updated before the data is
> > > > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > > > >
> > > > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > > > to make the client associate the state of the file with the wrong change
> > > > > > attribute. That association can persist indefinitely if the file sees no
> > > > > > further changes.
> > > > > >
> > > > > > Move the setting of times to the bottom of the function in
> > > > > > __generic_file_write_iter and only update it if something was
> > > > > > successfully written.
> > > > > >
> > > > >
> > > > > This solution is wrong for several reasons:
> > > > >
> > > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > > > solved the problem completely
> > > >
> > > > Right. I don't think there is a way to solve the problem vs. mmap.
> > > > Userland can write to a writeable mmap'ed page at any time and we'd
> > > > never know. We have to specifically carve out mmap as an exception here.
> > > > I'll plan to add something to the manpage patch for this.
> > > >
> > > > > 2. The other side of the coin is that post crash state is more likely to end
> > > > > up data changes without mtime/ctime change
> > > > >
> > > >
> > > > Is this really something filesystems rely on? I suppose the danger is
> > > > that some cached data gets written to disk before the write returns and
> > > > the inode on disk never gets updated.
> > > >
> > > > But...isn't that a danger now? Some of the cached data could get written
> > > > out and the updated inode just never makes it to disk before a crash
> > > > (AFAIU). I'm not sure that this increases our exposure to that problem.
> > > >
> > > >
> > >
> > > You are correct that that danger exists, but it only exists for overwriting
> > > to allocated blocks.
> > >
> > > For writing to new blocks, mtime change is recorded in transaction
> > > before the block mapping is recorded in transaction so there is no
> > > danger in this case (before your patch).
> > >
> > > Also, observing size change without observing mtime change
> > > after crash seems like a very bad outcome that may be possible
> > > after your change.
> > >
> > > These are just a few cases that I could think of, they may be filesystem
> > > dependent, but my gut feeling is that if you remove the time update before
> > > the operation, that has been like that forever, a lot of s#!t is going to float
> > > for various filesystems and applications.
> > >
> > > And it is not one of those things that are discovered during rc or even
> > > stable kernel testing - they are discovered much later when users start to
> > > realize their applications got bogged up after crash, so it feels like to me
> > > like playing with fire.
> > >
> > > > > If I read the problem description correctly, then a solution that invalidates
> > > > > the NFS cache before AND after the write would be acceptable. Right?
> > > > > Would an extra i_version bump after the write solve the race?
> > > > >
> > > >
> > > > I based this patch on Neil's assertion that updating the time before an
> > > > operation was pointless if we were going to do it afterward. The NFS
> > > > client only really cares about seeing it change after a write.
> > > >
> > >
> > > Pointless to NFS client maybe.
> > > Whether or not this is not changing user behavior for other applications
> > > is up to you to prove and I doubt that you can prove it because I doubt
> > > that it is true.
> > >
> > > > Doing both would be fine from a correctness standpoint, and in most
> > > > cases, the second would be a no-op anyway since a query would have to
> > > > race in between the two for that to happen.
> > > >
> > > > FWIW, I think we should update the m/ctime and version at the same time.
> > > > If the version changes, then there is always the potential that a timer
> > > > tick has occurred. So, that would translate to a second call to
> > > > file_update_time in here.
> > > >
> > > > The downside of bumping the times/version both before and after is that
> > > > these are hot codepaths, and we'd be adding extra operations there. Even
> > > > in the case where nothing has changed, we'd have to call
> > > > inode_needs_update_time a second time for every write. Is that worth the
> > > > cost?
> > >
> > > Is there a practical cost for iversion bump AFTER write as I suggested?
> > > If you NEED m/ctime update AFTER write and iversion update is not enough
> > > then I did not understand from your commit message why that is.
> > >
> > > Thanks,
> > > Amir.
> > >
> >
> > Maybe we should split i_version updates from ctime updates.
> >
> > While it isn't true that ctime updates have happened before the write
> > "forever" it has been true since 2.3.43[1] which is close to forever.
> >
> > For ctime there doesn't appear to be a strong specification of when the
> > change happens, so history provides a good case for leaving it before.
> > For i_version we want to provide clear and unambiguous semantics.
> > Performing 2 updates makes the specification muddy.
> >
> > So I would prefer a single update for i_version, performed after the
> > change becomes visible. If that means it has to be separate from ctime,
> > then so be it.
> >
> > NeilBrown
> >
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29
>
>
> Not necessarily. We can document it in such a way that bumping it twice
> is allowed, but not required.
>
> My main concern with splitting them up is that we'd have to dirty the
> inode twice if both the times and the i_version need updating. If the
> inode gets written out in between, then we end up doing twice the I/O.
> The interim on-disk metadata would be in sort of a weird state too --
> the ctime would have changed but the version would still be old.
>
> It might be worthwhile to just go ahead and continue bumping it in
> file_update_time, and then we'd just attempt to bump the i_version again
> afterward. The second bump will almost always be a no-op anyway.
I"m probably starting to sound like a scratched record here, but this is
why I think it should be up to the filesystem to bump i_version when it
determines that it should. It should be in a position to include the
i_version update any time that it writes the inode and so avoid a double
write.
Having that vfs/mm do so much of the work makes it hard for the
filesystem to do the right amount of work. The common code should
provide libraries of useful code, the filesystems should call that as
appropriate. Some of our code is structured that way, some of it isn't.
Most callers of file_update_time() are inside filesystems and that is
good - they are in control.
There are 3 in mm/*.c. Those are all in callbacks from the filesystem,
so the fs could avoid them, but only by duplicating lots of code to
avoid using the callback. Instead these file_update_time() calls should
become more explicit calls into the filesystem telling the filesystem
what has just happened, or is about to happen. Then the filesystem can
do the right thing, rather than having something done to it.
See also https://lwn.net/Articles/336262/ and the "midlayer mistake".
But yes, doing the bump afterwards as well is likely to be a no-op most
of the time and is probably the easy solution. Ugly, but easy.
NeilBrown