2022-08-26 21:48:51

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx

The purpose of this patchset is two-fold:

1/ correct performance issues in the existing i_version counter
implementations and (hopefully) to bring them into behavioral
alignment.

2/ expose the i_version field to userland via statx. This is useful for
testing the various i_version implementations, but may also be useful for
userland applications that want a way to tell whether a file might
have changed.

The i_version field in Linux has been around since 1994, but its meaning
and behavior has subtly changed over time [1]. The first patch in this
series fleshes out the comments in iversion.h to try and give a clear
explanation of what's expected from the filesystem. My first ask is for
feedback on this -- does the proposed definition seem reasonable for
presenting to userland?

There are two main consumers of i_version in the kernel: nfsd and IMA.
They both only want to see a change to the i_version iff there was an
explicit change to the inode. They can cope with an implementation that
does more increments than that, but that measurably harms performance.

I'll argue that atime-only updates SHOULD be excluded from i_version
bumps since they don't represent a "real" change to the inode. Spurious
updates to the i_version have real, measurable performance impacts with
NFSv4, and possibly with IMA.

There are 3 kernel-managed i_version implementations in the kernel:
btrfs, ext4 and xfs.

btrfs seems to work as we'd expect. It doesn't bump the i_version
on atime-only updates and seems to bump it appropriately for other
activity.

ext4 currently bumps the i_version even when only the atime is being
updated. I have a patch to fix this that Jan and Christian have
Reviewed, but I haven't yet heard from Ted or Andreas.

xfs has the same issue as ext4 bumping i_version on atime updates. He
has NACK'ed the patch I proposed since there are evidently tools that
depend on every log transaction being represented in i_version.

I've included the xfs patch in this series, but if it's not suitable I'm
open to fixing this other ways, but I'll need some feedback as to what
the xfs developers would like to do here. Should we add a new on-disk
field to the inode? Try to do something clever with "unused" parts of
the ctime? What would be best?

Lastly, there are patches to allow NFS and Ceph to report this value
as well. They should be fairly straightforward once the earlier pile is
resolved.

Note that I dropped the patch to make AFS report STATX_INO_VERSION since
its semantics don't match the proposed definition.

[1]: Now, for your entertainment...

A BRIEF HISTORY OF THE I_VERSION FIELD
======================================

PRE-GIT-HISTORY ERA:
--------------------
The i_version field first appears in v1.1.30 (summer 1994) with more increments
added to ext2 over next few v1.1.x versions. There were ioctls to set and fetch
the value in ext2. They're still there but they access the i_generation
counter now.

It was mostly used to catch races in lookup and readdir due to directory
changes, and a lot of filesystems still use it that way today. Non-directory
inodes would have this value set, but the kernel didn't do much with it.

GIT HISTORY ERA:
----------------
Then in 2.6.24, Jean Noel Cordenner from Bull increased the size to 64
bits, added the MS_I_VERSION flag, and started incrementing it in
file_update_time:

commit 7a224228ed79d587ece2304869000aad1b8e97dd
Author: Jean Noel Cordenner <[email protected]>
Date: Mon Jan 28 23:58:27 2008 -0500

vfs: Add 64 bit i_version support

The i_version field of the inode is changed to be a 64-bit counter that
is set on every inode creation and that is incremented every time the
inode data is modified (similarly to the "ctime" time-stamp).
The aim is to fulfill a NFSv4 requirement for rfc3530.
This first part concerns the vfs, it converts the 32-bit i_version in
the generic inode to a 64-bit, a flag is added in the super block in
order to check if the feature is enabled and the i_version is
incremented in the vfs.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Jean Noel Cordenner <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>

Then he added support to ext4, plus the mount option to enable it. The
problem with the i_version being incremented during atime updates
probably dates back to this patch. I imagine it was probably just an
oversight, though it could just have been due to unclear definition for
the change attr in the NFSv4.0 spec:

commit 25ec56b518257a56d2ff41a941d288e4b5ff9488
Author: Jean Noel Cordenner <[email protected]>
Date: Mon Jan 28 23:58:27 2008 -0500

ext4: Add inode version support in ext4

This patch adds 64-bit inode version support to ext4. The lower 32 bits
are stored in the osd1.linux1.l_i_version field while the high 32 bits
are stored in the i_version_hi field newly created in the ext4_inode.
This field is incremented in case the ext4_inode is large enough. A
i_version mount option has been added to enable the feature.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Jean Noel Cordenner <[email protected]>

Bruce then added support for it to nfsd:

commit c654b8a9cba6002aad1c01919e4928a79a4a6dcf
Author: J. Bruce Fields <[email protected]>
Date: Thu Apr 16 17:33:25 2009 -0400

nfsd: support ext4 i_version

ext4 supports a real NFSv4 change attribute, which is bumped whenever
the ctime would be updated, including times when two updates arrive
within a jiffy of each other. (Note that although ext4 has space for
nanosecond-precision ctime, the real resolution is lower: it actually
uses jiffies as the time-source.) This ensures clients will invalidate
their caches when they need to.

There is some fear that keeping the i_version up-to-date could have
performance drawbacks, so for now it's turned on only by a mount option.
We hope to do something better eventually.

Signed-off-by: J. Bruce Fields <[email protected]>
Cc: Theodore Tso <[email protected]>

Josef converted btrfs to use it instead of their own internal counter.
It looks like the btrfs implementation has probably avoided the issue
with atime updates causing i_version bumps.

commit 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a
Author: Josef Bacik <[email protected]>
Date: Thu Apr 5 15:03:02 2012 -0400

Btrfs: use i_version instead of our own sequence

We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write. Thanks,

Signed-off-by: Josef Bacik <[email protected]>

Then, in 2013 Dave added support for xfs with v3 superblocks. There were
some later changes of how it was stored, but its behavior has largely
been the same on xfs since then. Note that at the time, the stated
reason for adding this was to provide NFSv4 semantics:

commit dc037ad7d24f3711e431a45c053b5d425995e9e4
Author: Dave Chinner <[email protected]>
Date: Thu Jun 27 16:04:59 2013 +1000

xfs: implement inode change count

For CRC enabled filesystems, add support for the monotonic inode
version change counter that is needed by protocols like NFSv4 for
determining if the inode has changed in any way at all between two
unrelated operations on the inode.

This bumps the change count the first time an inode is dirtied in a
transaction. Since all modifications to the inode are logged, this
will catch all changes that are made to the inode, including
timestamp updates that occur during data writes.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Mark Tinguely <[email protected]>
Reviewed-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Ben Myers <[email protected]>

Jeff Layton (7):
iversion: update comments with info about atime updates
ext4: fix i_version handling in ext4
ext4: unconditionally enable the i_version counter
xfs: don't bump the i_version on an atime update in xfs_vn_update_time
vfs: report an inode version in statx for IS_I_VERSION inodes
nfs: report the inode version in statx if requested
ceph: fill in the change attribute in statx requests

fs/ceph/inode.c | 14 +++++++++-----
fs/ext4/inode.c | 10 +++++-----
fs/ext4/ioctl.c | 4 ++++
fs/ext4/move_extent.c | 6 ++++++
fs/ext4/super.c | 13 ++++---------
fs/ext4/xattr.c | 1 +
fs/nfs/inode.c | 7 +++++--
fs/stat.c | 7 +++++++
fs/xfs/libxfs/xfs_log_format.h | 2 +-
fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
fs/xfs/xfs_iops.c | 11 +++++++++--
include/linux/iversion.h | 23 +++++++++++++++++++++--
include/linux/stat.h | 1 +
include/uapi/linux/stat.h | 3 ++-
samples/vfs/test-statx.c | 8 ++++++--
15 files changed, 82 insertions(+), 30 deletions(-)

--
2.37.2


2022-08-26 21:49:16

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time

xfs will update the i_version when updating only the atime value, which
is not desirable for any of the current consumers of i_version. Doing so
leads to unnecessary cache invalidations on NFS and extra measurement
activity in IMA.

Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
transaction should not update the i_version. Set that value in
xfs_vn_update_time if we're only updating the atime.

Cc: Dave Chinner <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: David Wysochanski <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/libxfs/xfs_log_format.h | 2 +-
fs/xfs/libxfs/xfs_trans_inode.c | 2 +-
fs/xfs/xfs_iops.c | 11 +++++++++--
3 files changed, 11 insertions(+), 4 deletions(-)

Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
the problem. I still think this approach should at least fix the worst
problems with atime updates being counted. We can look to carve out
other "spurious" i_version updates as we identify them.

If however there are offline analysis tools that require atime updates
to be counted, then we won't be able to do this. If that's the case, how
can we fix this such that serving xfs via NFSv4 doesn't suck?

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b351b9dc6561..866a4c5cf70c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -323,7 +323,7 @@ struct xfs_inode_log_format_32 {
#define XFS_ILOG_ABROOT 0x100 /* log i_af.i_broot */
#define XFS_ILOG_DOWNER 0x200 /* change the data fork owner on replay */
#define XFS_ILOG_AOWNER 0x400 /* change the attr fork owner on replay */
-
+#define XFS_ILOG_NOIVER 0x800 /* don't bump i_version */

/*
* The timestamps are dirty, but not necessarily anything else in the inode
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 8b5547073379..ffe6d296e7f9 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -126,7 +126,7 @@ xfs_trans_log_inode(
* unconditionally.
*/
if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
- if (IS_I_VERSION(inode) &&
+ if (!(flags & XFS_ILOG_NOIVER) && IS_I_VERSION(inode) &&
inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
iversion_flags = XFS_ILOG_CORE;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 45518b8c613c..94f14d96641b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1041,10 +1041,17 @@ xfs_vn_update_time(
return error;

xfs_ilock(ip, XFS_ILOCK_EXCL);
- if (flags & S_CTIME)
+
+ if (!(flags & S_VERSION))
+ log_flags |= XFS_ILOG_NOIVER;
+ if (flags & S_CTIME) {
inode->i_ctime = *now;
- if (flags & S_MTIME)
+ log_flags &= ~XFS_ILOG_NOIVER;
+ }
+ if (flags & S_MTIME) {
inode->i_mtime = *now;
+ log_flags &= ~XFS_ILOG_NOIVER;
+ }
if (flags & S_ATIME)
inode->i_atime = *now;

--
2.37.2

2022-08-27 17:29:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time

On Sat, 2022-08-27 at 12:10 -0400, Jeff Layton wrote:
> On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > xfs will update the i_version when updating only the
> > > > > > > atime
> > > > > > > value, which
> > > > > > > is not desirable for any of the current consumers of
> > > > > > > i_version. Doing so
> > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > measurement
> > > > > > > activity in IMA.
> > > > > > >
> > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate
> > > > > > > that
> > > > > > > the
> > > > > > > transaction should not update the i_version. Set that
> > > > > > > value
> > > > > > > in
> > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > >
> > > > > > > Cc: Dave Chinner <[email protected]>
> > > > > > > Cc: NeilBrown <[email protected]>
> > > > > > > Cc: Trond Myklebust <[email protected]>
> > > > > > > Cc: David Wysochanski <[email protected]>
> > > > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way
> > > > > > > to
> > > > > > > illustrate
> > > > > > > the problem. I still think this approach should at least
> > > > > > > fix
> > > > > > > the worst
> > > > > > > problems with atime updates being counted. We can look to
> > > > > > > carve out
> > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > >
> > > > > >
> > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > writeback
> > > > > > of dirty pages. Anybody know about other cases?
> > > > > >
> > > > > > Regarding inode blocks map changes, first of all, I don't
> > > > > > think
> > > > > > that there is
> > > > > > any practical loss from invalidating NFS client cache on
> > > > > > dirty
> > > > > > data writeback,
> > > > > > because NFS server should be serving cold data most of the
> > > > > > time.
> > > > > > If there are a few unneeded cache invalidations they would
> > > > > > only
> > > > > > be temporary.
> > > > > >
> > > > >
> > > > > Unless there is an issue with a writer NFS client that
> > > > > invalidates its
> > > > > own attribute
> > > > > caches on server data writeback?
> > > > >
> > > >
> > > > The client just looks at the file attributes (of which
> > > > i_version is
> > > > but
> > > > one), and if certain attributes have changed (mtime, ctime,
> > > > i_version,
> > > > etc...) then it invalidates its cache.
> > > >
> > > > In the case of blocks map changes, could that mean a difference
> > > > in
> > > > the
> > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > before
> > > > the change and a READ_PLUS after could give different results.
> > > > Since
> > > > that difference is observable by the client, I'd think we'd
> > > > want to
> > > > bump
> > > > i_version for that anyway.
> > >
> > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I
> > > know
> > > that's been the subject of recent debate.  At least as far as XFS
> > > is
> > > concerned, a file range can go from hole -> delayed allocation
> > > reservation -> unwritten extent -> (actual writeback) -> written
> > > extent.
> > > The dance became rather more complex when we added COW.  If any
> > > of
> > > that
> > > will make a difference for READ_PLUS, then yes, I think you'd
> > > want
> > > file
> > > writeback activities to bump iversion to cause client
> > > invalidations,
> > > like (I think) Dave said.
> > >
> > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data
> > > for
> > > written and delalloc extents; and an unwritten extent will report
> > > data
> > > for any pagecache it finds.
> > >
> >
> > READ_PLUS should never return anything different than a read()
> > system
> > call would return for any given area. The way it reports sparse
> > regions
> > vs. data regions is purely an RPC formatting convenience.
> >
> > The only point to note about NFS READ and READ_PLUS is that because
> > the
> > client is forced to send multiple RPC calls if the user is trying
> > to
> > read a region that is larger than the 'rsize' value, it is possible
> > that these READ/READ_PLUS calls may be processed out of order, and
> > so
> > the result may end up looking different than if you had executed a
> > read() call for the full region directly on the server.
> > However each individual READ / READ_PLUS reply should look as if
> > the
> > user had called read() on that rsize-sized section of the file.
> > > >
>
> Yeah, thinking about it some more, simply changing the block
> allocation
> is not something that should affect the ctime, so we probably don't
> want
> to bump i_version on it. It's an implicit change, IOW, not an
> explicit
> one.

As you say, it is unfortunate that XFS does this, and it is unfortunate
that it then changes the blocks allocated attribute post-fsync(), since
all that does cause confusion for certain applications.
However I agree 100% that this is an implicit change that is driven by
the filesystem and not the user application. Hence it is not an action
that needs to be recorded with a change attribute bump.


--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]