2017-12-22 12:06:01

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 00/19] fs: rework and optimize i_version handling in filesystems

From: Jeff Layton <[email protected]>

v4:
- fix SB_LAZYTIME handling in generic_update_time
- add memory barriers to patch to convert i_version field to atomic64_t

v3:
- move i_version handling functions to new header file
- document that the kernel-managed i_version implementation will appear to
increase over time
- fix inode_cmp_iversion to handle wraparound correctly

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

I think this is now approaching merge readiness.

Special thanks to Jan Kara and Dave Chinner who helped me tighten up the
memory barriers in the final patch.

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:

https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change.

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same. The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os.

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into linux-next fairly soon. If it shows no
problems then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
fs: new API for handling inode->i_version
fs: don't take the i_lock in inode_inc_iversion
fat: convert to new i_version API
affs: convert to new i_version API
afs: convert to new i_version API
btrfs: convert to new i_version API
exofs: switch to new i_version API
ext2: convert to new i_version API
ext4: convert to new i_version API
nfs: convert to new i_version API
nfsd: convert to new i_version API
ocfs2: convert to new i_version API
ufs: use new i_version API
xfs: convert to new i_version API
IMA: switch IMA over to new i_version API
fs: only set S_VERSION when updating times if necessary
xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
btrfs: only dirty the inode in btrfs_update_time if something was
changed
fs: handle inode->i_version more efficiently

fs/affs/amigaffs.c | 5 +-
fs/affs/dir.c | 5 +-
fs/affs/super.c | 3 +-
fs/afs/fsclient.c | 3 +-
fs/afs/inode.c | 5 +-
fs/btrfs/delayed-inode.c | 7 +-
fs/btrfs/file.c | 1 +
fs/btrfs/inode.c | 12 +-
fs/btrfs/ioctl.c | 1 +
fs/btrfs/tree-log.c | 4 +-
fs/btrfs/xattr.c | 1 +
fs/exofs/dir.c | 9 +-
fs/exofs/super.c | 3 +-
fs/ext2/dir.c | 9 +-
fs/ext2/super.c | 5 +-
fs/ext4/dir.c | 9 +-
fs/ext4/inline.c | 7 +-
fs/ext4/inode.c | 13 +-
fs/ext4/ioctl.c | 3 +-
fs/ext4/namei.c | 5 +-
fs/ext4/super.c | 3 +-
fs/ext4/xattr.c | 5 +-
fs/fat/dir.c | 3 +-
fs/fat/inode.c | 9 +-
fs/fat/namei_msdos.c | 7 +-
fs/fat/namei_vfat.c | 22 +--
fs/inode.c | 11 +-
fs/nfs/delegation.c | 3 +-
fs/nfs/fscache-index.c | 5 +-
fs/nfs/inode.c | 18 +--
fs/nfs/nfs4proc.c | 10 +-
fs/nfs/nfstrace.h | 5 +-
fs/nfs/write.c | 8 +-
fs/nfsd/nfsfh.h | 3 +-
fs/ocfs2/dir.c | 15 +-
fs/ocfs2/inode.c | 3 +-
fs/ocfs2/namei.c | 3 +-
fs/ocfs2/quota_global.c | 3 +-
fs/ufs/dir.c | 9 +-
fs/ufs/inode.c | 3 +-
fs/ufs/super.c | 3 +-
fs/xfs/libxfs/xfs_inode_buf.c | 7 +-
fs/xfs/xfs_icache.c | 5 +-
fs/xfs/xfs_inode.c | 3 +-
fs/xfs/xfs_inode_item.c | 3 +-
fs/xfs/xfs_trans_inode.c | 16 +-
include/linux/fs.h | 17 +--
include/linux/iversion.h | 304 ++++++++++++++++++++++++++++++++++++++
security/integrity/ima/ima_api.c | 3 +-
security/integrity/ima/ima_main.c | 3 +-
50 files changed, 487 insertions(+), 135 deletions(-)
create mode 100644 include/linux/iversion.h

--
2.14.3



2017-12-22 12:06:04

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 01/19] fs: new API for handling inode->i_version

From: Jeff Layton <[email protected]>

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/file.c | 1 +
fs/btrfs/inode.c | 1 +
fs/btrfs/ioctl.c | 1 +
fs/btrfs/xattr.c | 1 +
fs/ext4/inode.c | 1 +
fs/ext4/namei.c | 1 +
fs/inode.c | 1 +
include/linux/fs.h | 15 ----
include/linux/iversion.h | 205 +++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 212 insertions(+), 15 deletions(-)
create mode 100644 include/linux/iversion.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eb1bac7c8553..c95d7b2efefb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -31,6 +31,7 @@
#include <linux/slab.h>
#include <linux/btrfs.h>
#include <linux/uio.h>
+#include <linux/iversion.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..27f008b33fc1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
#include <linux/posix_acl_xattr.h>
#include <linux/uio.h>
#include <linux/magic.h>
+#include <linux/iversion.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ef8acaac688..aa452c9e2eff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -43,6 +43,7 @@
#include <linux/uuid.h>
#include <linux/btrfs.h>
#include <linux/uaccess.h>
+#include <linux/iversion.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 2c7e53f9ff1b..5258c1714830 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -23,6 +23,7 @@
#include <linux/xattr.h>
#include <linux/security.h>
#include <linux/posix_acl_xattr.h>
+#include <linux/iversion.h>
#include "ctree.h"
#include "btrfs_inode.h"
#include "transaction.h"
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..fa5d8bc52d2d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
#include <linux/slab.h>
#include <linux/bitops.h>
#include <linux/iomap.h>
+#include <linux/iversion.h>

#include "ext4_jbd2.h"
#include "xattr.h"
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 798b3ac680db..bcf0dff517be 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -34,6 +34,7 @@
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
#include <linux/bio.h>
+#include <linux/iversion.h>
#include "ext4.h"
#include "ext4_jbd2.h"

diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..19e72f500f71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -18,6 +18,7 @@
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
#include <linux/list_lru.h>
+#include <linux/iversion.h>
#include <trace/events/writeback.h>
#include "internal.h"

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..76382c24e9d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode *inode)
mark_inode_dirty(inode);
}

-/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
- */
-
-static inline void inode_inc_iversion(struct inode *inode)
-{
- spin_lock(&inode->i_lock);
- inode->i_version++;
- spin_unlock(&inode->i_lock);
-}
-
enum file_time_flags {
S_ATIME = 1,
S_MTIME = 2,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
new file mode 100644
index 000000000000..bb50d27c71f9
--- /dev/null
+++ b/include/linux/iversion.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IVERSION_H
+#define _LINUX_IVERSION_H
+
+#include <linux/fs.h>
+
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * It should be considered an opaque value by observers. If it remains the same
+ * since it was last checked, then nothing has changed in the inode. If it's
+ * different then something has changed. Observers cannot infer anything about
+ * the nature or magnitude of the changes from the value, only that the inode
+ * has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not touch it on writes, and the filesystem can use it how it
+ * wishes. Note that the filesystem is always responsible for updating the
+ * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
+ * We consider these sorts of filesystems to have a kernel-managed i_version.
+ *
+ * Note that some filesystems (e.g. NFS and AFS) just use the field to store
+ * a server-provided value (for the most part). For that reason, those
+ * filesystems do not set SB_I_VERSION. These filesystems are considered to
+ * have a self-managed i_version.
+ */
+
+/**
+ * inode_set_iversion_raw - set i_version to the specified raw value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for use by
+ * filesystems that self-manage the i_version.
+ *
+ * For example, the NFS client stores its NFSv4 change attribute in this way,
+ * and the AFS client stores the data_version from the server here.
+ */
+static inline void
+inode_set_iversion_raw(struct inode *inode, const u64 new)
+{
+ inode->i_version = new;
+}
+
+/**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for filesystems with
+ * a kernel-managed i_version.
+ *
+ * For now, this just does the same thing as the _raw variant.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+ inode_set_iversion_raw(inode, new);
+}
+
+/**
+ * inode_set_iversion_queried - set i_version to a particular value and set
+ * flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * When loading in an i_version value from a backing store, we typically don't
+ * know whether it was previously viewed before being stored or not. Thus, we
+ * must assume that it was, to ensure that any changes will result in the
+ * value changing.
+ *
+ * This function will set the inode's i_version, and possibly flag the value
+ * as if it has already been viewed at least once.
+ *
+ * For now, this just does what inode_set_iversion does.
+ */
+static inline void
+inode_set_iversion_queried(struct inode *inode, const u64 new)
+{
+ inode_set_iversion(inode, new);
+}
+
+/**
+ * inode_maybe_inc_iversion - increments i_version
+ * @inode: inode with the i_version that should be updated
+ * @force: increment the counter even if it's not necessary
+ *
+ * Every time the inode is modified, the i_version field must be seen to have
+ * changed by any observer.
+ *
+ * In this implementation, we always increment it after taking the i_lock to
+ * ensure that we don't race with other incrementors.
+ *
+ * Returns true if counter was bumped, and false if it wasn't.
+ */
+static inline bool
+inode_maybe_inc_iversion(struct inode *inode, bool force)
+{
+ spin_lock(&inode->i_lock);
+ inode->i_version++;
+ spin_unlock(&inode->i_lock);
+ return true;
+}
+
+/**
+ * inode_inc_iversion - forcibly increment i_version
+ * @inode: inode that needs to be updated
+ *
+ * Forcbily increment the i_version field. This always results in a change to
+ * the observable value.
+ */
+static inline void
+inode_inc_iversion(struct inode *inode)
+{
+ inode_maybe_inc_iversion(inode, true);
+}
+
+/**
+ * inode_iversion_need_inc - is the i_version in need of being incremented?
+ * @inode: inode to check
+ *
+ * Returns whether the inode->i_version counter needs incrementing on the next
+ * change.
+ *
+ * For now, we assume that it always does.
+ */
+static inline bool
+inode_iversion_need_inc(struct inode *inode)
+{
+ return true;
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+ return inode->i_version;
+}
+
+/**
+ * inode_peek_iversion - read i_version without flagging it to be incremented
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter for an inode without registering it as a
+ * query.
+ *
+ * This is typically used by local filesystems that need to store an i_version
+ * on disk. In that situation, it's not necessary to flag it as having been
+ * viewed, as the result won't be used to gauge changes from that point.
+ */
+static inline u64
+inode_peek_iversion(const struct inode *inode)
+{
+ return inode_peek_iversion_raw(inode);
+}
+
+/**
+ * inode_query_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison. This will guarantee
+ * that a later query of the i_version will result in a different value if
+ * anything has changed.
+ *
+ * This implementation just does a peek.
+ */
+static inline u64
+inode_query_iversion(struct inode *inode)
+{
+ return inode_peek_iversion(inode);
+}
+
+/**
+ * inode_cmp_iversion - check whether the i_version counter has changed
+ * @inode: inode to check
+ * @old: old value to check against its i_version
+ *
+ * Compare an i_version counter with a previous one. Returns 0 if they are
+ * the same or non-zero if they are different.
+ */
+static inline s64
+inode_cmp_iversion(const struct inode *inode, const u64 old)
+{
+ return (s64)inode_peek_iversion(inode) - (s64)old;
+}
+#endif
--
2.14.3


2017-12-22 12:06:28

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 11/19] nfsd: convert to new i_version API

From: Jeff Layton <[email protected]>

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfsfh.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..b8444189223b 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -11,6 +11,7 @@
#include <linux/crc32.h>
#include <linux/sunrpc/svc.h>
#include <uapi/linux/nfsd/nfsfh.h>
+#include <linux/iversion.h>

static inline __u32 ino_t_to_u32(ino_t ino)
{
@@ -259,7 +260,7 @@ static inline u64 nfsd4_change_attribute(struct inode *inode)
chattr = inode->i_ctime.tv_sec;
chattr <<= 30;
chattr += inode->i_ctime.tv_nsec;
- chattr += inode->i_version;
+ chattr += inode_query_iversion(inode);
return chattr;
}

--
2.14.3


2017-12-22 12:06:48

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

From: Jeff Layton <[email protected]>

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 2 +-
include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
2 files changed, 154 insertions(+), 56 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76382c24e9d0..6804d075933e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@ struct inode {
struct hlist_head i_dentry;
struct rcu_head i_rcu;
};
- u64 i_version;
+ atomic64_t i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e08c634779df..cef242e54489 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -5,6 +5,8 @@
#include <linux/fs.h>

/*
+ * The inode->i_version field:
+ * ---------------------------
* The change attribute (i_version) is mandated by NFSv4 and is mostly for
* knfsd, but is also used for other purposes (e.g. IMA). The i_version must
* appear different to observers if there was a change to the inode's data or
@@ -27,86 +29,171 @@
* i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
* We consider these sorts of filesystems to have a kernel-managed i_version.
*
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ *
+ * With this implementation, the value should always appear to observers to
+ * increase over time if the file has changed. It's recommended to use
+ * inode_cmp_iversion() helper to compare values.
+ *
* Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
+ * a server-provided value for the most part. For that reason, those
* filesystems do not set SB_I_VERSION. These filesystems are considered to
* have a self-managed i_version.
+ *
+ * Persistently storing the i_version
+ * ----------------------------------
+ * Queries of the i_version field are not gated on them hitting the backing
+ * store. It's always possible that the host could crash after allowing
+ * a query of the value but before it has made it to disk.
+ *
+ * To mitigate this problem, filesystems should always use
+ * inode_set_iversion_queried when loading an existing inode from disk. This
+ * ensures that the next attempted inode increment will result in the value
+ * changing.
+ *
+ * Storing the value to disk therefore does not count as a query, so those
+ * filesystems should use inode_peek_iversion to grab the value to be stored.
+ * There is no need to flag the value as having been queried in that case.
*/

+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
+ */
+#define I_VERSION_QUERIED_SHIFT (1)
+#define I_VERSION_QUERIED (1ULL << (I_VERSION_QUERIED_SHIFT - 1))
+#define I_VERSION_INCREMENT (1ULL << I_VERSION_QUERIED_SHIFT)
+
/**
* inode_set_iversion_raw - set i_version to the specified raw value
* @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
*
- * Set @inode's i_version field to @new. This function is for use by
+ * Set @inode's i_version field to @val. This function is for use by
* filesystems that self-manage the i_version.
*
* For example, the NFS client stores its NFSv4 change attribute in this way,
* and the AFS client stores the data_version from the server here.
*/
static inline void
-inode_set_iversion_raw(struct inode *inode, const u64 new)
+inode_set_iversion_raw(struct inode *inode, const u64 val)
+{
+ atomic64_set(&inode->i_version, val);
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
{
- inode->i_version = new;
+ return atomic64_read(&inode->i_version);
}

/**
* inode_set_iversion - set i_version to a particular value
* @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
*
- * Set @inode's i_version field to @new. This function is for filesystems with
- * a kernel-managed i_version.
+ * Set @inode's i_version field to @val. This function is for filesystems with
+ * a kernel-managed i_version, for initializing a newly-created inode from
+ * scratch.
*
- * For now, this just does the same thing as the _raw variant.
+ * In this case, we do not set the QUERIED flag since we know that this value
+ * has never been queried.
*/
static inline void
-inode_set_iversion(struct inode *inode, const u64 new)
+inode_set_iversion(struct inode *inode, const u64 val)
{
- inode_set_iversion_raw(inode, new);
+ inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
}

/**
- * inode_set_iversion_queried - set i_version to a particular value and set
- * flag to indicate that it has been viewed
+ * inode_set_iversion_queried - set i_version to a particular value as quereied
* @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
*
- * When loading in an i_version value from a backing store, we typically don't
- * know whether it was previously viewed before being stored or not. Thus, we
- * must assume that it was, to ensure that any changes will result in the
- * value changing.
+ * Set @inode's i_version field to @val, and flag it for increment on the next
+ * change.
*
- * This function will set the inode's i_version, and possibly flag the value
- * as if it has already been viewed at least once.
+ * Filesystems that persistently store the i_version on disk should use this
+ * when loading an existing inode from disk.
*
- * For now, this just does what inode_set_iversion does.
+ * When loading in an i_version value from a backing store, we can't be certain
+ * that it wasn't previously viewed before being stored. Thus, we must assume
+ * that it was, to ensure that we don't end up handing out the same value for
+ * different versions of the same inode.
*/
static inline void
-inode_set_iversion_queried(struct inode *inode, const u64 new)
+inode_set_iversion_queried(struct inode *inode, const u64 val)
{
- inode_set_iversion(inode, new);
+ inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
+ I_VERSION_QUERIED);
}

/**
* inode_maybe_inc_iversion - increments i_version
* @inode: inode with the i_version that should be updated
- * @force: increment the counter even if it's not necessary
+ * @force: increment the counter even if it's not necessary?
*
* Every time the inode is modified, the i_version field must be seen to have
* changed by any observer.
*
- * In this implementation, we always increment it after taking the i_lock to
- * ensure that we don't race with other incrementors.
+ * If "force" is set or the QUERIED flag is set, then ensure that we increment
+ * the value, and clear the queried flag.
*
- * Returns true if counter was bumped, and false if it wasn't.
+ * In the common case where neither is set, then we can return "false" without
+ * updating i_version.
+ *
+ * If this function returns false, and no other metadata has changed, then we
+ * can avoid logging the metadata.
*/
static inline bool
inode_maybe_inc_iversion(struct inode *inode, bool force)
{
- atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+ u64 cur, old, new;
+
+ /*
+ * The i_version field is not strictly ordered with any other inode
+ * information, but the legacy inode_inc_iversion code used a spinlock
+ * to serialize increments.
+ *
+ * Here, we add full memory barriers to ensure that any de-facto
+ * ordering with other info is preserved.
+ *
+ * This barrier pairs with the barrier in inode_query_iversion()
+ */
+ smp_mb();
+ cur = inode_peek_iversion_raw(inode);
+ for (;;) {
+ /* If flag is clear then we needn't do anything */
+ if (!force && !(cur & I_VERSION_QUERIED))
+ return false;

- atomic64_inc(ivp);
+ /* Since lowest bit is flag, add 2 to avoid it */
+ new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
+
+ old = atomic64_cmpxchg(&inode->i_version, cur, new);
+ if (likely(old == cur))
+ break;
+ cur = old;
+ }
return true;
}

@@ -129,31 +216,12 @@ inode_inc_iversion(struct inode *inode)
* @inode: inode to check
*
* Returns whether the inode->i_version counter needs incrementing on the next
- * change.
- *
- * For now, we assume that it always does.
+ * change. Just fetch the value and check the QUERIED flag.
*/
static inline bool
inode_iversion_need_inc(struct inode *inode)
{
- return true;
-}
-
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
- return inode->i_version;
+ return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
}

/**
@@ -170,7 +238,7 @@ inode_peek_iversion_raw(const struct inode *inode)
static inline u64
inode_peek_iversion(const struct inode *inode)
{
- return inode_peek_iversion_raw(inode);
+ return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
}

/**
@@ -182,12 +250,35 @@ inode_peek_iversion(const struct inode *inode)
* that a later query of the i_version will result in a different value if
* anything has changed.
*
- * This implementation just does a peek.
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
*/
static inline u64
inode_query_iversion(struct inode *inode)
{
- return inode_peek_iversion(inode);
+ u64 cur, old, new;
+
+ cur = inode_peek_iversion_raw(inode);
+ for (;;) {
+ /* If flag is already set, then no need to swap */
+ if (cur & I_VERSION_QUERIED) {
+ /*
+ * This barrier (and the implicit barrier in the
+ * cmpxchg below) pairs with the barrier in
+ * inode_maybe_inc_iversion().
+ */
+ smp_mb();
+ break;
+ }
+
+ new = cur | I_VERSION_QUERIED;
+ old = atomic64_cmpxchg(&inode->i_version, cur, new);
+ if (likely(old == cur))
+ break;
+ cur = old;
+ }
+ return cur >> I_VERSION_QUERIED_SHIFT;
}

/**
@@ -196,11 +287,18 @@ inode_query_iversion(struct inode *inode)
* @old: old value to check against its i_version
*
* Compare an i_version counter with a previous one. Returns 0 if they are
- * the same or non-zero if they are different.
+ * the same, a positive value if the one in the inode appears newer than @old,
+ * and a negative value if @old appears to be newer than the one in the
+ * inode.
+ *
+ * Note that we don't need to set the QUERIED flag in this case, as the value
+ * in the inode is not being recorded for later use.
*/
+
static inline s64
inode_cmp_iversion(const struct inode *inode, const u64 old)
{
- return (s64)inode_peek_iversion(inode) - (s64)old;
+ return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
+ (s64)(old << I_VERSION_QUERIED_SHIFT);
}
#endif
--
2.14.3


2017-12-22 12:06:45

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed

From: Jeff Layton <[email protected]>

At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/inode.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac8692849a81..76245323a7c8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6107,19 +6107,20 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
int flags)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
+ bool dirty = flags & ~S_VERSION;

if (btrfs_root_readonly(root))
return -EROFS;

if (flags & S_VERSION)
- inode_inc_iversion(inode);
+ dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
- return btrfs_dirty_inode(inode);
+ return dirty ? btrfs_dirty_inode(inode) : 0;
}

/*
--
2.14.3


2017-12-22 12:06:43

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

From: Jeff Layton <[email protected]>

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/xfs_trans_inode.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 225544327c4f..4a89da4b6fe7 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -112,15 +112,17 @@ xfs_trans_log_inode(

/*
* First time we log the inode in a transaction, bump the inode change
- * counter if it is configured for this to occur. We don't use
- * inode_inc_version() because there is no need for extra locking around
- * i_version as we already hold the inode locked exclusively for
- * metadata modification.
+ * counter if it is configured for this to occur. While we have the
+ * inode locked exclusively for metadata modification, we can usually
+ * avoid setting XFS_ILOG_CORE if no one has queried the value since
+ * the last time it was incremented. If we have XFS_ILOG_CORE already
+ * set however, then go ahead and bump the i_version counter
+ * unconditionally.
*/
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
- inode_inc_iversion(VFS_I(ip));
- flags |= XFS_ILOG_CORE;
+ if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+ flags |= XFS_ILOG_CORE;
}

tp->t_flags |= XFS_TRANS_DIRTY;
--
2.14.3


2017-12-22 12:06:31

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 12/19] ocfs2: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ocfs2/dir.c | 15 ++++++++-------
fs/ocfs2/inode.c | 3 ++-
fs/ocfs2/namei.c | 3 ++-
fs/ocfs2/quota_global.c | 3 ++-
4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..32f9c72dff17 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -42,6 +42,7 @@
#include <linux/highmem.h>
#include <linux/quotaops.h>
#include <linux/sort.h>
+#include <linux/iversion.h>

#include <cluster/masklog.h>

@@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir,
le16_add_cpu(&pde->rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
- dir->i_version++;
+ inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);

- dir->i_version++;
+ inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (*f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
- *f_version = inode->i_version;
+ *f_version = inode_query_iversion(inode);
}

de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (*f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + i);
/* It's too expensive to do a full
@@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
- *f_version = inode->i_version;
+ *f_version = inode_query_iversion(inode);
}

while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version,
*/
int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
{
- u64 version = inode->i_version;
+ u64 version = inode_query_iversion(inode);
ocfs2_dir_foreach_blk(inode, &version, ctx, true);
return 0;
}
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..d51b80edd972 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -28,6 +28,7 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/quotaops.h>
+#include <linux/iversion.h>

#include <asm/byteorder.h>

@@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);

- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c801eddc4bf3 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -41,6 +41,7 @@
#include <linux/slab.h>
#include <linux/highmem.h>
#include <linux/quotaops.h>
+#include <linux/iversion.h>

#include <cluster/masklog.h>

@@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
- new_dir->i_version++;
+ inode_inc_iversion(new_dir);

if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index b39d14cbfa34..d03411784aaf 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -12,6 +12,7 @@
#include <linux/writeback.h>
#include <linux/workqueue.h>
#include <linux/llist.h>
+#include <linux/iversion.h>

#include <cluster/masklog.h>

@@ -289,7 +290,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,
mlog_errno(err);
return err;
}
- gqinode->i_version++;
+ inode_query_iversion(gqinode);
ocfs2_mark_inode_dirty(handle, gqinode, oinfo->dqi_gqi_bh);
return len;
}
--
2.14.3


2017-12-22 12:06:40

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

From: Jeff Layton <[email protected]>

We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/inode.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 19e72f500f71..2fa920188759 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
int generic_update_time(struct inode *inode, struct timespec *time, int flags)
{
int iflags = I_DIRTY_TIME;
+ bool dirty = false;

if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
- inode_inc_iversion(inode);
+ dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;
+ if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
+ !(inode->i_sb->s_flags & SB_LAZYTIME))
+ dirty = true;

- if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
+ if (dirty)
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1864,7 +1868,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(&inode->i_ctime, &now))
sync_it |= S_CTIME;

- if (IS_I_VERSION(inode))
+ if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;

if (!sync_it)
--
2.14.3


2017-12-22 12:06:38

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 15/19] IMA: switch IMA over to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
security/integrity/ima/ima_api.c | 3 ++-
security/integrity/ima/ima_main.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..c6ae42266270 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -18,6 +18,7 @@
#include <linux/fs.h>
#include <linux/xattr.h>
#include <linux/evm.h>
+#include <linux/iversion.h>

#include "ima.h"

@@ -215,7 +216,7 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
* which do not support i_version, support is limited to an initial
* measurement/appraisal/audit.
*/
- i_version = file_inode(file)->i_version;
+ i_version = inode_query_iversion(inode);
hash.hdr.algo = algo;

/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 50b82599994d..06a70c5a2329 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
#include <linux/slab.h>
#include <linux/xattr.h>
#include <linux/ima.h>
+#include <linux/iversion.h>

#include "ima.h"

@@ -128,7 +129,7 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(&inode->i_writecount) == 1) {
if (!IS_I_VERSION(inode) ||
- (iint->version != inode->i_version) ||
+ inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
--
2.14.3


2017-12-22 12:06:36

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 14/19] xfs: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/xfs/libxfs/xfs_inode_buf.c | 7 +++++--
fs/xfs/xfs_icache.c | 5 +++--
fs/xfs/xfs_inode.c | 3 ++-
fs/xfs/xfs_inode_item.c | 3 ++-
fs/xfs/xfs_trans_inode.c | 4 +++-
5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..b9c0bf80669c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -32,6 +32,8 @@
#include "xfs_ialloc.h"
#include "xfs_dir2.h"

+#include <linux/iversion.h>
+
/*
* Check that none of the inode's in the buffer have a next
* unlinked field of 0.
@@ -264,7 +266,8 @@ xfs_inode_from_disk(
to->di_flags = be16_to_cpu(from->di_flags);

if (to->di_version == 3) {
- inode->i_version = be64_to_cpu(from->di_changecount);
+ inode_set_iversion_queried(inode,
+ be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +317,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);

if (from->di_version == 3) {
- to->di_changecount = cpu_to_be64(inode->i_version);
+ to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..4c315adb05e6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -37,6 +37,7 @@

#include <linux/kthread.h>
#include <linux/freezer.h>
+#include <linux/iversion.h>

/*
* Allocate and initialise an xfs_inode.
@@ -293,14 +294,14 @@ xfs_reinit_inode(
int error;
uint32_t nlink = inode->i_nlink;
uint32_t generation = inode->i_generation;
- uint64_t version = inode->i_version;
+ uint64_t version = inode_peek_iversion(inode);
umode_t mode = inode->i_mode;

error = inode_init_always(mp->m_super, inode);

set_nlink(inode, nlink);
inode->i_generation = generation;
- inode->i_version = version;
+ inode_set_iversion_queried(inode, version);
inode->i_mode = mode;
return error;
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..dfc5e60d8af3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -16,6 +16,7 @@
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/log2.h>
+#include <linux/iversion.h>

#include "xfs.h"
#include "xfs_fs.h"
@@ -833,7 +834,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;

if (ip->i_d.di_version == 3) {
- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..7571abf5dfb3 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -30,6 +30,7 @@
#include "xfs_buf_item.h"
#include "xfs_log.h"

+#include <linux/iversion.h>

kmem_zone_t *xfs_ili_zone; /* inode log item zone */

@@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
to->di_next_unlinked = NULLAGINO;

if (from->di_version == 3) {
- to->di_changecount = inode->i_version;
+ to->di_changecount = inode_peek_iversion(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..225544327c4f 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -28,6 +28,8 @@
#include "xfs_inode_item.h"
#include "xfs_trace.h"

+#include <linux/iversion.h>
+
/*
* Add a locked inode to the transaction.
*
@@ -117,7 +119,7 @@ xfs_trans_log_inode(
*/
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
- VFS_I(ip)->i_version++;
+ inode_inc_iversion(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}

--
2.14.3


2017-12-22 12:06:33

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 13/19] ufs: use new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/ufs/dir.c | 9 +++++----
fs/ufs/inode.c | 3 ++-
fs/ufs/super.c | 3 ++-
3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..50dfce000864 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -20,6 +20,7 @@
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/swap.h>
+#include <linux/iversion.h>

#include "ufs_fs.h"
#include "ufs.h"
@@ -47,7 +48,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;

- dir->i_version++;
+ inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -428,7 +429,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
- int need_revalidate = file->f_version != inode->i_version;
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;

UFSD("BEGIN\n");
@@ -455,8 +456,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_query_iversion(inode);
+ need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..c843ec858cf7 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -36,6 +36,7 @@
#include <linux/mm.h>
#include <linux/buffer_head.h>
#include <linux/writeback.h>
+#include <linux/iversion.h>

#include "ufs_fs.h"
#include "ufs.h"
@@ -693,7 +694,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned long ino)
if (err)
goto bad_inode;

- inode->i_version++;
+ inode_inc_iversion(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..b6ba80e05bff 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -88,6 +88,7 @@
#include <linux/log2.h>
#include <linux/mount.h>
#include <linux/seq_file.h>
+#include <linux/iversion.h>

#include "ufs_fs.h"
#include "ufs.h"
@@ -1440,7 +1441,7 @@ static struct inode *ufs_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;

- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
seqlock_init(&ei->meta_lock);
mutex_init(&ei->truncate_mutex);
return &ei->vfs_inode;
--
2.14.3


2017-12-22 12:06:26

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 10/19] nfs: convert to new i_version API

From: Jeff Layton <[email protected]>

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/delegation.c | 3 ++-
fs/nfs/fscache-index.c | 5 +++--
fs/nfs/inode.c | 18 +++++++++---------
fs/nfs/nfs4proc.c | 10 ++++++----
fs/nfs/nfstrace.h | 5 +++--
fs/nfs/write.c | 8 +++-----
6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..d8b47624fee2 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/iversion.h>

#include <linux/nfs4.h>
#include <linux/nfs_fs.h>
@@ -347,7 +348,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct rpc_cred *cred, struct
nfs4_stateid_copy(&delegation->stateid, &res->delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
- delegation->change_attr = inode->i_version;
+ delegation->change_attr = inode_peek_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<<NFS_DELEGATION_REFERENCED;
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 3025fe8584a0..0ee4b93d36ea 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -16,6 +16,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_fs_sb.h>
#include <linux/in6.h>
+#include <linux/iversion.h>

#include "internal.h"
#include "fscache.h"
@@ -211,7 +212,7 @@ static uint16_t nfs_fscache_inode_get_aux(const void *cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;

if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
- auxdata.change_attr = nfsi->vfs_inode.i_version;
+ auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);

if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +244,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;

if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
- auxdata.change_attr = nfsi->vfs_inode.i_version;
+ auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);

if (memcmp(data, &auxdata, datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..0b85cca1184b 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -38,8 +38,8 @@
#include <linux/slab.h>
#include <linux/compat.h>
#include <linux/freezer.h>
-
#include <linux/uaccess.h>
+#include <linux/iversion.h>

#include "nfs4_fs.h"
#include "callback.h"
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
memset(&inode->i_atime, 0, sizeof(inode->i_atime));
memset(&inode->i_mtime, 0, sizeof(inode->i_mtime));
memset(&inode->i_ctime, 0, sizeof(inode->i_ctime));
- inode->i_version = 0;
+ inode_set_iversion_raw(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(&init_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
- inode->i_version = fattr->change_attr;
+ inode_set_iversion_raw(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr

if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
- && inode->i_version == fattr->pre_change_attr) {
- inode->i_version = fattr->change_attr;
+ && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+ inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat

if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
- if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode->i_version != fattr->change_attr)
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE;

if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&inode->i_mtime, &fattr->mtime))
@@ -1642,7 +1642,7 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
}
if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
(fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
- fattr->pre_change_attr = inode->i_version;
+ fattr->pre_change_attr = inode_peek_iversion_raw(inode);
fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
}
if ((fattr->valid & NFS_ATTR_FATTR_CTIME) != 0 &&
@@ -1778,7 +1778,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
- if (inode->i_version != fattr->change_attr) {
+ if (inode_cmp_iversion(inode, fattr->change_attr)) {
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id, inode->i_ino);
/* Could it be a race with writeback? */
@@ -1790,7 +1790,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
if (S_ISDIR(inode->i_mode))
nfs_force_lookup_revalidate(inode);
}
- inode->i_version = fattr->change_attr;
+ inode_set_iversion_raw(inode, fattr->change_attr);
}
} else {
nfsi->cache_validity |= save_cache_validity;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 56fa5a16e097..17a03f2c4330 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -54,6 +54,7 @@
#include <linux/xattr.h>
#include <linux/utsname.h>
#include <linux/freezer.h>
+#include <linux/iversion.h>

#include "nfs4_fs.h"
#include "delegation.h"
@@ -1045,16 +1046,16 @@ static void update_changeattr(struct inode *dir, struct nfs4_change_info *cinfo,

spin_lock(&dir->i_lock);
nfsi->cache_validity |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
- if (cinfo->atomic && cinfo->before == dir->i_version) {
+ if (cinfo->atomic && cinfo->before == inode_peek_iversion_raw(dir)) {
nfsi->cache_validity &= ~NFS_INO_REVAL_PAGECACHE;
nfsi->attrtimeo_timestamp = jiffies;
} else {
nfs_force_lookup_revalidate(dir);
- if (cinfo->before != dir->i_version)
+ if (cinfo->before != inode_peek_iversion_raw(dir))
nfsi->cache_validity |= NFS_INO_INVALID_ACCESS |
NFS_INO_INVALID_ACL;
}
- dir->i_version = cinfo->after;
+ inode_set_iversion_raw(dir, cinfo->after);
nfsi->read_cache_jiffies = timestamp;
nfsi->attr_gencount = nfs_inc_attr_generation_counter();
nfs_fscache_invalidate(dir);
@@ -2454,7 +2455,8 @@ static int _nfs4_proc_open(struct nfs4_opendata *data)
data->file_created = true;
else if (o_res->cinfo.before != o_res->cinfo.after)
data->file_created = true;
- if (data->file_created || dir->i_version != o_res->cinfo.after)
+ if (data->file_created ||
+ inode_peek_iversion_raw(dir) != o_res->cinfo.after)
update_changeattr(dir, &o_res->cinfo,
o_res->f_attr->time_start);
}
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 093290c42d7c..610d89d8942e 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -9,6 +9,7 @@
#define _TRACE_NFS_H

#include <linux/tracepoint.h>
+#include <linux/iversion.h>

#define nfs_show_file_type(ftype) \
__print_symbolic(ftype, \
@@ -61,7 +62,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event,
__entry->dev = inode->i_sb->s_dev;
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
- __entry->version = inode->i_version;
+ __entry->version = inode_peek_iversion_raw(inode);
),

TP_printk(
@@ -100,7 +101,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done,
__entry->fileid = nfsi->fileid;
__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
__entry->type = nfs_umode_to_dtype(inode->i_mode);
- __entry->version = inode->i_version;
+ __entry->version = inode_peek_iversion_raw(inode);
__entry->size = i_size_read(inode);
__entry->nfsi_flags = nfsi->flags;
__entry->cache_validity = nfsi->cache_validity;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5b5f464f6f2a..a03fbac1f88c 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -23,6 +23,7 @@
#include <linux/export.h>
#include <linux/freezer.h>
#include <linux/wait.h>
+#include <linux/iversion.h>

#include <linux/uaccess.h>

@@ -753,11 +754,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
*/
spin_lock(&mapping->private_lock);
if (!nfs_have_writebacks(inode) &&
- NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) {
- spin_lock(&inode->i_lock);
- inode->i_version++;
- spin_unlock(&inode->i_lock);
- }
+ NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
+ inode_inc_iversion(inode);
if (likely(!PageSwapCache(req->wb_page))) {
set_bit(PG_MAPPED, &req->wb_flags);
SetPagePrivate(req->wb_page);
--
2.14.3


2017-12-22 12:06:24

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 09/19] ext4: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
---
fs/ext4/dir.c | 9 +++++----
fs/ext4/inline.c | 7 ++++---
fs/ext4/inode.c | 12 ++++++++----
fs/ext4/ioctl.c | 3 ++-
fs/ext4/namei.c | 4 ++--
fs/ext4/super.c | 3 ++-
fs/ext4/xattr.c | 5 +++--
7 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..afda0a0499ce 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
#include <linux/fs.h>
#include <linux/buffer_head.h>
#include <linux/slab.h>
+#include <linux/iversion.h>
#include "ext4.h"
#include "xattr.h"

@@ -208,7 +209,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (file->f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -227,7 +228,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
- file->f_version = inode->i_version;
+ file->f_version = inode_query_iversion(inode);
}

while (ctx->pos < inode->i_size
@@ -568,10 +569,10 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx)
* cached entries.
*/
if ((!info->curr_node) ||
- (file->f_version != inode->i_version)) {
+ inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(&info->root);
- file->f_version = inode->i_version;
+ file->f_version = inode_query_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
info->curr_minor_hash,
&info->next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a8b987b71173 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -14,6 +14,7 @@

#include <linux/iomap.h>
#include <linux/fiemap.h>
+#include <linux/iversion.h>

#include "ext4_jbd2.h"
#include "ext4.h"
@@ -1042,7 +1043,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
*/
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
- dir->i_version++;
+ inode_inc_iversion(dir);
return 1;
}

@@ -1494,7 +1495,7 @@ int ext4_read_inline_dir(struct file *file,
* dirent right now. Scan from the start of the inline
* dir to make sure.
*/
- if (file->f_version != inode->i_version) {
+ if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
* "." is with offset 0 and
@@ -1526,7 +1527,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
- file->f_version = inode->i_version;
+ file->f_version = inode_query_iversion(inode);
}

while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa5d8bc52d2d..1b0d54b372f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4874,12 +4874,14 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);

if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
- inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+ u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
- inode->i_version |=
+ ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+ inode_set_iversion_queried(inode, ivers);
}

ret = 0;
@@ -5165,11 +5167,13 @@ static int ext4_do_update_inode(handle_t *handle,
}

if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
- raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+ u64 ivers = inode_peek_iversion(inode);
+
+ raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
raw_inode->i_version_hi =
- cpu_to_le32(inode->i_version >> 32);
+ cpu_to_le32(ivers >> 32);
raw_inode->i_extra_isize =
cpu_to_le16(ei->i_extra_isize);
}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1eec25014f62..7e99ad02f1ba 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -19,6 +19,7 @@
#include <linux/uuid.h>
#include <linux/uaccess.h>
#include <linux/delay.h>
+#include <linux/iversion.h>
#include "ext4_jbd2.h"
#include "ext4.h"
#include <linux/fsmap.h>
@@ -144,7 +145,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
i_gid_write(inode_bl, 0);
inode_bl->i_flags = 0;
ei_bl->i_flags = 0;
- inode_bl->i_version = 1;
+ inode_set_iversion(inode_bl, 1);
i_size_write(inode_bl, 0);
inode_bl->i_mode = S_IFREG;
if (ext4_has_feature_extents(sb)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index bcf0dff517be..55f6e38de5ba 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2956,7 +2956,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
"empty directory '%.*s' has too many links (%u)",
dentry->d_name.len, dentry->d_name.name,
inode->i_nlink);
- inode->i_version++;
+ inode_inc_iversion(inode);
clear_nlink(inode);
/* There's no need to set i_disksize: the fact that i_nlink is
* zero will ensure that the right thing happens during any
@@ -3362,7 +3362,7 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent,
ent->de->inode = cpu_to_le32(ino);
if (ext4_has_feature_filetype(ent->dir->i_sb))
ent->de->file_type = file_type;
- ent->dir->i_version++;
+ inode_inc_iversion(ent->dir);
ent->dir->i_ctime = ent->dir->i_mtime =
current_time(ent->dir);
ext4_mark_inode_dirty(handle, ent->dir);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7c46693a14d7..5de959fb0244 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -40,6 +40,7 @@
#include <linux/dax.h>
#include <linux/cleancache.h>
#include <linux/uaccess.h>
+#include <linux/iversion.h>

#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -967,7 +968,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;

- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
spin_lock_init(&ei->i_raw_lock);
INIT_LIST_HEAD(&ei->i_prealloc_list);
spin_lock_init(&ei->i_prealloc_lock);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 218a7ba57819..ba6fd5439aa4 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -56,6 +56,7 @@
#include <linux/slab.h>
#include <linux/mbcache.h>
#include <linux/quotaops.h>
+#include <linux/iversion.h>
#include "ext4_jbd2.h"
#include "ext4.h"
#include "xattr.h"
@@ -294,13 +295,13 @@ ext4_xattr_inode_hash(struct ext4_sb_info *sbi, const void *buffer, size_t size)
static u64 ext4_xattr_inode_get_ref(struct inode *ea_inode)
{
return ((u64)ea_inode->i_ctime.tv_sec << 32) |
- ((u32)ea_inode->i_version);
+ (u32) inode_peek_iversion(ea_inode);
}

static void ext4_xattr_inode_set_ref(struct inode *ea_inode, u64 ref_count)
{
ea_inode->i_ctime.tv_sec = (u32)(ref_count >> 32);
- ea_inode->i_version = (u32)ref_count;
+ inode_set_iversion(ea_inode, ref_count & 0xffffffff);
}

static u32 ext4_xattr_inode_get_hash(struct inode *ea_inode)
--
2.14.3


2017-12-22 12:06:22

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 08/19] ext2: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext2/dir.c | 9 +++++----
fs/ext2/super.c | 5 +++--
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..4111085a129f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,7 @@
#include <linux/buffer_head.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
+#include <linux/iversion.h>

typedef struct ext2_dir_entry_2 ext2_dirent;

@@ -92,7 +93,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;

- dir->i_version++;
+ inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);

if (pos+len > dir->i_size) {
@@ -293,7 +294,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
- int need_revalidate = file->f_version != inode->i_version;
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);

if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -319,8 +320,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_query_iversion(inode);
+ need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..554c98b8a93a 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -33,6 +33,7 @@
#include <linux/quotaops.h>
#include <linux/uaccess.h>
#include <linux/dax.h>
+#include <linux/iversion.h>
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -184,7 +185,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
- ei->vfs_inode.i_version = 1;
+ inode_set_iversion(&ei->vfs_inode, 1);
#ifdef CONFIG_QUOTA
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
#endif
@@ -1569,7 +1570,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
- inode->i_version++;
+ inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
--
2.14.3


2017-12-22 12:06:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 04/19] affs: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/affs/amigaffs.c | 5 +++--
fs/affs/dir.c | 5 +++--
fs/affs/super.c | 3 ++-
3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..14a6c1b90c9f 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -10,6 +10,7 @@
*/

#include <linux/math64.h>
+#include <linux/iversion.h>
#include "affs.h"

/*
@@ -60,7 +61,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);

dir->i_mtime = dir->i_ctime = current_time(dir);
- dir->i_version++;
+ inode_inc_iversion(dir);
mark_inode_dirty(dir);

return 0;
@@ -114,7 +115,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh)
affs_brelse(bh);

dir->i_mtime = dir->i_ctime = current_time(dir);
- dir->i_version++;
+ inode_inc_iversion(dir);
mark_inode_dirty(dir);

return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..d180b46453cf 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -14,6 +14,7 @@
*
*/

+#include <linux/iversion.h>
#include "affs.h"

static int affs_readdir(struct file *, struct dir_context *);
@@ -80,7 +81,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
* we can jump directly to where we left off.
*/
ino = (u32)(long)file->private_data;
- if (ino && file->f_version == inode->i_version) {
+ if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -130,7 +131,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
done:
- file->f_version = inode->i_version;
+ file->f_version = inode_query_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);

diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..e602619aed9d 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -21,6 +21,7 @@
#include <linux/writeback.h>
#include <linux/blkdev.h>
#include <linux/seq_file.h>
+#include <linux/iversion.h>
#include "affs.h"

static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
@@ -102,7 +103,7 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
if (!i)
return NULL;

- i->vfs_inode.i_version = 1;
+ inode_set_iversion(&i->vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
--
2.14.3


2017-12-22 12:06:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 07/19] exofs: switch to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/exofs/dir.c | 9 +++++----
fs/exofs/super.c | 3 ++-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..c5a53fcc43ea 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -31,6 +31,7 @@
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/

+#include <linux/iversion.h>
#include "exofs.h"

static inline unsigned exofs_chunk_size(struct inode *inode)
@@ -60,7 +61,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
struct inode *dir = mapping->host;
int err = 0;

- dir->i_version++;
+ inode_inc_iversion(dir);

if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +242,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
- int need_revalidate = (file->f_version != inode->i_version);
+ bool need_revalidate = inode_cmp_iversion(inode, file->f_version);

if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +265,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
- file->f_version = inode->i_version;
- need_revalidate = 0;
+ file->f_version = inode_query_iversion(inode);
+ need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..7e244093c0e5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -38,6 +38,7 @@
#include <linux/module.h>
#include <linux/exportfs.h>
#include <linux/slab.h>
+#include <linux/iversion.h>

#include "exofs.h"

@@ -159,7 +160,7 @@ static struct inode *exofs_alloc_inode(struct super_block *sb)
if (!oi)
return NULL;

- oi->vfs_inode.i_version = 1;
+ inode_set_iversion(&oi->vfs_inode, 1);
return &oi->vfs_inode;
}

--
2.14.3


2017-12-22 12:06:10

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 02/19] fs: don't take the i_lock in inode_inc_iversion

From: Jeff Layton <[email protected]>

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/iversion.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index bb50d27c71f9..e08c634779df 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, const u64 new)
static inline bool
inode_maybe_inc_iversion(struct inode *inode, bool force)
{
- spin_lock(&inode->i_lock);
- inode->i_version++;
- spin_unlock(&inode->i_lock);
+ atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+
+ atomic64_inc(ivp);
return true;
}

+
/**
* inode_inc_iversion - forcibly increment i_version
* @inode: inode that needs to be updated
--
2.14.3


2017-12-22 12:06:16

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 06/19] btrfs: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/btrfs/delayed-inode.c | 7 +++++--
fs/btrfs/inode.c | 6 ++++--
fs/btrfs/tree-log.c | 4 +++-
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..6a246ae2bcb2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -18,6 +18,7 @@
*/

#include <linux/slab.h>
+#include <linux/iversion.h>
#include "delayed-inode.h"
#include "disk-io.h"
#include "transaction.h"
@@ -1700,7 +1701,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
BTRFS_I(inode)->generation);
- btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+ btrfs_set_stack_inode_sequence(inode_item,
+ inode_peek_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1756,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);

- inode->i_version = btrfs_stack_inode_sequence(inode_item);
+ inode_set_iversion_queried(inode,
+ btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 27f008b33fc1..ac8692849a81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3778,7 +3778,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);

- inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+ inode_set_iversion_queried(inode,
+ btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3946,7 +3947,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
&token);
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
&token);
- btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+ btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode),
+ &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..1b7d92075c1f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/blkdev.h>
#include <linux/list_sort.h>
+#include <linux/iversion.h>
#include "tree-log.h"
#include "disk-io.h"
#include "locking.h"
@@ -3609,7 +3610,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
&token);

- btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+ btrfs_set_token_inode_sequence(leaf, item,
+ inode_peek_iversion(inode), &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
--
2.14.3


2017-12-22 12:06:14

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 05/19] afs: convert to new i_version API

From: Jeff Layton <[email protected]>

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/afs/fsclient.c | 3 ++-
fs/afs/inode.c | 5 +++--
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..88ec38c2d83c 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/circ_buf.h>
+#include <linux/iversion.h>
#include "internal.h"
#include "afs_fs.h"

@@ -124,7 +125,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
vnode->vfs_inode.i_mtime = vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime = vnode->vfs_inode.i_ctime;
- vnode->vfs_inode.i_version = data_version;
+ inode_set_iversion_raw(&vnode->vfs_inode, data_version);
}

expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..dcd2e08d6cdb 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -21,6 +21,7 @@
#include <linux/sched.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/iversion.h>
#include "internal.h"

static const struct inode_operations afs_symlink_inode_operations = {
@@ -89,7 +90,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
- inode->i_version = vnode->status.data_version;
+ inode_set_iversion_raw(inode, vnode->status.data_version);
inode->i_mapping->a_ops = &afs_fs_aops;

read_sequnlock_excl(&vnode->cb_lock);
@@ -218,7 +219,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const char *dev_name,
inode->i_ctime.tv_nsec = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
- inode->i_version = 0;
+ inode_set_iversion_raw(inode, 0);
inode->i_generation = 0;

set_bit(AFS_VNODE_PSEUDODIR, &vnode->flags);
--
2.14.3


2017-12-22 12:06:09

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v4 03/19] fat: convert to new i_version API

From: Jeff Layton <[email protected]>

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fat/dir.c | 3 ++-
fs/fat/inode.c | 9 +++++----
fs/fat/namei_msdos.c | 7 ++++---
fs/fat/namei_vfat.c | 22 +++++++++++-----------
4 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..8e100c3bf72c 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -16,6 +16,7 @@
#include <linux/slab.h>
#include <linux/compat.h>
#include <linux/uaccess.h>
+#include <linux/iversion.h>
#include "fat.h"

/*
@@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
- dir->i_version++;
+ inode_inc_iversion(dir);

if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..ffbbf0520d9e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <asm/unaligned.h>
+#include <linux/iversion.h>
#include "fat.h"

#ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
- inode->i_version++;
+ inode_inc_iversion(inode);
inode->i_generation = get_seconds();

if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
- inode->i_version = 1;
+ inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
- inode->i_version++;
+ inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
- root_inode->i_version = 1;
+ inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..582ca731a6c9 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -7,6 +7,7 @@
*/

#include <linux/module.h>
+#include <linux/iversion.h>
#include "fat.h"

/* Characters that are undesirable in an MS-DOS file name */
@@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
} else
mark_inode_dirty(old_inode);

- old_dir->i_version++;
+ inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
- new_dir->i_version++;
+ inode_inc_iversion(new_dir);

fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
- old_dir->i_version++;
+ inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c066663a3a..cefea792cde8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -20,7 +20,7 @@
#include <linux/slab.h>
#include <linux/namei.h>
#include <linux/kernel.h>
-
+#include <linux/iversion.h>
#include "fat.h"

static inline unsigned long vfat_d_version(struct dentry *dentry)
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
{
int ret = 1;
spin_lock(&dentry->d_lock);
- if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+ if (inode_cmp_iversion(d_inode(dentry->d_parent), vfat_d_version(dentry)))
ret = 0;
spin_unlock(&dentry->d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
if (!inode)
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_query_iversion(dir));
return d_splice_alias(inode, dentry);
error:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -781,7 +781,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
err = vfat_add_entry(dir, &dentry->d_name, 0, 0, &ts, &sinfo);
if (err)
goto out;
- dir->i_version++;
+ inode_inc_iversion(dir);

inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
brelse(sinfo.bh);
@@ -789,7 +789,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
err = PTR_ERR(inode);
goto out;
}
- inode->i_version++;
+ inode_inc_iversion(inode);
inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
/* timestamp is already written, so mark_inode_dirty() is unneeded. */

@@ -823,7 +823,7 @@ static int vfat_rmdir(struct inode *dir, struct dentry *dentry)
clear_nlink(inode);
inode->i_mtime = inode->i_atime = current_time(inode);
fat_detach(inode);
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_query_iversion(dir));
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);

@@ -849,7 +849,7 @@ static int vfat_unlink(struct inode *dir, struct dentry *dentry)
clear_nlink(inode);
inode->i_mtime = inode->i_atime = current_time(inode);
fat_detach(inode);
- vfat_d_version_set(dentry, dir->i_version);
+ vfat_d_version_set(dentry, inode_query_iversion(dir));
out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);

@@ -875,7 +875,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
err = vfat_add_entry(dir, &dentry->d_name, 1, cluster, &ts, &sinfo);
if (err)
goto out_free;
- dir->i_version++;
+ inode_inc_iversion(dir);
inc_nlink(dir);

inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
@@ -885,7 +885,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
/* the directory was completed, just return a error */
goto out;
}
- inode->i_version++;
+ inode_inc_iversion(inode);
set_nlink(inode, 2);
inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
/* timestamp is already written, so mark_inode_dirty() is unneeded. */
@@ -951,7 +951,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
goto out;
new_i_pos = sinfo.i_pos;
}
- new_dir->i_version++;
+ inode_inc_iversion(new_dir);

fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -979,7 +979,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
- old_dir->i_version++;
+ inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
--
2.14.3


2017-12-22 23:14:39

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

On Fri, Dec 22 2017, Jeff Layton wrote:

> From: Jeff Layton <[email protected]>
>
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
>
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
>
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/btrfs/file.c | 1 +
> fs/btrfs/inode.c | 1 +
> fs/btrfs/ioctl.c | 1 +
> fs/btrfs/xattr.c | 1 +
> fs/ext4/inode.c | 1 +
> fs/ext4/namei.c | 1 +
> fs/inode.c | 1 +
> include/linux/fs.h | 15 ----
> include/linux/iversion.h | 205 +++++++++++++++++++++++++++++++++++++++++++++++
> 9 files changed, 212 insertions(+), 15 deletions(-)
> create mode 100644 include/linux/iversion.h
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index eb1bac7c8553..c95d7b2efefb 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -31,6 +31,7 @@
> #include <linux/slab.h>
> #include <linux/btrfs.h>
> #include <linux/uio.h>
> +#include <linux/iversion.h>
> #include "ctree.h"
> #include "disk-io.h"
> #include "transaction.h"
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e1a7f3cb5be9..27f008b33fc1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -43,6 +43,7 @@
> #include <linux/posix_acl_xattr.h>
> #include <linux/uio.h>
> #include <linux/magic.h>
> +#include <linux/iversion.h>
> #include "ctree.h"
> #include "disk-io.h"
> #include "transaction.h"
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 2ef8acaac688..aa452c9e2eff 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -43,6 +43,7 @@
> #include <linux/uuid.h>
> #include <linux/btrfs.h>
> #include <linux/uaccess.h>
> +#include <linux/iversion.h>
> #include "ctree.h"
> #include "disk-io.h"
> #include "transaction.h"
> diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
> index 2c7e53f9ff1b..5258c1714830 100644
> --- a/fs/btrfs/xattr.c
> +++ b/fs/btrfs/xattr.c
> @@ -23,6 +23,7 @@
> #include <linux/xattr.h>
> #include <linux/security.h>
> #include <linux/posix_acl_xattr.h>
> +#include <linux/iversion.h>
> #include "ctree.h"
> #include "btrfs_inode.h"
> #include "transaction.h"
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7df2c5644e59..fa5d8bc52d2d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -39,6 +39,7 @@
> #include <linux/slab.h>
> #include <linux/bitops.h>
> #include <linux/iomap.h>
> +#include <linux/iversion.h>
>
> #include "ext4_jbd2.h"
> #include "xattr.h"
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 798b3ac680db..bcf0dff517be 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -34,6 +34,7 @@
> #include <linux/quotaops.h>
> #include <linux/buffer_head.h>
> #include <linux/bio.h>
> +#include <linux/iversion.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..19e72f500f71 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -18,6 +18,7 @@
> #include <linux/buffer_head.h> /* for inode_has_buffers */
> #include <linux/ratelimit.h>
> #include <linux/list_lru.h>
> +#include <linux/iversion.h>
> #include <trace/events/writeback.h>
> #include "internal.h"
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 511fbaabf624..76382c24e9d0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode *inode)
> mark_inode_dirty(inode);
> }
>
> -/**
> - * inode_inc_iversion - increments i_version
> - * @inode: inode that need to be updated
> - *
> - * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> - */
> -
> -static inline void inode_inc_iversion(struct inode *inode)
> -{
> - spin_lock(&inode->i_lock);
> - inode->i_version++;
> - spin_unlock(&inode->i_lock);
> -}
> -
> enum file_time_flags {
> S_ATIME = 1,
> S_MTIME = 2,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> new file mode 100644
> index 000000000000..bb50d27c71f9
> --- /dev/null
> +++ b/include/linux/iversion.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IVERSION_H
> +#define _LINUX_IVERSION_H
> +
> +#include <linux/fs.h>
> +
> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> + * appear different to observers if there was a change to the inode's data or
> + * metadata since it was last queried.
> + *
> + * It should be considered an opaque value by observers. If it remains the same
^^^^^^^^^^^^

You keep using that word ... I don't think it means what you think it
means.
Change that sentence to:

Observers see i_version as a 64 number which never decreases.

and the rest still makes perfect sense.

Thanks,
NeilBrown


> + * since it was last checked, then nothing has changed in the inode. If it's
> + * different then something has changed. Observers cannot infer anything about
> + * the nature or magnitude of the changes from the value, only that the inode
> + * has changed in some fashion.
> + *
> + * Not all filesystems properly implement the i_version counter. Subsystems that
> + * want to use i_version field on an inode should first check whether the
> + * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> + *
> + * Those that set SB_I_VERSION will automatically have their i_version counter
> + * incremented on writes to normal files. If the SB_I_VERSION is not set, then
> + * the VFS will not touch it on writes, and the filesystem can use it how it
> + * wishes. Note that the filesystem is always responsible for updating the
> + * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
> + * We consider these sorts of filesystems to have a kernel-managed i_version.
> + *
> + * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> + * a server-provided value (for the most part). For that reason, those
> + * filesystems do not set SB_I_VERSION. These filesystems are considered to
> + * have a self-managed i_version.
> + */
> +
> +/**
> + * inode_set_iversion_raw - set i_version to the specified raw value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set @inode's i_version field to @new. This function is for use by
> + * filesystems that self-manage the i_version.
> + *
> + * For example, the NFS client stores its NFSv4 change attribute in this way,
> + * and the AFS client stores the data_version from the server here.
> + */
> +static inline void
> +inode_set_iversion_raw(struct inode *inode, const u64 new)
> +{
> + inode->i_version = new;
> +}
> +
> +/**
> + * inode_set_iversion - set i_version to a particular value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set @inode's i_version field to @new. This function is for filesystems with
> + * a kernel-managed i_version.
> + *
> + * For now, this just does the same thing as the _raw variant.
> + */
> +static inline void
> +inode_set_iversion(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion_raw(inode, new);
> +}
> +
> +/**
> + * inode_set_iversion_queried - set i_version to a particular value and set
> + * flag to indicate that it has been viewed
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * When loading in an i_version value from a backing store, we typically don't
> + * know whether it was previously viewed before being stored or not. Thus, we
> + * must assume that it was, to ensure that any changes will result in the
> + * value changing.
> + *
> + * This function will set the inode's i_version, and possibly flag the value
> + * as if it has already been viewed at least once.
> + *
> + * For now, this just does what inode_set_iversion does.
> + */
> +static inline void
> +inode_set_iversion_queried(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion(inode, new);
> +}
> +
> +/**
> + * inode_maybe_inc_iversion - increments i_version
> + * @inode: inode with the i_version that should be updated
> + * @force: increment the counter even if it's not necessary
> + *
> + * Every time the inode is modified, the i_version field must be seen to have
> + * changed by any observer.
> + *
> + * In this implementation, we always increment it after taking the i_lock to
> + * ensure that we don't race with other incrementors.
> + *
> + * Returns true if counter was bumped, and false if it wasn't.
> + */
> +static inline bool
> +inode_maybe_inc_iversion(struct inode *inode, bool force)
> +{
> + spin_lock(&inode->i_lock);
> + inode->i_version++;
> + spin_unlock(&inode->i_lock);
> + return true;
> +}
> +
> +/**
> + * inode_inc_iversion - forcibly increment i_version
> + * @inode: inode that needs to be updated
> + *
> + * Forcbily increment the i_version field. This always results in a change to
> + * the observable value.
> + */
> +static inline void
> +inode_inc_iversion(struct inode *inode)
> +{
> + inode_maybe_inc_iversion(inode, true);
> +}
> +
> +/**
> + * inode_iversion_need_inc - is the i_version in need of being incremented?
> + * @inode: inode to check
> + *
> + * Returns whether the inode->i_version counter needs incrementing on the next
> + * change.
> + *
> + * For now, we assume that it always does.
> + */
> +static inline bool
> +inode_iversion_need_inc(struct inode *inode)
> +{
> + return true;
> +}
> +
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return inode->i_version;
> +}
> +
> +/**
> + * inode_peek_iversion - read i_version without flagging it to be incremented
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter for an inode without registering it as a
> + * query.
> + *
> + * This is typically used by local filesystems that need to store an i_version
> + * on disk. In that situation, it's not necessary to flag it as having been
> + * viewed, as the result won't be used to gauge changes from that point.
> + */
> +static inline u64
> +inode_peek_iversion(const struct inode *inode)
> +{
> + return inode_peek_iversion_raw(inode);
> +}
> +
> +/**
> + * inode_query_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison. This will guarantee
> + * that a later query of the i_version will result in a different value if
> + * anything has changed.
> + *
> + * This implementation just does a peek.
> + */
> +static inline u64
> +inode_query_iversion(struct inode *inode)
> +{
> + return inode_peek_iversion(inode);
> +}
> +
> +/**
> + * inode_cmp_iversion - check whether the i_version counter has changed
> + * @inode: inode to check
> + * @old: old value to check against its i_version
> + *
> + * Compare an i_version counter with a previous one. Returns 0 if they are
> + * the same or non-zero if they are different.
> + */
> +static inline s64
> +inode_cmp_iversion(const struct inode *inode, const u64 old)
> +{
> + return (s64)inode_peek_iversion(inode) - (s64)old;
> +}
> +#endif
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
signature.asc (832.00 B)

2017-12-22 23:55:02

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

On Sat, 2017-12-23 at 10:14 +1100, NeilBrown wrote:
> On Fri, Dec 22 2017, Jeff Layton wrote:
>
> > From: Jeff Layton <[email protected]>
> >
> > Add a documentation blob that explains what the i_version field is, how
> > it is expected to work, and how it is currently implemented by various
> > filesystems.
> >
> > We already have inode_inc_iversion. Add several other functions for
> > manipulating and accessing the i_version counter. For now, the
> > implementation is trivial and basically works the way that all of the
> > open-coded i_version accesses work today.
> >
> > Future patches will convert existing users of i_version to use the new
> > API, and then convert the backend implementation to do things more
> > efficiently.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/btrfs/file.c | 1 +
> > fs/btrfs/inode.c | 1 +
> > fs/btrfs/ioctl.c | 1 +
> > fs/btrfs/xattr.c | 1 +
> > fs/ext4/inode.c | 1 +
> > fs/ext4/namei.c | 1 +
> > fs/inode.c | 1 +
> > include/linux/fs.h | 15 ----
> > include/linux/iversion.h | 205 +++++++++++++++++++++++++++++++++++++++++++++++
> > 9 files changed, 212 insertions(+), 15 deletions(-)
> > create mode 100644 include/linux/iversion.h
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index eb1bac7c8553..c95d7b2efefb 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -31,6 +31,7 @@
> > #include <linux/slab.h>
> > #include <linux/btrfs.h>
> > #include <linux/uio.h>
> > +#include <linux/iversion.h>
> > #include "ctree.h"
> > #include "disk-io.h"
> > #include "transaction.h"
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e1a7f3cb5be9..27f008b33fc1 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -43,6 +43,7 @@
> > #include <linux/posix_acl_xattr.h>
> > #include <linux/uio.h>
> > #include <linux/magic.h>
> > +#include <linux/iversion.h>
> > #include "ctree.h"
> > #include "disk-io.h"
> > #include "transaction.h"
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 2ef8acaac688..aa452c9e2eff 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -43,6 +43,7 @@
> > #include <linux/uuid.h>
> > #include <linux/btrfs.h>
> > #include <linux/uaccess.h>
> > +#include <linux/iversion.h>
> > #include "ctree.h"
> > #include "disk-io.h"
> > #include "transaction.h"
> > diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
> > index 2c7e53f9ff1b..5258c1714830 100644
> > --- a/fs/btrfs/xattr.c
> > +++ b/fs/btrfs/xattr.c
> > @@ -23,6 +23,7 @@
> > #include <linux/xattr.h>
> > #include <linux/security.h>
> > #include <linux/posix_acl_xattr.h>
> > +#include <linux/iversion.h>
> > #include "ctree.h"
> > #include "btrfs_inode.h"
> > #include "transaction.h"
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 7df2c5644e59..fa5d8bc52d2d 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -39,6 +39,7 @@
> > #include <linux/slab.h>
> > #include <linux/bitops.h>
> > #include <linux/iomap.h>
> > +#include <linux/iversion.h>
> >
> > #include "ext4_jbd2.h"
> > #include "xattr.h"
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 798b3ac680db..bcf0dff517be 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -34,6 +34,7 @@
> > #include <linux/quotaops.h>
> > #include <linux/buffer_head.h>
> > #include <linux/bio.h>
> > +#include <linux/iversion.h>
> > #include "ext4.h"
> > #include "ext4_jbd2.h"
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 03102d6ef044..19e72f500f71 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -18,6 +18,7 @@
> > #include <linux/buffer_head.h> /* for inode_has_buffers */
> > #include <linux/ratelimit.h>
> > #include <linux/list_lru.h>
> > +#include <linux/iversion.h>
> > #include <trace/events/writeback.h>
> > #include "internal.h"
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 511fbaabf624..76382c24e9d0 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode *inode)
> > mark_inode_dirty(inode);
> > }
> >
> > -/**
> > - * inode_inc_iversion - increments i_version
> > - * @inode: inode that need to be updated
> > - *
> > - * Every time the inode is modified, the i_version field will be incremented.
> > - * The filesystem has to be mounted with i_version flag
> > - */
> > -
> > -static inline void inode_inc_iversion(struct inode *inode)
> > -{
> > - spin_lock(&inode->i_lock);
> > - inode->i_version++;
> > - spin_unlock(&inode->i_lock);
> > -}
> > -
> > enum file_time_flags {
> > S_ATIME = 1,
> > S_MTIME = 2,
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > new file mode 100644
> > index 000000000000..bb50d27c71f9
> > --- /dev/null
> > +++ b/include/linux/iversion.h
> > @@ -0,0 +1,205 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_IVERSION_H
> > +#define _LINUX_IVERSION_H
> > +
> > +#include <linux/fs.h>
> > +
> > +/*
> > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > + * appear different to observers if there was a change to the inode's data or
> > + * metadata since it was last queried.
> > + *
> > + * It should be considered an opaque value by observers. If it remains the same
>
> ^^^^^^^^^^^^
>
> You keep using that word ... I don't think it means what you think it
> means.
> Change that sentence to:
>
> Observers see i_version as a 64 number which never decreases.
>
> and the rest still makes perfect sense.
>

Thanks! Fixed in my tree. I'll not resend the set just for that though.

> > + * since it was last checked, then nothing has changed in the inode. If it's
> > + * different then something has changed. Observers cannot infer anything about
> > + * the nature or magnitude of the changes from the value, only that the inode
> > + * has changed in some fashion.
> > + *
> > + * Not all filesystems properly implement the i_version counter. Subsystems that
> > + * want to use i_version field on an inode should first check whether the
> > + * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> > + *
> > + * Those that set SB_I_VERSION will automatically have their i_version counter
> > + * incremented on writes to normal files. If the SB_I_VERSION is not set, then
> > + * the VFS will not touch it on writes, and the filesystem can use it how it
> > + * wishes. Note that the filesystem is always responsible for updating the
> > + * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
> > + * We consider these sorts of filesystems to have a kernel-managed i_version.
> > + *
> > + * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> > + * a server-provided value (for the most part). For that reason, those
> > + * filesystems do not set SB_I_VERSION. These filesystems are considered to
> > + * have a self-managed i_version.
> > + */
> > +
> > +/**
> > + * inode_set_iversion_raw - set i_version to the specified raw value
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set @inode's i_version field to @new. This function is for use by
> > + * filesystems that self-manage the i_version.
> > + *
> > + * For example, the NFS client stores its NFSv4 change attribute in this way,
> > + * and the AFS client stores the data_version from the server here.
> > + */
> > +static inline void
> > +inode_set_iversion_raw(struct inode *inode, const u64 new)
> > +{
> > + inode->i_version = new;
> > +}
> > +
> > +/**
> > + * inode_set_iversion - set i_version to a particular value
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set @inode's i_version field to @new. This function is for filesystems with
> > + * a kernel-managed i_version.
> > + *
> > + * For now, this just does the same thing as the _raw variant.
> > + */
> > +static inline void
> > +inode_set_iversion(struct inode *inode, const u64 new)
> > +{
> > + inode_set_iversion_raw(inode, new);
> > +}
> > +
> > +/**
> > + * inode_set_iversion_queried - set i_version to a particular value and set
> > + * flag to indicate that it has been viewed
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * When loading in an i_version value from a backing store, we typically don't
> > + * know whether it was previously viewed before being stored or not. Thus, we
> > + * must assume that it was, to ensure that any changes will result in the
> > + * value changing.
> > + *
> > + * This function will set the inode's i_version, and possibly flag the value
> > + * as if it has already been viewed at least once.
> > + *
> > + * For now, this just does what inode_set_iversion does.
> > + */
> > +static inline void
> > +inode_set_iversion_queried(struct inode *inode, const u64 new)
> > +{
> > + inode_set_iversion(inode, new);
> > +}
> > +
> > +/**
> > + * inode_maybe_inc_iversion - increments i_version
> > + * @inode: inode with the i_version that should be updated
> > + * @force: increment the counter even if it's not necessary
> > + *
> > + * Every time the inode is modified, the i_version field must be seen to have
> > + * changed by any observer.
> > + *
> > + * In this implementation, we always increment it after taking the i_lock to
> > + * ensure that we don't race with other incrementors.
> > + *
> > + * Returns true if counter was bumped, and false if it wasn't.
> > + */
> > +static inline bool
> > +inode_maybe_inc_iversion(struct inode *inode, bool force)
> > +{
> > + spin_lock(&inode->i_lock);
> > + inode->i_version++;
> > + spin_unlock(&inode->i_lock);
> > + return true;
> > +}
> > +
> > +/**
> > + * inode_inc_iversion - forcibly increment i_version
> > + * @inode: inode that needs to be updated
> > + *
> > + * Forcbily increment the i_version field. This always results in a change to
> > + * the observable value.
> > + */
> > +static inline void
> > +inode_inc_iversion(struct inode *inode)
> > +{
> > + inode_maybe_inc_iversion(inode, true);
> > +}
> > +
> > +/**
> > + * inode_iversion_need_inc - is the i_version in need of being incremented?
> > + * @inode: inode to check
> > + *
> > + * Returns whether the inode->i_version counter needs incrementing on the next
> > + * change.
> > + *
> > + * For now, we assume that it always does.
> > + */
> > +static inline bool
> > +inode_iversion_need_inc(struct inode *inode)
> > +{
> > + return true;
> > +}
> > +
> > +/**
> > + * inode_peek_iversion_raw - grab a "raw" iversion value
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Grab a "raw" inode->i_version value and return it. The i_version is not
> > + * flagged or converted in any way. This is mostly used to access a self-managed
> > + * i_version.
> > + *
> > + * With those filesystems, we want to treat the i_version as an entirely
> > + * opaque value.
> > + */
> > +static inline u64
> > +inode_peek_iversion_raw(const struct inode *inode)
> > +{
> > + return inode->i_version;
> > +}
> > +
> > +/**
> > + * inode_peek_iversion - read i_version without flagging it to be incremented
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter for an inode without registering it as a
> > + * query.
> > + *
> > + * This is typically used by local filesystems that need to store an i_version
> > + * on disk. In that situation, it's not necessary to flag it as having been
> > + * viewed, as the result won't be used to gauge changes from that point.
> > + */
> > +static inline u64
> > +inode_peek_iversion(const struct inode *inode)
> > +{
> > + return inode_peek_iversion_raw(inode);
> > +}
> > +
> > +/**
> > + * inode_query_iversion - read i_version for later use
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter. This should be used by callers that wish
> > + * to store the returned i_version for later comparison. This will guarantee
> > + * that a later query of the i_version will result in a different value if
> > + * anything has changed.
> > + *
> > + * This implementation just does a peek.
> > + */
> > +static inline u64
> > +inode_query_iversion(struct inode *inode)
> > +{
> > + return inode_peek_iversion(inode);
> > +}
> > +
> > +/**
> > + * inode_cmp_iversion - check whether the i_version counter has changed
> > + * @inode: inode to check
> > + * @old: old value to check against its i_version
> > + *
> > + * Compare an i_version counter with a previous one. Returns 0 if they are
> > + * the same or non-zero if they are different.
> > + */
> > +static inline s64
> > +inode_cmp_iversion(const struct inode *inode, const u64 old)
> > +{
> > + return (s64)inode_peek_iversion(inode) - (s64)old;
> > +}
> > +#endif
> > --
> > 2.14.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Jeff Layton <[email protected]>

2017-12-23 00:16:03

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v4 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

On Fri, Dec 22, 2017 at 07:05:54AM -0500, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> If XFS_ILOG_CORE is already set then go ahead and increment it.
>
> Signed-off-by: Jeff Layton <[email protected]>

Looks ok,
Acked-by: Darrick J. Wong <[email protected]>

> ---
> fs/xfs/xfs_trans_inode.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 225544327c4f..4a89da4b6fe7 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -112,15 +112,17 @@ xfs_trans_log_inode(
>
> /*
> * First time we log the inode in a transaction, bump the inode change
> - * counter if it is configured for this to occur. We don't use
> - * inode_inc_version() because there is no need for extra locking around
> - * i_version as we already hold the inode locked exclusively for
> - * metadata modification.
> + * counter if it is configured for this to occur. While we have the
> + * inode locked exclusively for metadata modification, we can usually
> + * avoid setting XFS_ILOG_CORE if no one has queried the value since
> + * the last time it was incremented. If we have XFS_ILOG_CORE already
> + * set however, then go ahead and bump the i_version counter
> + * unconditionally.
> */
> if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> IS_I_VERSION(VFS_I(ip))) {
> - inode_inc_iversion(VFS_I(ip));
> - flags |= XFS_ILOG_CORE;
> + if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> + flags |= XFS_ILOG_CORE;
> }
>
> tp->t_flags |= XFS_TRANS_DIRTY;
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-12-23 00:16:49

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v4 14/19] xfs: convert to new i_version API

On Fri, Dec 22, 2017 at 07:05:51AM -0500, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/xfs/libxfs/xfs_inode_buf.c | 7 +++++--
> fs/xfs/xfs_icache.c | 5 +++--
> fs/xfs/xfs_inode.c | 3 ++-
> fs/xfs/xfs_inode_item.c | 3 ++-
> fs/xfs/xfs_trans_inode.c | 4 +++-
> 5 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b7989038d75..b9c0bf80669c 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -32,6 +32,8 @@
> #include "xfs_ialloc.h"
> #include "xfs_dir2.h"
>
> +#include <linux/iversion.h>

/me wonders if these ought to be in fs/xfs/xfs_linux.h since this is
libxfs, but seeing as I already let that horse escape I might as well
clean it up separately.

Looks ok,
Acked-by: Darrick J. Wong <[email protected]>

--D

> +
> /*
> * Check that none of the inode's in the buffer have a next
> * unlinked field of 0.
> @@ -264,7 +266,8 @@ xfs_inode_from_disk(
> to->di_flags = be16_to_cpu(from->di_flags);
>
> if (to->di_version == 3) {
> - inode->i_version = be64_to_cpu(from->di_changecount);
> + inode_set_iversion_queried(inode,
> + be64_to_cpu(from->di_changecount));
> to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> to->di_flags2 = be64_to_cpu(from->di_flags2);
> @@ -314,7 +317,7 @@ xfs_inode_to_disk(
> to->di_flags = cpu_to_be16(from->di_flags);
>
> if (from->di_version == 3) {
> - to->di_changecount = cpu_to_be64(inode->i_version);
> + to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
> to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> to->di_flags2 = cpu_to_be64(from->di_flags2);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 43005fbe8b1e..4c315adb05e6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -37,6 +37,7 @@
>
> #include <linux/kthread.h>
> #include <linux/freezer.h>
> +#include <linux/iversion.h>
>
> /*
> * Allocate and initialise an xfs_inode.
> @@ -293,14 +294,14 @@ xfs_reinit_inode(
> int error;
> uint32_t nlink = inode->i_nlink;
> uint32_t generation = inode->i_generation;
> - uint64_t version = inode->i_version;
> + uint64_t version = inode_peek_iversion(inode);
> umode_t mode = inode->i_mode;
>
> error = inode_init_always(mp->m_super, inode);
>
> set_nlink(inode, nlink);
> inode->i_generation = generation;
> - inode->i_version = version;
> + inode_set_iversion_queried(inode, version);
> inode->i_mode = mode;
> return error;
> }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 801274126648..dfc5e60d8af3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -16,6 +16,7 @@
> * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> */
> #include <linux/log2.h>
> +#include <linux/iversion.h>
>
> #include "xfs.h"
> #include "xfs_fs.h"
> @@ -833,7 +834,7 @@ xfs_ialloc(
> ip->i_d.di_flags = 0;
>
> if (ip->i_d.di_version == 3) {
> - inode->i_version = 1;
> + inode_set_iversion(inode, 1);
> ip->i_d.di_flags2 = 0;
> ip->i_d.di_cowextsize = 0;
> ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ee5c3bf19ad..7571abf5dfb3 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -30,6 +30,7 @@
> #include "xfs_buf_item.h"
> #include "xfs_log.h"
>
> +#include <linux/iversion.h>
>
> kmem_zone_t *xfs_ili_zone; /* inode log item zone */
>
> @@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
> to->di_next_unlinked = NULLAGINO;
>
> if (from->di_version == 3) {
> - to->di_changecount = inode->i_version;
> + to->di_changecount = inode_peek_iversion(inode);
> to->di_crtime.t_sec = from->di_crtime.t_sec;
> to->di_crtime.t_nsec = from->di_crtime.t_nsec;
> to->di_flags2 = from->di_flags2;
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index daa7615497f9..225544327c4f 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -28,6 +28,8 @@
> #include "xfs_inode_item.h"
> #include "xfs_trace.h"
>
> +#include <linux/iversion.h>
> +
> /*
> * Add a locked inode to the transaction.
> *
> @@ -117,7 +119,7 @@ xfs_trans_log_inode(
> */
> if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> IS_I_VERSION(VFS_I(ip))) {
> - VFS_I(ip)->i_version++;
> + inode_inc_iversion(VFS_I(ip));
> flags |= XFS_ILOG_CORE;
> }
>
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2018-01-02 16:50:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

On Fri 22-12-17 07:05:53, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
>
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/inode.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 19e72f500f71..2fa920188759 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
> int generic_update_time(struct inode *inode, struct timespec *time, int flags)
> {
> int iflags = I_DIRTY_TIME;
> + bool dirty = false;
>
> if (flags & S_ATIME)
> inode->i_atime = *time;
> if (flags & S_VERSION)
> - inode_inc_iversion(inode);
> + dirty |= inode_maybe_inc_iversion(inode, dirty);
> if (flags & S_CTIME)
> inode->i_ctime = *time;
> if (flags & S_MTIME)
> inode->i_mtime = *time;
> + if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
> + !(inode->i_sb->s_flags & SB_LAZYTIME))
> + dirty = true;

When you pass 'dirty' to inode_maybe_inc_iversion(), it is always false.
Maybe this condition should be at the beginning of the function? Once you
fix that the patch looks good so you can add:

Reviewed-by: Jan Kara <[email protected]>


Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-01-02 17:00:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Fri 22-12-17 07:05:56, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
>
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
>
> When we go to maybe increment it, we fetch the value and check the flag
> bit. If it's clear then we don't need to do anything if the update
> isn't being forced.
>
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
>
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
>
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
>
> Signed-off-by: Jeff Layton <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/fs.h | 2 +-
> include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 154 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76382c24e9d0..6804d075933e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -639,7 +639,7 @@ struct inode {
> struct hlist_head i_dentry;
> struct rcu_head i_rcu;
> };
> - u64 i_version;
> + atomic64_t i_version;
> atomic_t i_count;
> atomic_t i_dio_count;
> atomic_t i_writecount;
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e08c634779df..cef242e54489 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -5,6 +5,8 @@
> #include <linux/fs.h>
>
> /*
> + * The inode->i_version field:
> + * ---------------------------
> * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> * appear different to observers if there was a change to the inode's data or
> @@ -27,86 +29,171 @@
> * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
> * We consider these sorts of filesystems to have a kernel-managed i_version.
> *
> + * This implementation uses the low bit in the i_version field as a flag to
> + * track when the value has been queried. If it has not been queried since it
> + * was last incremented, we can skip the increment in most cases.
> + *
> + * In the event that we're updating the ctime, we will usually go ahead and
> + * bump the i_version anyway. Since that has to go to stable storage in some
> + * fashion, we might as well increment it as well.
> + *
> + * With this implementation, the value should always appear to observers to
> + * increase over time if the file has changed. It's recommended to use
> + * inode_cmp_iversion() helper to compare values.
> + *
> * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> - * a server-provided value (for the most part). For that reason, those
> + * a server-provided value for the most part. For that reason, those
> * filesystems do not set SB_I_VERSION. These filesystems are considered to
> * have a self-managed i_version.
> + *
> + * Persistently storing the i_version
> + * ----------------------------------
> + * Queries of the i_version field are not gated on them hitting the backing
> + * store. It's always possible that the host could crash after allowing
> + * a query of the value but before it has made it to disk.
> + *
> + * To mitigate this problem, filesystems should always use
> + * inode_set_iversion_queried when loading an existing inode from disk. This
> + * ensures that the next attempted inode increment will result in the value
> + * changing.
> + *
> + * Storing the value to disk therefore does not count as a query, so those
> + * filesystems should use inode_peek_iversion to grab the value to be stored.
> + * There is no need to flag the value as having been queried in that case.
> */
>
> +/*
> + * We borrow the lowest bit in the i_version to use as a flag to tell whether
> + * it has been queried since we last incremented it. If it has, then we must
> + * increment it on the next change. After that, we can clear the flag and
> + * avoid incrementing it again until it has again been queried.
> + */
> +#define I_VERSION_QUERIED_SHIFT (1)
> +#define I_VERSION_QUERIED (1ULL << (I_VERSION_QUERIED_SHIFT - 1))
> +#define I_VERSION_INCREMENT (1ULL << I_VERSION_QUERIED_SHIFT)
> +
> /**
> * inode_set_iversion_raw - set i_version to the specified raw value
> * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
> *
> - * Set @inode's i_version field to @new. This function is for use by
> + * Set @inode's i_version field to @val. This function is for use by
> * filesystems that self-manage the i_version.
> *
> * For example, the NFS client stores its NFSv4 change attribute in this way,
> * and the AFS client stores the data_version from the server here.
> */
> static inline void
> -inode_set_iversion_raw(struct inode *inode, const u64 new)
> +inode_set_iversion_raw(struct inode *inode, const u64 val)
> +{
> + atomic64_set(&inode->i_version, val);
> +}
> +
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> {
> - inode->i_version = new;
> + return atomic64_read(&inode->i_version);
> }
>
> /**
> * inode_set_iversion - set i_version to a particular value
> * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
> *
> - * Set @inode's i_version field to @new. This function is for filesystems with
> - * a kernel-managed i_version.
> + * Set @inode's i_version field to @val. This function is for filesystems with
> + * a kernel-managed i_version, for initializing a newly-created inode from
> + * scratch.
> *
> - * For now, this just does the same thing as the _raw variant.
> + * In this case, we do not set the QUERIED flag since we know that this value
> + * has never been queried.
> */
> static inline void
> -inode_set_iversion(struct inode *inode, const u64 new)
> +inode_set_iversion(struct inode *inode, const u64 val)
> {
> - inode_set_iversion_raw(inode, new);
> + inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
> }
>
> /**
> - * inode_set_iversion_queried - set i_version to a particular value and set
> - * flag to indicate that it has been viewed
> + * inode_set_iversion_queried - set i_version to a particular value as quereied
> * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
> *
> - * When loading in an i_version value from a backing store, we typically don't
> - * know whether it was previously viewed before being stored or not. Thus, we
> - * must assume that it was, to ensure that any changes will result in the
> - * value changing.
> + * Set @inode's i_version field to @val, and flag it for increment on the next
> + * change.
> *
> - * This function will set the inode's i_version, and possibly flag the value
> - * as if it has already been viewed at least once.
> + * Filesystems that persistently store the i_version on disk should use this
> + * when loading an existing inode from disk.
> *
> - * For now, this just does what inode_set_iversion does.
> + * When loading in an i_version value from a backing store, we can't be certain
> + * that it wasn't previously viewed before being stored. Thus, we must assume
> + * that it was, to ensure that we don't end up handing out the same value for
> + * different versions of the same inode.
> */
> static inline void
> -inode_set_iversion_queried(struct inode *inode, const u64 new)
> +inode_set_iversion_queried(struct inode *inode, const u64 val)
> {
> - inode_set_iversion(inode, new);
> + inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
> + I_VERSION_QUERIED);
> }
>
> /**
> * inode_maybe_inc_iversion - increments i_version
> * @inode: inode with the i_version that should be updated
> - * @force: increment the counter even if it's not necessary
> + * @force: increment the counter even if it's not necessary?
> *
> * Every time the inode is modified, the i_version field must be seen to have
> * changed by any observer.
> *
> - * In this implementation, we always increment it after taking the i_lock to
> - * ensure that we don't race with other incrementors.
> + * If "force" is set or the QUERIED flag is set, then ensure that we increment
> + * the value, and clear the queried flag.
> *
> - * Returns true if counter was bumped, and false if it wasn't.
> + * In the common case where neither is set, then we can return "false" without
> + * updating i_version.
> + *
> + * If this function returns false, and no other metadata has changed, then we
> + * can avoid logging the metadata.
> */
> static inline bool
> inode_maybe_inc_iversion(struct inode *inode, bool force)
> {
> - atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> + u64 cur, old, new;
> +
> + /*
> + * The i_version field is not strictly ordered with any other inode
> + * information, but the legacy inode_inc_iversion code used a spinlock
> + * to serialize increments.
> + *
> + * Here, we add full memory barriers to ensure that any de-facto
> + * ordering with other info is preserved.
> + *
> + * This barrier pairs with the barrier in inode_query_iversion()
> + */
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & I_VERSION_QUERIED))
> + return false;
>
> - atomic64_inc(ivp);
> + /* Since lowest bit is flag, add 2 to avoid it */
> + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> +
> + old = atomic64_cmpxchg(&inode->i_version, cur, new);
> + if (likely(old == cur))
> + break;
> + cur = old;
> + }
> return true;
> }
>
> @@ -129,31 +216,12 @@ inode_inc_iversion(struct inode *inode)
> * @inode: inode to check
> *
> * Returns whether the inode->i_version counter needs incrementing on the next
> - * change.
> - *
> - * For now, we assume that it always does.
> + * change. Just fetch the value and check the QUERIED flag.
> */
> static inline bool
> inode_iversion_need_inc(struct inode *inode)
> {
> - return true;
> -}
> -
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return inode->i_version;
> + return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
> }
>
> /**
> @@ -170,7 +238,7 @@ inode_peek_iversion_raw(const struct inode *inode)
> static inline u64
> inode_peek_iversion(const struct inode *inode)
> {
> - return inode_peek_iversion_raw(inode);
> + return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
> }
>
> /**
> @@ -182,12 +250,35 @@ inode_peek_iversion(const struct inode *inode)
> * that a later query of the i_version will result in a different value if
> * anything has changed.
> *
> - * This implementation just does a peek.
> + * In this implementation, we fetch the current value, set the QUERIED flag and
> + * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> + * that fails, we try again with the newly fetched value from the cmpxchg.
> */
> static inline u64
> inode_query_iversion(struct inode *inode)
> {
> - return inode_peek_iversion(inode);
> + u64 cur, old, new;
> +
> + cur = inode_peek_iversion_raw(inode);
> + for (;;) {
> + /* If flag is already set, then no need to swap */
> + if (cur & I_VERSION_QUERIED) {
> + /*
> + * This barrier (and the implicit barrier in the
> + * cmpxchg below) pairs with the barrier in
> + * inode_maybe_inc_iversion().
> + */
> + smp_mb();
> + break;
> + }
> +
> + new = cur | I_VERSION_QUERIED;
> + old = atomic64_cmpxchg(&inode->i_version, cur, new);
> + if (likely(old == cur))
> + break;
> + cur = old;
> + }
> + return cur >> I_VERSION_QUERIED_SHIFT;
> }
>
> /**
> @@ -196,11 +287,18 @@ inode_query_iversion(struct inode *inode)
> * @old: old value to check against its i_version
> *
> * Compare an i_version counter with a previous one. Returns 0 if they are
> - * the same or non-zero if they are different.
> + * the same, a positive value if the one in the inode appears newer than @old,
> + * and a negative value if @old appears to be newer than the one in the
> + * inode.
> + *
> + * Note that we don't need to set the QUERIED flag in this case, as the value
> + * in the inode is not being recorded for later use.
> */
> +
> static inline s64
> inode_cmp_iversion(const struct inode *inode, const u64 old)
> {
> - return (s64)inode_peek_iversion(inode) - (s64)old;
> + return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
> + (s64)(old << I_VERSION_QUERIED_SHIFT);
> }
> #endif
> --
> 2.14.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-01-02 17:01:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

On Fri 22-12-17 18:54:57, Jeff Layton wrote:
> On Sat, 2017-12-23 at 10:14 +1100, NeilBrown wrote:
> > > +#include <linux/fs.h>
> > > +
> > > +/*
> > > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > + * appear different to observers if there was a change to the inode's data or
> > > + * metadata since it was last queried.
> > > + *
> > > + * It should be considered an opaque value by observers. If it remains the same
> >
> > ^^^^^^^^^^^^
> >
> > You keep using that word ... I don't think it means what you think it
> > means.
> > Change that sentence to:
> >
> > Observers see i_version as a 64 number which never decreases.
> >
> > and the rest still makes perfect sense.
> >
>
> Thanks! Fixed in my tree. I'll not resend the set just for that though.

With this fixed the patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-01-02 17:20:15

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v4 05/19] afs: convert to new i_version API

Jeff Layton <[email protected]> wrote:

> Note that AFS has quite a different definition for this counter. AFS
> only increments it on changes to the data, not for the metadata.

This also applies to AFS directories: create, mkdir, unlink, rmdir, link,
symlink, rename, and mountpoint creation/removal all bump the data version
number on a directory by exactly one if they change it.

David

2018-01-02 18:57:46

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 05/19] afs: convert to new i_version API

On Tue, 2018-01-02 at 17:20 +0000, David Howells wrote:
> Jeff Layton <[email protected]> wrote:
>
> > Note that AFS has quite a different definition for this counter. AFS
> > only increments it on changes to the data, not for the metadata.
>
> This also applies to AFS directories: create, mkdir, unlink, rmdir, link,
> symlink, rename, and mountpoint creation/removal all bump the data version
> number on a directory by exactly one if they change it.
>

Thanks! I updated that part of the the commit log to read:

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data to the data in regular files
and contents of the directories. Inode metadata changes do not result
in a version increment.

--
Jeff Layton <[email protected]>

2018-01-02 19:03:21

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

On Tue, 2018-01-02 at 17:50 +0100, Jan Kara wrote:
> On Fri 22-12-17 07:05:53, Jeff Layton wrote:
> > From: Jeff Layton <[email protected]>
> >
> > We only really need to update i_version if someone has queried for it
> > since we last incremented it. By doing that, we can avoid having to
> > update the inode if the times haven't changed.
> >
> > If the times have changed, then we go ahead and forcibly increment the
> > counter, under the assumption that we'll be going to the storage
> > anyway, and the increment itself is relatively cheap.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/inode.c | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 19e72f500f71..2fa920188759 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
> > int generic_update_time(struct inode *inode, struct timespec *time, int flags)
> > {
> > int iflags = I_DIRTY_TIME;
> > + bool dirty = false;
> >
> > if (flags & S_ATIME)
> > inode->i_atime = *time;
> > if (flags & S_VERSION)
> > - inode_inc_iversion(inode);
> > + dirty |= inode_maybe_inc_iversion(inode, dirty);
> > if (flags & S_CTIME)
> > inode->i_ctime = *time;
> > if (flags & S_MTIME)
> > inode->i_mtime = *time;
> > + if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
> > + !(inode->i_sb->s_flags & SB_LAZYTIME))
> > + dirty = true;
>
> When you pass 'dirty' to inode_maybe_inc_iversion(), it is always false.
> Maybe this condition should be at the beginning of the function? Once you
> fix that the patch looks good so you can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>

Thanks for the review! I've fixed it in my tree. I'll not re-post the
set unless I have to make another significant change or someone requests
it.

I did make one other change, and that was to drop the "const" qualifiers
on the integer arguments in the new API. David Howells pointed out that
they don't really help anything, and the prototypes look cleaner without
them.

This set is now in linux-next as well, so I'm going to try to get this
merged into v4.16, assuming no problems between now and the merge
window.
--
Jeff Layton <[email protected]>

2018-01-03 16:28:18

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v4 05/19] afs: convert to new i_version API

Jeff Layton <[email protected]> wrote:

> Thanks! I updated that part of the the commit log to read:
>
> Note that AFS has quite a different definition for this counter. AFS
> only increments it on changes to the data to the data in regular files
> and contents of the directories. Inode metadata changes do not result
> in a version increment.

Sounds good.

David

2018-01-04 13:34:37

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 12/19] ocfs2: convert to new i_version API

On Fri, 2017-12-22 at 07:05 -0500, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Signed-off-by: Jeff Layton <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> fs/ocfs2/dir.c | 15 ++++++++-------
> fs/ocfs2/inode.c | 3 ++-
> fs/ocfs2/namei.c | 3 ++-
> fs/ocfs2/quota_global.c | 3 ++-
> 4 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
> index febe6312ceff..32f9c72dff17 100644
> --- a/fs/ocfs2/dir.c
> +++ b/fs/ocfs2/dir.c
> @@ -42,6 +42,7 @@
> #include <linux/highmem.h>
> #include <linux/quotaops.h>
> #include <linux/sort.h>
> +#include <linux/iversion.h>
>
> #include <cluster/masklog.h>
>
> @@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir,
> le16_add_cpu(&pde->rec_len,
> le16_to_cpu(de->rec_len));
> de->inode = 0;
> - dir->i_version++;
> + inode_inc_iversion(dir);
> ocfs2_journal_dirty(handle, bh);
> goto bail;
> }
> @@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle,
> if (ocfs2_dir_indexed(dir))
> ocfs2_recalc_free_list(dir, handle, lookup);
>
> - dir->i_version++;
> + inode_inc_iversion(dir);
> ocfs2_journal_dirty(handle, insert_bh);
> retval = 0;
> goto bail;
> @@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
> * readdir(2), then we might be pointing to an invalid
> * dirent right now. Scan from the start of the block
> * to make sure. */
> - if (*f_version != inode->i_version) {
> + if (inode_cmp_iversion(inode, *f_version)) {
> for (i = 0; i < i_size_read(inode) && i < offset; ) {
> de = (struct ocfs2_dir_entry *)
> (data->id_data + i);
> @@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
> i += le16_to_cpu(de->rec_len);
> }
> ctx->pos = offset = i;
> - *f_version = inode->i_version;
> + *f_version = inode_query_iversion(inode);
> }
>
> de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
> @@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
> * readdir(2), then we might be pointing to an invalid
> * dirent right now. Scan from the start of the block
> * to make sure. */
> - if (*f_version != inode->i_version) {
> + if (inode_cmp_iversion(inode, *f_version)) {
> for (i = 0; i < sb->s_blocksize && i < offset; ) {
> de = (struct ocfs2_dir_entry *) (bh->b_data + i);
> /* It's too expensive to do a full
> @@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
> offset = i;
> ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
> | offset;
> - *f_version = inode->i_version;
> + *f_version = inode_query_iversion(inode);
> }
>
> while (ctx->pos < i_size_read(inode)
> @@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version,
> */
> int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
> {
> - u64 version = inode->i_version;
> + u64 version = inode_query_iversion(inode);
> ocfs2_dir_foreach_blk(inode, &version, ctx, true);
> return 0;
> }
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index 1a1e0078ab38..d51b80edd972 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -28,6 +28,7 @@
> #include <linux/highmem.h>
> #include <linux/pagemap.h>
> #include <linux/quotaops.h>
> +#include <linux/iversion.h>
>
> #include <asm/byteorder.h>
>
> @@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe,
> OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
> OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
>
> - inode->i_version = 1;
> + inode_set_iversion(inode, 1);
> inode->i_generation = le32_to_cpu(fe->i_generation);
> inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
> inode->i_mode = le16_to_cpu(fe->i_mode);
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index 3b0a10d9b36f..c801eddc4bf3 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -41,6 +41,7 @@
> #include <linux/slab.h>
> #include <linux/highmem.h>
> #include <linux/quotaops.h>
> +#include <linux/iversion.h>
>
> #include <cluster/masklog.h>
>
> @@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir,
> mlog_errno(status);
> goto bail;
> }
> - new_dir->i_version++;
> + inode_inc_iversion(new_dir);
>
> if (S_ISDIR(new_inode->i_mode))
> ocfs2_set_links_count(newfe, 0);
> diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
> index b39d14cbfa34..d03411784aaf 100644
> --- a/fs/ocfs2/quota_global.c
> +++ b/fs/ocfs2/quota_global.c
> @@ -12,6 +12,7 @@
> #include <linux/writeback.h>
> #include <linux/workqueue.h>
> #include <linux/llist.h>
> +#include <linux/iversion.h>
>
> #include <cluster/masklog.h>
>
> @@ -289,7 +290,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,
> mlog_errno(err);
> return err;
> }
> - gqinode->i_version++;
> + inode_query_iversion(gqinode);

Bug here: this should have been an inode_inc_iversion. Fixed in tree.

> ocfs2_mark_inode_dirty(handle, gqinode, oinfo->dqi_gqi_bh);
> return len;
> }

--
Jeff Layton <[email protected]>

2018-01-07 12:44:56

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
> From: Jeff Layton <[email protected]>
>
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
>
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
>
> When we go to maybe increment it, we fetch the value and check the flag
> bit. If it's clear then we don't need to do anything if the update
> isn't being forced.
>
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
>
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
>
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> include/linux/fs.h | 2 +-
> include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 154 insertions(+), 56 deletions(-)
>

Hi,

On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
This commit popped up through bisect (log at the end). Systemd
timeouts on some device-specific services, including mounting ext4
/home:

[ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
[ TIME ] Timed out waiting for device dev-ttySAC2.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
Jan 07 13:29:38 [ TIME ] Timed out waiting for device
dev-disk-by\x2dlabel-home.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
/dev/disk/by-label/home.
Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
polling (1min 53s / no limit)

Kernel command line:
console=tty1 console=ttySAC2,115200n8
ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
nfsrootdebug root=/dev/nfs
nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend

/home is /dev/mmcblk1p2:
kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
tune2fs 1.43.7 (16-Oct-2017)
Filesystem volume name: home
Last mounted on: /home
Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1430800
Block count: 5717760
Reserved block count: 285888
Free blocks: 5467576
Free inodes: 1428301
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1022
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8176
Inode blocks per group: 511
Flex block group size: 16
Filesystem created: Thu May 21 12:17:05 2015
Last mount time: Thu Dec 21 13:31:26 2017
Last write time: Thu Dec 21 13:31:26 2017
Mount count: 1
Maximum mount count: -1
Last checked: Thu Dec 21 13:31:25 2017
Check interval: 0 (<none>)
Lifetime writes: 126 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
Journal backup: inode blocks


Full dmesg log:
http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0

The regular boot from rootfs on SD card also fails - but without any
serial console logs (just "Starting kernel...") so the real cause is
unknown.

Any hints?

Best regards,
Krzysztof


bisect log:
# bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
specific files for 20180103
# good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
git bisect start 'next/master' 'next/stable'
# bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
remote-tracking branch 'crypto/master'
git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
# bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
remote-tracking branch 'hid/for-next'
git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
# good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
remote-tracking branch 'realtek/for-next'
git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
# good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
remote-tracking branch 'btrfs/next'
git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
# good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
remote-tracking branch 'xfs/for-next'
git bisect good 984c35877f36bee305e43a1c58176169854d85cf
# bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
remote-tracking branch 'printk/for-next'
git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
# good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
POLL... arch-independent
git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
# good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
'work.misc' into for-next
git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
# good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
i_version API
git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
# good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
pF/pf deprecation warning
git bisect good a94fe10fb114c169e7ddaecd8251521886409121
# good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
the inode in btrfs_update_time if something was changed
git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
# bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
remote-tracking branch 'iversion/iversion-next'
git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
# bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
inode->i_version more efficiently
git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
# first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
handle inode->i_version more efficiently

2018-01-08 12:56:53

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
> > From: Jeff Layton <[email protected]>
> >
> > Since i_version is mostly treated as an opaque value, we can exploit that
> > fact to avoid incrementing it when no one is watching. With that change,
> > we can avoid incrementing the counter on writes, unless someone has
> > queried for it since it was last incremented. If the a/c/mtime don't
> > change, and the i_version hasn't changed, then there's no need to dirty
> > the inode metadata on a write.
> >
> > Convert the i_version counter to an atomic64_t, and use the lowest order
> > bit to hold a flag that will tell whether anyone has queried the value
> > since it was last incremented.
> >
> > When we go to maybe increment it, we fetch the value and check the flag
> > bit. If it's clear then we don't need to do anything if the update
> > isn't being forced.
> >
> > If we do need to update, then we increment the counter by 2, and clear
> > the flag bit, and then use a CAS op to swap it into place. If that
> > works, we return true. If it doesn't then do it again with the value
> > that we fetch from the CAS operation.
> >
> > On the query side, if the flag is already set, then we just shift the
> > value down by 1 bit and return it. Otherwise, we set the flag in our
> > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > changed. If it has, then we use the value from the cmpxchg as the new
> > "old" value and try again.
> >
> > This method allows us to avoid incrementing the counter on writes (and
> > dirtying the metadata) under typical workloads. We only need to increment
> > if it has been queried since it was last changed.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > include/linux/fs.h | 2 +-
> > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> > 2 files changed, 154 insertions(+), 56 deletions(-)
> >
>
> Hi,
>
> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> This commit popped up through bisect (log at the end). Systemd
> timeouts on some device-specific services, including mounting ext4
> /home:
>
> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> dev-disk-by\x2dlabel-home.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> /dev/disk/by-label/home.
> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> polling (1min 53s / no limit)
>
> Kernel command line:
> console=tty1 console=ttySAC2,115200n8
> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> nfsrootdebug root=/dev/nfs
> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
>
> /home is /dev/mmcblk1p2:
> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> tune2fs 1.43.7 (16-Oct-2017)
> Filesystem volume name: home
> Last mounted on: /home
> Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> Filesystem features: has_journal ext_attr resize_inode dir_index
> filetype needs_recovery extent flex_bg sparse_super large_file
> uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options: user_xattr acl
> Filesystem state: clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 1430800
> Block count: 5717760
> Reserved block count: 285888
> Free blocks: 5467576
> Free inodes: 1428301
> First block: 0
> Block size: 4096
> Fragment size: 4096
> Reserved GDT blocks: 1022
> Blocks per group: 32768
> Fragments per group: 32768
> Inodes per group: 8176
> Inode blocks per group: 511
> Flex block group size: 16
> Filesystem created: Thu May 21 12:17:05 2015
> Last mount time: Thu Dec 21 13:31:26 2017
> Last write time: Thu Dec 21 13:31:26 2017
> Mount count: 1
> Maximum mount count: -1
> Last checked: Thu Dec 21 13:31:25 2017
> Check interval: 0 (<none>)
> Lifetime writes: 126 GB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group root)
> First inode: 11
> Inode size: 256
> Required extra isize: 28
> Desired extra isize: 28
> Journal inode: 8
> Default directory hash: half_md4
> Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
> Journal backup: inode blocks
>
>
> Full dmesg log:
> http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
>
> The regular boot from rootfs on SD card also fails - but without any
> serial console logs (just "Starting kernel...") so the real cause is
> unknown.
>
> Any hints?
>
> Best regards,
> Krzysztof
>
>
> bisect log:
> # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> specific files for 20180103
> # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> git bisect start 'next/master' 'next/stable'
> # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> remote-tracking branch 'crypto/master'
> git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> remote-tracking branch 'hid/for-next'
> git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> remote-tracking branch 'realtek/for-next'
> git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> remote-tracking branch 'btrfs/next'
> git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> remote-tracking branch 'xfs/for-next'
> git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> remote-tracking branch 'printk/for-next'
> git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> POLL... arch-independent
> git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> 'work.misc' into for-next
> git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> i_version API
> git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> pF/pf deprecation warning
> git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> the inode in btrfs_update_time if something was changed
> git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> remote-tracking branch 'iversion/iversion-next'
> git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> inode->i_version more efficiently
> git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> handle inode->i_version more efficiently

That's really strange. I'm afraid I have no idea what could be going on.

With NFS, we really just treat i_version as an opaque value, so I'm not
sure how this patch in particular would affect anything there. We _do_
increment it if you have a write delegation in some cases, but not many
servers hand those out.

ext4 will only touch the i_version field if you mount it with '-o
iversion'. Are you doing that here?

Have you run the bisect more than once? Is this maybe an intermittent
problem, and the bisect has landed on the wrong commit?

Thanks,
--
Jeff Layton <[email protected]>

2018-01-08 13:21:07

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <[email protected]> wrote:
> On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
>> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
>> > From: Jeff Layton <[email protected]>
>> >
>> > Since i_version is mostly treated as an opaque value, we can exploit that
>> > fact to avoid incrementing it when no one is watching. With that change,
>> > we can avoid incrementing the counter on writes, unless someone has
>> > queried for it since it was last incremented. If the a/c/mtime don't
>> > change, and the i_version hasn't changed, then there's no need to dirty
>> > the inode metadata on a write.
>> >
>> > Convert the i_version counter to an atomic64_t, and use the lowest order
>> > bit to hold a flag that will tell whether anyone has queried the value
>> > since it was last incremented.
>> >
>> > When we go to maybe increment it, we fetch the value and check the flag
>> > bit. If it's clear then we don't need to do anything if the update
>> > isn't being forced.
>> >
>> > If we do need to update, then we increment the counter by 2, and clear
>> > the flag bit, and then use a CAS op to swap it into place. If that
>> > works, we return true. If it doesn't then do it again with the value
>> > that we fetch from the CAS operation.
>> >
>> > On the query side, if the flag is already set, then we just shift the
>> > value down by 1 bit and return it. Otherwise, we set the flag in our
>> > on-stack value and again use cmpxchg to swap it into place if it hasn't
>> > changed. If it has, then we use the value from the cmpxchg as the new
>> > "old" value and try again.
>> >
>> > This method allows us to avoid incrementing the counter on writes (and
>> > dirtying the metadata) under typical workloads. We only need to increment
>> > if it has been queried since it was last changed.
>> >
>> > Signed-off-by: Jeff Layton <[email protected]>
>> > ---
>> > include/linux/fs.h | 2 +-
>> > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
>> > 2 files changed, 154 insertions(+), 56 deletions(-)
>> >
>>
>> Hi,
>>
>> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
>> This commit popped up through bisect (log at the end). Systemd
>> timeouts on some device-specific services, including mounting ext4
>> /home:
>>
>> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
>> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
>> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
>> dev-disk-by\x2dlabel-home.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
>> /dev/disk/by-label/home.
>> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
>> polling (1min 53s / no limit)
>>
>> Kernel command line:
>> console=tty1 console=ttySAC2,115200n8
>> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
>> nfsrootdebug root=/dev/nfs
>> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
>> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
>>
>> /home is /dev/mmcblk1p2:
>> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
>> tune2fs 1.43.7 (16-Oct-2017)
>> Filesystem volume name: home
>> Last mounted on: /home
>> Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
>> Filesystem magic number: 0xEF53
>> Filesystem revision #: 1 (dynamic)
>> Filesystem features: has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery extent flex_bg sparse_super large_file
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags: signed_directory_hash
>> Default mount options: user_xattr acl
>> Filesystem state: clean
>> Errors behavior: Continue
>> Filesystem OS type: Linux
>> Inode count: 1430800
>> Block count: 5717760
>> Reserved block count: 285888
>> Free blocks: 5467576
>> Free inodes: 1428301
>> First block: 0
>> Block size: 4096
>> Fragment size: 4096
>> Reserved GDT blocks: 1022
>> Blocks per group: 32768
>> Fragments per group: 32768
>> Inodes per group: 8176
>> Inode blocks per group: 511
>> Flex block group size: 16
>> Filesystem created: Thu May 21 12:17:05 2015
>> Last mount time: Thu Dec 21 13:31:26 2017
>> Last write time: Thu Dec 21 13:31:26 2017
>> Mount count: 1
>> Maximum mount count: -1
>> Last checked: Thu Dec 21 13:31:25 2017
>> Check interval: 0 (<none>)
>> Lifetime writes: 126 GB
>> Reserved blocks uid: 0 (user root)
>> Reserved blocks gid: 0 (group root)
>> First inode: 11
>> Inode size: 256
>> Required extra isize: 28
>> Desired extra isize: 28
>> Journal inode: 8
>> Default directory hash: half_md4
>> Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
>> Journal backup: inode blocks
>>
>>
>> Full dmesg log:
>> http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
>>
>> The regular boot from rootfs on SD card also fails - but without any
>> serial console logs (just "Starting kernel...") so the real cause is
>> unknown.
>>
>> Any hints?
>>
>> Best regards,
>> Krzysztof
>>
>>
>> bisect log:
>> # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
>> specific files for 20180103
>> # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
>> git bisect start 'next/master' 'next/stable'
>> # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
>> remote-tracking branch 'crypto/master'
>> git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
>> # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
>> remote-tracking branch 'hid/for-next'
>> git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
>> # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
>> remote-tracking branch 'realtek/for-next'
>> git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
>> # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
>> remote-tracking branch 'btrfs/next'
>> git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
>> # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
>> remote-tracking branch 'xfs/for-next'
>> git bisect good 984c35877f36bee305e43a1c58176169854d85cf
>> # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
>> remote-tracking branch 'printk/for-next'
>> git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
>> # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
>> POLL... arch-independent
>> git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
>> # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
>> 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
>> 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
>> 'work.misc' into for-next
>> git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
>> # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
>> i_version API
>> git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
>> # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
>> pF/pf deprecation warning
>> git bisect good a94fe10fb114c169e7ddaecd8251521886409121
>> # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
>> the inode in btrfs_update_time if something was changed
>> git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
>> # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
>> remote-tracking branch 'iversion/iversion-next'
>> git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
>> # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
>> inode->i_version more efficiently
>> git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
>> # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
>> handle inode->i_version more efficiently
>
> That's really strange. I'm afraid I have no idea what could be going on.
>
> With NFS, we really just treat i_version as an opaque value, so I'm not
> sure how this patch in particular would affect anything there. We _do_
> increment it if you have a write delegation in some cases, but not many
> servers hand those out.

About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux

The /etc/nfs.conf is default except:
[nfsd]
vers2=n
vers3=n

The client logs for nfsroot mounts are:
Jan 08 14:07:25 :: running hook [net_nfs4]
Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
netmask: 255.255.255.0
Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
Jan 08 14:07:25 filename :
Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
Jan 08 14:07:36 Mount cmd:
Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root

Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).

> ext4 will only touch the i_version field if you mount it with '-o
> iversion'. Are you doing that here?

The /home is mounted by systemd from /etc/fstab:
LABEL=home /home ext4 defaults 0 2

>
> Have you run the bisect more than once? Is this maybe an intermittent
> problem, and the bisect has landed on the wrong commit?

I just run tests on commits around again (but not full bisect) on
current next (next-20180108):
fbf97ece47d66d22 - works
3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
fails: http://www.krzk.eu/#/builders/1/builds/1267


Best regards,
Krzysztof

2018-01-08 13:30:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> + cur = inode_peek_iversion_raw(inode);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & I_VERSION_QUERIED))
> + return false;
> + /* Since lowest bit is flag, add 2 to avoid it */
> + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;

Isn't this an extraordinarily complicated way of spelling:

new = cur + 1;

We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
two is going to be the same as adding 1 ... right?


2018-01-08 13:29:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <[email protected]> wrote:
> > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
> > > > From: Jeff Layton <[email protected]>
> > > >
> > > > Since i_version is mostly treated as an opaque value, we can exploit that
> > > > fact to avoid incrementing it when no one is watching. With that change,
> > > > we can avoid incrementing the counter on writes, unless someone has
> > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > change, and the i_version hasn't changed, then there's no need to dirty
> > > > the inode metadata on a write.
> > > >
> > > > Convert the i_version counter to an atomic64_t, and use the lowest order
> > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > since it was last incremented.
> > > >
> > > > When we go to maybe increment it, we fetch the value and check the flag
> > > > bit. If it's clear then we don't need to do anything if the update
> > > > isn't being forced.
> > > >
> > > > If we do need to update, then we increment the counter by 2, and clear
> > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > works, we return true. If it doesn't then do it again with the value
> > > > that we fetch from the CAS operation.
> > > >
> > > > On the query side, if the flag is already set, then we just shift the
> > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > "old" value and try again.
> > > >
> > > > This method allows us to avoid incrementing the counter on writes (and
> > > > dirtying the metadata) under typical workloads. We only need to increment
> > > > if it has been queried since it was last changed.
> > > >
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > include/linux/fs.h | 2 +-
> > > > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> > > > 2 files changed, 154 insertions(+), 56 deletions(-)
> > > >
> > >
> > > Hi,
> > >
> > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > This commit popped up through bisect (log at the end). Systemd
> > > timeouts on some device-specific services, including mounting ext4
> > > /home:
> > >
> > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > dev-disk-by\x2dlabel-home.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > /dev/disk/by-label/home.
> > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > polling (1min 53s / no limit)
> > >
> > > Kernel command line:
> > > console=tty1 console=ttySAC2,115200n8
> > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > nfsrootdebug root=/dev/nfs
> > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > >
> > > /home is /dev/mmcblk1p2:
> > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > tune2fs 1.43.7 (16-Oct-2017)
> > > Filesystem volume name: home
> > > Last mounted on: /home
> > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > Filesystem magic number: 0xEF53
> > > Filesystem revision #: 1 (dynamic)
> > > Filesystem features: has_journal ext_attr resize_inode dir_index
> > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > uninit_bg dir_nlink extra_isize
> > > Filesystem flags: signed_directory_hash
> > > Default mount options: user_xattr acl
> > > Filesystem state: clean
> > > Errors behavior: Continue
> > > Filesystem OS type: Linux
> > > Inode count: 1430800
> > > Block count: 5717760
> > > Reserved block count: 285888
> > > Free blocks: 5467576
> > > Free inodes: 1428301
> > > First block: 0
> > > Block size: 4096
> > > Fragment size: 4096
> > > Reserved GDT blocks: 1022
> > > Blocks per group: 32768
> > > Fragments per group: 32768
> > > Inodes per group: 8176
> > > Inode blocks per group: 511
> > > Flex block group size: 16
> > > Filesystem created: Thu May 21 12:17:05 2015
> > > Last mount time: Thu Dec 21 13:31:26 2017
> > > Last write time: Thu Dec 21 13:31:26 2017
> > > Mount count: 1
> > > Maximum mount count: -1
> > > Last checked: Thu Dec 21 13:31:25 2017
> > > Check interval: 0 (<none>)
> > > Lifetime writes: 126 GB
> > > Reserved blocks uid: 0 (user root)
> > > Reserved blocks gid: 0 (group root)
> > > First inode: 11
> > > Inode size: 256
> > > Required extra isize: 28
> > > Desired extra isize: 28
> > > Journal inode: 8
> > > Default directory hash: half_md4
> > > Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
> > > Journal backup: inode blocks
> > >
> > >
> > > Full dmesg log:
> > > http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
> > >
> > > The regular boot from rootfs on SD card also fails - but without any
> > > serial console logs (just "Starting kernel...") so the real cause is
> > > unknown.
> > >
> > > Any hints?
> > >
> > > Best regards,
> > > Krzysztof
> > >
> > >
> > > bisect log:
> > > # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> > > specific files for 20180103
> > > # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> > > git bisect start 'next/master' 'next/stable'
> > > # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> > > remote-tracking branch 'crypto/master'
> > > git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> > > # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> > > remote-tracking branch 'hid/for-next'
> > > git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> > > # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> > > remote-tracking branch 'realtek/for-next'
> > > git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> > > # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> > > remote-tracking branch 'btrfs/next'
> > > git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> > > # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> > > remote-tracking branch 'xfs/for-next'
> > > git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> > > # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> > > remote-tracking branch 'printk/for-next'
> > > git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> > > # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> > > POLL... arch-independent
> > > git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> > > # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> > > 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> > > 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> > > 'work.misc' into for-next
> > > git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> > > # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> > > i_version API
> > > git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> > > # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> > > pF/pf deprecation warning
> > > git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> > > # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> > > the inode in btrfs_update_time if something was changed
> > > git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> > > # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> > > remote-tracking branch 'iversion/iversion-next'
> > > git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> > > # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> > > inode->i_version more efficiently
> > > git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> > > # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> > > handle inode->i_version more efficiently
> >
> > That's really strange. I'm afraid I have no idea what could be going on.
> >
> > With NFS, we really just treat i_version as an opaque value, so I'm not
> > sure how this patch in particular would affect anything there. We _do_
> > increment it if you have a write delegation in some cases, but not many
> > servers hand those out.
>
> About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
> Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux
>
> The /etc/nfs.conf is default except:
> [nfsd]
> vers2=n
> vers3=n
>
> The client logs for nfsroot mounts are:
> Jan 08 14:07:25 :: running hook [net_nfs4]
> Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
> Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
> Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
> Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
> netmask: 255.255.255.0
> Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
> Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
> Jan 08 14:07:25 filename :
> Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
> Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
> Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
> Jan 08 14:07:36 Mount cmd:
> Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
> vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root
>
> Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).
>
> > ext4 will only touch the i_version field if you mount it with '-o
> > iversion'. Are you doing that here?
>
> The /home is mounted by systemd from /etc/fstab:
> LABEL=home /home ext4 defaults 0 2
>
> >
> > Have you run the bisect more than once? Is this maybe an intermittent
> > problem, and the bisect has landed on the wrong commit?
>
> I just run tests on commits around again (but not full bisect) on
> current next (next-20180108):
> fbf97ece47d66d22 - works
> 3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
> fails: http://www.krzk.eu/#/builders/1/builds/1267
>
>
> Best regards,
> Krzysztof

Ok, thanks. If you're seeing hangs then that might imply that we have
some sort of excessive looping going on in the cmpxchg loops.

Could you apply the patch below and let me know if it causes either of
the warnings to pop? That might at least point us in the right
direction:

--------------------8<----------------------

[PATCH] DEBUG: throw warning if we are looping excessively in new
iversion code

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/iversion.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 36ca56005c36..fe92bcc29f4c 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -168,6 +168,7 @@ static inline bool
inode_maybe_inc_iversion(struct inode *inode, bool force)
{
u64 cur, old, new;
+ int i = 0;

/*
* The i_version field is not strictly ordered with any other inode
@@ -193,6 +194,8 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
if (likely(old == cur))
break;
cur = old;
+ if (++i > 1000)
+ WARN_ONCE(1, "Too much looping!");
}
return true;
}
@@ -258,6 +261,7 @@ static inline u64
inode_query_iversion(struct inode *inode)
{
u64 cur, old, new;
+ int i = 0;

cur = inode_peek_iversion_raw(inode);
for (;;) {
@@ -277,6 +281,8 @@ inode_query_iversion(struct inode *inode)
if (likely(old == cur))
break;
cur = old;
+ if (++i > 1000)
+ WARN_ONCE(1, "Too much looping!");
}
return cur >> I_VERSION_QUERIED_SHIFT;
}
--
2.14.3


2018-01-08 13:46:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 05:30 -0800, Matthew Wilcox wrote:
> On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> > + cur = inode_peek_iversion_raw(inode);
> > + for (;;) {
> > + /* If flag is clear then we needn't do anything */
> > + if (!force && !(cur & I_VERSION_QUERIED))
> > + return false;
> > + /* Since lowest bit is flag, add 2 to avoid it */
> > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>
> Isn't this an extraordinarily complicated way of spelling:
>
> new = cur + 1;
>
> We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
> two is going to be the same as adding 1 ... right?
>

It would be, but if "force" is true, then I_VERSION_QUERIED may not be
set.
--
Jeff Layton <[email protected]>

2018-01-08 17:29:34

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <[email protected]> wrote:
> > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
> > > > > From: Jeff Layton <[email protected]>
> > > > >
> > > > > Since i_version is mostly treated as an opaque value, we can exploit that
> > > > > fact to avoid incrementing it when no one is watching. With that change,
> > > > > we can avoid incrementing the counter on writes, unless someone has
> > > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > > change, and the i_version hasn't changed, then there's no need to dirty
> > > > > the inode metadata on a write.
> > > > >
> > > > > Convert the i_version counter to an atomic64_t, and use the lowest order
> > > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > > since it was last incremented.
> > > > >
> > > > > When we go to maybe increment it, we fetch the value and check the flag
> > > > > bit. If it's clear then we don't need to do anything if the update
> > > > > isn't being forced.
> > > > >
> > > > > If we do need to update, then we increment the counter by 2, and clear
> > > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > > works, we return true. If it doesn't then do it again with the value
> > > > > that we fetch from the CAS operation.
> > > > >
> > > > > On the query side, if the flag is already set, then we just shift the
> > > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > > "old" value and try again.
> > > > >
> > > > > This method allows us to avoid incrementing the counter on writes (and
> > > > > dirtying the metadata) under typical workloads. We only need to increment
> > > > > if it has been queried since it was last changed.
> > > > >
> > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > ---
> > > > > include/linux/fs.h | 2 +-
> > > > > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> > > > > 2 files changed, 154 insertions(+), 56 deletions(-)
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > > This commit popped up through bisect (log at the end). Systemd
> > > > timeouts on some device-specific services, including mounting ext4
> > > > /home:
> > > >
> > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > > dev-disk-by\x2dlabel-home.device.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > > /dev/disk/by-label/home.
> > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > > polling (1min 53s / no limit)
> > > >
> > > > Kernel command line:
> > > > console=tty1 console=ttySAC2,115200n8
> > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > > nfsrootdebug root=/dev/nfs
> > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > > >
> > > > /home is /dev/mmcblk1p2:
> > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > > tune2fs 1.43.7 (16-Oct-2017)
> > > > Filesystem volume name: home
> > > > Last mounted on: /home
> > > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > > Filesystem magic number: 0xEF53
> > > > Filesystem revision #: 1 (dynamic)
> > > > Filesystem features: has_journal ext_attr resize_inode dir_index
> > > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > > uninit_bg dir_nlink extra_isize
> > > > Filesystem flags: signed_directory_hash
> > > > Default mount options: user_xattr acl
> > > > Filesystem state: clean
> > > > Errors behavior: Continue
> > > > Filesystem OS type: Linux
> > > > Inode count: 1430800
> > > > Block count: 5717760
> > > > Reserved block count: 285888
> > > > Free blocks: 5467576
> > > > Free inodes: 1428301
> > > > First block: 0
> > > > Block size: 4096
> > > > Fragment size: 4096
> > > > Reserved GDT blocks: 1022
> > > > Blocks per group: 32768
> > > > Fragments per group: 32768
> > > > Inodes per group: 8176
> > > > Inode blocks per group: 511
> > > > Flex block group size: 16
> > > > Filesystem created: Thu May 21 12:17:05 2015
> > > > Last mount time: Thu Dec 21 13:31:26 2017
> > > > Last write time: Thu Dec 21 13:31:26 2017
> > > > Mount count: 1
> > > > Maximum mount count: -1
> > > > Last checked: Thu Dec 21 13:31:25 2017
> > > > Check interval: 0 (<none>)
> > > > Lifetime writes: 126 GB
> > > > Reserved blocks uid: 0 (user root)
> > > > Reserved blocks gid: 0 (group root)
> > > > First inode: 11
> > > > Inode size: 256
> > > > Required extra isize: 28
> > > > Desired extra isize: 28
> > > > Journal inode: 8
> > > > Default directory hash: half_md4
> > > > Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
> > > > Journal backup: inode blocks
> > > >
> > > >
> > > > Full dmesg log:
> > > > http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
> > > >
> > > > The regular boot from rootfs on SD card also fails - but without any
> > > > serial console logs (just "Starting kernel...") so the real cause is
> > > > unknown.
> > > >
> > > > Any hints?
> > > >
> > > > Best regards,
> > > > Krzysztof
> > > >
> > > >
> > > > bisect log:
> > > > # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> > > > specific files for 20180103
> > > > # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> > > > git bisect start 'next/master' 'next/stable'
> > > > # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> > > > remote-tracking branch 'crypto/master'
> > > > git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> > > > # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> > > > remote-tracking branch 'hid/for-next'
> > > > git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> > > > # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> > > > remote-tracking branch 'realtek/for-next'
> > > > git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> > > > # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> > > > remote-tracking branch 'btrfs/next'
> > > > git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> > > > # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> > > > remote-tracking branch 'xfs/for-next'
> > > > git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> > > > # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> > > > remote-tracking branch 'printk/for-next'
> > > > git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> > > > # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> > > > POLL... arch-independent
> > > > git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> > > > # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> > > > 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> > > > 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> > > > 'work.misc' into for-next
> > > > git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> > > > # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> > > > i_version API
> > > > git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> > > > # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> > > > pF/pf deprecation warning
> > > > git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> > > > # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> > > > the inode in btrfs_update_time if something was changed
> > > > git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> > > > # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> > > > remote-tracking branch 'iversion/iversion-next'
> > > > git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> > > > # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> > > > inode->i_version more efficiently
> > > > git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> > > > # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> > > > handle inode->i_version more efficiently
> > >
> > > That's really strange. I'm afraid I have no idea what could be going on.
> > >
> > > With NFS, we really just treat i_version as an opaque value, so I'm not
> > > sure how this patch in particular would affect anything there. We _do_
> > > increment it if you have a write delegation in some cases, but not many
> > > servers hand those out.
> >
> > About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
> > Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux
> >
> > The /etc/nfs.conf is default except:
> > [nfsd]
> > vers2=n
> > vers3=n
> >
> > The client logs for nfsroot mounts are:
> > Jan 08 14:07:25 :: running hook [net_nfs4]
> > Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
> > Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
> > Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
> > Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
> > netmask: 255.255.255.0
> > Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
> > Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
> > Jan 08 14:07:25 filename :
> > Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
> > Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
> > Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
> > Jan 08 14:07:36 Mount cmd:
> > Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
> > vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root
> >
> > Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).
> >
> > > ext4 will only touch the i_version field if you mount it with '-o
> > > iversion'. Are you doing that here?
> >
> > The /home is mounted by systemd from /etc/fstab:
> > LABEL=home /home ext4 defaults 0 2
> >
> > >
> > > Have you run the bisect more than once? Is this maybe an intermittent
> > > problem, and the bisect has landed on the wrong commit?
> >
> > I just run tests on commits around again (but not full bisect) on
> > current next (next-20180108):
> > fbf97ece47d66d22 - works
> > 3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
> > fails: http://www.krzk.eu/#/builders/1/builds/1267
> >
> >
> > Best regards,
> > Krzysztof
>
> Ok, thanks. If you're seeing hangs then that might imply that we have
> some sort of excessive looping going on in the cmpxchg loops.
>
> Could you apply the patch below and let me know if it causes either of
> the warnings to pop? That might at least point us in the right
> direction:

No new warnings with attached patch (except existing already lockdep:
"INFO: trying to register non-static key.").

Systemd timeouts on mounting /home but after entering rescue shell there
is no problem running mount /home:
Give root password for maintenance
(or press Control-D to continue):
root@odroidxu3:~# mount /home
[ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)

Best regards,
Krzysztof


2018-01-08 18:00:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 08:29:24AM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> > > On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <[email protected]> wrote:
> > > > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <[email protected]> wrote:
> > > > > > From: Jeff Layton <[email protected]>
> > > > > >
> > > > > > Since i_version is mostly treated as an opaque value, we can exploit that
> > > > > > fact to avoid incrementing it when no one is watching. With that change,
> > > > > > we can avoid incrementing the counter on writes, unless someone has
> > > > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > > > change, and the i_version hasn't changed, then there's no need to dirty
> > > > > > the inode metadata on a write.
> > > > > >
> > > > > > Convert the i_version counter to an atomic64_t, and use the lowest order
> > > > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > > > since it was last incremented.
> > > > > >
> > > > > > When we go to maybe increment it, we fetch the value and check the flag
> > > > > > bit. If it's clear then we don't need to do anything if the update
> > > > > > isn't being forced.
> > > > > >
> > > > > > If we do need to update, then we increment the counter by 2, and clear
> > > > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > > > works, we return true. If it doesn't then do it again with the value
> > > > > > that we fetch from the CAS operation.
> > > > > >
> > > > > > On the query side, if the flag is already set, then we just shift the
> > > > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > > > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > > > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > > > "old" value and try again.
> > > > > >
> > > > > > This method allows us to avoid incrementing the counter on writes (and
> > > > > > dirtying the metadata) under typical workloads. We only need to increment
> > > > > > if it has been queried since it was last changed.
> > > > > >
> > > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > > ---
> > > > > > include/linux/fs.h | 2 +-
> > > > > > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> > > > > > 2 files changed, 154 insertions(+), 56 deletions(-)
> > > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > > > This commit popped up through bisect (log at the end). Systemd
> > > > > timeouts on some device-specific services, including mounting ext4
> > > > > /home:
> > > > >
> > > > > [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> > > > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > > > dev-disk-by\x2dlabel-home.device.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > > > /dev/disk/by-label/home.
> > > > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> > > > > polling (1min 53s / no limit)
> > > > >
> > > > > Kernel command line:
> > > > > console=tty1 console=ttySAC2,115200n8
> > > > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > > > nfsrootdebug root=/dev/nfs
> > > > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > > > >
> > > > > /home is /dev/mmcblk1p2:
> > > > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > > > tune2fs 1.43.7 (16-Oct-2017)
> > > > > Filesystem volume name: home
> > > > > Last mounted on: /home
> > > > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > > > Filesystem magic number: 0xEF53
> > > > > Filesystem revision #: 1 (dynamic)
> > > > > Filesystem features: has_journal ext_attr resize_inode dir_index
> > > > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > > > uninit_bg dir_nlink extra_isize
> > > > > Filesystem flags: signed_directory_hash
> > > > > Default mount options: user_xattr acl
> > > > > Filesystem state: clean
> > > > > Errors behavior: Continue
> > > > > Filesystem OS type: Linux
> > > > > Inode count: 1430800
> > > > > Block count: 5717760
> > > > > Reserved block count: 285888
> > > > > Free blocks: 5467576
> > > > > Free inodes: 1428301
> > > > > First block: 0
> > > > > Block size: 4096
> > > > > Fragment size: 4096
> > > > > Reserved GDT blocks: 1022
> > > > > Blocks per group: 32768
> > > > > Fragments per group: 32768
> > > > > Inodes per group: 8176
> > > > > Inode blocks per group: 511
> > > > > Flex block group size: 16
> > > > > Filesystem created: Thu May 21 12:17:05 2015
> > > > > Last mount time: Thu Dec 21 13:31:26 2017
> > > > > Last write time: Thu Dec 21 13:31:26 2017
> > > > > Mount count: 1
> > > > > Maximum mount count: -1
> > > > > Last checked: Thu Dec 21 13:31:25 2017
> > > > > Check interval: 0 (<none>)
> > > > > Lifetime writes: 126 GB
> > > > > Reserved blocks uid: 0 (user root)
> > > > > Reserved blocks gid: 0 (group root)
> > > > > First inode: 11
> > > > > Inode size: 256
> > > > > Required extra isize: 28
> > > > > Desired extra isize: 28
> > > > > Journal inode: 8
> > > > > Default directory hash: half_md4
> > > > > Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
> > > > > Journal backup: inode blocks
> > > > >
> > > > >
> > > > > Full dmesg log:
> > > > > http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
> > > > >
> > > > > The regular boot from rootfs on SD card also fails - but without any
> > > > > serial console logs (just "Starting kernel...") so the real cause is
> > > > > unknown.
> > > > >
> > > > > Any hints?
> > > > >
> > > > > Best regards,
> > > > > Krzysztof
> > > > >
> > > > >
> > > > > bisect log:
> > > > > # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> > > > > specific files for 20180103
> > > > > # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> > > > > git bisect start 'next/master' 'next/stable'
> > > > > # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> > > > > remote-tracking branch 'crypto/master'
> > > > > git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> > > > > # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> > > > > remote-tracking branch 'hid/for-next'
> > > > > git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> > > > > # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> > > > > remote-tracking branch 'realtek/for-next'
> > > > > git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> > > > > # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> > > > > remote-tracking branch 'btrfs/next'
> > > > > git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> > > > > # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> > > > > remote-tracking branch 'xfs/for-next'
> > > > > git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> > > > > # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> > > > > remote-tracking branch 'printk/for-next'
> > > > > git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> > > > > # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> > > > > POLL... arch-independent
> > > > > git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> > > > > # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> > > > > 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> > > > > 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> > > > > 'work.misc' into for-next
> > > > > git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> > > > > # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> > > > > i_version API
> > > > > git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> > > > > # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> > > > > pF/pf deprecation warning
> > > > > git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> > > > > # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> > > > > the inode in btrfs_update_time if something was changed
> > > > > git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> > > > > # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> > > > > remote-tracking branch 'iversion/iversion-next'
> > > > > git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> > > > > # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> > > > > inode->i_version more efficiently
> > > > > git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> > > > > # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> > > > > handle inode->i_version more efficiently
> > > >
> > > > That's really strange. I'm afraid I have no idea what could be going on.
> > > >
> > > > With NFS, we really just treat i_version as an opaque value, so I'm not
> > > > sure how this patch in particular would affect anything there. We _do_
> > > > increment it if you have a write delegation in some cases, but not many
> > > > servers hand those out.
> > >
> > > About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
> > > Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux
> > >
> > > The /etc/nfs.conf is default except:
> > > [nfsd]
> > > vers2=n
> > > vers3=n
> > >
> > > The client logs for nfsroot mounts are:
> > > Jan 08 14:07:25 :: running hook [net_nfs4]
> > > Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
> > > Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
> > > Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
> > > Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
> > > netmask: 255.255.255.0
> > > Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
> > > Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
> > > Jan 08 14:07:25 filename :
> > > Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
> > > Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
> > > Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
> > > Jan 08 14:07:36 Mount cmd:
> > > Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
> > > vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root
> > >
> > > Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).
> > >
> > > > ext4 will only touch the i_version field if you mount it with '-o
> > > > iversion'. Are you doing that here?
> > >
> > > The /home is mounted by systemd from /etc/fstab:
> > > LABEL=home /home ext4 defaults 0 2
> > >
> > > >
> > > > Have you run the bisect more than once? Is this maybe an intermittent
> > > > problem, and the bisect has landed on the wrong commit?
> > >
> > > I just run tests on commits around again (but not full bisect) on
> > > current next (next-20180108):
> > > fbf97ece47d66d22 - works
> > > 3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
> > > fails: http://www.krzk.eu/#/builders/1/builds/1267
> > >
> > >
> > > Best regards,
> > > Krzysztof
> >
> > Ok, thanks. If you're seeing hangs then that might imply that we have
> > some sort of excessive looping going on in the cmpxchg loops.
> >
> > Could you apply the patch below and let me know if it causes either of
> > the warnings to pop? That might at least point us in the right
> > direction:
>
> No new warnings with attached patch (except existing already lockdep:
> "INFO: trying to register non-static key.").
>

Yeah, I saw that in the original logs and it looks unrelated (and
harmless).

> Systemd timeouts on mounting /home but after entering rescue shell there
> is no problem running mount /home:
> Give root password for maintenance
> (or press Control-D to continue):
> root@odroidxu3:~# mount /home
> [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
>

Ok, thanks for testing it. So I guess we can probably rule out excessive
looping in those functions as the issue.

To make sure I understand the problem: When systemd tries to do the
initial mount of /home (which is an ext4 filesystem), it hangs. But once
it drops to the shell, it works, if you do the mount by hand.

Is that correct?

If so, then is it possible to trigger sysrq commands during the hanging
mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
determine what it's blocking on?

Thanks,
--
Jeff Layton <[email protected]>

2018-01-08 18:01:20

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v4 06/19] btrfs: convert to new i_version API

On Fri, Dec 22, 2017 at 07:05:43AM -0500, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Signed-off-by: Jeff Layton <[email protected]>

Acked-by: David Sterba <[email protected]>

2018-01-08 18:01:29

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v4 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed

On Fri, Dec 22, 2017 at 07:05:55AM -0500, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> At this point, we know that "now" and the file times may differ, and we
> suspect that the i_version has been flagged to be bumped. Attempt to
> bump the i_version, and only mark the inode dirty if that actually
> occurred or if one of the times was updated.
>
> Signed-off-by: Jeff Layton <[email protected]>

Acked-by: David Sterba <[email protected]>

2018-01-08 18:33:59

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:

(...)

> > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > some sort of excessive looping going on in the cmpxchg loops.
> > >
> > > Could you apply the patch below and let me know if it causes either of
> > > the warnings to pop? That might at least point us in the right
> > > direction:
> >
> > No new warnings with attached patch (except existing already lockdep:
> > "INFO: trying to register non-static key.").
> >
>
> Yeah, I saw that in the original logs and it looks unrelated (and
> harmless).
>
> > Systemd timeouts on mounting /home but after entering rescue shell there
> > is no problem running mount /home:
> > Give root password for maintenance
> > (or press Control-D to continue):
> > root@odroidxu3:~# mount /home
> > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
> >
>
> Ok, thanks for testing it. So I guess we can probably rule out excessive
> looping in those functions as the issue.
>
> To make sure I understand the problem: When systemd tries to do the
> initial mount of /home (which is an ext4 filesystem), it hangs. But once
> it drops to the shell, it works, if you do the mount by hand.
>
> Is that correct?

Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
console).

> If so, then is it possible to trigger sysrq commands during the hanging
> mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> determine what it's blocking on?

Yes, I have sysrq. Blocked state (w):
##################
[ *** ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 30s)
*** break sent ***
[*** ] (2 of 4) A start job is running for…v-ttySAC2.device (1min / 1min 30s)[ 110.962917] sysrq: SysRq : Show Blocked State
[ 110.965931] task PC stack pid father
[ 110.971360] lvm D 0 219 1 0x00000000
[ 110.976649] [<c087682c>] (__schedule) from [<c0876ca0>] (schedule+0x4c/0xac)
[ 110.983643] [<c0876ca0>] (schedule) from [<c07ae19c>] (rpc_wait_bit_killable+0x2c/0xf0)
[ 110.991621] [<c07ae19c>] (rpc_wait_bit_killable) from [<c0877328>] (__wait_on_bit+0x80/0xb4)
[ 111.000025] [<c0877328>] (__wait_on_bit) from [<c08773e8>] (out_of_line_wait_on_bit+0x8c/0x94)
[ 111.008603] [<c08773e8>] (out_of_line_wait_on_bit) from [<c07aec78>] (__rpc_execute+0x19c/0x274)
[ 111.017370] [<c07aec78>] (__rpc_execute) from [<c07a5eb4>] (rpc_run_task+0x134/0x14c)
[ 111.025164] [<c07a5eb4>] (rpc_run_task) from [<c0315b20>] (nfs4_call_sync_sequence+0x58/0x78)
[ 111.033648] [<c0315b20>] (nfs4_call_sync_sequence) from [<c031c490>] (nfs4_proc_access+0xf4/0x12c)
[ 111.042586] [<c031c490>] (nfs4_proc_access) from [<c02fe408>] (nfs_do_access+0x230/0x3e8)
[ 111.050716] [<c02fe408>] (nfs_do_access) from [<c02fe88c>] (nfs_permission+0x2a0/0x2c0)
[ 111.058694] [<c02fe88c>] (nfs_permission) from [<c0224aa4>] (link_path_walk+0x6c/0x508)
[ 111.066658] [<c0224aa4>] (link_path_walk) from [<c0225030>] (path_lookupat+0x8c/0x1ec)
[ 111.074541] [<c0225030>] (path_lookupat) from [<c02270fc>] (path_openat+0x770/0x964)
[ 111.082253] [<c02270fc>] (path_openat) from [<c0227e78>] (do_filp_open+0x60/0xc4)
[ 111.089711] [<c0227e78>] (do_filp_open) from [<c0215a14>] (do_sys_open+0x114/0x1c4)
[ 111.097339] [<c0215a14>] (do_sys_open) from [<c0108880>] (ret_fast_syscall+0x0/0x28)
[ 111.105036] udevadm D 0 259 1 0x00000000
[ 111.110482] [<c087682c>] (__schedule) from [<c0876ca0>] (schedule+0x4c/0xac)
[ 111.117508] [<c0876ca0>] (schedule) from [<c07ae19c>] (rpc_wait_bit_killable+0x2c/0xf0)
[ 111.125494] [<c07ae19c>] (rpc_wait_bit_killable) from [<c0877328>] (__wait_on_bit+0x80/0xb4)
[ 111.133903] [<c0877328>] (__wait_on_bit) from [<c08773e8>] (out_of_line_wait_on_bit+0x8c/0x94)
[ 111.142482] [<c08773e8>] (out_of_line_wait_on_bit) from [<c07aec78>] (__rpc_execute+0x19c/0x274)
[ 111.151236] [<c07aec78>] (__rpc_execute) from [<c07a5eb4>] (rpc_run_task+0x134/0x14c)
[ 111.159028] [<c07a5eb4>] (rpc_run_task) from [<c0315b20>] (nfs4_call_sync_sequence+0x58/0x78)
[ 111.167523] [<c0315b20>] (nfs4_call_sync_sequence) from [<c031c68c>] (nfs4_proc_getattr+0xbc/0xe4)
[ 111.176452] [<c031c68c>] (nfs4_proc_getattr) from [<c0303c90>] (__nfs_revalidate_inode+0x8c/0x130)
[ 111.185378] [<c0303c90>] (__nfs_revalidate_inode) from [<c02fe758>] (nfs_permission+0x16c/0x2c0)
[ 111.194132] [<c02fe758>] (nfs_permission) from [<c0224aa4>] (link_path_walk+0x6c/0x508)
[ 111.202096] [<c0224aa4>] (link_path_walk) from [<c0225030>] (path_lookupat+0x8c/0x1ec)
[ 111.209980] [<c0225030>] (path_lookupat) from [<c0227714>] (filename_lookup+0x8c/0xe8)
[ 111.217865] [<c0227714>] (filename_lookup) from [<c0215014>] (SyS_faccessat+0x9c/0x208)
[ 111.225839] [<c0215014>] (SyS_faccessat) from [<c0108880>] (ret_fast_syscall+0x0/0x28)
[*** ] (3 of 4) A start job is running for…l-home.device (1min 2s / 1min 30s)
##################

Another blocked (after few seconds):
[ 125.364311] sysrq: SysRq : Show Blocked State
[ 125.367292] task PC stack pid father
[ 125.372688] udevadm D 0 259 1 0x00000000
[ 125.377998] [<c087682c>] (__schedule) from [<c0876ca0>] (schedule+0x4c/0xac)
[ 125.384990] [<c0876ca0>] (schedule) from [<c07ae19c>] (rpc_wait_bit_killable+0x2c/0xf0)
[ 125.392961] [<c07ae19c>] (rpc_wait_bit_killable) from [<c0877328>] (__wait_on_bit+0x80/0xb4)
[ 125.401365] [<c0877328>] (__wait_on_bit) from [<c08773e8>] (out_of_line_wait_on_bit+0x8c/0x94)
[ 125.409944] [<c08773e8>] (out_of_line_wait_on_bit) from [<c07aec78>] (__rpc_execute+0x19c/0x274)
[ 125.418711] [<c07aec78>] (__rpc_execute) from [<c07a5eb4>] (rpc_run_task+0x134/0x14c)
[ 125.426505] [<c07a5eb4>] (rpc_run_task) from [<c0315b20>] (nfs4_call_sync_sequence+0x58/0x78)
[ 125.434989] [<c0315b20>] (nfs4_call_sync_sequence) from [<c031eab0>] (nfs4_proc_lookup_common+0x100/0x354)
[ 125.444608] [<c031eab0>] (nfs4_proc_lookup_common) from [<c031edac>] (nfs4_proc_lookup+0x3c/0x88)
[ 125.453456] [<c031edac>] (nfs4_proc_lookup) from [<c02ff46c>] (nfs_lookup_revalidate+0x224/0x35c)
[ 125.462294] [<c02ff46c>] (nfs_lookup_revalidate) from [<c02233c4>] (lookup_fast+0x294/0x478)
[ 125.470689] [<c02233c4>] (lookup_fast) from [<c02235d4>] (walk_component+0x2c/0x2e0)
[ 125.478395] [<c02235d4>] (walk_component) from [<c0224bd4>] (link_path_walk+0x19c/0x508)
[ 125.486455] [<c0224bd4>] (link_path_walk) from [<c0225030>] (path_lookupat+0x8c/0x1ec)
[ 125.494342] [<c0225030>] (path_lookupat) from [<c0227714>] (filename_lookup+0x8c/0xe8)
[ 125.502226] [<c0227714>] (filename_lookup) from [<c021d074>] (SyS_readlinkat+0x44/0xec)
[ 125.510208] [<c021d074>] (SyS_readlinkat) from [<c0108880>] (ret_fast_syscall+0x0/0x28)
[ **] (3 of 4) A start job is running for…-home.device (1min 20s / 1min 30s)

##################

Another blocked (after few seconds):
[ **] (3 of 4) A start job is running for…-home.device (1min 26s / 1min 30s)[ 136.281417] sysrq: SysRq : Show Blocked State
[ 136.284403] task PC stack pid father
[ 136.289818] udevadm D 0 259 1 0x00000000
[ 136.295116] [<c087682c>] (__schedule) from [<c0876ca0>] (schedule+0x4c/0xac)
[ 136.302105] [<c0876ca0>] (schedule) from [<c07ae19c>] (rpc_wait_bit_killable+0x2c/0xf0)
[ 136.310079] [<c07ae19c>] (rpc_wait_bit_killable) from [<c0877328>] (__wait_on_bit+0x80/0xb4)
[ 136.318484] [<c0877328>] (__wait_on_bit) from [<c08773e8>] (out_of_line_wait_on_bit+0x8c/0x94)
[ 136.327064] [<c08773e8>] (out_of_line_wait_on_bit) from [<c07aec78>] (__rpc_execute+0x19c/0x274)
[ 136.335830] [<c07aec78>] (__rpc_execute) from [<c07a5eb4>] (rpc_run_task+0x134/0x14c)
[ 136.343625] [<c07a5eb4>] (rpc_run_task) from [<c0315b20>] (nfs4_call_sync_sequence+0x58/0x78)
[ 136.352110] [<c0315b20>] (nfs4_call_sync_sequence) from [<c031eab0>] (nfs4_proc_lookup_common+0x100/0x354)
[ 136.361729] [<c031eab0>] (nfs4_proc_lookup_common) from [<c031edac>] (nfs4_proc_lookup+0x3c/0x88)
[ 136.370576] [<c031edac>] (nfs4_proc_lookup) from [<c02ff46c>] (nfs_lookup_revalidate+0x224/0x35c)
[ 136.379416] [<c02ff46c>] (nfs_lookup_revalidate) from [<c02233c4>] (lookup_fast+0x294/0x478)
[ 136.387809] [<c02233c4>] (lookup_fast) from [<c02235d4>] (walk_component+0x2c/0x2e0)
[ 136.395514] [<c02235d4>] (walk_component) from [<c0224bd4>] (link_path_walk+0x19c/0x508)
[ 136.403575] [<c0224bd4>] (link_path_walk) from [<c02269f4>] (path_openat+0x68/0x964)
[ 136.411285] [<c02269f4>] (path_openat) from [<c0227e78>] (do_filp_open+0x60/0xc4)
[ 136.418743] [<c0227e78>] (do_filp_open) from [<c0215a14>] (do_sys_open+0x114/0x1c4)
[ 136.426372] [<c0215a14>] (do_sys_open) from [<c0108880>] (ret_fast_syscall+0x0/0x28)

##################


And list of all CPU backtraces (l):

*** break sent ***
[ 87.197309] sysrq: SysRq : Show backtrace of all active CPUs
[ 87.201610] NMI backtrace for cpu 0
[ 87.205044] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.213297] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.219399] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.227098] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.234280] [<c08620ac>] (dump_stack) from [<c0868078>] (nmi_cpu_backtrace+0x10c/0x110)
[ 87.242251] [<c0868078>] (nmi_cpu_backtrace) from [<c0868184>] (nmi_trigger_cpumask_backtrace+0x108/0x150)
[ 87.251884] [<c0868184>] (nmi_trigger_cpumask_backtrace) from [<c041f384>] (__handle_sysrq+0x170/0x264)
[ 87.261237] [<c041f384>] (__handle_sysrq) from [<c0439888>] (s3c24xx_serial_rx_drain_fifo+0x1ec/0x214)
[ 87.270506] [<c0439888>] (s3c24xx_serial_rx_drain_fifo) from [<c043a6b8>] (s3c24xx_serial_rx_chars+0x6c/0x1a0)
[ 87.280473] [<c043a6b8>] (s3c24xx_serial_rx_chars) from [<c043a834>] (s3c64xx_serial_handle_irq+0x48/0x60)
[ 87.290104] [<c043a834>] (s3c64xx_serial_handle_irq) from [<c017ceb0>] (__handle_irq_event_percpu+0x9c/0x128)
[ 87.299975] [<c017ceb0>] (__handle_irq_event_percpu) from [<c017cf58>] (handle_irq_event_percpu+0x1c/0x58)
[ 87.309589] [<c017cf58>] (handle_irq_event_percpu) from [<c017cfcc>] (handle_irq_event+0x38/0x5c)
[ 87.318425] [<c017cfcc>] (handle_irq_event) from [<c0180740>] (handle_fasteoi_irq+0xb8/0x174)
[ 87.326914] [<c0180740>] (handle_fasteoi_irq) from [<c017c15c>] (generic_handle_irq+0x24/0x34)
[ 87.335492] [<c017c15c>] (generic_handle_irq) from [<c017c714>] (__handle_domain_irq+0x7c/0xec)
[ 87.344159] [<c017c714>] (__handle_domain_irq) from [<c0101544>] (gic_handle_irq+0x58/0x9c)
[ 87.352473] [<c0101544>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.359912] Exception stack(0xc0d01ee8 to 0xc0d01f30)
[ 87.364927] 1ee0: c087cc70 c0d0bf80 00000000 00000000 eef6d3c0 00000000
[ 87.373094] 1f00: ed4e8000 00000102 cfe02400 c0d08db8 ed491880 c0d01f6c 00000001 c0d01f38
[ 87.381234] 1f20: c087cc70 c087cc74 60070013 ffffffff
[ 87.386254] [<c010d7b0>] (__irq_svc) from [<c087cc74>] (_raw_spin_unlock_irq+0x28/0x5c)
[ 87.394252] [<c087cc74>] (_raw_spin_unlock_irq) from [<c0144b94>] (finish_task_switch+0xa0/0x1b0)
[ 87.403097] [<c0144b94>] (finish_task_switch) from [<c0876830>] (__schedule+0x238/0x65c)
[ 87.411140] [<c0876830>] (__schedule) from [<c0877158>] (schedule_idle+0x38/0x78)
[ 87.418591] [<c0877158>] (schedule_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.426658] [<c0162030>] (cpu_startup_entry) from [<c0c00c88>] (start_kernel+0x39c/0x3a8)
[ 87.434792] Sending NMI from CPU 0 to CPUs 1-7:
[ 87.439278] NMI backtrace for cpu 3
[ 87.439294] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.439300] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.439322] PC is at arch_cpu_idle+0x24/0x3c
[ 87.439333] LR is at arch_cpu_idle+0x20/0x3c
[ 87.439341] pc : [<c01093d8>] lr : [<c01093d4>] psr: 60000013
[ 87.439348] sp : ee8d5fb8 ip : 00000001 fp : c0afe104
[ 87.439354] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.439363] r7 : 00000008 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8d4000
[ 87.439370] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.439381] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.439389] Control: 10c5387d Table: 6cd4806a DAC: 00000051
[ 87.439399] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.439405] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.439425] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.439442] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.439460] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.439476] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.439492] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.439506] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.439514] Exception stack(0xee8d5f68 to 0xee8d5fb0)
[ 87.439527] 5f60: c01093d4 00000000 00000000 00000000 ee8d4000 c0d08c20
[ 87.439542] 5f80: c0d08c8c 00000008 c0c65730 00000000 00000000 c0afe104 00000001 ee8d5fb8
[ 87.439552] 5fa0: c01093d4 c01093d8 60000013 ffffffff
[ 87.439569] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.439586] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.439600] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.439615] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.439626] NMI backtrace for cpu 1
[ 87.439641] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.439647] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.439659] PC is at arch_cpu_idle+0x24/0x3c
[ 87.439669] LR is at arch_cpu_idle+0x20/0x3c
[ 87.439677] pc : [<c01093d8>] lr : [<c01093d4>] psr: 600d0013
[ 87.439684] sp : ee8d1fb8 ip : 00000001 fp : c0afe104
[ 87.439691] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.439699] r7 : 00000002 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8d0000
[ 87.439707] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.439715] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.439723] Control: 10c5387d Table: 6cd4806a DAC: 00000051
[ 87.439733] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.439738] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.439758] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.439775] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.439793] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.439808] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.439824] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.439838] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.439845] Exception stack(0xee8d1f68 to 0xee8d1fb0)
[ 87.439859] 1f60: c01093d4 00000000 00000000 00000000 ee8d0000 c0d08c20
[ 87.439873] 1f80: c0d08c8c 00000002 c0c65730 00000000 00000000 c0afe104 00000001 ee8d1fb8
[ 87.439884] 1fa0: c01093d4 c01093d8 600d0013 ffffffff
[ 87.439901] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.439917] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.439931] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.439944] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.439955] NMI backtrace for cpu 2
[ 87.439968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.439979] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.439993] PC is at arch_cpu_idle+0x24/0x3c
[ 87.440003] LR is at arch_cpu_idle+0x20/0x3c
[ 87.440011] pc : [<c01093d8>] lr : [<c01093d4>] psr: 60070013
[ 87.440019] sp : ee8d3fb8 ip : 00000001 fp : c0afe104
[ 87.440027] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.440036] r7 : 00000004 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8d2000
[ 87.440044] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.440054] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.440064] Control: 10c5387d Table: 6cd3006a DAC: 00000051
[ 87.440076] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.440082] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.440102] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.440118] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.440135] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.440153] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.440169] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.440184] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.440192] Exception stack(0xee8d3f68 to 0xee8d3fb0)
[ 87.440206] 3f60: c01093d4 00000000 00000000 00000000 ee8d2000 c0d08c20
[ 87.440221] 3f80: c0d08c8c 00000004 c0c65730 00000000 00000000 c0afe104 00000001 ee8d3fb8
[ 87.440233] 3fa0: c01093d4 c01093d8 60070013 ffffffff
[ 87.440251] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.440269] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.440285] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.440299] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.440323] NMI backtrace for cpu 4
[ 87.440359] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.440375] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.440422] PC is at arch_cpu_idle+0x24/0x3c
[ 87.440450] LR is at arch_cpu_idle+0x20/0x3c
[ 87.440469] pc : [<c01093d8>] lr : [<c01093d4>] psr: 600d0013
[ 87.440487] sp : ee8d7fb8 ip : 00000001 fp : c0afe104
[ 87.440504] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.440523] r7 : 00000010 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8d6000
[ 87.440541] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.440565] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.440586] Control: 10c5387d Table: 6cd5006a DAC: 00000051
[ 87.440615] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.440629] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.440690] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.440736] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.440782] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.440822] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.440859] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.440891] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.440908] Exception stack(0xee8d7f68 to 0xee8d7fb0)
[ 87.440944] 7f60: c01093d4 00000000 00000000 00000000 ee8d6000 c0d08c20
[ 87.440979] 7f80: c0d08c8c 00000010 c0c65730 00000000 00000000 c0afe104 00000001 ee8d7fb8
[ 87.441003] 7fa0: c01093d4 c01093d8 600d0013 ffffffff
[ 87.441048] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.441089] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.441124] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.441155] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.441181] NMI backtrace for cpu 5
[ 87.441216] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.441232] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.441269] PC is at arch_cpu_idle+0x24/0x3c
[ 87.441298] LR is at arch_cpu_idle+0x20/0x3c
[ 87.441318] pc : [<c01093d8>] lr : [<c01093d4>] psr: 60010113
[ 87.441336] sp : ee8e1fb8 ip : 00000001 fp : c0afe104
[ 87.441352] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.441372] r7 : 00000020 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8e0000
[ 87.441389] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.441412] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.441433] Control: 10c5387d Table: 6ccd406a DAC: 00000051
[ 87.441461] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.441476] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.441528] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.441574] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.441619] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.441659] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.441694] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.441727] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.441745] Exception stack(0xee8e1f68 to 0xee8e1fb0)
[ 87.441780] 1f60: c01093d4 00000000 00000000 00000000 ee8e0000 c0d08c20
[ 87.441815] 1f80: c0d08c8c 00000020 c0c65730 00000000 00000000 c0afe104 00000001 ee8e1fb8
[ 87.441839] 1fa0: c01093d4 c01093d8 60010113 ffffffff
[ 87.441884] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.441925] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.441959] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.441992] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.442017] NMI backtrace for cpu 6
[ 87.442051] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.442067] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.442103] PC is at arch_cpu_idle+0x24/0x3c
[ 87.442132] LR is at arch_cpu_idle+0x20/0x3c
[ 87.442152] pc : [<c01093d8>] lr : [<c01093d4>] psr: 60010013
[ 87.442169] sp : ee8e3fb8 ip : 00000001 fp : c0afe104
[ 87.442185] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.442204] r7 : 00000040 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8e2000
[ 87.442222] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.442244] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.442264] Control: 10c5387d Table: 6ccd006a DAC: 00000051
[ 87.442292] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.442306] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.442359] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.442404] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.442448] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.442486] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.442522] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.442555] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.442572] Exception stack(0xee8e3f68 to 0xee8e3fb0)
[ 87.442605] 3f60: c01093d4 00000000 00000000 00000000 ee8e2000 c0d08c20
[ 87.442642] 3f80: c0d08c8c 00000040 c0c65730 00000000 00000000 c0afe104 00000001 ee8e3fb8
[ 87.442666] 3fa0: c01093d4 c01093d8 60010013 ffffffff
[ 87.442708] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.442747] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.442783] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.442814] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 87.442839] NMI backtrace for cpu 7
[ 87.442873] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.442888] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.442925] PC is at arch_cpu_idle+0x24/0x3c
[ 87.442954] LR is at arch_cpu_idle+0x20/0x3c
[ 87.442973] pc : [<c01093d8>] lr : [<c01093d4>] psr: 60000013
[ 87.442990] sp : ee8e5fb8 ip : 00000001 fp : c0afe104
[ 87.443007] r10: 00000000 r9 : 00000000 r8 : c0c65730
[ 87.443025] r7 : 00000080 r6 : c0d08c8c r5 : c0d08c20 r4 : ee8e4000
[ 87.443043] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c01093d4
[ 87.443065] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 87.443086] Control: 10c5387d Table: 6ccb006a DAC: 00000051
[ 87.443113] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.15.0-rc3-00022-g3da7bcdd695b #1101
[ 87.443127] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 87.443179] [<c01103a4>] (unwind_backtrace) from [<c010cac0>] (show_stack+0x10/0x14)
[ 87.443223] [<c010cac0>] (show_stack) from [<c08620ac>] (dump_stack+0x98/0xc4)
[ 87.443265] [<c08620ac>] (dump_stack) from [<c086801c>] (nmi_cpu_backtrace+0xb0/0x110)
[ 87.443303] [<c086801c>] (nmi_cpu_backtrace) from [<c010f2ac>] (handle_IPI+0xd8/0x1b0)
[ 87.443340] [<c010f2ac>] (handle_IPI) from [<c0101584>] (gic_handle_irq+0x98/0x9c)
[ 87.443374] [<c0101584>] (gic_handle_irq) from [<c010d7b0>] (__irq_svc+0x70/0xb0)
[ 87.443391] Exception stack(0xee8e5f68 to 0xee8e5fb0)
[ 87.443424] 5f60: c01093d4 00000000 00000000 00000000 ee8e4000 c0d08c20
[ 87.443459] 5f80: c0d08c8c 00000080 c0c65730 00000000 00000000 c0afe104 00000001 ee8e5fb8
[ 87.443482] 5fa0: c01093d4 c01093d8 60000013 ffffffff
[ 87.443527] [<c010d7b0>] (__irq_svc) from [<c01093d8>] (arch_cpu_idle+0x24/0x3c)
[ 87.443568] [<c01093d8>] (arch_cpu_idle) from [<c0161c20>] (do_idle+0x190/0x230)
[ 87.443604] [<c0161c20>] (do_idle) from [<c0162030>] (cpu_startup_entry+0x18/0x1c)
[ 87.443634] [<c0162030>] (cpu_startup_entry) from [<4010192c>] (0x4010192c)
[ 89.892591] systemd-journald[225]: Received request to flush runtime journal from PID 1
[ 90.289569] systemd-journald[225]: File /var/log/journal/7a4e9999283c4067874e93fe7b5d73eb/system.journal corrupted or uncleanly shut down




Best regards,
Krzysztof


2018-01-08 19:15:34

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
>
> (...)
>
> > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > some sort of excessive looping going on in the cmpxchg loops.
> > > >
> > > > Could you apply the patch below and let me know if it causes either of
> > > > the warnings to pop? That might at least point us in the right
> > > > direction:
> > >
> > > No new warnings with attached patch (except existing already lockdep:
> > > "INFO: trying to register non-static key.").
> > >
> >
> > Yeah, I saw that in the original logs and it looks unrelated (and
> > harmless).
> >
> > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > is no problem running mount /home:
> > > Give root password for maintenance
> > > (or press Control-D to continue):
> > > root@odroidxu3:~# mount /home
> > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
> > >
> >
> > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > looping in those functions as the issue.
> >
> > To make sure I understand the problem: When systemd tries to do the
> > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > it drops to the shell, it works, if you do the mount by hand.
> >
> > Is that correct?
>
> Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> console).
>
> > If so, then is it possible to trigger sysrq commands during the hanging
> > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > determine what it's blocking on?

(trimming the output)

Thanks. I don't really see anything obvious in that info,
unfortunately. What we really need to do is find the systemd task
performing the mount, and see what it's doing.

We do have one questionable bug in the NFS changes though. Does this
patch help at all?

-------------------------------8<---------------------------------

SQUASH: nfs: fix i_version increment when adding a request

NFS treats this value as an opaque value with no flag, so we must
increment it as such instead of using inode_inc_iversion.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/write.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a03fbac1f88c..48837b6250e9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
spin_lock(&mapping->private_lock);
if (!nfs_have_writebacks(inode) &&
NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
- inode_inc_iversion(inode);
+ atomic64_inc(&inode->i_version);
if (likely(!PageSwapCache(req->wb_page))) {
set_bit(PG_MAPPED, &req->wb_flags);
SetPagePrivate(req->wb_page);
--
2.14.3


2018-01-08 20:05:06

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 14:15 -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> >
> > (...)
> >
> > > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > >
> > > > > Could you apply the patch below and let me know if it causes either of
> > > > > the warnings to pop? That might at least point us in the right
> > > > > direction:
> > > >
> > > > No new warnings with attached patch (except existing already lockdep:
> > > > "INFO: trying to register non-static key.").
> > > >
> > >
> > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > harmless).
> > >
> > > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > > is no problem running mount /home:
> > > > Give root password for maintenance
> > > > (or press Control-D to continue):
> > > > root@odroidxu3:~# mount /home
> > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
> > > >
> > >
> > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > looping in those functions as the issue.
> > >
> > > To make sure I understand the problem: When systemd tries to do the
> > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > it drops to the shell, it works, if you do the mount by hand.
> > >
> > > Is that correct?
> >
> > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > console).
> >
> > > If so, then is it possible to trigger sysrq commands during the hanging
> > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > determine what it's blocking on?
>
> (trimming the output)
>
> Thanks. I don't really see anything obvious in that info,
> unfortunately. What we really need to do is find the systemd task
> performing the mount, and see what it's doing.
>
> We do have one questionable bug in the NFS changes though. Does this
> patch help at all?
>
> -------------------------------8<---------------------------------
>
> SQUASH: nfs: fix i_version increment when adding a request
>
> NFS treats this value as an opaque value with no flag, so we must
> increment it as such instead of using inode_inc_iversion.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/write.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a03fbac1f88c..48837b6250e9 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> spin_lock(&mapping->private_lock);
> if (!nfs_have_writebacks(inode) &&
> NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
> - inode_inc_iversion(inode);
> + atomic64_inc(&inode->i_version);
> if (likely(!PageSwapCache(req->wb_page))) {
> set_bit(PG_MAPPED, &req->wb_flags);
> SetPagePrivate(req->wb_page);

Sorry for the slow dribble of patches. I found a bug in the ext4 code
too. Might want to try this ext4 patch as well:

----------------------8<-----------------

SQUASH: ext4: use raw API for xattr inode refcounts

This could distort the values otherwise.

(Side Q: Why do we do this across the ctime and iversion like this?)

Signed-off-by: Jeff Layton <[email protected]>
---
fs/ext4/xattr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index ba6fd5439aa4..63656dbafdc4 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -295,13 +295,13 @@ ext4_xattr_inode_hash(struct ext4_sb_info *sbi, const void *buffer, size_t size)
static u64 ext4_xattr_inode_get_ref(struct inode *ea_inode)
{
return ((u64)ea_inode->i_ctime.tv_sec << 32) |
- (u32) inode_peek_iversion(ea_inode);
+ (u32) inode_peek_iversion_raw(ea_inode);
}

static void ext4_xattr_inode_set_ref(struct inode *ea_inode, u64 ref_count)
{
ea_inode->i_ctime.tv_sec = (u32)(ref_count >> 32);
- inode_set_iversion(ea_inode, ref_count & 0xffffffff);
+ inode_set_iversion_raw(ea_inode, ref_count & 0xffffffff);
}

static u32 ext4_xattr_inode_get_hash(struct inode *ea_inode)
--
2.14.3




2018-01-08 20:17:19

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote:
> On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> >
> > (...)
> >
> > > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > >
> > > > > Could you apply the patch below and let me know if it causes either of
> > > > > the warnings to pop? That might at least point us in the right
> > > > > direction:
> > > >
> > > > No new warnings with attached patch (except existing already lockdep:
> > > > "INFO: trying to register non-static key.").
> > > >
> > >
> > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > harmless).
> > >
> > > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > > is no problem running mount /home:
> > > > Give root password for maintenance
> > > > (or press Control-D to continue):
> > > > root@odroidxu3:~# mount /home
> > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
> > > >
> > >
> > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > looping in those functions as the issue.
> > >
> > > To make sure I understand the problem: When systemd tries to do the
> > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > it drops to the shell, it works, if you do the mount by hand.
> > >
> > > Is that correct?
> >
> > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > console).
> >
> > > If so, then is it possible to trigger sysrq commands during the hanging
> > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > determine what it's blocking on?
>
> (trimming the output)
>
> Thanks. I don't really see anything obvious in that info,
> unfortunately. What we really need to do is find the systemd task
> performing the mount, and see what it's doing.

It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are
updated.

> We do have one questionable bug in the NFS changes though. Does this
> patch help at all?

Patches do not change anything (I tried "SQUASH: nfs: fix i_version
increment when adding a request" and "SQUASH: ext4: use raw API for
xattr inode refcounts").

I tried again regular SDcard-root boot and it succeeded. Only nfsroot
fails.

Best regards,
Krzysztof


>
> -------------------------------8<---------------------------------
>
> SQUASH: nfs: fix i_version increment when adding a request
>
> NFS treats this value as an opaque value with no flag, so we must
> increment it as such instead of using inode_inc_iversion.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/write.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a03fbac1f88c..48837b6250e9 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -755,7 +755,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> spin_lock(&mapping->private_lock);
> if (!nfs_have_writebacks(inode) &&
> NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
> - inode_inc_iversion(inode);
> + atomic64_inc(&inode->i_version);
> if (likely(!PageSwapCache(req->wb_page))) {
> set_bit(PG_MAPPED, &req->wb_flags);
> SetPagePrivate(req->wb_page);
> --
> 2.14.3
>

2018-01-08 21:39:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, 2018-01-08 at 21:17 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 08, 2018 at 02:15:29PM -0500, Jeff Layton wrote:
> > On Mon, 2018-01-08 at 19:33 +0100, Krzysztof Kozlowski wrote:
> > > On Mon, Jan 08, 2018 at 01:00:19PM -0500, Jeff Layton wrote:
> > > > On Mon, 2018-01-08 at 18:29 +0100, Krzysztof Kozlowski wrote:
> > >
> > > (...)
> > >
> > > > > > Ok, thanks. If you're seeing hangs then that might imply that we have
> > > > > > some sort of excessive looping going on in the cmpxchg loops.
> > > > > >
> > > > > > Could you apply the patch below and let me know if it causes either of
> > > > > > the warnings to pop? That might at least point us in the right
> > > > > > direction:
> > > > >
> > > > > No new warnings with attached patch (except existing already lockdep:
> > > > > "INFO: trying to register non-static key.").
> > > > >
> > > >
> > > > Yeah, I saw that in the original logs and it looks unrelated (and
> > > > harmless).
> > > >
> > > > > Systemd timeouts on mounting /home but after entering rescue shell there
> > > > > is no problem running mount /home:
> > > > > Give root password for maintenance
> > > > > (or press Control-D to continue):
> > > > > root@odroidxu3:~# mount /home
> > > > > [ 220.659331] EXT4-fs (mmcblk1p2): mounted filesystem with ordered data mode. Opts: (null)
> > > > >
> > > >
> > > > Ok, thanks for testing it. So I guess we can probably rule out excessive
> > > > looping in those functions as the issue.
> > > >
> > > > To make sure I understand the problem: When systemd tries to do the
> > > > initial mount of /home (which is an ext4 filesystem), it hangs. But once
> > > > it drops to the shell, it works, if you do the mount by hand.
> > > >
> > > > Is that correct?
> > >
> > > Yes, although it also timeouts on setting up /dev/ttySAC2 (serial
> > > console).
> > >
> > > > If so, then is it possible to trigger sysrq commands during the hanging
> > > > mount attempt? Maybe you could use e.g. sysrq-l, sysrq-w, etc. to
> > > > determine what it's blocking on?
> >
> > (trimming the output)
> >
> > Thanks. I don't really see anything obvious in that info,
> > unfortunately. What we really need to do is find the systemd task
> > performing the mount, and see what it's doing.
>
> It's systemd 236.0-2 coming from Arch Linux for ARM. All packages are
> updated.
>
> > We do have one questionable bug in the NFS changes though. Does this
> > patch help at all?
>
> Patches do not change anything (I tried "SQUASH: nfs: fix i_version
> increment when adding a request" and "SQUASH: ext4: use raw API for
> xattr inode refcounts").
>
> I tried again regular SDcard-root boot and it succeeded. Only nfsroot
> fails.
>
> Best regards,
> Krzysztof
>

Got it, that's helpful. Does this patch help (on top of the others) ?

------------------------8<--------------------------

SQUASH: nfs: compare raw iversion counter since that's what's
being stored

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/inode.c | 6 +++---
include/linux/iversion.h | 13 +++++++++++++
2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 0b85cca1184b..93552c482992 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1289,7 +1289,7 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr

if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
- && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+ && !inode_cmp_iversion_raw(inode, fattr->pre_change_attr)) {
inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat

if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
- if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion(inode, fattr->change_attr))
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion_raw(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE;

if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&inode->i_mtime, &fattr->mtime))
@@ -1778,7 +1778,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
- if (inode_cmp_iversion(inode, fattr->change_attr)) {
+ if (inode_cmp_iversion_raw(inode, fattr->change_attr)) {
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id, inode->i_ino);
/* Could it be a race with writeback? */
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 107fcb3ec809..8c97f67ffbbc 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -190,6 +190,19 @@ inode_query_iversion(struct inode *inode)
return inode_peek_iversion(inode);
}

+/**
+ * inode_cmp_iversion_raw - check whether the raw i_version counter has changed
+ * @inode: inode to check
+ * @old: old value to check against its i_version
+ *
+ * Compare the current raw i_version counter with a previous one. Returns 0 if
+ * they are the same or non-zero if they are different.
+ */
+static inline s64
+inode_cmp_iversion_raw(const struct inode *inode, u64 old)
+{
+ return (s64)inode_peek_iversion(inode) - (s64)old;
+}
/**
* inode_cmp_iversion - check whether the i_version counter has changed
* @inode: inode to check
--
2.14.3


2018-01-09 09:27:54

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

On Mon, Jan 8, 2018 at 10:39 PM, Jeff Layton <[email protected]> wrote:
>>
>
> Got it, that's helpful. Does this patch help (on top of the others) ?
>
> ------------------------8<--------------------------
>
> SQUASH: nfs: compare raw iversion counter since that's what's
> being stored
>

Did not apply cleanly (in include/linux/iversion.h) but it was easy to
adjust. However still no improvements. Applied on top of others.

Best regards,
Krzysztof