2023-02-14 00:12:19

by NeilBrown

[permalink] [raw]
Subject: [PATCH v2] NFSv3: handle out-of-order write replies.


NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect. The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation. The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed. This is not
true in general. Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen. This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linx writing to a Netapp server which
occasionally re-orders requests. In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand. If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the able we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server. Usually these
exactly cancel out and so nothing is stored. When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed. This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

One difficulty in this is that in some circumstances,
NFS_ATTR_FATTR_PRECHANGE is cleared. This causes the gap tracking to
fail because we can only add pre/post attributes to the table if bother
NFS_ATTR_FATTR_PRECHANGE and NFS_ATTR_FATTR_CHANGE are set. If one is
cleared, we have a problem.

This patch works around that by introducing a new
NFS_ATTR_FATTR_PRECHANGE_RAW whcih is set when the normal PRECHANGE is
cleared. If either PRECHANGE or PRECHANGE_RAW a set then we can use the
prechange value. Obviously this is a hack. I would be keen to
understand why PRECHANGE is being cleared so I can find a better
resolution.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/dir.c | 2 +-
fs/nfs/inode.c | 105 ++++++++++++++++++++++++++++++++++------
include/linux/nfs_fs.h | 47 ++++++++++++++++++
include/linux/nfs_xdr.h | 1 +
4 files changed, 139 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f7e4a88d5d92..5ec8030d9fe8 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -84,7 +84,7 @@ alloc_nfs_open_dir_context(struct inode *dir)
ctx->dtsize = NFS_INIT_DTSIZE;
spin_lock(&dir->i_lock);
if (list_empty(&nfsi->open_files) &&
- (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
+ nfs_ooo_test(nfsi))
nfs_set_cache_invalid(dir,
NFS_INO_INVALID_DATA |
NFS_INO_REVAL_FORCED);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e98ee7599eeb..82b56761294e 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)

nfsi->cache_validity |= flags;

- if (inode->i_mapping->nrpages == 0)
- nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
- else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
- nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
+ if (inode->i_mapping->nrpages == 0) {
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
+ } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+ nfs_ooo_clear(nfsi);
+ }
trace_nfs_set_cache_invalid(inode, 0);
}
EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
@@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
trace_nfs_size_truncate(inode, offset);
i_size_write(inode, offset);
/* Optimisation */
- if (offset == 0)
- NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
+ if (offset == 0) {
+ NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(NFS_I(inode));
+ }
NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;

spin_unlock(&inode->i_lock);
@@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)

spin_lock(&inode->i_lock);
if (list_empty(&nfsi->open_files) &&
- (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
+ nfs_ooo_test(nfsi))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
NFS_INO_REVAL_FORCED);
list_add_tail_rcu(&ctx->list, &nfsi->open_files);
@@ -1345,8 +1347,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)

set_bit(NFS_INO_INVALIDATING, bitlock);
smp_wmb();
- nfsi->cache_validity &=
- ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
spin_unlock(&inode->i_lock);
trace_nfs_invalidate_mapping_enter(inode);
ret = nfs_invalidate_mapping(inode, mapping);
@@ -1808,6 +1810,52 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
return 0;
}

+static void nfs_ooo_merge(struct nfs_inode *nfsi,
+ u64 start, u64 end)
+{
+ int i, cnt;
+
+ if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
+ /* No point merging anything */
+ return;
+
+ if (!nfsi->ooo) {
+ nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
+ if (!nfsi->ooo) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ return;
+ }
+ nfsi->ooo->cnt = 0;
+ }
+
+ /* add this range, merging if possible */
+ cnt = nfsi->ooo->cnt;
+ for (i = 0; i < cnt; i++) {
+ if (end == nfsi->ooo->gap[i].start)
+ end = nfsi->ooo->gap[i].end;
+ else if (start == nfsi->ooo->gap[i].end)
+ start = nfsi->ooo->gap[i].start;
+ else
+ continue;
+ /* Remove 'i' from table and loop to insert the new range */
+ cnt -= 1;
+ nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
+ i = -1;
+ }
+ if (start != end) {
+ if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ return;
+ }
+ nfsi->ooo->gap[cnt].start = start;
+ nfsi->ooo->gap[cnt].end = end;
+ cnt += 1;
+ }
+ nfsi->ooo->cnt = cnt;
+}
+
static int nfs_refresh_inode_locked(struct inode *inode,
struct nfs_fattr *fattr)
{
@@ -1818,8 +1866,17 @@ static int nfs_refresh_inode_locked(struct inode *inode,

if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
ret = nfs_update_inode(inode, fattr);
- else if (attr_cmp == 0)
- ret = nfs_check_inode_attributes(inode, fattr);
+ else {
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
+ (fattr->valid & (NFS_ATTR_FATTR_PRECHANGE |
+ NFS_ATTR_FATTR_PRECHANGE_RAW)))
+ nfs_ooo_merge(NFS_I(inode),
+ fattr->change_attr,
+ fattr->pre_change_attr);
+
+ if (attr_cmp == 0)
+ ret = nfs_check_inode_attributes(inode, fattr);
+ }

trace_nfs_refresh_inode_exit(inode, ret);
return ret;
@@ -1910,6 +1967,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
if (attr_cmp < 0)
return 0;
if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
+ if (fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
+ fattr->valid |= NFS_ATTR_FATTR_PRECHANGE_RAW;
fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
| NFS_ATTR_FATTR_PRESIZE
| NFS_ATTR_FATTR_PREMTIME
@@ -2064,6 +2123,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
+ if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
+ nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
+ /* There is one remaining gap that hasn't been
+ * merged into iversion - do that now.
+ */
+ inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ }
if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
/* Could it be a race with writeback? */
if (!(have_writers || have_delegation)) {
@@ -2085,8 +2153,13 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id,
inode->i_ino);
- } else if (!have_delegation)
- nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ } else if (!have_delegation) {
+ if (fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
+ nfs_ooo_merge(nfsi, fattr->change_attr,
+ fattr->pre_change_attr);
+ nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
+ fattr->change_attr);
+ }
inode_set_iversion_raw(inode, fattr->change_attr);
}
} else {
@@ -2240,6 +2313,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
return NULL;
nfsi->flags = 0UL;
nfsi->cache_validity = 0UL;
+ nfsi->ooo = NULL;
#if IS_ENABLED(CONFIG_NFS_V4)
nfsi->nfs4_acl = NULL;
#endif /* CONFIG_NFS_V4 */
@@ -2252,6 +2326,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);

void nfs_free_inode(struct inode *inode)
{
+ kfree(NFS_I(inode)->ooo);
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
}
EXPORT_SYMBOL_GPL(nfs_free_inode);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d92fdfd2444c..bd186a8a6153 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -191,6 +191,39 @@ struct nfs_inode {
/* Open contexts for shared mmap writes */
struct list_head open_files;

+ /* Keep track of out-of-order replies.
+ * The ooo array contains start/end pairs of
+ * number from the changeid sequence when
+ * the inodes iversion has been updated.
+ * It also contains end/start pair (i.e. reverse order)
+ * of sections of the changeid sequence that have
+ * been seen in replies from the server.
+ * Normally these should match and when both
+ * A:B and B:A are found in ooo, they are both removed.
+ * And if a reply with A:B causes an iversion update
+ * of A:B, then neither are added.
+ * When a reply has pre_change that doesn't match
+ * iversion, then the changeid pair, and any consequent
+ * change in iversion ARE added. Later replies
+ * might fill in the gaps, or possibly a gap is caused
+ * by a change from another client.
+ * When a file or directory is opened, if the ooo table
+ * is not empty, then we assume the gaps were due to
+ * another client and we invalidate the cached data.
+ *
+ * We can only track a limited number of concurrent gaps.
+ * Currently that limit is 16.
+ * We allocate the table on demand. If there is insufficient
+ * memory, then we probably cannot cache the file anyway
+ * so there is no loss.
+ */
+ struct {
+ int cnt;
+ struct {
+ u64 start, end;
+ } gap[16];
+ } *ooo;
+
#if IS_ENABLED(CONFIG_NFS_V4)
struct nfs4_cached_acl *nfs4_acl;
/* NFSv4 state */
@@ -616,6 +649,20 @@ nfs_fileid_to_ino_t(u64 fileid)
return ino;
}

+static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
+{
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+}
+
+static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
+{
+ return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
+ (nfsi->ooo && nfsi->ooo->cnt > 0);
+
+}
+
#define NFS_JUKEBOX_RETRY_TIME (5 * HZ)

/* We need to block new opens while a file is being unlinked.
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index e86cf6642d21..b18a877b16eb 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -106,6 +106,7 @@ struct nfs_fattr {
#define NFS_ATTR_FATTR_OWNER_NAME (1U << 23)
#define NFS_ATTR_FATTR_GROUP_NAME (1U << 24)
#define NFS_ATTR_FATTR_V4_SECURITY_LABEL (1U << 25)
+#define NFS_ATTR_FATTR_PRECHANGE_RAW (1U << 26)

#define NFS_ATTR_FATTR (NFS_ATTR_FATTR_TYPE \
| NFS_ATTR_FATTR_MODE \
--
2.39.1



2023-02-15 05:13:24

by NeilBrown

[permalink] [raw]
Subject: [PATCH v3] NFSv3: handle out-of-order write replies.


NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect. The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation. The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed. This is not
true in general. Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen. This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests. In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand. If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server. Usually these
exactly cancel out and so nothing is stored. When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed. This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly. The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily. A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/inode.c | 120 +++++++++++++++++++++++++++++++++++------
include/linux/nfs_fs.h | 47 ++++++++++++++++
2 files changed, 152 insertions(+), 15 deletions(-)

Sorry for the new version so soon :-)

This version removes the PRE_CHANGE_RAW flag - I realised that before
clearing PRE_CHANGE I can simply record the pre/post info if relevant.

I also now understand the ways in which directories are different and so
make no attempt to address directories. Change comment explains that
there is room for improvement there, but it is a separate issue needing
separate testing.

Thanks,
NeilBrown



diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e98ee7599eeb..8f1a78837ec9 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)

nfsi->cache_validity |= flags;

- if (inode->i_mapping->nrpages == 0)
- nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
- else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
- nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
+ if (inode->i_mapping->nrpages == 0) {
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
+ } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+ nfs_ooo_clear(nfsi);
+ }
trace_nfs_set_cache_invalid(inode, 0);
}
EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
@@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
trace_nfs_size_truncate(inode, offset);
i_size_write(inode, offset);
/* Optimisation */
- if (offset == 0)
- NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
+ if (offset == 0) {
+ NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(NFS_I(inode));
+ }
NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;

spin_unlock(&inode->i_lock);
@@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)

spin_lock(&inode->i_lock);
if (list_empty(&nfsi->open_files) &&
- (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
+ nfs_ooo_test(nfsi))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
NFS_INO_REVAL_FORCED);
list_add_tail_rcu(&ctx->list, &nfsi->open_files);
@@ -1345,8 +1347,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)

set_bit(NFS_INO_INVALIDATING, bitlock);
smp_wmb();
- nfsi->cache_validity &=
- ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
spin_unlock(&inode->i_lock);
trace_nfs_invalidate_mapping_enter(inode);
ret = nfs_invalidate_mapping(inode, mapping);
@@ -1808,6 +1810,74 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
return 0;
}

+static void nfs_ooo_merge(struct nfs_inode *nfsi,
+ u64 start, u64 end)
+{
+ int i, cnt;
+
+ if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
+ /* No point merging anything */
+ return;
+ if (!S_ISREG(nfsi->vfs_inode.mode))
+ /* We flush cached data for a directory on any change
+ * because we cannot insert new entries into the cache,
+ * or delete old entries.
+ * So there is no point being concerned about out-of-order
+ * replies.
+ */
+ return;
+
+ if (!nfsi->ooo) {
+ nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
+ if (!nfsi->ooo) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ return;
+ }
+ nfsi->ooo->cnt = 0;
+ }
+
+ /* add this range, merging if possible */
+ cnt = nfsi->ooo->cnt;
+ for (i = 0; i < cnt; i++) {
+ if (end == nfsi->ooo->gap[i].start)
+ end = nfsi->ooo->gap[i].end;
+ else if (start == nfsi->ooo->gap[i].end)
+ start = nfsi->ooo->gap[i].start;
+ else
+ continue;
+ /* Remove 'i' from table and loop to insert the new range */
+ cnt -= 1;
+ nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
+ i = -1;
+ }
+ if (start != end) {
+ if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ return;
+ }
+ nfsi->ooo->gap[cnt].start = start;
+ nfsi->ooo->gap[cnt].end = end;
+ cnt += 1;
+ }
+ nfsi->ooo->cnt = cnt;
+}
+
+static void nfs_ooo_record(struct nfs_inode *nfsi,
+ struct nfs_fattr *fattr)
+{
+ /* This reply was out-of-order, so record in the
+ * pre/post change id, possibly cancelling
+ * gaps created when iversion was jumpped forward.
+ */
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
+ (fattr->valid & NFS_ATTR_FATTR_PRECHANGE))
+ nfs_ooo_merge(nfsi,
+ fattr->change_attr,
+ fattr->pre_change_attr);
+}
+
static int nfs_refresh_inode_locked(struct inode *inode,
struct nfs_fattr *fattr)
{
@@ -1818,8 +1888,12 @@ static int nfs_refresh_inode_locked(struct inode *inode,

if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
ret = nfs_update_inode(inode, fattr);
- else if (attr_cmp == 0)
- ret = nfs_check_inode_attributes(inode, fattr);
+ else {
+ nfs_ooo_record(NFS_I(inode), fattr);
+
+ if (attr_cmp == 0)
+ ret = nfs_check_inode_attributes(inode, fattr);
+ }

trace_nfs_refresh_inode_exit(inode, ret);
return ret;
@@ -1910,6 +1984,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
if (attr_cmp < 0)
return 0;
if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
+ /* Record the pre/post change info before clearing PRECHANGE */
+ nfs_ooo_record(NFS_I(inode), fattr);
fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
| NFS_ATTR_FATTR_PRESIZE
| NFS_ATTR_FATTR_PREMTIME
@@ -2064,6 +2140,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
+ if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
+ nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
+ /* There is one remaining gap that hasn't been
+ * merged into iversion - do that now.
+ */
+ inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ }
if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
/* Could it be a race with writeback? */
if (!(have_writers || have_delegation)) {
@@ -2085,8 +2170,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id,
inode->i_ino);
- } else if (!have_delegation)
- nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ } else if (!have_delegation) {
+ nfs_ooo_record(nfsi, fattr);
+ nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
+ fattr->change_attr);
+ }
inode_set_iversion_raw(inode, fattr->change_attr);
}
} else {
@@ -2240,6 +2328,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
return NULL;
nfsi->flags = 0UL;
nfsi->cache_validity = 0UL;
+ nfsi->ooo = NULL;
#if IS_ENABLED(CONFIG_NFS_V4)
nfsi->nfs4_acl = NULL;
#endif /* CONFIG_NFS_V4 */
@@ -2252,6 +2341,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);

void nfs_free_inode(struct inode *inode)
{
+ kfree(NFS_I(inode)->ooo);
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
}
EXPORT_SYMBOL_GPL(nfs_free_inode);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d92fdfd2444c..776afe09b63b 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -191,6 +191,39 @@ struct nfs_inode {
/* Open contexts for shared mmap writes */
struct list_head open_files;

+ /* Keep track of out-of-order replies.
+ * The ooo array contains start/end pairs of
+ * numbers from the changeid sequence when
+ * the inode's iversion has been updated.
+ * It also contains end/start pair (i.e. reverse order)
+ * of sections of the changeid sequence that have
+ * been seen in replies from the server.
+ * Normally these should match and when both
+ * A:B and B:A are found in ooo, they are both removed.
+ * And if a reply with A:B causes an iversion update
+ * of A:B, then neither are added.
+ * When a reply has pre_change that doesn't match
+ * iversion, then the changeid pair and any consequent
+ * change in iversion ARE added. Later replies
+ * might fill in the gaps, or possibly a gap is caused
+ * by a change from another client.
+ * When a file or directory is opened, if the ooo table
+ * is not empty, then we assume the gaps were due to
+ * another client and we invalidate the cached data.
+ *
+ * We can only track a limited number of concurrent gaps.
+ * Currently that limit is 16.
+ * We allocate the table on demand. If there is insufficient
+ * memory, then we probably cannot cache the file anyway
+ * so there is no loss.
+ */
+ struct {
+ int cnt;
+ struct {
+ u64 start, end;
+ } gap[16];
+ } *ooo;
+
#if IS_ENABLED(CONFIG_NFS_V4)
struct nfs4_cached_acl *nfs4_acl;
/* NFSv4 state */
@@ -616,6 +649,20 @@ nfs_fileid_to_ino_t(u64 fileid)
return ino;
}

+static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
+{
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+}
+
+static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
+{
+ return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
+ (nfsi->ooo && nfsi->ooo->cnt > 0);
+
+}
+
#define NFS_JUKEBOX_RETRY_TIME (5 * HZ)

/* We need to block new opens while a file is being unlinked.
--
2.39.1



2023-02-15 19:59:15

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v3] NFSv3: handle out-of-order write replies.

Hi Neil,

On Wed, Feb 15, 2023 at 12:13 AM NeilBrown <[email protected]> wrote:
>
>
> NFSv3 includes pre/post wcc attributes which allow the client to
> determine if all changes to the file have been made by the client
> itself, or if any might have been made by some other client.
>
> If there are gaps in the pre/post ctime sequence it must be assumed that
> some other client changed the file in that gap and the local cache must
> be suspect. The next time the file is opened the cache should be
> invalidated.
>
> Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
> close-to-open consistency violations") in linux 5.3 the Linux client has
> been triggering this invalidation. The chunk in nfs_update_inode() in
> particularly triggers.
>
> Unfortunately Linux NFS assumes that all replies will be processed in
> the order sent, and will arrive in the order processed. This is not
> true in general. Consequently Linux NFS might ignore the wcc info in a
> WRITE reply because the reply is in response to a WRITE that was sent
> before some other request for which a reply has already been seen. This
> is detected by Linux using the gencount tests in nfs_inode_attr_cmp().
>
> Also, when the gencount tests pass it is still possible that the request
> were processed on the server in a different order, and a gap seen in
> the ctime sequence might be filled in by a subsequent reply, so gaps
> should not immediately trigger delayed invalidation.
>
> The net result is that writing to a server and then reading the file
> back can result in going to the server for the read rather than serving
> it from cache - all because a couple of replies arrived out-of-order.
> This is a performance regression over kernels before 5.3, though the
> change in 5.3 is a correctness improvement.
>
> This has been seen with Linux writing to a Netapp server which
> occasionally re-orders requests. In testing the majority of requests
> were in-order, but a few (maybe 2 or three at a time) could be
> re-ordered.
>
> This patch addresses the problem by recording any gaps seen in the
> pre/post ctime sequence and not triggering invalidation until either
> there are too many gaps to fit in the table, or until there are no more
> active writes and the remaining gaps cannot be resolved.
>
> We allocate a table of 16 gaps on demand. If the allocation fails we
> revert to current behaviour which is of little cost as we are unlikely
> to be able to cache the writes anyway.
>
> In the table we store "start->end" pair when iversion is updated and
> "end<-start" pairs pre/post pairs reported by the server. Usually these
> exactly cancel out and so nothing is stored. When there are
> out-of-order replies we do store gaps and these will eventually be
> cancelled against later replies when this client is the only writer.
>
> If the final write is out-of-order there may be one gap remaining when
> the file is closed. This will be noticed and if there is precisely on
> gap and if the iversion can be advanced to match it, then we do so.
>
> This patch makes no attempt to handle directories correctly. The same
> problem potentially exists in the out-of-order replies to create/unlink
> requests can cause future lookup requires to be sent to the server
> unnecessarily. A similar scheme using the same primitives could be used
> to notice and handle out-of-order replies.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfs/inode.c | 120 +++++++++++++++++++++++++++++++++++------
> include/linux/nfs_fs.h | 47 ++++++++++++++++
> 2 files changed, 152 insertions(+), 15 deletions(-)
>
> Sorry for the new version so soon :-)
>
> This version removes the PRE_CHANGE_RAW flag - I realised that before
> clearing PRE_CHANGE I can simply record the pre/post info if relevant.
>
> I also now understand the ways in which directories are different and so
> make no attempt to address directories. Change comment explains that
> there is room for improvement there, but it is a separate issue needing
> separate testing.
>
> Thanks,
> NeilBrown
>
>
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index e98ee7599eeb..8f1a78837ec9 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)
>
> nfsi->cache_validity |= flags;
>
> - if (inode->i_mapping->nrpages == 0)
> - nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
> - NFS_INO_DATA_INVAL_DEFER);
> - else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
> - nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
> + if (inode->i_mapping->nrpages == 0) {
> + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> + nfs_ooo_clear(nfsi);
> + } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> + nfs_ooo_clear(nfsi);
> + }
> trace_nfs_set_cache_invalid(inode, 0);
> }
> EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
> @@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
> trace_nfs_size_truncate(inode, offset);
> i_size_write(inode, offset);
> /* Optimisation */
> - if (offset == 0)
> - NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
> - NFS_INO_DATA_INVAL_DEFER);
> + if (offset == 0) {
> + NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
> + nfs_ooo_clear(NFS_I(inode));
> + }
> NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;
>
> spin_unlock(&inode->i_lock);
> @@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
>
> spin_lock(&inode->i_lock);
> if (list_empty(&nfsi->open_files) &&
> - (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
> + nfs_ooo_test(nfsi))
> nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
> NFS_INO_REVAL_FORCED);
> list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> @@ -1345,8 +1347,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
>
> set_bit(NFS_INO_INVALIDATING, bitlock);
> smp_wmb();
> - nfsi->cache_validity &=
> - ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
> + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> + nfs_ooo_clear(nfsi);
> spin_unlock(&inode->i_lock);
> trace_nfs_invalidate_mapping_enter(inode);
> ret = nfs_invalidate_mapping(inode, mapping);
> @@ -1808,6 +1810,74 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
> return 0;
> }
>
> +static void nfs_ooo_merge(struct nfs_inode *nfsi,
> + u64 start, u64 end)
> +{
> + int i, cnt;
> +
> + if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
> + /* No point merging anything */
> + return;
> + if (!S_ISREG(nfsi->vfs_inode.mode))

Um, should this be "vfs_inode.i_mode"? I'm also curious how you've
tested this, since the code won't compile without this change?

Thanks,
Anna

> + /* We flush cached data for a directory on any change
> + * because we cannot insert new entries into the cache,
> + * or delete old entries.
> + * So there is no point being concerned about out-of-order
> + * replies.
> + */
> + return;
> +
> + if (!nfsi->ooo) {
> + nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
> + if (!nfsi->ooo) {
> + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> + return;
> + }
> + nfsi->ooo->cnt = 0;
> + }
> +
> + /* add this range, merging if possible */
> + cnt = nfsi->ooo->cnt;
> + for (i = 0; i < cnt; i++) {
> + if (end == nfsi->ooo->gap[i].start)
> + end = nfsi->ooo->gap[i].end;
> + else if (start == nfsi->ooo->gap[i].end)
> + start = nfsi->ooo->gap[i].start;
> + else
> + continue;
> + /* Remove 'i' from table and loop to insert the new range */
> + cnt -= 1;
> + nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
> + i = -1;
> + }
> + if (start != end) {
> + if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
> + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> + kfree(nfsi->ooo);
> + nfsi->ooo = NULL;
> + return;
> + }
> + nfsi->ooo->gap[cnt].start = start;
> + nfsi->ooo->gap[cnt].end = end;
> + cnt += 1;
> + }
> + nfsi->ooo->cnt = cnt;
> +}
> +
> +static void nfs_ooo_record(struct nfs_inode *nfsi,
> + struct nfs_fattr *fattr)
> +{
> + /* This reply was out-of-order, so record in the
> + * pre/post change id, possibly cancelling
> + * gaps created when iversion was jumpped forward.
> + */
> + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
> + (fattr->valid & NFS_ATTR_FATTR_PRECHANGE))
> + nfs_ooo_merge(nfsi,
> + fattr->change_attr,
> + fattr->pre_change_attr);
> +}
> +
> static int nfs_refresh_inode_locked(struct inode *inode,
> struct nfs_fattr *fattr)
> {
> @@ -1818,8 +1888,12 @@ static int nfs_refresh_inode_locked(struct inode *inode,
>
> if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
> ret = nfs_update_inode(inode, fattr);
> - else if (attr_cmp == 0)
> - ret = nfs_check_inode_attributes(inode, fattr);
> + else {
> + nfs_ooo_record(NFS_I(inode), fattr);
> +
> + if (attr_cmp == 0)
> + ret = nfs_check_inode_attributes(inode, fattr);
> + }
>
> trace_nfs_refresh_inode_exit(inode, ret);
> return ret;
> @@ -1910,6 +1984,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
> if (attr_cmp < 0)
> return 0;
> if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
> + /* Record the pre/post change info before clearing PRECHANGE */
> + nfs_ooo_record(NFS_I(inode), fattr);
> fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
> | NFS_ATTR_FATTR_PRESIZE
> | NFS_ATTR_FATTR_PREMTIME
> @@ -2064,6 +2140,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>
> /* More cache consistency checks */
> if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> + if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
> + nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
> + /* There is one remaining gap that hasn't been
> + * merged into iversion - do that now.
> + */
> + inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
> + kfree(nfsi->ooo);
> + nfsi->ooo = NULL;
> + }
> if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> /* Could it be a race with writeback? */
> if (!(have_writers || have_delegation)) {
> @@ -2085,8 +2170,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
> dprintk("NFS: change_attr change on server for file %s/%ld\n",
> inode->i_sb->s_id,
> inode->i_ino);
> - } else if (!have_delegation)
> - nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> + } else if (!have_delegation) {
> + nfs_ooo_record(nfsi, fattr);
> + nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
> + fattr->change_attr);
> + }
> inode_set_iversion_raw(inode, fattr->change_attr);
> }
> } else {
> @@ -2240,6 +2328,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
> return NULL;
> nfsi->flags = 0UL;
> nfsi->cache_validity = 0UL;
> + nfsi->ooo = NULL;
> #if IS_ENABLED(CONFIG_NFS_V4)
> nfsi->nfs4_acl = NULL;
> #endif /* CONFIG_NFS_V4 */
> @@ -2252,6 +2341,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);
>
> void nfs_free_inode(struct inode *inode)
> {
> + kfree(NFS_I(inode)->ooo);
> kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
> }
> EXPORT_SYMBOL_GPL(nfs_free_inode);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index d92fdfd2444c..776afe09b63b 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -191,6 +191,39 @@ struct nfs_inode {
> /* Open contexts for shared mmap writes */
> struct list_head open_files;
>
> + /* Keep track of out-of-order replies.
> + * The ooo array contains start/end pairs of
> + * numbers from the changeid sequence when
> + * the inode's iversion has been updated.
> + * It also contains end/start pair (i.e. reverse order)
> + * of sections of the changeid sequence that have
> + * been seen in replies from the server.
> + * Normally these should match and when both
> + * A:B and B:A are found in ooo, they are both removed.
> + * And if a reply with A:B causes an iversion update
> + * of A:B, then neither are added.
> + * When a reply has pre_change that doesn't match
> + * iversion, then the changeid pair and any consequent
> + * change in iversion ARE added. Later replies
> + * might fill in the gaps, or possibly a gap is caused
> + * by a change from another client.
> + * When a file or directory is opened, if the ooo table
> + * is not empty, then we assume the gaps were due to
> + * another client and we invalidate the cached data.
> + *
> + * We can only track a limited number of concurrent gaps.
> + * Currently that limit is 16.
> + * We allocate the table on demand. If there is insufficient
> + * memory, then we probably cannot cache the file anyway
> + * so there is no loss.
> + */
> + struct {
> + int cnt;
> + struct {
> + u64 start, end;
> + } gap[16];
> + } *ooo;
> +
> #if IS_ENABLED(CONFIG_NFS_V4)
> struct nfs4_cached_acl *nfs4_acl;
> /* NFSv4 state */
> @@ -616,6 +649,20 @@ nfs_fileid_to_ino_t(u64 fileid)
> return ino;
> }
>
> +static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
> +{
> + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> + kfree(nfsi->ooo);
> + nfsi->ooo = NULL;
> +}
> +
> +static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
> +{
> + return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
> + (nfsi->ooo && nfsi->ooo->cnt > 0);
> +
> +}
> +
> #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
>
> /* We need to block new opens while a file is being unlinked.
> --
> 2.39.1
>
>

2023-02-15 20:30:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3] NFSv3: handle out-of-order write replies.

On Thu, 16 Feb 2023, Anna Schumaker wrote:
> Hi Neil,
>
> On Wed, Feb 15, 2023 at 12:13 AM NeilBrown <[email protected]> wrote:
> >
> > +static void nfs_ooo_merge(struct nfs_inode *nfsi,
> > + u64 start, u64 end)
> > +{
> > + int i, cnt;
> > +
> > + if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
> > + /* No point merging anything */
> > + return;
> > + if (!S_ISREG(nfsi->vfs_inode.mode))
>
> Um, should this be "vfs_inode.i_mode"? I'm also curious how you've
> tested this, since the code won't compile without this change?

I had removed this whole 'if' because I realised it wasn't really
needed.
The problem was that I hadn't refreshed that patch after final testing
:-(

I'll resend.

Thanks,
NeilBrown

2023-02-15 20:32:35

by NeilBrown

[permalink] [raw]
Subject: [PATCH v4] NFSv3: handle out-of-order write replies.


NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect. The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation. The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed. This is not
true in general. Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen. This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests. In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand. If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server. Usually these
exactly cancel out and so nothing is stored. When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed. This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly. The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily. A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/inode.c | 112 +++++++++++++++++++++++++++++++++++------
include/linux/nfs_fs.h | 47 +++++++++++++++++
2 files changed, 144 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e98ee7599eeb..ef46c854930d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)

nfsi->cache_validity |= flags;

- if (inode->i_mapping->nrpages == 0)
- nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
- else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
- nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
+ if (inode->i_mapping->nrpages == 0) {
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
+ } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+ nfs_ooo_clear(nfsi);
+ }
trace_nfs_set_cache_invalid(inode, 0);
}
EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
@@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
trace_nfs_size_truncate(inode, offset);
i_size_write(inode, offset);
/* Optimisation */
- if (offset == 0)
- NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
- NFS_INO_DATA_INVAL_DEFER);
+ if (offset == 0) {
+ NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(NFS_I(inode));
+ }
NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;

spin_unlock(&inode->i_lock);
@@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)

spin_lock(&inode->i_lock);
if (list_empty(&nfsi->open_files) &&
- (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
+ nfs_ooo_test(nfsi))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
NFS_INO_REVAL_FORCED);
list_add_tail_rcu(&ctx->list, &nfsi->open_files);
@@ -1345,8 +1347,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)

set_bit(NFS_INO_INVALIDATING, bitlock);
smp_wmb();
- nfsi->cache_validity &=
- ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ nfs_ooo_clear(nfsi);
spin_unlock(&inode->i_lock);
trace_nfs_invalidate_mapping_enter(inode);
ret = nfs_invalidate_mapping(inode, mapping);
@@ -1808,6 +1810,66 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
return 0;
}

+static void nfs_ooo_merge(struct nfs_inode *nfsi,
+ u64 start, u64 end)
+{
+ int i, cnt;
+
+ if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
+ /* No point merging anything */
+ return;
+
+ if (!nfsi->ooo) {
+ nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
+ if (!nfsi->ooo) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ return;
+ }
+ nfsi->ooo->cnt = 0;
+ }
+
+ /* add this range, merging if possible */
+ cnt = nfsi->ooo->cnt;
+ for (i = 0; i < cnt; i++) {
+ if (end == nfsi->ooo->gap[i].start)
+ end = nfsi->ooo->gap[i].end;
+ else if (start == nfsi->ooo->gap[i].end)
+ start = nfsi->ooo->gap[i].start;
+ else
+ continue;
+ /* Remove 'i' from table and loop to insert the new range */
+ cnt -= 1;
+ nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
+ i = -1;
+ }
+ if (start != end) {
+ if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
+ nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ return;
+ }
+ nfsi->ooo->gap[cnt].start = start;
+ nfsi->ooo->gap[cnt].end = end;
+ cnt += 1;
+ }
+ nfsi->ooo->cnt = cnt;
+}
+
+static void nfs_ooo_record(struct nfs_inode *nfsi,
+ struct nfs_fattr *fattr)
+{
+ /* This reply was out-of-order, so record in the
+ * pre/post change id, possibly cancelling
+ * gaps created when iversion was jumpped forward.
+ */
+ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
+ (fattr->valid & NFS_ATTR_FATTR_PRECHANGE))
+ nfs_ooo_merge(nfsi,
+ fattr->change_attr,
+ fattr->pre_change_attr);
+}
+
static int nfs_refresh_inode_locked(struct inode *inode,
struct nfs_fattr *fattr)
{
@@ -1818,8 +1880,12 @@ static int nfs_refresh_inode_locked(struct inode *inode,

if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
ret = nfs_update_inode(inode, fattr);
- else if (attr_cmp == 0)
- ret = nfs_check_inode_attributes(inode, fattr);
+ else {
+ nfs_ooo_record(NFS_I(inode), fattr);
+
+ if (attr_cmp == 0)
+ ret = nfs_check_inode_attributes(inode, fattr);
+ }

trace_nfs_refresh_inode_exit(inode, ret);
return ret;
@@ -1910,6 +1976,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
if (attr_cmp < 0)
return 0;
if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
+ /* Record the pre/post change info before clearing PRECHANGE */
+ nfs_ooo_record(NFS_I(inode), fattr);
fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
| NFS_ATTR_FATTR_PRESIZE
| NFS_ATTR_FATTR_PREMTIME
@@ -2064,6 +2132,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
+ if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
+ nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
+ /* There is one remaining gap that hasn't been
+ * merged into iversion - do that now.
+ */
+ inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+ }
if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
/* Could it be a race with writeback? */
if (!(have_writers || have_delegation)) {
@@ -2085,8 +2162,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
dprintk("NFS: change_attr change on server for file %s/%ld\n",
inode->i_sb->s_id,
inode->i_ino);
- } else if (!have_delegation)
- nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
+ } else if (!have_delegation) {
+ nfs_ooo_record(nfsi, fattr);
+ nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
+ fattr->change_attr);
+ }
inode_set_iversion_raw(inode, fattr->change_attr);
}
} else {
@@ -2240,6 +2320,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
return NULL;
nfsi->flags = 0UL;
nfsi->cache_validity = 0UL;
+ nfsi->ooo = NULL;
#if IS_ENABLED(CONFIG_NFS_V4)
nfsi->nfs4_acl = NULL;
#endif /* CONFIG_NFS_V4 */
@@ -2252,6 +2333,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);

void nfs_free_inode(struct inode *inode)
{
+ kfree(NFS_I(inode)->ooo);
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
}
EXPORT_SYMBOL_GPL(nfs_free_inode);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d92fdfd2444c..776afe09b63b 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -191,6 +191,39 @@ struct nfs_inode {
/* Open contexts for shared mmap writes */
struct list_head open_files;

+ /* Keep track of out-of-order replies.
+ * The ooo array contains start/end pairs of
+ * numbers from the changeid sequence when
+ * the inode's iversion has been updated.
+ * It also contains end/start pair (i.e. reverse order)
+ * of sections of the changeid sequence that have
+ * been seen in replies from the server.
+ * Normally these should match and when both
+ * A:B and B:A are found in ooo, they are both removed.
+ * And if a reply with A:B causes an iversion update
+ * of A:B, then neither are added.
+ * When a reply has pre_change that doesn't match
+ * iversion, then the changeid pair and any consequent
+ * change in iversion ARE added. Later replies
+ * might fill in the gaps, or possibly a gap is caused
+ * by a change from another client.
+ * When a file or directory is opened, if the ooo table
+ * is not empty, then we assume the gaps were due to
+ * another client and we invalidate the cached data.
+ *
+ * We can only track a limited number of concurrent gaps.
+ * Currently that limit is 16.
+ * We allocate the table on demand. If there is insufficient
+ * memory, then we probably cannot cache the file anyway
+ * so there is no loss.
+ */
+ struct {
+ int cnt;
+ struct {
+ u64 start, end;
+ } gap[16];
+ } *ooo;
+
#if IS_ENABLED(CONFIG_NFS_V4)
struct nfs4_cached_acl *nfs4_acl;
/* NFSv4 state */
@@ -616,6 +649,20 @@ nfs_fileid_to_ino_t(u64 fileid)
return ino;
}

+static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
+{
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ kfree(nfsi->ooo);
+ nfsi->ooo = NULL;
+}
+
+static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
+{
+ return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
+ (nfsi->ooo && nfsi->ooo->cnt > 0);
+
+}
+
#define NFS_JUKEBOX_RETRY_TIME (5 * HZ)

/* We need to block new opens while a file is being unlinked.
--
2.39.1