2022-10-18 19:59:44

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 0/7] A course adjustment, for sure

I'm proposing this series for v6.2 (for-next). I intend to open this
branch soon, now that the v6.1 merge window is closed.

For quite some time, we've been encouraged to disable filecache
garbage collection for NFSv4 files, and I think I found a surgical
way to do just that. That is presented in "NFSD: Add an NFSD_FILE_GC
flag to enable nfsd_file garbage collection".

The other major change in this set is reworking the file_hashtbl to
resize itself dynamically. This reduces the average size of its
bucket chains, greatly speeding up hash insertion, which holds the
state_lock.

Comments and opinions are welcome.

Changes since v3:
- the new filehandle alias check was still not right

Changes since v2:
- Converted nfs4_file_rhashtbl to nfs4_file_rhltable
- Addressed most or all other review comments


Changes since RFC:
- checking nfs4_files for inode aliases is now done only on hash
insertion
- the nfs4_file reference count is now bumped only while the RCU
read lock is held
- comments and function names have been revised and clarified

---

Chuck Lever (7):
NFSD: Pass the target nfsd_file to nfsd_commit()
NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"
NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection
NFSD: Use const pointers as parameters to fh_ helpers
NFSD: Use rhashtable for managing nfs4_file objects
NFSD: Clean up nfs4_preprocess_stateid_op() call sites
NFSD: Trace delegation revocations


fs/nfsd/filecache.c | 79 +++++++++++-----
fs/nfsd/filecache.h | 4 +-
fs/nfsd/nfs3proc.c | 10 +-
fs/nfsd/nfs4proc.c | 42 ++++-----
fs/nfsd/nfs4state.c | 221 ++++++++++++++++++++++++++++++--------------
fs/nfsd/nfsfh.h | 10 +-
fs/nfsd/state.h | 5 +-
fs/nfsd/trace.h | 58 +++++++++++-
fs/nfsd/vfs.c | 19 ++--
fs/nfsd/vfs.h | 3 +-
10 files changed, 305 insertions(+), 146 deletions(-)

--
Chuck Lever


2022-10-18 19:59:44

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

NFSv4 operations manage the lifetime of nfsd_file items they use by
means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
garbage collected.

Introduce a mechanism to enable garbage collection for nfsd_file
items used only by NFSv2/3 callers.

Note that the change in nfsd_file_put() ensures that both CLOSE and
DELEGRETURN will actually close out and free an nfsd_file on last
reference of a non-garbage-collected file.

Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
Suggested-by: Trond Myklebust <[email protected]>
Tested-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
fs/nfsd/filecache.h | 3 +++
fs/nfsd/nfs3proc.c | 4 ++-
fs/nfsd/trace.h | 3 ++-
fs/nfsd/vfs.c | 4 ++-
5 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index b7aa523c2010..87fce5c95726 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
struct net *net;
const struct cred *cred;
unsigned char need;
+ unsigned char gc:1;
enum nfsd_file_lookup_type type;
};

@@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
return 1;
if (!nfsd_match_cred(nf->nf_cred, key->cred))
return 1;
+ if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
+ return 1;
if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
return 1;
break;
@@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
nf->nf_flags = 0;
__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
+ if (key->gc)
+ __set_bit(NFSD_FILE_GC, &nf->nf_flags);
nf->nf_inode = key->inode;
/* nf_ref is pre-incremented for hash table */
refcount_set(&nf->nf_ref, 2);
@@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
}
}

+static void
+nfsd_file_unhash_and_put(struct nfsd_file *nf)
+{
+ if (nfsd_file_unhash(nf))
+ nfsd_file_put_noref(nf);
+}
+
void
nfsd_file_put(struct nfsd_file *nf)
{
might_sleep();

- nfsd_file_lru_add(nf);
+ if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
+ nfsd_file_lru_add(nf);
+ else if (refcount_read(&nf->nf_ref) == 2)
+ nfsd_file_unhash_and_put(nf);
+
if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
nfsd_file_flush(nf);
nfsd_file_put_noref(nf);
- } else if (nf->nf_file) {
+ } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
nfsd_file_put_noref(nf);
nfsd_file_schedule_laundrette();
} else
@@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)

static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
- unsigned int may_flags, struct nfsd_file **pnf, bool open)
+ unsigned int may_flags, struct nfsd_file **pnf,
+ bool open, int want_gc)
{
struct nfsd_file_lookup_key key = {
.type = NFSD_FILE_KEY_FULL,
.need = may_flags & NFSD_FILE_MAY_MASK,
.net = SVC_NET(rqstp),
+ .gc = want_gc,
};
bool open_retry = true;
struct nfsd_file *nf;
@@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* then unhash.
*/
if (status != nfs_ok || key.inode->i_nlink == 0)
- if (nfsd_file_unhash(nf))
- nfsd_file_put_noref(nf);
+ nfsd_file_unhash_and_put(nf);
clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
smp_mb__after_atomic();
wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
goto out;
}

+/**
+ * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
+ * @rqstp: the RPC transaction being executed
+ * @fhp: the NFS filehandle of the file to be opened
+ * @may_flags: NFSD_MAY_ settings for the file
+ * @pnf: OUT: new or found "struct nfsd_file" object
+ *
+ * The nfsd_file object returned by this API is reference-counted
+ * and garbage-collected. The object is retained for a few
+ * seconds after the final nfsd_file_put() in case the caller
+ * wants to re-use it.
+ *
+ * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
+ * network byte order is returned.
+ */
+__be32
+nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ unsigned int may_flags, struct nfsd_file **pnf)
+{
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
+}
+
/**
* nfsd_file_acquire - Get a struct nfsd_file with an open file
* @rqstp: the RPC transaction being executed
@@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @may_flags: NFSD_MAY_ settings for the file
* @pnf: OUT: new or found "struct nfsd_file" object
*
+ * The nfsd_file_object returned by this API is reference-counted
+ * but not garbage-collected. The object is unhashed after the
+ * final nfsd_file_put().
+ *
* Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
* network byte order is returned.
*/
@@ -1139,7 +1182,7 @@ __be32
nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
- return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
}

/**
@@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @may_flags: NFSD_MAY_ settings for the file
* @pnf: OUT: new or found "struct nfsd_file" object
*
+ * The nfsd_file_object returned by this API is reference-counted
+ * but not garbage-collected. The object is released immediately
+ * one RCU grace period after the final nfsd_file_put().
+ *
* Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
* network byte order is returned.
*/
@@ -1156,7 +1203,7 @@ __be32
nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
- return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
}

/*
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index f81c198f4ed6..0f6546bcd3e0 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -38,6 +38,7 @@ struct nfsd_file {
#define NFSD_FILE_HASHED (0)
#define NFSD_FILE_PENDING (1)
#define NFSD_FILE_REFERENCED (2)
+#define NFSD_FILE_GC (3)
unsigned long nf_flags;
struct inode *nf_inode; /* don't deref */
refcount_t nf_ref;
@@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
void nfsd_file_close_inode_sync(struct inode *inode);
bool nfsd_file_is_cached(struct inode *inode);
+__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ unsigned int may_flags, struct nfsd_file **nfp);
__be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **nfp);
__be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index d12823371857..6a5ad6c91d8e 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
(unsigned long long) argp->offset);

fh_copy(&resp->fh, &argp->fh);
- resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
- NFSD_MAY_NOT_BREAK_LEASE, &nf);
+ resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
+ NFSD_MAY_NOT_BREAK_LEASE, &nf);
if (resp->status)
goto out;
resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 9ebd67d461f9..4921144880d3 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
__print_flags(val, "|", \
{ 1 << NFSD_FILE_HASHED, "HASHED" }, \
{ 1 << NFSD_FILE_PENDING, "PENDING" }, \
- { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
+ { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
+ { 1 << NFSD_FILE_GC, "GC"})

DECLARE_EVENT_CLASS(nfsd_file_class,
TP_PROTO(struct nfsd_file *nf),
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 44f210ba17cc..89d682a56fc4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
__be32 err;

trace_nfsd_read_start(rqstp, fhp, offset, *count);
- err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
+ err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
if (err)
return err;

@@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,

trace_nfsd_write_start(rqstp, fhp, offset, *cnt);

- err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
+ err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
if (err)
goto out;



2022-10-18 20:00:49

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 4/7] NFSD: Use const pointers as parameters to fh_ helpers

Enable callers to use const pointers where they are able to.

Reviewed-by: Jeff Layton <[email protected]>
Tested-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfsfh.h | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index c3ae6414fc5c..513e028b0bbe 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -220,7 +220,7 @@ __be32 fh_update(struct svc_fh *);
void fh_put(struct svc_fh *);

static __inline__ struct svc_fh *
-fh_copy(struct svc_fh *dst, struct svc_fh *src)
+fh_copy(struct svc_fh *dst, const struct svc_fh *src)
{
WARN_ON(src->fh_dentry);

@@ -229,7 +229,7 @@ fh_copy(struct svc_fh *dst, struct svc_fh *src)
}

static inline void
-fh_copy_shallow(struct knfsd_fh *dst, struct knfsd_fh *src)
+fh_copy_shallow(struct knfsd_fh *dst, const struct knfsd_fh *src)
{
dst->fh_size = src->fh_size;
memcpy(&dst->fh_raw, &src->fh_raw, src->fh_size);
@@ -243,7 +243,8 @@ fh_init(struct svc_fh *fhp, int maxsize)
return fhp;
}

-static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+static inline bool fh_match(const struct knfsd_fh *fh1,
+ const struct knfsd_fh *fh2)
{
if (fh1->fh_size != fh2->fh_size)
return false;
@@ -252,7 +253,8 @@ static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
return true;
}

-static inline bool fh_fsid_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+static inline bool fh_fsid_match(const struct knfsd_fh *fh1,
+ const struct knfsd_fh *fh2)
{
if (fh1->fh_fsid_type != fh2->fh_fsid_type)
return false;


2022-10-18 20:00:49

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

fh_match() is costly, especially when filehandles are large (as is
the case for NFSv4). It needs to be used sparingly when searching
data structures. Unfortunately, with common workloads, I see
multiple thousands of objects stored in file_hashtbl[], which always
has only 256 buckets, which makes the hash chains quite lengthy.

Walking long hash chains with the state_lock held blocks other
activity that needs that lock.

To help mitigate the cost of searching with fh_match(), replace the
nfs4_file hash table with an rhashtable, which can dynamically
resize its bucket array to minimize hash chain length. The ideal for
this use case is one bucket per inode.

The result is an improvement in the latency of NFSv4 operations
and the reduction of nfsd CPU utilization due to the cost of
fh_match() and the CPU cache misses incurred while walking long
hash chains in the nfs4_file hash table.

Note that hash removal is no longer protected by the state_lock.
This is because insert_nfs4_file() takes the RCU read lock then the
state_lock. rhltable_remove() takes the RCU read lock internally;
if remove_nfs4_file() took the state_lock before calling
rhltable_remove(), that would result in a lock inversion.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++----------------
fs/nfsd/state.h | 5 -
2 files changed, 147 insertions(+), 73 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2b850de288cf..a63334ad61f6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -44,7 +44,9 @@
#include <linux/jhash.h>
#include <linux/string_helpers.h>
#include <linux/fsnotify.h>
+#include <linux/rhashtable.h>
#include <linux/nfs_ssc.h>
+
#include "xdr4.h"
#include "xdr4cb.h"
#include "vfs.h"
@@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
void nfsd4_end_grace(struct nfsd_net *nn);
static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
+static void remove_nfs4_file(struct nfs4_file *fi);

/* Locking: */

@@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
void
put_nfs4_file(struct nfs4_file *fi)
{
- might_lock(&state_lock);
-
- if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
- hlist_del_rcu(&fi->fi_hash);
- spin_unlock(&state_lock);
+ if (refcount_dec_and_test(&fi->fi_ref)) {
+ remove_nfs4_file(fi);
WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
@@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
return ret & OWNER_HASH_MASK;
}

-/* hash table for nfs4_file */
-#define FILE_HASH_BITS 8
-#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
+static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
+
+/*
+ * The returned hash value is based solely on the address of an in-code
+ * inode, a pointer to a slab-allocated object. The entropy in such a
+ * pointer is concentrated in its middle bits.
+ */
+static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
+{
+ u32 k;
+
+ k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
+ return jhash2(&k, 1, seed);
+}

-static unsigned int file_hashval(struct svc_fh *fh)
+/**
+ * nfs4_file_key_hashfn - Compute the hash value of a lookup key
+ * @data: key on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ * Computed 32-bit hash value
+ */
+static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
{
- struct inode *inode = d_inode(fh->fh_dentry);
+ const struct svc_fh *fhp = data;

- /* XXX: why not (here & in file cache) use inode? */
- return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
+ return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
}

-static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
+/**
+ * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
+ * @data: object on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ * Computed 32-bit hash value
+ */
+static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+ const struct nfs4_file *fi = data;
+
+ return nfs4_file_inode_hash(fi->fi_inode, seed);
+}
+
+/**
+ * nfs4_file_obj_cmpfn - Match a cache item against search criteria
+ * @arg: search criteria
+ * @ptr: cache item to check
+ *
+ * Return values:
+ * %0 - Success; item matches search criteria
+ */
+static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
+ const void *ptr)
+{
+ const struct svc_fh *fhp = arg->key;
+ const struct nfs4_file *fi = ptr;
+
+ return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
+}
+
+static const struct rhashtable_params nfs4_file_rhash_params = {
+ .key_len = sizeof_field(struct nfs4_file, fi_inode),
+ .key_offset = offsetof(struct nfs4_file, fi_inode),
+ .head_offset = offsetof(struct nfs4_file, fi_rhash),
+ .hashfn = nfs4_file_key_hashfn,
+ .obj_hashfn = nfs4_file_obj_hashfn,
+ .obj_cmpfn = nfs4_file_obj_cmpfn,
+
+ /* Reduce resizing churn on light workloads */
+ .min_size = 512, /* buckets */
+ .automatic_shrinking = true,
+};

/*
* Check if courtesy clients have conflicting access and resolve it if possible
@@ -4251,11 +4314,8 @@ static struct nfs4_file *nfsd4_alloc_file(void)
}

/* OPEN Share state helper functions */
-static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
- struct nfs4_file *fp)
+static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp)
{
- lockdep_assert_held(&state_lock);
-
refcount_set(&fp->fi_ref, 1);
spin_lock_init(&fp->fi_lock);
INIT_LIST_HEAD(&fp->fi_stateids);
@@ -4273,7 +4333,6 @@ static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
INIT_LIST_HEAD(&fp->fi_lo_states);
atomic_set(&fp->fi_lo_recalls, 0);
#endif
- hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
}

void
@@ -4626,71 +4685,75 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
nfs4_put_stid(&last->st_stid);
}

-/* search file_hashtbl[] for file */
-static struct nfs4_file *
-find_file_locked(struct svc_fh *fh, unsigned int hashval)
+static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp)
{
- struct nfs4_file *fp;
+ struct rhlist_head *tmp, *list;
+ struct nfs4_file *fi = NULL;

- hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
- lockdep_is_held(&state_lock)) {
- if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
- if (refcount_inc_not_zero(&fp->fi_ref))
- return fp;
+ list = rhltable_lookup(&nfs4_file_rhltable, fhp,
+ nfs4_file_rhash_params);
+ rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
+ if (fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
+ if (!refcount_inc_not_zero(&fi->fi_ref))
+ fi = NULL;
+ break;
}
- }
- return NULL;
+ return fi;
}

-static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh,
- unsigned int hashval)
+/*
+ * On hash insertion, identify entries with the same inode but
+ * distinct filehandles. They will all be in the same hash bucket
+ * because we hash by the address of the nfs4_file's struct inode.
+ */
+static struct nfs4_file *insert_file(struct nfs4_file *new,
+ const struct svc_fh *fhp)
{
- struct nfs4_file *fp;
- struct nfs4_file *ret = NULL;
- bool alias_found = false;
+ struct rhlist_head *tmp, *list;
+ struct nfs4_file *fi;
+ int err;

spin_lock(&state_lock);
- hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
- lockdep_is_held(&state_lock)) {
- if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
- if (refcount_inc_not_zero(&fp->fi_ref))
- ret = fp;
- } else if (d_inode(fh->fh_dentry) == fp->fi_inode)
- fp->fi_aliased = alias_found = true;
- }
- if (likely(ret == NULL)) {
- nfsd4_init_file(fh, hashval, new);
- new->fi_aliased = alias_found;
- ret = new;
+
+ err = rhltable_insert_key(&nfs4_file_rhltable, fhp,
+ &new->fi_rhash,
+ nfs4_file_rhash_params);
+ if (err) {
+ spin_unlock(&state_lock);
+ return NULL;
}
+
+ list = rhltable_lookup(&nfs4_file_rhltable, fhp,
+ nfs4_file_rhash_params);
+ rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
+ if (fi->fi_inode == d_inode(fhp->fh_dentry) &&
+ !fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
+ fi->fi_aliased = true;
+ new->fi_aliased = true;
+ }
+
spin_unlock(&state_lock);
- return ret;
+ return new;
}

-static struct nfs4_file * find_file(struct svc_fh *fh)
+static noinline struct nfs4_file *
+insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
{
- struct nfs4_file *fp;
- unsigned int hashval = file_hashval(fh);
+ struct nfs4_file *fi;

- rcu_read_lock();
- fp = find_file_locked(fh, hashval);
- rcu_read_unlock();
- return fp;
+ /* Do not hash the same filehandle more than once */
+ fi = find_nfs4_file(fhp);
+ if (unlikely(fi))
+ return fi;
+
+ init_nfs4_file(fhp, new);
+ return insert_file(new, fhp);
}

-static struct nfs4_file *
-find_or_add_file(struct nfs4_file *new, struct svc_fh *fh)
+static noinline void remove_nfs4_file(struct nfs4_file *fi)
{
- struct nfs4_file *fp;
- unsigned int hashval = file_hashval(fh);
-
- rcu_read_lock();
- fp = find_file_locked(fh, hashval);
- rcu_read_unlock();
- if (fp)
- return fp;
-
- return insert_file(new, fh, hashval);
+ rhltable_remove(&nfs4_file_rhltable, &fi->fi_rhash,
+ nfs4_file_rhash_params);
}

/*
@@ -4703,9 +4766,12 @@ nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type)
struct nfs4_file *fp;
__be32 ret = nfs_ok;

- fp = find_file(current_fh);
+ rcu_read_lock();
+ fp = find_nfs4_file(current_fh);
+ rcu_read_unlock();
if (!fp)
return ret;
+
/* Check for conflicting share reservations */
spin_lock(&fp->fi_lock);
if (fp->fi_share_deny & deny_type)
@@ -5548,7 +5614,11 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
* and check for delegations in the process of being recalled.
* If not found, create the nfs4_file struct
*/
- fp = find_or_add_file(open->op_file, current_fh);
+ rcu_read_lock();
+ fp = insert_nfs4_file(open->op_file, current_fh);
+ rcu_read_unlock();
+ if (unlikely(!fp))
+ return nfserr_jukebox;
if (fp != open->op_file) {
status = nfs4_check_deleg(cl, open, &dp);
if (status)
@@ -7905,10 +7975,16 @@ nfs4_state_start(void)
{
int ret;

- ret = nfsd4_create_callback_queue();
+ ret = rhltable_init(&nfs4_file_rhltable, &nfs4_file_rhash_params);
if (ret)
return ret;

+ ret = nfsd4_create_callback_queue();
+ if (ret) {
+ rhltable_destroy(&nfs4_file_rhltable);
+ return ret;
+ }
+
set_max_delegations();
return 0;
}
@@ -7939,6 +8015,7 @@ nfs4_state_shutdown_net(struct net *net)

nfsd4_client_tracking_exit(net);
nfs4_state_destroy_net(net);
+ rhltable_destroy(&nfs4_file_rhltable);
#ifdef CONFIG_NFSD_V4_2_INTER_SSC
nfsd4_ssc_shutdown_umount(nn);
#endif
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index ae596dbf8667..ad20467c9dee 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -536,16 +536,13 @@ struct nfs4_clnt_odstate {
* inode can have multiple filehandles associated with it, so there is
* (potentially) a many to one relationship between this struct and struct
* inode.
- *
- * These are hashed by filehandle in the file_hashtbl, which is protected by
- * the global state_lock spinlock.
*/
struct nfs4_file {
refcount_t fi_ref;
struct inode * fi_inode;
bool fi_aliased;
spinlock_t fi_lock;
- struct hlist_node fi_hash; /* hash on fi_fhandle */
+ struct rhlist_head fi_rhash;
struct list_head fi_stateids;
union {
struct list_head fi_delegations;


2022-10-18 20:00:49

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 7/7] NFSD: Trace delegation revocations

Delegation revocation is an exceptional event that is not otherwise
visible externally (eg, no network traffic is emitted). Generate a
trace record when it occurs so that revocation can be observed or
other activity can be triggered. Example:

nfsd-1104 [005] 1912.002544: nfsd_stid_revoke: client 633c9343:4e82788d stateid 00000003:00000001 ref=2 type=DELEG

Trace infrastructure is provided for subsequent additional tracing
related to nfs4_stid activity.

Reviewed-by: Jeff Layton <[email protected]>
Tested-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4state.c | 2 ++
fs/nfsd/trace.h | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index a63334ad61f6..c37f07bb874d 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1401,6 +1401,8 @@ static void revoke_delegation(struct nfs4_delegation *dp)

WARN_ON(!list_empty(&dp->dl_recall_lru));

+ trace_nfsd_stid_revoke(&dp->dl_stid);
+
if (clp->cl_minorversion) {
dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;
refcount_inc(&dp->dl_stid.sc_count);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 4921144880d3..23fb39c957af 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -561,6 +561,61 @@ DEFINE_EVENT(nfsd_stateseqid_class, nfsd_##name, \
DEFINE_STATESEQID_EVENT(preprocess);
DEFINE_STATESEQID_EVENT(open_confirm);

+TRACE_DEFINE_ENUM(NFS4_OPEN_STID);
+TRACE_DEFINE_ENUM(NFS4_LOCK_STID);
+TRACE_DEFINE_ENUM(NFS4_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_CLOSED_STID);
+TRACE_DEFINE_ENUM(NFS4_REVOKED_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_CLOSED_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_LAYOUT_STID);
+
+#define show_stid_type(x) \
+ __print_flags(x, "|", \
+ { NFS4_OPEN_STID, "OPEN" }, \
+ { NFS4_LOCK_STID, "LOCK" }, \
+ { NFS4_DELEG_STID, "DELEG" }, \
+ { NFS4_CLOSED_STID, "CLOSED" }, \
+ { NFS4_REVOKED_DELEG_STID, "REVOKED" }, \
+ { NFS4_CLOSED_DELEG_STID, "CLOSED_DELEG" }, \
+ { NFS4_LAYOUT_STID, "LAYOUT" })
+
+DECLARE_EVENT_CLASS(nfsd_stid_class,
+ TP_PROTO(
+ const struct nfs4_stid *stid
+ ),
+ TP_ARGS(stid),
+ TP_STRUCT__entry(
+ __field(unsigned long, sc_type)
+ __field(int, sc_count)
+ __field(u32, cl_boot)
+ __field(u32, cl_id)
+ __field(u32, si_id)
+ __field(u32, si_generation)
+ ),
+ TP_fast_assign(
+ const stateid_t *stp = &stid->sc_stateid;
+
+ __entry->sc_type = stid->sc_type;
+ __entry->sc_count = refcount_read(&stid->sc_count);
+ __entry->cl_boot = stp->si_opaque.so_clid.cl_boot;
+ __entry->cl_id = stp->si_opaque.so_clid.cl_id;
+ __entry->si_id = stp->si_opaque.so_id;
+ __entry->si_generation = stp->si_generation;
+ ),
+ TP_printk("client %08x:%08x stateid %08x:%08x ref=%d type=%s",
+ __entry->cl_boot, __entry->cl_id,
+ __entry->si_id, __entry->si_generation,
+ __entry->sc_count, show_stid_type(__entry->sc_type)
+ )
+);
+
+#define DEFINE_STID_EVENT(name) \
+DEFINE_EVENT(nfsd_stid_class, nfsd_stid_##name, \
+ TP_PROTO(const struct nfs4_stid *stid), \
+ TP_ARGS(stid))
+
+DEFINE_STID_EVENT(revoke);
+
DECLARE_EVENT_CLASS(nfsd_clientid_class,
TP_PROTO(const clientid_t *clid),
TP_ARGS(clid),


2022-10-18 20:02:17

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v4 6/7] NFSD: Clean up nfs4_preprocess_stateid_op() call sites

Remove the lame-duck dprintk()s around nfs4_preprocess_stateid_op()
call sites.

Reviewed-by: Jeff Layton <[email protected]>
Tested-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4proc.c | 31 +++++++------------------------
1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index d0d976f847ca..42b81e88ea14 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -951,12 +951,7 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&read->rd_stateid, RD_STATE,
&read->rd_nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_read: couldn't process stateid!\n");
- goto out;
- }
- status = nfs_ok;
-out:
+
read->rd_rqstp = rqstp;
read->rd_fhp = &cstate->current_fh;
return status;
@@ -1125,10 +1120,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate,
&cstate->current_fh, &setattr->sa_stateid,
WR_STATE, NULL, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_setattr: couldn't process stateid!\n");
+ if (status)
return status;
- }
}
err = fh_want_write(&cstate->current_fh);
if (err)
@@ -1176,10 +1169,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_offset, cnt);
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
stateid, WR_STATE, &nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_write: couldn't process stateid!\n");
+ if (status)
return status;
- }

write->wr_how_written = write->wr_stable_how;

@@ -1210,17 +1201,13 @@ nfsd4_verify_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,

status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->save_fh,
src_stateid, RD_STATE, src, NULL);
- if (status) {
- dprintk("NFSD: %s: couldn't process src stateid!\n", __func__);
+ if (status)
goto out;
- }

status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
dst_stateid, WR_STATE, dst, NULL);
- if (status) {
- dprintk("NFSD: %s: couldn't process dst stateid!\n", __func__);
+ if (status)
goto out_put_src;
- }

/* fix up for NFS-specific error code */
if (!S_ISREG(file_inode((*src)->nf_file)->i_mode) ||
@@ -1955,10 +1942,8 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&fallocate->falloc_stateid,
WR_STATE, &nf, NULL);
- if (status != nfs_ok) {
- dprintk("NFSD: nfsd4_fallocate: couldn't process stateid!\n");
+ if (status != nfs_ok)
return status;
- }

status = nfsd4_vfs_fallocate(rqstp, &cstate->current_fh, nf->nf_file,
fallocate->falloc_offset,
@@ -2014,10 +1999,8 @@ nfsd4_seek(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&seek->seek_stateid,
RD_STATE, &nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_seek: couldn't process stateid!\n");
+ if (status)
return status;
- }

switch (seek->seek_whence) {
case NFS4_CONTENT_DATA:


2022-10-19 12:07:30

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Tue, 2022-10-18 at 15:58 -0400, Chuck Lever wrote:
> NFSv4 operations manage the lifetime of nfsd_file items they use by
> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
> garbage collected.
>
> Introduce a mechanism to enable garbage collection for nfsd_file
> items used only by NFSv2/3 callers.
>
> Note that the change in nfsd_file_put() ensures that both CLOSE and
> DELEGRETURN will actually close out and free an nfsd_file on last
> reference of a non-garbage-collected file.
>
> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
> Suggested-by: Trond Myklebust <[email protected]>
> Tested-by: Jeff Layton <[email protected]>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
> fs/nfsd/filecache.h | 3 +++
> fs/nfsd/nfs3proc.c | 4 ++-
> fs/nfsd/trace.h | 3 ++-
> fs/nfsd/vfs.c | 4 ++-
> 5 files changed, 63 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index b7aa523c2010..87fce5c95726 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
> struct net *net;
> const struct cred *cred;
> unsigned char need;
> + unsigned char gc:1;

Is it worth it to mess around with bitfields here? There are exising
holes in this struct, so you're really not saving anything by using one
here.

Also, it seems like "need" is used as a bool anyway. Maybe both of those
fields should just be bools? It looks like changing that won't change
the size of the struct anyway. You might even be able to shrink it by
moving the "type" above "need".

> enum nfsd_file_lookup_type type;
> };
>
> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> return 1;
> if (!nfsd_match_cred(nf->nf_cred, key->cred))
> return 1;
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
> + return 1;
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
> return 1;
> break;
> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> nf->nf_flags = 0;
> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> + if (key->gc)
> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> nf->nf_inode = key->inode;
> /* nf_ref is pre-incremented for hash table */
> refcount_set(&nf->nf_ref, 2);
> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> }
> }
>
> +static void
> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
> +{
> + if (nfsd_file_unhash(nf))
> + nfsd_file_put_noref(nf);
> +}
> +
> void
> nfsd_file_put(struct nfsd_file *nf)
> {
> might_sleep();
>
> - nfsd_file_lru_add(nf);
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
> + nfsd_file_lru_add(nf);
> + else if (refcount_read(&nf->nf_ref) == 2)
> + nfsd_file_unhash_and_put(nf);
> +
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> nfsd_file_flush(nf);
> nfsd_file_put_noref(nf);
> - } else if (nf->nf_file) {
> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
> nfsd_file_put_noref(nf);
> nfsd_file_schedule_laundrette();
> } else
> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>
> static __be32
> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
> + unsigned int may_flags, struct nfsd_file **pnf,
> + bool open, int want_gc)
> {
> struct nfsd_file_lookup_key key = {
> .type = NFSD_FILE_KEY_FULL,
> .need = may_flags & NFSD_FILE_MAY_MASK,
> .net = SVC_NET(rqstp),
> + .gc = want_gc,
> };
> bool open_retry = true;
> struct nfsd_file *nf;
> @@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * then unhash.
> */
> if (status != nfs_ok || key.inode->i_nlink == 0)
> - if (nfsd_file_unhash(nf))
> - nfsd_file_put_noref(nf);
> + nfsd_file_unhash_and_put(nf);
> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> smp_mb__after_atomic();
> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> goto out;
> }
>
> +/**
> + * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
> + * @rqstp: the RPC transaction being executed
> + * @fhp: the NFS filehandle of the file to be opened
> + * @may_flags: NFSD_MAY_ settings for the file
> + * @pnf: OUT: new or found "struct nfsd_file" object
> + *
> + * The nfsd_file object returned by this API is reference-counted
> + * and garbage-collected. The object is retained for a few
> + * seconds after the final nfsd_file_put() in case the caller
> + * wants to re-use it.
> + *
> + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> + * network byte order is returned.
> + */
> +__be32
> +nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **pnf)
> +{
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
> +}
> +
> /**
> * nfsd_file_acquire - Get a struct nfsd_file with an open file
> * @rqstp: the RPC transaction being executed
> @@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * @may_flags: NFSD_MAY_ settings for the file
> * @pnf: OUT: new or found "struct nfsd_file" object
> *
> + * The nfsd_file_object returned by this API is reference-counted
> + * but not garbage-collected. The object is unhashed after the
> + * final nfsd_file_put().
> + *
> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> * network byte order is returned.
> */
> @@ -1139,7 +1182,7 @@ __be32
> nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **pnf)
> {
> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
> }
>
> /**
> @@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * @may_flags: NFSD_MAY_ settings for the file
> * @pnf: OUT: new or found "struct nfsd_file" object
> *
> + * The nfsd_file_object returned by this API is reference-counted
> + * but not garbage-collected. The object is released immediately
> + * one RCU grace period after the final nfsd_file_put().
> + *
> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> * network byte order is returned.
> */
> @@ -1156,7 +1203,7 @@ __be32
> nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **pnf)
> {
> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
> }
>
> /*
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index f81c198f4ed6..0f6546bcd3e0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
> #define NFSD_FILE_HASHED (0)
> #define NFSD_FILE_PENDING (1)
> #define NFSD_FILE_REFERENCED (2)
> +#define NFSD_FILE_GC (3)
> unsigned long nf_flags;
> struct inode *nf_inode; /* don't deref */
> refcount_t nf_ref;
> @@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
> void nfsd_file_close_inode_sync(struct inode *inode);
> bool nfsd_file_is_cached(struct inode *inode);
> +__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **nfp);
> __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **nfp);
> __be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index d12823371857..6a5ad6c91d8e 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
> (unsigned long long) argp->offset);
>
> fh_copy(&resp->fh, &argp->fh);
> - resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
> - NFSD_MAY_NOT_BREAK_LEASE, &nf);
> + resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
> + NFSD_MAY_NOT_BREAK_LEASE, &nf);
> if (resp->status)
> goto out;
> resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 9ebd67d461f9..4921144880d3 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
> __print_flags(val, "|", \
> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> - { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
> + { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
> + { 1 << NFSD_FILE_GC, "GC"})
>
> DECLARE_EVENT_CLASS(nfsd_file_class,
> TP_PROTO(struct nfsd_file *nf),
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 44f210ba17cc..89d682a56fc4 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> __be32 err;
>
> trace_nfsd_read_start(rqstp, fhp, offset, *count);
> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
> if (err)
> return err;
>
> @@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>
> trace_nfsd_write_start(rqstp, fhp, offset, *cnt);
>
> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
> if (err)
> goto out;
>
>
>

Patch itself looks fine though. You can add:

Reviewed-by: Jeff Layton <[email protected]>

...but I'd probably prefer to see the flags in that struct as bools
instead if you're amenable to the change.

2022-10-19 12:07:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Tue, 2022-10-18 at 15:58 -0400, Chuck Lever wrote:
> fh_match() is costly, especially when filehandles are large (as is
> the case for NFSv4). It needs to be used sparingly when searching
> data structures. Unfortunately, with common workloads, I see
> multiple thousands of objects stored in file_hashtbl[], which always
> has only 256 buckets, which makes the hash chains quite lengthy.
>
> Walking long hash chains with the state_lock held blocks other
> activity that needs that lock.
>
> To help mitigate the cost of searching with fh_match(), replace the
> nfs4_file hash table with an rhashtable, which can dynamically
> resize its bucket array to minimize hash chain length. The ideal for
> this use case is one bucket per inode.
>
> The result is an improvement in the latency of NFSv4 operations
> and the reduction of nfsd CPU utilization due to the cost of
> fh_match() and the CPU cache misses incurred while walking long
> hash chains in the nfs4_file hash table.
>
> Note that hash removal is no longer protected by the state_lock.
> This is because insert_nfs4_file() takes the RCU read lock then the
> state_lock. rhltable_remove() takes the RCU read lock internally;
> if remove_nfs4_file() took the state_lock before calling
> rhltable_remove(), that would result in a lock inversion.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++----------------
> fs/nfsd/state.h | 5 -
> 2 files changed, 147 insertions(+), 73 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 2b850de288cf..a63334ad61f6 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -44,7 +44,9 @@
> #include <linux/jhash.h>
> #include <linux/string_helpers.h>
> #include <linux/fsnotify.h>
> +#include <linux/rhashtable.h>
> #include <linux/nfs_ssc.h>
> +
> #include "xdr4.h"
> #include "xdr4cb.h"
> #include "vfs.h"
> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
> static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
> void nfsd4_end_grace(struct nfsd_net *nn);
> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
> +static void remove_nfs4_file(struct nfs4_file *fi);
>
> /* Locking: */
>
> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> void
> put_nfs4_file(struct nfs4_file *fi)
> {
> - might_lock(&state_lock);
> -
> - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
> - hlist_del_rcu(&fi->fi_hash);
> - spin_unlock(&state_lock);
> + if (refcount_dec_and_test(&fi->fi_ref)) {
> + remove_nfs4_file(fi);
> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
> WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
> call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
> return ret & OWNER_HASH_MASK;
> }
>
> -/* hash table for nfs4_file */
> -#define FILE_HASH_BITS 8
> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
> +
> +/*
> + * The returned hash value is based solely on the address of an in-code
> + * inode, a pointer to a slab-allocated object. The entropy in such a
> + * pointer is concentrated in its middle bits.
> + */
> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
> +{
> + u32 k;
> +
> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
> + return jhash2(&k, 1, seed);
> +}
>
> -static unsigned int file_hashval(struct svc_fh *fh)
> +/**
> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key
> + * @data: key on which to compute the hash value
> + * @len: rhash table's key_len parameter (unused)
> + * @seed: rhash table's random seed of the day
> + *
> + * Return value:
> + * Computed 32-bit hash value
> + */
> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
> {
> - struct inode *inode = d_inode(fh->fh_dentry);
> + const struct svc_fh *fhp = data;
>
> - /* XXX: why not (here & in file cache) use inode? */
> - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
> + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
> }
>
> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
> +/**
> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
> + * @data: object on which to compute the hash value
> + * @len: rhash table's key_len parameter (unused)
> + * @seed: rhash table's random seed of the day
> + *
> + * Return value:
> + * Computed 32-bit hash value
> + */
> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
> +{
> + const struct nfs4_file *fi = data;
> +
> + return nfs4_file_inode_hash(fi->fi_inode, seed);
> +}
> +
> +/**
> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> + * @arg: search criteria
> + * @ptr: cache item to check
> + *
> + * Return values:
> + * %0 - Success; item matches search criteria
> + */
> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> + const void *ptr)
> +{
> + const struct svc_fh *fhp = arg->key;
> + const struct nfs4_file *fi = ptr;
> +
> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
> +}
> +
> +static const struct rhashtable_params nfs4_file_rhash_params = {
> + .key_len = sizeof_field(struct nfs4_file, fi_inode),
> + .key_offset = offsetof(struct nfs4_file, fi_inode),
> + .head_offset = offsetof(struct nfs4_file, fi_rhash),
> + .hashfn = nfs4_file_key_hashfn,
> + .obj_hashfn = nfs4_file_obj_hashfn,
> + .obj_cmpfn = nfs4_file_obj_cmpfn,
> +
> + /* Reduce resizing churn on light workloads */
> + .min_size = 512, /* buckets */
> + .automatic_shrinking = true,
> +};
>
> /*
> * Check if courtesy clients have conflicting access and resolve it if possible
> @@ -4251,11 +4314,8 @@ static struct nfs4_file *nfsd4_alloc_file(void)
> }
>
> /* OPEN Share state helper functions */
> -static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
> - struct nfs4_file *fp)
> +static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp)
> {
> - lockdep_assert_held(&state_lock);
> -
> refcount_set(&fp->fi_ref, 1);
> spin_lock_init(&fp->fi_lock);
> INIT_LIST_HEAD(&fp->fi_stateids);
> @@ -4273,7 +4333,6 @@ static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
> INIT_LIST_HEAD(&fp->fi_lo_states);
> atomic_set(&fp->fi_lo_recalls, 0);
> #endif
> - hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
> }
>
> void
> @@ -4626,71 +4685,75 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
> nfs4_put_stid(&last->st_stid);
> }
>
> -/* search file_hashtbl[] for file */
> -static struct nfs4_file *
> -find_file_locked(struct svc_fh *fh, unsigned int hashval)
> +static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp)
> {
> - struct nfs4_file *fp;
> + struct rhlist_head *tmp, *list;
> + struct nfs4_file *fi = NULL;
>
> - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
> - lockdep_is_held(&state_lock)) {
> - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
> - if (refcount_inc_not_zero(&fp->fi_ref))
> - return fp;
> + list = rhltable_lookup(&nfs4_file_rhltable, fhp,
> + nfs4_file_rhash_params);
> + rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
> + if (fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
> + if (!refcount_inc_not_zero(&fi->fi_ref))
> + fi = NULL;
> + break;
> }
> - }
> - return NULL;
> + return fi;
> }
>
> -static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh,
> - unsigned int hashval)
> +/*
> + * On hash insertion, identify entries with the same inode but
> + * distinct filehandles. They will all be in the same hash bucket
> + * because we hash by the address of the nfs4_file's struct inode.
> + */
> +static struct nfs4_file *insert_file(struct nfs4_file *new,
> + const struct svc_fh *fhp)
> {
> - struct nfs4_file *fp;
> - struct nfs4_file *ret = NULL;
> - bool alias_found = false;
> + struct rhlist_head *tmp, *list;
> + struct nfs4_file *fi;
> + int err;
>
> spin_lock(&state_lock);
> - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
> - lockdep_is_held(&state_lock)) {
> - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
> - if (refcount_inc_not_zero(&fp->fi_ref))
> - ret = fp;
> - } else if (d_inode(fh->fh_dentry) == fp->fi_inode)
> - fp->fi_aliased = alias_found = true;
> - }
> - if (likely(ret == NULL)) {
> - nfsd4_init_file(fh, hashval, new);
> - new->fi_aliased = alias_found;
> - ret = new;
> +
> + err = rhltable_insert_key(&nfs4_file_rhltable, fhp,
> + &new->fi_rhash,
> + nfs4_file_rhash_params);
> + if (err) {
> + spin_unlock(&state_lock);
> + return NULL;
> }
> +
> + list = rhltable_lookup(&nfs4_file_rhltable, fhp,
> + nfs4_file_rhash_params);
> + rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
> + if (fi->fi_inode == d_inode(fhp->fh_dentry) &&
> + !fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
> + fi->fi_aliased = true;
> + new->fi_aliased = true;
> + }
> +
> spin_unlock(&state_lock);
> - return ret;
> + return new;
> }
>
> -static struct nfs4_file * find_file(struct svc_fh *fh)
> +static noinline struct nfs4_file *
> +insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
> {
> - struct nfs4_file *fp;
> - unsigned int hashval = file_hashval(fh);
> + struct nfs4_file *fi;
>
> - rcu_read_lock();
> - fp = find_file_locked(fh, hashval);
> - rcu_read_unlock();
> - return fp;
> + /* Do not hash the same filehandle more than once */
> + fi = find_nfs4_file(fhp);
> + if (unlikely(fi))
> + return fi;
> +
> + init_nfs4_file(fhp, new);
> + return insert_file(new, fhp);
> }
>
> -static struct nfs4_file *
> -find_or_add_file(struct nfs4_file *new, struct svc_fh *fh)
> +static noinline void remove_nfs4_file(struct nfs4_file *fi)
> {
> - struct nfs4_file *fp;
> - unsigned int hashval = file_hashval(fh);
> -
> - rcu_read_lock();
> - fp = find_file_locked(fh, hashval);
> - rcu_read_unlock();
> - if (fp)
> - return fp;
> -
> - return insert_file(new, fh, hashval);
> + rhltable_remove(&nfs4_file_rhltable, &fi->fi_rhash,
> + nfs4_file_rhash_params);
> }
>
> /*
> @@ -4703,9 +4766,12 @@ nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type)
> struct nfs4_file *fp;
> __be32 ret = nfs_ok;
>
> - fp = find_file(current_fh);
> + rcu_read_lock();
> + fp = find_nfs4_file(current_fh);
> + rcu_read_unlock();
> if (!fp)
> return ret;
> +
> /* Check for conflicting share reservations */
> spin_lock(&fp->fi_lock);
> if (fp->fi_share_deny & deny_type)
> @@ -5548,7 +5614,11 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> * and check for delegations in the process of being recalled.
> * If not found, create the nfs4_file struct
> */
> - fp = find_or_add_file(open->op_file, current_fh);
> + rcu_read_lock();
> + fp = insert_nfs4_file(open->op_file, current_fh);
> + rcu_read_unlock();

It'd probably be better to push this rcu_read_lock down into
insert_nfs4_file. You don't need to hold it over the actual insertion,
since that requires the state_lock.


> + if (unlikely(!fp))
> + return nfserr_jukebox;
> if (fp != open->op_file) {
> status = nfs4_check_deleg(cl, open, &dp);
> if (status)
> @@ -7905,10 +7975,16 @@ nfs4_state_start(void)
> {
> int ret;
>
> - ret = nfsd4_create_callback_queue();
> + ret = rhltable_init(&nfs4_file_rhltable, &nfs4_file_rhash_params);
> if (ret)
> return ret;
>
> + ret = nfsd4_create_callback_queue();
> + if (ret) {
> + rhltable_destroy(&nfs4_file_rhltable);
> + return ret;
> + }
> +
> set_max_delegations();
> return 0;
> }
> @@ -7939,6 +8015,7 @@ nfs4_state_shutdown_net(struct net *net)
>
> nfsd4_client_tracking_exit(net);
> nfs4_state_destroy_net(net);
> + rhltable_destroy(&nfs4_file_rhltable);
> #ifdef CONFIG_NFSD_V4_2_INTER_SSC
> nfsd4_ssc_shutdown_umount(nn);
> #endif
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index ae596dbf8667..ad20467c9dee 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -536,16 +536,13 @@ struct nfs4_clnt_odstate {
> * inode can have multiple filehandles associated with it, so there is
> * (potentially) a many to one relationship between this struct and struct
> * inode.
> - *
> - * These are hashed by filehandle in the file_hashtbl, which is protected by
> - * the global state_lock spinlock.
> */
> struct nfs4_file {
> refcount_t fi_ref;
> struct inode * fi_inode;
> bool fi_aliased;
> spinlock_t fi_lock;
> - struct hlist_node fi_hash; /* hash on fi_fhandle */
> + struct rhlist_head fi_rhash;
> struct list_head fi_stateids;
> union {
> struct list_head fi_delegations;
>
>

--
Jeff Layton <[email protected]>

2022-10-19 14:47:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 19, 2022, at 7:39 AM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 2022-10-18 at 15:58 -0400, Chuck Lever wrote:
>> fh_match() is costly, especially when filehandles are large (as is
>> the case for NFSv4). It needs to be used sparingly when searching
>> data structures. Unfortunately, with common workloads, I see
>> multiple thousands of objects stored in file_hashtbl[], which always
>> has only 256 buckets, which makes the hash chains quite lengthy.
>>
>> Walking long hash chains with the state_lock held blocks other
>> activity that needs that lock.
>>
>> To help mitigate the cost of searching with fh_match(), replace the
>> nfs4_file hash table with an rhashtable, which can dynamically
>> resize its bucket array to minimize hash chain length. The ideal for
>> this use case is one bucket per inode.
>>
>> The result is an improvement in the latency of NFSv4 operations
>> and the reduction of nfsd CPU utilization due to the cost of
>> fh_match() and the CPU cache misses incurred while walking long
>> hash chains in the nfs4_file hash table.
>>
>> Note that hash removal is no longer protected by the state_lock.
>> This is because insert_nfs4_file() takes the RCU read lock then the
>> state_lock. rhltable_remove() takes the RCU read lock internally;
>> if remove_nfs4_file() took the state_lock before calling
>> rhltable_remove(), that would result in a lock inversion.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++----------------
>> fs/nfsd/state.h | 5 -
>> 2 files changed, 147 insertions(+), 73 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 2b850de288cf..a63334ad61f6 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -44,7 +44,9 @@
>> #include <linux/jhash.h>
>> #include <linux/string_helpers.h>
>> #include <linux/fsnotify.h>
>> +#include <linux/rhashtable.h>
>> #include <linux/nfs_ssc.h>
>> +
>> #include "xdr4.h"
>> #include "xdr4cb.h"
>> #include "vfs.h"
>> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
>> static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
>> void nfsd4_end_grace(struct nfsd_net *nn);
>> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
>> +static void remove_nfs4_file(struct nfs4_file *fi);
>>
>> /* Locking: */
>>
>> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>> void
>> put_nfs4_file(struct nfs4_file *fi)
>> {
>> - might_lock(&state_lock);
>> -
>> - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
>> - hlist_del_rcu(&fi->fi_hash);
>> - spin_unlock(&state_lock);
>> + if (refcount_dec_and_test(&fi->fi_ref)) {
>> + remove_nfs4_file(fi);
>> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
>> WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
>> call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
>> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
>> return ret & OWNER_HASH_MASK;
>> }
>>
>> -/* hash table for nfs4_file */
>> -#define FILE_HASH_BITS 8
>> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
>> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
>> +
>> +/*
>> + * The returned hash value is based solely on the address of an in-code
>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>> + * pointer is concentrated in its middle bits.
>> + */
>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>> +{
>> + u32 k;
>> +
>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>> + return jhash2(&k, 1, seed);
>> +}
>>
>> -static unsigned int file_hashval(struct svc_fh *fh)
>> +/**
>> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key
>> + * @data: key on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + * Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
>> {
>> - struct inode *inode = d_inode(fh->fh_dentry);
>> + const struct svc_fh *fhp = data;
>>
>> - /* XXX: why not (here & in file cache) use inode? */
>> - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
>> + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
>> }
>>
>> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
>> +/**
>> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
>> + * @data: object on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + * Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
>> +{
>> + const struct nfs4_file *fi = data;
>> +
>> + return nfs4_file_inode_hash(fi->fi_inode, seed);
>> +}
>> +
>> +/**
>> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
>> + * @arg: search criteria
>> + * @ptr: cache item to check
>> + *
>> + * Return values:
>> + * %0 - Success; item matches search criteria
>> + */
>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> + const void *ptr)
>> +{
>> + const struct svc_fh *fhp = arg->key;
>> + const struct nfs4_file *fi = ptr;
>> +
>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>> +}
>> +
>> +static const struct rhashtable_params nfs4_file_rhash_params = {
>> + .key_len = sizeof_field(struct nfs4_file, fi_inode),
>> + .key_offset = offsetof(struct nfs4_file, fi_inode),
>> + .head_offset = offsetof(struct nfs4_file, fi_rhash),
>> + .hashfn = nfs4_file_key_hashfn,
>> + .obj_hashfn = nfs4_file_obj_hashfn,
>> + .obj_cmpfn = nfs4_file_obj_cmpfn,
>> +
>> + /* Reduce resizing churn on light workloads */
>> + .min_size = 512, /* buckets */
>> + .automatic_shrinking = true,
>> +};
>>
>> /*
>> * Check if courtesy clients have conflicting access and resolve it if possible
>> @@ -4251,11 +4314,8 @@ static struct nfs4_file *nfsd4_alloc_file(void)
>> }
>>
>> /* OPEN Share state helper functions */
>> -static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
>> - struct nfs4_file *fp)
>> +static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp)
>> {
>> - lockdep_assert_held(&state_lock);
>> -
>> refcount_set(&fp->fi_ref, 1);
>> spin_lock_init(&fp->fi_lock);
>> INIT_LIST_HEAD(&fp->fi_stateids);
>> @@ -4273,7 +4333,6 @@ static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
>> INIT_LIST_HEAD(&fp->fi_lo_states);
>> atomic_set(&fp->fi_lo_recalls, 0);
>> #endif
>> - hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
>> }
>>
>> void
>> @@ -4626,71 +4685,75 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
>> nfs4_put_stid(&last->st_stid);
>> }
>>
>> -/* search file_hashtbl[] for file */
>> -static struct nfs4_file *
>> -find_file_locked(struct svc_fh *fh, unsigned int hashval)
>> +static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp)
>> {
>> - struct nfs4_file *fp;
>> + struct rhlist_head *tmp, *list;
>> + struct nfs4_file *fi = NULL;
>>
>> - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
>> - lockdep_is_held(&state_lock)) {
>> - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
>> - if (refcount_inc_not_zero(&fp->fi_ref))
>> - return fp;
>> + list = rhltable_lookup(&nfs4_file_rhltable, fhp,
>> + nfs4_file_rhash_params);
>> + rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
>> + if (fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
>> + if (!refcount_inc_not_zero(&fi->fi_ref))
>> + fi = NULL;
>> + break;
>> }
>> - }
>> - return NULL;
>> + return fi;
>> }
>>
>> -static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh,
>> - unsigned int hashval)
>> +/*
>> + * On hash insertion, identify entries with the same inode but
>> + * distinct filehandles. They will all be in the same hash bucket
>> + * because we hash by the address of the nfs4_file's struct inode.
>> + */
>> +static struct nfs4_file *insert_file(struct nfs4_file *new,
>> + const struct svc_fh *fhp)
>> {
>> - struct nfs4_file *fp;
>> - struct nfs4_file *ret = NULL;
>> - bool alias_found = false;
>> + struct rhlist_head *tmp, *list;
>> + struct nfs4_file *fi;
>> + int err;
>>
>> spin_lock(&state_lock);
>> - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
>> - lockdep_is_held(&state_lock)) {
>> - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
>> - if (refcount_inc_not_zero(&fp->fi_ref))
>> - ret = fp;
>> - } else if (d_inode(fh->fh_dentry) == fp->fi_inode)
>> - fp->fi_aliased = alias_found = true;
>> - }
>> - if (likely(ret == NULL)) {
>> - nfsd4_init_file(fh, hashval, new);
>> - new->fi_aliased = alias_found;
>> - ret = new;
>> +
>> + err = rhltable_insert_key(&nfs4_file_rhltable, fhp,
>> + &new->fi_rhash,
>> + nfs4_file_rhash_params);
>> + if (err) {
>> + spin_unlock(&state_lock);
>> + return NULL;
>> }
>> +
>> + list = rhltable_lookup(&nfs4_file_rhltable, fhp,
>> + nfs4_file_rhash_params);
>> + rhl_for_each_entry_rcu(fi, tmp, list, fi_rhash)
>> + if (fi->fi_inode == d_inode(fhp->fh_dentry) &&
>> + !fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
>> + fi->fi_aliased = true;
>> + new->fi_aliased = true;
>> + }
>> +
>> spin_unlock(&state_lock);
>> - return ret;
>> + return new;
>> }
>>
>> -static struct nfs4_file * find_file(struct svc_fh *fh)
>> +static noinline struct nfs4_file *
>> +insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
>> {
>> - struct nfs4_file *fp;
>> - unsigned int hashval = file_hashval(fh);
>> + struct nfs4_file *fi;
>>
>> - rcu_read_lock();
>> - fp = find_file_locked(fh, hashval);
>> - rcu_read_unlock();
>> - return fp;
>> + /* Do not hash the same filehandle more than once */
>> + fi = find_nfs4_file(fhp);
>> + if (unlikely(fi))
>> + return fi;
>> +
>> + init_nfs4_file(fhp, new);
>> + return insert_file(new, fhp);
>> }
>>
>> -static struct nfs4_file *
>> -find_or_add_file(struct nfs4_file *new, struct svc_fh *fh)
>> +static noinline void remove_nfs4_file(struct nfs4_file *fi)
>> {
>> - struct nfs4_file *fp;
>> - unsigned int hashval = file_hashval(fh);
>> -
>> - rcu_read_lock();
>> - fp = find_file_locked(fh, hashval);
>> - rcu_read_unlock();
>> - if (fp)
>> - return fp;
>> -
>> - return insert_file(new, fh, hashval);
>> + rhltable_remove(&nfs4_file_rhltable, &fi->fi_rhash,
>> + nfs4_file_rhash_params);
>> }
>>
>> /*
>> @@ -4703,9 +4766,12 @@ nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type)
>> struct nfs4_file *fp;
>> __be32 ret = nfs_ok;
>>
>> - fp = find_file(current_fh);
>> + rcu_read_lock();
>> + fp = find_nfs4_file(current_fh);
>> + rcu_read_unlock();
>> if (!fp)
>> return ret;
>> +
>> /* Check for conflicting share reservations */
>> spin_lock(&fp->fi_lock);
>> if (fp->fi_share_deny & deny_type)
>> @@ -5548,7 +5614,11 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>> * and check for delegations in the process of being recalled.
>> * If not found, create the nfs4_file struct
>> */
>> - fp = find_or_add_file(open->op_file, current_fh);
>> + rcu_read_lock();
>> + fp = insert_nfs4_file(open->op_file, current_fh);
>> + rcu_read_unlock();
>
> It'd probably be better to push this rcu_read_lock down into
> insert_nfs4_file. You don't need to hold it over the actual insertion,
> since that requires the state_lock.

I used this arrangement because:

insert_nfs4_file() invokes only find_nfs4_file() and the
insert_file() helper. Both find_nfs4_file() and the
insert_file() helper invoke rhltable_lookup(), which
must be called with the RCU read lock held.

And this is the reason why put_nfs4_file() no longer takes
the state_lock: it would take the state_lock first and
then the RCU read lock (which is implicitly taken in
rhltable_remove()), which results in a lock inversion
relative to insert_nfs4_file(), which takes the RCU read
lock first, then the state_lock.


I'm certainly not an expert, so I'm willing to listen to
alternative approaches. Can we rely on only the RCU read
lock for exclusion on hash insertion?


>> + if (unlikely(!fp))
>> + return nfserr_jukebox;
>> if (fp != open->op_file) {
>> status = nfs4_check_deleg(cl, open, &dp);
>> if (status)
>> @@ -7905,10 +7975,16 @@ nfs4_state_start(void)
>> {
>> int ret;
>>
>> - ret = nfsd4_create_callback_queue();
>> + ret = rhltable_init(&nfs4_file_rhltable, &nfs4_file_rhash_params);
>> if (ret)
>> return ret;
>>
>> + ret = nfsd4_create_callback_queue();
>> + if (ret) {
>> + rhltable_destroy(&nfs4_file_rhltable);
>> + return ret;
>> + }
>> +
>> set_max_delegations();
>> return 0;
>> }
>> @@ -7939,6 +8015,7 @@ nfs4_state_shutdown_net(struct net *net)
>>
>> nfsd4_client_tracking_exit(net);
>> nfs4_state_destroy_net(net);
>> + rhltable_destroy(&nfs4_file_rhltable);
>> #ifdef CONFIG_NFSD_V4_2_INTER_SSC
>> nfsd4_ssc_shutdown_umount(nn);
>> #endif
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index ae596dbf8667..ad20467c9dee 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -536,16 +536,13 @@ struct nfs4_clnt_odstate {
>> * inode can have multiple filehandles associated with it, so there is
>> * (potentially) a many to one relationship between this struct and struct
>> * inode.
>> - *
>> - * These are hashed by filehandle in the file_hashtbl, which is protected by
>> - * the global state_lock spinlock.
>> */
>> struct nfs4_file {
>> refcount_t fi_ref;
>> struct inode * fi_inode;
>> bool fi_aliased;
>> spinlock_t fi_lock;
>> - struct hlist_node fi_hash; /* hash on fi_fhandle */
>> + struct rhlist_head fi_rhash;
>> struct list_head fi_stateids;
>> union {
>> struct list_head fi_delegations;
>>
>>
>
> --
> Jeff Layton <[email protected]>

--
Chuck Lever



2022-10-19 14:52:19

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection



> On Oct 19, 2022, at 7:24 AM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 2022-10-18 at 15:58 -0400, Chuck Lever wrote:
>> NFSv4 operations manage the lifetime of nfsd_file items they use by
>> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
>> garbage collected.
>>
>> Introduce a mechanism to enable garbage collection for nfsd_file
>> items used only by NFSv2/3 callers.
>>
>> Note that the change in nfsd_file_put() ensures that both CLOSE and
>> DELEGRETURN will actually close out and free an nfsd_file on last
>> reference of a non-garbage-collected file.
>>
>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
>> Suggested-by: Trond Myklebust <[email protected]>
>> Tested-by: Jeff Layton <[email protected]>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
>> fs/nfsd/filecache.h | 3 +++
>> fs/nfsd/nfs3proc.c | 4 ++-
>> fs/nfsd/trace.h | 3 ++-
>> fs/nfsd/vfs.c | 4 ++-
>> 5 files changed, 63 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index b7aa523c2010..87fce5c95726 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
>> struct net *net;
>> const struct cred *cred;
>> unsigned char need;
>> + unsigned char gc:1;
>
> Is it worth it to mess around with bitfields here? There are exising
> holes in this struct, so you're really not saving anything by using one
> here.
>
> Also, it seems like "need" is used as a bool anyway.

AFAICS, @need is a set of NFSD_MAY flags, not a bool.


> Maybe both of those
> fields should just be bools? It looks like changing that won't change
> the size of the struct anyway. You might even be able to shrink it by
> moving the "type" above "need".

I forgot to check this with pahole. I assumed the compiler would
squeeze these fields into a single 32-bit word, but in fact it
does not do that.

But, I don't think there's any arrangement of these fields that
avoids a hole. We can revisit if more fields are added to the
lookup key struct.


>> enum nfsd_file_lookup_type type;
>> };
>>
>> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> return 1;
>> if (!nfsd_match_cred(nf->nf_cred, key->cred))
>> return 1;
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
>> + return 1;
>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
>> return 1;
>> break;
>> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>> nf->nf_flags = 0;
>> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
>> + if (key->gc)
>> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
>> nf->nf_inode = key->inode;
>> /* nf_ref is pre-incremented for hash table */
>> refcount_set(&nf->nf_ref, 2);
>> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
>> }
>> }
>>
>> +static void
>> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
>> +{
>> + if (nfsd_file_unhash(nf))
>> + nfsd_file_put_noref(nf);
>> +}
>> +
>> void
>> nfsd_file_put(struct nfsd_file *nf)
>> {
>> might_sleep();
>>
>> - nfsd_file_lru_add(nf);
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
>> + nfsd_file_lru_add(nf);
>> + else if (refcount_read(&nf->nf_ref) == 2)
>> + nfsd_file_unhash_and_put(nf);
>> +
>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
>> nfsd_file_flush(nf);
>> nfsd_file_put_noref(nf);
>> - } else if (nf->nf_file) {
>> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
>> nfsd_file_put_noref(nf);
>> nfsd_file_schedule_laundrette();
>> } else
>> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>>
>> static __be32
>> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
>> + unsigned int may_flags, struct nfsd_file **pnf,
>> + bool open, int want_gc)
>> {
>> struct nfsd_file_lookup_key key = {
>> .type = NFSD_FILE_KEY_FULL,
>> .need = may_flags & NFSD_FILE_MAY_MASK,
>> .net = SVC_NET(rqstp),
>> + .gc = want_gc,
>> };
>> bool open_retry = true;
>> struct nfsd_file *nf;
>> @@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * then unhash.
>> */
>> if (status != nfs_ok || key.inode->i_nlink == 0)
>> - if (nfsd_file_unhash(nf))
>> - nfsd_file_put_noref(nf);
>> + nfsd_file_unhash_and_put(nf);
>> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
>> smp_mb__after_atomic();
>> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
>> goto out;
>> }
>>
>> +/**
>> + * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
>> + * @rqstp: the RPC transaction being executed
>> + * @fhp: the NFS filehandle of the file to be opened
>> + * @may_flags: NFSD_MAY_ settings for the file
>> + * @pnf: OUT: new or found "struct nfsd_file" object
>> + *
>> + * The nfsd_file object returned by this API is reference-counted
>> + * and garbage-collected. The object is retained for a few
>> + * seconds after the final nfsd_file_put() in case the caller
>> + * wants to re-use it.
>> + *
>> + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> + * network byte order is returned.
>> + */
>> +__be32
>> +nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> + unsigned int may_flags, struct nfsd_file **pnf)
>> +{
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
>> +}
>> +
>> /**
>> * nfsd_file_acquire - Get a struct nfsd_file with an open file
>> * @rqstp: the RPC transaction being executed
>> @@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * @may_flags: NFSD_MAY_ settings for the file
>> * @pnf: OUT: new or found "struct nfsd_file" object
>> *
>> + * The nfsd_file_object returned by this API is reference-counted
>> + * but not garbage-collected. The object is unhashed after the
>> + * final nfsd_file_put().
>> + *
>> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> * network byte order is returned.
>> */
>> @@ -1139,7 +1182,7 @@ __be32
>> nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **pnf)
>> {
>> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
>> }
>>
>> /**
>> @@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * @may_flags: NFSD_MAY_ settings for the file
>> * @pnf: OUT: new or found "struct nfsd_file" object
>> *
>> + * The nfsd_file_object returned by this API is reference-counted
>> + * but not garbage-collected. The object is released immediately
>> + * one RCU grace period after the final nfsd_file_put().
>> + *
>> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> * network byte order is returned.
>> */
>> @@ -1156,7 +1203,7 @@ __be32
>> nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **pnf)
>> {
>> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
>> }
>>
>> /*
>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>> index f81c198f4ed6..0f6546bcd3e0 100644
>> --- a/fs/nfsd/filecache.h
>> +++ b/fs/nfsd/filecache.h
>> @@ -38,6 +38,7 @@ struct nfsd_file {
>> #define NFSD_FILE_HASHED (0)
>> #define NFSD_FILE_PENDING (1)
>> #define NFSD_FILE_REFERENCED (2)
>> +#define NFSD_FILE_GC (3)
>> unsigned long nf_flags;
>> struct inode *nf_inode; /* don't deref */
>> refcount_t nf_ref;
>> @@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
>> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
>> void nfsd_file_close_inode_sync(struct inode *inode);
>> bool nfsd_file_is_cached(struct inode *inode);
>> +__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> + unsigned int may_flags, struct nfsd_file **nfp);
>> __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **nfp);
>> __be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>> index d12823371857..6a5ad6c91d8e 100644
>> --- a/fs/nfsd/nfs3proc.c
>> +++ b/fs/nfsd/nfs3proc.c
>> @@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
>> (unsigned long long) argp->offset);
>>
>> fh_copy(&resp->fh, &argp->fh);
>> - resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
>> - NFSD_MAY_NOT_BREAK_LEASE, &nf);
>> + resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
>> + NFSD_MAY_NOT_BREAK_LEASE, &nf);
>> if (resp->status)
>> goto out;
>> resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>> index 9ebd67d461f9..4921144880d3 100644
>> --- a/fs/nfsd/trace.h
>> +++ b/fs/nfsd/trace.h
>> @@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
>> __print_flags(val, "|", \
>> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
>> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
>> - { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
>> + { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
>> + { 1 << NFSD_FILE_GC, "GC"})
>>
>> DECLARE_EVENT_CLASS(nfsd_file_class,
>> TP_PROTO(struct nfsd_file *nf),
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 44f210ba17cc..89d682a56fc4 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> __be32 err;
>>
>> trace_nfsd_read_start(rqstp, fhp, offset, *count);
>> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
>> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
>> if (err)
>> return err;
>>
>> @@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>>
>> trace_nfsd_write_start(rqstp, fhp, offset, *cnt);
>>
>> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
>> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
>> if (err)
>> goto out;
>>
>>
>>
>
> Patch itself looks fine though. You can add:
>
> Reviewed-by: Jeff Layton <[email protected]>
>
> ...but I'd probably prefer to see the flags in that struct as bools
> instead if you're amenable to the change.

--
Chuck Lever



2022-10-24 02:29:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Thu, 20 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 19, 2022, at 7:39 AM, Jeff Layton <[email protected]> wrote:
> >
> >> - fp = find_or_add_file(open->op_file, current_fh);
> >> + rcu_read_lock();
> >> + fp = insert_nfs4_file(open->op_file, current_fh);
> >> + rcu_read_unlock();
> >
> > It'd probably be better to push this rcu_read_lock down into
> > insert_nfs4_file. You don't need to hold it over the actual insertion,
> > since that requires the state_lock.
>
> I used this arrangement because:
>
> insert_nfs4_file() invokes only find_nfs4_file() and the
> insert_file() helper. Both find_nfs4_file() and the
> insert_file() helper invoke rhltable_lookup(), which
> must be called with the RCU read lock held.
>
> And this is the reason why put_nfs4_file() no longer takes
> the state_lock: it would take the state_lock first and
> then the RCU read lock (which is implicitly taken in
> rhltable_remove()), which results in a lock inversion
> relative to insert_nfs4_file(), which takes the RCU read
> lock first, then the state_lock.

It doesn't make any sense to talk about lock inversion with
rcu_read_lock(). It isn't really a lock in any traditional sense in
that it can never block (which is what cause lock-inversion problems).
I prefer to think for rcu_read_lock() as taking a reference on some
global state.

>
>
> I'm certainly not an expert, so I'm willing to listen to
> alternative approaches. Can we rely on only the RCU read
> lock for exclusion on hash insertion?

Probably we can. I'll read through all the patches now and provide some
review.

NeilBrown

2022-10-24 02:38:01

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Wed, 19 Oct 2022, Chuck Lever wrote:
> NFSv4 operations manage the lifetime of nfsd_file items they use by
> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
> garbage collected.
>
> Introduce a mechanism to enable garbage collection for nfsd_file
> items used only by NFSv2/3 callers.
>
> Note that the change in nfsd_file_put() ensures that both CLOSE and
> DELEGRETURN will actually close out and free an nfsd_file on last
> reference of a non-garbage-collected file.
>
> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
> Suggested-by: Trond Myklebust <[email protected]>
> Tested-by: Jeff Layton <[email protected]>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
> fs/nfsd/filecache.h | 3 +++
> fs/nfsd/nfs3proc.c | 4 ++-
> fs/nfsd/trace.h | 3 ++-
> fs/nfsd/vfs.c | 4 ++-
> 5 files changed, 63 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index b7aa523c2010..87fce5c95726 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
> struct net *net;
> const struct cred *cred;
> unsigned char need;
> + unsigned char gc:1;
> enum nfsd_file_lookup_type type;
> };
>
> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> return 1;
> if (!nfsd_match_cred(nf->nf_cred, key->cred))
> return 1;
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
> + return 1;
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
> return 1;
> break;
> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> nf->nf_flags = 0;
> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> + if (key->gc)
> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> nf->nf_inode = key->inode;
> /* nf_ref is pre-incremented for hash table */
> refcount_set(&nf->nf_ref, 2);
> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> }
> }
>
> +static void
> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
> +{
> + if (nfsd_file_unhash(nf))
> + nfsd_file_put_noref(nf);
> +}
> +
> void
> nfsd_file_put(struct nfsd_file *nf)
> {
> might_sleep();
>
> - nfsd_file_lru_add(nf);
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)

Clearly this is a style choice on which sensible people might disagree,
but I much prefer to leave out the "== 1" That is what most callers in
fs/nfsd/ do - only exceptions are here in filecache.c.
Even "!= 0" would be better than "== 1".
I think test_bit() is declared as a bool, but it is hard to be certain.

> + nfsd_file_lru_add(nf);
> + else if (refcount_read(&nf->nf_ref) == 2)
> + nfsd_file_unhash_and_put(nf);

Tests on the value of a refcount are almost always racy.
I suspect there is an implication that as NFSD_FILE_GC is not set, this
*must* be hashed which implies there is guaranteed to be a refcount from
the hashtable. So this is really a test to see if the pre-biased
refcount is one. The safe way to test if a refcount is 1 is dec_and_test.

i.e. linkage from the hash table should not count as a reference (in the
not-GC case). Lookup in the hash table should fail if the found entry
cannot achieve an inc_not_zero. When dec_and_test says the refcount is
zero, we remove from the hash table (certain that no new references will
be taken).


> +
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> nfsd_file_flush(nf);
> nfsd_file_put_noref(nf);

This seems weird. If the file was unhashed above (because nf_ref was
2), it would now not be flushed. Why don't we want it to be flushed in
that case?

> - } else if (nf->nf_file) {
> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
> nfsd_file_put_noref(nf);
> nfsd_file_schedule_laundrette();
> } else
> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>
> static __be32
> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
> + unsigned int may_flags, struct nfsd_file **pnf,
> + bool open, int want_gc)

I too would prefer "bool" for all intstance of gc and want_gc.

NeilBrown

2022-10-24 03:35:14

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Wed, 19 Oct 2022, Chuck Lever wrote:
> fh_match() is costly, especially when filehandles are large (as is
> the case for NFSv4). It needs to be used sparingly when searching
> data structures. Unfortunately, with common workloads, I see
> multiple thousands of objects stored in file_hashtbl[], which always
> has only 256 buckets, which makes the hash chains quite lengthy.
>
> Walking long hash chains with the state_lock held blocks other
> activity that needs that lock.

Using a bit-spin-lock for each hash chain would be a simple fix if this
was the only concern. See for example fscache_hash_cookie().
I'm not at all against using rhltables, but thought I would mention that
there are other options.

>
> To help mitigate the cost of searching with fh_match(), replace the
> nfs4_file hash table with an rhashtable, which can dynamically
> resize its bucket array to minimize hash chain length. The ideal for
> this use case is one bucket per inode.
>
> The result is an improvement in the latency of NFSv4 operations
> and the reduction of nfsd CPU utilization due to the cost of
> fh_match() and the CPU cache misses incurred while walking long
> hash chains in the nfs4_file hash table.
>
> Note that hash removal is no longer protected by the state_lock.
> This is because insert_nfs4_file() takes the RCU read lock then the
> state_lock. rhltable_remove() takes the RCU read lock internally;
> if remove_nfs4_file() took the state_lock before calling
> rhltable_remove(), that would result in a lock inversion.

As mentioned separately, lock inversion is not relevant for
rcu_read_lock().
Also, I cannot see any need for state_lock to be used for protecting
additions to, or removals from, the rhash table.

>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++----------------
> fs/nfsd/state.h | 5 -
> 2 files changed, 147 insertions(+), 73 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 2b850de288cf..a63334ad61f6 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -44,7 +44,9 @@
> #include <linux/jhash.h>
> #include <linux/string_helpers.h>
> #include <linux/fsnotify.h>
> +#include <linux/rhashtable.h>
> #include <linux/nfs_ssc.h>
> +
> #include "xdr4.h"
> #include "xdr4cb.h"
> #include "vfs.h"
> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
> static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
> void nfsd4_end_grace(struct nfsd_net *nn);
> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
> +static void remove_nfs4_file(struct nfs4_file *fi);
>
> /* Locking: */
>
> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> void
> put_nfs4_file(struct nfs4_file *fi)
> {
> - might_lock(&state_lock);
> -
> - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
> - hlist_del_rcu(&fi->fi_hash);
> - spin_unlock(&state_lock);
> + if (refcount_dec_and_test(&fi->fi_ref)) {
> + remove_nfs4_file(fi);
> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
> WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
> call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
> return ret & OWNER_HASH_MASK;
> }
>
> -/* hash table for nfs4_file */
> -#define FILE_HASH_BITS 8
> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
> +
> +/*
> + * The returned hash value is based solely on the address of an in-code
> + * inode, a pointer to a slab-allocated object. The entropy in such a
> + * pointer is concentrated in its middle bits.
> + */
> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
> +{
> + u32 k;
> +
> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
> + return jhash2(&k, 1, seed);

I still don't think this makes any sense at all.

return jhash2(&inode, sizeof(inode)/4, seed);

uses all of the entropy, which is best for rhashtables.

> +}
>
> -static unsigned int file_hashval(struct svc_fh *fh)
> +/**
> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key
> + * @data: key on which to compute the hash value
> + * @len: rhash table's key_len parameter (unused)
> + * @seed: rhash table's random seed of the day
> + *
> + * Return value:
> + * Computed 32-bit hash value
> + */
> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
> {
> - struct inode *inode = d_inode(fh->fh_dentry);
> + const struct svc_fh *fhp = data;
>
> - /* XXX: why not (here & in file cache) use inode? */
> - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
> + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
> }
>
> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
> +/**
> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
> + * @data: object on which to compute the hash value
> + * @len: rhash table's key_len parameter (unused)
> + * @seed: rhash table's random seed of the day
> + *
> + * Return value:
> + * Computed 32-bit hash value
> + */
> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
> +{
> + const struct nfs4_file *fi = data;
> +
> + return nfs4_file_inode_hash(fi->fi_inode, seed);
> +}
> +
> +/**
> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> + * @arg: search criteria
> + * @ptr: cache item to check
> + *
> + * Return values:
> + * %0 - Success; item matches search criteria
> + */
> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> + const void *ptr)
> +{
> + const struct svc_fh *fhp = arg->key;
> + const struct nfs4_file *fi = ptr;
> +
> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
> +}

This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
obscure.

An rhltable is like an rhashtable except that insertion can never fail.
Multiple entries with the same key can co-exist in a linked list.
Lookup does not return an entry but returns a linked list of entries.

For the nfs4_file table this isn't really a good match (though I
previously thought it was). Insertion must be able to fail if an entry
with the same filehandle exists.

One approach that might work would be to take the inode->i_lock after a
lookup fails then repeat the lookup and if it fails again, do the
insert. This keeps the locking fine-grained but means we don't have to
depend on insert failing.

So the hash and cmp functions would only look at the inode pointer.
If lookup returns something, we would walk the list calling fh_match()
to see if we have the right entry - all under RCU and using
refcount_inc_not_zero() when we find the matching filehandle.

NeilBrown

2022-10-24 11:01:14

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Mon, 2022-10-24 at 13:07 +1100, NeilBrown wrote:
> On Thu, 20 Oct 2022, Chuck Lever III wrote:
> >
> > > On Oct 19, 2022, at 7:39 AM, Jeff Layton <[email protected]> wrote:
> > >
> > > > - fp = find_or_add_file(open->op_file, current_fh);
> > > > + rcu_read_lock();
> > > > + fp = insert_nfs4_file(open->op_file, current_fh);
> > > > + rcu_read_unlock();
> > >
> > > It'd probably be better to push this rcu_read_lock down into
> > > insert_nfs4_file. You don't need to hold it over the actual insertion,
> > > since that requires the state_lock.
> >
> > I used this arrangement because:
> >
> > insert_nfs4_file() invokes only find_nfs4_file() and the
> > insert_file() helper. Both find_nfs4_file() and the
> > insert_file() helper invoke rhltable_lookup(), which
> > must be called with the RCU read lock held.
> >
> > And this is the reason why put_nfs4_file() no longer takes
> > the state_lock: it would take the state_lock first and
> > then the RCU read lock (which is implicitly taken in
> > rhltable_remove()), which results in a lock inversion
> > relative to insert_nfs4_file(), which takes the RCU read
> > lock first, then the state_lock.
>
> It doesn't make any sense to talk about lock inversion with
> rcu_read_lock(). It isn't really a lock in any traditional sense in
> that it can never block (which is what cause lock-inversion problems).
> I prefer to think for rcu_read_lock() as taking a reference on some
> global state.
>

Right. To be clear, you can deadlock with synchronize_rcu if you use it
improperly, but the rcu_read_lock itself never blocks.

> >
> >
> > I'm certainly not an expert, so I'm willing to listen to
> > alternative approaches. Can we rely on only the RCU read
> > lock for exclusion on hash insertion?
>
> Probably we can. I'll read through all the patches now and provide some
> review.
>

The rcu_read_lock provides _no_ exclusion whatsoever, so it's not
usually suitable for things that need exclusive access (like a hash
insertion). If each rhl hash chain has its own lock though, then we may
not require other locking to serialize insertions.

--
Jeff Layton <[email protected]>

2022-10-24 18:34:09

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Mon, 2022-10-24 at 13:33 +1100, NeilBrown wrote:
> On Wed, 19 Oct 2022, Chuck Lever wrote:
> > NFSv4 operations manage the lifetime of nfsd_file items they use by
> > means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
> > garbage collected.
> >
> > Introduce a mechanism to enable garbage collection for nfsd_file
> > items used only by NFSv2/3 callers.
> >
> > Note that the change in nfsd_file_put() ensures that both CLOSE and
> > DELEGRETURN will actually close out and free an nfsd_file on last
> > reference of a non-garbage-collected file.
> >
> > Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
> > Suggested-by: Trond Myklebust <[email protected]>
> > Tested-by: Jeff Layton <[email protected]>
> > Signed-off-by: Chuck Lever <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
> > fs/nfsd/filecache.h | 3 +++
> > fs/nfsd/nfs3proc.c | 4 ++-
> > fs/nfsd/trace.h | 3 ++-
> > fs/nfsd/vfs.c | 4 ++-
> > 5 files changed, 63 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index b7aa523c2010..87fce5c95726 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
> > struct net *net;
> > const struct cred *cred;
> > unsigned char need;
> > + unsigned char gc:1;
> > enum nfsd_file_lookup_type type;
> > };
> >
> > @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> > return 1;
> > if (!nfsd_match_cred(nf->nf_cred, key->cred))
> > return 1;
> > + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
> > + return 1;
> > if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
> > return 1;
> > break;
> > @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> > nf->nf_flags = 0;
> > __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> > __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> > + if (key->gc)
> > + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> > nf->nf_inode = key->inode;
> > /* nf_ref is pre-incremented for hash table */
> > refcount_set(&nf->nf_ref, 2);
> > @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> > }
> > }
> >
> > +static void
> > +nfsd_file_unhash_and_put(struct nfsd_file *nf)
> > +{
> > + if (nfsd_file_unhash(nf))
> > + nfsd_file_put_noref(nf);
> > +}
> > +
> > void
> > nfsd_file_put(struct nfsd_file *nf)
> > {
> > might_sleep();
> >
> > - nfsd_file_lru_add(nf);
> > + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
>
> Clearly this is a style choice on which sensible people might disagree,
> but I much prefer to leave out the "== 1" That is what most callers in
> fs/nfsd/ do - only exceptions are here in filecache.c.
> Even "!= 0" would be better than "== 1".
> I think test_bit() is declared as a bool, but it is hard to be certain.
>
> > + nfsd_file_lru_add(nf);
> > + else if (refcount_read(&nf->nf_ref) == 2)
> > + nfsd_file_unhash_and_put(nf);
>
> Tests on the value of a refcount are almost always racy.

Agreed, and there's a clear race above, now that I look more closely. If
nf_ref is 3 and two puts are racing then neither of them will call
nfsd_file_unhash_and_put. We really should be letting the outcome of the
decrement drive things (like you say below).

> I suspect there is an implication that as NFSD_FILE_GC is not set, this
> *must* be hashed which implies there is guaranteed to be a refcount from
> the hashtable. So this is really a test to see if the pre-biased
> refcount is one. The safe way to test if a refcount is 1 is dec_and_test.
>
> i.e. linkage from the hash table should not count as a reference (in the
> not-GC case). Lookup in the hash table should fail if the found entry
> cannot achieve an inc_not_zero. When dec_and_test says the refcount is
> zero, we remove from the hash table (certain that no new references will
> be taken).
>

This does seem a more sensible approach. That would go a long way toward
simplifying nfsd_file_put.

>
> > +
> > if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> > nfsd_file_flush(nf);
> > nfsd_file_put_noref(nf);
>
> This seems weird. If the file was unhashed above (because nf_ref was
> 2), it would now not be flushed. Why don't we want it to be flushed in
> that case?
>
> > - } else if (nf->nf_file) {
> > + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
> > nfsd_file_put_noref(nf);
> > nfsd_file_schedule_laundrette();
> > } else
> > @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
> >
> > static __be32
> > nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > - unsigned int may_flags, struct nfsd_file **pnf, bool open)
> > + unsigned int may_flags, struct nfsd_file **pnf,
> > + bool open, int want_gc)
>
> I too would prefer "bool" for all intstance of gc and want_gc.
>
> NeilBrown

--
Jeff Layton <[email protected]>

2022-10-24 19:09:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 24, 2022, at 6:57 AM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2022-10-24 at 13:07 +1100, NeilBrown wrote:
>> On Thu, 20 Oct 2022, Chuck Lever III wrote:
>>>
>>>> On Oct 19, 2022, at 7:39 AM, Jeff Layton <[email protected]> wrote:
>>>>
>>>>> - fp = find_or_add_file(open->op_file, current_fh);
>>>>> + rcu_read_lock();
>>>>> + fp = insert_nfs4_file(open->op_file, current_fh);
>>>>> + rcu_read_unlock();
>>>>
>>>> It'd probably be better to push this rcu_read_lock down into
>>>> insert_nfs4_file. You don't need to hold it over the actual insertion,
>>>> since that requires the state_lock.
>>>
>>> I used this arrangement because:
>>>
>>> insert_nfs4_file() invokes only find_nfs4_file() and the
>>> insert_file() helper. Both find_nfs4_file() and the
>>> insert_file() helper invoke rhltable_lookup(), which
>>> must be called with the RCU read lock held.
>>>
>>> And this is the reason why put_nfs4_file() no longer takes
>>> the state_lock: it would take the state_lock first and
>>> then the RCU read lock (which is implicitly taken in
>>> rhltable_remove()), which results in a lock inversion
>>> relative to insert_nfs4_file(), which takes the RCU read
>>> lock first, then the state_lock.
>>
>> It doesn't make any sense to talk about lock inversion with
>> rcu_read_lock(). It isn't really a lock in any traditional sense in
>> that it can never block (which is what cause lock-inversion problems).
>> I prefer to think for rcu_read_lock() as taking a reference on some
>> global state.
>>
>
> Right. To be clear, you can deadlock with synchronize_rcu if you use it
> improperly, but the rcu_read_lock itself never blocks.
>
>>>
>>>
>>> I'm certainly not an expert, so I'm willing to listen to
>>> alternative approaches. Can we rely on only the RCU read
>>> lock for exclusion on hash insertion?
>>
>> Probably we can. I'll read through all the patches now and provide some
>> review.
>>
>
> The rcu_read_lock provides _no_ exclusion whatsoever, so it's not
> usually suitable for things that need exclusive access (like a hash
> insertion). If each rhl hash chain has its own lock though, then we may
> not require other locking to serialize insertions.

rhash does have per-bucket serialization using bit-spin-locks,
as far as I can tell.


--
Chuck Lever



2022-10-24 20:18:55

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 23, 2022, at 11:26 PM, NeilBrown <[email protected]> wrote:
>
> On Wed, 19 Oct 2022, Chuck Lever wrote:
>> fh_match() is costly, especially when filehandles are large (as is
>> the case for NFSv4). It needs to be used sparingly when searching
>> data structures. Unfortunately, with common workloads, I see
>> multiple thousands of objects stored in file_hashtbl[], which always
>> has only 256 buckets, which makes the hash chains quite lengthy.
>>
>> Walking long hash chains with the state_lock held blocks other
>> activity that needs that lock.
>
> Using a bit-spin-lock for each hash chain would be a simple fix if this
> was the only concern. See for example fscache_hash_cookie().
> I'm not at all against using rhltables, but thought I would mention that
> there are other options.

rhashtable uses bit-spin-locks, and as I said in the other thread,
without the state_lock this all seems to work without crashing,
lockdep or KASAN splat, or test workload failures.


>> To help mitigate the cost of searching with fh_match(), replace the
>> nfs4_file hash table with an rhashtable, which can dynamically
>> resize its bucket array to minimize hash chain length. The ideal for
>> this use case is one bucket per inode.
>>
>> The result is an improvement in the latency of NFSv4 operations
>> and the reduction of nfsd CPU utilization due to the cost of
>> fh_match() and the CPU cache misses incurred while walking long
>> hash chains in the nfs4_file hash table.
>>
>> Note that hash removal is no longer protected by the state_lock.
>> This is because insert_nfs4_file() takes the RCU read lock then the
>> state_lock. rhltable_remove() takes the RCU read lock internally;
>> if remove_nfs4_file() took the state_lock before calling
>> rhltable_remove(), that would result in a lock inversion.
>
> As mentioned separately, lock inversion is not relevant for
> rcu_read_lock().

Understood.


> Also, I cannot see any need for state_lock to be used for protecting
> additions to, or removals from, the rhash table.

Agreed.


>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++----------------
>> fs/nfsd/state.h | 5 -
>> 2 files changed, 147 insertions(+), 73 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 2b850de288cf..a63334ad61f6 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -44,7 +44,9 @@
>> #include <linux/jhash.h>
>> #include <linux/string_helpers.h>
>> #include <linux/fsnotify.h>
>> +#include <linux/rhashtable.h>
>> #include <linux/nfs_ssc.h>
>> +
>> #include "xdr4.h"
>> #include "xdr4cb.h"
>> #include "vfs.h"
>> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
>> static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
>> void nfsd4_end_grace(struct nfsd_net *nn);
>> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
>> +static void remove_nfs4_file(struct nfs4_file *fi);
>>
>> /* Locking: */
>>
>> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>> void
>> put_nfs4_file(struct nfs4_file *fi)
>> {
>> - might_lock(&state_lock);
>> -
>> - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
>> - hlist_del_rcu(&fi->fi_hash);
>> - spin_unlock(&state_lock);
>> + if (refcount_dec_and_test(&fi->fi_ref)) {
>> + remove_nfs4_file(fi);
>> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
>> WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
>> call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
>> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
>> return ret & OWNER_HASH_MASK;
>> }
>>
>> -/* hash table for nfs4_file */
>> -#define FILE_HASH_BITS 8
>> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
>> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
>> +
>> +/*
>> + * The returned hash value is based solely on the address of an in-code
>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>> + * pointer is concentrated in its middle bits.
>> + */
>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>> +{
>> + u32 k;
>> +
>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>> + return jhash2(&k, 1, seed);
>
> I still don't think this makes any sense at all.
>
> return jhash2(&inode, sizeof(inode)/4, seed);
>
> uses all of the entropy, which is best for rhashtables.

I don't really disagree, but the L1_CACHE_SHIFT was added at
Al's behest. OK, I'll replace it.


>> +}
>>
>> -static unsigned int file_hashval(struct svc_fh *fh)
>> +/**
>> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key
>> + * @data: key on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + * Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
>> {
>> - struct inode *inode = d_inode(fh->fh_dentry);
>> + const struct svc_fh *fhp = data;
>>
>> - /* XXX: why not (here & in file cache) use inode? */
>> - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
>> + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
>> }
>>
>> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
>> +/**
>> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
>> + * @data: object on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + * Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
>> +{
>> + const struct nfs4_file *fi = data;
>> +
>> + return nfs4_file_inode_hash(fi->fi_inode, seed);
>> +}
>> +
>> +/**
>> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
>> + * @arg: search criteria
>> + * @ptr: cache item to check
>> + *
>> + * Return values:
>> + * %0 - Success; item matches search criteria
>> + */
>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> + const void *ptr)
>> +{
>> + const struct svc_fh *fhp = arg->key;
>> + const struct nfs4_file *fi = ptr;
>> +
>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>> +}
>
> This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
> obscure.
>
> An rhltable is like an rhashtable except that insertion can never fail.
> Multiple entries with the same key can co-exist in a linked list.
> Lookup does not return an entry but returns a linked list of entries.
>
> For the nfs4_file table this isn't really a good match (though I
> previously thought it was). Insertion must be able to fail if an entry
> with the same filehandle exists.

I think I've arrived at the same conclusion (about nfs4_file insertion
needing to fail for duplicate filehandles). That's more or less open-coded
in my current revision of this work rather than relying on the behavior
of rhltable_insert().


> One approach that might work would be to take the inode->i_lock after a
> lookup fails then repeat the lookup and if it fails again, do the
> insert. This keeps the locking fine-grained but means we don't have to
> depend on insert failing.
>
> So the hash and cmp functions would only look at the inode pointer.
> If lookup returns something, we would walk the list calling fh_match()
> to see if we have the right entry - all under RCU and using
> refcount_inc_not_zero() when we find the matching filehandle.

That's basically what's going on now. But I tried removing obj_cmpfn()
and the rhltable_init() failed outright, so it's still there like a
vestigial organ.

If you now believe rhltable is not a good fit, I can revert all of the
rhltable changes and go back to rhashtable quite easily.

Let me do that and post what I have.


--
Chuck Lever



2022-10-24 20:37:43

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 23, 2022, at 10:07 PM, NeilBrown <[email protected]> wrote:
>
> On Thu, 20 Oct 2022, Chuck Lever III wrote:
>>
>>> On Oct 19, 2022, at 7:39 AM, Jeff Layton <[email protected]> wrote:
>>>
>>>> - fp = find_or_add_file(open->op_file, current_fh);
>>>> + rcu_read_lock();
>>>> + fp = insert_nfs4_file(open->op_file, current_fh);
>>>> + rcu_read_unlock();
>>>
>>> It'd probably be better to push this rcu_read_lock down into
>>> insert_nfs4_file. You don't need to hold it over the actual insertion,
>>> since that requires the state_lock.
>>
>> I used this arrangement because:
>>
>> insert_nfs4_file() invokes only find_nfs4_file() and the
>> insert_file() helper. Both find_nfs4_file() and the
>> insert_file() helper invoke rhltable_lookup(), which
>> must be called with the RCU read lock held.
>>
>> And this is the reason why put_nfs4_file() no longer takes
>> the state_lock: it would take the state_lock first and
>> then the RCU read lock (which is implicitly taken in
>> rhltable_remove()), which results in a lock inversion
>> relative to insert_nfs4_file(), which takes the RCU read
>> lock first, then the state_lock.
>
> It doesn't make any sense to talk about lock inversion with
> rcu_read_lock(). It isn't really a lock in any traditional sense in
> that it can never block (which is what cause lock-inversion problems).
> I prefer to think for rcu_read_lock() as taking a reference on some
> global state.
>
>>
>>
>> I'm certainly not an expert, so I'm willing to listen to
>> alternative approaches. Can we rely on only the RCU read
>> lock for exclusion on hash insertion?
>
> Probably we can. I'll read through all the patches now and provide some
> review.

I've been testing a version all weekend that removes the use of
state_lock. I haven't found any issues with it. I'll post what
I have in a moment.


--
Chuck Lever



2022-10-24 20:42:09

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection



> On Oct 24, 2022, at 12:57 PM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2022-10-24 at 13:33 +1100, NeilBrown wrote:
>> On Wed, 19 Oct 2022, Chuck Lever wrote:
>>> NFSv4 operations manage the lifetime of nfsd_file items they use by
>>> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
>>> garbage collected.
>>>
>>> Introduce a mechanism to enable garbage collection for nfsd_file
>>> items used only by NFSv2/3 callers.
>>>
>>> Note that the change in nfsd_file_put() ensures that both CLOSE and
>>> DELEGRETURN will actually close out and free an nfsd_file on last
>>> reference of a non-garbage-collected file.
>>>
>>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
>>> Suggested-by: Trond Myklebust <[email protected]>
>>> Tested-by: Jeff Layton <[email protected]>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
>>> fs/nfsd/filecache.h | 3 +++
>>> fs/nfsd/nfs3proc.c | 4 ++-
>>> fs/nfsd/trace.h | 3 ++-
>>> fs/nfsd/vfs.c | 4 ++-
>>> 5 files changed, 63 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index b7aa523c2010..87fce5c95726 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
>>> struct net *net;
>>> const struct cred *cred;
>>> unsigned char need;
>>> + unsigned char gc:1;
>>> enum nfsd_file_lookup_type type;
>>> };
>>>
>>> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>>> return 1;
>>> if (!nfsd_match_cred(nf->nf_cred, key->cred))
>>> return 1;
>>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
>>> + return 1;
>>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
>>> return 1;
>>> break;
>>> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>>> nf->nf_flags = 0;
>>> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>>> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
>>> + if (key->gc)
>>> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
>>> nf->nf_inode = key->inode;
>>> /* nf_ref is pre-incremented for hash table */
>>> refcount_set(&nf->nf_ref, 2);
>>> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
>>> }
>>> }
>>>
>>> +static void
>>> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
>>> +{
>>> + if (nfsd_file_unhash(nf))
>>> + nfsd_file_put_noref(nf);
>>> +}
>>> +
>>> void
>>> nfsd_file_put(struct nfsd_file *nf)
>>> {
>>> might_sleep();
>>>
>>> - nfsd_file_lru_add(nf);
>>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
>>
>> Clearly this is a style choice on which sensible people might disagree,
>> but I much prefer to leave out the "== 1" That is what most callers in
>> fs/nfsd/ do - only exceptions are here in filecache.c.
>> Even "!= 0" would be better than "== 1".
>> I think test_bit() is declared as a bool, but it is hard to be certain.
>>
>>> + nfsd_file_lru_add(nf);
>>> + else if (refcount_read(&nf->nf_ref) == 2)
>>> + nfsd_file_unhash_and_put(nf);
>>
>> Tests on the value of a refcount are almost always racy.
>
> Agreed, and there's a clear race above, now that I look more closely. If
> nf_ref is 3 and two puts are racing then neither of them will call
> nfsd_file_unhash_and_put. We really should be letting the outcome of the
> decrement drive things (like you say below).
>
>> I suspect there is an implication that as NFSD_FILE_GC is not set, this
>> *must* be hashed which implies there is guaranteed to be a refcount from
>> the hashtable. So this is really a test to see if the pre-biased
>> refcount is one. The safe way to test if a refcount is 1 is dec_and_test.
>>
>> i.e. linkage from the hash table should not count as a reference (in the
>> not-GC case). Lookup in the hash table should fail if the found entry
>> cannot achieve an inc_not_zero. When dec_and_test says the refcount is
>> zero, we remove from the hash table (certain that no new references will
>> be taken).
>>
>
> This does seem a more sensible approach. That would go a long way toward
> simplifying nfsd_file_put.

So I cut-and-pasted the approach you used in the patch you sent a few
weeks ago. I don't object to replacing that... but I don't see exactly
where you guys are going with this.


>>> +
>>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
>>> nfsd_file_flush(nf);
>>> nfsd_file_put_noref(nf);
>>
>> This seems weird. If the file was unhashed above (because nf_ref was
>> 2), it would now not be flushed. Why don't we want it to be flushed in
>> that case?
>>
>>> - } else if (nf->nf_file) {
>>> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
>>> nfsd_file_put_noref(nf);
>>> nfsd_file_schedule_laundrette();
>>> } else
>>> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>>>
>>> static __be32
>>> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
>>> + unsigned int may_flags, struct nfsd_file **pnf,
>>> + bool open, int want_gc)
>>
>> I too would prefer "bool" for all intstance of gc and want_gc.
>>
>> NeilBrown
>
> --
> Jeff Layton <[email protected]>

--
Chuck Lever



2022-10-24 22:31:51

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection



> On Oct 23, 2022, at 10:33 PM, NeilBrown <[email protected]> wrote:
>
> On Wed, 19 Oct 2022, Chuck Lever wrote:
>> NFSv4 operations manage the lifetime of nfsd_file items they use by
>> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
>> garbage collected.
>>
>> Introduce a mechanism to enable garbage collection for nfsd_file
>> items used only by NFSv2/3 callers.
>>
>> Note that the change in nfsd_file_put() ensures that both CLOSE and
>> DELEGRETURN will actually close out and free an nfsd_file on last
>> reference of a non-garbage-collected file.
>>
>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
>> Suggested-by: Trond Myklebust <[email protected]>
>> Tested-by: Jeff Layton <[email protected]>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
>> fs/nfsd/filecache.h | 3 +++
>> fs/nfsd/nfs3proc.c | 4 ++-
>> fs/nfsd/trace.h | 3 ++-
>> fs/nfsd/vfs.c | 4 ++-
>> 5 files changed, 63 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index b7aa523c2010..87fce5c95726 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
>> struct net *net;
>> const struct cred *cred;
>> unsigned char need;
>> + unsigned char gc:1;
>> enum nfsd_file_lookup_type type;
>> };
>>
>> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> return 1;
>> if (!nfsd_match_cred(nf->nf_cred, key->cred))
>> return 1;
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
>> + return 1;
>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
>> return 1;
>> break;
>> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>> nf->nf_flags = 0;
>> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
>> + if (key->gc)
>> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
>> nf->nf_inode = key->inode;
>> /* nf_ref is pre-incremented for hash table */
>> refcount_set(&nf->nf_ref, 2);
>> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
>> }
>> }
>>
>> +static void
>> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
>> +{
>> + if (nfsd_file_unhash(nf))
>> + nfsd_file_put_noref(nf);
>> +}
>> +
>> void
>> nfsd_file_put(struct nfsd_file *nf)
>> {
>> might_sleep();
>>
>> - nfsd_file_lru_add(nf);
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
>
> Clearly this is a style choice on which sensible people might disagree,
> but I much prefer to leave out the "== 1" That is what most callers in
> fs/nfsd/ do - only exceptions are here in filecache.c.
> Even "!= 0" would be better than "== 1".
> I think test_bit() is declared as a bool, but it is hard to be certain.

The definitions of test_bit() I've seen return "int" which is why
I wrote it this way.


>> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>>
>> static __be32
>> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
>> + unsigned int may_flags, struct nfsd_file **pnf,
>> + bool open, int want_gc)
>
> I too would prefer "bool" for all intstance of gc and want_gc.

OK. I think I've figured out a way to do that and still deal
safely with the test_bit() return type silliness.

v5 coming soon.


--
Chuck Lever



2022-10-24 23:16:10

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Tue, 25 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 23, 2022, at 10:33 PM, NeilBrown <[email protected]> wrote:
> >
> > On Wed, 19 Oct 2022, Chuck Lever wrote:
> >> NFSv4 operations manage the lifetime of nfsd_file items they use by
> >> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
> >> garbage collected.
> >>
> >> Introduce a mechanism to enable garbage collection for nfsd_file
> >> items used only by NFSv2/3 callers.
> >>
> >> Note that the change in nfsd_file_put() ensures that both CLOSE and
> >> DELEGRETURN will actually close out and free an nfsd_file on last
> >> reference of a non-garbage-collected file.
> >>
> >> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
> >> Suggested-by: Trond Myklebust <[email protected]>
> >> Tested-by: Jeff Layton <[email protected]>
> >> Signed-off-by: Chuck Lever <[email protected]>
> >> ---
> >> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
> >> fs/nfsd/filecache.h | 3 +++
> >> fs/nfsd/nfs3proc.c | 4 ++-
> >> fs/nfsd/trace.h | 3 ++-
> >> fs/nfsd/vfs.c | 4 ++-
> >> 5 files changed, 63 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> >> index b7aa523c2010..87fce5c95726 100644
> >> --- a/fs/nfsd/filecache.c
> >> +++ b/fs/nfsd/filecache.c
> >> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
> >> struct net *net;
> >> const struct cred *cred;
> >> unsigned char need;
> >> + unsigned char gc:1;
> >> enum nfsd_file_lookup_type type;
> >> };
> >>
> >> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> >> return 1;
> >> if (!nfsd_match_cred(nf->nf_cred, key->cred))
> >> return 1;
> >> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
> >> + return 1;
> >> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
> >> return 1;
> >> break;
> >> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> >> nf->nf_flags = 0;
> >> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> >> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> >> + if (key->gc)
> >> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> >> nf->nf_inode = key->inode;
> >> /* nf_ref is pre-incremented for hash table */
> >> refcount_set(&nf->nf_ref, 2);
> >> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> >> }
> >> }
> >>
> >> +static void
> >> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
> >> +{
> >> + if (nfsd_file_unhash(nf))
> >> + nfsd_file_put_noref(nf);
> >> +}
> >> +
> >> void
> >> nfsd_file_put(struct nfsd_file *nf)
> >> {
> >> might_sleep();
> >>
> >> - nfsd_file_lru_add(nf);
> >> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
> >
> > Clearly this is a style choice on which sensible people might disagree,
> > but I much prefer to leave out the "== 1" That is what most callers in
> > fs/nfsd/ do - only exceptions are here in filecache.c.
> > Even "!= 0" would be better than "== 1".
> > I think test_bit() is declared as a bool, but it is hard to be certain.
>
> The definitions of test_bit() I've seen return "int" which is why
> I wrote it this way.

Just for completeness, I found

#define test_bit(nr, addr) bitop(_test_bit, nr, addr)

in linux/bitops.h.

bitop() is defined just above and (I think) in the case where nr is
constant and addr is not NULL, this becomes a call to const_test_bit().
In include/asm-generic/bitops/generic-non-atomic.h I found

static __always_inline bool
const_test_bit(unsigned long nr, const volatile unsigned long *addr)

In the non-constant case it calls _test_bit() which is

static __always_inline bool
_test_bit(unsigned long nr, const volatile unsigned long *addr)

But as an int that is known to be 0 or 1 is type-compatible with a bool,
ht shouldn't really matter from a type perspective, only for a human
readability perspective.

Thanks,
NeilBrown

2022-10-24 23:49:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Tue, 25 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 23, 2022, at 11:26 PM, NeilBrown <[email protected]> wrote:
> >
> > On Wed, 19 Oct 2022, Chuck Lever wrote:
> >> +/*
> >> + * The returned hash value is based solely on the address of an in-code
> >> + * inode, a pointer to a slab-allocated object. The entropy in such a
> >> + * pointer is concentrated in its middle bits.
> >> + */
> >> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
> >> +{
> >> + u32 k;
> >> +
> >> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
> >> + return jhash2(&k, 1, seed);
> >
> > I still don't think this makes any sense at all.
> >
> > return jhash2(&inode, sizeof(inode)/4, seed);
> >
> > uses all of the entropy, which is best for rhashtables.
>
> I don't really disagree, but the L1_CACHE_SHIFT was added at
> Al's behest. OK, I'll replace it.

I think that was in a different context though.

If you are using a simple fixed array of bucket-chains for a hash
table, then shifting down L1_CACHE_SHIFT and masking off the number of
bits to match the array size is a perfectly sensible hash function. It
will occasionally produce longer chains, but no often.

But rhashtables does something a bit different. It mixes a seed into
the key as part of producing the hash, and assumes that if an
unfortunate distribution of keys results is a substantially non-uniform
distribution into buckets, then choosing a new random seed will
re-distribute the keys into buckets and will probably produce a more
uniform distribution.

The purpose of this seed is (as I understand it) primarily focused on
network-faced hash tables where an attacker might be able to choose keys
that all hash to the same value. The random seed will defeat the
attacker.

When hashing inodes there is no opportunity for a deliberate attack, and
we would probably be happy to not use a seed and accept the small
possibility of occasional long chains. However the rhashtable code
doesn't offer us that option. It insists on rehashing if any chain
exceeds 16. So we need to provide a much entropy as we can, and mix the
seed in as effectively as we can, to ensure that rehashing really works.

> >> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> >> + const void *ptr)
> >> +{
> >> + const struct svc_fh *fhp = arg->key;
> >> + const struct nfs4_file *fi = ptr;
> >> +
> >> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
> >> +}
> >
> > This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
> > obscure.
> >
> > An rhltable is like an rhashtable except that insertion can never fail.
> > Multiple entries with the same key can co-exist in a linked list.
> > Lookup does not return an entry but returns a linked list of entries.
> >
> > For the nfs4_file table this isn't really a good match (though I
> > previously thought it was). Insertion must be able to fail if an entry
> > with the same filehandle exists.
>
> I think I've arrived at the same conclusion (about nfs4_file insertion
> needing to fail for duplicate filehandles). That's more or less open-coded
> in my current revision of this work rather than relying on the behavior
> of rhltable_insert().
>
>
> > One approach that might work would be to take the inode->i_lock after a
> > lookup fails then repeat the lookup and if it fails again, do the
> > insert. This keeps the locking fine-grained but means we don't have to
> > depend on insert failing.
> >
> > So the hash and cmp functions would only look at the inode pointer.
> > If lookup returns something, we would walk the list calling fh_match()
> > to see if we have the right entry - all under RCU and using
> > refcount_inc_not_zero() when we find the matching filehandle.
>
> That's basically what's going on now. But I tried removing obj_cmpfn()
> and the rhltable_init() failed outright, so it's still there like a
> vestigial organ.

You need an obj_cmpfn if you have an obj_hashfn. If you don't have
obj_hashfn and just provide hashfn and key_len, then you don't need the
cmpfn.

If you have a cmpfn, just compare the inode (the same thing that you
hash) - don't compare the file handle.

>
> If you now believe rhltable is not a good fit, I can revert all of the
> rhltable changes and go back to rhashtable quite easily.

I wasn't very clear, as I was still working out what I though.

I think an rhashtable cannot work as you need two different keys, the
inode and the filehandle.

I think rhltable can work but it needs help, particularly as it will
never fail an insertion.
The help we can provide is to take a lock, perform a lookup, then only
try to insert if the lookup failed (and we are still holding an
appropriate lock to stop another thread inserting). Thus any time we
try an insert, we don't want it to fail.

The lock I suggest is ->i_lock in the inode.

The rhltable should only consider the inode as a key, and should provide
a linked list of nfs4_files with the same inode.
We then code a linear search of this linked list (which is expected to
have one entry) to find if there is an entry with the required file
handle.


An alternative is to not use a resizable table at all. We could stick
with the table we currently have and make small changes.

1/ change the compare function to test the inode pointer first and only
call fh_match when the inodes match. This should make it much
faster.
2/ Instead of using state_lock for locking, use a bit-spin-lock on the
lsb of the bucket pointers in the array to lock each bucket. This
will substantially reduce locking conflicts.

I don't see a big performance difference between this approach and the
rhltable approach. It really depends which you would feel more
comfortable working with. Would you rather work with common code which
isn't quite a perfect fit, or with private code that you have complete
control over?

Thanks,
NeilBrown

2022-10-25 00:27:22

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 24, 2022, at 6:02 PM, NeilBrown <[email protected]> wrote:
>
> On Tue, 25 Oct 2022, Chuck Lever III wrote:
>>
>>> On Oct 23, 2022, at 11:26 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Wed, 19 Oct 2022, Chuck Lever wrote:
>>>> +/*
>>>> + * The returned hash value is based solely on the address of an in-code
>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>> + * pointer is concentrated in its middle bits.
>>>> + */
>>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>>>> +{
>>>> + u32 k;
>>>> +
>>>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>>>> + return jhash2(&k, 1, seed);
>>>
>>> I still don't think this makes any sense at all.
>>>
>>> return jhash2(&inode, sizeof(inode)/4, seed);
>>>
>>> uses all of the entropy, which is best for rhashtables.
>>
>> I don't really disagree, but the L1_CACHE_SHIFT was added at
>> Al's behest. OK, I'll replace it.
>
> I think that was in a different context though.
>
> If you are using a simple fixed array of bucket-chains for a hash
> table, then shifting down L1_CACHE_SHIFT and masking off the number of
> bits to match the array size is a perfectly sensible hash function. It
> will occasionally produce longer chains, but no often.
>
> But rhashtables does something a bit different. It mixes a seed into
> the key as part of producing the hash, and assumes that if an
> unfortunate distribution of keys results is a substantially non-uniform
> distribution into buckets, then choosing a new random seed will
> re-distribute the keys into buckets and will probably produce a more
> uniform distribution.
>
> The purpose of this seed is (as I understand it) primarily focused on
> network-faced hash tables where an attacker might be able to choose keys
> that all hash to the same value. The random seed will defeat the
> attacker.
>
> When hashing inodes there is no opportunity for a deliberate attack, and
> we would probably be happy to not use a seed and accept the small
> possibility of occasional long chains. However the rhashtable code
> doesn't offer us that option. It insists on rehashing if any chain
> exceeds 16. So we need to provide a much entropy as we can, and mix the
> seed in as effectively as we can, to ensure that rehashing really works.
>
>>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>>>> + const void *ptr)
>>>> +{
>>>> + const struct svc_fh *fhp = arg->key;
>>>> + const struct nfs4_file *fi = ptr;
>>>> +
>>>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>>>> +}
>>>
>>> This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
>>> obscure.
>>>
>>> An rhltable is like an rhashtable except that insertion can never fail.
>>> Multiple entries with the same key can co-exist in a linked list.
>>> Lookup does not return an entry but returns a linked list of entries.
>>>
>>> For the nfs4_file table this isn't really a good match (though I
>>> previously thought it was). Insertion must be able to fail if an entry
>>> with the same filehandle exists.
>>
>> I think I've arrived at the same conclusion (about nfs4_file insertion
>> needing to fail for duplicate filehandles). That's more or less open-coded
>> in my current revision of this work rather than relying on the behavior
>> of rhltable_insert().
>>
>>
>>> One approach that might work would be to take the inode->i_lock after a
>>> lookup fails then repeat the lookup and if it fails again, do the
>>> insert. This keeps the locking fine-grained but means we don't have to
>>> depend on insert failing.
>>>
>>> So the hash and cmp functions would only look at the inode pointer.
>>> If lookup returns something, we would walk the list calling fh_match()
>>> to see if we have the right entry - all under RCU and using
>>> refcount_inc_not_zero() when we find the matching filehandle.
>>
>> That's basically what's going on now. But I tried removing obj_cmpfn()
>> and the rhltable_init() failed outright, so it's still there like a
>> vestigial organ.
>
> You need an obj_cmpfn if you have an obj_hashfn. If you don't have
> obj_hashfn and just provide hashfn and key_len, then you don't need the
> cmpfn.
>
> If you have a cmpfn, just compare the inode (the same thing that you
> hash) - don't compare the file handle.
>
>>
>> If you now believe rhltable is not a good fit, I can revert all of the
>> rhltable changes and go back to rhashtable quite easily.
>
> I wasn't very clear, as I was still working out what I though.
>
> I think an rhashtable cannot work as you need two different keys, the
> inode and the filehandle.
>
> I think rhltable can work but it needs help, particularly as it will
> never fail an insertion.
> The help we can provide is to take a lock, perform a lookup, then only
> try to insert if the lookup failed (and we are still holding an
> appropriate lock to stop another thread inserting). Thus any time we
> try an insert, we don't want it to fail.
>
> The lock I suggest is ->i_lock in the inode.

I will give that a try next.


> The rhltable should only consider the inode as a key, and should provide
> a linked list of nfs4_files with the same inode.

The implementation I've arrived at is rhltable keyed on inode only.
The bucket chains are searched with fh_match().

I've gotten rid of all of the special obj_hash/cmp functions as a
result of this simplification. If I've set up the rhashtable
parameters correctly, the rhashtable code should use jhash/jhash2
on the whole inode pointer by default.


> We then code a linear search of this linked list (which is expected to
> have one entry) to find if there is an entry with the required file
> handle.

I'm not sure about that expectation: multiple inode pointers could
hash to the same bucket. Also, there are cases where multiple file
handles refer to the same inode (the aliasing that a0ce48375a36
("nfsd: track filehandle aliasing in nfs4_files") tries to deal
with).

I will post what I have right now. It's not passing all tests due
to very recent changes (and probably because of lack of insert
serialization). We can pick up from there; I know I've likely missed
some review comments still.


> An alternative is to not use a resizable table at all. We could stick
> with the table we currently have and make small changes.
>
> 1/ change the compare function to test the inode pointer first and only
> call fh_match when the inodes match. This should make it much
> faster.

I had the same thought, but this doesn't reduce the number of CPU
cache misses I observed from long bucket chain walks. I'd prefer a
solution that maintains small buckets.


> 2/ Instead of using state_lock for locking, use a bit-spin-lock on the
> lsb of the bucket pointers in the array to lock each bucket. This
> will substantially reduce locking conflicts.
>
> I don't see a big performance difference between this approach and the
> rhltable approach. It really depends which you would feel more
> comfortable working with. Would you rather work with common code which
> isn't quite a perfect fit, or with private code that you have complete
> control over?

I think rhltable still has some legs, will keep chipping away at that.

--
Chuck Lever



2022-10-25 01:09:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Tue, 25 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 24, 2022, at 12:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2022-10-24 at 13:33 +1100, NeilBrown wrote:
> >> On Wed, 19 Oct 2022, Chuck Lever wrote:
> >>> + nfsd_file_lru_add(nf);
> >>> + else if (refcount_read(&nf->nf_ref) == 2)
> >>> + nfsd_file_unhash_and_put(nf);
> >>
> >> Tests on the value of a refcount are almost always racy.
> >
> > Agreed, and there's a clear race above, now that I look more closely. If
> > nf_ref is 3 and two puts are racing then neither of them will call
> > nfsd_file_unhash_and_put. We really should be letting the outcome of the
> > decrement drive things (like you say below).
> >
> >> I suspect there is an implication that as NFSD_FILE_GC is not set, this
> >> *must* be hashed which implies there is guaranteed to be a refcount from
> >> the hashtable. So this is really a test to see if the pre-biased
> >> refcount is one. The safe way to test if a refcount is 1 is dec_and_test.
> >>
> >> i.e. linkage from the hash table should not count as a reference (in the
> >> not-GC case). Lookup in the hash table should fail if the found entry
> >> cannot achieve an inc_not_zero. When dec_and_test says the refcount is
> >> zero, we remove from the hash table (certain that no new references will
> >> be taken).
> >>
> >
> > This does seem a more sensible approach. That would go a long way toward
> > simplifying nfsd_file_put.
>
> So I cut-and-pasted the approach you used in the patch you sent a few
> weeks ago. I don't object to replacing that... but I don't see exactly
> where you guys are going with this.
>

Where I'm going with this is to suggest that this ref-counting oddity is
possibly the cause of the occasional refcounting problems that you
have had with nfsd.

I think that the hash table should never own a reference to the
nfsd_file. If the refcount ever reaches zero, then it gets removed from
the hash table.

if (refcount_dec_and_test())
if (test_and_clear_bit(HASHED,...))
delete from hash table

Transient references are counted.
For NFSv2,3 existence in the LRU is counted (I don't think they are at present).
For NFSv4, references from nfs4_file are counted.
But presence in the hash table is only indicated by the HASHED flag.

I think this would make the ref counting more robust.

NeilBrown

2022-10-25 17:25:13

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

On Mon, 2022-10-24 at 22:30 +0000, Chuck Lever III wrote:
>
> > On Oct 24, 2022, at 6:02 PM, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 25 Oct 2022, Chuck Lever III wrote:
> > >
> > > > On Oct 23, 2022, at 11:26 PM, NeilBrown <[email protected]> wrote:
> > > >
> > > > On Wed, 19 Oct 2022, Chuck Lever wrote:
> > > > > +/*
> > > > > + * The returned hash value is based solely on the address of an in-code
> > > > > + * inode, a pointer to a slab-allocated object. The entropy in such a
> > > > > + * pointer is concentrated in its middle bits.
> > > > > + */
> > > > > +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
> > > > > +{
> > > > > + u32 k;
> > > > > +
> > > > > + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
> > > > > + return jhash2(&k, 1, seed);
> > > >
> > > > I still don't think this makes any sense at all.
> > > >
> > > > return jhash2(&inode, sizeof(inode)/4, seed);
> > > >
> > > > uses all of the entropy, which is best for rhashtables.
> > >
> > > I don't really disagree, but the L1_CACHE_SHIFT was added at
> > > Al's behest. OK, I'll replace it.
> >
> > I think that was in a different context though.
> >
> > If you are using a simple fixed array of bucket-chains for a hash
> > table, then shifting down L1_CACHE_SHIFT and masking off the number of
> > bits to match the array size is a perfectly sensible hash function. It
> > will occasionally produce longer chains, but no often.
> >
> > But rhashtables does something a bit different. It mixes a seed into
> > the key as part of producing the hash, and assumes that if an
> > unfortunate distribution of keys results is a substantially non-uniform
> > distribution into buckets, then choosing a new random seed will
> > re-distribute the keys into buckets and will probably produce a more
> > uniform distribution.
> >
> > The purpose of this seed is (as I understand it) primarily focused on
> > network-faced hash tables where an attacker might be able to choose keys
> > that all hash to the same value. The random seed will defeat the
> > attacker.
> >
> > When hashing inodes there is no opportunity for a deliberate attack, and
> > we would probably be happy to not use a seed and accept the small
> > possibility of occasional long chains. However the rhashtable code
> > doesn't offer us that option. It insists on rehashing if any chain
> > exceeds 16. So we need to provide a much entropy as we can, and mix the
> > seed in as effectively as we can, to ensure that rehashing really works.
> >
> > > > > +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> > > > > + const void *ptr)
> > > > > +{
> > > > > + const struct svc_fh *fhp = arg->key;
> > > > > + const struct nfs4_file *fi = ptr;
> > > > > +
> > > > > + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
> > > > > +}
> > > >
> > > > This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
> > > > obscure.
> > > >
> > > > An rhltable is like an rhashtable except that insertion can never fail.
> > > > Multiple entries with the same key can co-exist in a linked list.
> > > > Lookup does not return an entry but returns a linked list of entries.
> > > >
> > > > For the nfs4_file table this isn't really a good match (though I
> > > > previously thought it was). Insertion must be able to fail if an entry
> > > > with the same filehandle exists.
> > >
> > > I think I've arrived at the same conclusion (about nfs4_file insertion
> > > needing to fail for duplicate filehandles). That's more or less open-coded
> > > in my current revision of this work rather than relying on the behavior
> > > of rhltable_insert().
> > >
> > >
> > > > One approach that might work would be to take the inode->i_lock after a
> > > > lookup fails then repeat the lookup and if it fails again, do the
> > > > insert. This keeps the locking fine-grained but means we don't have to
> > > > depend on insert failing.
> > > >
> > > > So the hash and cmp functions would only look at the inode pointer.
> > > > If lookup returns something, we would walk the list calling fh_match()
> > > > to see if we have the right entry - all under RCU and using
> > > > refcount_inc_not_zero() when we find the matching filehandle.
> > >
> > > That's basically what's going on now. But I tried removing obj_cmpfn()
> > > and the rhltable_init() failed outright, so it's still there like a
> > > vestigial organ.
> >
> > You need an obj_cmpfn if you have an obj_hashfn. If you don't have
> > obj_hashfn and just provide hashfn and key_len, then you don't need the
> > cmpfn.
> >
> > If you have a cmpfn, just compare the inode (the same thing that you
> > hash) - don't compare the file handle.
> >
> > >
> > > If you now believe rhltable is not a good fit, I can revert all of the
> > > rhltable changes and go back to rhashtable quite easily.
> >
> > I wasn't very clear, as I was still working out what I though.
> >
> > I think an rhashtable cannot work as you need two different keys, the
> > inode and the filehandle.
> >
> > I think rhltable can work but it needs help, particularly as it will
> > never fail an insertion.
> > The help we can provide is to take a lock, perform a lookup, then only
> > try to insert if the lookup failed (and we are still holding an
> > appropriate lock to stop another thread inserting). Thus any time we
> > try an insert, we don't want it to fail.
> >
> > The lock I suggest is ->i_lock in the inode.
>
> I will give that a try next.
>
>
> > The rhltable should only consider the inode as a key, and should provide
> > a linked list of nfs4_files with the same inode.
>
> The implementation I've arrived at is rhltable keyed on inode only.
> The bucket chains are searched with fh_match().
>
> I've gotten rid of all of the special obj_hash/cmp functions as a
> result of this simplification. If I've set up the rhashtable
> parameters correctly, the rhashtable code should use jhash/jhash2
> on the whole inode pointer by default.
>
>
> > We then code a linear search of this linked list (which is expected to
> > have one entry) to find if there is an entry with the required file
> > handle.
>
> I'm not sure about that expectation: multiple inode pointers could
> hash to the same bucket. Also, there are cases where multiple file
> handles refer to the same inode (the aliasing that a0ce48375a36
> ("nfsd: track filehandle aliasing in nfs4_files") tries to deal
> with).
>
> I will post what I have right now. It's not passing all tests due
> to very recent changes (and probably because of lack of insert
> serialization). We can pick up from there; I know I've likely missed
> some review comments still.
>
>

Thanks. I've started looking over this and some other changes, and found
at least a couple of problems with the existing code (not necessarily
due to your patches, fwiw):

1/ the nf_lru is the linkage to the LRU, but it's also used when we move
an object to a dispose list. I don't see anything that prevents you from
trying to add an entry back onto the LRU while it's still on a dispose
list (which could cause corruption). I think we need some clear rules
around how that field get used.

2/ nfsd_file_close_inode_sync can end up putting an nf onto the dispose
list without ensuring that nf_ref has gone to zero. I think there are
some situations where the file can end up being freed out from under a
valid reference due to that.

I'm working on some patches to clean this up along the lines of the
scheme Neil is advocating for. For now, I'm basing this on top of your
series.

> > An alternative is to not use a resizable table at all. We could stick
> > with the table we currently have and make small changes.
> >
> > 1/ change the compare function to test the inode pointer first and only
> > call fh_match when the inodes match. This should make it much
> > faster.
>
> I had the same thought, but this doesn't reduce the number of CPU
> cache misses I observed from long bucket chain walks. I'd prefer a
> solution that maintains small buckets.
>
>
> > 2/ Instead of using state_lock for locking, use a bit-spin-lock on the
> > lsb of the bucket pointers in the array to lock each bucket. This
> > will substantially reduce locking conflicts.
> >
> > I don't see a big performance difference between this approach and the
> > rhltable approach. It really depends which you would feel more
> > comfortable working with. Would you rather work with common code which
> > isn't quite a perfect fit, or with private code that you have complete
> > control over?
>
> I think rhltable still has some legs, will keep chipping away at that.
>
> --
> Chuck Lever
>
>
>

--
Jeff Layton <[email protected]>

2022-10-25 17:35:07

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 25, 2022, at 1:18 PM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2022-10-24 at 22:30 +0000, Chuck Lever III wrote:
>>
>>> On Oct 24, 2022, at 6:02 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Tue, 25 Oct 2022, Chuck Lever III wrote:
>>>>
>>>>> On Oct 23, 2022, at 11:26 PM, NeilBrown <[email protected]> wrote:
>>>>>
>>>>> On Wed, 19 Oct 2022, Chuck Lever wrote:
>>>>>> +/*
>>>>>> + * The returned hash value is based solely on the address of an in-code
>>>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>>>> + * pointer is concentrated in its middle bits.
>>>>>> + */
>>>>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>>>>>> +{
>>>>>> + u32 k;
>>>>>> +
>>>>>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>>>>>> + return jhash2(&k, 1, seed);
>>>>>
>>>>> I still don't think this makes any sense at all.
>>>>>
>>>>> return jhash2(&inode, sizeof(inode)/4, seed);
>>>>>
>>>>> uses all of the entropy, which is best for rhashtables.
>>>>
>>>> I don't really disagree, but the L1_CACHE_SHIFT was added at
>>>> Al's behest. OK, I'll replace it.
>>>
>>> I think that was in a different context though.
>>>
>>> If you are using a simple fixed array of bucket-chains for a hash
>>> table, then shifting down L1_CACHE_SHIFT and masking off the number of
>>> bits to match the array size is a perfectly sensible hash function. It
>>> will occasionally produce longer chains, but no often.
>>>
>>> But rhashtables does something a bit different. It mixes a seed into
>>> the key as part of producing the hash, and assumes that if an
>>> unfortunate distribution of keys results is a substantially non-uniform
>>> distribution into buckets, then choosing a new random seed will
>>> re-distribute the keys into buckets and will probably produce a more
>>> uniform distribution.
>>>
>>> The purpose of this seed is (as I understand it) primarily focused on
>>> network-faced hash tables where an attacker might be able to choose keys
>>> that all hash to the same value. The random seed will defeat the
>>> attacker.
>>>
>>> When hashing inodes there is no opportunity for a deliberate attack, and
>>> we would probably be happy to not use a seed and accept the small
>>> possibility of occasional long chains. However the rhashtable code
>>> doesn't offer us that option. It insists on rehashing if any chain
>>> exceeds 16. So we need to provide a much entropy as we can, and mix the
>>> seed in as effectively as we can, to ensure that rehashing really works.
>>>
>>>>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>>>>>> + const void *ptr)
>>>>>> +{
>>>>>> + const struct svc_fh *fhp = arg->key;
>>>>>> + const struct nfs4_file *fi = ptr;
>>>>>> +
>>>>>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>>>>>> +}
>>>>>
>>>>> This doesn't make sense. Maybe the subtleties of rhl-tables are a bit
>>>>> obscure.
>>>>>
>>>>> An rhltable is like an rhashtable except that insertion can never fail.
>>>>> Multiple entries with the same key can co-exist in a linked list.
>>>>> Lookup does not return an entry but returns a linked list of entries.
>>>>>
>>>>> For the nfs4_file table this isn't really a good match (though I
>>>>> previously thought it was). Insertion must be able to fail if an entry
>>>>> with the same filehandle exists.
>>>>
>>>> I think I've arrived at the same conclusion (about nfs4_file insertion
>>>> needing to fail for duplicate filehandles). That's more or less open-coded
>>>> in my current revision of this work rather than relying on the behavior
>>>> of rhltable_insert().
>>>>
>>>>
>>>>> One approach that might work would be to take the inode->i_lock after a
>>>>> lookup fails then repeat the lookup and if it fails again, do the
>>>>> insert. This keeps the locking fine-grained but means we don't have to
>>>>> depend on insert failing.
>>>>>
>>>>> So the hash and cmp functions would only look at the inode pointer.
>>>>> If lookup returns something, we would walk the list calling fh_match()
>>>>> to see if we have the right entry - all under RCU and using
>>>>> refcount_inc_not_zero() when we find the matching filehandle.
>>>>
>>>> That's basically what's going on now. But I tried removing obj_cmpfn()
>>>> and the rhltable_init() failed outright, so it's still there like a
>>>> vestigial organ.
>>>
>>> You need an obj_cmpfn if you have an obj_hashfn. If you don't have
>>> obj_hashfn and just provide hashfn and key_len, then you don't need the
>>> cmpfn.
>>>
>>> If you have a cmpfn, just compare the inode (the same thing that you
>>> hash) - don't compare the file handle.
>>>
>>>>
>>>> If you now believe rhltable is not a good fit, I can revert all of the
>>>> rhltable changes and go back to rhashtable quite easily.
>>>
>>> I wasn't very clear, as I was still working out what I though.
>>>
>>> I think an rhashtable cannot work as you need two different keys, the
>>> inode and the filehandle.
>>>
>>> I think rhltable can work but it needs help, particularly as it will
>>> never fail an insertion.
>>> The help we can provide is to take a lock, perform a lookup, then only
>>> try to insert if the lookup failed (and we are still holding an
>>> appropriate lock to stop another thread inserting). Thus any time we
>>> try an insert, we don't want it to fail.
>>>
>>> The lock I suggest is ->i_lock in the inode.
>>
>> I will give that a try next.
>>
>>
>>> The rhltable should only consider the inode as a key, and should provide
>>> a linked list of nfs4_files with the same inode.
>>
>> The implementation I've arrived at is rhltable keyed on inode only.
>> The bucket chains are searched with fh_match().
>>
>> I've gotten rid of all of the special obj_hash/cmp functions as a
>> result of this simplification. If I've set up the rhashtable
>> parameters correctly, the rhashtable code should use jhash/jhash2
>> on the whole inode pointer by default.
>>
>>
>>> We then code a linear search of this linked list (which is expected to
>>> have one entry) to find if there is an entry with the required file
>>> handle.
>>
>> I'm not sure about that expectation: multiple inode pointers could
>> hash to the same bucket. Also, there are cases where multiple file
>> handles refer to the same inode (the aliasing that a0ce48375a36
>> ("nfsd: track filehandle aliasing in nfs4_files") tries to deal
>> with).
>>
>> I will post what I have right now. It's not passing all tests due
>> to very recent changes (and probably because of lack of insert
>> serialization). We can pick up from there; I know I've likely missed
>> some review comments still.
>>
>>
>
> Thanks. I've started looking over this and some other changes, and found
> at least a couple of problems with the existing code (not necessarily
> due to your patches, fwiw):
>
> 1/ the nf_lru is the linkage to the LRU, but it's also used when we move
> an object to a dispose list. I don't see anything that prevents you from
> trying to add an entry back onto the LRU while it's still on a dispose
> list (which could cause corruption). I think we need some clear rules
> around how that field get used.

I chased this when converting filecache to rhashtable. There don't seem
to be any logic errors in the current code, and it seems to be a common
trope in other consumers of list_lru, so I left it alone.

Yeah, it makes me nervous too.


> 2/ nfsd_file_close_inode_sync can end up putting an nf onto the dispose
> list without ensuring that nf_ref has gone to zero. I think there are
> some situations where the file can end up being freed out from under a
> valid reference due to that.
>
> I'm working on some patches to clean this up along the lines of the
> scheme Neil is advocating for. For now, I'm basing this on top of your
> series.

I'll try to refresh kernel.org:cel for-next later today.


>>> An alternative is to not use a resizable table at all. We could stick
>>> with the table we currently have and make small changes.
>>>
>>> 1/ change the compare function to test the inode pointer first and only
>>> call fh_match when the inodes match. This should make it much
>>> faster.
>>
>> I had the same thought, but this doesn't reduce the number of CPU
>> cache misses I observed from long bucket chain walks. I'd prefer a
>> solution that maintains small buckets.
>>
>>
>>> 2/ Instead of using state_lock for locking, use a bit-spin-lock on the
>>> lsb of the bucket pointers in the array to lock each bucket. This
>>> will substantially reduce locking conflicts.
>>>
>>> I don't see a big performance difference between this approach and the
>>> rhltable approach. It really depends which you would feel more
>>> comfortable working with. Would you rather work with common code which
>>> isn't quite a perfect fit, or with private code that you have complete
>>> control over?
>>
>> I think rhltable still has some legs, will keep chipping away at that.
>>
>> --
>> Chuck Lever
>>
>>
>>
>
> --
> Jeff Layton <[email protected]>

--
Chuck Lever