2022-10-06 16:23:58

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 0/9] A course adjustment, maybe...

I'm proposing this series as the first NFSD-related patchset to go
into v6.2 (for-next), though I haven't opened that yet.

For quite some time, we've been encouraged to disable filecache
garbage collection for NFSv4 files, and I think I found a surgical
way to do just that. That is presented in "NFSD: Add an NFSD_FILE_GC
flag to enable nfsd_file garbage collection".

Comments and opinions are welcome.

Changes since RFC:
- checking nfs4_files for inode aliases is now done only on hash
insertion
- the nfs4_file reference count is now bumped only while the RCU
read lock is held
- comments and function names have been revised and clarified

I haven't updated the new @want_gc parameter... jury is still out.

---

Chuck Lever (7):
NFSD: Pass the target nfsd_file to nfsd_commit()
NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"
NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection
NFSD: Use const pointers as parameters to fh_ helpers.
NFSD: Use rhashtable for managing nfs4_file objects
NFSD: Clean up nfs4_preprocess_stateid_op() call sites
NFSD: Trace delegation revocations

Jeff Layton (2):
nfsd: fix nfsd_file_unhash_and_dispose
nfsd: rework hashtable handling in nfsd_do_file_acquire


fs/nfsd/filecache.c | 165 +++++++++++++++---------------
fs/nfsd/filecache.h | 4 +-
fs/nfsd/nfs3proc.c | 10 +-
fs/nfsd/nfs4proc.c | 42 ++++----
fs/nfsd/nfs4state.c | 241 ++++++++++++++++++++++++++++++--------------
fs/nfsd/nfsfh.h | 10 +-
fs/nfsd/state.h | 5 +-
fs/nfsd/trace.h | 58 ++++++++++-
fs/nfsd/vfs.c | 19 ++--
fs/nfsd/vfs.h | 3 +-
10 files changed, 351 insertions(+), 206 deletions(-)

--
Chuck Lever


2022-10-06 16:24:33

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 1/9] nfsd: fix nfsd_file_unhash_and_dispose

From: Jeff Layton <[email protected]>

nfsd_file_unhash_and_dispose() is called for two reasons:

We're either shutting down and purging the filecache, or we've gotten a
notification about a file delete, so we want to go ahead and unhash it
so that it'll get cleaned up when we close.

We're either walking the hashtable or doing a lookup in it and we
don't take a reference in either case. What we want to do in both cases
is to try and unhash the object and put it on the dispose list if that
was successful. If it's no longer hashed, then we don't want to touch
it, with the assumption being that something else is already cleaning
up the sentinel reference.

Instead of trying to selectively decrement the refcount in this
function, just unhash it, and if that was successful, move it to the
dispose list. Then, the disposal routine will just clean that up as
usual.

Also, just make this a void function, drop the WARN_ON_ONCE, and the
comments about deadlocking since the nature of the purported deadlock
is no longer clear.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/filecache.c | 36 +++++++-----------------------------
1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index eeed4ae5b4ad..844c41832d50 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -405,22 +405,15 @@ nfsd_file_unhash(struct nfsd_file *nf)
return false;
}

-/*
- * Return true if the file was unhashed.
- */
-static bool
+static void
nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
{
trace_nfsd_file_unhash_and_dispose(nf);
- if (!nfsd_file_unhash(nf))
- return false;
- /* keep final reference for nfsd_file_lru_dispose */
- if (refcount_dec_not_one(&nf->nf_ref))
- return true;
-
- nfsd_file_lru_remove(nf);
- list_add(&nf->nf_lru, dispose);
- return true;
+ if (nfsd_file_unhash(nf)) {
+ /* caller must call nfsd_file_dispose_list() later */
+ nfsd_file_lru_remove(nf);
+ list_add(&nf->nf_lru, dispose);
+ }
}

static void
@@ -562,8 +555,6 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
* @lock: LRU list lock (unused)
* @arg: dispose list
*
- * Note this can deadlock with nfsd_file_cache_purge.
- *
* Return values:
* %LRU_REMOVED: @item was removed from the LRU
* %LRU_ROTATE: @item is to be moved to the LRU tail
@@ -748,8 +739,6 @@ nfsd_file_close_inode(struct inode *inode)
*
* Walk the LRU list and close any entries that have not been used since
* the last scan.
- *
- * Note this can deadlock with nfsd_file_cache_purge.
*/
static void
nfsd_file_delayed_close(struct work_struct *work)
@@ -891,16 +880,12 @@ nfsd_file_cache_init(void)
goto out;
}

-/*
- * Note this can deadlock with nfsd_file_lru_cb.
- */
static void
__nfsd_file_cache_purge(struct net *net)
{
struct rhashtable_iter iter;
struct nfsd_file *nf;
LIST_HEAD(dispose);
- bool del;

rhashtable_walk_enter(&nfsd_file_rhash_tbl, &iter);
do {
@@ -910,14 +895,7 @@ __nfsd_file_cache_purge(struct net *net)
while (!IS_ERR_OR_NULL(nf)) {
if (net && nf->nf_net != net)
continue;
- del = nfsd_file_unhash_and_dispose(nf, &dispose);
-
- /*
- * Deadlock detected! Something marked this entry as
- * unhased, but hasn't removed it from the hash list.
- */
- WARN_ON_ONCE(!del);
-
+ nfsd_file_unhash_and_dispose(nf, &dispose);
nf = rhashtable_walk_next(&iter);
}



2022-10-06 16:24:34

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 2/9] nfsd: rework hashtable handling in nfsd_do_file_acquire

From: Jeff Layton <[email protected]>

nfsd_file is RCU-freed, so we need to hold the rcu_read_lock long enough
to get a reference after finding it in the hash. Take the
rcu_read_lock() and call rhashtable_lookup directly.

Switch to using rhashtable_lookup_insert_key as well, and use the usual
retry mechanism if we hit an -EEXIST. Rename the "retry" bool to
open_retry, and eliminiate the insert_err goto target.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/filecache.c | 52 ++++++++++++++++++++++-----------------------------
1 file changed, 22 insertions(+), 30 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 844c41832d50..a2adfc247648 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1042,9 +1042,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
.need = may_flags & NFSD_FILE_MAY_MASK,
.net = SVC_NET(rqstp),
};
- struct nfsd_file *nf, *new;
- bool retry = true;
+ bool open_retry = true;
+ struct nfsd_file *nf;
__be32 status;
+ int ret;

status = fh_verify(rqstp, fhp, S_IFREG,
may_flags|NFSD_MAY_OWNER_OVERRIDE);
@@ -1054,35 +1055,33 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
key.cred = get_current_cred();

retry:
- /* Avoid allocation if the item is already in cache */
- nf = rhashtable_lookup_fast(&nfsd_file_rhash_tbl, &key,
- nfsd_file_rhash_params);
+ rcu_read_lock();
+ nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
+ nfsd_file_rhash_params);
if (nf)
nf = nfsd_file_get(nf);
+ rcu_read_unlock();
if (nf)
goto wait_for_construction;

- new = nfsd_file_alloc(&key, may_flags);
- if (!new) {
+ nf = nfsd_file_alloc(&key, may_flags);
+ if (!nf) {
status = nfserr_jukebox;
goto out_status;
}

- nf = rhashtable_lookup_get_insert_key(&nfsd_file_rhash_tbl,
- &key, &new->nf_rhash,
- nfsd_file_rhash_params);
- if (!nf) {
- nf = new;
- goto open_file;
- }
- if (IS_ERR(nf))
- goto insert_err;
- nf = nfsd_file_get(nf);
- if (nf == NULL) {
- nf = new;
+ ret = rhashtable_lookup_insert_key(&nfsd_file_rhash_tbl,
+ &key, &nf->nf_rhash,
+ nfsd_file_rhash_params);
+ if (likely(ret == 0))
goto open_file;
- }
- nfsd_file_slab_free(&new->nf_rcu);
+
+ nfsd_file_slab_free(&nf->nf_rcu);
+ if (ret == -EEXIST)
+ goto retry;
+ trace_nfsd_file_insert_err(rqstp, key.inode, may_flags, ret);
+ status = nfserr_jukebox;
+ goto out_status;

wait_for_construction:
wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
@@ -1090,11 +1089,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
/* Did construction of this file fail? */
if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
trace_nfsd_file_cons_err(rqstp, key.inode, may_flags, nf);
- if (!retry) {
+ if (!open_retry) {
status = nfserr_jukebox;
goto out;
}
- retry = false;
+ open_retry = false;
nfsd_file_put_noref(nf);
goto retry;
}
@@ -1142,13 +1141,6 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
smp_mb__after_atomic();
wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
goto out;
-
-insert_err:
- nfsd_file_slab_free(&new->nf_rcu);
- trace_nfsd_file_insert_err(rqstp, key.inode, may_flags, PTR_ERR(nf));
- nf = NULL;
- status = nfserr_jukebox;
- goto out_status;
}

/**


2022-10-06 16:24:34

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 4/9] NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"

This reverts commit 5e138c4a750dc140d881dab4a8804b094bbc08d2.

That commit attempted to make files available to other users as soon
as all NFSv4 clients were done with them, rather than waiting until
the filecache LRU had garbage collected them.

It gets the reference counting wrong, for one thing.

But it also misses that DELEGRETURN should release a file in the
same fashion. In fact, any nfsd_file_put() on an file held open
by an NFSv4 client needs potentially to release the file
immediately...

Clear the way for implementing that idea.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/filecache.c | 18 ------------------
fs/nfsd/filecache.h | 1 -
fs/nfsd/nfs4state.c | 4 ++--
3 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a2adfc247648..b7aa523c2010 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -444,24 +444,6 @@ nfsd_file_put(struct nfsd_file *nf)
nfsd_file_put_noref(nf);
}

-/**
- * nfsd_file_close - Close an nfsd_file
- * @nf: nfsd_file to close
- *
- * If this is the final reference for @nf, free it immediately.
- * This reflects an on-the-wire CLOSE or DELEGRETURN into the
- * VFS and exported filesystem.
- */
-void nfsd_file_close(struct nfsd_file *nf)
-{
- nfsd_file_put(nf);
- if (refcount_dec_if_one(&nf->nf_ref)) {
- nfsd_file_unhash(nf);
- nfsd_file_lru_remove(nf);
- nfsd_file_free(nf);
- }
-}
-
struct nfsd_file *
nfsd_file_get(struct nfsd_file *nf)
{
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 8e8c0c47d67d..f81c198f4ed6 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -52,7 +52,6 @@ void nfsd_file_cache_shutdown(void);
int nfsd_file_cache_start_net(struct net *net);
void nfsd_file_cache_shutdown_net(struct net *net);
void nfsd_file_put(struct nfsd_file *nf);
-void nfsd_file_close(struct nfsd_file *nf);
struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
void nfsd_file_close_inode_sync(struct inode *inode);
bool nfsd_file_is_cached(struct inode *inode);
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c5d199d7e6b4..2b850de288cf 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -820,9 +820,9 @@ static void __nfs4_file_put_access(struct nfs4_file *fp, int oflag)
swap(f2, fp->fi_fds[O_RDWR]);
spin_unlock(&fp->fi_lock);
if (f1)
- nfsd_file_close(f1);
+ nfsd_file_put(f1);
if (f2)
- nfsd_file_close(f2);
+ nfsd_file_put(f2);
}
}



2022-10-06 16:24:34

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 3/9] NFSD: Pass the target nfsd_file to nfsd_commit()

In a moment I'm going to introduce separate nfsd_file types, one of
which is garbage-collected; the other, not. The garbage-collected
variety is to be used by NFSv2 and v3, and the non-garbage-collected
variety is to be used by NFSv4.

nfsd_commit() is invoked by both NFSv3 and NFSv4 consumers. We want
nfsd_commit() to find and use the correct variety of cached
nfsd_file object for the NFS version that is in use.

This is a refactoring change. No behavior change is expected.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs3proc.c | 10 +++++++++-
fs/nfsd/nfs4proc.c | 11 ++++++++++-
fs/nfsd/vfs.c | 15 ++++-----------
fs/nfsd/vfs.h | 3 ++-
4 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index a41cca619338..d12823371857 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -13,6 +13,7 @@
#include "cache.h"
#include "xdr3.h"
#include "vfs.h"
+#include "filecache.h"

#define NFSDDBG_FACILITY NFSDDBG_PROC

@@ -770,6 +771,7 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
{
struct nfsd3_commitargs *argp = rqstp->rq_argp;
struct nfsd3_commitres *resp = rqstp->rq_resp;
+ struct nfsd_file *nf;

dprintk("nfsd: COMMIT(3) %s %u@%Lu\n",
SVCFH_fmt(&argp->fh),
@@ -777,8 +779,14 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
(unsigned long long) argp->offset);

fh_copy(&resp->fh, &argp->fh);
- resp->status = nfsd_commit(rqstp, &resp->fh, argp->offset,
+ resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
+ NFSD_MAY_NOT_BREAK_LEASE, &nf);
+ if (resp->status)
+ goto out;
+ resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
argp->count, resp->verf);
+ nfsd_file_put(nf);
+out:
return rpc_success;
}

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a72ab97f77ef..d0d976f847ca 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -739,10 +739,19 @@ nfsd4_commit(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
union nfsd4_op_u *u)
{
struct nfsd4_commit *commit = &u->commit;
+ struct nfsd_file *nf;
+ __be32 status;

- return nfsd_commit(rqstp, &cstate->current_fh, commit->co_offset,
+ status = nfsd_file_acquire(rqstp, &cstate->current_fh, NFSD_MAY_WRITE |
+ NFSD_MAY_NOT_BREAK_LEASE, &nf);
+ if (status != nfs_ok)
+ return status;
+
+ status = nfsd_commit(rqstp, &cstate->current_fh, nf, commit->co_offset,
commit->co_count,
(__be32 *)commit->co_verf.data);
+ nfsd_file_put(nf);
+ return status;
}

static __be32
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index fc17b0ac8729..44f210ba17cc 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1108,6 +1108,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
* nfsd_commit - Commit pending writes to stable storage
* @rqstp: RPC request being processed
* @fhp: NFS filehandle
+ * @nf: target file
* @offset: raw offset from beginning of file
* @count: raw count of bytes to sync
* @verf: filled in with the server's current write verifier
@@ -1124,19 +1125,13 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
* An nfsstat value in network byte order.
*/
__be32
-nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, u64 offset,
- u32 count, __be32 *verf)
+nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+ u64 offset, u32 count, __be32 *verf)
{
+ __be32 err = nfs_ok;
u64 maxbytes;
loff_t start, end;
struct nfsd_net *nn;
- struct nfsd_file *nf;
- __be32 err;
-
- err = nfsd_file_acquire(rqstp, fhp,
- NFSD_MAY_WRITE|NFSD_MAY_NOT_BREAK_LEASE, &nf);
- if (err)
- goto out;

/*
* Convert the client-provided (offset, count) range to a
@@ -1177,8 +1172,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, u64 offset,
} else
nfsd_copy_write_verifier(verf, nn);

- nfsd_file_put(nf);
-out:
return err;
}

diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index c95cd414b4bb..04dfd80570f0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -88,7 +88,8 @@ __be32 nfsd_access(struct svc_rqst *, struct svc_fh *, u32 *, u32 *);
__be32 nfsd_create_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct svc_fh *resfhp, struct nfsd_attrs *iap);
__be32 nfsd_commit(struct svc_rqst *rqst, struct svc_fh *fhp,
- u64 offset, u32 count, __be32 *verf);
+ struct nfsd_file *nf, u64 offset, u32 count,
+ __be32 *verf);
#ifdef CONFIG_NFSD_V4
__be32 nfsd_getxattr(struct svc_rqst *rqstp, struct svc_fh *fhp,
char *name, void **bufp, int *lenp);


2022-10-06 16:24:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 6/9] NFSD: Use const pointers as parameters to fh_ helpers.

Enable callers to use const pointers where they are able to.

Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfsfh.h | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index c3ae6414fc5c..513e028b0bbe 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -220,7 +220,7 @@ __be32 fh_update(struct svc_fh *);
void fh_put(struct svc_fh *);

static __inline__ struct svc_fh *
-fh_copy(struct svc_fh *dst, struct svc_fh *src)
+fh_copy(struct svc_fh *dst, const struct svc_fh *src)
{
WARN_ON(src->fh_dentry);

@@ -229,7 +229,7 @@ fh_copy(struct svc_fh *dst, struct svc_fh *src)
}

static inline void
-fh_copy_shallow(struct knfsd_fh *dst, struct knfsd_fh *src)
+fh_copy_shallow(struct knfsd_fh *dst, const struct knfsd_fh *src)
{
dst->fh_size = src->fh_size;
memcpy(&dst->fh_raw, &src->fh_raw, src->fh_size);
@@ -243,7 +243,8 @@ fh_init(struct svc_fh *fhp, int maxsize)
return fhp;
}

-static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+static inline bool fh_match(const struct knfsd_fh *fh1,
+ const struct knfsd_fh *fh2)
{
if (fh1->fh_size != fh2->fh_size)
return false;
@@ -252,7 +253,8 @@ static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
return true;
}

-static inline bool fh_fsid_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+static inline bool fh_fsid_match(const struct knfsd_fh *fh1,
+ const struct knfsd_fh *fh2)
{
if (fh1->fh_fsid_type != fh2->fh_fsid_type)
return false;


2022-10-06 16:24:41

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 8/9] NFSD: Clean up nfs4_preprocess_stateid_op() call sites

Remove the lame-duck dprintk()s around nfs4_preprocess_stateid_op()
call sites.

Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4proc.c | 31 +++++++------------------------
1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index d0d976f847ca..42b81e88ea14 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -951,12 +951,7 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&read->rd_stateid, RD_STATE,
&read->rd_nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_read: couldn't process stateid!\n");
- goto out;
- }
- status = nfs_ok;
-out:
+
read->rd_rqstp = rqstp;
read->rd_fhp = &cstate->current_fh;
return status;
@@ -1125,10 +1120,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate,
&cstate->current_fh, &setattr->sa_stateid,
WR_STATE, NULL, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_setattr: couldn't process stateid!\n");
+ if (status)
return status;
- }
}
err = fh_want_write(&cstate->current_fh);
if (err)
@@ -1176,10 +1169,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_offset, cnt);
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
stateid, WR_STATE, &nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_write: couldn't process stateid!\n");
+ if (status)
return status;
- }

write->wr_how_written = write->wr_stable_how;

@@ -1210,17 +1201,13 @@ nfsd4_verify_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,

status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->save_fh,
src_stateid, RD_STATE, src, NULL);
- if (status) {
- dprintk("NFSD: %s: couldn't process src stateid!\n", __func__);
+ if (status)
goto out;
- }

status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
dst_stateid, WR_STATE, dst, NULL);
- if (status) {
- dprintk("NFSD: %s: couldn't process dst stateid!\n", __func__);
+ if (status)
goto out_put_src;
- }

/* fix up for NFS-specific error code */
if (!S_ISREG(file_inode((*src)->nf_file)->i_mode) ||
@@ -1955,10 +1942,8 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&fallocate->falloc_stateid,
WR_STATE, &nf, NULL);
- if (status != nfs_ok) {
- dprintk("NFSD: nfsd4_fallocate: couldn't process stateid!\n");
+ if (status != nfs_ok)
return status;
- }

status = nfsd4_vfs_fallocate(rqstp, &cstate->current_fh, nf->nf_file,
fallocate->falloc_offset,
@@ -2014,10 +1999,8 @@ nfsd4_seek(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&seek->seek_stateid,
RD_STATE, &nf, NULL);
- if (status) {
- dprintk("NFSD: nfsd4_seek: couldn't process stateid!\n");
+ if (status)
return status;
- }

switch (seek->seek_whence) {
case NFS4_CONTENT_DATA:


2022-10-06 16:24:41

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 5/9] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

NFSv4 operations manage the lifetime of nfsd_file items they use by
means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
garbage collected.

Introduce a mechanism to enable garbage collection for nfsd_file
items used only by NFSv2/3 callers.

Note that the change in nfsd_file_put() ensures that both CLOSE and
DELEGRETURN will actually close out and free an nfsd_file on last
reference of a non-garbage-collected file.

Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
Suggested-by: Trond Myklebust <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
fs/nfsd/filecache.h | 3 +++
fs/nfsd/nfs3proc.c | 4 ++-
fs/nfsd/trace.h | 3 ++-
fs/nfsd/vfs.c | 4 ++-
5 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index b7aa523c2010..01c27deabf83 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
struct net *net;
const struct cred *cred;
unsigned char need;
+ unsigned char gc:1;
enum nfsd_file_lookup_type type;
};

@@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
return 1;
if (!nfsd_match_cred(nf->nf_cred, key->cred))
return 1;
+ if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
+ return 1;
if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
return 1;
break;
@@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
nf->nf_flags = 0;
__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
+ if (key->gc)
+ __set_bit(NFSD_FILE_GC, &nf->nf_flags);
nf->nf_inode = key->inode;
/* nf_ref is pre-incremented for hash table */
refcount_set(&nf->nf_ref, 2);
@@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
}
}

+static void
+nfsd_file_unhash_and_put(struct nfsd_file *nf)
+{
+ if (nfsd_file_unhash(nf))
+ nfsd_file_put_noref(nf);
+}
+
void
nfsd_file_put(struct nfsd_file *nf)
{
might_sleep();

- nfsd_file_lru_add(nf);
+ if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
+ nfsd_file_lru_add(nf);
+ else if (refcount_read(&nf->nf_ref) == 2)
+ nfsd_file_unhash_and_put(nf);
+
if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
nfsd_file_flush(nf);
nfsd_file_put_noref(nf);
- } else if (nf->nf_file) {
+ } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
nfsd_file_put_noref(nf);
nfsd_file_schedule_laundrette();
} else
@@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)

static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
- unsigned int may_flags, struct nfsd_file **pnf, bool open)
+ unsigned int may_flags, struct nfsd_file **pnf,
+ bool open, int want_gc)
{
struct nfsd_file_lookup_key key = {
.type = NFSD_FILE_KEY_FULL,
.need = may_flags & NFSD_FILE_MAY_MASK,
.net = SVC_NET(rqstp),
+ .gc = want_gc,
};
bool open_retry = true;
struct nfsd_file *nf;
@@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* then unhash.
*/
if (status != nfs_ok || key.inode->i_nlink == 0)
- if (nfsd_file_unhash(nf))
- nfsd_file_put_noref(nf);
+ nfsd_file_unhash_and_put(nf);
clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
smp_mb__after_atomic();
wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
goto out;
}

+/**
+ * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
+ * @rqstp: the RPC transaction being executed
+ * @fhp: the NFS filehandle of the file to be opened
+ * @may_flags: NFSD_MAY_ settings for the file
+ * @pnf: OUT: new or found "struct nfsd_file" object
+ *
+ * The nfsd_file object returned by this API is reference-counted
+ * and garbage-collected. The object is retained for a few
+ * seconds after the final nfsd_file_put() in case the caller
+ * wants to re-use it.
+ *
+ * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
+ * network byte order is returned.
+ */
+__be32
+nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ unsigned int may_flags, struct nfsd_file **pnf)
+{
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
+}
+
/**
* nfsd_file_acquire - Get a struct nfsd_file with an open file
* @rqstp: the RPC transaction being executed
@@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @may_flags: NFSD_MAY_ settings for the file
* @pnf: OUT: new or found "struct nfsd_file" object
*
+ * The nfsd_file_object returned by this API is reference-counted
+ * but not garbage-collected. The object is released one RCU grace
+ * period after the final nfsd_file_put().
+ *
* Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
* network byte order is returned.
*/
@@ -1139,7 +1182,7 @@ __be32
nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
- return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
}

/**
@@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @may_flags: NFSD_MAY_ settings for the file
* @pnf: OUT: new or found "struct nfsd_file" object
*
+ * The nfsd_file_object returned by this API is reference-counted
+ * but not garbage-collected. The object is released immediately
+ * one RCU grace period after the final nfsd_file_put().
+ *
* Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
* network byte order is returned.
*/
@@ -1156,7 +1203,7 @@ __be32
nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
- return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
+ return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
}

/*
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index f81c198f4ed6..0f6546bcd3e0 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -38,6 +38,7 @@ struct nfsd_file {
#define NFSD_FILE_HASHED (0)
#define NFSD_FILE_PENDING (1)
#define NFSD_FILE_REFERENCED (2)
+#define NFSD_FILE_GC (3)
unsigned long nf_flags;
struct inode *nf_inode; /* don't deref */
refcount_t nf_ref;
@@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
void nfsd_file_close_inode_sync(struct inode *inode);
bool nfsd_file_is_cached(struct inode *inode);
+__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ unsigned int may_flags, struct nfsd_file **nfp);
__be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **nfp);
__be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index d12823371857..6a5ad6c91d8e 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
(unsigned long long) argp->offset);

fh_copy(&resp->fh, &argp->fh);
- resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
- NFSD_MAY_NOT_BREAK_LEASE, &nf);
+ resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
+ NFSD_MAY_NOT_BREAK_LEASE, &nf);
if (resp->status)
goto out;
resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 9ebd67d461f9..4921144880d3 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
__print_flags(val, "|", \
{ 1 << NFSD_FILE_HASHED, "HASHED" }, \
{ 1 << NFSD_FILE_PENDING, "PENDING" }, \
- { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
+ { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
+ { 1 << NFSD_FILE_GC, "GC"})

DECLARE_EVENT_CLASS(nfsd_file_class,
TP_PROTO(struct nfsd_file *nf),
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 44f210ba17cc..89d682a56fc4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
__be32 err;

trace_nfsd_read_start(rqstp, fhp, offset, *count);
- err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
+ err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
if (err)
return err;

@@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,

trace_nfsd_write_start(rqstp, fhp, offset, *cnt);

- err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
+ err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
if (err)
goto out;



2022-10-06 16:24:42

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 9/9] NFSD: Trace delegation revocations

Delegation revocation is an exceptional event that is not otherwise
visible externally (eg, no network traffic is emitted). Generate a
trace record when it occurs so that revocation can be observed or
other activity can be triggered. Example:

nfsd-1104 [005] 1912.002544: nfsd_stid_revoke: client 633c9343:4e82788d stateid 00000003:00000001 ref=2 type=DELEG

Trace infrastructure is provided for subsequent additional tracing
related to nfs4_stid activity.

Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4state.c | 2 ++
fs/nfsd/trace.h | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index bb52510a6a83..5d58e71f664d 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1404,6 +1404,8 @@ static void revoke_delegation(struct nfs4_delegation *dp)

WARN_ON(!list_empty(&dp->dl_recall_lru));

+ trace_nfsd_stid_revoke(&dp->dl_stid);
+
if (clp->cl_minorversion) {
dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;
refcount_inc(&dp->dl_stid.sc_count);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 4921144880d3..23fb39c957af 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -561,6 +561,61 @@ DEFINE_EVENT(nfsd_stateseqid_class, nfsd_##name, \
DEFINE_STATESEQID_EVENT(preprocess);
DEFINE_STATESEQID_EVENT(open_confirm);

+TRACE_DEFINE_ENUM(NFS4_OPEN_STID);
+TRACE_DEFINE_ENUM(NFS4_LOCK_STID);
+TRACE_DEFINE_ENUM(NFS4_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_CLOSED_STID);
+TRACE_DEFINE_ENUM(NFS4_REVOKED_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_CLOSED_DELEG_STID);
+TRACE_DEFINE_ENUM(NFS4_LAYOUT_STID);
+
+#define show_stid_type(x) \
+ __print_flags(x, "|", \
+ { NFS4_OPEN_STID, "OPEN" }, \
+ { NFS4_LOCK_STID, "LOCK" }, \
+ { NFS4_DELEG_STID, "DELEG" }, \
+ { NFS4_CLOSED_STID, "CLOSED" }, \
+ { NFS4_REVOKED_DELEG_STID, "REVOKED" }, \
+ { NFS4_CLOSED_DELEG_STID, "CLOSED_DELEG" }, \
+ { NFS4_LAYOUT_STID, "LAYOUT" })
+
+DECLARE_EVENT_CLASS(nfsd_stid_class,
+ TP_PROTO(
+ const struct nfs4_stid *stid
+ ),
+ TP_ARGS(stid),
+ TP_STRUCT__entry(
+ __field(unsigned long, sc_type)
+ __field(int, sc_count)
+ __field(u32, cl_boot)
+ __field(u32, cl_id)
+ __field(u32, si_id)
+ __field(u32, si_generation)
+ ),
+ TP_fast_assign(
+ const stateid_t *stp = &stid->sc_stateid;
+
+ __entry->sc_type = stid->sc_type;
+ __entry->sc_count = refcount_read(&stid->sc_count);
+ __entry->cl_boot = stp->si_opaque.so_clid.cl_boot;
+ __entry->cl_id = stp->si_opaque.so_clid.cl_id;
+ __entry->si_id = stp->si_opaque.so_id;
+ __entry->si_generation = stp->si_generation;
+ ),
+ TP_printk("client %08x:%08x stateid %08x:%08x ref=%d type=%s",
+ __entry->cl_boot, __entry->cl_id,
+ __entry->si_id, __entry->si_generation,
+ __entry->sc_count, show_stid_type(__entry->sc_type)
+ )
+);
+
+#define DEFINE_STID_EVENT(name) \
+DEFINE_EVENT(nfsd_stid_class, nfsd_stid_##name, \
+ TP_PROTO(const struct nfs4_stid *stid), \
+ TP_ARGS(stid))
+
+DEFINE_STID_EVENT(revoke);
+
DECLARE_EVENT_CLASS(nfsd_clientid_class,
TP_PROTO(const clientid_t *clid),
TP_ARGS(clid),


2022-10-06 16:24:50

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

fh_match() is expensive to use for hash chains that contain more
than a few objects. With common workloads, I see multiple thousands
of objects stored in file_hashtbl[], which always has only 256
buckets.

Replace it with an rhashtable, which dynamically resizes its bucket
array to keep hash chains short.

This also enables the removal of the use of state_lock to serialize
operations on the new rhashtable.

The result is an improvement in the latency of NFSv4 operations
and the reduction of nfsd CPU utilization due to the cache misses
of walking long hash chains in file_hashtbl.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4state.c | 235 ++++++++++++++++++++++++++++++++++++---------------
fs/nfsd/state.h | 5 -
2 files changed, 165 insertions(+), 75 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2b850de288cf..bb52510a6a83 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -44,7 +44,9 @@
#include <linux/jhash.h>
#include <linux/string_helpers.h>
#include <linux/fsnotify.h>
+#include <linux/rhashtable.h>
#include <linux/nfs_ssc.h>
+
#include "xdr4.h"
#include "xdr4cb.h"
#include "vfs.h"
@@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
void nfsd4_end_grace(struct nfsd_net *nn);
static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
+static void remove_nfs4_file(struct nfs4_file *fp);

/* Locking: */

@@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
void
put_nfs4_file(struct nfs4_file *fi)
{
- might_lock(&state_lock);
-
- if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
- hlist_del_rcu(&fi->fi_hash);
- spin_unlock(&state_lock);
+ if (refcount_dec_and_test(&fi->fi_ref)) {
+ remove_nfs4_file(fi);
WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
@@ -695,19 +695,85 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
return ret & OWNER_HASH_MASK;
}

-/* hash table for nfs4_file */
-#define FILE_HASH_BITS 8
-#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
+static struct rhashtable nfs4_file_rhashtbl ____cacheline_aligned_in_smp;

-static unsigned int file_hashval(struct svc_fh *fh)
+/*
+ * The returned hash value is based solely on the address of an in-code
+ * inode, a pointer to a slab-allocated object. The entropy in such a
+ * pointer is concentrated in its middle bits.
+ */
+static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
{
- struct inode *inode = d_inode(fh->fh_dentry);
+ unsigned long ptr = (unsigned long)inode;
+ u32 k;

- /* XXX: why not (here & in file cache) use inode? */
- return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
+ k = ptr >> L1_CACHE_SHIFT;
+ k &= 0x00ffffff;
+ return jhash2(&k, 1, seed);
}

-static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
+/**
+ * nfs4_file_key_hashfn - Compute the hash value of a lookup key
+ * @data: key on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ * Computed 32-bit hash value
+ */
+static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
+{
+ const struct svc_fh *fhp = data;
+
+ return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
+}
+
+/**
+ * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
+ * @data: object on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ * Computed 32-bit hash value
+ */
+static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+ const struct nfs4_file *fi = data;
+
+ return nfs4_file_inode_hash(fi->fi_inode, seed);
+}
+
+/**
+ * nfs4_file_obj_cmpfn - Match a cache item against search criteria
+ * @arg: search criteria
+ * @ptr: cache item to check
+ *
+ * Return values:
+ * %0 - Item matches search criteria
+ * %1 - Item does not match search criteria
+ */
+static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
+ const void *ptr)
+{
+ const struct svc_fh *fhp = arg->key;
+ const struct nfs4_file *fi = ptr;
+
+ return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
+}
+
+static const struct rhashtable_params nfs4_file_rhash_params = {
+ .key_len = sizeof_field(struct nfs4_file, fi_inode),
+ .key_offset = offsetof(struct nfs4_file, fi_inode),
+ .head_offset = offsetof(struct nfs4_file, fi_rhash),
+ .hashfn = nfs4_file_key_hashfn,
+ .obj_hashfn = nfs4_file_obj_hashfn,
+ .obj_cmpfn = nfs4_file_obj_cmpfn,
+
+ /* Reduce resizing churn on light workloads */
+ .min_size = 512, /* buckets */
+ .automatic_shrinking = true,
+};

/*
* Check if courtesy clients have conflicting access and resolve it if possible
@@ -4251,11 +4317,8 @@ static struct nfs4_file *nfsd4_alloc_file(void)
}

/* OPEN Share state helper functions */
-static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
- struct nfs4_file *fp)
+static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp)
{
- lockdep_assert_held(&state_lock);
-
refcount_set(&fp->fi_ref, 1);
spin_lock_init(&fp->fi_lock);
INIT_LIST_HEAD(&fp->fi_stateids);
@@ -4273,7 +4336,6 @@ static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
INIT_LIST_HEAD(&fp->fi_lo_states);
atomic_set(&fp->fi_lo_recalls, 0);
#endif
- hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
}

void
@@ -4626,71 +4688,92 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
nfs4_put_stid(&last->st_stid);
}

-/* search file_hashtbl[] for file */
-static struct nfs4_file *
-find_file_locked(struct svc_fh *fh, unsigned int hashval)
-{
- struct nfs4_file *fp;
+/*
+ * On hash insertion, identify any inode aliases. They will
+ * all be in the same hash bucket because we hash by the
+ * address of the struct inode.
+ */
+static void check_nfs4_file_aliases(struct nfs4_file *new,
+ const struct svc_fh *fhp)
+{
+ struct rhashtable *ht = &nfs4_file_rhashtbl;
+ struct rhash_lock_head __rcu *const *bkt;
+ struct rhashtable_compare_arg arg = {
+ .ht = ht,
+ .key = fhp,
+ };
+ struct bucket_table *tbl;
+ struct rhash_head *he;
+ unsigned int hash;

- hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
- lockdep_is_held(&state_lock)) {
- if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
- if (refcount_inc_not_zero(&fp->fi_ref))
- return fp;
+ /*
+ * rhashtable avoids large buckets, thus this loop stays
+ * efficient.
+ */
+ rcu_read_lock();
+ tbl = rht_dereference_rcu(ht->tbl, ht);
+ hash = rht_key_hashfn(ht, tbl, fhp, nfs4_file_rhash_params);
+ bkt = rht_bucket(tbl, hash);
+ rht_for_each_rcu_from(he, rht_ptr_rcu(bkt), tbl, hash) {
+ struct nfs4_file *fi;
+
+ fi = rht_obj(ht, he);
+ if (nfs4_file_obj_cmpfn(&arg, fi) == 0)
+ continue;
+ if (d_inode(fhp->fh_dentry) == fi->fi_inode) {
+ fi->fi_aliased = true;
+ new->fi_aliased = true;
}
}
- return NULL;
-}
-
-static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh,
- unsigned int hashval)
-{
- struct nfs4_file *fp;
- struct nfs4_file *ret = NULL;
- bool alias_found = false;
-
- spin_lock(&state_lock);
- hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
- lockdep_is_held(&state_lock)) {
- if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
- if (refcount_inc_not_zero(&fp->fi_ref))
- ret = fp;
- } else if (d_inode(fh->fh_dentry) == fp->fi_inode)
- fp->fi_aliased = alias_found = true;
- }
- if (likely(ret == NULL)) {
- nfsd4_init_file(fh, hashval, new);
- new->fi_aliased = alias_found;
- ret = new;
- }
- spin_unlock(&state_lock);
- return ret;
+ rcu_read_unlock();
}

-static struct nfs4_file * find_file(struct svc_fh *fh)
+static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp)
{
- struct nfs4_file *fp;
- unsigned int hashval = file_hashval(fh);
+ struct nfs4_file *fi;

rcu_read_lock();
- fp = find_file_locked(fh, hashval);
+ fi = rhashtable_lookup(&nfs4_file_rhashtbl, fhp,
+ nfs4_file_rhash_params);
+ if (fi)
+ if (!refcount_inc_not_zero(&fi->fi_ref))
+ fi = NULL;
rcu_read_unlock();
- return fp;
+ return fi;
}

static struct nfs4_file *
-find_or_add_file(struct nfs4_file *new, struct svc_fh *fh)
+find_or_insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
{
- struct nfs4_file *fp;
- unsigned int hashval = file_hashval(fh);
+ struct nfs4_file *fi;
+ int err;

- rcu_read_lock();
- fp = find_file_locked(fh, hashval);
- rcu_read_unlock();
- if (fp)
- return fp;
+ while (true) {
+ fi = find_nfs4_file(fhp);
+ if (fi)
+ break;
+
+ init_nfs4_file(fhp, new);
+ err = rhashtable_lookup_insert_key(&nfs4_file_rhashtbl, fhp,
+ &new->fi_rhash,
+ nfs4_file_rhash_params);
+ if (!err) {
+ fi = new;
+ check_nfs4_file_aliases(fi, fhp);
+ break;
+ }
+ if (err != -EEXIST) {
+ fi = NULL;
+ break;
+ }
+ }
+ return fi;
+}

- return insert_file(new, fh, hashval);
+static void remove_nfs4_file(struct nfs4_file *fi)
+{
+ rhashtable_remove_fast(&nfs4_file_rhashtbl, &fi->fi_rhash,
+ nfs4_file_rhash_params);
}

/*
@@ -4703,9 +4786,10 @@ nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type)
struct nfs4_file *fp;
__be32 ret = nfs_ok;

- fp = find_file(current_fh);
+ fp = find_nfs4_file(current_fh);
if (!fp)
return ret;
+
/* Check for conflicting share reservations */
spin_lock(&fp->fi_lock);
if (fp->fi_share_deny & deny_type)
@@ -5548,7 +5632,9 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
* and check for delegations in the process of being recalled.
* If not found, create the nfs4_file struct
*/
- fp = find_or_add_file(open->op_file, current_fh);
+ fp = find_or_insert_nfs4_file(open->op_file, current_fh);
+ if (unlikely(!fp))
+ return nfserr_jukebox;
if (fp != open->op_file) {
status = nfs4_check_deleg(cl, open, &dp);
if (status)
@@ -7905,10 +7991,16 @@ nfs4_state_start(void)
{
int ret;

- ret = nfsd4_create_callback_queue();
+ ret = rhashtable_init(&nfs4_file_rhashtbl, &nfs4_file_rhash_params);
if (ret)
return ret;

+ ret = nfsd4_create_callback_queue();
+ if (ret) {
+ rhashtable_destroy(&nfs4_file_rhashtbl);
+ return ret;
+ }
+
set_max_delegations();
return 0;
}
@@ -7939,6 +8031,7 @@ nfs4_state_shutdown_net(struct net *net)

nfsd4_client_tracking_exit(net);
nfs4_state_destroy_net(net);
+ rhashtable_destroy(&nfs4_file_rhashtbl);
#ifdef CONFIG_NFSD_V4_2_INTER_SSC
nfsd4_ssc_shutdown_umount(nn);
#endif
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index ae596dbf8667..879f085bc39e 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -536,16 +536,13 @@ struct nfs4_clnt_odstate {
* inode can have multiple filehandles associated with it, so there is
* (potentially) a many to one relationship between this struct and struct
* inode.
- *
- * These are hashed by filehandle in the file_hashtbl, which is protected by
- * the global state_lock spinlock.
*/
struct nfs4_file {
refcount_t fi_ref;
struct inode * fi_inode;
bool fi_aliased;
spinlock_t fi_lock;
- struct hlist_node fi_hash; /* hash on fi_fhandle */
+ struct rhash_head fi_rhash;
struct list_head fi_stateids;
union {
struct list_head fi_delegations;


2022-10-07 13:21:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 0/9] A course adjustment, maybe...

On Thu, 2022-10-06 at 12:20 -0400, Chuck Lever wrote:
> I'm proposing this series as the first NFSD-related patchset to go
> into v6.2 (for-next), though I haven't opened that yet.
>
> For quite some time, we've been encouraged to disable filecache
> garbage collection for NFSv4 files, and I think I found a surgical
> way to do just that. That is presented in "NFSD: Add an NFSD_FILE_GC
> flag to enable nfsd_file garbage collection".
>
> Comments and opinions are welcome.
>
> Changes since RFC:
> - checking nfs4_files for inode aliases is now done only on hash
> insertion
> - the nfs4_file reference count is now bumped only while the RCU
> read lock is held
> - comments and function names have been revised and clarified
>
> I haven't updated the new @want_gc parameter... jury is still out.
>

It was just a nit I noticed since it looked like it was being used as a
bool. If you think it needs to be an int, then so be it.


> ---
>
> Chuck Lever (7):
> NFSD: Pass the target nfsd_file to nfsd_commit()
> NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"
> NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection
> NFSD: Use const pointers as parameters to fh_ helpers.
> NFSD: Use rhashtable for managing nfs4_file objects
> NFSD: Clean up nfs4_preprocess_stateid_op() call sites
> NFSD: Trace delegation revocations
>
> Jeff Layton (2):
> nfsd: fix nfsd_file_unhash_and_dispose
> nfsd: rework hashtable handling in nfsd_do_file_acquire
>
>
> fs/nfsd/filecache.c | 165 +++++++++++++++---------------
> fs/nfsd/filecache.h | 4 +-
> fs/nfsd/nfs3proc.c | 10 +-
> fs/nfsd/nfs4proc.c | 42 ++++----
> fs/nfsd/nfs4state.c | 241 ++++++++++++++++++++++++++++++--------------
> fs/nfsd/nfsfh.h | 10 +-
> fs/nfsd/state.h | 5 +-
> fs/nfsd/trace.h | 58 ++++++++++-
> fs/nfsd/vfs.c | 19 ++--
> fs/nfsd/vfs.h | 3 +-
> 10 files changed, 351 insertions(+), 206 deletions(-)


I been doing some testing with this and it seems to be working well. You
can add:

Tested-by: Jeff Layton <[email protected]>

2022-10-07 14:23:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 0/9] A course adjustment, maybe...



> On Oct 7, 2022, at 9:19 AM, Jeff Layton <[email protected]> wrote:
>
> On Thu, 2022-10-06 at 12:20 -0400, Chuck Lever wrote:
>> I'm proposing this series as the first NFSD-related patchset to go
>> into v6.2 (for-next), though I haven't opened that yet.
>>
>> For quite some time, we've been encouraged to disable filecache
>> garbage collection for NFSv4 files, and I think I found a surgical
>> way to do just that. That is presented in "NFSD: Add an NFSD_FILE_GC
>> flag to enable nfsd_file garbage collection".
>>
>> Comments and opinions are welcome.
>>
>> Changes since RFC:
>> - checking nfs4_files for inode aliases is now done only on hash
>> insertion
>> - the nfs4_file reference count is now bumped only while the RCU
>> read lock is held
>> - comments and function names have been revised and clarified
>>
>> I haven't updated the new @want_gc parameter... jury is still out.
>>
>
> It was just a nit I noticed since it looked like it was being used as a
> bool. If you think it needs to be an int, then so be it.

I prefer bool in these applications, but in this case, I think "int"
is the safer choice. Thanks for your review!


>> ---
>>
>> Chuck Lever (7):
>> NFSD: Pass the target nfsd_file to nfsd_commit()
>> NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"
>> NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection
>> NFSD: Use const pointers as parameters to fh_ helpers.
>> NFSD: Use rhashtable for managing nfs4_file objects
>> NFSD: Clean up nfs4_preprocess_stateid_op() call sites
>> NFSD: Trace delegation revocations
>>
>> Jeff Layton (2):
>> nfsd: fix nfsd_file_unhash_and_dispose
>> nfsd: rework hashtable handling in nfsd_do_file_acquire
>>
>>
>> fs/nfsd/filecache.c | 165 +++++++++++++++---------------
>> fs/nfsd/filecache.h | 4 +-
>> fs/nfsd/nfs3proc.c | 10 +-
>> fs/nfsd/nfs4proc.c | 42 ++++----
>> fs/nfsd/nfs4state.c | 241 ++++++++++++++++++++++++++++++--------------
>> fs/nfsd/nfsfh.h | 10 +-
>> fs/nfsd/state.h | 5 +-
>> fs/nfsd/trace.h | 58 ++++++++++-
>> fs/nfsd/vfs.c | 19 ++--
>> fs/nfsd/vfs.h | 3 +-
>> 10 files changed, 351 insertions(+), 206 deletions(-)
>
>
> I been doing some testing with this and it seems to be working well. You
> can add:
>
> Tested-by: Jeff Layton <[email protected]>

--
Chuck Lever



2022-10-10 23:59:36

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection

On Fri, 07 Oct 2022, Chuck Lever wrote:
> NFSv4 operations manage the lifetime of nfsd_file items they use by
> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
> garbage collected.
>
> Introduce a mechanism to enable garbage collection for nfsd_file
> items used only by NFSv2/3 callers.
>
> Note that the change in nfsd_file_put() ensures that both CLOSE and
> DELEGRETURN will actually close out and free an nfsd_file on last
> reference of a non-garbage-collected file.
>
> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
> Suggested-by: Trond Myklebust <[email protected]>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
> fs/nfsd/filecache.h | 3 +++
> fs/nfsd/nfs3proc.c | 4 ++-
> fs/nfsd/trace.h | 3 ++-
> fs/nfsd/vfs.c | 4 ++-
> 5 files changed, 63 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index b7aa523c2010..01c27deabf83 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
> struct net *net;
> const struct cred *cred;
> unsigned char need;
> + unsigned char gc:1;
> enum nfsd_file_lookup_type type;
> };
>
> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
> return 1;
> if (!nfsd_match_cred(nf->nf_cred, key->cred))
> return 1;
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
> + return 1;
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
> return 1;
> break;
> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> nf->nf_flags = 0;
> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> + if (key->gc)
> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> nf->nf_inode = key->inode;
> /* nf_ref is pre-incremented for hash table */
> refcount_set(&nf->nf_ref, 2);
> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> }
> }
>
> +static void
> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
> +{
> + if (nfsd_file_unhash(nf))
> + nfsd_file_put_noref(nf);
> +}
> +
> void
> nfsd_file_put(struct nfsd_file *nf)
> {
> might_sleep();
>
> - nfsd_file_lru_add(nf);
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
> + nfsd_file_lru_add(nf);
> + else if (refcount_read(&nf->nf_ref) == 2)
> + nfsd_file_unhash_and_put(nf);
> +
> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> nfsd_file_flush(nf);
> nfsd_file_put_noref(nf);
> - } else if (nf->nf_file) {
> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
> nfsd_file_put_noref(nf);
> nfsd_file_schedule_laundrette();
> } else
> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>
> static __be32
> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
> + unsigned int may_flags, struct nfsd_file **pnf,
> + bool open, int want_gc)
> {
> struct nfsd_file_lookup_key key = {
> .type = NFSD_FILE_KEY_FULL,
> .need = may_flags & NFSD_FILE_MAY_MASK,
> .net = SVC_NET(rqstp),
> + .gc = want_gc,
> };
> bool open_retry = true;
> struct nfsd_file *nf;
> @@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * then unhash.
> */
> if (status != nfs_ok || key.inode->i_nlink == 0)
> - if (nfsd_file_unhash(nf))
> - nfsd_file_put_noref(nf);
> + nfsd_file_unhash_and_put(nf);
> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> smp_mb__after_atomic();
> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> goto out;
> }
>
> +/**
> + * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
> + * @rqstp: the RPC transaction being executed
> + * @fhp: the NFS filehandle of the file to be opened
> + * @may_flags: NFSD_MAY_ settings for the file
> + * @pnf: OUT: new or found "struct nfsd_file" object
> + *
> + * The nfsd_file object returned by this API is reference-counted
> + * and garbage-collected. The object is retained for a few
> + * seconds after the final nfsd_file_put() in case the caller
> + * wants to re-use it.
> + *
> + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> + * network byte order is returned.
> + */
> +__be32
> +nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **pnf)
> +{
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
> +}
> +
> /**
> * nfsd_file_acquire - Get a struct nfsd_file with an open file
> * @rqstp: the RPC transaction being executed
> @@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * @may_flags: NFSD_MAY_ settings for the file
> * @pnf: OUT: new or found "struct nfsd_file" object
> *
> + * The nfsd_file_object returned by this API is reference-counted
> + * but not garbage-collected. The object is released one RCU grace
> + * period after the final nfsd_file_put().
> + *
> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> * network byte order is returned.
> */
> @@ -1139,7 +1182,7 @@ __be32
> nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **pnf)
> {
> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
> }
>
> /**
> @@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * @may_flags: NFSD_MAY_ settings for the file
> * @pnf: OUT: new or found "struct nfsd_file" object
> *
> + * The nfsd_file_object returned by this API is reference-counted
> + * but not garbage-collected. The object is released immediately
> + * one RCU grace period after the final nfsd_file_put().

While that last sentence is correct, I think it is missing the point.
The reference to grace periods is really an internal implementation
details.

The important point is that on final nfsd_file_put(), the object is
removed from the hash table. This contrasts with the other version
where on final nfsd_file_put(), the object is added to the lru.

The code all seems good, just the comment seemed odd.

Thanks,
NeilBrown


> + *
> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
> * network byte order is returned.
> */
> @@ -1156,7 +1203,7 @@ __be32
> nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **pnf)
> {
> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
> }
>
> /*
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index f81c198f4ed6..0f6546bcd3e0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
> #define NFSD_FILE_HASHED (0)
> #define NFSD_FILE_PENDING (1)
> #define NFSD_FILE_REFERENCED (2)
> +#define NFSD_FILE_GC (3)
> unsigned long nf_flags;
> struct inode *nf_inode; /* don't deref */
> refcount_t nf_ref;
> @@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
> void nfsd_file_close_inode_sync(struct inode *inode);
> bool nfsd_file_is_cached(struct inode *inode);
> +__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **nfp);
> __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **nfp);
> __be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index d12823371857..6a5ad6c91d8e 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
> (unsigned long long) argp->offset);
>
> fh_copy(&resp->fh, &argp->fh);
> - resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
> - NFSD_MAY_NOT_BREAK_LEASE, &nf);
> + resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
> + NFSD_MAY_NOT_BREAK_LEASE, &nf);
> if (resp->status)
> goto out;
> resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 9ebd67d461f9..4921144880d3 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
> __print_flags(val, "|", \
> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> - { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
> + { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
> + { 1 << NFSD_FILE_GC, "GC"})
>
> DECLARE_EVENT_CLASS(nfsd_file_class,
> TP_PROTO(struct nfsd_file *nf),
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 44f210ba17cc..89d682a56fc4 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> __be32 err;
>
> trace_nfsd_read_start(rqstp, fhp, offset, *count);
> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
> if (err)
> return err;
>
> @@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>
> trace_nfsd_write_start(rqstp, fhp, offset, *cnt);
>
> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
> if (err)
> goto out;
>
>
>
>

2022-10-11 00:23:36

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

On Fri, 07 Oct 2022, Chuck Lever wrote:
>
> -static unsigned int file_hashval(struct svc_fh *fh)
> +/*
> + * The returned hash value is based solely on the address of an in-code
> + * inode, a pointer to a slab-allocated object. The entropy in such a
> + * pointer is concentrated in its middle bits.

I think you need more justification than that for throwing away some of
the entropy, even if you don't think it is much.
Presumably you think hashing 32 bits is faster than hashing 64 bits.
Maybe it is, but is it worth it?

rhashtable depends heavily on having a strong hash function. In
particular if any bucket ends up with more than 16 elements it will
choose a new seed and rehash. If you deliberately remove some bits that
it might have been used to spread those 16 out, then you are asking for
trouble.

We know that two different file handles can refer to the same inode
("rarely"), and you deliberately place them in the same hash bucket.
So if an attacker arranges to access 17 files with the same inode but
different file handles, then the hashtable will be continuously
rehashed.

The preferred approach when you require things to share a hash chain is
to use an rhl table. This allows multiple instances with the same key.
You would then key the rhl-table with the inode, and search a
linked-list to find the entry with the desired file handle. This would
be no worse in search time than the current code for aliased inodes, but
less susceptible to attack.

> +/**
> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> + * @arg: search criteria
> + * @ptr: cache item to check
> + *
> + * Return values:
> + * %0 - Item matches search criteria
> + * %1 - Item does not match search criteria

I *much* prefer %-ESRCH for "does not match search criteria". It is
self-documenting. Any non-zero value will do.

NeilBrown

2022-10-11 13:04:53

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects


> On Oct 10, 2022, at 8:16 PM, NeilBrown <[email protected]> wrote:
>
> On Fri, 07 Oct 2022, Chuck Lever wrote:
>>
>> -static unsigned int file_hashval(struct svc_fh *fh)
>> +/*
>> + * The returned hash value is based solely on the address of an in-code
>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>> + * pointer is concentrated in its middle bits.
>
> I think you need more justification than that for throwing away some of
> the entropy, even if you don't think it is much.

We might have that justification:

https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/

Actually I believe we are not discarding /any/ entropy in
this function. The bits we discard are invariant.

And, note that this is exactly the same situation we just merged
in the filecache overhaul, and is a common trope amongst other
hash tables that base their function on the inode's address.


> Presumably you think hashing 32 bits is faster than hashing 64 bits.
> Maybe it is, but is it worth it?
>
> rhashtable depends heavily on having a strong hash function. In
> particular if any bucket ends up with more than 16 elements it will
> choose a new seed and rehash. If you deliberately remove some bits that
> it might have been used to spread those 16 out, then you are asking for
> trouble.
>
> We know that two different file handles can refer to the same inode
> ("rarely"), and you deliberately place them in the same hash bucket.
> So if an attacker arranges to access 17 files with the same inode but
> different file handles, then the hashtable will be continuously
> rehashed.
>
> The preferred approach when you require things to share a hash chain is
> to use an rhl table.

Again, this is the same situation for the filecache. Do you
believe it is worth reworking that? I'm guessing "yes".


> This allows multiple instances with the same key.
> You would then key the rhl-table with the inode, and search a
> linked-list to find the entry with the desired file handle. This would
> be no worse in search time than the current code for aliased inodes, but
> less susceptible to attack.
>
>> +/**
>> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
>> + * @arg: search criteria
>> + * @ptr: cache item to check
>> + *
>> + * Return values:
>> + * %0 - Item matches search criteria
>> + * %1 - Item does not match search criteria
>
> I *much* prefer %-ESRCH for "does not match search criteria". It is
> self-documenting. Any non-zero value will do.

Noted, but returning 1 appears to be the typical arrangement for
existing obj_cmpfn methods in most other areas.


--
Chuck Lever



2022-10-11 13:18:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collection


> On Oct 10, 2022, at 7:50 PM, NeilBrown <[email protected]> wrote:
>
> On Fri, 07 Oct 2022, Chuck Lever wrote:
>> NFSv4 operations manage the lifetime of nfsd_file items they use by
>> means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
>> garbage collected.
>>
>> Introduce a mechanism to enable garbage collection for nfsd_file
>> items used only by NFSv2/3 callers.
>>
>> Note that the change in nfsd_file_put() ensures that both CLOSE and
>> DELEGRETURN will actually close out and free an nfsd_file on last
>> reference of a non-garbage-collected file.
>>
>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
>> Suggested-by: Trond Myklebust <[email protected]>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/filecache.c | 61 +++++++++++++++++++++++++++++++++++++++++++++------
>> fs/nfsd/filecache.h | 3 +++
>> fs/nfsd/nfs3proc.c | 4 ++-
>> fs/nfsd/trace.h | 3 ++-
>> fs/nfsd/vfs.c | 4 ++-
>> 5 files changed, 63 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index b7aa523c2010..01c27deabf83 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -63,6 +63,7 @@ struct nfsd_file_lookup_key {
>> struct net *net;
>> const struct cred *cred;
>> unsigned char need;
>> + unsigned char gc:1;
>> enum nfsd_file_lookup_type type;
>> };
>>
>> @@ -162,6 +163,8 @@ static int nfsd_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> return 1;
>> if (!nfsd_match_cred(nf->nf_cred, key->cred))
>> return 1;
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) != key->gc)
>> + return 1;
>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0)
>> return 1;
>> break;
>> @@ -297,6 +300,8 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>> nf->nf_flags = 0;
>> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
>> + if (key->gc)
>> + __set_bit(NFSD_FILE_GC, &nf->nf_flags);
>> nf->nf_inode = key->inode;
>> /* nf_ref is pre-incremented for hash table */
>> refcount_set(&nf->nf_ref, 2);
>> @@ -428,16 +433,27 @@ nfsd_file_put_noref(struct nfsd_file *nf)
>> }
>> }
>>
>> +static void
>> +nfsd_file_unhash_and_put(struct nfsd_file *nf)
>> +{
>> + if (nfsd_file_unhash(nf))
>> + nfsd_file_put_noref(nf);
>> +}
>> +
>> void
>> nfsd_file_put(struct nfsd_file *nf)
>> {
>> might_sleep();
>>
>> - nfsd_file_lru_add(nf);
>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1)
>> + nfsd_file_lru_add(nf);
>> + else if (refcount_read(&nf->nf_ref) == 2)
>> + nfsd_file_unhash_and_put(nf);
>> +
>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
>> nfsd_file_flush(nf);
>> nfsd_file_put_noref(nf);
>> - } else if (nf->nf_file) {
>> + } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags) == 1) {
>> nfsd_file_put_noref(nf);
>> nfsd_file_schedule_laundrette();
>> } else
>> @@ -1017,12 +1033,14 @@ nfsd_file_is_cached(struct inode *inode)
>>
>> static __be32
>> nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> - unsigned int may_flags, struct nfsd_file **pnf, bool open)
>> + unsigned int may_flags, struct nfsd_file **pnf,
>> + bool open, int want_gc)
>> {
>> struct nfsd_file_lookup_key key = {
>> .type = NFSD_FILE_KEY_FULL,
>> .need = may_flags & NFSD_FILE_MAY_MASK,
>> .net = SVC_NET(rqstp),
>> + .gc = want_gc,
>> };
>> bool open_retry = true;
>> struct nfsd_file *nf;
>> @@ -1117,14 +1135,35 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * then unhash.
>> */
>> if (status != nfs_ok || key.inode->i_nlink == 0)
>> - if (nfsd_file_unhash(nf))
>> - nfsd_file_put_noref(nf);
>> + nfsd_file_unhash_and_put(nf);
>> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
>> smp_mb__after_atomic();
>> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
>> goto out;
>> }
>>
>> +/**
>> + * nfsd_file_acquire_gc - Get a struct nfsd_file with an open file
>> + * @rqstp: the RPC transaction being executed
>> + * @fhp: the NFS filehandle of the file to be opened
>> + * @may_flags: NFSD_MAY_ settings for the file
>> + * @pnf: OUT: new or found "struct nfsd_file" object
>> + *
>> + * The nfsd_file object returned by this API is reference-counted
>> + * and garbage-collected. The object is retained for a few
>> + * seconds after the final nfsd_file_put() in case the caller
>> + * wants to re-use it.
>> + *
>> + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> + * network byte order is returned.
>> + */
>> +__be32
>> +nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> + unsigned int may_flags, struct nfsd_file **pnf)
>> +{
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 1);
>> +}
>> +
>> /**
>> * nfsd_file_acquire - Get a struct nfsd_file with an open file
>> * @rqstp: the RPC transaction being executed
>> @@ -1132,6 +1171,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * @may_flags: NFSD_MAY_ settings for the file
>> * @pnf: OUT: new or found "struct nfsd_file" object
>> *
>> + * The nfsd_file_object returned by this API is reference-counted
>> + * but not garbage-collected. The object is released one RCU grace
>> + * period after the final nfsd_file_put().
>> + *
>> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> * network byte order is returned.
>> */
>> @@ -1139,7 +1182,7 @@ __be32
>> nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **pnf)
>> {
>> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true);
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, true, 0);
>> }
>>
>> /**
>> @@ -1149,6 +1192,10 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> * @may_flags: NFSD_MAY_ settings for the file
>> * @pnf: OUT: new or found "struct nfsd_file" object
>> *
>> + * The nfsd_file_object returned by this API is reference-counted
>> + * but not garbage-collected. The object is released immediately
>> + * one RCU grace period after the final nfsd_file_put().
>
> While that last sentence is correct, I think it is missing the point.
> The reference to grace periods is really an internal implementation
> details.

Fair enough, document policy and not implementation. I'll rework this
based on your suggestion below.


> The important point is that on final nfsd_file_put(), the object is
> removed from the hash table. This contrasts with the other version
> where on final nfsd_file_put(), the object is added to the lru.
>
> The code all seems good, just the comment seemed odd.
>
> Thanks,
> NeilBrown
>
>
>> + *
>> * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in
>> * network byte order is returned.
>> */
>> @@ -1156,7 +1203,7 @@ __be32
>> nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **pnf)
>> {
>> - return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false);
>> + return nfsd_file_do_acquire(rqstp, fhp, may_flags, pnf, false, 0);
>> }
>>
>> /*
>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>> index f81c198f4ed6..0f6546bcd3e0 100644
>> --- a/fs/nfsd/filecache.h
>> +++ b/fs/nfsd/filecache.h
>> @@ -38,6 +38,7 @@ struct nfsd_file {
>> #define NFSD_FILE_HASHED (0)
>> #define NFSD_FILE_PENDING (1)
>> #define NFSD_FILE_REFERENCED (2)
>> +#define NFSD_FILE_GC (3)
>> unsigned long nf_flags;
>> struct inode *nf_inode; /* don't deref */
>> refcount_t nf_ref;
>> @@ -55,6 +56,8 @@ void nfsd_file_put(struct nfsd_file *nf);
>> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
>> void nfsd_file_close_inode_sync(struct inode *inode);
>> bool nfsd_file_is_cached(struct inode *inode);
>> +__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> + unsigned int may_flags, struct nfsd_file **nfp);
>> __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> unsigned int may_flags, struct nfsd_file **nfp);
>> __be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>> index d12823371857..6a5ad6c91d8e 100644
>> --- a/fs/nfsd/nfs3proc.c
>> +++ b/fs/nfsd/nfs3proc.c
>> @@ -779,8 +779,8 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
>> (unsigned long long) argp->offset);
>>
>> fh_copy(&resp->fh, &argp->fh);
>> - resp->status = nfsd_file_acquire(rqstp, &resp->fh, NFSD_MAY_WRITE |
>> - NFSD_MAY_NOT_BREAK_LEASE, &nf);
>> + resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
>> + NFSD_MAY_NOT_BREAK_LEASE, &nf);
>> if (resp->status)
>> goto out;
>> resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>> index 9ebd67d461f9..4921144880d3 100644
>> --- a/fs/nfsd/trace.h
>> +++ b/fs/nfsd/trace.h
>> @@ -742,7 +742,8 @@ DEFINE_CLID_EVENT(confirmed_r);
>> __print_flags(val, "|", \
>> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
>> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
>> - { 1 << NFSD_FILE_REFERENCED, "REFERENCED"})
>> + { 1 << NFSD_FILE_REFERENCED, "REFERENCED"}, \
>> + { 1 << NFSD_FILE_GC, "GC"})
>>
>> DECLARE_EVENT_CLASS(nfsd_file_class,
>> TP_PROTO(struct nfsd_file *nf),
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 44f210ba17cc..89d682a56fc4 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1060,7 +1060,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> __be32 err;
>>
>> trace_nfsd_read_start(rqstp, fhp, offset, *count);
>> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
>> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_READ, &nf);
>> if (err)
>> return err;
>>
>> @@ -1092,7 +1092,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>>
>> trace_nfsd_write_start(rqstp, fhp, offset, *cnt);
>>
>> - err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_WRITE, &nf);
>> + err = nfsd_file_acquire_gc(rqstp, fhp, NFSD_MAY_WRITE, &nf);
>> if (err)
>> goto out;

--
Chuck Lever



2022-10-11 23:44:04

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

On Tue, 11 Oct 2022, Chuck Lever III wrote:
> > On Oct 10, 2022, at 8:16 PM, NeilBrown <[email protected]> wrote:
> >
> > On Fri, 07 Oct 2022, Chuck Lever wrote:
> >>
> >> -static unsigned int file_hashval(struct svc_fh *fh)
> >> +/*
> >> + * The returned hash value is based solely on the address of an in-code
> >> + * inode, a pointer to a slab-allocated object. The entropy in such a
> >> + * pointer is concentrated in its middle bits.
> >
> > I think you need more justification than that for throwing away some of
> > the entropy, even if you don't think it is much.
>
> We might have that justification:
>
> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
>
> Actually I believe we are not discarding /any/ entropy in
> this function. The bits we discard are invariant.
>

Ok, I can accept that this:

+ k = ptr >> L1_CACHE_SHIFT;

only discards invariant bits, but how can you justify this:

+ k &= 0x00ffffff;

??

And given that you pass it all to jhash anyway, why not just pass all of
it?


> And, note that this is exactly the same situation we just merged
> in the filecache overhaul, and is a common trope amongst other
> hash tables that base their function on the inode's address.
>
>
> > Presumably you think hashing 32 bits is faster than hashing 64 bits.
> > Maybe it is, but is it worth it?
> >
> > rhashtable depends heavily on having a strong hash function. In
> > particular if any bucket ends up with more than 16 elements it will
> > choose a new seed and rehash. If you deliberately remove some bits that
> > it might have been used to spread those 16 out, then you are asking for
> > trouble.
> >
> > We know that two different file handles can refer to the same inode
> > ("rarely"), and you deliberately place them in the same hash bucket.
> > So if an attacker arranges to access 17 files with the same inode but
> > different file handles, then the hashtable will be continuously
> > rehashed.
> >
> > The preferred approach when you require things to share a hash chain is
> > to use an rhl table.
>
> Again, this is the same situation for the filecache. Do you
> believe it is worth reworking that? I'm guessing "yes".

As a matter of principle: yes.
rhashtable is designed to assume that hash collisions are bad and can be
changed by choosing a different seed.
rhl_tables was explicitly added for cases when people wanted multiple
elements to hash to the same value.

The chance of it causing a problem without an attack are admittedly
tiny. Attacks are only possible with subtree_check enabled, or if the
underlying filesystem does something "clever" with file handles, so
there wouldn't be many situations where an attack would even be
possible. But if it were possible, it would likely be easy.
The cost of the attack would be a minor-to-modest performance impact.

So it is hard to argue "this code is dangerous and must be changed", but
we have different tools for a reason, and I believe that rhl-tables is
the right tool for this job.

>
>
> > This allows multiple instances with the same key.
> > You would then key the rhl-table with the inode, and search a
> > linked-list to find the entry with the desired file handle. This would
> > be no worse in search time than the current code for aliased inodes, but
> > less susceptible to attack.
> >
> >> +/**
> >> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> >> + * @arg: search criteria
> >> + * @ptr: cache item to check
> >> + *
> >> + * Return values:
> >> + * %0 - Item matches search criteria
> >> + * %1 - Item does not match search criteria
> >
> > I *much* prefer %-ESRCH for "does not match search criteria". It is
> > self-documenting. Any non-zero value will do.
>
> Noted, but returning 1 appears to be the typical arrangement for
> existing obj_cmpfn methods in most other areas.

Some people have no imagination :-)

[Same argument could be used against implementing kernel modules in Rust
- Using C appears to be the typical arrangement for existing modules)

Thanks,
NeilBrown

2022-10-12 06:55:11

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

On Tue, 11 Oct 2022, Chuck Lever III wrote:
> > On Oct 10, 2022, at 8:16 PM, NeilBrown <[email protected]> wrote:
> >
> > On Fri, 07 Oct 2022, Chuck Lever wrote:
> >>
> >> -static unsigned int file_hashval(struct svc_fh *fh)
> >> +/*
> >> + * The returned hash value is based solely on the address of an in-code
> >> + * inode, a pointer to a slab-allocated object. The entropy in such a
> >> + * pointer is concentrated in its middle bits.
> >
> > I think you need more justification than that for throwing away some of
> > the entropy, even if you don't think it is much.
>
> We might have that justification:
>
> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
>
> Actually I believe we are not discarding /any/ entropy in
> this function. The bits we discard are invariant.
>
> And, note that this is exactly the same situation we just merged
> in the filecache overhaul, and is a common trope amongst other
> hash tables that base their function on the inode's address.

Common? I searched for ">> *L1_CACHE_SHIFT". Apart from the nfsd
filecache you mentioned I find two. One in quota and one in reiserfs.
Both work with traditional hash tables which are more forgiving of
longer chains.
Do you have other evidence of this being a common trope?

Thanks,
NeilBrown

2022-10-12 15:18:33

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 11, 2022, at 7:37 PM, NeilBrown <[email protected]> wrote:
>
> On Tue, 11 Oct 2022, Chuck Lever III wrote:
>>> On Oct 10, 2022, at 8:16 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Fri, 07 Oct 2022, Chuck Lever wrote:
>>>>
>>>> -static unsigned int file_hashval(struct svc_fh *fh)
>>>> +/*
>>>> + * The returned hash value is based solely on the address of an in-code
>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>> + * pointer is concentrated in its middle bits.
>>>
>>> I think you need more justification than that for throwing away some of
>>> the entropy, even if you don't think it is much.
>>
>> We might have that justification:
>>
>> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
>>
>> Actually I believe we are not discarding /any/ entropy in
>> this function. The bits we discard are invariant.
>>
>
> Ok, I can accept that this:
>
> + k = ptr >> L1_CACHE_SHIFT;

> I searched for ">> *L1_CACHE_SHIFT". Apart from the nfsd
> filecache you mentioned I find two. One in quota and one in reiserfs.
> Both work with traditional hash tables which are more forgiving of
> longer chains.
> Do you have other evidence of this being a common trope?


This approach is based on the hash function in fs/inode.c,
which uses integer division instead of a shift.

509 static unsigned long hash(struct super_block *sb, unsigned long hashval)
510 {
511 unsigned long tmp;
512
513 tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
514 L1_CACHE_BYTES;
515 tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
516 return tmp & i_hash_mask;
517 }


> only discards invariant bits, but how can you justify this:
>
> + k &= 0x00ffffff;
>
> ??

After shifting an address, the top byte generally contains
invariant bits as well.


> And given that you pass it all to jhash anyway, why not just pass all of
> it?

I don't think this is a big deal, but these functions are
basically the same as what was recently merged without
complaint. It's not a high priority to revisit those.

It might be worth a clean-up to share this hash function
between the two hash tables... at that point we might
consider removing the extra mask.


>> And, note that this is exactly the same situation we just merged
>> in the filecache overhaul, and is a common trope amongst other
>> hash tables that base their function on the inode's address.
>>
>>
>>> Presumably you think hashing 32 bits is faster than hashing 64 bits.
>>> Maybe it is, but is it worth it?
>>>
>>> rhashtable depends heavily on having a strong hash function. In
>>> particular if any bucket ends up with more than 16 elements it will
>>> choose a new seed and rehash. If you deliberately remove some bits that
>>> it might have been used to spread those 16 out, then you are asking for
>>> trouble.
>>>
>>> We know that two different file handles can refer to the same inode
>>> ("rarely"), and you deliberately place them in the same hash bucket.
>>> So if an attacker arranges to access 17 files with the same inode but
>>> different file handles, then the hashtable will be continuously
>>> rehashed.
>>>
>>> The preferred approach when you require things to share a hash chain is
>>> to use an rhl table.
>>
>> Again, this is the same situation for the filecache. Do you
>> believe it is worth reworking that? I'm guessing "yes".
>
> As a matter of principle: yes.
> rhashtable is designed to assume that hash collisions are bad and can be
> changed by choosing a different seed.
> rhl_tables was explicitly added for cases when people wanted multiple
> elements to hash to the same value.
>
> The chance of it causing a problem without an attack are admittedly
> tiny. Attacks are only possible with subtree_check enabled, or if the
> underlying filesystem does something "clever" with file handles, so
> there wouldn't be many situations where an attack would even be
> possible. But if it were possible, it would likely be easy.
> The cost of the attack would be a minor-to-modest performance impact.
>
> So it is hard to argue "this code is dangerous and must be changed", but
> we have different tools for a reason, and I believe that rhl-tables is
> the right tool for this job.

Agreed. I wasn't suggesting it's an emergency situation, but it's
something that should get fixed at some point if there's a problem
with it, even a minor one.

I think I stopped at the non-list variant of rhashtable because
using rhl was more complex, and the non-list variant seemed to
work fine. There's no architectural reason either file_hashtbl
or the filecache must use the non-list variant.

In any event, it's worth taking the trouble now to change the
nfs4_file implementation proposed here as you suggest.


--
Chuck Lever



2022-10-12 21:21:27

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

On Thu, 13 Oct 2022, Chuck Lever III wrote:
>
> I think I stopped at the non-list variant of rhashtable because
> using rhl was more complex, and the non-list variant seemed to
> work fine. There's no architectural reason either file_hashtbl
> or the filecache must use the non-list variant.
>
> In any event, it's worth taking the trouble now to change the
> nfs4_file implementation proposed here as you suggest.

If you like you could leave it as-is for now and I can provide a patch
to convert to rhl-tables later (won't be until late October).
There is one thing I would need to understand though: why are the
nfsd_files per-filehandle instead of per-inode? There is probably a
good reason, but I cannot think of one.

Thanks,
NeilBrown

2022-10-13 14:36:29

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 12, 2022, at 5:18 PM, NeilBrown <[email protected]> wrote:
>
> On Thu, 13 Oct 2022, Chuck Lever III wrote:
>>
>> I think I stopped at the non-list variant of rhashtable because
>> using rhl was more complex, and the non-list variant seemed to
>> work fine. There's no architectural reason either file_hashtbl
>> or the filecache must use the non-list variant.
>>
>> In any event, it's worth taking the trouble now to change the
>> nfs4_file implementation proposed here as you suggest.
>
> If you like you could leave it as-is for now and I can provide a patch
> to convert to rhl-tables later (won't be until late October).
> There is one thing I would need to understand though: why are the
> nfsd_files per-filehandle instead of per-inode? There is probably a
> good reason, but I cannot think of one.

I'm not clear on your offer: do you mean converting the nfsd_file
cache from rhashtable to rhl, or converting the proposed nfs4_file
rework? I had planned to do the latter myself and post a refresh.

The nfsd_file_acquire API is the only place that seems to want a
filehandle, and it's just to lookup the underlying inode. Perhaps
I don't understand your question?


--
Chuck Lever



2022-10-13 22:22:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

On Fri, 14 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 12, 2022, at 5:18 PM, NeilBrown <[email protected]> wrote:
> >
> > On Thu, 13 Oct 2022, Chuck Lever III wrote:
> >>
> >> I think I stopped at the non-list variant of rhashtable because
> >> using rhl was more complex, and the non-list variant seemed to
> >> work fine. There's no architectural reason either file_hashtbl
> >> or the filecache must use the non-list variant.
> >>
> >> In any event, it's worth taking the trouble now to change the
> >> nfs4_file implementation proposed here as you suggest.
> >
> > If you like you could leave it as-is for now and I can provide a patch
> > to convert to rhl-tables later (won't be until late October).
> > There is one thing I would need to understand though: why are the
> > nfsd_files per-filehandle instead of per-inode? There is probably a
> > good reason, but I cannot think of one.
>
> I'm not clear on your offer: do you mean converting the nfsd_file
> cache from rhashtable to rhl, or converting the proposed nfs4_file
> rework? I had planned to do the latter myself and post a refresh.

Either/both. Of course if you do the refresh, then I'll just review it.

>
> The nfsd_file_acquire API is the only place that seems to want a
> filehandle, and it's just to lookup the underlying inode. Perhaps
> I don't understand your question?

Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file().
Why is there one nfs4_file per filehandle

I see that there can be several nfsd_file per inode - in different
network namespaces, or with different credentials or different access
modes. That really needs to be fixed.

Thanks,
NeilBrown

2022-10-14 12:54:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects



> On Oct 13, 2022, at 6:14 PM, NeilBrown <[email protected]> wrote:
>
> On Fri, 14 Oct 2022, Chuck Lever III wrote:
>>
>>> On Oct 12, 2022, at 5:18 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Thu, 13 Oct 2022, Chuck Lever III wrote:
>>>>
>>>> I think I stopped at the non-list variant of rhashtable because
>>>> using rhl was more complex, and the non-list variant seemed to
>>>> work fine. There's no architectural reason either file_hashtbl
>>>> or the filecache must use the non-list variant.
>>>>
>>>> In any event, it's worth taking the trouble now to change the
>>>> nfs4_file implementation proposed here as you suggest.
>>>
>>> If you like you could leave it as-is for now and I can provide a patch
>>> to convert to rhl-tables later (won't be until late October).
>>> There is one thing I would need to understand though: why are the
>>> nfsd_files per-filehandle instead of per-inode? There is probably a
>>> good reason, but I cannot think of one.
>>
>> I'm not clear on your offer: do you mean converting the nfsd_file
>> cache from rhashtable to rhl, or converting the proposed nfs4_file
>> rework? I had planned to do the latter myself and post a refresh.
>
> Either/both. Of course if you do the refresh, then I'll just review it.

Yep, I plan to repost, as part of addressing (your) review comments.


>> The nfsd_file_acquire API is the only place that seems to want a
>> filehandle, and it's just to lookup the underlying inode. Perhaps
>> I don't understand your question?
>
> Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file().
> Why is there one nfs4_file per filehandle

I can't answer that (yet), but I suspect there is some semantic
association between the [current] filehandle and a particular
state ID that makes this a sensible arrangement.


> I see that there can be several nfsd_file per inode - in different
> network namespaces, or with different credentials or different access
> modes.

My impression is that is by design. Each namespace and unique
credential needs a distinct nfsd_file. Each nfsd_file acts like
an open file descriptor in user space.


> That really needs to be fixed.

I'm not sure I agree, but maybe "that" and "fixed" are doing some
heavy lifting. Jeff might be able to add some insight on design
purpose, and we can write a patch that adds that as a documenting
comment somewhere.

--
Chuck Lever