There are a number of changes here from the last set, but I've tried to
address Neil and Chuck's comments (thanks for the review, btw).
I think I've solved the race between adding to the LRU and unhashing.
That's in patch #3. We may want to squash that into #2 before merging.
Jeff Layton (4):
nfsd: remove the pages_flushed statistic from filecache
nfsd: rework refcounting in filecache
nfsd: close race between unhashing and LRU addition
nfsd: start non-blocking writeback after adding nfsd_file to the LRU
fs/nfsd/filecache.c | 385 +++++++++++++++++++++++---------------------
fs/nfsd/filecache.h | 1 +
fs/nfsd/trace.h | 5 +-
3 files changed, 207 insertions(+), 184 deletions(-)
--
2.37.3
We're counting mapping->nrpages, but not all of those are necessarily
dirty. We don't really have a simple way to count just the dirty pages,
so just remove this stat since it's not accurate.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/filecache.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 98c6b5f51bc8..f8ebbf7daa18 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -32,7 +32,6 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_cache_hits);
static DEFINE_PER_CPU(unsigned long, nfsd_file_acquisitions);
static DEFINE_PER_CPU(unsigned long, nfsd_file_releases);
static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
-static DEFINE_PER_CPU(unsigned long, nfsd_file_pages_flushed);
static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
struct nfsd_fcache_disposal {
@@ -370,7 +369,6 @@ nfsd_file_flush(struct nfsd_file *nf)
if (!file || !(file->f_mode & FMODE_WRITE))
return;
- this_cpu_add(nfsd_file_pages_flushed, file->f_mapping->nrpages);
if (vfs_fsync(file, 1) != 0)
nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
}
@@ -998,7 +996,6 @@ nfsd_file_cache_shutdown(void)
per_cpu(nfsd_file_acquisitions, i) = 0;
per_cpu(nfsd_file_releases, i) = 0;
per_cpu(nfsd_file_total_age, i) = 0;
- per_cpu(nfsd_file_pages_flushed, i) = 0;
per_cpu(nfsd_file_evictions, i) = 0;
}
}
@@ -1212,7 +1209,7 @@ nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
*/
int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
{
- unsigned long releases = 0, pages_flushed = 0, evictions = 0;
+ unsigned long releases = 0, evictions = 0;
unsigned long hits = 0, acquisitions = 0;
unsigned int i, count = 0, buckets = 0;
unsigned long lru = 0, total_age = 0;
@@ -1240,7 +1237,6 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
releases += per_cpu(nfsd_file_releases, i);
total_age += per_cpu(nfsd_file_total_age, i);
evictions += per_cpu(nfsd_file_evictions, i);
- pages_flushed += per_cpu(nfsd_file_pages_flushed, i);
}
seq_printf(m, "total entries: %u\n", count);
@@ -1254,6 +1250,5 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
seq_printf(m, "mean age (ms): %ld\n", total_age / releases);
else
seq_printf(m, "mean age (ms): -\n");
- seq_printf(m, "pages flushed: %lu\n", pages_flushed);
return 0;
}
--
2.37.3
The list_lru_add and list_lru_del functions use list_empty checks to see
whether the object is already on the LRU. That's fine in most cases, but
we occasionally repurpose nf_lru after unhashing. It's possible for an
LRU removal to remove it from a different list altogether if we lose a
race.
Add a new NFSD_FILE_LRU flag, which indicates that the object actually
resides on the LRU and not some other list. Use that when adding and
removing it from the LRU instead of just relying on list_empty checks.
Add an extra HASHED check after adding the entry to the LRU. If it's now
clear, just remove it from the LRU again and put the reference if that
remove is successful.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/filecache.c | 44 +++++++++++++++++++++++++++++---------------
fs/nfsd/filecache.h | 1 +
2 files changed, 30 insertions(+), 15 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index d928c5e38eeb..47cdc6129a7b 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -403,18 +403,22 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
static bool nfsd_file_lru_add(struct nfsd_file *nf)
{
set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
- if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
- trace_nfsd_file_lru_add(nf);
- return true;
+ if (!test_and_set_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
+ if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
+ trace_nfsd_file_lru_add(nf);
+ return true;
+ }
}
return false;
}
static bool nfsd_file_lru_remove(struct nfsd_file *nf)
{
- if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
- trace_nfsd_file_lru_del(nf);
- return true;
+ if (test_and_clear_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
+ if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
+ trace_nfsd_file_lru_del(nf);
+ return true;
+ }
}
return false;
}
@@ -469,20 +473,30 @@ nfsd_file_put(struct nfsd_file *nf)
{
trace_nfsd_file_put(nf);
- /*
- * The HASHED check is racy. We may end up with the occasional
- * unhashed entry on the LRU, but they should get cleaned up
- * like any other.
- */
if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
/*
- * If this is the last reference (nf_ref == 1), then transfer
- * it to the LRU. If the add to the LRU fails, just put it as
- * usual.
+ * If this is the last reference (nf_ref == 1), then try to
+ * transfer it to the LRU.
*/
- if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
+ if (refcount_dec_not_one(&nf->nf_ref))
return;
+
+ /* Try to add it to the LRU. If that fails, decrement. */
+ if (nfsd_file_lru_add(nf)) {
+ /* If it's still hashed, we're done */
+ if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
+ return;
+
+ /*
+ * We're racing with unhashing, so try to remove it from
+ * the LRU. If removal fails, then someone else already
+ * has our reference and we're done. If it succeeds,
+ * fall through to decrement.
+ */
+ if (!nfsd_file_lru_remove(nf))
+ return;
+ }
}
if (refcount_dec_and_test(&nf->nf_ref))
nfsd_file_free(nf);
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index b7efb2c3ddb1..e52ab7d5a44c 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -39,6 +39,7 @@ struct nfsd_file {
#define NFSD_FILE_PENDING (1)
#define NFSD_FILE_REFERENCED (2)
#define NFSD_FILE_GC (3)
+#define NFSD_FILE_LRU (4) /* file is on LRU */
unsigned long nf_flags;
struct inode *nf_inode; /* don't deref */
refcount_t nf_ref;
--
2.37.3
The filecache refcounting is a bit non-standard for something searchable
by RCU, in that we maintain a sentinel reference while it's hashed. This
in turn requires that we have to do things differently in the "put"
depending on whether its hashed, which we believe to have led to races.
There are other problems in here too. nfsd_file_close_inode_sync can end
up freeing an nfsd_file while there are still outstanding references to
it, and there are a number of subtle ToC/ToU races.
Rework the code so that the refcount is what drives the lifecycle. When
the refcount goes to zero, then unhash and rcu free the object.
With this change, the LRU carries a reference. Take special care to
deal with it when removing an entry from the list.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/filecache.c | 357 ++++++++++++++++++++++----------------------
fs/nfsd/trace.h | 5 +-
2 files changed, 178 insertions(+), 184 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index f8ebbf7daa18..d928c5e38eeb 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1,6 +1,12 @@
// SPDX-License-Identifier: GPL-2.0
/*
* The NFSD open file cache.
+ *
+ * Each nfsd_file is created in response to client activity -- either regular
+ * file I/O for v2/v3, or opening a file for v4. Files opened via v4 are
+ * cleaned up as soon as their refcount goes to 0. Entries for v2/v3 are
+ * flagged with NFSD_FILE_GC. On their last put, they are added to the LRU for
+ * eventual disposal if they aren't used again within a short time period.
*/
#include <linux/hash.h>
@@ -301,55 +307,22 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
if (key->gc)
__set_bit(NFSD_FILE_GC, &nf->nf_flags);
nf->nf_inode = key->inode;
- /* nf_ref is pre-incremented for hash table */
- refcount_set(&nf->nf_ref, 2);
+ refcount_set(&nf->nf_ref, 1);
nf->nf_may = key->need;
nf->nf_mark = NULL;
}
return nf;
}
-static bool
-nfsd_file_free(struct nfsd_file *nf)
-{
- s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
- bool flush = false;
-
- this_cpu_inc(nfsd_file_releases);
- this_cpu_add(nfsd_file_total_age, age);
-
- trace_nfsd_file_put_final(nf);
- if (nf->nf_mark)
- nfsd_file_mark_put(nf->nf_mark);
- if (nf->nf_file) {
- get_file(nf->nf_file);
- filp_close(nf->nf_file, NULL);
- fput(nf->nf_file);
- flush = true;
- }
-
- /*
- * If this item is still linked via nf_lru, that's a bug.
- * WARN and leak it to preserve system stability.
- */
- if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
- return flush;
-
- call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
- return flush;
-}
-
-static bool
-nfsd_file_check_writeback(struct nfsd_file *nf)
+static void
+nfsd_file_fsync(struct nfsd_file *nf)
{
struct file *file = nf->nf_file;
- struct address_space *mapping;
if (!file || !(file->f_mode & FMODE_WRITE))
- return false;
- mapping = file->f_mapping;
- return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
- mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
+ return;
+ if (vfs_fsync(file, 1) != 0)
+ nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
}
static int
@@ -362,30 +335,6 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
}
-static void
-nfsd_file_flush(struct nfsd_file *nf)
-{
- struct file *file = nf->nf_file;
-
- if (!file || !(file->f_mode & FMODE_WRITE))
- return;
- if (vfs_fsync(file, 1) != 0)
- nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
-}
-
-static void nfsd_file_lru_add(struct nfsd_file *nf)
-{
- set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
- if (list_lru_add(&nfsd_file_lru, &nf->nf_lru))
- trace_nfsd_file_lru_add(nf);
-}
-
-static void nfsd_file_lru_remove(struct nfsd_file *nf)
-{
- if (list_lru_del(&nfsd_file_lru, &nf->nf_lru))
- trace_nfsd_file_lru_del(nf);
-}
-
static void
nfsd_file_hash_remove(struct nfsd_file *nf)
{
@@ -408,53 +357,66 @@ nfsd_file_unhash(struct nfsd_file *nf)
}
static void
-nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
+nfsd_file_free(struct nfsd_file *nf)
{
- trace_nfsd_file_unhash_and_dispose(nf);
- if (nfsd_file_unhash(nf)) {
- /* caller must call nfsd_file_dispose_list() later */
- nfsd_file_lru_remove(nf);
- list_add(&nf->nf_lru, dispose);
+ s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
+
+ trace_nfsd_file_free(nf);
+
+ this_cpu_inc(nfsd_file_releases);
+ this_cpu_add(nfsd_file_total_age, age);
+
+ nfsd_file_unhash(nf);
+ nfsd_file_fsync(nf);
+
+ if (nf->nf_mark)
+ nfsd_file_mark_put(nf->nf_mark);
+ if (nf->nf_file) {
+ get_file(nf->nf_file);
+ filp_close(nf->nf_file, NULL);
+ fput(nf->nf_file);
}
+
+ /*
+ * If this item is still linked via nf_lru, that's a bug.
+ * WARN and leak it to preserve system stability.
+ */
+ if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
+ return;
+
+ call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
}
-static void
-nfsd_file_put_noref(struct nfsd_file *nf)
+static bool
+nfsd_file_check_writeback(struct nfsd_file *nf)
{
- trace_nfsd_file_put(nf);
+ struct file *file = nf->nf_file;
+ struct address_space *mapping;
- if (refcount_dec_and_test(&nf->nf_ref)) {
- WARN_ON(test_bit(NFSD_FILE_HASHED, &nf->nf_flags));
- nfsd_file_lru_remove(nf);
- nfsd_file_free(nf);
- }
+ if (!file || !(file->f_mode & FMODE_WRITE))
+ return false;
+ mapping = file->f_mapping;
+ return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
+ mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
}
-static void
-nfsd_file_unhash_and_put(struct nfsd_file *nf)
+static bool nfsd_file_lru_add(struct nfsd_file *nf)
{
- if (nfsd_file_unhash(nf))
- nfsd_file_put_noref(nf);
+ set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
+ if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
+ trace_nfsd_file_lru_add(nf);
+ return true;
+ }
+ return false;
}
-void
-nfsd_file_put(struct nfsd_file *nf)
+static bool nfsd_file_lru_remove(struct nfsd_file *nf)
{
- might_sleep();
-
- if (test_bit(NFSD_FILE_GC, &nf->nf_flags))
- nfsd_file_lru_add(nf);
- else if (refcount_read(&nf->nf_ref) == 2)
- nfsd_file_unhash_and_put(nf);
-
- if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
- nfsd_file_flush(nf);
- nfsd_file_put_noref(nf);
- } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
- nfsd_file_put_noref(nf);
- nfsd_file_schedule_laundrette();
- } else
- nfsd_file_put_noref(nf);
+ if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
+ trace_nfsd_file_lru_del(nf);
+ return true;
+ }
+ return false;
}
struct nfsd_file *
@@ -465,36 +427,77 @@ nfsd_file_get(struct nfsd_file *nf)
return NULL;
}
-static void
-nfsd_file_dispose_list(struct list_head *dispose)
+/**
+ * nfsd_file_unhash_and_queue - unhash a file and queue it to the dispose list
+ * @nf: nfsd_file to be unhashed and queued
+ * @dispose: list to which it should be queued
+ *
+ * Attempt to unhash a nfsd_file and queue it to the given list. Each file
+ * will have a reference held on behalf of the list. That reference may come
+ * from the LRU, or we may need to take one. If we can't get a reference,
+ * ignore it altogether.
+ */
+static bool
+nfsd_file_unhash_and_queue(struct nfsd_file *nf, struct list_head *dispose)
{
- struct nfsd_file *nf;
+ trace_nfsd_file_unhash_and_queue(nf);
+ if (nfsd_file_unhash(nf)) {
+ /*
+ * If we remove it from the LRU, then just use that
+ * reference for the dispose list. Otherwise, we need
+ * to take a reference. If that fails, just ignore
+ * the file altogether.
+ */
+ if (!nfsd_file_lru_remove(nf) && !nfsd_file_get(nf))
+ return false;
+ list_add(&nf->nf_lru, dispose);
+ return true;
+ }
+ return false;
+}
- while(!list_empty(dispose)) {
- nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
- list_del_init(&nf->nf_lru);
- nfsd_file_flush(nf);
- nfsd_file_put_noref(nf);
+/**
+ * nfsd_file_put - put the reference to a nfsd_file
+ * @nf: nfsd_file of which to put the reference
+ *
+ * Put a reference to a nfsd_file. In the v4 case, we just put the
+ * reference immediately. In the v2/3 case, if the reference would be
+ * the last one, the put it on the LRU instead to be cleaned up later.
+ */
+void
+nfsd_file_put(struct nfsd_file *nf)
+{
+ trace_nfsd_file_put(nf);
+
+ /*
+ * The HASHED check is racy. We may end up with the occasional
+ * unhashed entry on the LRU, but they should get cleaned up
+ * like any other.
+ */
+ if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
+ test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
+ /*
+ * If this is the last reference (nf_ref == 1), then transfer
+ * it to the LRU. If the add to the LRU fails, just put it as
+ * usual.
+ */
+ if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
+ return;
}
+ if (refcount_dec_and_test(&nf->nf_ref))
+ nfsd_file_free(nf);
}
static void
-nfsd_file_dispose_list_sync(struct list_head *dispose)
+nfsd_file_dispose_list(struct list_head *dispose)
{
- bool flush = false;
struct nfsd_file *nf;
while(!list_empty(dispose)) {
nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
list_del_init(&nf->nf_lru);
- nfsd_file_flush(nf);
- if (!refcount_dec_and_test(&nf->nf_ref))
- continue;
- if (nfsd_file_free(nf))
- flush = true;
+ nfsd_file_free(nf);
}
- if (flush)
- flush_delayed_fput();
}
static void
@@ -564,21 +567,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
struct list_head *head = arg;
struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
- /*
- * Do a lockless refcount check. The hashtable holds one reference, so
- * we look to see if anything else has a reference, or if any have
- * been put since the shrinker last ran. Those don't get unhashed and
- * released.
- *
- * Note that in the put path, we set the flag and then decrement the
- * counter. Here we check the counter and then test and clear the flag.
- * That order is deliberate to ensure that we can do this locklessly.
- */
- if (refcount_read(&nf->nf_ref) > 1) {
- list_lru_isolate(lru, &nf->nf_lru);
- trace_nfsd_file_gc_in_use(nf);
- return LRU_REMOVED;
- }
+ /* We should only be dealing with v2/3 entries here */
+ WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
/*
* Don't throw out files that are still undergoing I/O or
@@ -589,40 +579,30 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
return LRU_SKIP;
}
+ /* If it was recently added to the list, skip it */
if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
trace_nfsd_file_gc_referenced(nf);
return LRU_ROTATE;
}
- if (!test_and_clear_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
- trace_nfsd_file_gc_hashed(nf);
- return LRU_SKIP;
+ /*
+ * Put the reference held on behalf of the LRU. If it wasn't the last
+ * one, then just remove it from the LRU and ignore it.
+ */
+ if (!refcount_dec_and_test(&nf->nf_ref)) {
+ trace_nfsd_file_gc_in_use(nf);
+ list_lru_isolate(lru, &nf->nf_lru);
+ return LRU_REMOVED;
}
+ /* Refcount went to zero. Unhash it and queue it to the dispose list */
+ nfsd_file_unhash(nf);
list_lru_isolate_move(lru, &nf->nf_lru, head);
this_cpu_inc(nfsd_file_evictions);
trace_nfsd_file_gc_disposed(nf);
return LRU_REMOVED;
}
-/*
- * Unhash items on @dispose immediately, then queue them on the
- * disposal workqueue to finish releasing them in the background.
- *
- * cel: Note that between the time list_lru_shrink_walk runs and
- * now, these items are in the hash table but marked unhashed.
- * Why release these outside of lru_cb ? There's no lock ordering
- * problem since lru_cb currently takes no lock.
- */
-static void nfsd_file_gc_dispose_list(struct list_head *dispose)
-{
- struct nfsd_file *nf;
-
- list_for_each_entry(nf, dispose, nf_lru)
- nfsd_file_hash_remove(nf);
- nfsd_file_dispose_list_delayed(dispose);
-}
-
static void
nfsd_file_gc(void)
{
@@ -632,7 +612,7 @@ nfsd_file_gc(void)
ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
&dispose, list_lru_count(&nfsd_file_lru));
trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
- nfsd_file_gc_dispose_list(&dispose);
+ nfsd_file_dispose_list_delayed(&dispose);
}
static void
@@ -657,7 +637,7 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
nfsd_file_lru_cb, &dispose);
trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
- nfsd_file_gc_dispose_list(&dispose);
+ nfsd_file_dispose_list_delayed(&dispose);
return ret;
}
@@ -668,8 +648,11 @@ static struct shrinker nfsd_file_shrinker = {
};
/*
- * Find all cache items across all net namespaces that match @inode and
- * move them to @dispose. The lookup is atomic wrt nfsd_file_acquire().
+ * Find all cache items across all net namespaces that match @inode, unhash
+ * them, take references and then put them on @dispose if that was successful.
+ *
+ * The nfsd_file objects on the list will be unhashed, and each will have a
+ * reference taken.
*/
static unsigned int
__nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
@@ -687,52 +670,59 @@ __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
nfsd_file_rhash_params);
if (!nf)
break;
- nfsd_file_unhash_and_dispose(nf, dispose);
- count++;
+
+ if (nfsd_file_unhash_and_queue(nf, dispose))
+ count++;
} while (1);
rcu_read_unlock();
return count;
}
/**
- * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
+ * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
* @inode: inode of the file to attempt to remove
*
- * Unhash and put, then flush and fput all cache items associated with @inode.
+ * Unhash and put all cache item associated with @inode.
*/
-void
-nfsd_file_close_inode_sync(struct inode *inode)
+static unsigned int
+nfsd_file_close_inode(struct inode *inode)
{
- LIST_HEAD(dispose);
+ struct nfsd_file *nf;
unsigned int count;
+ LIST_HEAD(dispose);
count = __nfsd_file_close_inode(inode, &dispose);
- trace_nfsd_file_close_inode_sync(inode, count);
- nfsd_file_dispose_list_sync(&dispose);
+ trace_nfsd_file_close_inode(inode, count);
+ if (count) {
+ while(!list_empty(&dispose)) {
+ nf = list_first_entry(&dispose, struct nfsd_file, nf_lru);
+ list_del_init(&nf->nf_lru);
+ trace_nfsd_file_closing(nf);
+ if (refcount_dec_and_test(&nf->nf_ref))
+ nfsd_file_free(nf);
+ }
+ }
+ return count;
}
/**
- * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
+ * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
* @inode: inode of the file to attempt to remove
*
- * Unhash and put all cache item associated with @inode.
+ * Unhash and put, then flush and fput all cache items associated with @inode.
*/
-static void
-nfsd_file_close_inode(struct inode *inode)
+void
+nfsd_file_close_inode_sync(struct inode *inode)
{
- LIST_HEAD(dispose);
- unsigned int count;
-
- count = __nfsd_file_close_inode(inode, &dispose);
- trace_nfsd_file_close_inode(inode, count);
- nfsd_file_dispose_list_delayed(&dispose);
+ if (nfsd_file_close_inode(inode))
+ flush_delayed_fput();
}
/**
* nfsd_file_delayed_close - close unused nfsd_files
* @work: dummy
*
- * Walk the LRU list and close any entries that have not been used since
+ * Walk the LRU list and destroy any entries that have not been used since
* the last scan.
*/
static void
@@ -890,7 +880,7 @@ __nfsd_file_cache_purge(struct net *net)
while (!IS_ERR_OR_NULL(nf)) {
if (net && nf->nf_net != net)
continue;
- nfsd_file_unhash_and_dispose(nf, &dispose);
+ nfsd_file_unhash_and_queue(nf, &dispose);
nf = rhashtable_walk_next(&iter);
}
@@ -1054,8 +1044,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
rcu_read_lock();
nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
nfsd_file_rhash_params);
- if (nf)
- nf = nfsd_file_get(nf);
+ if (nf) {
+ if (!nfsd_file_lru_remove(nf))
+ nf = nfsd_file_get(nf);
+ }
rcu_read_unlock();
if (nf)
goto wait_for_construction;
@@ -1090,11 +1082,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
goto out;
}
open_retry = false;
- nfsd_file_put_noref(nf);
+ if (refcount_dec_and_test(&nf->nf_ref))
+ nfsd_file_free(nf);
goto retry;
}
- nfsd_file_lru_remove(nf);
this_cpu_inc(nfsd_file_cache_hits);
status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
@@ -1104,7 +1096,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
this_cpu_inc(nfsd_file_acquisitions);
*pnf = nf;
} else {
- nfsd_file_put(nf);
+ if (refcount_dec_and_test(&nf->nf_ref))
+ nfsd_file_free(nf);
nf = NULL;
}
@@ -1131,7 +1124,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
* then unhash.
*/
if (status != nfs_ok || key.inode->i_nlink == 0)
- nfsd_file_unhash_and_put(nf);
+ nfsd_file_unhash(nf);
clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
smp_mb__after_atomic();
wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index b09ab4f92d43..a44ded06af87 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -903,10 +903,11 @@ DEFINE_EVENT(nfsd_file_class, name, \
TP_PROTO(struct nfsd_file *nf), \
TP_ARGS(nf))
-DEFINE_NFSD_FILE_EVENT(nfsd_file_put_final);
+DEFINE_NFSD_FILE_EVENT(nfsd_file_free);
DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash);
DEFINE_NFSD_FILE_EVENT(nfsd_file_put);
-DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_dispose);
+DEFINE_NFSD_FILE_EVENT(nfsd_file_closing);
+DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_queue);
TRACE_EVENT(nfsd_file_alloc,
TP_PROTO(
--
2.37.3
When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
so that we can be ready to close it when the time comes. This should
help minimize delays when freeing objects reaped from the LRU.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/filecache.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 47cdc6129a7b..c43b6cff03e2 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -325,6 +325,20 @@ nfsd_file_fsync(struct nfsd_file *nf)
nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
}
+static void
+nfsd_file_flush(struct nfsd_file *nf)
+{
+ struct file *file = nf->nf_file;
+ struct address_space *mapping;
+
+ if (!file || !(file->f_mode & FMODE_WRITE))
+ return;
+
+ mapping = file->f_mapping;
+ if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+ filemap_flush(mapping);
+}
+
static int
nfsd_file_check_write_error(struct nfsd_file *nf)
{
@@ -484,9 +498,14 @@ nfsd_file_put(struct nfsd_file *nf)
/* Try to add it to the LRU. If that fails, decrement. */
if (nfsd_file_lru_add(nf)) {
- /* If it's still hashed, we're done */
- if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
+ /*
+ * If it's still hashed, we can just return now,
+ * after kicking off SYNC_NONE writeback.
+ */
+ if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
+ nfsd_file_flush(nf);
return;
+ }
/*
* We're racing with unhashing, so try to remove it from
--
2.37.3
> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>
> We're counting mapping->nrpages, but not all of those are necessarily
> dirty. We don't really have a simple way to count just the dirty pages,
> so just remove this stat since it's not accurate.
OK. I'll apply this one when the others are ready, so just leave it
in this series when you re-post.
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/filecache.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 98c6b5f51bc8..f8ebbf7daa18 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -32,7 +32,6 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_cache_hits);
> static DEFINE_PER_CPU(unsigned long, nfsd_file_acquisitions);
> static DEFINE_PER_CPU(unsigned long, nfsd_file_releases);
> static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> -static DEFINE_PER_CPU(unsigned long, nfsd_file_pages_flushed);
> static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
>
> struct nfsd_fcache_disposal {
> @@ -370,7 +369,6 @@ nfsd_file_flush(struct nfsd_file *nf)
>
> if (!file || !(file->f_mode & FMODE_WRITE))
> return;
> - this_cpu_add(nfsd_file_pages_flushed, file->f_mapping->nrpages);
> if (vfs_fsync(file, 1) != 0)
> nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> }
> @@ -998,7 +996,6 @@ nfsd_file_cache_shutdown(void)
> per_cpu(nfsd_file_acquisitions, i) = 0;
> per_cpu(nfsd_file_releases, i) = 0;
> per_cpu(nfsd_file_total_age, i) = 0;
> - per_cpu(nfsd_file_pages_flushed, i) = 0;
> per_cpu(nfsd_file_evictions, i) = 0;
> }
> }
> @@ -1212,7 +1209,7 @@ nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> */
> int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
> {
> - unsigned long releases = 0, pages_flushed = 0, evictions = 0;
> + unsigned long releases = 0, evictions = 0;
> unsigned long hits = 0, acquisitions = 0;
> unsigned int i, count = 0, buckets = 0;
> unsigned long lru = 0, total_age = 0;
> @@ -1240,7 +1237,6 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
> releases += per_cpu(nfsd_file_releases, i);
> total_age += per_cpu(nfsd_file_total_age, i);
> evictions += per_cpu(nfsd_file_evictions, i);
> - pages_flushed += per_cpu(nfsd_file_pages_flushed, i);
> }
>
> seq_printf(m, "total entries: %u\n", count);
> @@ -1254,6 +1250,5 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
> seq_printf(m, "mean age (ms): %ld\n", total_age / releases);
> else
> seq_printf(m, "mean age (ms): -\n");
> - seq_printf(m, "pages flushed: %lu\n", pages_flushed);
> return 0;
> }
> --
> 2.37.3
>
--
Chuck Lever
> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>
> The filecache refcounting is a bit non-standard for something searchable
> by RCU, in that we maintain a sentinel reference while it's hashed. This
> in turn requires that we have to do things differently in the "put"
> depending on whether its hashed, which we believe to have led to races.
>
> There are other problems in here too. nfsd_file_close_inode_sync can end
> up freeing an nfsd_file while there are still outstanding references to
> it, and there are a number of subtle ToC/ToU races.
>
> Rework the code so that the refcount is what drives the lifecycle. When
> the refcount goes to zero, then unhash and rcu free the object.
>
> With this change, the LRU carries a reference. Take special care to
> deal with it when removing an entry from the list.
I can see a way of making this patch a lot cleaner. It looks like there's
a fair bit of renaming and moving of functions -- that can go in clean
up patches before doing the heavy lifting.
I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
That feels like a deadlock waiting to happen and quite difficult to
reproduce because I/O there is rarely needed. It could help to put a
might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
drive I/O in that path at all.
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/filecache.c | 357 ++++++++++++++++++++++----------------------
> fs/nfsd/trace.h | 5 +-
> 2 files changed, 178 insertions(+), 184 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index f8ebbf7daa18..d928c5e38eeb 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -1,6 +1,12 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * The NFSD open file cache.
> + *
> + * Each nfsd_file is created in response to client activity -- either regular
> + * file I/O for v2/v3, or opening a file for v4. Files opened via v4 are
> + * cleaned up as soon as their refcount goes to 0. Entries for v2/v3 are
> + * flagged with NFSD_FILE_GC. On their last put, they are added to the LRU for
> + * eventual disposal if they aren't used again within a short time period.
> */
>
> #include <linux/hash.h>
> @@ -301,55 +307,22 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> if (key->gc)
> __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> nf->nf_inode = key->inode;
> - /* nf_ref is pre-incremented for hash table */
> - refcount_set(&nf->nf_ref, 2);
> + refcount_set(&nf->nf_ref, 1);
> nf->nf_may = key->need;
> nf->nf_mark = NULL;
> }
> return nf;
> }
>
> -static bool
> -nfsd_file_free(struct nfsd_file *nf)
> -{
> - s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> - bool flush = false;
> -
> - this_cpu_inc(nfsd_file_releases);
> - this_cpu_add(nfsd_file_total_age, age);
> -
> - trace_nfsd_file_put_final(nf);
> - if (nf->nf_mark)
> - nfsd_file_mark_put(nf->nf_mark);
> - if (nf->nf_file) {
> - get_file(nf->nf_file);
> - filp_close(nf->nf_file, NULL);
> - fput(nf->nf_file);
> - flush = true;
> - }
> -
> - /*
> - * If this item is still linked via nf_lru, that's a bug.
> - * WARN and leak it to preserve system stability.
> - */
> - if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> - return flush;
> -
> - call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> - return flush;
> -}
> -
> -static bool
> -nfsd_file_check_writeback(struct nfsd_file *nf)
> +static void
> +nfsd_file_fsync(struct nfsd_file *nf)
> {
> struct file *file = nf->nf_file;
> - struct address_space *mapping;
>
> if (!file || !(file->f_mode & FMODE_WRITE))
> - return false;
> - mapping = file->f_mapping;
> - return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> - mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> + return;
> + if (vfs_fsync(file, 1) != 0)
> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> }
>
> static int
> @@ -362,30 +335,6 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
> }
>
> -static void
> -nfsd_file_flush(struct nfsd_file *nf)
> -{
> - struct file *file = nf->nf_file;
> -
> - if (!file || !(file->f_mode & FMODE_WRITE))
> - return;
> - if (vfs_fsync(file, 1) != 0)
> - nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> -}
> -
> -static void nfsd_file_lru_add(struct nfsd_file *nf)
> -{
> - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru))
> - trace_nfsd_file_lru_add(nf);
> -}
> -
> -static void nfsd_file_lru_remove(struct nfsd_file *nf)
> -{
> - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru))
> - trace_nfsd_file_lru_del(nf);
> -}
> -
> static void
> nfsd_file_hash_remove(struct nfsd_file *nf)
> {
> @@ -408,53 +357,66 @@ nfsd_file_unhash(struct nfsd_file *nf)
> }
>
> static void
> -nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
> +nfsd_file_free(struct nfsd_file *nf)
> {
> - trace_nfsd_file_unhash_and_dispose(nf);
> - if (nfsd_file_unhash(nf)) {
> - /* caller must call nfsd_file_dispose_list() later */
> - nfsd_file_lru_remove(nf);
> - list_add(&nf->nf_lru, dispose);
> + s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> +
> + trace_nfsd_file_free(nf);
> +
> + this_cpu_inc(nfsd_file_releases);
> + this_cpu_add(nfsd_file_total_age, age);
> +
> + nfsd_file_unhash(nf);
> + nfsd_file_fsync(nf);
> +
> + if (nf->nf_mark)
> + nfsd_file_mark_put(nf->nf_mark);
> + if (nf->nf_file) {
> + get_file(nf->nf_file);
> + filp_close(nf->nf_file, NULL);
> + fput(nf->nf_file);
> }
> +
> + /*
> + * If this item is still linked via nf_lru, that's a bug.
> + * WARN and leak it to preserve system stability.
> + */
> + if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> + return;
> +
> + call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> }
>
> -static void
> -nfsd_file_put_noref(struct nfsd_file *nf)
> +static bool
> +nfsd_file_check_writeback(struct nfsd_file *nf)
> {
> - trace_nfsd_file_put(nf);
> + struct file *file = nf->nf_file;
> + struct address_space *mapping;
>
> - if (refcount_dec_and_test(&nf->nf_ref)) {
> - WARN_ON(test_bit(NFSD_FILE_HASHED, &nf->nf_flags));
> - nfsd_file_lru_remove(nf);
> - nfsd_file_free(nf);
> - }
> + if (!file || !(file->f_mode & FMODE_WRITE))
> + return false;
> + mapping = file->f_mapping;
> + return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> + mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> }
>
> -static void
> -nfsd_file_unhash_and_put(struct nfsd_file *nf)
> +static bool nfsd_file_lru_add(struct nfsd_file *nf)
> {
> - if (nfsd_file_unhash(nf))
> - nfsd_file_put_noref(nf);
> + set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> + trace_nfsd_file_lru_add(nf);
> + return true;
> + }
> + return false;
> }
>
> -void
> -nfsd_file_put(struct nfsd_file *nf)
> +static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> {
> - might_sleep();
> -
> - if (test_bit(NFSD_FILE_GC, &nf->nf_flags))
> - nfsd_file_lru_add(nf);
> - else if (refcount_read(&nf->nf_ref) == 2)
> - nfsd_file_unhash_and_put(nf);
> -
> - if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> - nfsd_file_flush(nf);
> - nfsd_file_put_noref(nf);
> - } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
> - nfsd_file_put_noref(nf);
> - nfsd_file_schedule_laundrette();
> - } else
> - nfsd_file_put_noref(nf);
> + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> + trace_nfsd_file_lru_del(nf);
> + return true;
> + }
> + return false;
> }
>
> struct nfsd_file *
> @@ -465,36 +427,77 @@ nfsd_file_get(struct nfsd_file *nf)
> return NULL;
> }
>
> -static void
> -nfsd_file_dispose_list(struct list_head *dispose)
> +/**
> + * nfsd_file_unhash_and_queue - unhash a file and queue it to the dispose list
> + * @nf: nfsd_file to be unhashed and queued
> + * @dispose: list to which it should be queued
> + *
> + * Attempt to unhash a nfsd_file and queue it to the given list. Each file
> + * will have a reference held on behalf of the list. That reference may come
> + * from the LRU, or we may need to take one. If we can't get a reference,
> + * ignore it altogether.
> + */
> +static bool
> +nfsd_file_unhash_and_queue(struct nfsd_file *nf, struct list_head *dispose)
> {
> - struct nfsd_file *nf;
> + trace_nfsd_file_unhash_and_queue(nf);
> + if (nfsd_file_unhash(nf)) {
> + /*
> + * If we remove it from the LRU, then just use that
> + * reference for the dispose list. Otherwise, we need
> + * to take a reference. If that fails, just ignore
> + * the file altogether.
> + */
> + if (!nfsd_file_lru_remove(nf) && !nfsd_file_get(nf))
> + return false;
> + list_add(&nf->nf_lru, dispose);
> + return true;
> + }
> + return false;
> +}
>
> - while(!list_empty(dispose)) {
> - nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> - list_del_init(&nf->nf_lru);
> - nfsd_file_flush(nf);
> - nfsd_file_put_noref(nf);
> +/**
> + * nfsd_file_put - put the reference to a nfsd_file
> + * @nf: nfsd_file of which to put the reference
> + *
> + * Put a reference to a nfsd_file. In the v4 case, we just put the
> + * reference immediately. In the v2/3 case, if the reference would be
> + * the last one, the put it on the LRU instead to be cleaned up later.
> + */
> +void
> +nfsd_file_put(struct nfsd_file *nf)
> +{
> + trace_nfsd_file_put(nf);
> +
> + /*
> + * The HASHED check is racy. We may end up with the occasional
> + * unhashed entry on the LRU, but they should get cleaned up
> + * like any other.
> + */
> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
> + test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> + /*
> + * If this is the last reference (nf_ref == 1), then transfer
> + * it to the LRU. If the add to the LRU fails, just put it as
> + * usual.
> + */
> + if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
> + return;
> }
> + if (refcount_dec_and_test(&nf->nf_ref))
> + nfsd_file_free(nf);
> }
>
> static void
> -nfsd_file_dispose_list_sync(struct list_head *dispose)
> +nfsd_file_dispose_list(struct list_head *dispose)
> {
> - bool flush = false;
> struct nfsd_file *nf;
>
> while(!list_empty(dispose)) {
> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> list_del_init(&nf->nf_lru);
> - nfsd_file_flush(nf);
> - if (!refcount_dec_and_test(&nf->nf_ref))
> - continue;
> - if (nfsd_file_free(nf))
> - flush = true;
> + nfsd_file_free(nf);
> }
> - if (flush)
> - flush_delayed_fput();
> }
>
> static void
> @@ -564,21 +567,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> struct list_head *head = arg;
> struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>
> - /*
> - * Do a lockless refcount check. The hashtable holds one reference, so
> - * we look to see if anything else has a reference, or if any have
> - * been put since the shrinker last ran. Those don't get unhashed and
> - * released.
> - *
> - * Note that in the put path, we set the flag and then decrement the
> - * counter. Here we check the counter and then test and clear the flag.
> - * That order is deliberate to ensure that we can do this locklessly.
> - */
> - if (refcount_read(&nf->nf_ref) > 1) {
> - list_lru_isolate(lru, &nf->nf_lru);
> - trace_nfsd_file_gc_in_use(nf);
> - return LRU_REMOVED;
> - }
> + /* We should only be dealing with v2/3 entries here */
> + WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
>
> /*
> * Don't throw out files that are still undergoing I/O or
> @@ -589,40 +579,30 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> return LRU_SKIP;
> }
>
> + /* If it was recently added to the list, skip it */
> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> trace_nfsd_file_gc_referenced(nf);
> return LRU_ROTATE;
> }
>
> - if (!test_and_clear_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> - trace_nfsd_file_gc_hashed(nf);
> - return LRU_SKIP;
> + /*
> + * Put the reference held on behalf of the LRU. If it wasn't the last
> + * one, then just remove it from the LRU and ignore it.
> + */
> + if (!refcount_dec_and_test(&nf->nf_ref)) {
> + trace_nfsd_file_gc_in_use(nf);
> + list_lru_isolate(lru, &nf->nf_lru);
> + return LRU_REMOVED;
> }
>
> + /* Refcount went to zero. Unhash it and queue it to the dispose list */
> + nfsd_file_unhash(nf);
> list_lru_isolate_move(lru, &nf->nf_lru, head);
> this_cpu_inc(nfsd_file_evictions);
> trace_nfsd_file_gc_disposed(nf);
> return LRU_REMOVED;
> }
>
> -/*
> - * Unhash items on @dispose immediately, then queue them on the
> - * disposal workqueue to finish releasing them in the background.
> - *
> - * cel: Note that between the time list_lru_shrink_walk runs and
> - * now, these items are in the hash table but marked unhashed.
> - * Why release these outside of lru_cb ? There's no lock ordering
> - * problem since lru_cb currently takes no lock.
> - */
> -static void nfsd_file_gc_dispose_list(struct list_head *dispose)
> -{
> - struct nfsd_file *nf;
> -
> - list_for_each_entry(nf, dispose, nf_lru)
> - nfsd_file_hash_remove(nf);
> - nfsd_file_dispose_list_delayed(dispose);
> -}
> -
> static void
> nfsd_file_gc(void)
> {
> @@ -632,7 +612,7 @@ nfsd_file_gc(void)
> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> &dispose, list_lru_count(&nfsd_file_lru));
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> - nfsd_file_gc_dispose_list(&dispose);
> + nfsd_file_dispose_list_delayed(&dispose);
> }
>
> static void
> @@ -657,7 +637,7 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
> ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
> nfsd_file_lru_cb, &dispose);
> trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
> - nfsd_file_gc_dispose_list(&dispose);
> + nfsd_file_dispose_list_delayed(&dispose);
> return ret;
> }
>
> @@ -668,8 +648,11 @@ static struct shrinker nfsd_file_shrinker = {
> };
>
> /*
> - * Find all cache items across all net namespaces that match @inode and
> - * move them to @dispose. The lookup is atomic wrt nfsd_file_acquire().
> + * Find all cache items across all net namespaces that match @inode, unhash
> + * them, take references and then put them on @dispose if that was successful.
> + *
> + * The nfsd_file objects on the list will be unhashed, and each will have a
> + * reference taken.
> */
> static unsigned int
> __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> @@ -687,52 +670,59 @@ __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> nfsd_file_rhash_params);
> if (!nf)
> break;
> - nfsd_file_unhash_and_dispose(nf, dispose);
> - count++;
> +
> + if (nfsd_file_unhash_and_queue(nf, dispose))
> + count++;
> } while (1);
> rcu_read_unlock();
> return count;
> }
>
> /**
> - * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> + * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> * @inode: inode of the file to attempt to remove
> *
> - * Unhash and put, then flush and fput all cache items associated with @inode.
> + * Unhash and put all cache item associated with @inode.
> */
> -void
> -nfsd_file_close_inode_sync(struct inode *inode)
> +static unsigned int
> +nfsd_file_close_inode(struct inode *inode)
> {
> - LIST_HEAD(dispose);
> + struct nfsd_file *nf;
> unsigned int count;
> + LIST_HEAD(dispose);
>
> count = __nfsd_file_close_inode(inode, &dispose);
> - trace_nfsd_file_close_inode_sync(inode, count);
> - nfsd_file_dispose_list_sync(&dispose);
> + trace_nfsd_file_close_inode(inode, count);
> + if (count) {
> + while(!list_empty(&dispose)) {
> + nf = list_first_entry(&dispose, struct nfsd_file, nf_lru);
> + list_del_init(&nf->nf_lru);
> + trace_nfsd_file_closing(nf);
> + if (refcount_dec_and_test(&nf->nf_ref))
> + nfsd_file_free(nf);
> + }
> + }
> + return count;
> }
>
> /**
> - * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> + * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> * @inode: inode of the file to attempt to remove
> *
> - * Unhash and put all cache item associated with @inode.
> + * Unhash and put, then flush and fput all cache items associated with @inode.
> */
> -static void
> -nfsd_file_close_inode(struct inode *inode)
> +void
> +nfsd_file_close_inode_sync(struct inode *inode)
> {
> - LIST_HEAD(dispose);
> - unsigned int count;
> -
> - count = __nfsd_file_close_inode(inode, &dispose);
> - trace_nfsd_file_close_inode(inode, count);
> - nfsd_file_dispose_list_delayed(&dispose);
> + if (nfsd_file_close_inode(inode))
> + flush_delayed_fput();
> }
>
> /**
> * nfsd_file_delayed_close - close unused nfsd_files
> * @work: dummy
> *
> - * Walk the LRU list and close any entries that have not been used since
> + * Walk the LRU list and destroy any entries that have not been used since
> * the last scan.
> */
> static void
> @@ -890,7 +880,7 @@ __nfsd_file_cache_purge(struct net *net)
> while (!IS_ERR_OR_NULL(nf)) {
> if (net && nf->nf_net != net)
> continue;
> - nfsd_file_unhash_and_dispose(nf, &dispose);
> + nfsd_file_unhash_and_queue(nf, &dispose);
> nf = rhashtable_walk_next(&iter);
> }
>
> @@ -1054,8 +1044,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> rcu_read_lock();
> nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
> nfsd_file_rhash_params);
> - if (nf)
> - nf = nfsd_file_get(nf);
> + if (nf) {
> + if (!nfsd_file_lru_remove(nf))
> + nf = nfsd_file_get(nf);
> + }
> rcu_read_unlock();
> if (nf)
> goto wait_for_construction;
> @@ -1090,11 +1082,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> goto out;
> }
> open_retry = false;
> - nfsd_file_put_noref(nf);
> + if (refcount_dec_and_test(&nf->nf_ref))
> + nfsd_file_free(nf);
> goto retry;
> }
>
> - nfsd_file_lru_remove(nf);
> this_cpu_inc(nfsd_file_cache_hits);
>
> status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
> @@ -1104,7 +1096,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> this_cpu_inc(nfsd_file_acquisitions);
> *pnf = nf;
> } else {
> - nfsd_file_put(nf);
> + if (refcount_dec_and_test(&nf->nf_ref))
> + nfsd_file_free(nf);
> nf = NULL;
> }
>
> @@ -1131,7 +1124,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * then unhash.
> */
> if (status != nfs_ok || key.inode->i_nlink == 0)
> - nfsd_file_unhash_and_put(nf);
> + nfsd_file_unhash(nf);
> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> smp_mb__after_atomic();
> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index b09ab4f92d43..a44ded06af87 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -903,10 +903,11 @@ DEFINE_EVENT(nfsd_file_class, name, \
> TP_PROTO(struct nfsd_file *nf), \
> TP_ARGS(nf))
>
> -DEFINE_NFSD_FILE_EVENT(nfsd_file_put_final);
> +DEFINE_NFSD_FILE_EVENT(nfsd_file_free);
> DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash);
> DEFINE_NFSD_FILE_EVENT(nfsd_file_put);
> -DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_dispose);
> +DEFINE_NFSD_FILE_EVENT(nfsd_file_closing);
> +DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_queue);
>
> TRACE_EVENT(nfsd_file_alloc,
> TP_PROTO(
> --
> 2.37.3
>
--
Chuck Lever
> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>
> The list_lru_add and list_lru_del functions use list_empty checks to see
> whether the object is already on the LRU. That's fine in most cases, but
> we occasionally repurpose nf_lru after unhashing. It's possible for an
> LRU removal to remove it from a different list altogether if we lose a
> race.
I've never seen that happen. lru field re-use is actually used in other
places in the kernel. Shouldn't we try to find and fix such races?
Wasn't the idea to reduce the complexity of nfsd_file_put ?
> Add a new NFSD_FILE_LRU flag, which indicates that the object actually
> resides on the LRU and not some other list. Use that when adding and
> removing it from the LRU instead of just relying on list_empty checks.
>
> Add an extra HASHED check after adding the entry to the LRU. If it's now
> clear, just remove it from the LRU again and put the reference if that
> remove is successful.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/filecache.c | 44 +++++++++++++++++++++++++++++---------------
> fs/nfsd/filecache.h | 1 +
> 2 files changed, 30 insertions(+), 15 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index d928c5e38eeb..47cdc6129a7b 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -403,18 +403,22 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> static bool nfsd_file_lru_add(struct nfsd_file *nf)
> {
> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> - trace_nfsd_file_lru_add(nf);
> - return true;
> + if (!test_and_set_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
> + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> + trace_nfsd_file_lru_add(nf);
> + return true;
> + }
> }
> return false;
> }
>
> static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> {
> - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> - trace_nfsd_file_lru_del(nf);
> - return true;
> + if (test_and_clear_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
> + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> + trace_nfsd_file_lru_del(nf);
> + return true;
> + }
> }
> return false;
> }
> @@ -469,20 +473,30 @@ nfsd_file_put(struct nfsd_file *nf)
> {
> trace_nfsd_file_put(nf);
>
> - /*
> - * The HASHED check is racy. We may end up with the occasional
> - * unhashed entry on the LRU, but they should get cleaned up
> - * like any other.
> - */
> if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
> test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> /*
> - * If this is the last reference (nf_ref == 1), then transfer
> - * it to the LRU. If the add to the LRU fails, just put it as
> - * usual.
> + * If this is the last reference (nf_ref == 1), then try to
> + * transfer it to the LRU.
> */
> - if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
> + if (refcount_dec_not_one(&nf->nf_ref))
> return;
> +
> + /* Try to add it to the LRU. If that fails, decrement. */
> + if (nfsd_file_lru_add(nf)) {
> + /* If it's still hashed, we're done */
> + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> + return;
> +
> + /*
> + * We're racing with unhashing, so try to remove it from
> + * the LRU. If removal fails, then someone else already
> + * has our reference and we're done. If it succeeds,
> + * fall through to decrement.
> + */
> + if (!nfsd_file_lru_remove(nf))
> + return;
> + }
> }
> if (refcount_dec_and_test(&nf->nf_ref))
> nfsd_file_free(nf);
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index b7efb2c3ddb1..e52ab7d5a44c 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -39,6 +39,7 @@ struct nfsd_file {
> #define NFSD_FILE_PENDING (1)
> #define NFSD_FILE_REFERENCED (2)
> #define NFSD_FILE_GC (3)
> +#define NFSD_FILE_LRU (4) /* file is on LRU */
> unsigned long nf_flags;
> struct inode *nf_inode; /* don't deref */
> refcount_t nf_ref;
> --
> 2.37.3
>
--
Chuck Lever
> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>
> When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
> so that we can be ready to close it when the time comes. This should
> help minimize delays when freeing objects reaped from the LRU.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/filecache.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 47cdc6129a7b..c43b6cff03e2 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -325,6 +325,20 @@ nfsd_file_fsync(struct nfsd_file *nf)
> nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> }
>
> +static void
> +nfsd_file_flush(struct nfsd_file *nf)
> +{
> + struct file *file = nf->nf_file;
> + struct address_space *mapping;
> +
> + if (!file || !(file->f_mode & FMODE_WRITE))
> + return;
> +
> + mapping = file->f_mapping;
> + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> + filemap_flush(mapping);
> +}
> +
> static int
> nfsd_file_check_write_error(struct nfsd_file *nf)
> {
> @@ -484,9 +498,14 @@ nfsd_file_put(struct nfsd_file *nf)
>
> /* Try to add it to the LRU. If that fails, decrement. */
> if (nfsd_file_lru_add(nf)) {
> - /* If it's still hashed, we're done */
> - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> + /*
> + * If it's still hashed, we can just return now,
> + * after kicking off SYNC_NONE writeback.
> + */
> + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> + nfsd_file_flush(nf);
> return;
> + }
nfsd_write() calls nfsd_file_put() after every nfsd_vfs_write(). In some
cases, this new logic adds an async flush after every UNSTABLE NFSv3 WRITE.
I'll need to see performance measurements demonstrating no negative
impact on throughput or latency of NFSv3 WRITEs with large payloads.
> /*
> * We're racing with unhashing, so try to remove it from
> --
> 2.37.3
>
--
Chuck Lever
On Fri, 2022-10-28 at 19:50 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > The list_lru_add and list_lru_del functions use list_empty checks to see
> > whether the object is already on the LRU. That's fine in most cases, but
> > we occasionally repurpose nf_lru after unhashing. It's possible for an
> > LRU removal to remove it from a different list altogether if we lose a
> > race.
>
> I've never seen that happen. lru field re-use is actually used in other
> places in the kernel. Shouldn't we try to find and fix such races?
>
> Wasn't the idea to reduce the complexity of nfsd_file_put ?
>
It certainly seems theoretically possible here. Maybe those other places
have other ways to ensure that it doesn't occur. We are dealing with RCU
freed structures here, so we can't always rely on locks to keep things
nice and serialized.
FWIW, I left this as a separate patch just to illustrate the race and
fix, but we probably would want to squash this one into the first.
> > Add a new NFSD_FILE_LRU flag, which indicates that the object actually
> > resides on the LRU and not some other list. Use that when adding and
> > removing it from the LRU instead of just relying on list_empty checks.
> >
> > Add an extra HASHED check after adding the entry to the LRU. If it's now
> > clear, just remove it from the LRU again and put the reference if that
> > remove is successful.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 44 +++++++++++++++++++++++++++++---------------
> > fs/nfsd/filecache.h | 1 +
> > 2 files changed, 30 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index d928c5e38eeb..47cdc6129a7b 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -403,18 +403,22 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > {
> > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> > - trace_nfsd_file_lru_add(nf);
> > - return true;
> > + if (!test_and_set_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
> > + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> > + trace_nfsd_file_lru_add(nf);
> > + return true;
> > + }
> > }
> > return false;
> > }
> >
> > static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> > {
> > - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> > - trace_nfsd_file_lru_del(nf);
> > - return true;
> > + if (test_and_clear_bit(NFSD_FILE_LRU, &nf->nf_flags)) {
> > + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> > + trace_nfsd_file_lru_del(nf);
> > + return true;
> > + }
> > }
> > return false;
> > }
> > @@ -469,20 +473,30 @@ nfsd_file_put(struct nfsd_file *nf)
> > {
> > trace_nfsd_file_put(nf);
> >
> > - /*
> > - * The HASHED check is racy. We may end up with the occasional
> > - * unhashed entry on the LRU, but they should get cleaned up
> > - * like any other.
> > - */
> > if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
> > test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > /*
> > - * If this is the last reference (nf_ref == 1), then transfer
> > - * it to the LRU. If the add to the LRU fails, just put it as
> > - * usual.
> > + * If this is the last reference (nf_ref == 1), then try to
> > + * transfer it to the LRU.
> > */
> > - if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
> > + if (refcount_dec_not_one(&nf->nf_ref))
> > return;
> > +
> > + /* Try to add it to the LRU. If that fails, decrement. */
> > + if (nfsd_file_lru_add(nf)) {
> > + /* If it's still hashed, we're done */
> > + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> > + return;
> > +
> > + /*
> > + * We're racing with unhashing, so try to remove it from
> > + * the LRU. If removal fails, then someone else already
> > + * has our reference and we're done. If it succeeds,
> > + * fall through to decrement.
> > + */
> > + if (!nfsd_file_lru_remove(nf))
> > + return;
> > + }
> > }
> > if (refcount_dec_and_test(&nf->nf_ref))
> > nfsd_file_free(nf);
> > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > index b7efb2c3ddb1..e52ab7d5a44c 100644
> > --- a/fs/nfsd/filecache.h
> > +++ b/fs/nfsd/filecache.h
> > @@ -39,6 +39,7 @@ struct nfsd_file {
> > #define NFSD_FILE_PENDING (1)
> > #define NFSD_FILE_REFERENCED (2)
> > #define NFSD_FILE_GC (3)
> > +#define NFSD_FILE_LRU (4) /* file is on LRU */
> > unsigned long nf_flags;
> > struct inode *nf_inode; /* don't deref */
> > refcount_t nf_ref;
> > --
> > 2.37.3
> >
>
> --
> Chuck Lever
>
>
>
--
Jeff Layton <[email protected]>
On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > The filecache refcounting is a bit non-standard for something searchable
> > by RCU, in that we maintain a sentinel reference while it's hashed. This
> > in turn requires that we have to do things differently in the "put"
> > depending on whether its hashed, which we believe to have led to races.
> >
> > There are other problems in here too. nfsd_file_close_inode_sync can end
> > up freeing an nfsd_file while there are still outstanding references to
> > it, and there are a number of subtle ToC/ToU races.
> >
> > Rework the code so that the refcount is what drives the lifecycle. When
> > the refcount goes to zero, then unhash and rcu free the object.
> >
> > With this change, the LRU carries a reference. Take special care to
> > deal with it when removing an entry from the list.
>
> I can see a way of making this patch a lot cleaner. It looks like there's
> a fair bit of renaming and moving of functions -- that can go in clean
> up patches before doing the heavy lifting.
>
Is this something that's really needed? I'm already basically rewriting
this code. Reshuffling the old code around first will take a lot of time
and we'll still end up with the same result.
> I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
I think that we need to call this there to ensure that writeback errors
are handled. I worry that if try to do this piecemeal, we could end up
missing errors when they fall off the LRU.
> That feels like a deadlock waiting to happen and quite difficult to
> reproduce because I/O there is rarely needed. It could help to put a
> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
> drive I/O in that path at all.
I don't quite grok the potential for a deadlock here. nfsd_file_free
already has to deal with blocking activities due to it effective doing a
close(). This is just another one. That's why nfsd_file_put has a
might_sleep in it (to warn its callers).
What's the deadlock scenario you envision?
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 357 ++++++++++++++++++++++----------------------
> > fs/nfsd/trace.h | 5 +-
> > 2 files changed, 178 insertions(+), 184 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index f8ebbf7daa18..d928c5e38eeb 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -1,6 +1,12 @@
> > // SPDX-License-Identifier: GPL-2.0
> > /*
> > * The NFSD open file cache.
> > + *
> > + * Each nfsd_file is created in response to client activity -- either regular
> > + * file I/O for v2/v3, or opening a file for v4. Files opened via v4 are
> > + * cleaned up as soon as their refcount goes to 0. Entries for v2/v3 are
> > + * flagged with NFSD_FILE_GC. On their last put, they are added to the LRU for
> > + * eventual disposal if they aren't used again within a short time period.
> > */
> >
> > #include <linux/hash.h>
> > @@ -301,55 +307,22 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> > if (key->gc)
> > __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> > nf->nf_inode = key->inode;
> > - /* nf_ref is pre-incremented for hash table */
> > - refcount_set(&nf->nf_ref, 2);
> > + refcount_set(&nf->nf_ref, 1);
> > nf->nf_may = key->need;
> > nf->nf_mark = NULL;
> > }
> > return nf;
> > }
> >
> > -static bool
> > -nfsd_file_free(struct nfsd_file *nf)
> > -{
> > - s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> > - bool flush = false;
> > -
> > - this_cpu_inc(nfsd_file_releases);
> > - this_cpu_add(nfsd_file_total_age, age);
> > -
> > - trace_nfsd_file_put_final(nf);
> > - if (nf->nf_mark)
> > - nfsd_file_mark_put(nf->nf_mark);
> > - if (nf->nf_file) {
> > - get_file(nf->nf_file);
> > - filp_close(nf->nf_file, NULL);
> > - fput(nf->nf_file);
> > - flush = true;
> > - }
> > -
> > - /*
> > - * If this item is still linked via nf_lru, that's a bug.
> > - * WARN and leak it to preserve system stability.
> > - */
> > - if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> > - return flush;
> > -
> > - call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> > - return flush;
> > -}
> > -
> > -static bool
> > -nfsd_file_check_writeback(struct nfsd_file *nf)
> > +static void
> > +nfsd_file_fsync(struct nfsd_file *nf)
> > {
> > struct file *file = nf->nf_file;
> > - struct address_space *mapping;
> >
> > if (!file || !(file->f_mode & FMODE_WRITE))
> > - return false;
> > - mapping = file->f_mapping;
> > - return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> > - mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > + return;
> > + if (vfs_fsync(file, 1) != 0)
> > + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > }
> >
> > static int
> > @@ -362,30 +335,6 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
> > return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
> > }
> >
> > -static void
> > -nfsd_file_flush(struct nfsd_file *nf)
> > -{
> > - struct file *file = nf->nf_file;
> > -
> > - if (!file || !(file->f_mode & FMODE_WRITE))
> > - return;
> > - if (vfs_fsync(file, 1) != 0)
> > - nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > -}
> > -
> > -static void nfsd_file_lru_add(struct nfsd_file *nf)
> > -{
> > - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru))
> > - trace_nfsd_file_lru_add(nf);
> > -}
> > -
> > -static void nfsd_file_lru_remove(struct nfsd_file *nf)
> > -{
> > - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru))
> > - trace_nfsd_file_lru_del(nf);
> > -}
> > -
> > static void
> > nfsd_file_hash_remove(struct nfsd_file *nf)
> > {
> > @@ -408,53 +357,66 @@ nfsd_file_unhash(struct nfsd_file *nf)
> > }
> >
> > static void
> > -nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
> > +nfsd_file_free(struct nfsd_file *nf)
> > {
> > - trace_nfsd_file_unhash_and_dispose(nf);
> > - if (nfsd_file_unhash(nf)) {
> > - /* caller must call nfsd_file_dispose_list() later */
> > - nfsd_file_lru_remove(nf);
> > - list_add(&nf->nf_lru, dispose);
> > + s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> > +
> > + trace_nfsd_file_free(nf);
> > +
> > + this_cpu_inc(nfsd_file_releases);
> > + this_cpu_add(nfsd_file_total_age, age);
> > +
> > + nfsd_file_unhash(nf);
> > + nfsd_file_fsync(nf);
> > +
> > + if (nf->nf_mark)
> > + nfsd_file_mark_put(nf->nf_mark);
> > + if (nf->nf_file) {
> > + get_file(nf->nf_file);
> > + filp_close(nf->nf_file, NULL);
> > + fput(nf->nf_file);
> > }
> > +
> > + /*
> > + * If this item is still linked via nf_lru, that's a bug.
> > + * WARN and leak it to preserve system stability.
> > + */
> > + if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> > + return;
> > +
> > + call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> > }
> >
> > -static void
> > -nfsd_file_put_noref(struct nfsd_file *nf)
> > +static bool
> > +nfsd_file_check_writeback(struct nfsd_file *nf)
> > {
> > - trace_nfsd_file_put(nf);
> > + struct file *file = nf->nf_file;
> > + struct address_space *mapping;
> >
> > - if (refcount_dec_and_test(&nf->nf_ref)) {
> > - WARN_ON(test_bit(NFSD_FILE_HASHED, &nf->nf_flags));
> > - nfsd_file_lru_remove(nf);
> > - nfsd_file_free(nf);
> > - }
> > + if (!file || !(file->f_mode & FMODE_WRITE))
> > + return false;
> > + mapping = file->f_mapping;
> > + return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> > + mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > }
> >
> > -static void
> > -nfsd_file_unhash_and_put(struct nfsd_file *nf)
> > +static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > {
> > - if (nfsd_file_unhash(nf))
> > - nfsd_file_put_noref(nf);
> > + set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> > + trace_nfsd_file_lru_add(nf);
> > + return true;
> > + }
> > + return false;
> > }
> >
> > -void
> > -nfsd_file_put(struct nfsd_file *nf)
> > +static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> > {
> > - might_sleep();
> > -
> > - if (test_bit(NFSD_FILE_GC, &nf->nf_flags))
> > - nfsd_file_lru_add(nf);
> > - else if (refcount_read(&nf->nf_ref) == 2)
> > - nfsd_file_unhash_and_put(nf);
> > -
> > - if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > - nfsd_file_flush(nf);
> > - nfsd_file_put_noref(nf);
> > - } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
> > - nfsd_file_put_noref(nf);
> > - nfsd_file_schedule_laundrette();
> > - } else
> > - nfsd_file_put_noref(nf);
> > + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> > + trace_nfsd_file_lru_del(nf);
> > + return true;
> > + }
> > + return false;
> > }
> >
> > struct nfsd_file *
> > @@ -465,36 +427,77 @@ nfsd_file_get(struct nfsd_file *nf)
> > return NULL;
> > }
> >
> > -static void
> > -nfsd_file_dispose_list(struct list_head *dispose)
> > +/**
> > + * nfsd_file_unhash_and_queue - unhash a file and queue it to the dispose list
> > + * @nf: nfsd_file to be unhashed and queued
> > + * @dispose: list to which it should be queued
> > + *
> > + * Attempt to unhash a nfsd_file and queue it to the given list. Each file
> > + * will have a reference held on behalf of the list. That reference may come
> > + * from the LRU, or we may need to take one. If we can't get a reference,
> > + * ignore it altogether.
> > + */
> > +static bool
> > +nfsd_file_unhash_and_queue(struct nfsd_file *nf, struct list_head *dispose)
> > {
> > - struct nfsd_file *nf;
> > + trace_nfsd_file_unhash_and_queue(nf);
> > + if (nfsd_file_unhash(nf)) {
> > + /*
> > + * If we remove it from the LRU, then just use that
> > + * reference for the dispose list. Otherwise, we need
> > + * to take a reference. If that fails, just ignore
> > + * the file altogether.
> > + */
> > + if (!nfsd_file_lru_remove(nf) && !nfsd_file_get(nf))
> > + return false;
> > + list_add(&nf->nf_lru, dispose);
> > + return true;
> > + }
> > + return false;
> > +}
> >
> > - while(!list_empty(dispose)) {
> > - nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> > - list_del_init(&nf->nf_lru);
> > - nfsd_file_flush(nf);
> > - nfsd_file_put_noref(nf);
> > +/**
> > + * nfsd_file_put - put the reference to a nfsd_file
> > + * @nf: nfsd_file of which to put the reference
> > + *
> > + * Put a reference to a nfsd_file. In the v4 case, we just put the
> > + * reference immediately. In the v2/3 case, if the reference would be
> > + * the last one, the put it on the LRU instead to be cleaned up later.
> > + */
> > +void
> > +nfsd_file_put(struct nfsd_file *nf)
> > +{
> > + trace_nfsd_file_put(nf);
> > +
> > + /*
> > + * The HASHED check is racy. We may end up with the occasional
> > + * unhashed entry on the LRU, but they should get cleaned up
> > + * like any other.
> > + */
> > + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
> > + test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > + /*
> > + * If this is the last reference (nf_ref == 1), then transfer
> > + * it to the LRU. If the add to the LRU fails, just put it as
> > + * usual.
> > + */
> > + if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
> > + return;
> > }
> > + if (refcount_dec_and_test(&nf->nf_ref))
> > + nfsd_file_free(nf);
> > }
> >
> > static void
> > -nfsd_file_dispose_list_sync(struct list_head *dispose)
> > +nfsd_file_dispose_list(struct list_head *dispose)
> > {
> > - bool flush = false;
> > struct nfsd_file *nf;
> >
> > while(!list_empty(dispose)) {
> > nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> > list_del_init(&nf->nf_lru);
> > - nfsd_file_flush(nf);
> > - if (!refcount_dec_and_test(&nf->nf_ref))
> > - continue;
> > - if (nfsd_file_free(nf))
> > - flush = true;
> > + nfsd_file_free(nf);
> > }
> > - if (flush)
> > - flush_delayed_fput();
> > }
> >
> > static void
> > @@ -564,21 +567,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > struct list_head *head = arg;
> > struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> >
> > - /*
> > - * Do a lockless refcount check. The hashtable holds one reference, so
> > - * we look to see if anything else has a reference, or if any have
> > - * been put since the shrinker last ran. Those don't get unhashed and
> > - * released.
> > - *
> > - * Note that in the put path, we set the flag and then decrement the
> > - * counter. Here we check the counter and then test and clear the flag.
> > - * That order is deliberate to ensure that we can do this locklessly.
> > - */
> > - if (refcount_read(&nf->nf_ref) > 1) {
> > - list_lru_isolate(lru, &nf->nf_lru);
> > - trace_nfsd_file_gc_in_use(nf);
> > - return LRU_REMOVED;
> > - }
> > + /* We should only be dealing with v2/3 entries here */
> > + WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
> >
> > /*
> > * Don't throw out files that are still undergoing I/O or
> > @@ -589,40 +579,30 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > return LRU_SKIP;
> > }
> >
> > + /* If it was recently added to the list, skip it */
> > if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> > trace_nfsd_file_gc_referenced(nf);
> > return LRU_ROTATE;
> > }
> >
> > - if (!test_and_clear_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > - trace_nfsd_file_gc_hashed(nf);
> > - return LRU_SKIP;
> > + /*
> > + * Put the reference held on behalf of the LRU. If it wasn't the last
> > + * one, then just remove it from the LRU and ignore it.
> > + */
> > + if (!refcount_dec_and_test(&nf->nf_ref)) {
> > + trace_nfsd_file_gc_in_use(nf);
> > + list_lru_isolate(lru, &nf->nf_lru);
> > + return LRU_REMOVED;
> > }
> >
> > + /* Refcount went to zero. Unhash it and queue it to the dispose list */
> > + nfsd_file_unhash(nf);
> > list_lru_isolate_move(lru, &nf->nf_lru, head);
> > this_cpu_inc(nfsd_file_evictions);
> > trace_nfsd_file_gc_disposed(nf);
> > return LRU_REMOVED;
> > }
> >
> > -/*
> > - * Unhash items on @dispose immediately, then queue them on the
> > - * disposal workqueue to finish releasing them in the background.
> > - *
> > - * cel: Note that between the time list_lru_shrink_walk runs and
> > - * now, these items are in the hash table but marked unhashed.
> > - * Why release these outside of lru_cb ? There's no lock ordering
> > - * problem since lru_cb currently takes no lock.
> > - */
> > -static void nfsd_file_gc_dispose_list(struct list_head *dispose)
> > -{
> > - struct nfsd_file *nf;
> > -
> > - list_for_each_entry(nf, dispose, nf_lru)
> > - nfsd_file_hash_remove(nf);
> > - nfsd_file_dispose_list_delayed(dispose);
> > -}
> > -
> > static void
> > nfsd_file_gc(void)
> > {
> > @@ -632,7 +612,7 @@ nfsd_file_gc(void)
> > ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > &dispose, list_lru_count(&nfsd_file_lru));
> > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > - nfsd_file_gc_dispose_list(&dispose);
> > + nfsd_file_dispose_list_delayed(&dispose);
> > }
> >
> > static void
> > @@ -657,7 +637,7 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
> > ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
> > nfsd_file_lru_cb, &dispose);
> > trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
> > - nfsd_file_gc_dispose_list(&dispose);
> > + nfsd_file_dispose_list_delayed(&dispose);
> > return ret;
> > }
> >
> > @@ -668,8 +648,11 @@ static struct shrinker nfsd_file_shrinker = {
> > };
> >
> > /*
> > - * Find all cache items across all net namespaces that match @inode and
> > - * move them to @dispose. The lookup is atomic wrt nfsd_file_acquire().
> > + * Find all cache items across all net namespaces that match @inode, unhash
> > + * them, take references and then put them on @dispose if that was successful.
> > + *
> > + * The nfsd_file objects on the list will be unhashed, and each will have a
> > + * reference taken.
> > */
> > static unsigned int
> > __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> > @@ -687,52 +670,59 @@ __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> > nfsd_file_rhash_params);
> > if (!nf)
> > break;
> > - nfsd_file_unhash_and_dispose(nf, dispose);
> > - count++;
> > +
> > + if (nfsd_file_unhash_and_queue(nf, dispose))
> > + count++;
> > } while (1);
> > rcu_read_unlock();
> > return count;
> > }
> >
> > /**
> > - * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> > + * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> > * @inode: inode of the file to attempt to remove
> > *
> > - * Unhash and put, then flush and fput all cache items associated with @inode.
> > + * Unhash and put all cache item associated with @inode.
> > */
> > -void
> > -nfsd_file_close_inode_sync(struct inode *inode)
> > +static unsigned int
> > +nfsd_file_close_inode(struct inode *inode)
> > {
> > - LIST_HEAD(dispose);
> > + struct nfsd_file *nf;
> > unsigned int count;
> > + LIST_HEAD(dispose);
> >
> > count = __nfsd_file_close_inode(inode, &dispose);
> > - trace_nfsd_file_close_inode_sync(inode, count);
> > - nfsd_file_dispose_list_sync(&dispose);
> > + trace_nfsd_file_close_inode(inode, count);
> > + if (count) {
> > + while(!list_empty(&dispose)) {
> > + nf = list_first_entry(&dispose, struct nfsd_file, nf_lru);
> > + list_del_init(&nf->nf_lru);
> > + trace_nfsd_file_closing(nf);
> > + if (refcount_dec_and_test(&nf->nf_ref))
> > + nfsd_file_free(nf);
> > + }
> > + }
> > + return count;
> > }
> >
> > /**
> > - * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> > + * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> > * @inode: inode of the file to attempt to remove
> > *
> > - * Unhash and put all cache item associated with @inode.
> > + * Unhash and put, then flush and fput all cache items associated with @inode.
> > */
> > -static void
> > -nfsd_file_close_inode(struct inode *inode)
> > +void
> > +nfsd_file_close_inode_sync(struct inode *inode)
> > {
> > - LIST_HEAD(dispose);
> > - unsigned int count;
> > -
> > - count = __nfsd_file_close_inode(inode, &dispose);
> > - trace_nfsd_file_close_inode(inode, count);
> > - nfsd_file_dispose_list_delayed(&dispose);
> > + if (nfsd_file_close_inode(inode))
> > + flush_delayed_fput();
> > }
> >
> > /**
> > * nfsd_file_delayed_close - close unused nfsd_files
> > * @work: dummy
> > *
> > - * Walk the LRU list and close any entries that have not been used since
> > + * Walk the LRU list and destroy any entries that have not been used since
> > * the last scan.
> > */
> > static void
> > @@ -890,7 +880,7 @@ __nfsd_file_cache_purge(struct net *net)
> > while (!IS_ERR_OR_NULL(nf)) {
> > if (net && nf->nf_net != net)
> > continue;
> > - nfsd_file_unhash_and_dispose(nf, &dispose);
> > + nfsd_file_unhash_and_queue(nf, &dispose);
> > nf = rhashtable_walk_next(&iter);
> > }
> >
> > @@ -1054,8 +1044,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > rcu_read_lock();
> > nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
> > nfsd_file_rhash_params);
> > - if (nf)
> > - nf = nfsd_file_get(nf);
> > + if (nf) {
> > + if (!nfsd_file_lru_remove(nf))
> > + nf = nfsd_file_get(nf);
> > + }
> > rcu_read_unlock();
> > if (nf)
> > goto wait_for_construction;
> > @@ -1090,11 +1082,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > goto out;
> > }
> > open_retry = false;
> > - nfsd_file_put_noref(nf);
> > + if (refcount_dec_and_test(&nf->nf_ref))
> > + nfsd_file_free(nf);
> > goto retry;
> > }
> >
> > - nfsd_file_lru_remove(nf);
> > this_cpu_inc(nfsd_file_cache_hits);
> >
> > status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
> > @@ -1104,7 +1096,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > this_cpu_inc(nfsd_file_acquisitions);
> > *pnf = nf;
> > } else {
> > - nfsd_file_put(nf);
> > + if (refcount_dec_and_test(&nf->nf_ref))
> > + nfsd_file_free(nf);
> > nf = NULL;
> > }
> >
> > @@ -1131,7 +1124,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > * then unhash.
> > */
> > if (status != nfs_ok || key.inode->i_nlink == 0)
> > - nfsd_file_unhash_and_put(nf);
> > + nfsd_file_unhash(nf);
> > clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> > smp_mb__after_atomic();
> > wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index b09ab4f92d43..a44ded06af87 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -903,10 +903,11 @@ DEFINE_EVENT(nfsd_file_class, name, \
> > TP_PROTO(struct nfsd_file *nf), \
> > TP_ARGS(nf))
> >
> > -DEFINE_NFSD_FILE_EVENT(nfsd_file_put_final);
> > +DEFINE_NFSD_FILE_EVENT(nfsd_file_free);
> > DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash);
> > DEFINE_NFSD_FILE_EVENT(nfsd_file_put);
> > -DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_dispose);
> > +DEFINE_NFSD_FILE_EVENT(nfsd_file_closing);
> > +DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_queue);
> >
> > TRACE_EVENT(nfsd_file_alloc,
> > TP_PROTO(
> > --
> > 2.37.3
> >
>
> --
> Chuck Lever
>
>
>
--
Jeff Layton <[email protected]>
On Fri, 2022-10-28 at 19:50 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
> > so that we can be ready to close it when the time comes. This should
> > help minimize delays when freeing objects reaped from the LRU.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 23 +++++++++++++++++++++--
> > 1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 47cdc6129a7b..c43b6cff03e2 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -325,6 +325,20 @@ nfsd_file_fsync(struct nfsd_file *nf)
> > nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > }
> >
> > +static void
> > +nfsd_file_flush(struct nfsd_file *nf)
> > +{
> > + struct file *file = nf->nf_file;
> > + struct address_space *mapping;
> > +
> > + if (!file || !(file->f_mode & FMODE_WRITE))
> > + return;
> > +
> > + mapping = file->f_mapping;
> > + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > + filemap_flush(mapping);
> > +}
> > +
> > static int
> > nfsd_file_check_write_error(struct nfsd_file *nf)
> > {
> > @@ -484,9 +498,14 @@ nfsd_file_put(struct nfsd_file *nf)
> >
> > /* Try to add it to the LRU. If that fails, decrement. */
> > if (nfsd_file_lru_add(nf)) {
> > - /* If it's still hashed, we're done */
> > - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> > + /*
> > + * If it's still hashed, we can just return now,
> > + * after kicking off SYNC_NONE writeback.
> > + */
> > + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > + nfsd_file_flush(nf);
> > return;
> > + }
>
> nfsd_write() calls nfsd_file_put() after every nfsd_vfs_write(). In some
> cases, this new logic adds an async flush after every UNSTABLE NFSv3 WRITE.
>
> I'll need to see performance measurements demonstrating no negative
> impact on throughput or latency of NFSv3 WRITEs with large payloads.
>
>
In your earlier mail, you mentioned that you wanted the writeback work
to be done in the context of nfsd threads. nfsd_file_put is how nfsd
threads put their references so this seems like the right place to do
it.
If you're concerned about calling filemap_flush too often because we
have an entry that's cycling onto and off of the LRU, then another
(maybe better) idea might be to kick off writeback when we clear the
REFERENCED flag. That would need to be done in the gc thread context,
however.
> > /*
> > * We're racing with unhashing, so try to remove it from
> > --
> > 2.37.3
> >
>
> --
> Chuck Lever
>
>
>
--
Jeff Layton <[email protected]>
> On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>>>
>>> The filecache refcounting is a bit non-standard for something searchable
>>> by RCU, in that we maintain a sentinel reference while it's hashed. This
>>> in turn requires that we have to do things differently in the "put"
>>> depending on whether its hashed, which we believe to have led to races.
>>>
>>> There are other problems in here too. nfsd_file_close_inode_sync can end
>>> up freeing an nfsd_file while there are still outstanding references to
>>> it, and there are a number of subtle ToC/ToU races.
>>>
>>> Rework the code so that the refcount is what drives the lifecycle. When
>>> the refcount goes to zero, then unhash and rcu free the object.
>>>
>>> With this change, the LRU carries a reference. Take special care to
>>> deal with it when removing an entry from the list.
>>
>> I can see a way of making this patch a lot cleaner. It looks like there's
>> a fair bit of renaming and moving of functions -- that can go in clean
>> up patches before doing the heavy lifting.
>>
>
> Is this something that's really needed? I'm already basically rewriting
> this code. Reshuffling the old code around first will take a lot of time
> and we'll still end up with the same result.
I did exactly this for the nfs4_file rhash changes. It took just a couple
of hours. The outcome is that you can see exactly, in the final patch in
that series, how the file_hashtbl -> rhltable substitution is done.
Making sure each of the changes is more or less mechanical and obvious
is a good way to ensure no-one is doing something incorrect. That's why
folks like to use cocchinelle.
Trust me, it will be much easier to figure out in a year when we have
new bugs in this code if we split up this commit just a little.
>> I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
>
> I think that we need to call this there to ensure that writeback errors
> are handled. I worry that if try to do this piecemeal, we could end up
> missing errors when they fall off the LRU.
>
>> That feels like a deadlock waiting to happen and quite difficult to
>> reproduce because I/O there is rarely needed. It could help to put a
>> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
>> drive I/O in that path at all.
>
> I don't quite grok the potential for a deadlock here. nfsd_file_free
> already has to deal with blocking activities due to it effective doing a
> close(). This is just another one. That's why nfsd_file_put has a
> might_sleep in it (to warn its callers).
Currently nfsd_file_put() calls nfsd_file_flush(), which calls
vfs_fsync(). That can't be called while holding a spinlock.
> What's the deadlock scenario you envision?
OK, filp_close() does call f_op->flush(). So we have this call
here and there aren't problems today. I still say this is a
problem waiting to occur, but I guess I can live with it.
If filp_close() already calls f_op->flush(), why do we need an
explicit vfs_fsync() there?
>>> Signed-off-by: Jeff Layton <[email protected]>
>>> ---
>>> fs/nfsd/filecache.c | 357 ++++++++++++++++++++++----------------------
>>> fs/nfsd/trace.h | 5 +-
>>> 2 files changed, 178 insertions(+), 184 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index f8ebbf7daa18..d928c5e38eeb 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -1,6 +1,12 @@
>>> // SPDX-License-Identifier: GPL-2.0
>>> /*
>>> * The NFSD open file cache.
>>> + *
>>> + * Each nfsd_file is created in response to client activity -- either regular
>>> + * file I/O for v2/v3, or opening a file for v4. Files opened via v4 are
>>> + * cleaned up as soon as their refcount goes to 0. Entries for v2/v3 are
>>> + * flagged with NFSD_FILE_GC. On their last put, they are added to the LRU for
>>> + * eventual disposal if they aren't used again within a short time period.
>>> */
>>>
>>> #include <linux/hash.h>
>>> @@ -301,55 +307,22 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>>> if (key->gc)
>>> __set_bit(NFSD_FILE_GC, &nf->nf_flags);
>>> nf->nf_inode = key->inode;
>>> - /* nf_ref is pre-incremented for hash table */
>>> - refcount_set(&nf->nf_ref, 2);
>>> + refcount_set(&nf->nf_ref, 1);
>>> nf->nf_may = key->need;
>>> nf->nf_mark = NULL;
>>> }
>>> return nf;
>>> }
>>>
>>> -static bool
>>> -nfsd_file_free(struct nfsd_file *nf)
>>> -{
>>> - s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
>>> - bool flush = false;
>>> -
>>> - this_cpu_inc(nfsd_file_releases);
>>> - this_cpu_add(nfsd_file_total_age, age);
>>> -
>>> - trace_nfsd_file_put_final(nf);
>>> - if (nf->nf_mark)
>>> - nfsd_file_mark_put(nf->nf_mark);
>>> - if (nf->nf_file) {
>>> - get_file(nf->nf_file);
>>> - filp_close(nf->nf_file, NULL);
>>> - fput(nf->nf_file);
>>> - flush = true;
>>> - }
>>> -
>>> - /*
>>> - * If this item is still linked via nf_lru, that's a bug.
>>> - * WARN and leak it to preserve system stability.
>>> - */
>>> - if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
>>> - return flush;
>>> -
>>> - call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
>>> - return flush;
>>> -}
>>> -
>>> -static bool
>>> -nfsd_file_check_writeback(struct nfsd_file *nf)
>>> +static void
>>> +nfsd_file_fsync(struct nfsd_file *nf)
>>> {
>>> struct file *file = nf->nf_file;
>>> - struct address_space *mapping;
>>>
>>> if (!file || !(file->f_mode & FMODE_WRITE))
>>> - return false;
>>> - mapping = file->f_mapping;
>>> - return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
>>> - mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>> + return;
>>> + if (vfs_fsync(file, 1) != 0)
>>> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> }
>>>
>>> static int
>>> @@ -362,30 +335,6 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>>> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>>> }
>>>
>>> -static void
>>> -nfsd_file_flush(struct nfsd_file *nf)
>>> -{
>>> - struct file *file = nf->nf_file;
>>> -
>>> - if (!file || !(file->f_mode & FMODE_WRITE))
>>> - return;
>>> - if (vfs_fsync(file, 1) != 0)
>>> - nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> -}
>>> -
>>> -static void nfsd_file_lru_add(struct nfsd_file *nf)
>>> -{
>>> - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru))
>>> - trace_nfsd_file_lru_add(nf);
>>> -}
>>> -
>>> -static void nfsd_file_lru_remove(struct nfsd_file *nf)
>>> -{
>>> - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru))
>>> - trace_nfsd_file_lru_del(nf);
>>> -}
>>> -
>>> static void
>>> nfsd_file_hash_remove(struct nfsd_file *nf)
>>> {
>>> @@ -408,53 +357,66 @@ nfsd_file_unhash(struct nfsd_file *nf)
>>> }
>>>
>>> static void
>>> -nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
>>> +nfsd_file_free(struct nfsd_file *nf)
>>> {
>>> - trace_nfsd_file_unhash_and_dispose(nf);
>>> - if (nfsd_file_unhash(nf)) {
>>> - /* caller must call nfsd_file_dispose_list() later */
>>> - nfsd_file_lru_remove(nf);
>>> - list_add(&nf->nf_lru, dispose);
>>> + s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
>>> +
>>> + trace_nfsd_file_free(nf);
>>> +
>>> + this_cpu_inc(nfsd_file_releases);
>>> + this_cpu_add(nfsd_file_total_age, age);
>>> +
>>> + nfsd_file_unhash(nf);
>>> + nfsd_file_fsync(nf);
>>> +
>>> + if (nf->nf_mark)
>>> + nfsd_file_mark_put(nf->nf_mark);
>>> + if (nf->nf_file) {
>>> + get_file(nf->nf_file);
>>> + filp_close(nf->nf_file, NULL);
>>> + fput(nf->nf_file);
>>> }
>>> +
>>> + /*
>>> + * If this item is still linked via nf_lru, that's a bug.
>>> + * WARN and leak it to preserve system stability.
>>> + */
>>> + if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
>>> + return;
>>> +
>>> + call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
>>> }
>>>
>>> -static void
>>> -nfsd_file_put_noref(struct nfsd_file *nf)
>>> +static bool
>>> +nfsd_file_check_writeback(struct nfsd_file *nf)
>>> {
>>> - trace_nfsd_file_put(nf);
>>> + struct file *file = nf->nf_file;
>>> + struct address_space *mapping;
>>>
>>> - if (refcount_dec_and_test(&nf->nf_ref)) {
>>> - WARN_ON(test_bit(NFSD_FILE_HASHED, &nf->nf_flags));
>>> - nfsd_file_lru_remove(nf);
>>> - nfsd_file_free(nf);
>>> - }
>>> + if (!file || !(file->f_mode & FMODE_WRITE))
>>> + return false;
>>> + mapping = file->f_mapping;
>>> + return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
>>> + mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>> }
>>>
>>> -static void
>>> -nfsd_file_unhash_and_put(struct nfsd_file *nf)
>>> +static bool nfsd_file_lru_add(struct nfsd_file *nf)
>>> {
>>> - if (nfsd_file_unhash(nf))
>>> - nfsd_file_put_noref(nf);
>>> + set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
>>> + trace_nfsd_file_lru_add(nf);
>>> + return true;
>>> + }
>>> + return false;
>>> }
>>>
>>> -void
>>> -nfsd_file_put(struct nfsd_file *nf)
>>> +static bool nfsd_file_lru_remove(struct nfsd_file *nf)
>>> {
>>> - might_sleep();
>>> -
>>> - if (test_bit(NFSD_FILE_GC, &nf->nf_flags))
>>> - nfsd_file_lru_add(nf);
>>> - else if (refcount_read(&nf->nf_ref) == 2)
>>> - nfsd_file_unhash_and_put(nf);
>>> -
>>> - if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>>> - nfsd_file_flush(nf);
>>> - nfsd_file_put_noref(nf);
>>> - } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
>>> - nfsd_file_put_noref(nf);
>>> - nfsd_file_schedule_laundrette();
>>> - } else
>>> - nfsd_file_put_noref(nf);
>>> + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
>>> + trace_nfsd_file_lru_del(nf);
>>> + return true;
>>> + }
>>> + return false;
>>> }
>>>
>>> struct nfsd_file *
>>> @@ -465,36 +427,77 @@ nfsd_file_get(struct nfsd_file *nf)
>>> return NULL;
>>> }
>>>
>>> -static void
>>> -nfsd_file_dispose_list(struct list_head *dispose)
>>> +/**
>>> + * nfsd_file_unhash_and_queue - unhash a file and queue it to the dispose list
>>> + * @nf: nfsd_file to be unhashed and queued
>>> + * @dispose: list to which it should be queued
>>> + *
>>> + * Attempt to unhash a nfsd_file and queue it to the given list. Each file
>>> + * will have a reference held on behalf of the list. That reference may come
>>> + * from the LRU, or we may need to take one. If we can't get a reference,
>>> + * ignore it altogether.
>>> + */
>>> +static bool
>>> +nfsd_file_unhash_and_queue(struct nfsd_file *nf, struct list_head *dispose)
>>> {
>>> - struct nfsd_file *nf;
>>> + trace_nfsd_file_unhash_and_queue(nf);
>>> + if (nfsd_file_unhash(nf)) {
>>> + /*
>>> + * If we remove it from the LRU, then just use that
>>> + * reference for the dispose list. Otherwise, we need
>>> + * to take a reference. If that fails, just ignore
>>> + * the file altogether.
>>> + */
>>> + if (!nfsd_file_lru_remove(nf) && !nfsd_file_get(nf))
>>> + return false;
>>> + list_add(&nf->nf_lru, dispose);
>>> + return true;
>>> + }
>>> + return false;
>>> +}
>>>
>>> - while(!list_empty(dispose)) {
>>> - nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> - list_del_init(&nf->nf_lru);
>>> - nfsd_file_flush(nf);
>>> - nfsd_file_put_noref(nf);
>>> +/**
>>> + * nfsd_file_put - put the reference to a nfsd_file
>>> + * @nf: nfsd_file of which to put the reference
>>> + *
>>> + * Put a reference to a nfsd_file. In the v4 case, we just put the
>>> + * reference immediately. In the v2/3 case, if the reference would be
>>> + * the last one, the put it on the LRU instead to be cleaned up later.
>>> + */
>>> +void
>>> +nfsd_file_put(struct nfsd_file *nf)
>>> +{
>>> + trace_nfsd_file_put(nf);
>>> +
>>> + /*
>>> + * The HASHED check is racy. We may end up with the occasional
>>> + * unhashed entry on the LRU, but they should get cleaned up
>>> + * like any other.
>>> + */
>>> + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
>>> + test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>>> + /*
>>> + * If this is the last reference (nf_ref == 1), then transfer
>>> + * it to the LRU. If the add to the LRU fails, just put it as
>>> + * usual.
>>> + */
>>> + if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
>>> + return;
>>> }
>>> + if (refcount_dec_and_test(&nf->nf_ref))
>>> + nfsd_file_free(nf);
>>> }
>>>
>>> static void
>>> -nfsd_file_dispose_list_sync(struct list_head *dispose)
>>> +nfsd_file_dispose_list(struct list_head *dispose)
>>> {
>>> - bool flush = false;
>>> struct nfsd_file *nf;
>>>
>>> while(!list_empty(dispose)) {
>>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> list_del_init(&nf->nf_lru);
>>> - nfsd_file_flush(nf);
>>> - if (!refcount_dec_and_test(&nf->nf_ref))
>>> - continue;
>>> - if (nfsd_file_free(nf))
>>> - flush = true;
>>> + nfsd_file_free(nf);
>>> }
>>> - if (flush)
>>> - flush_delayed_fput();
>>> }
>>>
>>> static void
>>> @@ -564,21 +567,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>> struct list_head *head = arg;
>>> struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>>>
>>> - /*
>>> - * Do a lockless refcount check. The hashtable holds one reference, so
>>> - * we look to see if anything else has a reference, or if any have
>>> - * been put since the shrinker last ran. Those don't get unhashed and
>>> - * released.
>>> - *
>>> - * Note that in the put path, we set the flag and then decrement the
>>> - * counter. Here we check the counter and then test and clear the flag.
>>> - * That order is deliberate to ensure that we can do this locklessly.
>>> - */
>>> - if (refcount_read(&nf->nf_ref) > 1) {
>>> - list_lru_isolate(lru, &nf->nf_lru);
>>> - trace_nfsd_file_gc_in_use(nf);
>>> - return LRU_REMOVED;
>>> - }
>>> + /* We should only be dealing with v2/3 entries here */
>>> + WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
>>>
>>> /*
>>> * Don't throw out files that are still undergoing I/O or
>>> @@ -589,40 +579,30 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>> return LRU_SKIP;
>>> }
>>>
>>> + /* If it was recently added to the list, skip it */
>>> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>>> trace_nfsd_file_gc_referenced(nf);
>>> return LRU_ROTATE;
>>> }
>>>
>>> - if (!test_and_clear_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>>> - trace_nfsd_file_gc_hashed(nf);
>>> - return LRU_SKIP;
>>> + /*
>>> + * Put the reference held on behalf of the LRU. If it wasn't the last
>>> + * one, then just remove it from the LRU and ignore it.
>>> + */
>>> + if (!refcount_dec_and_test(&nf->nf_ref)) {
>>> + trace_nfsd_file_gc_in_use(nf);
>>> + list_lru_isolate(lru, &nf->nf_lru);
>>> + return LRU_REMOVED;
>>> }
>>>
>>> + /* Refcount went to zero. Unhash it and queue it to the dispose list */
>>> + nfsd_file_unhash(nf);
>>> list_lru_isolate_move(lru, &nf->nf_lru, head);
>>> this_cpu_inc(nfsd_file_evictions);
>>> trace_nfsd_file_gc_disposed(nf);
>>> return LRU_REMOVED;
>>> }
>>>
>>> -/*
>>> - * Unhash items on @dispose immediately, then queue them on the
>>> - * disposal workqueue to finish releasing them in the background.
>>> - *
>>> - * cel: Note that between the time list_lru_shrink_walk runs and
>>> - * now, these items are in the hash table but marked unhashed.
>>> - * Why release these outside of lru_cb ? There's no lock ordering
>>> - * problem since lru_cb currently takes no lock.
>>> - */
>>> -static void nfsd_file_gc_dispose_list(struct list_head *dispose)
>>> -{
>>> - struct nfsd_file *nf;
>>> -
>>> - list_for_each_entry(nf, dispose, nf_lru)
>>> - nfsd_file_hash_remove(nf);
>>> - nfsd_file_dispose_list_delayed(dispose);
>>> -}
>>> -
>>> static void
>>> nfsd_file_gc(void)
>>> {
>>> @@ -632,7 +612,7 @@ nfsd_file_gc(void)
>>> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>> &dispose, list_lru_count(&nfsd_file_lru));
>>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>> - nfsd_file_gc_dispose_list(&dispose);
>>> + nfsd_file_dispose_list_delayed(&dispose);
>>> }
>>>
>>> static void
>>> @@ -657,7 +637,7 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
>>> ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
>>> nfsd_file_lru_cb, &dispose);
>>> trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
>>> - nfsd_file_gc_dispose_list(&dispose);
>>> + nfsd_file_dispose_list_delayed(&dispose);
>>> return ret;
>>> }
>>>
>>> @@ -668,8 +648,11 @@ static struct shrinker nfsd_file_shrinker = {
>>> };
>>>
>>> /*
>>> - * Find all cache items across all net namespaces that match @inode and
>>> - * move them to @dispose. The lookup is atomic wrt nfsd_file_acquire().
>>> + * Find all cache items across all net namespaces that match @inode, unhash
>>> + * them, take references and then put them on @dispose if that was successful.
>>> + *
>>> + * The nfsd_file objects on the list will be unhashed, and each will have a
>>> + * reference taken.
>>> */
>>> static unsigned int
>>> __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
>>> @@ -687,52 +670,59 @@ __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
>>> nfsd_file_rhash_params);
>>> if (!nf)
>>> break;
>>> - nfsd_file_unhash_and_dispose(nf, dispose);
>>> - count++;
>>> +
>>> + if (nfsd_file_unhash_and_queue(nf, dispose))
>>> + count++;
>>> } while (1);
>>> rcu_read_unlock();
>>> return count;
>>> }
>>>
>>> /**
>>> - * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
>>> + * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
>>> * @inode: inode of the file to attempt to remove
>>> *
>>> - * Unhash and put, then flush and fput all cache items associated with @inode.
>>> + * Unhash and put all cache item associated with @inode.
>>> */
>>> -void
>>> -nfsd_file_close_inode_sync(struct inode *inode)
>>> +static unsigned int
>>> +nfsd_file_close_inode(struct inode *inode)
>>> {
>>> - LIST_HEAD(dispose);
>>> + struct nfsd_file *nf;
>>> unsigned int count;
>>> + LIST_HEAD(dispose);
>>>
>>> count = __nfsd_file_close_inode(inode, &dispose);
>>> - trace_nfsd_file_close_inode_sync(inode, count);
>>> - nfsd_file_dispose_list_sync(&dispose);
>>> + trace_nfsd_file_close_inode(inode, count);
>>> + if (count) {
>>> + while(!list_empty(&dispose)) {
>>> + nf = list_first_entry(&dispose, struct nfsd_file, nf_lru);
>>> + list_del_init(&nf->nf_lru);
>>> + trace_nfsd_file_closing(nf);
>>> + if (refcount_dec_and_test(&nf->nf_ref))
>>> + nfsd_file_free(nf);
>>> + }
>>> + }
>>> + return count;
>>> }
>>>
>>> /**
>>> - * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
>>> + * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
>>> * @inode: inode of the file to attempt to remove
>>> *
>>> - * Unhash and put all cache item associated with @inode.
>>> + * Unhash and put, then flush and fput all cache items associated with @inode.
>>> */
>>> -static void
>>> -nfsd_file_close_inode(struct inode *inode)
>>> +void
>>> +nfsd_file_close_inode_sync(struct inode *inode)
>>> {
>>> - LIST_HEAD(dispose);
>>> - unsigned int count;
>>> -
>>> - count = __nfsd_file_close_inode(inode, &dispose);
>>> - trace_nfsd_file_close_inode(inode, count);
>>> - nfsd_file_dispose_list_delayed(&dispose);
>>> + if (nfsd_file_close_inode(inode))
>>> + flush_delayed_fput();
>>> }
>>>
>>> /**
>>> * nfsd_file_delayed_close - close unused nfsd_files
>>> * @work: dummy
>>> *
>>> - * Walk the LRU list and close any entries that have not been used since
>>> + * Walk the LRU list and destroy any entries that have not been used since
>>> * the last scan.
>>> */
>>> static void
>>> @@ -890,7 +880,7 @@ __nfsd_file_cache_purge(struct net *net)
>>> while (!IS_ERR_OR_NULL(nf)) {
>>> if (net && nf->nf_net != net)
>>> continue;
>>> - nfsd_file_unhash_and_dispose(nf, &dispose);
>>> + nfsd_file_unhash_and_queue(nf, &dispose);
>>> nf = rhashtable_walk_next(&iter);
>>> }
>>>
>>> @@ -1054,8 +1044,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> rcu_read_lock();
>>> nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
>>> nfsd_file_rhash_params);
>>> - if (nf)
>>> - nf = nfsd_file_get(nf);
>>> + if (nf) {
>>> + if (!nfsd_file_lru_remove(nf))
>>> + nf = nfsd_file_get(nf);
>>> + }
>>> rcu_read_unlock();
>>> if (nf)
>>> goto wait_for_construction;
>>> @@ -1090,11 +1082,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> goto out;
>>> }
>>> open_retry = false;
>>> - nfsd_file_put_noref(nf);
>>> + if (refcount_dec_and_test(&nf->nf_ref))
>>> + nfsd_file_free(nf);
>>> goto retry;
>>> }
>>>
>>> - nfsd_file_lru_remove(nf);
>>> this_cpu_inc(nfsd_file_cache_hits);
>>>
>>> status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
>>> @@ -1104,7 +1096,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> this_cpu_inc(nfsd_file_acquisitions);
>>> *pnf = nf;
>>> } else {
>>> - nfsd_file_put(nf);
>>> + if (refcount_dec_and_test(&nf->nf_ref))
>>> + nfsd_file_free(nf);
>>> nf = NULL;
>>> }
>>>
>>> @@ -1131,7 +1124,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> * then unhash.
>>> */
>>> if (status != nfs_ok || key.inode->i_nlink == 0)
>>> - nfsd_file_unhash_and_put(nf);
>>> + nfsd_file_unhash(nf);
>>> clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
>>> smp_mb__after_atomic();
>>> wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>>> index b09ab4f92d43..a44ded06af87 100644
>>> --- a/fs/nfsd/trace.h
>>> +++ b/fs/nfsd/trace.h
>>> @@ -903,10 +903,11 @@ DEFINE_EVENT(nfsd_file_class, name, \
>>> TP_PROTO(struct nfsd_file *nf), \
>>> TP_ARGS(nf))
>>>
>>> -DEFINE_NFSD_FILE_EVENT(nfsd_file_put_final);
>>> +DEFINE_NFSD_FILE_EVENT(nfsd_file_free);
>>> DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash);
>>> DEFINE_NFSD_FILE_EVENT(nfsd_file_put);
>>> -DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_dispose);
>>> +DEFINE_NFSD_FILE_EVENT(nfsd_file_closing);
>>> +DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_queue);
>>>
>>> TRACE_EVENT(nfsd_file_alloc,
>>> TP_PROTO(
>>> --
>>> 2.37.3
>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
>
> --
> Jeff Layton <[email protected]>
--
Chuck Lever
> On Oct 28, 2022, at 4:30 PM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2022-10-28 at 19:50 +0000, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>>>
>>> When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
>>> so that we can be ready to close it when the time comes. This should
>>> help minimize delays when freeing objects reaped from the LRU.
>>>
>>> Signed-off-by: Jeff Layton <[email protected]>
>>> ---
>>> fs/nfsd/filecache.c | 23 +++++++++++++++++++++--
>>> 1 file changed, 21 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 47cdc6129a7b..c43b6cff03e2 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -325,6 +325,20 @@ nfsd_file_fsync(struct nfsd_file *nf)
>>> nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> }
>>>
>>> +static void
>>> +nfsd_file_flush(struct nfsd_file *nf)
>>> +{
>>> + struct file *file = nf->nf_file;
>>> + struct address_space *mapping;
>>> +
>>> + if (!file || !(file->f_mode & FMODE_WRITE))
>>> + return;
>>> +
>>> + mapping = file->f_mapping;
>>> + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>>> + filemap_flush(mapping);
>>> +}
>>> +
>>> static int
>>> nfsd_file_check_write_error(struct nfsd_file *nf)
>>> {
>>> @@ -484,9 +498,14 @@ nfsd_file_put(struct nfsd_file *nf)
>>>
>>> /* Try to add it to the LRU. If that fails, decrement. */
>>> if (nfsd_file_lru_add(nf)) {
>>> - /* If it's still hashed, we're done */
>>> - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
>>> + /*
>>> + * If it's still hashed, we can just return now,
>>> + * after kicking off SYNC_NONE writeback.
>>> + */
>>> + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>>> + nfsd_file_flush(nf);
>>> return;
>>> + }
>>
>> nfsd_write() calls nfsd_file_put() after every nfsd_vfs_write(). In some
>> cases, this new logic adds an async flush after every UNSTABLE NFSv3 WRITE.
>>
>> I'll need to see performance measurements demonstrating no negative
>> impact on throughput or latency of NFSv3 WRITEs with large payloads.
>>
>>
>
> In your earlier mail, you mentioned that you wanted the writeback work
> to be done in the context of nfsd threads. nfsd_file_put is how nfsd
> threads put their references so this seems like the right place to do
> it.
>
> If you're concerned about calling filemap_flush too often because we
> have an entry that's cycling onto and off of the LRU, then another
> (maybe better) idea might be to kick off writeback when we clear the
> REFERENCED flag.
I think we are doing just about that today by flushing in nfsd_file_put
right when the REFERENCED bit is set. :-)
But yes: that is essentially it. nfsd is a good place to do the flush,
but we don't want to flush too often, because that will be noticeable.
> That would need to be done in the gc thread context, however.
Apparently it is already doing this via filp_close(), though it's
not clear how often that call needs to wait for I/O. You could
schedule a worker to complete the tear down if the open file has
dirty pages.
To catch errors that might occur when the client is delaying its
COMMITs for a long while, maybe don't evict nfsd_files that have
dirty pages...?
>>> /*
>>> * We're racing with unhashing, so try to remove it from
>>> --
>>> 2.37.3
>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
>
> --
> Jeff Layton <[email protected]>
--
Chuck Lever
On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
> > >
> > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > The filecache refcounting is a bit non-standard for something searchable
> > > > by RCU, in that we maintain a sentinel reference while it's hashed. This
> > > > in turn requires that we have to do things differently in the "put"
> > > > depending on whether its hashed, which we believe to have led to races.
> > > >
> > > > There are other problems in here too. nfsd_file_close_inode_sync can end
> > > > up freeing an nfsd_file while there are still outstanding references to
> > > > it, and there are a number of subtle ToC/ToU races.
> > > >
> > > > Rework the code so that the refcount is what drives the lifecycle. When
> > > > the refcount goes to zero, then unhash and rcu free the object.
> > > >
> > > > With this change, the LRU carries a reference. Take special care to
> > > > deal with it when removing an entry from the list.
> > >
> > > I can see a way of making this patch a lot cleaner. It looks like there's
> > > a fair bit of renaming and moving of functions -- that can go in clean
> > > up patches before doing the heavy lifting.
> > >
> >
> > Is this something that's really needed? I'm already basically rewriting
> > this code. Reshuffling the old code around first will take a lot of time
> > and we'll still end up with the same result.
>
> I did exactly this for the nfs4_file rhash changes. It took just a couple
> of hours. The outcome is that you can see exactly, in the final patch in
> that series, how the file_hashtbl -> rhltable substitution is done.
>
> Making sure each of the changes is more or less mechanical and obvious
> is a good way to ensure no-one is doing something incorrect. That's why
> folks like to use cocchinelle.
>
> Trust me, it will be much easier to figure out in a year when we have
> new bugs in this code if we split up this commit just a little.
>
Sigh. It seems pointless to rearrange code that is going to be replaced,
but I'll do it. It'll probably be next week though.
>
> > > I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
> >
> > I think that we need to call this there to ensure that writeback errors
> > are handled. I worry that if try to do this piecemeal, we could end up
> > missing errors when they fall off the LRU.
> >
> > > That feels like a deadlock waiting to happen and quite difficult to
> > > reproduce because I/O there is rarely needed. It could help to put a
> > > might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
> > > drive I/O in that path at all.
> >
> > I don't quite grok the potential for a deadlock here. nfsd_file_free
> > already has to deal with blocking activities due to it effective doing a
> > close(). This is just another one. That's why nfsd_file_put has a
> > might_sleep in it (to warn its callers).
>
> Currently nfsd_file_put() calls nfsd_file_flush(), which calls
> vfs_fsync(). That can't be called while holding a spinlock.
>
>
nfsd_file_free (and hence, nfsd_file_put) can never be called with a
spinlock held. That's true even before this set. Both functions can
block.
> > What's the deadlock scenario you envision?
>
> OK, filp_close() does call f_op->flush(). So we have this call
> here and there aren't problems today. I still say this is a
> problem waiting to occur, but I guess I can live with it.
>
> If filp_close() already calls f_op->flush(), why do we need an
> explicit vfs_fsync() there?
>
>
->flush doesn't do anything on some filesystems, and while it does
return an error code today, it should have been a void return function.
The error from it can't be counted on.
vfs_fsync is what ensures that everything gets written back and returns
info about writeback errors. I don't see a way around calling it at
least once before we close a file, if we want to keep up the "trick" of
resetting the verifier when we see errors.
IMO, the goal ought to be to ensure that we don't end up having to do
any writeback when we get to GC'ing it, and that's what patch 4/4 should
do.
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > fs/nfsd/filecache.c | 357 ++++++++++++++++++++++----------------------
> > > > fs/nfsd/trace.h | 5 +-
> > > > 2 files changed, 178 insertions(+), 184 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index f8ebbf7daa18..d928c5e38eeb 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -1,6 +1,12 @@
> > > > // SPDX-License-Identifier: GPL-2.0
> > > > /*
> > > > * The NFSD open file cache.
> > > > + *
> > > > + * Each nfsd_file is created in response to client activity -- either regular
> > > > + * file I/O for v2/v3, or opening a file for v4. Files opened via v4 are
> > > > + * cleaned up as soon as their refcount goes to 0. Entries for v2/v3 are
> > > > + * flagged with NFSD_FILE_GC. On their last put, they are added to the LRU for
> > > > + * eventual disposal if they aren't used again within a short time period.
> > > > */
> > > >
> > > > #include <linux/hash.h>
> > > > @@ -301,55 +307,22 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
> > > > if (key->gc)
> > > > __set_bit(NFSD_FILE_GC, &nf->nf_flags);
> > > > nf->nf_inode = key->inode;
> > > > - /* nf_ref is pre-incremented for hash table */
> > > > - refcount_set(&nf->nf_ref, 2);
> > > > + refcount_set(&nf->nf_ref, 1);
> > > > nf->nf_may = key->need;
> > > > nf->nf_mark = NULL;
> > > > }
> > > > return nf;
> > > > }
> > > >
> > > > -static bool
> > > > -nfsd_file_free(struct nfsd_file *nf)
> > > > -{
> > > > - s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> > > > - bool flush = false;
> > > > -
> > > > - this_cpu_inc(nfsd_file_releases);
> > > > - this_cpu_add(nfsd_file_total_age, age);
> > > > -
> > > > - trace_nfsd_file_put_final(nf);
> > > > - if (nf->nf_mark)
> > > > - nfsd_file_mark_put(nf->nf_mark);
> > > > - if (nf->nf_file) {
> > > > - get_file(nf->nf_file);
> > > > - filp_close(nf->nf_file, NULL);
> > > > - fput(nf->nf_file);
> > > > - flush = true;
> > > > - }
> > > > -
> > > > - /*
> > > > - * If this item is still linked via nf_lru, that's a bug.
> > > > - * WARN and leak it to preserve system stability.
> > > > - */
> > > > - if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> > > > - return flush;
> > > > -
> > > > - call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> > > > - return flush;
> > > > -}
> > > > -
> > > > -static bool
> > > > -nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > +static void
> > > > +nfsd_file_fsync(struct nfsd_file *nf)
> > > > {
> > > > struct file *file = nf->nf_file;
> > > > - struct address_space *mapping;
> > > >
> > > > if (!file || !(file->f_mode & FMODE_WRITE))
> > > > - return false;
> > > > - mapping = file->f_mapping;
> > > > - return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> > > > - mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > + return;
> > > > + if (vfs_fsync(file, 1) != 0)
> > > > + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > > > }
> > > >
> > > > static int
> > > > @@ -362,30 +335,6 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
> > > > return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
> > > > }
> > > >
> > > > -static void
> > > > -nfsd_file_flush(struct nfsd_file *nf)
> > > > -{
> > > > - struct file *file = nf->nf_file;
> > > > -
> > > > - if (!file || !(file->f_mode & FMODE_WRITE))
> > > > - return;
> > > > - if (vfs_fsync(file, 1) != 0)
> > > > - nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > > > -}
> > > > -
> > > > -static void nfsd_file_lru_add(struct nfsd_file *nf)
> > > > -{
> > > > - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > - if (list_lru_add(&nfsd_file_lru, &nf->nf_lru))
> > > > - trace_nfsd_file_lru_add(nf);
> > > > -}
> > > > -
> > > > -static void nfsd_file_lru_remove(struct nfsd_file *nf)
> > > > -{
> > > > - if (list_lru_del(&nfsd_file_lru, &nf->nf_lru))
> > > > - trace_nfsd_file_lru_del(nf);
> > > > -}
> > > > -
> > > > static void
> > > > nfsd_file_hash_remove(struct nfsd_file *nf)
> > > > {
> > > > @@ -408,53 +357,66 @@ nfsd_file_unhash(struct nfsd_file *nf)
> > > > }
> > > >
> > > > static void
> > > > -nfsd_file_unhash_and_dispose(struct nfsd_file *nf, struct list_head *dispose)
> > > > +nfsd_file_free(struct nfsd_file *nf)
> > > > {
> > > > - trace_nfsd_file_unhash_and_dispose(nf);
> > > > - if (nfsd_file_unhash(nf)) {
> > > > - /* caller must call nfsd_file_dispose_list() later */
> > > > - nfsd_file_lru_remove(nf);
> > > > - list_add(&nf->nf_lru, dispose);
> > > > + s64 age = ktime_to_ms(ktime_sub(ktime_get(), nf->nf_birthtime));
> > > > +
> > > > + trace_nfsd_file_free(nf);
> > > > +
> > > > + this_cpu_inc(nfsd_file_releases);
> > > > + this_cpu_add(nfsd_file_total_age, age);
> > > > +
> > > > + nfsd_file_unhash(nf);
> > > > + nfsd_file_fsync(nf);
> > > > +
> > > > + if (nf->nf_mark)
> > > > + nfsd_file_mark_put(nf->nf_mark);
> > > > + if (nf->nf_file) {
> > > > + get_file(nf->nf_file);
> > > > + filp_close(nf->nf_file, NULL);
> > > > + fput(nf->nf_file);
> > > > }
> > > > +
> > > > + /*
> > > > + * If this item is still linked via nf_lru, that's a bug.
> > > > + * WARN and leak it to preserve system stability.
> > > > + */
> > > > + if (WARN_ON_ONCE(!list_empty(&nf->nf_lru)))
> > > > + return;
> > > > +
> > > > + call_rcu(&nf->nf_rcu, nfsd_file_slab_free);
> > > > }
> > > >
> > > > -static void
> > > > -nfsd_file_put_noref(struct nfsd_file *nf)
> > > > +static bool
> > > > +nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > {
> > > > - trace_nfsd_file_put(nf);
> > > > + struct file *file = nf->nf_file;
> > > > + struct address_space *mapping;
> > > >
> > > > - if (refcount_dec_and_test(&nf->nf_ref)) {
> > > > - WARN_ON(test_bit(NFSD_FILE_HASHED, &nf->nf_flags));
> > > > - nfsd_file_lru_remove(nf);
> > > > - nfsd_file_free(nf);
> > > > - }
> > > > + if (!file || !(file->f_mode & FMODE_WRITE))
> > > > + return false;
> > > > + mapping = file->f_mapping;
> > > > + return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> > > > + mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > }
> > > >
> > > > -static void
> > > > -nfsd_file_unhash_and_put(struct nfsd_file *nf)
> > > > +static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > {
> > > > - if (nfsd_file_unhash(nf))
> > > > - nfsd_file_put_noref(nf);
> > > > + set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > + if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
> > > > + trace_nfsd_file_lru_add(nf);
> > > > + return true;
> > > > + }
> > > > + return false;
> > > > }
> > > >
> > > > -void
> > > > -nfsd_file_put(struct nfsd_file *nf)
> > > > +static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> > > > {
> > > > - might_sleep();
> > > > -
> > > > - if (test_bit(NFSD_FILE_GC, &nf->nf_flags))
> > > > - nfsd_file_lru_add(nf);
> > > > - else if (refcount_read(&nf->nf_ref) == 2)
> > > > - nfsd_file_unhash_and_put(nf);
> > > > -
> > > > - if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > > > - nfsd_file_flush(nf);
> > > > - nfsd_file_put_noref(nf);
> > > > - } else if (nf->nf_file && test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
> > > > - nfsd_file_put_noref(nf);
> > > > - nfsd_file_schedule_laundrette();
> > > > - } else
> > > > - nfsd_file_put_noref(nf);
> > > > + if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
> > > > + trace_nfsd_file_lru_del(nf);
> > > > + return true;
> > > > + }
> > > > + return false;
> > > > }
> > > >
> > > > struct nfsd_file *
> > > > @@ -465,36 +427,77 @@ nfsd_file_get(struct nfsd_file *nf)
> > > > return NULL;
> > > > }
> > > >
> > > > -static void
> > > > -nfsd_file_dispose_list(struct list_head *dispose)
> > > > +/**
> > > > + * nfsd_file_unhash_and_queue - unhash a file and queue it to the dispose list
> > > > + * @nf: nfsd_file to be unhashed and queued
> > > > + * @dispose: list to which it should be queued
> > > > + *
> > > > + * Attempt to unhash a nfsd_file and queue it to the given list. Each file
> > > > + * will have a reference held on behalf of the list. That reference may come
> > > > + * from the LRU, or we may need to take one. If we can't get a reference,
> > > > + * ignore it altogether.
> > > > + */
> > > > +static bool
> > > > +nfsd_file_unhash_and_queue(struct nfsd_file *nf, struct list_head *dispose)
> > > > {
> > > > - struct nfsd_file *nf;
> > > > + trace_nfsd_file_unhash_and_queue(nf);
> > > > + if (nfsd_file_unhash(nf)) {
> > > > + /*
> > > > + * If we remove it from the LRU, then just use that
> > > > + * reference for the dispose list. Otherwise, we need
> > > > + * to take a reference. If that fails, just ignore
> > > > + * the file altogether.
> > > > + */
> > > > + if (!nfsd_file_lru_remove(nf) && !nfsd_file_get(nf))
> > > > + return false;
> > > > + list_add(&nf->nf_lru, dispose);
> > > > + return true;
> > > > + }
> > > > + return false;
> > > > +}
> > > >
> > > > - while(!list_empty(dispose)) {
> > > > - nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> > > > - list_del_init(&nf->nf_lru);
> > > > - nfsd_file_flush(nf);
> > > > - nfsd_file_put_noref(nf);
> > > > +/**
> > > > + * nfsd_file_put - put the reference to a nfsd_file
> > > > + * @nf: nfsd_file of which to put the reference
> > > > + *
> > > > + * Put a reference to a nfsd_file. In the v4 case, we just put the
> > > > + * reference immediately. In the v2/3 case, if the reference would be
> > > > + * the last one, the put it on the LRU instead to be cleaned up later.
> > > > + */
> > > > +void
> > > > +nfsd_file_put(struct nfsd_file *nf)
> > > > +{
> > > > + trace_nfsd_file_put(nf);
> > > > +
> > > > + /*
> > > > + * The HASHED check is racy. We may end up with the occasional
> > > > + * unhashed entry on the LRU, but they should get cleaned up
> > > > + * like any other.
> > > > + */
> > > > + if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
> > > > + test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > > > + /*
> > > > + * If this is the last reference (nf_ref == 1), then transfer
> > > > + * it to the LRU. If the add to the LRU fails, just put it as
> > > > + * usual.
> > > > + */
> > > > + if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
> > > > + return;
> > > > }
> > > > + if (refcount_dec_and_test(&nf->nf_ref))
> > > > + nfsd_file_free(nf);
> > > > }
> > > >
> > > > static void
> > > > -nfsd_file_dispose_list_sync(struct list_head *dispose)
> > > > +nfsd_file_dispose_list(struct list_head *dispose)
> > > > {
> > > > - bool flush = false;
> > > > struct nfsd_file *nf;
> > > >
> > > > while(!list_empty(dispose)) {
> > > > nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> > > > list_del_init(&nf->nf_lru);
> > > > - nfsd_file_flush(nf);
> > > > - if (!refcount_dec_and_test(&nf->nf_ref))
> > > > - continue;
> > > > - if (nfsd_file_free(nf))
> > > > - flush = true;
> > > > + nfsd_file_free(nf);
> > > > }
> > > > - if (flush)
> > > > - flush_delayed_fput();
> > > > }
> > > >
> > > > static void
> > > > @@ -564,21 +567,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > struct list_head *head = arg;
> > > > struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > >
> > > > - /*
> > > > - * Do a lockless refcount check. The hashtable holds one reference, so
> > > > - * we look to see if anything else has a reference, or if any have
> > > > - * been put since the shrinker last ran. Those don't get unhashed and
> > > > - * released.
> > > > - *
> > > > - * Note that in the put path, we set the flag and then decrement the
> > > > - * counter. Here we check the counter and then test and clear the flag.
> > > > - * That order is deliberate to ensure that we can do this locklessly.
> > > > - */
> > > > - if (refcount_read(&nf->nf_ref) > 1) {
> > > > - list_lru_isolate(lru, &nf->nf_lru);
> > > > - trace_nfsd_file_gc_in_use(nf);
> > > > - return LRU_REMOVED;
> > > > - }
> > > > + /* We should only be dealing with v2/3 entries here */
> > > > + WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
> > > >
> > > > /*
> > > > * Don't throw out files that are still undergoing I/O or
> > > > @@ -589,40 +579,30 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > return LRU_SKIP;
> > > > }
> > > >
> > > > + /* If it was recently added to the list, skip it */
> > > > if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> > > > trace_nfsd_file_gc_referenced(nf);
> > > > return LRU_ROTATE;
> > > > }
> > > >
> > > > - if (!test_and_clear_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > > > - trace_nfsd_file_gc_hashed(nf);
> > > > - return LRU_SKIP;
> > > > + /*
> > > > + * Put the reference held on behalf of the LRU. If it wasn't the last
> > > > + * one, then just remove it from the LRU and ignore it.
> > > > + */
> > > > + if (!refcount_dec_and_test(&nf->nf_ref)) {
> > > > + trace_nfsd_file_gc_in_use(nf);
> > > > + list_lru_isolate(lru, &nf->nf_lru);
> > > > + return LRU_REMOVED;
> > > > }
> > > >
> > > > + /* Refcount went to zero. Unhash it and queue it to the dispose list */
> > > > + nfsd_file_unhash(nf);
> > > > list_lru_isolate_move(lru, &nf->nf_lru, head);
> > > > this_cpu_inc(nfsd_file_evictions);
> > > > trace_nfsd_file_gc_disposed(nf);
> > > > return LRU_REMOVED;
> > > > }
> > > >
> > > > -/*
> > > > - * Unhash items on @dispose immediately, then queue them on the
> > > > - * disposal workqueue to finish releasing them in the background.
> > > > - *
> > > > - * cel: Note that between the time list_lru_shrink_walk runs and
> > > > - * now, these items are in the hash table but marked unhashed.
> > > > - * Why release these outside of lru_cb ? There's no lock ordering
> > > > - * problem since lru_cb currently takes no lock.
> > > > - */
> > > > -static void nfsd_file_gc_dispose_list(struct list_head *dispose)
> > > > -{
> > > > - struct nfsd_file *nf;
> > > > -
> > > > - list_for_each_entry(nf, dispose, nf_lru)
> > > > - nfsd_file_hash_remove(nf);
> > > > - nfsd_file_dispose_list_delayed(dispose);
> > > > -}
> > > > -
> > > > static void
> > > > nfsd_file_gc(void)
> > > > {
> > > > @@ -632,7 +612,7 @@ nfsd_file_gc(void)
> > > > ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > > > &dispose, list_lru_count(&nfsd_file_lru));
> > > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > - nfsd_file_gc_dispose_list(&dispose);
> > > > + nfsd_file_dispose_list_delayed(&dispose);
> > > > }
> > > >
> > > > static void
> > > > @@ -657,7 +637,7 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
> > > > ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
> > > > nfsd_file_lru_cb, &dispose);
> > > > trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > - nfsd_file_gc_dispose_list(&dispose);
> > > > + nfsd_file_dispose_list_delayed(&dispose);
> > > > return ret;
> > > > }
> > > >
> > > > @@ -668,8 +648,11 @@ static struct shrinker nfsd_file_shrinker = {
> > > > };
> > > >
> > > > /*
> > > > - * Find all cache items across all net namespaces that match @inode and
> > > > - * move them to @dispose. The lookup is atomic wrt nfsd_file_acquire().
> > > > + * Find all cache items across all net namespaces that match @inode, unhash
> > > > + * them, take references and then put them on @dispose if that was successful.
> > > > + *
> > > > + * The nfsd_file objects on the list will be unhashed, and each will have a
> > > > + * reference taken.
> > > > */
> > > > static unsigned int
> > > > __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> > > > @@ -687,52 +670,59 @@ __nfsd_file_close_inode(struct inode *inode, struct list_head *dispose)
> > > > nfsd_file_rhash_params);
> > > > if (!nf)
> > > > break;
> > > > - nfsd_file_unhash_and_dispose(nf, dispose);
> > > > - count++;
> > > > +
> > > > + if (nfsd_file_unhash_and_queue(nf, dispose))
> > > > + count++;
> > > > } while (1);
> > > > rcu_read_unlock();
> > > > return count;
> > > > }
> > > >
> > > > /**
> > > > - * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> > > > + * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> > > > * @inode: inode of the file to attempt to remove
> > > > *
> > > > - * Unhash and put, then flush and fput all cache items associated with @inode.
> > > > + * Unhash and put all cache item associated with @inode.
> > > > */
> > > > -void
> > > > -nfsd_file_close_inode_sync(struct inode *inode)
> > > > +static unsigned int
> > > > +nfsd_file_close_inode(struct inode *inode)
> > > > {
> > > > - LIST_HEAD(dispose);
> > > > + struct nfsd_file *nf;
> > > > unsigned int count;
> > > > + LIST_HEAD(dispose);
> > > >
> > > > count = __nfsd_file_close_inode(inode, &dispose);
> > > > - trace_nfsd_file_close_inode_sync(inode, count);
> > > > - nfsd_file_dispose_list_sync(&dispose);
> > > > + trace_nfsd_file_close_inode(inode, count);
> > > > + if (count) {
> > > > + while(!list_empty(&dispose)) {
> > > > + nf = list_first_entry(&dispose, struct nfsd_file, nf_lru);
> > > > + list_del_init(&nf->nf_lru);
> > > > + trace_nfsd_file_closing(nf);
> > > > + if (refcount_dec_and_test(&nf->nf_ref))
> > > > + nfsd_file_free(nf);
> > > > + }
> > > > + }
> > > > + return count;
> > > > }
> > > >
> > > > /**
> > > > - * nfsd_file_close_inode - attempt a delayed close of a nfsd_file
> > > > + * nfsd_file_close_inode_sync - attempt to forcibly close a nfsd_file
> > > > * @inode: inode of the file to attempt to remove
> > > > *
> > > > - * Unhash and put all cache item associated with @inode.
> > > > + * Unhash and put, then flush and fput all cache items associated with @inode.
> > > > */
> > > > -static void
> > > > -nfsd_file_close_inode(struct inode *inode)
> > > > +void
> > > > +nfsd_file_close_inode_sync(struct inode *inode)
> > > > {
> > > > - LIST_HEAD(dispose);
> > > > - unsigned int count;
> > > > -
> > > > - count = __nfsd_file_close_inode(inode, &dispose);
> > > > - trace_nfsd_file_close_inode(inode, count);
> > > > - nfsd_file_dispose_list_delayed(&dispose);
> > > > + if (nfsd_file_close_inode(inode))
> > > > + flush_delayed_fput();
> > > > }
> > > >
> > > > /**
> > > > * nfsd_file_delayed_close - close unused nfsd_files
> > > > * @work: dummy
> > > > *
> > > > - * Walk the LRU list and close any entries that have not been used since
> > > > + * Walk the LRU list and destroy any entries that have not been used since
> > > > * the last scan.
> > > > */
> > > > static void
> > > > @@ -890,7 +880,7 @@ __nfsd_file_cache_purge(struct net *net)
> > > > while (!IS_ERR_OR_NULL(nf)) {
> > > > if (net && nf->nf_net != net)
> > > > continue;
> > > > - nfsd_file_unhash_and_dispose(nf, &dispose);
> > > > + nfsd_file_unhash_and_queue(nf, &dispose);
> > > > nf = rhashtable_walk_next(&iter);
> > > > }
> > > >
> > > > @@ -1054,8 +1044,10 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > rcu_read_lock();
> > > > nf = rhashtable_lookup(&nfsd_file_rhash_tbl, &key,
> > > > nfsd_file_rhash_params);
> > > > - if (nf)
> > > > - nf = nfsd_file_get(nf);
> > > > + if (nf) {
> > > > + if (!nfsd_file_lru_remove(nf))
> > > > + nf = nfsd_file_get(nf);
> > > > + }
> > > > rcu_read_unlock();
> > > > if (nf)
> > > > goto wait_for_construction;
> > > > @@ -1090,11 +1082,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > goto out;
> > > > }
> > > > open_retry = false;
> > > > - nfsd_file_put_noref(nf);
> > > > + if (refcount_dec_and_test(&nf->nf_ref))
> > > > + nfsd_file_free(nf);
> > > > goto retry;
> > > > }
> > > >
> > > > - nfsd_file_lru_remove(nf);
> > > > this_cpu_inc(nfsd_file_cache_hits);
> > > >
> > > > status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
> > > > @@ -1104,7 +1096,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > this_cpu_inc(nfsd_file_acquisitions);
> > > > *pnf = nf;
> > > > } else {
> > > > - nfsd_file_put(nf);
> > > > + if (refcount_dec_and_test(&nf->nf_ref))
> > > > + nfsd_file_free(nf);
> > > > nf = NULL;
> > > > }
> > > >
> > > > @@ -1131,7 +1124,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > * then unhash.
> > > > */
> > > > if (status != nfs_ok || key.inode->i_nlink == 0)
> > > > - nfsd_file_unhash_and_put(nf);
> > > > + nfsd_file_unhash(nf);
> > > > clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> > > > smp_mb__after_atomic();
> > > > wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > index b09ab4f92d43..a44ded06af87 100644
> > > > --- a/fs/nfsd/trace.h
> > > > +++ b/fs/nfsd/trace.h
> > > > @@ -903,10 +903,11 @@ DEFINE_EVENT(nfsd_file_class, name, \
> > > > TP_PROTO(struct nfsd_file *nf), \
> > > > TP_ARGS(nf))
> > > >
> > > > -DEFINE_NFSD_FILE_EVENT(nfsd_file_put_final);
> > > > +DEFINE_NFSD_FILE_EVENT(nfsd_file_free);
> > > > DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash);
> > > > DEFINE_NFSD_FILE_EVENT(nfsd_file_put);
> > > > -DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_dispose);
> > > > +DEFINE_NFSD_FILE_EVENT(nfsd_file_closing);
> > > > +DEFINE_NFSD_FILE_EVENT(nfsd_file_unhash_and_queue);
> > > >
> > > > TRACE_EVENT(nfsd_file_alloc,
> > > > TP_PROTO(
> > > > --
> > > > 2.37.3
> > > >
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> > >
> >
> > --
> > Jeff Layton <[email protected]>
>
> --
> Chuck Lever
>
>
>
--
Jeff Layton <[email protected]>
> On Oct 28, 2022, at 5:03 PM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
>>>
>>> On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
>>>>
>>>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>>>>>
>>>>> The filecache refcounting is a bit non-standard for something searchable
>>>>> by RCU, in that we maintain a sentinel reference while it's hashed. This
>>>>> in turn requires that we have to do things differently in the "put"
>>>>> depending on whether its hashed, which we believe to have led to races.
>>>>>
>>>>> There are other problems in here too. nfsd_file_close_inode_sync can end
>>>>> up freeing an nfsd_file while there are still outstanding references to
>>>>> it, and there are a number of subtle ToC/ToU races.
>>>>>
>>>>> Rework the code so that the refcount is what drives the lifecycle. When
>>>>> the refcount goes to zero, then unhash and rcu free the object.
>>>>>
>>>>> With this change, the LRU carries a reference. Take special care to
>>>>> deal with it when removing an entry from the list.
>>>>
>>>> I can see a way of making this patch a lot cleaner. It looks like there's
>>>> a fair bit of renaming and moving of functions -- that can go in clean
>>>> up patches before doing the heavy lifting.
>>>>
>>>
>>> Is this something that's really needed? I'm already basically rewriting
>>> this code. Reshuffling the old code around first will take a lot of time
>>> and we'll still end up with the same result.
>>
>> I did exactly this for the nfs4_file rhash changes. It took just a couple
>> of hours. The outcome is that you can see exactly, in the final patch in
>> that series, how the file_hashtbl -> rhltable substitution is done.
>>
>> Making sure each of the changes is more or less mechanical and obvious
>> is a good way to ensure no-one is doing something incorrect. That's why
>> folks like to use cocchinelle.
>>
>> Trust me, it will be much easier to figure out in a year when we have
>> new bugs in this code if we split up this commit just a little.
>>
>
> Sigh. It seems pointless to rearrange code that is going to be replaced,
> but I'll do it. It'll probably be next week though.
>
>>
>>>> I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
>>>
>>> I think that we need to call this there to ensure that writeback errors
>>> are handled. I worry that if try to do this piecemeal, we could end up
>>> missing errors when they fall off the LRU.
>>>
>>>> That feels like a deadlock waiting to happen and quite difficult to
>>>> reproduce because I/O there is rarely needed. It could help to put a
>>>> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
>>>> drive I/O in that path at all.
>>>
>>> I don't quite grok the potential for a deadlock here. nfsd_file_free
>>> already has to deal with blocking activities due to it effective doing a
>>> close(). This is just another one. That's why nfsd_file_put has a
>>> might_sleep in it (to warn its callers).
>>
>> Currently nfsd_file_put() calls nfsd_file_flush(), which calls
>> vfs_fsync(). That can't be called while holding a spinlock.
>>
>>
>
> nfsd_file_free (and hence, nfsd_file_put) can never be called with a
> spinlock held. That's true even before this set. Both functions can
> block.
Dead horse: in the current code base, nfsd_file_free() can be called
via nfsd_file_close_inode_sync(), which is an API external to
filecache.c. But, I agree now that both functions can block.
>>> What's the deadlock scenario you envision?
>>
>> OK, filp_close() does call f_op->flush(). So we have this call
>> here and there aren't problems today. I still say this is a
>> problem waiting to occur, but I guess I can live with it.
>>
>> If filp_close() already calls f_op->flush(), why do we need an
>> explicit vfs_fsync() there?
>>
>>
>
> ->flush doesn't do anything on some filesystems, and while it does
> return an error code today, it should have been a void return function.
> The error from it can't be counted on.
OK. The goal is detecting writeback errors, and ->flush is not a
reliable way to do that.
> vfs_fsync is what ensures that everything gets written back and returns
> info about writeback errors. I don't see a way around calling it at
> least once before we close a file, if we want to keep up the "trick" of
> resetting the verifier when we see errors.
Fair enough, but maybe it's not worth a complexity and performance
impact. Writeback errors are supposed to be rare compared to I/O.
If we can find another way (especially, a more reliable way) to
monitor for writeback errors, that might be an improvement over
using vfs_fsync, which seems heavyweight no matter how you cut it.
> IMO, the goal ought to be to ensure that we don't end up having to do
> any writeback when we get to GC'ing it, and that's what patch 4/4 should
> do.
Understood. I don't disagree with the goal, since filp_close()
won't be able to avoid a potentially costly ->flush in some cases.
The issue is even a SYNC_NONE flush has appreciable cost.
--
Chuck Lever
On Sat, 29 Oct 2022, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > The list_lru_add and list_lru_del functions use list_empty checks to see
> > whether the object is already on the LRU. That's fine in most cases, but
> > we occasionally repurpose nf_lru after unhashing. It's possible for an
> > LRU removal to remove it from a different list altogether if we lose a
> > race.
>
> I've never seen that happen. lru field re-use is actually used in other
> places in the kernel. Shouldn't we try to find and fix such races?
>
> Wasn't the idea to reduce the complexity of nfsd_file_put ?
>
I think the nfsd filecache is different from those "other places"
because of nfsd_file_close_inode() and related code. If I understand
correctly, nfsd can remove a file from the filecache while it is still
in use. In this case it needs to be unhashed and removed from the lru -
and then added to a dispose list - while it might still be active for
some IO request.
Prior to Commit 8d0d254b15cc ("nfsd: fix nfsd_file_unhash_and_dispose")
unhash_and_dispose() wouldn't add to the dispose list unless the
refcount was one. I'm not sure how racy this test was, but it would
mean that it is unlikely for an nfsd_file to be added to the dispose list
if it was still in use.
After that commit it seems more likely that a nfsd_file will be added to
a dispose list while it is in use.
As we add/remove things to a dispose list without a lock, we need to be
careful that no other thread will add the nfsd_file to an lru at the
same time. refcounts will no longer provide that protection. Maybe we
should restore the refcount protection, or maybe we need a bit as Jeff
has added here.
NeilBrown
> On Oct 30, 2022, at 5:45 PM, NeilBrown <[email protected]> wrote:
>
> On Sat, 29 Oct 2022, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>>>
>>> The list_lru_add and list_lru_del functions use list_empty checks to see
>>> whether the object is already on the LRU. That's fine in most cases, but
>>> we occasionally repurpose nf_lru after unhashing. It's possible for an
>>> LRU removal to remove it from a different list altogether if we lose a
>>> race.
Can that issue be resolved by simply adding a "struct list_head nf_dispose"
field? That might be more straightforward than adding conditional logic.
>> I've never seen that happen. lru field re-use is actually used in other
>> places in the kernel. Shouldn't we try to find and fix such races?
>>
>> Wasn't the idea to reduce the complexity of nfsd_file_put ?
>>
>
> I think the nfsd filecache is different from those "other places"
> because of nfsd_file_close_inode() and related code. If I understand
> correctly, nfsd can remove a file from the filecache while it is still
> in use.
Not sure about that; I think nfsd_file_close_inode() is invoked only
when a file is deleted. I could be remembering incorrectly, but that
seems like a really difficult race to hit.
> In this case it needs to be unhashed and removed from the lru -
> and then added to a dispose list - while it might still be active for
> some IO request.
>
> Prior to Commit 8d0d254b15cc ("nfsd: fix nfsd_file_unhash_and_dispose")
> unhash_and_dispose() wouldn't add to the dispose list unless the
> refcount was one. I'm not sure how racy this test was, but it would
> mean that it is unlikely for an nfsd_file to be added to the dispose list
> if it was still in use.
>
> After that commit it seems more likely that a nfsd_file will be added to
> a dispose list while it is in use.
After it's linked to a dispose list via nf_lru, list_lru_add won't put
it on the LRU -- it becomes a no-op because nf_lru is now !empty. I
think we would have seen LRU corruption pretty quickly. Re-reading
Jeff's patch description, that might not be the problem he's trying
to address here.
But, it would be easy to do some reality testing. I think you could
add a WARN_ON or tracepoint in nfsd_file_free() or somewhere in the
dispose-list path to catch an in-use nfsd_file?
> As we add/remove things to a dispose list without a lock, we need to be
> careful that no other thread will add the nfsd_file to an lru at the
> same time. refcounts will no longer provide that protection. Maybe we
> should restore the refcount protection, or maybe we need a bit as Jeff
> has added here.
I'm not opposed to defensive changes, in general. This one seems to be
adding significant complexity without a clear hazard. I'd like to have
a better understanding of exactly what misbehavior is being addressed.
--
Chuck Lever
On Fri, 2022-10-28 at 20:57 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 4:30 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2022-10-28 at 19:50 +0000, Chuck Lever III wrote:
> > >
> > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
> > > > so that we can be ready to close it when the time comes. This should
> > > > help minimize delays when freeing objects reaped from the LRU.
> > > >
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > fs/nfsd/filecache.c | 23 +++++++++++++++++++++--
> > > > 1 file changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index 47cdc6129a7b..c43b6cff03e2 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -325,6 +325,20 @@ nfsd_file_fsync(struct nfsd_file *nf)
> > > > nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> > > > }
> > > >
> > > > +static void
> > > > +nfsd_file_flush(struct nfsd_file *nf)
> > > > +{
> > > > + struct file *file = nf->nf_file;
> > > > + struct address_space *mapping;
> > > > +
> > > > + if (!file || !(file->f_mode & FMODE_WRITE))
> > > > + return;
> > > > +
> > > > + mapping = file->f_mapping;
> > > > + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > > > + filemap_flush(mapping);
> > > > +}
> > > > +
> > > > static int
> > > > nfsd_file_check_write_error(struct nfsd_file *nf)
> > > > {
> > > > @@ -484,9 +498,14 @@ nfsd_file_put(struct nfsd_file *nf)
> > > >
> > > > /* Try to add it to the LRU. If that fails, decrement. */
> > > > if (nfsd_file_lru_add(nf)) {
> > > > - /* If it's still hashed, we're done */
> > > > - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> > > > + /*
> > > > + * If it's still hashed, we can just return now,
> > > > + * after kicking off SYNC_NONE writeback.
> > > > + */
> > > > + if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > > > + nfsd_file_flush(nf);
> > > > return;
> > > > + }
> > >
> > > nfsd_write() calls nfsd_file_put() after every nfsd_vfs_write(). In some
> > > cases, this new logic adds an async flush after every UNSTABLE NFSv3 WRITE.
> > >
> > > I'll need to see performance measurements demonstrating no negative
> > > impact on throughput or latency of NFSv3 WRITEs with large payloads.
> > >
> > >
> >
> > In your earlier mail, you mentioned that you wanted the writeback work
> > to be done in the context of nfsd threads. nfsd_file_put is how nfsd
> > threads put their references so this seems like the right place to do
> > it.
> >
> > If you're concerned about calling filemap_flush too often because we
> > have an entry that's cycling onto and off of the LRU, then another
> > (maybe better) idea might be to kick off writeback when we clear the
> > REFERENCED flag.
>
> I think we are doing just about that today by flushing in nfsd_file_put
> right when the REFERENCED bit is set. :-)
>
> But yes: that is essentially it. nfsd is a good place to do the flush,
> but we don't want to flush too often, because that will be noticeable.
>
>
> > That would need to be done in the gc thread context, however.
>
> Apparently it is already doing this via filp_close(), though it's
> not clear how often that call needs to wait for I/O. You could
> schedule a worker to complete the tear down if the open file has
> dirty pages.
>
Most filesystems do not flush data on close(). NFS is an exception here
due to CTO.
> To catch errors that might occur when the client is delaying its
> COMMITs for a long while, maybe don't evict nfsd_files that have
> dirty pages...?
>
We could do that, I suppose, but then I'd start worrying about memory
pressure. When we have dirty data, it eventually has to be written out.
Are we really helping performance by gaming which thread kicks off
writeback? I'm skeptical.
>
> > > > /*
> > > > * We're racing with unhashing, so try to remove it from
> > > > --
> > > > 2.37.3
> > > >
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> > >
> >
> > --
> > Jeff Layton <[email protected]>
>
> --
> Chuck Lever
>
>
>
--
Jeff Layton <[email protected]>
On Fri, 2022-10-28 at 21:23 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 5:03 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote:
> > >
> > > > On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
> > > > >
> > > > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > The filecache refcounting is a bit non-standard for something searchable
> > > > > > by RCU, in that we maintain a sentinel reference while it's hashed. This
> > > > > > in turn requires that we have to do things differently in the "put"
> > > > > > depending on whether its hashed, which we believe to have led to races.
> > > > > >
> > > > > > There are other problems in here too. nfsd_file_close_inode_sync can end
> > > > > > up freeing an nfsd_file while there are still outstanding references to
> > > > > > it, and there are a number of subtle ToC/ToU races.
> > > > > >
> > > > > > Rework the code so that the refcount is what drives the lifecycle. When
> > > > > > the refcount goes to zero, then unhash and rcu free the object.
> > > > > >
> > > > > > With this change, the LRU carries a reference. Take special care to
> > > > > > deal with it when removing an entry from the list.
> > > > >
> > > > > I can see a way of making this patch a lot cleaner. It looks like there's
> > > > > a fair bit of renaming and moving of functions -- that can go in clean
> > > > > up patches before doing the heavy lifting.
> > > > >
> > > >
> > > > Is this something that's really needed? I'm already basically rewriting
> > > > this code. Reshuffling the old code around first will take a lot of time
> > > > and we'll still end up with the same result.
> > >
> > > I did exactly this for the nfs4_file rhash changes. It took just a couple
> > > of hours. The outcome is that you can see exactly, in the final patch in
> > > that series, how the file_hashtbl -> rhltable substitution is done.
> > >
> > > Making sure each of the changes is more or less mechanical and obvious
> > > is a good way to ensure no-one is doing something incorrect. That's why
> > > folks like to use cocchinelle.
> > >
> > > Trust me, it will be much easier to figure out in a year when we have
> > > new bugs in this code if we split up this commit just a little.
> > >
> >
> > Sigh. It seems pointless to rearrange code that is going to be replaced,
> > but I'll do it. It'll probably be next week though.
> >
> > >
> > > > > I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
> > > >
> > > > I think that we need to call this there to ensure that writeback errors
> > > > are handled. I worry that if try to do this piecemeal, we could end up
> > > > missing errors when they fall off the LRU.
> > > >
> > > > > That feels like a deadlock waiting to happen and quite difficult to
> > > > > reproduce because I/O there is rarely needed. It could help to put a
> > > > > might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
> > > > > drive I/O in that path at all.
> > > >
> > > > I don't quite grok the potential for a deadlock here. nfsd_file_free
> > > > already has to deal with blocking activities due to it effective doing a
> > > > close(). This is just another one. That's why nfsd_file_put has a
> > > > might_sleep in it (to warn its callers).
> > >
> > > Currently nfsd_file_put() calls nfsd_file_flush(), which calls
> > > vfs_fsync(). That can't be called while holding a spinlock.
> > >
> > >
> >
> > nfsd_file_free (and hence, nfsd_file_put) can never be called with a
> > spinlock held. That's true even before this set. Both functions can
> > block.
>
> Dead horse: in the current code base, nfsd_file_free() can be called
> via nfsd_file_close_inode_sync(), which is an API external to
> filecache.c. But, I agree now that both functions can block.
>
>
> > > > What's the deadlock scenario you envision?
> > >
> > > OK, filp_close() does call f_op->flush(). So we have this call
> > > here and there aren't problems today. I still say this is a
> > > problem waiting to occur, but I guess I can live with it.
> > >
> > > If filp_close() already calls f_op->flush(), why do we need an
> > > explicit vfs_fsync() there?
> > >
> > >
> >
> > ->flush doesn't do anything on some filesystems, and while it does
> > return an error code today, it should have been a void return function.
> > The error from it can't be counted on.
>
> OK. The goal is detecting writeback errors, and ->flush is not a
> reliable way to do that.
>
>
> > vfs_fsync is what ensures that everything gets written back and returns
> > info about writeback errors. I don't see a way around calling it at
> > least once before we close a file, if we want to keep up the "trick" of
> > resetting the verifier when we see errors.
>
> Fair enough, but maybe it's not worth a complexity and performance
> impact. Writeback errors are supposed to be rare compared to I/O.
>
Absolutely, but they _do_ happen, and in a world with more complex I/O
stacks, we see them more often than before.
> If we can find another way (especially, a more reliable way) to
> monitor for writeback errors, that might be an improvement over
> using vfs_fsync, which seems heavyweight no matter how you cut it.
>
There really isn't a way as we can't guarantee that everything has been
written back without calling vfs_fsync.
>
> > IMO, the goal ought to be to ensure that we don't end up having to do
> > any writeback when we get to GC'ing it, and that's what patch 4/4 should
> > do.
>
> Understood. I don't disagree with the goal, since filp_close()
> won't be able to avoid a potentially costly ->flush in some cases.
>
> The issue is even a SYNC_NONE flush has appreciable cost.
Sure, but the machine will have to pay that cost eventually. We can try
to offload that onto other threads and whatnot, but I'm not convinced
that's really better for overall performance.
--
Jeff Layton <[email protected]>
On Mon, 2022-10-31 at 08:45 +1100, NeilBrown wrote:
> On Sat, 29 Oct 2022, Chuck Lever III wrote:
> >
> > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > >
> > > The list_lru_add and list_lru_del functions use list_empty checks to see
> > > whether the object is already on the LRU. That's fine in most cases, but
> > > we occasionally repurpose nf_lru after unhashing. It's possible for an
> > > LRU removal to remove it from a different list altogether if we lose a
> > > race.
> >
> > I've never seen that happen. lru field re-use is actually used in other
> > places in the kernel. Shouldn't we try to find and fix such races?
> >
> > Wasn't the idea to reduce the complexity of nfsd_file_put ?
> >
>
> I think the nfsd filecache is different from those "other places"
> because of nfsd_file_close_inode() and related code. If I understand
> correctly, nfsd can remove a file from the filecache while it is still
> in use. In this case it needs to be unhashed and removed from the lru -
> and then added to a dispose list - while it might still be active for
> some IO request.
>
> Prior to Commit 8d0d254b15cc ("nfsd: fix nfsd_file_unhash_and_dispose")
> unhash_and_dispose() wouldn't add to the dispose list unless the
> refcount was one. I'm not sure how racy this test was, but it would
> mean that it is unlikely for an nfsd_file to be added to the dispose list
> if it was still in use.
>
I'm pretty sure that was racy. By the time you go to put it on the
dispose list, the refcount may no longer be one since it can be
incremented at any time.
> After that commit it seems more likely that a nfsd_file will be added to
> a dispose list while it is in use.
> As we add/remove things to a dispose list without a lock, we need to be
> careful that no other thread will add the nfsd_file to an lru at the
> same time. refcounts will no longer provide that protection. Maybe we
> should restore the refcount protection, or maybe we need a bit as Jeff
> has added here.
>
If we have an entry on a list, then it can't be added to the LRU (since
the list_empty check would fail). The real danger is that we could end
up trying to remove it from the LRU and inadvertantly remove it from a
private dispose list instead. Since that is done without a lock, we
could easily corrupt memory.
--
Jeff Layton <[email protected]>
On Mon, 2022-10-31 at 02:51 +0000, Chuck Lever III wrote:
>
> > On Oct 30, 2022, at 5:45 PM, NeilBrown <[email protected]> wrote:
> >
> > On Sat, 29 Oct 2022, Chuck Lever III wrote:
> > >
> > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > The list_lru_add and list_lru_del functions use list_empty checks to see
> > > > whether the object is already on the LRU. That's fine in most cases, but
> > > > we occasionally repurpose nf_lru after unhashing. It's possible for an
> > > > LRU removal to remove it from a different list altogether if we lose a
> > > > race.
>
> Can that issue be resolved by simply adding a "struct list_head nf_dispose"
> field? That might be more straightforward than adding conditional logic.
>
Yes, though that would take more memory.
>
> > > I've never seen that happen. lru field re-use is actually used in other
> > > places in the kernel. Shouldn't we try to find and fix such races?
> > >
> > > Wasn't the idea to reduce the complexity of nfsd_file_put ?
> > >
> >
> > I think the nfsd filecache is different from those "other places"
> > because of nfsd_file_close_inode() and related code. If I understand
> > correctly, nfsd can remove a file from the filecache while it is still
> > in use.
>
> Not sure about that; I think nfsd_file_close_inode() is invoked only
> when a file is deleted. I could be remembering incorrectly, but that
> seems like a really difficult race to hit.
>
I think that's correct. We have a notifier set up that tells us when the
link count goes to zero. At that point we want to unhash the thing and
try to get it out of the cache ASAP.
FWIW, the main impetus for this was NFS reexport. Keeping unlinked files
in the cache could cause the reexporting server to do a bunch of
unnecessary silly-renaming.
> > In this case it needs to be unhashed and removed from the lru -
> > and then added to a dispose list - while it might still be active for
> > some IO request.
> >
> > Prior to Commit 8d0d254b15cc ("nfsd: fix nfsd_file_unhash_and_dispose")
> > unhash_and_dispose() wouldn't add to the dispose list unless the
> > refcount was one. I'm not sure how racy this test was, but it would
> > mean that it is unlikely for an nfsd_file to be added to the dispose list
> > if it was still in use.
> >
> > After that commit it seems more likely that a nfsd_file will be added to
> > a dispose list while it is in use.
>
> After it's linked to a dispose list via nf_lru, list_lru_add won't put
> it on the LRU -- it becomes a no-op because nf_lru is now !empty. I
> think we would have seen LRU corruption pretty quickly. Re-reading
> Jeff's patch description, that might not be the problem he's trying
> to address here.
>
> But, it would be easy to do some reality testing. I think you could
> add a WARN_ON or tracepoint in nfsd_file_free() or somewhere in the
> dispose-list path to catch an in-use nfsd_file?
>
>
> > As we add/remove things to a dispose list without a lock, we need to be
> > careful that no other thread will add the nfsd_file to an lru at the
> > same time. refcounts will no longer provide that protection. Maybe we
> > should restore the refcount protection, or maybe we need a bit as Jeff
> > has added here.
>
> I'm not opposed to defensive changes, in general. This one seems to be
> adding significant complexity without a clear hazard. I'd like to have
> a better understanding of exactly what misbehavior is being addressed.
>
The real danger is that we could end up trying to remove an entry from
the LRU, when it's actually sitting on a dispose list. Without some way
to distinguish where it is, we could cause memory corruption.
--
Jeff Layton <[email protected]>
> On Oct 31, 2022, at 6:08 AM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2022-10-31 at 02:51 +0000, Chuck Lever III wrote:
>>
>>> On Oct 30, 2022, at 5:45 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Sat, 29 Oct 2022, Chuck Lever III wrote:
>>>>
>>>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>>>>>
>>>>> The list_lru_add and list_lru_del functions use list_empty checks to see
>>>>> whether the object is already on the LRU. That's fine in most cases, but
>>>>> we occasionally repurpose nf_lru after unhashing. It's possible for an
>>>>> LRU removal to remove it from a different list altogether if we lose a
>>>>> race.
>>
>> Can that issue be resolved by simply adding a "struct list_head nf_dispose"
>> field? That might be more straightforward than adding conditional logic.
>>
>
> Yes, though that would take more memory.
Not really. pahole says struct nfsd_file is currently 40 bytes short
of two cache lines. So adding a list_head field should not push the
size of nfsd_file to the point where kmalloc would have to allocate
more memory per object.
I'm wondering if a separate list_head field would help simplify
nfsd_file_put() ?
--
Chuck Lever
On Mon, 2022-10-31 at 13:14 +0000, Chuck Lever III wrote:
>
> > On Oct 31, 2022, at 6:08 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2022-10-31 at 02:51 +0000, Chuck Lever III wrote:
> > >
> > > > On Oct 30, 2022, at 5:45 PM, NeilBrown <[email protected]> wrote:
> > > >
> > > > On Sat, 29 Oct 2022, Chuck Lever III wrote:
> > > > >
> > > > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > The list_lru_add and list_lru_del functions use list_empty checks to see
> > > > > > whether the object is already on the LRU. That's fine in most cases, but
> > > > > > we occasionally repurpose nf_lru after unhashing. It's possible for an
> > > > > > LRU removal to remove it from a different list altogether if we lose a
> > > > > > race.
> > >
> > > Can that issue be resolved by simply adding a "struct list_head nf_dispose"
> > > field? That might be more straightforward than adding conditional logic.
> > >
> >
> > Yes, though that would take more memory.
>
> Not really. pahole says struct nfsd_file is currently 40 bytes short
> of two cache lines. So adding a list_head field should not push the
> size of nfsd_file to the point where kmalloc would have to allocate
> more memory per object.
>
> I'm wondering if a separate list_head field would help simplify
> nfsd_file_put() ?
>
Probably not. It wouldn't need the flag anymore, but the logic would
still be roughly the same. We still have to check for the race between
unhashing and adding to the LRU either way.
--
Jeff Layton <[email protected]>
> On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
>
>> I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
>
> I think that we need to call this there to ensure that writeback errors
> are handled. I worry that if try to do this piecemeal, we could end up
> missing errors when they fall off the LRU.
>
>> That feels like a deadlock waiting to happen and quite difficult to
>> reproduce because I/O there is rarely needed. It could help to put a
>> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
>> drive I/O in that path at all.
>
> I don't quite grok the potential for a deadlock here. nfsd_file_free
> already has to deal with blocking activities due to it effective doing a
> close(). This is just another one. That's why nfsd_file_put has a
> might_sleep in it (to warn its callers).
>
> What's the deadlock scenario you envision?
I never answered this question.
I'll say up front that I believe this problem exists in the current code
base, so what follows is meant to document an existing issue rather than
a problem with this patch series.
The filecache sets up a shrinker callback. This callback uses the same
or similar code paths as the filecache garbage collector.
Dai has found that trying to fsync inside a shrinker callback will
lead to deadlocks on some filesystems (notably I believe he was testing
btrfs at the time).
To address this, the filecache shrinker callback could avoid evicting
nfsd_file items that are open for write.
--
Chuck Lever
On Tue, 2022-11-01 at 13:58 +0000, Chuck Lever III wrote:
>
> > On Oct 28, 2022, at 4:13 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
> > >
> > > > On Oct 28, 2022, at 2:57 PM, Jeff Layton <[email protected]> wrote:
> >
> > > I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
> >
> > I think that we need to call this there to ensure that writeback errors
> > are handled. I worry that if try to do this piecemeal, we could end up
> > missing errors when they fall off the LRU.
> >
> > > That feels like a deadlock waiting to happen and quite difficult to
> > > reproduce because I/O there is rarely needed. It could help to put a
> > > might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
> > > drive I/O in that path at all.
> >
> > I don't quite grok the potential for a deadlock here. nfsd_file_free
> > already has to deal with blocking activities due to it effective doing a
> > close(). This is just another one. That's why nfsd_file_put has a
> > might_sleep in it (to warn its callers).
> >
> > What's the deadlock scenario you envision?
>
> I never answered this question.
>
> I'll say up front that I believe this problem exists in the current code
> base, so what follows is meant to document an existing issue rather than
> a problem with this patch series.
>
> The filecache sets up a shrinker callback. This callback uses the same
> or similar code paths as the filecache garbage collector.
>
> Dai has found that trying to fsync inside a shrinker callback will
> lead to deadlocks on some filesystems (notably I believe he was testing
> btrfs at the time).
>
> To address this, the filecache shrinker callback could avoid evicting
> nfsd_file items that are open for write.
>
Ok, that makes sense, particularly if btrfs needs to wait on reclaim in
order to allocate.
We already skip entries that are dirty or under writeback:
/*
* Don't throw out files that are still undergoing I/O or
* that have uncleared errors pending.
*/
if (nfsd_file_check_writeback(nf)) {
trace_nfsd_file_gc_writeback(nf);
return LRU_SKIP;
}
...and nfsd_file_check_writeback does:
return mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
If that flush is due to shrinker deadlock, then you'd have to hit a
pretty tight race window, I think.
Actually...this may have been due to a race in the existing lru
callback. It didn't properly decrement the refcount when isolating
entries from the LRU, which left a pretty big window open for new
references to be grabbed while we were trying to reap the thing. These
patches may improve that if so.
FWIW, I think we're "within our rights" to not fsync on close here. The
downside of not doing that though is that if there are writeback errors,
they may go unreported to the client (aka silent data corruption). I'd
prefer to keep that in, unless we can demonstrate that it's a problem.
--
Jeff Layton <[email protected]>