Hi,
here is a revised version of my previous patch titled:
[PATCH/RFC] core/nfsd: allow kernel threads to use task_work.
There are now two patches - one that changes core code to allow nfsd
to handle its own __dput calls, and one to make various changes to
nfsd.
It would probably make sense for the first patch to land through the
VFS tree, and the second to follow through the NFSD tree, maybe after the relevant rc1 ??
Details of the problem and explanation of the solution are in the individual patches.
Thanks for all the review and suggestions.
NeilBrown
[PATCH 1/2] Allow a kthread to declare that it calls task_work_run()
[PATCH 2/2] nfsd: Don't leave work of closing files to a work queue.
User-space processes always call task_work_run() as needed when
returning from a system call. Kernel-threads generally do not.
Because of this some work that is best run in the task_works context
(guaranteed that no locks are held) cannot be queued to task_works from
kernel threads and so are queued to a (single) work_time to be managed
on a work queue.
This means that any cost for doing the work is not imposed on the kernel
thread, and importantly excessive amounts of work cannot apply
back-pressure to reduce the amount of new work queued.
I have evidence from a customer site when nfsd (which runs as kernel
threads) is being asked to modify many millions of files which causes
sufficient memory pressure that some cache (in XFS I think) gets cleaned
earlier than would be ideal. When __dput (from the workqueue) calls
__dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
previously cached info from storage. This slows down the single thread
that is making all the final __dput() calls for all the nfsd threads
with the net result that files are added to the delayed_fput_list faster
than they are removed, and the system eventually runs out of memory.
This happens because there is no back-pressure: the nfsd isn't forced to
slow down when __dput() is slow for any reason. To fix this we can
change the nfsd threads to call task_work_run() regularly (much like
user-space processes do) and allow it to declare this so that work does
get queued to task_works rather than to a work queue.
This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
instead of PF_KTHREAD to determine whether it is sensible to queue
something to task_works. This flag is always set for non-kernel threads.
task_work_run() is also exported so that it can be called from a module
such as nfsd.
Signed-off-by: NeilBrown <[email protected]>
---
fs/file_table.c | 3 ++-
fs/namespace.c | 2 +-
include/linux/sched.h | 2 +-
kernel/fork.c | 2 ++
kernel/task_work.c | 1 +
5 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..d36cade6e366 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -435,7 +435,8 @@ void fput(struct file *file)
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
- if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+ if (likely(!in_interrupt() &&
+ (task->flags & PF_RUNS_TASK_WORK))) {
init_task_work(&file->f_rcuhead, ____fput);
if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
return;
diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..46d640b70ca9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
- if (likely(!(task->flags & PF_KTHREAD))) {
+ if (likely((task->flags & PF_RUNS_TASK_WORK))) {
init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..e4eebac708e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
* I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
-#define PF__HOLE__00800000 0x00800000
+#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */
#define PF__HOLE__01000000 0x01000000
#define PF__HOLE__02000000 0x02000000
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..d612d8f14861 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
+ else
+ p->flags |= PF_RUNS_TASK_WORK;
if (args->user_worker) {
/*
* Mark us a user worker, and block any signal that isn't
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..aec19876e121 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -183,3 +183,4 @@ void task_work_run(void)
} while (work);
}
}
+EXPORT_SYMBOL(task_work_run);
--
2.43.0
The work of closing a file can have non-trivial cost. Doing it in a
separate work queue thread means that cost isn't imposed on the nfsd
threads and an imbalance can be created.
I have evidence from a customer site when nfsd is being asked to modify
many millions of files which causes sufficient memory pressure that some
cache (in XFS I think) gets cleaned earlier than would be ideal. When
__dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
needs to synchronously read back previously cached info from storage.
This slows down the single thread that is making all the final __dput()
calls for all the nfsd threads with the net result that files are added
to the delayed_fput_list faster than they are removed, and the system
eventually runs out of memory.
To avoid this work imbalance that exhausts memory, this patch moves all
work for closing files into the nfsd threads. This means that when the
work imposes a cost, that cost appears where it would be expected - in
the work of the nfsd thread.
There are two changes to achieve this.
1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
final __dput() for any file closed by a given nfsd thread is handled
by that thread. This ensures that the number of files that are
queued for a final close is limited by the number of threads and
cannot grow without bound.
2/ Files opened for NFSv3 are never explicitly closed by the client and are
kept open by the server in the "filecache", which responds to memory
pressure, is garbage collected even when there is no pressure, and
sometimes closes files when there is particular need such as for
rename. These files currently have filp_close() called in a dedicated
work queue, so their __dput() can have no effect on nfsd threads.
This patch discards the work queue and instead has each nfsd thread
call flip_close() on as many as 8 files from the filecache each time
it acts on a client request (or finds there are no pending client
requests). If there are more to be closed, more threads are woken.
This spreads the work of __dput() over multiple threads and imposes
any cost on those threads.
The number 8 is somewhat arbitrary. It needs to be greater than 1 to
ensure that files are closed more quickly than they can be added to
the cache. It needs to be small enough to limit the per-request
delays that will be imposed on clients when all threads are busy
closing files.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
fs/nfsd/filecache.h | 1 +
fs/nfsd/nfssvc.c | 6 +++++
3 files changed, 32 insertions(+), 37 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index ee9c923192e0..55268b7362d4 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -39,6 +39,7 @@
#include <linux/fsnotify.h>
#include <linux/seq_file.h>
#include <linux/rhashtable.h>
+#include <linux/task_work.h>
#include "vfs.h"
#include "nfsd.h"
@@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
struct nfsd_fcache_disposal {
- struct work_struct work;
spinlock_t lock;
struct list_head freeme;
};
-static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
-
static struct kmem_cache *nfsd_file_slab;
static struct kmem_cache *nfsd_file_mark_slab;
static struct list_lru nfsd_file_lru;
@@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
spin_lock(&l->lock);
list_move_tail(&nf->nf_lru, &l->freeme);
spin_unlock(&l->lock);
- queue_work(nfsd_filecache_wq, &l->work);
+ svc_wake_up(nn->nfsd_serv);
}
}
+/**
+ * nfsd_file_dispose_some
+ *
+ */
+void nfsd_file_dispose_some(struct nfsd_net *nn)
+{
+ struct nfsd_fcache_disposal *l = nn->fcache_disposal;
+ LIST_HEAD(dispose);
+ int i;
+
+ if (list_empty(&l->freeme))
+ return;
+ spin_lock(&l->lock);
+ for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
+ list_move(l->freeme.next, &dispose);
+ spin_unlock(&l->lock);
+ if (!list_empty(&l->freeme))
+ svc_wake_up(nn->nfsd_serv);
+ nfsd_file_dispose_list(&dispose);
+}
+
/**
* nfsd_file_lru_cb - Examine an entry on the LRU list
* @item: LRU entry to examine
@@ -635,28 +654,8 @@ nfsd_file_close_inode_sync(struct inode *inode)
list_del_init(&nf->nf_lru);
nfsd_file_free(nf);
}
- flush_delayed_fput();
-}
-
-/**
- * nfsd_file_delayed_close - close unused nfsd_files
- * @work: dummy
- *
- * Scrape the freeme list for this nfsd_net, and then dispose of them
- * all.
- */
-static void
-nfsd_file_delayed_close(struct work_struct *work)
-{
- LIST_HEAD(head);
- struct nfsd_fcache_disposal *l = container_of(work,
- struct nfsd_fcache_disposal, work);
-
- spin_lock(&l->lock);
- list_splice_init(&l->freeme, &head);
- spin_unlock(&l->lock);
-
- nfsd_file_dispose_list(&head);
+ /* Flush any delayed fput */
+ task_work_run();
}
static int
@@ -721,10 +720,6 @@ nfsd_file_cache_init(void)
return ret;
ret = -ENOMEM;
- nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0);
- if (!nfsd_filecache_wq)
- goto out;
-
nfsd_file_slab = kmem_cache_create("nfsd_file",
sizeof(struct nfsd_file), 0, 0, NULL);
if (!nfsd_file_slab) {
@@ -739,7 +734,6 @@ nfsd_file_cache_init(void)
goto out_err;
}
-
ret = list_lru_init(&nfsd_file_lru);
if (ret) {
pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret);
@@ -782,8 +776,6 @@ nfsd_file_cache_init(void)
nfsd_file_slab = NULL;
kmem_cache_destroy(nfsd_file_mark_slab);
nfsd_file_mark_slab = NULL;
- destroy_workqueue(nfsd_filecache_wq);
- nfsd_filecache_wq = NULL;
rhltable_destroy(&nfsd_file_rhltable);
goto out;
}
@@ -829,7 +821,6 @@ nfsd_alloc_fcache_disposal(void)
l = kmalloc(sizeof(*l), GFP_KERNEL);
if (!l)
return NULL;
- INIT_WORK(&l->work, nfsd_file_delayed_close);
spin_lock_init(&l->lock);
INIT_LIST_HEAD(&l->freeme);
return l;
@@ -838,7 +829,6 @@ nfsd_alloc_fcache_disposal(void)
static void
nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
{
- cancel_work_sync(&l->work);
nfsd_file_dispose_list(&l->freeme);
kfree(l);
}
@@ -907,8 +897,6 @@ nfsd_file_cache_shutdown(void)
fsnotify_wait_marks_destroyed();
kmem_cache_destroy(nfsd_file_mark_slab);
nfsd_file_mark_slab = NULL;
- destroy_workqueue(nfsd_filecache_wq);
- nfsd_filecache_wq = NULL;
rhltable_destroy(&nfsd_file_rhltable);
for_each_possible_cpu(i) {
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index e54165a3224f..bc8c3363bbdf 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -56,6 +56,7 @@ void nfsd_file_cache_shutdown_net(struct net *net);
void nfsd_file_put(struct nfsd_file *nf);
struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
void nfsd_file_close_inode_sync(struct inode *inode);
+void nfsd_file_dispose_some(struct nfsd_net *nn);
bool nfsd_file_is_cached(struct inode *inode);
__be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **nfp);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index c7af1095f6b5..02ea16636b54 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -13,6 +13,7 @@
#include <linux/fs_struct.h>
#include <linux/swap.h>
#include <linux/siphash.h>
+#include <linux/task_work.h>
#include <linux/sunrpc/stats.h>
#include <linux/sunrpc/svcsock.h>
@@ -949,6 +950,7 @@ nfsd(void *vrqstp)
}
current->fs->umask = 0;
+ current->flags |= PF_RUNS_TASK_WORK;
atomic_inc(&nfsdstats.th_cnt);
@@ -963,6 +965,10 @@ nfsd(void *vrqstp)
svc_recv(rqstp);
validate_process_creds();
+
+ nfsd_file_dispose_some(nn);
+ if (task_work_pending(current))
+ task_work_run();
}
atomic_dec(&nfsdstats.th_cnt);
--
2.43.0
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> +++ b/fs/namespace.c
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>
> if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> struct task_struct *task = current;
> - if (likely(!(task->flags & PF_KTHREAD))) {
> + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
You could lose one set of parens here ...
if (likely(task->flags & PF_RUNS_TASK_WORK)) {
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> -#define PF__HOLE__00800000 0x00800000
> +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */
And you could lose "Will" here:
#define PF_RUNS_TASK_WORK 0x00800000 /* Calls task_work_run() periodically */
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..aec19876e121 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -183,3 +183,4 @@ void task_work_run(void)
> } while (work);
> }
> }
> +EXPORT_SYMBOL(task_work_run);
_GPL?
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.
It also means that a stuck ->release() won't end up with stuck
kernel thread...
> earlier than would be ideal. When __dput (from the workqueue) calls
WTF is that __dput thing? __fput, perhaps?
> This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> instead of PF_KTHREAD to determine whether it is sensible to queue
> something to task_works. This flag is always set for non-kernel threads.
*ugh*
What's that flag for? task_work_add() always can fail; any caller must
have a fallback to cope with that possibility; fput() certainly does.
Just have the kernel threads born with ->task_works set to &work_exited
and provide a primitive that would flip it from that to NULL.
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>
> if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> struct task_struct *task = current;
> - if (likely(!(task->flags & PF_KTHREAD))) {
> + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> return;
Now, *that* is something I have much stronger objections to.
Stuck filesystem shutdown is far more likely than stuck
->release(). You are seriously asking for trouble here.
Why would you want to have nfsd block on that?
I am sick and can't read emails, just one note
On 12/04, Al Viro wrote:
>
> Just have the kernel threads born with ->task_works set to &work_exited
Then irq_thread()->task_work_add() will silently fail,
> and provide a primitive that would flip it from that to NULL.
OK, so this change should update irq_thread(). But what else can fail?
And what if another kthread uses task_work_add(current) to add the
desctructor and does fput() without task_work_run() ?
Oleg.
On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> The work of closing a file can have non-trivial cost. Doing it in a
> separate work queue thread means that cost isn't imposed on the nfsd
> threads and an imbalance can be created.
>
> I have evidence from a customer site when nfsd is being asked to modify
> many millions of files which causes sufficient memory pressure that some
> cache (in XFS I think) gets cleaned earlier than would be ideal. When
> __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> needs to synchronously read back previously cached info from storage.
> This slows down the single thread that is making all the final __dput()
> calls for all the nfsd threads with the net result that files are added
> to the delayed_fput_list faster than they are removed, and the system
> eventually runs out of memory.
>
> To avoid this work imbalance that exhausts memory, this patch moves all
> work for closing files into the nfsd threads. This means that when the
> work imposes a cost, that cost appears where it would be expected - in
> the work of the nfsd thread.
Thanks for pursuing this next step in the evolution of the NFSD
file cache.
Your problem statement should mention whether you have observed the
issue with an NFSv3 or an NFSv4 workload or if you see this issue
with both, since those two use cases are handled very differently
within the file cache implementation.
> There are two changes to achieve this.
>
> 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> final __dput() for any file closed by a given nfsd thread is handled
> by that thread. This ensures that the number of files that are
> queued for a final close is limited by the number of threads and
> cannot grow without bound.
>
> 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> kept open by the server in the "filecache", which responds to memory
> pressure, is garbage collected even when there is no pressure, and
> sometimes closes files when there is particular need such as for
> rename.
There is a good reason for close-on-rename: IIRC we want to avoid
triggering a silly-rename on NFS re-exports.
Also, I think we do want to close cached garbage-collected files
quickly, even without memory pressure. Files left open in this way
can conflict with subsequent NFSv4 OPENs that might hand out a
delegation as long as no other clients are using them. Files held
open by the file cache will interfere with that.
> These files currently have filp_close() called in a dedicated
> work queue, so their __dput() can have no effect on nfsd threads.
>
> This patch discards the work queue and instead has each nfsd thread
> call flip_close() on as many as 8 files from the filecache each time
> it acts on a client request (or finds there are no pending client
> requests). If there are more to be closed, more threads are woken.
> This spreads the work of __dput() over multiple threads and imposes
> any cost on those threads.
>
> The number 8 is somewhat arbitrary. It needs to be greater than 1 to
> ensure that files are closed more quickly than they can be added to
> the cache. It needs to be small enough to limit the per-request
> delays that will be imposed on clients when all threads are busy
> closing files.
IMO we want to explicitly separate the mechanisms of handling
garbage-collected files and non-garbage-collected files.
In the non-garbage-collected (NFSv4) case, the kthread can wait
for everything it has opened to be closed. task_work seems
appropriate for that IIUC.
The problem with handling a limited number of garbage-collected
items is that once the RPC workload stops, any remaining open
files will remain open because garbage collection has effectively
stopped. We really need those files closed out within a couple of
seconds.
We used to have watermarks in the nfsd_file_put() path to kick
garbage-collection if there were too many open files. Instead,
waiting for the GC thread to make progress before recycling the
kthread might be beneficial.
And, as we discussed in a previous thread, replacing the per-
namespace worker with a parallel mechanism would help GC proceed
more quickly to reduce the flush/close backlog for NFSv3.
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> fs/nfsd/filecache.h | 1 +
> fs/nfsd/nfssvc.c | 6 +++++
> 3 files changed, 32 insertions(+), 37 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index ee9c923192e0..55268b7362d4 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -39,6 +39,7 @@
> #include <linux/fsnotify.h>
> #include <linux/seq_file.h>
> #include <linux/rhashtable.h>
> +#include <linux/task_work.h>
>
> #include "vfs.h"
> #include "nfsd.h"
> @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
>
> struct nfsd_fcache_disposal {
> - struct work_struct work;
> spinlock_t lock;
> struct list_head freeme;
> };
>
> -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> -
> static struct kmem_cache *nfsd_file_slab;
> static struct kmem_cache *nfsd_file_mark_slab;
> static struct list_lru nfsd_file_lru;
> @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> spin_lock(&l->lock);
> list_move_tail(&nf->nf_lru, &l->freeme);
> spin_unlock(&l->lock);
> - queue_work(nfsd_filecache_wq, &l->work);
> + svc_wake_up(nn->nfsd_serv);
> }
> }
>
> +/**
> + * nfsd_file_dispose_some
This needs a short description and:
* @nn: namespace to check
Or something more enlightening than that.
Also, the function name exposes mechanism; I think I'd prefer a name
that is more abstract, such as nfsd_file_net_release() ?
> + *
> + */
> +void nfsd_file_dispose_some(struct nfsd_net *nn)
> +{
> + struct nfsd_fcache_disposal *l = nn->fcache_disposal;
> + LIST_HEAD(dispose);
> + int i;
> +
> + if (list_empty(&l->freeme))
> + return;
> + spin_lock(&l->lock);
> + for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
> + list_move(l->freeme.next, &dispose);
> + spin_unlock(&l->lock);
> + if (!list_empty(&l->freeme))
> + svc_wake_up(nn->nfsd_serv);
> + nfsd_file_dispose_list(&dispose);
> +}
> +
> /**
> * nfsd_file_lru_cb - Examine an entry on the LRU list
> * @item: LRU entry to examine
> @@ -635,28 +654,8 @@ nfsd_file_close_inode_sync(struct inode *inode)
> list_del_init(&nf->nf_lru);
> nfsd_file_free(nf);
> }
> - flush_delayed_fput();
> -}
> -
> -/**
> - * nfsd_file_delayed_close - close unused nfsd_files
> - * @work: dummy
> - *
> - * Scrape the freeme list for this nfsd_net, and then dispose of them
> - * all.
> - */
> -static void
> -nfsd_file_delayed_close(struct work_struct *work)
> -{
> - LIST_HEAD(head);
> - struct nfsd_fcache_disposal *l = container_of(work,
> - struct nfsd_fcache_disposal, work);
> -
> - spin_lock(&l->lock);
> - list_splice_init(&l->freeme, &head);
> - spin_unlock(&l->lock);
> -
> - nfsd_file_dispose_list(&head);
> + /* Flush any delayed fput */
> + task_work_run();
> }
>
> static int
> @@ -721,10 +720,6 @@ nfsd_file_cache_init(void)
> return ret;
>
> ret = -ENOMEM;
> - nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0);
> - if (!nfsd_filecache_wq)
> - goto out;
> -
> nfsd_file_slab = kmem_cache_create("nfsd_file",
> sizeof(struct nfsd_file), 0, 0, NULL);
> if (!nfsd_file_slab) {
> @@ -739,7 +734,6 @@ nfsd_file_cache_init(void)
> goto out_err;
> }
>
> -
> ret = list_lru_init(&nfsd_file_lru);
> if (ret) {
> pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret);
> @@ -782,8 +776,6 @@ nfsd_file_cache_init(void)
> nfsd_file_slab = NULL;
> kmem_cache_destroy(nfsd_file_mark_slab);
> nfsd_file_mark_slab = NULL;
> - destroy_workqueue(nfsd_filecache_wq);
> - nfsd_filecache_wq = NULL;
> rhltable_destroy(&nfsd_file_rhltable);
> goto out;
> }
> @@ -829,7 +821,6 @@ nfsd_alloc_fcache_disposal(void)
> l = kmalloc(sizeof(*l), GFP_KERNEL);
> if (!l)
> return NULL;
> - INIT_WORK(&l->work, nfsd_file_delayed_close);
> spin_lock_init(&l->lock);
> INIT_LIST_HEAD(&l->freeme);
> return l;
> @@ -838,7 +829,6 @@ nfsd_alloc_fcache_disposal(void)
> static void
> nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
> {
> - cancel_work_sync(&l->work);
> nfsd_file_dispose_list(&l->freeme);
> kfree(l);
> }
> @@ -907,8 +897,6 @@ nfsd_file_cache_shutdown(void)
> fsnotify_wait_marks_destroyed();
> kmem_cache_destroy(nfsd_file_mark_slab);
> nfsd_file_mark_slab = NULL;
> - destroy_workqueue(nfsd_filecache_wq);
> - nfsd_filecache_wq = NULL;
> rhltable_destroy(&nfsd_file_rhltable);
>
> for_each_possible_cpu(i) {
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index e54165a3224f..bc8c3363bbdf 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -56,6 +56,7 @@ void nfsd_file_cache_shutdown_net(struct net *net);
> void nfsd_file_put(struct nfsd_file *nf);
> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
> void nfsd_file_close_inode_sync(struct inode *inode);
> +void nfsd_file_dispose_some(struct nfsd_net *nn);
> bool nfsd_file_is_cached(struct inode *inode);
> __be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **nfp);
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..02ea16636b54 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -13,6 +13,7 @@
> #include <linux/fs_struct.h>
> #include <linux/swap.h>
> #include <linux/siphash.h>
> +#include <linux/task_work.h>
>
> #include <linux/sunrpc/stats.h>
> #include <linux/sunrpc/svcsock.h>
> @@ -949,6 +950,7 @@ nfsd(void *vrqstp)
> }
>
> current->fs->umask = 0;
> + current->flags |= PF_RUNS_TASK_WORK;
>
> atomic_inc(&nfsdstats.th_cnt);
>
> @@ -963,6 +965,10 @@ nfsd(void *vrqstp)
>
> svc_recv(rqstp);
> validate_process_creds();
> +
> + nfsd_file_dispose_some(nn);
> + if (task_work_pending(current))
> + task_work_run();
I'd prefer that these task_work details reside inside
nfsd_file_dispose_some(), or whatever we want to call to call it ...
> }
>
> atomic_dec(&nfsdstats.th_cnt);
> --
> 2.43.0
>
--
Chuck Lever
On Mon, 04 Dec 2023, Jens Axboe wrote:
> On 12/3/23 6:36 PM, NeilBrown wrote:
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index e157efc54023..46d640b70ca9 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >
> > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> > struct task_struct *task = current;
> > - if (likely(!(task->flags & PF_KTHREAD))) {
> > + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
>
> Extraneous parens here.
Thanks - and thanks to Matthew Wilcox too. Fixed.
>
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> > } while (work);
> > }
> > }
> > +EXPORT_SYMBOL(task_work_run);
>
> If we're exporting this, then I think that function needs a big
> disclaimer on exactly when it is safe to call it. And it most certainly
> needs to be a _GPL export.
I've added
* Can be used by a kernel thread but only when no locks are held and the
* thread won't be waited for by other code that might hold locks. It
* can be useful in the top-level loop of a file-serving thread to ensure
* files get closed promptly.
to the documentation comment.
It isn't clear to me what _GPL is appropriate, but maybe the rules
changed since last I looked..... are there rules?
My reasoning was that the call is effectively part of the user-space
ABI. A user-space process can call this trivially by invoking any
system call. The user-space ABI is explicitly a boundary which the GPL
does not cross. So it doesn't seem appropriate to prevent non-GPL
kernel code from doing something that non-GPL user-space code can
trivially do.
But if there are other strong opinions, or clearly documented rules that
contradict my opinion, I have not problem with adding _GPL.
Thanks,
NeilBrown
>
> --
> Jens Axboe
>
>
On Mon, 04 Dec 2023, Matthew Wilcox wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >
> > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> > struct task_struct *task = current;
> > - if (likely(!(task->flags & PF_KTHREAD))) {
> > + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
>
> You could lose one set of parens here ...
>
> if (likely(task->flags & PF_RUNS_TASK_WORK)) {
Done.
>
> > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> > -#define PF__HOLE__00800000 0x00800000
> > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */
>
> And you could lose "Will" here:
>
> #define PF_RUNS_TASK_WORK 0x00800000 /* Calls task_work_run() periodically */
Better - thanks.
>
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> > } while (work);
> > }
> > }
> > +EXPORT_SYMBOL(task_work_run);
>
> _GPL?
Justification?
Thanks,
NeilBrown
On Mon, 04 Dec 2023, Al Viro wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
>
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
>
> It also means that a stuck ->release() won't end up with stuck
> kernel thread...
Is a stuck kernel thread any worse than a stuck user-space thread?
>
> > earlier than would be ideal. When __dput (from the workqueue) calls
>
> WTF is that __dput thing? __fput, perhaps?
Either __fput or dput :-)
->release isn't the problem that I am seeing.
The call trace that I see causing problems is
__fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode
so both __fput and dput are there, but most of the code is dput related.
So both "put"s were swimming in by brain and the wrong combination came
out.
I changed it to __fput - thanks.
>
> > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> > instead of PF_KTHREAD to determine whether it is sensible to queue
> > something to task_works. This flag is always set for non-kernel threads.
>
> *ugh*
>
> What's that flag for? task_work_add() always can fail; any caller must
> have a fallback to cope with that possibility; fput() certainly does.
As Oleg pointed out, all threads including kernel threads call
task_work_run() at exit, and some kernel threads depend on this. So
disabling task_work_add() early for all kernel threads would break
things.
Currently task_work_add() fails only once the process has started
exiting. Only code that might run during the exit handling need check.
>
> Just have the kernel threads born with ->task_works set to &work_exited
> and provide a primitive that would flip it from that to NULL.
>
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >
> > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> > struct task_struct *task = current;
> > - if (likely(!(task->flags & PF_KTHREAD))) {
> > + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> > init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> > return;
>
> Now, *that* is something I have much stronger objections to.
> Stuck filesystem shutdown is far more likely than stuck
> ->release(). You are seriously asking for trouble here.
>
> Why would you want to have nfsd block on that?
>
I don't *want* nfsd block on that, but I don't care if it does. nfsd
will only call task_work_run() at a safe time. This is no different to
user-space processes only calling task_work_run() at a safe time.
The new flag isn't "I_AM_NFSD" or "QUEUE_FPUT_WORK_TO_TASK". It is
"RUNS_TASK_WORK". So any code that would prefer to call task_work_add()
but has a fall-back for tasks that don't call run_task_work() should
test the new flag. Doing otherwise would be inconsistent and
potentially confusing.
I don't think that nfsd getting stuck would be any worse than systemd
getting stuck, or automount getting stuck, or udiskd getting stuck.
Thanks,
NeilBrown
On 12/4/23 2:02 PM, NeilBrown wrote:
> It isn't clear to me what _GPL is appropriate, but maybe the rules
> changed since last I looked..... are there rules?
>
> My reasoning was that the call is effectively part of the user-space
> ABI. A user-space process can call this trivially by invoking any
> system call. The user-space ABI is explicitly a boundary which the GPL
> does not cross. So it doesn't seem appropriate to prevent non-GPL
> kernel code from doing something that non-GPL user-space code can
> trivially do.
By that reasoning, basically everything in the kernel should be non-GPL
marked. And while task_work can get used by the application, it happens
only indirectly or implicitly. So I don't think this reasoning is sound
at all, it's not an exported ABI or API by itself.
For me, the more core of an export it is, the stronger the reason it
should be GPL. FWIW, I don't think exporting task_work functionality is
a good idea in the first place, but if there's a strong reason to do so,
it should most certainly not be accessible to non-GPL modules. Basically
NO new export should be non-GPL.
--
Jens Axboe
On Tue, 05 Dec 2023, Chuck Lever wrote:
> On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> > The work of closing a file can have non-trivial cost. Doing it in a
> > separate work queue thread means that cost isn't imposed on the nfsd
> > threads and an imbalance can be created.
> >
> > I have evidence from a customer site when nfsd is being asked to modify
> > many millions of files which causes sufficient memory pressure that some
> > cache (in XFS I think) gets cleaned earlier than would be ideal. When
> > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> > needs to synchronously read back previously cached info from storage.
> > This slows down the single thread that is making all the final __dput()
> > calls for all the nfsd threads with the net result that files are added
> > to the delayed_fput_list faster than they are removed, and the system
> > eventually runs out of memory.
> >
> > To avoid this work imbalance that exhausts memory, this patch moves all
> > work for closing files into the nfsd threads. This means that when the
> > work imposes a cost, that cost appears where it would be expected - in
> > the work of the nfsd thread.
>
> Thanks for pursuing this next step in the evolution of the NFSD
> file cache.
>
> Your problem statement should mention whether you have observed the
> issue with an NFSv3 or an NFSv4 workload or if you see this issue
> with both, since those two use cases are handled very differently
> within the file cache implementation.
I have added:
=============
The customer was using NFSv4. I can demonstrate the same problem using
NFSv3 or NFSv4 (which close files in different ways) by adding
msleep(25) to for FMODE_WRITE files in __fput(). This simulates
slowness in the final close and when writing through nfsd it causes
/proc/sys/fs/file-nr to grow without bound.
==============
>
>
> > There are two changes to achieve this.
> >
> > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> > final __dput() for any file closed by a given nfsd thread is handled
> > by that thread. This ensures that the number of files that are
> > queued for a final close is limited by the number of threads and
> > cannot grow without bound.
> >
> > 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> > kept open by the server in the "filecache", which responds to memory
> > pressure, is garbage collected even when there is no pressure, and
> > sometimes closes files when there is particular need such as for
> > rename.
>
> There is a good reason for close-on-rename: IIRC we want to avoid
> triggering a silly-rename on NFS re-exports.
>
> Also, I think we do want to close cached garbage-collected files
> quickly, even without memory pressure. Files left open in this way
> can conflict with subsequent NFSv4 OPENs that might hand out a
> delegation as long as no other clients are using them. Files held
> open by the file cache will interfere with that.
Yes - I agree all this behaviour is appropriate. I was just setting out
the current behaviour of the filecache so that effect of the proposed
changes would be easier to understand.
>
>
> > These files currently have filp_close() called in a dedicated
> > work queue, so their __dput() can have no effect on nfsd threads.
> >
> > This patch discards the work queue and instead has each nfsd thread
> > call flip_close() on as many as 8 files from the filecache each time
> > it acts on a client request (or finds there are no pending client
> > requests). If there are more to be closed, more threads are woken.
> > This spreads the work of __dput() over multiple threads and imposes
> > any cost on those threads.
> >
> > The number 8 is somewhat arbitrary. It needs to be greater than 1 to
> > ensure that files are closed more quickly than they can be added to
> > the cache. It needs to be small enough to limit the per-request
> > delays that will be imposed on clients when all threads are busy
> > closing files.
>
> IMO we want to explicitly separate the mechanisms of handling
> garbage-collected files and non-garbage-collected files.
I think we already have explicit separation.
garbage-collected files are handled to nfsd_file_display_list_delayed(),
either when they fall off the lru or through nfsd_file_close_inode() -
which is used by lease and fsnotify callbacks.
non-garbage collected files are closed directly by nfsd_file_put().
>
> In the non-garbage-collected (NFSv4) case, the kthread can wait
> for everything it has opened to be closed. task_work seems
> appropriate for that IIUC.
Agreed. The task_work change is all that we need for NFSv4.
>
> The problem with handling a limited number of garbage-collected
> items is that once the RPC workload stops, any remaining open
> files will remain open because garbage collection has effectively
> stopped. We really need those files closed out within a couple of
> seconds.
Why would garbage collection stop?
nfsd_filecache_laundrette is still running on the system_wq. It will
continue to garbage collect and queue files using
nfsd_file_display_list_delayed().
That will wake up an nfsd thread if none is running. The thread will
close a few, but will first wake another thread if there was more than
it was willing to manage. So the closing of files should proceed
promptly, and if any close operation takes a non-trivial amount of time,
more threads will be woken and work will proceed in parallel.
>
> We used to have watermarks in the nfsd_file_put() path to kick
> garbage-collection if there were too many open files. Instead,
> waiting for the GC thread to make progress before recycling the
> kthread might be beneficial.
"too many" is only meaningful in the context of memory usage. Having
the shrinker callback is exactly the right way to address this - nothing
else is needed.
The GC thread is expected to be CPU intensive. The main cause of delay
is skipping over lots of files that cannot be closed yet - looking for
files that can. This could delay the closing of files, but not nearly
as much as the delays I saw caused by synchronous IO.
We might be able to improve the situation a bit by queuing files as soon
as list_lru_walk finds them, rather than gathering them all into a list
and the queuing them one by one from that list.
It isn't clear to me that there is an issue here that needs fixing.
>
> And, as we discussed in a previous thread, replacing the per-
> namespace worker with a parallel mechanism would help GC proceed
> more quickly to reduce the flush/close backlog for NFSv3.
This patch discards the per-namespace worker.
The GC step (searching the LRU list for "garbage") is still
single-threaded. The filecache is shared by all net-namespaces and
there is a single GC thread for the filecache.
Files that are found *were* filp_close()ed by per-net-fs work-items.
With this patch the filp_close() is called by the nfsd threads.
The file __fput of those files *was* handled by a single system-wide
work-item. With this patch they are called by the nfsd thread which
called the filp_close().
>
>
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> > fs/nfsd/filecache.h | 1 +
> > fs/nfsd/nfssvc.c | 6 +++++
> > 3 files changed, 32 insertions(+), 37 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index ee9c923192e0..55268b7362d4 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -39,6 +39,7 @@
> > #include <linux/fsnotify.h>
> > #include <linux/seq_file.h>
> > #include <linux/rhashtable.h>
> > +#include <linux/task_work.h>
> >
> > #include "vfs.h"
> > #include "nfsd.h"
> > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
> >
> > struct nfsd_fcache_disposal {
> > - struct work_struct work;
> > spinlock_t lock;
> > struct list_head freeme;
> > };
> >
> > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> > -
> > static struct kmem_cache *nfsd_file_slab;
> > static struct kmem_cache *nfsd_file_mark_slab;
> > static struct list_lru nfsd_file_lru;
> > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> > spin_lock(&l->lock);
> > list_move_tail(&nf->nf_lru, &l->freeme);
> > spin_unlock(&l->lock);
> > - queue_work(nfsd_filecache_wq, &l->work);
> > + svc_wake_up(nn->nfsd_serv);
> > }
> > }
> >
> > +/**
> > + * nfsd_file_dispose_some
>
> This needs a short description and:
>
> * @nn: namespace to check
>
> Or something more enlightening than that.
>
> Also, the function name exposes mechanism; I think I'd prefer a name
> that is more abstract, such as nfsd_file_net_release() ?
Sometimes exposing mechanism is a good thing. It means the casual reader
can get a sense of what the function does without having to look at the
function.
So I still prefer my name, but I changed to nfsd_file_net_dispose() so
as suit your preference, but follow the established pattern of using the
word "dispose". "release" usually just drops a reference. "dispose"
makes it clear that the thing is going away now.
/**
* nfsd_file_net_dispose - deal with nfsd_files wait to be disposed.
* @nn: nfsd_net in which to find files to be disposed.
*
* When files held open for nfsv3 are removed from the filecache, whether
* due to memory pressure or garbage collection, they are queued to
* a per-net-ns queue. This function completes the disposal, either
* directly or by waking another nfsd thread to help with the work.
*/
>
> > + *
> > + */
> > +void nfsd_file_dispose_some(struct nfsd_net *nn)
> > +{
> > + struct nfsd_fcache_disposal *l = nn->fcache_disposal;
> > + LIST_HEAD(dispose);
> > + int i;
> > +
> > + if (list_empty(&l->freeme))
> > + return;
> > + spin_lock(&l->lock);
> > + for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
> > + list_move(l->freeme.next, &dispose);
> > + spin_unlock(&l->lock);
> > + if (!list_empty(&l->freeme))
> > + svc_wake_up(nn->nfsd_serv);
> > + nfsd_file_dispose_list(&dispose);
..
> > @@ -949,6 +950,7 @@ nfsd(void *vrqstp)
> > }
> >
> > current->fs->umask = 0;
> > + current->flags |= PF_RUNS_TASK_WORK;
> >
> > atomic_inc(&nfsdstats.th_cnt);
> >
> > @@ -963,6 +965,10 @@ nfsd(void *vrqstp)
> >
> > svc_recv(rqstp);
> > validate_process_creds();
> > +
> > + nfsd_file_dispose_some(nn);
> > + if (task_work_pending(current))
> > + task_work_run();
>
> I'd prefer that these task_work details reside inside
> nfsd_file_dispose_some(), or whatever we want to call to call it ...
I don't agree. They are performing quite separate tasks.
nfsd_file_net_dispose() is disposing files queued for this net.
task_run_work() is finalising the close of any file closed by this
thread, including those used for NFSv4 that are not touched by
nfsd_file_dispose_some().
I don't think they belong in the same function.
Thanks,
NeilBrown
On Tue, 05 Dec 2023, Jens Axboe wrote:
> On 12/4/23 2:02 PM, NeilBrown wrote:
> > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > changed since last I looked..... are there rules?
> >
> > My reasoning was that the call is effectively part of the user-space
> > ABI. A user-space process can call this trivially by invoking any
> > system call. The user-space ABI is explicitly a boundary which the GPL
> > does not cross. So it doesn't seem appropriate to prevent non-GPL
> > kernel code from doing something that non-GPL user-space code can
> > trivially do.
>
> By that reasoning, basically everything in the kernel should be non-GPL
> marked. And while task_work can get used by the application, it happens
> only indirectly or implicitly. So I don't think this reasoning is sound
> at all, it's not an exported ABI or API by itself.
>
> For me, the more core of an export it is, the stronger the reason it
> should be GPL. FWIW, I don't think exporting task_work functionality is
> a good idea in the first place, but if there's a strong reason to do so,
> it should most certainly not be accessible to non-GPL modules. Basically
> NO new export should be non-GPL.
An alternate to exporting task_work_run() might be to call it from
try_to_freeze(). I think that too should only be called from a context
where no locks are held etc. Obviously try_to_freeze would only call
task_work_run() if PF_RUNS_TASK_WORK were set.
I'm not sure this is a *good* idea, but it is an idea that would avoid
the export.
For now I change the export to _GPL.
Thanks,
NeilBrown
On Tue, Dec 05, 2023 at 09:21:08AM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Chuck Lever wrote:
> > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> > > The work of closing a file can have non-trivial cost. Doing it in a
> > > separate work queue thread means that cost isn't imposed on the nfsd
> > > threads and an imbalance can be created.
> > >
> > > I have evidence from a customer site when nfsd is being asked to modify
> > > many millions of files which causes sufficient memory pressure that some
> > > cache (in XFS I think) gets cleaned earlier than would be ideal. When
> > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> > > needs to synchronously read back previously cached info from storage.
> > > This slows down the single thread that is making all the final __dput()
> > > calls for all the nfsd threads with the net result that files are added
> > > to the delayed_fput_list faster than they are removed, and the system
> > > eventually runs out of memory.
> > >
> > > To avoid this work imbalance that exhausts memory, this patch moves all
> > > work for closing files into the nfsd threads. This means that when the
> > > work imposes a cost, that cost appears where it would be expected - in
> > > the work of the nfsd thread.
> >
> > Thanks for pursuing this next step in the evolution of the NFSD
> > file cache.
> >
> > Your problem statement should mention whether you have observed the
> > issue with an NFSv3 or an NFSv4 workload or if you see this issue
> > with both, since those two use cases are handled very differently
> > within the file cache implementation.
>
> I have added:
>
> =============
> The customer was using NFSv4. I can demonstrate the same problem using
> NFSv3 or NFSv4 (which close files in different ways) by adding
> msleep(25) to for FMODE_WRITE files in __fput(). This simulates
> slowness in the final close and when writing through nfsd it causes
> /proc/sys/fs/file-nr to grow without bound.
> ==============
>
> >
> >
> > > There are two changes to achieve this.
> > >
> > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> > > final __dput() for any file closed by a given nfsd thread is handled
> > > by that thread. This ensures that the number of files that are
> > > queued for a final close is limited by the number of threads and
> > > cannot grow without bound.
> > >
> > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> > > kept open by the server in the "filecache", which responds to memory
> > > pressure, is garbage collected even when there is no pressure, and
> > > sometimes closes files when there is particular need such as for
> > > rename.
> >
> > There is a good reason for close-on-rename: IIRC we want to avoid
> > triggering a silly-rename on NFS re-exports.
> >
> > Also, I think we do want to close cached garbage-collected files
> > quickly, even without memory pressure. Files left open in this way
> > can conflict with subsequent NFSv4 OPENs that might hand out a
> > delegation as long as no other clients are using them. Files held
> > open by the file cache will interfere with that.
>
> Yes - I agree all this behaviour is appropriate. I was just setting out
> the current behaviour of the filecache so that effect of the proposed
> changes would be easier to understand.
Ok, I misread "when there is particular need" as "when there is no
particular need." My bad.
> > > These files currently have filp_close() called in a dedicated
> > > work queue, so their __dput() can have no effect on nfsd threads.
> > >
> > > This patch discards the work queue and instead has each nfsd thread
> > > call flip_close() on as many as 8 files from the filecache each time
> > > it acts on a client request (or finds there are no pending client
> > > requests). If there are more to be closed, more threads are woken.
> > > This spreads the work of __dput() over multiple threads and imposes
> > > any cost on those threads.
> > >
> > > The number 8 is somewhat arbitrary. It needs to be greater than 1 to
> > > ensure that files are closed more quickly than they can be added to
> > > the cache. It needs to be small enough to limit the per-request
> > > delays that will be imposed on clients when all threads are busy
> > > closing files.
> >
> > IMO we want to explicitly separate the mechanisms of handling
> > garbage-collected files and non-garbage-collected files.
>
> I think we already have explicit separation.
> garbage-collected files are handled to nfsd_file_display_list_delayed(),
> either when they fall off the lru or through nfsd_file_close_inode() -
> which is used by lease and fsnotify callbacks.
>
> non-garbage collected files are closed directly by nfsd_file_put().
The separation is more clear to me now. Building this all into a
single patch kind of blurred the edges between the two.
> > In the non-garbage-collected (NFSv4) case, the kthread can wait
> > for everything it has opened to be closed. task_work seems
> > appropriate for that IIUC.
>
> Agreed. The task_work change is all that we need for NFSv4.
>
> > The problem with handling a limited number of garbage-collected
> > items is that once the RPC workload stops, any remaining open
> > files will remain open because garbage collection has effectively
> > stopped. We really need those files closed out within a couple of
> > seconds.
>
> Why would garbage collection stop?
Because with your patch GC now appears to be driven through
nfsd_file_dispose_some(). I see now that there is a hidden
recursion that wakes more nfsd threads if there's more GC to
be done. So file disposal is indeed not dependent on more
ingress RPC traffic.
The "If there are more to be closed" remark above in the patch
description was ambiguous to me, but I think I get it now.
> nfsd_filecache_laundrette is still running on the system_wq. It will
> continue to garbage collect and queue files using
> nfsd_file_display_list_delayed().
> That will wake up an nfsd thread if none is running. The thread will
> close a few, but will first wake another thread if there was more than
> it was willing to manage. So the closing of files should proceed
> promptly, and if any close operation takes a non-trivial amount of time,
> more threads will be woken and work will proceed in parallel.
OK, that is what the svc_wake_up()s are doing.
> > And, as we discussed in a previous thread, replacing the per-
> > namespace worker with a parallel mechanism would help GC proceed
> > more quickly to reduce the flush/close backlog for NFSv3.
>
> This patch discards the per-namespace worker.
>
> The GC step (searching the LRU list for "garbage") is still
> single-threaded. The filecache is shared by all net-namespaces and
> there is a single GC thread for the filecache.
Agreed.
> Files that are found *were* filp_close()ed by per-net-fs work-items.
> With this patch the filp_close() is called by the nfsd threads.
>
> The file __fput of those files *was* handled by a single system-wide
> work-item. With this patch they are called by the nfsd thread which
> called the filp_close().
Fwiw, none of that is obvious to me when looking at the diff.
> > > Signed-off-by: NeilBrown <[email protected]>
> > > ---
> > > fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> > > fs/nfsd/filecache.h | 1 +
> > > fs/nfsd/nfssvc.c | 6 +++++
> > > 3 files changed, 32 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index ee9c923192e0..55268b7362d4 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -39,6 +39,7 @@
> > > #include <linux/fsnotify.h>
> > > #include <linux/seq_file.h>
> > > #include <linux/rhashtable.h>
> > > +#include <linux/task_work.h>
> > >
> > > #include "vfs.h"
> > > #include "nfsd.h"
> > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> > > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
> > >
> > > struct nfsd_fcache_disposal {
> > > - struct work_struct work;
> > > spinlock_t lock;
> > > struct list_head freeme;
> > > };
> > >
> > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> > > -
> > > static struct kmem_cache *nfsd_file_slab;
> > > static struct kmem_cache *nfsd_file_mark_slab;
> > > static struct list_lru nfsd_file_lru;
> > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> > > spin_lock(&l->lock);
> > > list_move_tail(&nf->nf_lru, &l->freeme);
> > > spin_unlock(&l->lock);
> > > - queue_work(nfsd_filecache_wq, &l->work);
> > > + svc_wake_up(nn->nfsd_serv);
> > > }
> > > }
> > >
> > > +/**
> > > + * nfsd_file_dispose_some
> >
> > This needs a short description and:
> >
> > * @nn: namespace to check
> >
> > Or something more enlightening than that.
> >
> > Also, the function name exposes mechanism; I think I'd prefer a name
> > that is more abstract, such as nfsd_file_net_release() ?
>
> Sometimes exposing mechanism is a good thing. It means the casual reader
> can get a sense of what the function does without having to look at the
> function.
> So I still prefer my name, but I changed to nfsd_file_net_dispose() so
> as suit your preference, but follow the established pattern of using the
> word "dispose". "release" usually just drops a reference. "dispose"
> makes it clear that the thing is going away now.
>
> /**
> * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed.
> * @nn: nfsd_net in which to find files to be disposed.
> *
> * When files held open for nfsv3 are removed from the filecache, whether
This comment is helpful. But note that we quite purposely do not
refer to NFS versions in filecache.c -- it's either garbage-
collected or not garbage-collected files. IIRC on occasion NFSv3
wants to use a non-garbage-collected file, and NFSv4 might sometimes
use a GC-d file. I've forgotten the details.
> * due to memory pressure or garbage collection, they are queued to
> * a per-net-ns queue. This function completes the disposal, either
> * directly or by waking another nfsd thread to help with the work.
> */
I understand why you want to keep this name: this function handles
only garbage-collected files.
I would still like nfsd() to call a wrapper function to handle
the details of closing both types of files rather than open-coding
calling nfsd_file_net_dispose() and task_run_work(), especially
because there is no code comment explaining why the task_run_work()
call is needed. That level of filecache implementation detail
doesn't belong in nfsd().
--
Chuck Lever
On Tue, Dec 05, 2023 at 08:20:31AM +1100, NeilBrown wrote:
> On Mon, 04 Dec 2023, Al Viro wrote:
> > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> >
> > > This means that any cost for doing the work is not imposed on the kernel
> > > thread, and importantly excessive amounts of work cannot apply
> > > back-pressure to reduce the amount of new work queued.
> >
> > It also means that a stuck ->release() won't end up with stuck
> > kernel thread...
>
> Is a stuck kernel thread any worse than a stuck user-space thread?
>
> >
> > > earlier than would be ideal. When __dput (from the workqueue) calls
> >
> > WTF is that __dput thing? __fput, perhaps?
>
> Either __fput or dput :-)
> ->release isn't the problem that I am seeing.
> The call trace that I see causing problems is
> __fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode
What problem, exactly, are you having with xfs_fs_destroy_inode()?
-Dave.
--
Dave Chinner
[email protected]
Hi NeilBrown,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v6.7-rc4]
[cannot apply to brauner-vfs/vfs.all next-20231205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-a-kthread-to-declare-that-it-calls-task_work_run/20231204-094248
base: tip/sched/core
patch link: https://lore.kernel.org/r/20231204014042.6754-3-neilb%40suse.de
patch subject: [PATCH 2/2] nfsd: Don't leave work of closing files to a work queue.
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20231205/[email protected]/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231205/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> fs/nfsd/filecache.c:431: warning: Function parameter or member 'nn' not described in 'nfsd_file_dispose_some'
vim +431 fs/nfsd/filecache.c
425
426 /**
427 * nfsd_file_dispose_some
428 *
429 */
430 void nfsd_file_dispose_some(struct nfsd_net *nn)
> 431 {
432 struct nfsd_fcache_disposal *l = nn->fcache_disposal;
433 LIST_HEAD(dispose);
434 int i;
435
436 if (list_empty(&l->freeme))
437 return;
438 spin_lock(&l->lock);
439 for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
440 list_move(l->freeme.next, &dispose);
441 spin_unlock(&l->lock);
442 if (!list_empty(&l->freeme))
443 svc_wake_up(nn->nfsd_serv);
444 nfsd_file_dispose_list(&dispose);
445 }
446
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> User-space processes always call task_work_run() as needed when
> returning from a system call. Kernel-threads generally do not.
> Because of this some work that is best run in the task_works context
> (guaranteed that no locks are held) cannot be queued to task_works from
> kernel threads and so are queued to a (single) work_time to be managed
> on a work queue.
>
> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.
>
> I have evidence from a customer site when nfsd (which runs as kernel
> threads) is being asked to modify many millions of files which causes
> sufficient memory pressure that some cache (in XFS I think) gets cleaned
> earlier than would be ideal. When __dput (from the workqueue) calls
> __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> previously cached info from storage.
We fixed that specific XFS problem in 5.9.
https://lore.kernel.org/linux-xfs/[email protected]/
Can you reproduce these issues on a current TOT kernel?
If not, there's no bugs to fix in the upstream kernel. If you can,
then we've got more XFS issues to work through and fix.
Fundamentally, though, we should not be papering over an XFS issue
by changing how core task_work infrastructure is used. So let's deal
with the XFS issue first....
-Dave.
--
Dave Chinner
[email protected]
On Tue, 05 Dec 2023, Dave Chinner wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > User-space processes always call task_work_run() as needed when
> > returning from a system call. Kernel-threads generally do not.
> > Because of this some work that is best run in the task_works context
> > (guaranteed that no locks are held) cannot be queued to task_works from
> > kernel threads and so are queued to a (single) work_time to be managed
> > on a work queue.
> >
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> >
> > I have evidence from a customer site when nfsd (which runs as kernel
> > threads) is being asked to modify many millions of files which causes
> > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > earlier than would be ideal. When __dput (from the workqueue) calls
> > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > previously cached info from storage.
>
> We fixed that specific XFS problem in 5.9.
>
> https://lore.kernel.org/linux-xfs/[email protected]/
Good to know - thanks.
>
> Can you reproduce these issues on a current TOT kernel?
I haven't tried. I don't know if I know enough details of the work load
to attempt it.
>
> If not, there's no bugs to fix in the upstream kernel. If you can,
> then we've got more XFS issues to work through and fix.
>
> Fundamentally, though, we should not be papering over an XFS issue
> by changing how core task_work infrastructure is used. So let's deal
> with the XFS issue first....
I disagree. This customer experience has demonstrated both a bug in XFS
and bug in the interaction between fput, task_work, and nfsd.
If a bug in a filesystem that only causes a modest performance impact
when used through the syscall API can bring the system to its knees
through memory exhaustion when used by nfsd, then that is a robustness
issue for nfsd.
I want to fix that robustness issue so that unusual behaviour in
filesystems does not cause out-of-proportion bad behaviour in nfsd.
I highlighted this in the cover letter to the first version of my patch:
https://lore.kernel.org/all/[email protected]/
While this might point to a problem with the filesystem not handling the
final close efficiently, such problems should only hurt throughput, not
lead to memory exhaustion.
Thanks,
NeilBrown
>
> -Dave.
> --
> Dave Chinner
> [email protected]
>
On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> On 12/4/23 2:02 PM, NeilBrown wrote:
> > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > changed since last I looked..... are there rules?
> >
> > My reasoning was that the call is effectively part of the user-space
> > ABI. A user-space process can call this trivially by invoking any
> > system call. The user-space ABI is explicitly a boundary which the GPL
> > does not cross. So it doesn't seem appropriate to prevent non-GPL
> > kernel code from doing something that non-GPL user-space code can
> > trivially do.
>
> By that reasoning, basically everything in the kernel should be non-GPL
> marked. And while task_work can get used by the application, it happens
> only indirectly or implicitly. So I don't think this reasoning is sound
> at all, it's not an exported ABI or API by itself.
>
> For me, the more core of an export it is, the stronger the reason it
> should be GPL. FWIW, I don't think exporting task_work functionality is
> a good idea in the first place, but if there's a strong reason to do so,
Yeah, I'm not too fond of that part as well. I don't think we want to
give modules the ability to mess with task work. This is just asking for
trouble.
On Tue, Dec 05, 2023 at 07:48:20PM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Dave Chinner wrote:
> > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > > User-space processes always call task_work_run() as needed when
> > > returning from a system call. Kernel-threads generally do not.
> > > Because of this some work that is best run in the task_works context
> > > (guaranteed that no locks are held) cannot be queued to task_works from
> > > kernel threads and so are queued to a (single) work_time to be managed
> > > on a work queue.
> > >
> > > This means that any cost for doing the work is not imposed on the kernel
> > > thread, and importantly excessive amounts of work cannot apply
> > > back-pressure to reduce the amount of new work queued.
> > >
> > > I have evidence from a customer site when nfsd (which runs as kernel
> > > threads) is being asked to modify many millions of files which causes
> > > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > > earlier than would be ideal. When __dput (from the workqueue) calls
> > > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > > previously cached info from storage.
> >
> > We fixed that specific XFS problem in 5.9.
> >
> > https://lore.kernel.org/linux-xfs/[email protected]/
>
> Good to know - thanks.
>
> >
> > Can you reproduce these issues on a current TOT kernel?
>
> I haven't tried. I don't know if I know enough details of the work load
> to attempt it.
>
> >
> > If not, there's no bugs to fix in the upstream kernel. If you can,
> > then we've got more XFS issues to work through and fix.
> >
> > Fundamentally, though, we should not be papering over an XFS issue
> > by changing how core task_work infrastructure is used. So let's deal
> > with the XFS issue first....
>
> I disagree. This customer experience has demonstrated both a bug in XFS
> and bug in the interaction between fput, task_work, and nfsd.
>
> If a bug in a filesystem that only causes a modest performance impact
> when used through the syscall API can bring the system to its knees
> through memory exhaustion when used by nfsd, then that is a robustness
> issue for nfsd.
>
> I want to fix that robustness issue so that unusual behaviour in
> filesystems does not cause out-of-proportion bad behaviour in nfsd.
>
> I highlighted this in the cover letter to the first version of my patch:
>
> https://lore.kernel.org/all/[email protected]/
>
> While this might point to a problem with the filesystem not handling the
> final close efficiently, such problems should only hurt throughput, not
> lead to memory exhaustion.
I'm still confused about this memory exhaustion claim?
If this is a filesystem problem it's pretty annoying that we have to
work around it by exposing task work to random modules.
On Tue, 2023-12-05 at 12:14 +0100, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > changed since last I looked..... are there rules?
> > >
> > > My reasoning was that the call is effectively part of the user-space
> > > ABI. A user-space process can call this trivially by invoking any
> > > system call. The user-space ABI is explicitly a boundary which the GPL
> > > does not cross. So it doesn't seem appropriate to prevent non-GPL
> > > kernel code from doing something that non-GPL user-space code can
> > > trivially do.
> >
> > By that reasoning, basically everything in the kernel should be non-GPL
> > marked. And while task_work can get used by the application, it happens
> > only indirectly or implicitly. So I don't think this reasoning is sound
> > at all, it's not an exported ABI or API by itself.
> >
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
>
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.
The fact that nfsd has to queue all of the delayed fput activity to a
workqueue has always been a horrible hack though. We export all kinds of
functionality to modules that you can screw up.
I think that nfsd's use-case is legitimate. ksmbd may also want to
follow suit.
--
Jeff Layton <[email protected]>
On Tue, 05 Dec 2023, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > changed since last I looked..... are there rules?
> > >
> > > My reasoning was that the call is effectively part of the user-space
> > > ABI. A user-space process can call this trivially by invoking any
> > > system call. The user-space ABI is explicitly a boundary which the GPL
> > > does not cross. So it doesn't seem appropriate to prevent non-GPL
> > > kernel code from doing something that non-GPL user-space code can
> > > trivially do.
> >
> > By that reasoning, basically everything in the kernel should be non-GPL
> > marked. And while task_work can get used by the application, it happens
> > only indirectly or implicitly. So I don't think this reasoning is sound
> > at all, it's not an exported ABI or API by itself.
> >
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
>
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.
>
Ok, maybe we need to reframe the problem then.
Currently fput(), and hence filp_close(), take control away from kernel
threads in that they cannot be sure that a "close" has actually
completed.
This is already a problem for nfsd. When renaming a file, nfsd needs to
ensure any cached "open" that it has on the file is closed (else when
re-exporting an NFS filesystem it can result in a silly-rename).
nfsd currently handles this case by calling flush_delayed_fput(). I
suspect you are no more happy about exporting that than you are about
exporting task_work_run(), but this solution isn't actually 100%
reliable. If some other thread calls flush_delayed_fput() between nfsd
calling filp_close() and that same nfsd calling flush_delayed_fput(),
then the second flush can return before the first flush (in the other
thread) completes all the work it took on.
What we really need - both for handling renames and for avoiding
possible memory exhaustion - is for nfsd to be able to reliably wait for
any fput() that it initiated to complete.
How would you like the VFS to provide that service?
Thanks,
NeilBrown
On 12/5/23 2:28 PM, NeilBrown wrote:
> On Tue, 05 Dec 2023, Christian Brauner wrote:
>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>> changed since last I looked..... are there rules?
>>>>
>>>> My reasoning was that the call is effectively part of the user-space
>>>> ABI. A user-space process can call this trivially by invoking any
>>>> system call. The user-space ABI is explicitly a boundary which the GPL
>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
>>>> kernel code from doing something that non-GPL user-space code can
>>>> trivially do.
>>>
>>> By that reasoning, basically everything in the kernel should be non-GPL
>>> marked. And while task_work can get used by the application, it happens
>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>> at all, it's not an exported ABI or API by itself.
>>>
>>> For me, the more core of an export it is, the stronger the reason it
>>> should be GPL. FWIW, I don't think exporting task_work functionality is
>>
>> Yeah, I'm not too fond of that part as well. I don't think we want to
>> give modules the ability to mess with task work. This is just asking for
>> trouble.
>>
>
> Ok, maybe we need to reframe the problem then.
>
> Currently fput(), and hence filp_close(), take control away from kernel
> threads in that they cannot be sure that a "close" has actually
> completed.
>
> This is already a problem for nfsd. When renaming a file, nfsd needs to
> ensure any cached "open" that it has on the file is closed (else when
> re-exporting an NFS filesystem it can result in a silly-rename).
>
> nfsd currently handles this case by calling flush_delayed_fput(). I
> suspect you are no more happy about exporting that than you are about
> exporting task_work_run(), but this solution isn't actually 100%
> reliable. If some other thread calls flush_delayed_fput() between nfsd
> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> then the second flush can return before the first flush (in the other
> thread) completes all the work it took on.
>
> What we really need - both for handling renames and for avoiding
> possible memory exhaustion - is for nfsd to be able to reliably wait for
> any fput() that it initiated to complete.
>
> How would you like the VFS to provide that service?
Since task_work happens in the context of your task already, why not
just have a way to get it stashed into a list when final fput is done?
This avoids all of this "let's expose task_work" and using the task list
for that, which seems kind of pointless as you're just going to run it
later on manually anyway.
In semi pseudo code:
bool fput_put_ref(struct file *file)
{
return atomic_dec_and_test(&file->f_count);
}
void fput(struct file *file)
{
if (fput_put_ref(file)) {
...
}
}
and then your nfsd_file_free() could do:
ret = filp_flush(file, id);
if (fput_put_ref(file))
llist_add(&file->f_llist, &l->to_free_llist);
or something like that, where l->to_free_llist is where ever you'd
otherwise punt the actual freeing to.
--
Jens Axboe
On 12/5/23 2:58 PM, Jens Axboe wrote:
> On 12/5/23 2:28 PM, NeilBrown wrote:
>> On Tue, 05 Dec 2023, Christian Brauner wrote:
>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>>> changed since last I looked..... are there rules?
>>>>>
>>>>> My reasoning was that the call is effectively part of the user-space
>>>>> ABI. A user-space process can call this trivially by invoking any
>>>>> system call. The user-space ABI is explicitly a boundary which the GPL
>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
>>>>> kernel code from doing something that non-GPL user-space code can
>>>>> trivially do.
>>>>
>>>> By that reasoning, basically everything in the kernel should be non-GPL
>>>> marked. And while task_work can get used by the application, it happens
>>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>>> at all, it's not an exported ABI or API by itself.
>>>>
>>>> For me, the more core of an export it is, the stronger the reason it
>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
>
>>>
>>> Yeah, I'm not too fond of that part as well. I don't think we want to
>>> give modules the ability to mess with task work. This is just asking for
>>> trouble.
>>>
>>
>> Ok, maybe we need to reframe the problem then.
>>
>> Currently fput(), and hence filp_close(), take control away from kernel
>> threads in that they cannot be sure that a "close" has actually
>> completed.
>>
>> This is already a problem for nfsd. When renaming a file, nfsd needs to
>> ensure any cached "open" that it has on the file is closed (else when
>> re-exporting an NFS filesystem it can result in a silly-rename).
>>
>> nfsd currently handles this case by calling flush_delayed_fput(). I
>> suspect you are no more happy about exporting that than you are about
>> exporting task_work_run(), but this solution isn't actually 100%
>> reliable. If some other thread calls flush_delayed_fput() between nfsd
>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
>> then the second flush can return before the first flush (in the other
>> thread) completes all the work it took on.
>>
>> What we really need - both for handling renames and for avoiding
>> possible memory exhaustion - is for nfsd to be able to reliably wait for
>> any fput() that it initiated to complete.
>>
>> How would you like the VFS to provide that service?
>
> Since task_work happens in the context of your task already, why not
> just have a way to get it stashed into a list when final fput is done?
> This avoids all of this "let's expose task_work" and using the task list
> for that, which seems kind of pointless as you're just going to run it
> later on manually anyway.
>
> In semi pseudo code:
>
> bool fput_put_ref(struct file *file)
> {
> return atomic_dec_and_test(&file->f_count);
> }
>
> void fput(struct file *file)
> {
> if (fput_put_ref(file)) {
> ...
> }
> }
>
> and then your nfsd_file_free() could do:
>
> ret = filp_flush(file, id);
> if (fput_put_ref(file))
> llist_add(&file->f_llist, &l->to_free_llist);
>
> or something like that, where l->to_free_llist is where ever you'd
> otherwise punt the actual freeing to.
Should probably have the put_ref or whatever helper also init the
task_work, and then reuse the list in the callback_head there. Then
whoever flushes it has to call ->func() and avoid exposing ____fput() to
random users. But you get the idea.
--
Jens Axboe
On Wed, 06 Dec 2023, Jens Axboe wrote:
> On 12/5/23 2:58 PM, Jens Axboe wrote:
> > On 12/5/23 2:28 PM, NeilBrown wrote:
> >> On Tue, 05 Dec 2023, Christian Brauner wrote:
> >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> >>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> >>>>> changed since last I looked..... are there rules?
> >>>>>
> >>>>> My reasoning was that the call is effectively part of the user-space
> >>>>> ABI. A user-space process can call this trivially by invoking any
> >>>>> system call. The user-space ABI is explicitly a boundary which the GPL
> >>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
> >>>>> kernel code from doing something that non-GPL user-space code can
> >>>>> trivially do.
> >>>>
> >>>> By that reasoning, basically everything in the kernel should be non-GPL
> >>>> marked. And while task_work can get used by the application, it happens
> >>>> only indirectly or implicitly. So I don't think this reasoning is sound
> >>>> at all, it's not an exported ABI or API by itself.
> >>>>
> >>>> For me, the more core of an export it is, the stronger the reason it
> >>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> >
> >>>
> >>> Yeah, I'm not too fond of that part as well. I don't think we want to
> >>> give modules the ability to mess with task work. This is just asking for
> >>> trouble.
> >>>
> >>
> >> Ok, maybe we need to reframe the problem then.
> >>
> >> Currently fput(), and hence filp_close(), take control away from kernel
> >> threads in that they cannot be sure that a "close" has actually
> >> completed.
> >>
> >> This is already a problem for nfsd. When renaming a file, nfsd needs to
> >> ensure any cached "open" that it has on the file is closed (else when
> >> re-exporting an NFS filesystem it can result in a silly-rename).
> >>
> >> nfsd currently handles this case by calling flush_delayed_fput(). I
> >> suspect you are no more happy about exporting that than you are about
> >> exporting task_work_run(), but this solution isn't actually 100%
> >> reliable. If some other thread calls flush_delayed_fput() between nfsd
> >> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> >> then the second flush can return before the first flush (in the other
> >> thread) completes all the work it took on.
> >>
> >> What we really need - both for handling renames and for avoiding
> >> possible memory exhaustion - is for nfsd to be able to reliably wait for
> >> any fput() that it initiated to complete.
> >>
> >> How would you like the VFS to provide that service?
> >
> > Since task_work happens in the context of your task already, why not
> > just have a way to get it stashed into a list when final fput is done?
> > This avoids all of this "let's expose task_work" and using the task list
> > for that, which seems kind of pointless as you're just going to run it
> > later on manually anyway.
> >
> > In semi pseudo code:
> >
> > bool fput_put_ref(struct file *file)
> > {
> > return atomic_dec_and_test(&file->f_count);
> > }
> >
> > void fput(struct file *file)
> > {
> > if (fput_put_ref(file)) {
> > ...
> > }
> > }
> >
> > and then your nfsd_file_free() could do:
> >
> > ret = filp_flush(file, id);
> > if (fput_put_ref(file))
> > llist_add(&file->f_llist, &l->to_free_llist);
> >
> > or something like that, where l->to_free_llist is where ever you'd
> > otherwise punt the actual freeing to.
>
> Should probably have the put_ref or whatever helper also init the
> task_work, and then reuse the list in the callback_head there. Then
> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> random users. But you get the idea.
Interesting ideas - thanks.
So maybe the new API would be
fput_queued(struct file *f, struct llist_head *q)
and
flush_fput_queue(struct llist_head *q)
with the meaning being that fput_queued() is just like fput() except
that any file needing __fput() is added to the 'q'; and that
flush_fput_queue() calls __fput() on any files in 'q'.
So to close a file nfsd would:
fget(f);
flip_close(f);
fput_queued(f, &my_queue);
though possibly we could have a
filp_close_queued(f, q)
as well.
I'll try that out - but am happy to hear alternate suggestions for names :-)
Thanks,
NeilBrown
On Wed, 06 Dec 2023, NeilBrown wrote:
> On Wed, 06 Dec 2023, Jens Axboe wrote:
> > On 12/5/23 2:58 PM, Jens Axboe wrote:
> > > On 12/5/23 2:28 PM, NeilBrown wrote:
> > >> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > >>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> > >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> > >>>>> changed since last I looked..... are there rules?
> > >>>>>
> > >>>>> My reasoning was that the call is effectively part of the user-space
> > >>>>> ABI. A user-space process can call this trivially by invoking any
> > >>>>> system call. The user-space ABI is explicitly a boundary which the GPL
> > >>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
> > >>>>> kernel code from doing something that non-GPL user-space code can
> > >>>>> trivially do.
> > >>>>
> > >>>> By that reasoning, basically everything in the kernel should be non-GPL
> > >>>> marked. And while task_work can get used by the application, it happens
> > >>>> only indirectly or implicitly. So I don't think this reasoning is sound
> > >>>> at all, it's not an exported ABI or API by itself.
> > >>>>
> > >>>> For me, the more core of an export it is, the stronger the reason it
> > >>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> > >
> > >>>
> > >>> Yeah, I'm not too fond of that part as well. I don't think we want to
> > >>> give modules the ability to mess with task work. This is just asking for
> > >>> trouble.
> > >>>
> > >>
> > >> Ok, maybe we need to reframe the problem then.
> > >>
> > >> Currently fput(), and hence filp_close(), take control away from kernel
> > >> threads in that they cannot be sure that a "close" has actually
> > >> completed.
> > >>
> > >> This is already a problem for nfsd. When renaming a file, nfsd needs to
> > >> ensure any cached "open" that it has on the file is closed (else when
> > >> re-exporting an NFS filesystem it can result in a silly-rename).
> > >>
> > >> nfsd currently handles this case by calling flush_delayed_fput(). I
> > >> suspect you are no more happy about exporting that than you are about
> > >> exporting task_work_run(), but this solution isn't actually 100%
> > >> reliable. If some other thread calls flush_delayed_fput() between nfsd
> > >> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> > >> then the second flush can return before the first flush (in the other
> > >> thread) completes all the work it took on.
> > >>
> > >> What we really need - both for handling renames and for avoiding
> > >> possible memory exhaustion - is for nfsd to be able to reliably wait for
> > >> any fput() that it initiated to complete.
> > >>
> > >> How would you like the VFS to provide that service?
> > >
> > > Since task_work happens in the context of your task already, why not
> > > just have a way to get it stashed into a list when final fput is done?
> > > This avoids all of this "let's expose task_work" and using the task list
> > > for that, which seems kind of pointless as you're just going to run it
> > > later on manually anyway.
> > >
> > > In semi pseudo code:
> > >
> > > bool fput_put_ref(struct file *file)
> > > {
> > > return atomic_dec_and_test(&file->f_count);
> > > }
> > >
> > > void fput(struct file *file)
> > > {
> > > if (fput_put_ref(file)) {
> > > ...
> > > }
> > > }
> > >
> > > and then your nfsd_file_free() could do:
> > >
> > > ret = filp_flush(file, id);
> > > if (fput_put_ref(file))
> > > llist_add(&file->f_llist, &l->to_free_llist);
> > >
> > > or something like that, where l->to_free_llist is where ever you'd
> > > otherwise punt the actual freeing to.
> >
> > Should probably have the put_ref or whatever helper also init the
> > task_work, and then reuse the list in the callback_head there. Then
> > whoever flushes it has to call ->func() and avoid exposing ____fput() to
> > random users. But you get the idea.
>
> Interesting ideas - thanks.
>
> So maybe the new API would be
>
> fput_queued(struct file *f, struct llist_head *q)
> and
> flush_fput_queue(struct llist_head *q)
>
> with the meaning being that fput_queued() is just like fput() except
> that any file needing __fput() is added to the 'q'; and that
> flush_fput_queue() calls __fput() on any files in 'q'.
>
> So to close a file nfsd would:
>
> fget(f);
> flip_close(f);
> fput_queued(f, &my_queue);
>
> though possibly we could have a
> filp_close_queued(f, q)
> as well.
>
> I'll try that out - but am happy to hear alternate suggestions for names :-)
>
Actually .... I'm beginning to wonder if we should just use
__fput_sync() in nfsd.
It has a big warning about not doing that blindly, but the detail in the
warning doesn't seem to apply to nfsd...
NeilBrown
On 12/5/23 4:23 PM, NeilBrown wrote:
> On Wed, 06 Dec 2023, NeilBrown wrote:
>> On Wed, 06 Dec 2023, Jens Axboe wrote:
>>> On 12/5/23 2:58 PM, Jens Axboe wrote:
>>>> On 12/5/23 2:28 PM, NeilBrown wrote:
>>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
>>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>>>>>> changed since last I looked..... are there rules?
>>>>>>>>
>>>>>>>> My reasoning was that the call is effectively part of the user-space
>>>>>>>> ABI. A user-space process can call this trivially by invoking any
>>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL
>>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
>>>>>>>> kernel code from doing something that non-GPL user-space code can
>>>>>>>> trivially do.
>>>>>>>
>>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
>>>>>>> marked. And while task_work can get used by the application, it happens
>>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>>>>>> at all, it's not an exported ABI or API by itself.
>>>>>>>
>>>>>>> For me, the more core of an export it is, the stronger the reason it
>>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
>>>>
>>>>>>
>>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
>>>>>> give modules the ability to mess with task work. This is just asking for
>>>>>> trouble.
>>>>>>
>>>>>
>>>>> Ok, maybe we need to reframe the problem then.
>>>>>
>>>>> Currently fput(), and hence filp_close(), take control away from kernel
>>>>> threads in that they cannot be sure that a "close" has actually
>>>>> completed.
>>>>>
>>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to
>>>>> ensure any cached "open" that it has on the file is closed (else when
>>>>> re-exporting an NFS filesystem it can result in a silly-rename).
>>>>>
>>>>> nfsd currently handles this case by calling flush_delayed_fput(). I
>>>>> suspect you are no more happy about exporting that than you are about
>>>>> exporting task_work_run(), but this solution isn't actually 100%
>>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd
>>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
>>>>> then the second flush can return before the first flush (in the other
>>>>> thread) completes all the work it took on.
>>>>>
>>>>> What we really need - both for handling renames and for avoiding
>>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
>>>>> any fput() that it initiated to complete.
>>>>>
>>>>> How would you like the VFS to provide that service?
>>>>
>>>> Since task_work happens in the context of your task already, why not
>>>> just have a way to get it stashed into a list when final fput is done?
>>>> This avoids all of this "let's expose task_work" and using the task list
>>>> for that, which seems kind of pointless as you're just going to run it
>>>> later on manually anyway.
>>>>
>>>> In semi pseudo code:
>>>>
>>>> bool fput_put_ref(struct file *file)
>>>> {
>>>> return atomic_dec_and_test(&file->f_count);
>>>> }
>>>>
>>>> void fput(struct file *file)
>>>> {
>>>> if (fput_put_ref(file)) {
>>>> ...
>>>> }
>>>> }
>>>>
>>>> and then your nfsd_file_free() could do:
>>>>
>>>> ret = filp_flush(file, id);
>>>> if (fput_put_ref(file))
>>>> llist_add(&file->f_llist, &l->to_free_llist);
>>>>
>>>> or something like that, where l->to_free_llist is where ever you'd
>>>> otherwise punt the actual freeing to.
>>>
>>> Should probably have the put_ref or whatever helper also init the
>>> task_work, and then reuse the list in the callback_head there. Then
>>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
>>> random users. But you get the idea.
>>
>> Interesting ideas - thanks.
>>
>> So maybe the new API would be
>>
>> fput_queued(struct file *f, struct llist_head *q)
>> and
>> flush_fput_queue(struct llist_head *q)
>>
>> with the meaning being that fput_queued() is just like fput() except
>> that any file needing __fput() is added to the 'q'; and that
>> flush_fput_queue() calls __fput() on any files in 'q'.
>>
>> So to close a file nfsd would:
>>
>> fget(f);
>> flip_close(f);
>> fput_queued(f, &my_queue);
>>
>> though possibly we could have a
>> filp_close_queued(f, q)
>> as well.
>>
>> I'll try that out - but am happy to hear alternate suggestions for names :-)
>>
>
> Actually .... I'm beginning to wonder if we should just use
> __fput_sync() in nfsd.
> It has a big warning about not doing that blindly, but the detail in the
> warning doesn't seem to apply to nfsd...
If you can do it from the context where you do the filp_close() right
now, then yeah there's no reason to over-complicate this at all... FWIW,
the reason task_work exists is just to ensure a clean context to perform
these operations from the task itself. The more I think about it, it
doesn't make a lot of sense to utilize it for this purpose, which is
where my alternate suggestion came from. But if you can just call it
directly, then that makes everything much easier.
--
Jens Axboe
On Tue, Dec 05, 2023 at 12:14:29PM +0100, Christian Brauner wrote:
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
>
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.
It just seems like a really bad idea. At the same time it fixes a real
problem. If we go a step back how could we fix it in a better way?
Do we even need the task_run based delay for file usage from kernel
threads?
On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote:
> On 12/5/23 4:23 PM, NeilBrown wrote:
> > On Wed, 06 Dec 2023, NeilBrown wrote:
> >> On Wed, 06 Dec 2023, Jens Axboe wrote:
> >>> On 12/5/23 2:58 PM, Jens Axboe wrote:
> >>>> On 12/5/23 2:28 PM, NeilBrown wrote:
> >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
> >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> >>>>>>>> changed since last I looked..... are there rules?
> >>>>>>>>
> >>>>>>>> My reasoning was that the call is effectively part of the user-space
> >>>>>>>> ABI. A user-space process can call this trivially by invoking any
> >>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL
> >>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
> >>>>>>>> kernel code from doing something that non-GPL user-space code can
> >>>>>>>> trivially do.
> >>>>>>>
> >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
> >>>>>>> marked. And while task_work can get used by the application, it happens
> >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
> >>>>>>> at all, it's not an exported ABI or API by itself.
> >>>>>>>
> >>>>>>> For me, the more core of an export it is, the stronger the reason it
> >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> >>>>
> >>>>>>
> >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
> >>>>>> give modules the ability to mess with task work. This is just asking for
> >>>>>> trouble.
> >>>>>>
> >>>>>
> >>>>> Ok, maybe we need to reframe the problem then.
> >>>>>
> >>>>> Currently fput(), and hence filp_close(), take control away from kernel
> >>>>> threads in that they cannot be sure that a "close" has actually
> >>>>> completed.
> >>>>>
> >>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to
> >>>>> ensure any cached "open" that it has on the file is closed (else when
> >>>>> re-exporting an NFS filesystem it can result in a silly-rename).
> >>>>>
> >>>>> nfsd currently handles this case by calling flush_delayed_fput(). I
> >>>>> suspect you are no more happy about exporting that than you are about
> >>>>> exporting task_work_run(), but this solution isn't actually 100%
> >>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd
> >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> >>>>> then the second flush can return before the first flush (in the other
> >>>>> thread) completes all the work it took on.
> >>>>>
> >>>>> What we really need - both for handling renames and for avoiding
> >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
> >>>>> any fput() that it initiated to complete.
> >>>>>
> >>>>> How would you like the VFS to provide that service?
> >>>>
> >>>> Since task_work happens in the context of your task already, why not
> >>>> just have a way to get it stashed into a list when final fput is done?
> >>>> This avoids all of this "let's expose task_work" and using the task list
> >>>> for that, which seems kind of pointless as you're just going to run it
> >>>> later on manually anyway.
> >>>>
> >>>> In semi pseudo code:
> >>>>
> >>>> bool fput_put_ref(struct file *file)
> >>>> {
> >>>> return atomic_dec_and_test(&file->f_count);
> >>>> }
> >>>>
> >>>> void fput(struct file *file)
> >>>> {
> >>>> if (fput_put_ref(file)) {
> >>>> ...
> >>>> }
> >>>> }
> >>>>
> >>>> and then your nfsd_file_free() could do:
> >>>>
> >>>> ret = filp_flush(file, id);
> >>>> if (fput_put_ref(file))
> >>>> llist_add(&file->f_llist, &l->to_free_llist);
> >>>>
> >>>> or something like that, where l->to_free_llist is where ever you'd
> >>>> otherwise punt the actual freeing to.
> >>>
> >>> Should probably have the put_ref or whatever helper also init the
> >>> task_work, and then reuse the list in the callback_head there. Then
> >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> >>> random users. But you get the idea.
> >>
> >> Interesting ideas - thanks.
> >>
> >> So maybe the new API would be
> >>
> >> fput_queued(struct file *f, struct llist_head *q)
> >> and
> >> flush_fput_queue(struct llist_head *q)
> >>
> >> with the meaning being that fput_queued() is just like fput() except
> >> that any file needing __fput() is added to the 'q'; and that
> >> flush_fput_queue() calls __fput() on any files in 'q'.
> >>
> >> So to close a file nfsd would:
> >>
> >> fget(f);
> >> flip_close(f);
> >> fput_queued(f, &my_queue);
> >>
> >> though possibly we could have a
> >> filp_close_queued(f, q)
> >> as well.
> >>
> >> I'll try that out - but am happy to hear alternate suggestions for names :-)
> >>
> >
> > Actually .... I'm beginning to wonder if we should just use
> > __fput_sync() in nfsd.
> > It has a big warning about not doing that blindly, but the detail in the
> > warning doesn't seem to apply to nfsd...
>
> If you can do it from the context where you do the filp_close() right
> now, then yeah there's no reason to over-complicate this at all... FWIW,
As long as nfsd doesn't care that it may get stuck on umount or
->release...
> the reason task_work exists is just to ensure a clean context to perform
> these operations from the task itself. The more I think about it, it
> doesn't make a lot of sense to utilize it for this purpose, which is
> where my alternate suggestion came from. But if you can just call it
> directly, then that makes everything much easier.
And for better or worse we already expose __fput_sync(). We've recently
switched close(2) over to it as well as it was needlessly punting to
task work.
On Wed, Dec 06, 2023 at 08:28:15AM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > > changed since last I looked..... are there rules?
> > > >
> > > > My reasoning was that the call is effectively part of the user-space
> > > > ABI. A user-space process can call this trivially by invoking any
> > > > system call. The user-space ABI is explicitly a boundary which the GPL
> > > > does not cross. So it doesn't seem appropriate to prevent non-GPL
> > > > kernel code from doing something that non-GPL user-space code can
> > > > trivially do.
> > >
> > > By that reasoning, basically everything in the kernel should be non-GPL
> > > marked. And while task_work can get used by the application, it happens
> > > only indirectly or implicitly. So I don't think this reasoning is sound
> > > at all, it's not an exported ABI or API by itself.
> > >
> > > For me, the more core of an export it is, the stronger the reason it
> > > should be GPL. FWIW, I don't think exporting task_work functionality is
> > > a good idea in the first place, but if there's a strong reason to do so,
> >
> > Yeah, I'm not too fond of that part as well. I don't think we want to
> > give modules the ability to mess with task work. This is just asking for
> > trouble.
> >
>
> Ok, maybe we need to reframe the problem then.
>
> Currently fput(), and hence filp_close(), take control away from kernel
> threads in that they cannot be sure that a "close" has actually
> completed.
>
> This is already a problem for nfsd. When renaming a file, nfsd needs to
> ensure any cached "open" that it has on the file is closed (else when
> re-exporting an NFS filesystem it can result in a silly-rename).
>
> nfsd currently handles this case by calling flush_delayed_fput(). I
> suspect you are no more happy about exporting that than you are about
> exporting task_work_run(), but this solution isn't actually 100%
> reliable. If some other thread calls flush_delayed_fput() between nfsd
> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> then the second flush can return before the first flush (in the other
> thread) completes all the work it took on.
>
> What we really need - both for handling renames and for avoiding
> possible memory exhaustion - is for nfsd to be able to reliably wait for
> any fput() that it initiated to complete.
Yeah, I acknowledge the problem and I said I'm not opposed to your
solution I just would like to do it differently if we can.
>
> How would you like the VFS to provide that service?
If you really don't care about getting stuck on an fput() somehow then
what Jens suggested might actually be fine.
And the proposal - queue the files on a list - isn't that already what
nfsd is kinda doing already for the file cache. That's at least the
impression I got from reading over nfsd_file_free(). It's just that it
doesn't free directly but used that flush_delayed_fput(). So really,
taking this on step further and doing the fput synchronously might work.
On Thu, 07 Dec 2023, Christian Brauner wrote:
> On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote:
> > On 12/5/23 4:23 PM, NeilBrown wrote:
> > > On Wed, 06 Dec 2023, NeilBrown wrote:
> > >> On Wed, 06 Dec 2023, Jens Axboe wrote:
> > >>> On 12/5/23 2:58 PM, Jens Axboe wrote:
> > >>>> On 12/5/23 2:28 PM, NeilBrown wrote:
> > >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> > >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> > >>>>>>>> changed since last I looked..... are there rules?
> > >>>>>>>>
> > >>>>>>>> My reasoning was that the call is effectively part of the user-space
> > >>>>>>>> ABI. A user-space process can call this trivially by invoking any
> > >>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL
> > >>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL
> > >>>>>>>> kernel code from doing something that non-GPL user-space code can
> > >>>>>>>> trivially do.
> > >>>>>>>
> > >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
> > >>>>>>> marked. And while task_work can get used by the application, it happens
> > >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
> > >>>>>>> at all, it's not an exported ABI or API by itself.
> > >>>>>>>
> > >>>>>>> For me, the more core of an export it is, the stronger the reason it
> > >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> > >>>>
> > >>>>>>
> > >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
> > >>>>>> give modules the ability to mess with task work. This is just asking for
> > >>>>>> trouble.
> > >>>>>>
> > >>>>>
> > >>>>> Ok, maybe we need to reframe the problem then.
> > >>>>>
> > >>>>> Currently fput(), and hence filp_close(), take control away from kernel
> > >>>>> threads in that they cannot be sure that a "close" has actually
> > >>>>> completed.
> > >>>>>
> > >>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to
> > >>>>> ensure any cached "open" that it has on the file is closed (else when
> > >>>>> re-exporting an NFS filesystem it can result in a silly-rename).
> > >>>>>
> > >>>>> nfsd currently handles this case by calling flush_delayed_fput(). I
> > >>>>> suspect you are no more happy about exporting that than you are about
> > >>>>> exporting task_work_run(), but this solution isn't actually 100%
> > >>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd
> > >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> > >>>>> then the second flush can return before the first flush (in the other
> > >>>>> thread) completes all the work it took on.
> > >>>>>
> > >>>>> What we really need - both for handling renames and for avoiding
> > >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
> > >>>>> any fput() that it initiated to complete.
> > >>>>>
> > >>>>> How would you like the VFS to provide that service?
> > >>>>
> > >>>> Since task_work happens in the context of your task already, why not
> > >>>> just have a way to get it stashed into a list when final fput is done?
> > >>>> This avoids all of this "let's expose task_work" and using the task list
> > >>>> for that, which seems kind of pointless as you're just going to run it
> > >>>> later on manually anyway.
> > >>>>
> > >>>> In semi pseudo code:
> > >>>>
> > >>>> bool fput_put_ref(struct file *file)
> > >>>> {
> > >>>> return atomic_dec_and_test(&file->f_count);
> > >>>> }
> > >>>>
> > >>>> void fput(struct file *file)
> > >>>> {
> > >>>> if (fput_put_ref(file)) {
> > >>>> ...
> > >>>> }
> > >>>> }
> > >>>>
> > >>>> and then your nfsd_file_free() could do:
> > >>>>
> > >>>> ret = filp_flush(file, id);
> > >>>> if (fput_put_ref(file))
> > >>>> llist_add(&file->f_llist, &l->to_free_llist);
> > >>>>
> > >>>> or something like that, where l->to_free_llist is where ever you'd
> > >>>> otherwise punt the actual freeing to.
> > >>>
> > >>> Should probably have the put_ref or whatever helper also init the
> > >>> task_work, and then reuse the list in the callback_head there. Then
> > >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> > >>> random users. But you get the idea.
> > >>
> > >> Interesting ideas - thanks.
> > >>
> > >> So maybe the new API would be
> > >>
> > >> fput_queued(struct file *f, struct llist_head *q)
> > >> and
> > >> flush_fput_queue(struct llist_head *q)
> > >>
> > >> with the meaning being that fput_queued() is just like fput() except
> > >> that any file needing __fput() is added to the 'q'; and that
> > >> flush_fput_queue() calls __fput() on any files in 'q'.
> > >>
> > >> So to close a file nfsd would:
> > >>
> > >> fget(f);
> > >> flip_close(f);
> > >> fput_queued(f, &my_queue);
> > >>
> > >> though possibly we could have a
> > >> filp_close_queued(f, q)
> > >> as well.
> > >>
> > >> I'll try that out - but am happy to hear alternate suggestions for names :-)
> > >>
> > >
> > > Actually .... I'm beginning to wonder if we should just use
> > > __fput_sync() in nfsd.
> > > It has a big warning about not doing that blindly, but the detail in the
> > > warning doesn't seem to apply to nfsd...
> >
> > If you can do it from the context where you do the filp_close() right
> > now, then yeah there's no reason to over-complicate this at all... FWIW,
>
> As long as nfsd doesn't care that it may get stuck on umount or
> ->release...
I think we do *care* about getting stuck. But I don't think we would
*expect* to get stuck..
I had a look at varous ->release function. Quite few do fsync or
similar which isn't a problem. nfsd often waits for writes to complete.
Some lock the inode, which again is something that nfsd threads often
do.
Is there something special that ->release might do but that other
filesystem operation don't do?
I'd really like to understand why __fput is so special that we often
queue it to a separate thread.
>
> > the reason task_work exists is just to ensure a clean context to perform
> > these operations from the task itself. The more I think about it, it
> > doesn't make a lot of sense to utilize it for this purpose, which is
> > where my alternate suggestion came from. But if you can just call it
> > directly, then that makes everything much easier.
>
> And for better or worse we already expose __fput_sync(). We've recently
> switched close(2) over to it as well as it was needlessly punting to
> task work.
>
exit_files() would be another good candidate for using __fput_sync().
Oleg Nesterov has reported problems when a process which a large number
of files exits - this currently puts lots of entries on the task_works
lists. If task_work_cancel is then called before those are all dealt
with, it can have a long list to search while holding a hot lock. (I
hope I got that description right).
Thanks,
NeilBrown