On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> User-space processes always call task_work_run() as needed when
> returning from a system call. Kernel-threads generally do not.
> Because of this some work that is best run in the task_works context
> (guaranteed that no locks are held) cannot be queued to task_works from
> kernel threads and so are queued to a (single) work_time to be managed
> on a work queue.
>
> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.
>
> I have evidence from a customer site when nfsd (which runs as kernel
> threads) is being asked to modify many millions of files which causes
> sufficient memory pressure that some cache (in XFS I think) gets cleaned
> earlier than would be ideal. When __dput (from the workqueue) calls
> __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> previously cached info from storage. This slows down the single thread
> that is making all the final __dput() calls for all the nfsd threads
> with the net result that files are added to the delayed_fput_list faster
> than they are removed, and the system eventually runs out of memory.
>
> This happens because there is no back-pressure: the nfsd isn't forced to
> slow down when __dput() is slow for any reason. To fix this we can
> change the nfsd threads to call task_work_run() regularly (much like
> user-space processes do) and allow it to declare this so that work does
> get queued to task_works rather than to a work queue.
>
> This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> instead of PF_KTHREAD to determine whether it is sensible to queue
> something to task_works. This flag is always set for non-kernel threads.
>
> task_work_run() is also exported so that it can be called from a module
> such as nfsd.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
The thing that bugs me the most about this is that we expose task work
infrastructure to modules which I think is a really bad idea. File
handling code brings so many driver to their knees and now we're handing
them another footgun.
I'm not per se opposed to all of this but is this really what the other
NFS maintainers want to switch to as well? And is this really that badly
needed and that common that we want to go down that road? I wouldn't
mind not having to do all this if we can get by via other means.
> fs/file_table.c | 3 ++-
> fs/namespace.c | 2 +-
> include/linux/sched.h | 2 +-
> kernel/fork.c | 2 ++
> kernel/task_work.c | 1 +
> 5 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index ee21b3da9d08..d36cade6e366 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -435,7 +435,8 @@ void fput(struct file *file)
> if (atomic_long_dec_and_test(&file->f_count)) {
> struct task_struct *task = current;
>
> - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> + if (likely(!in_interrupt() &&
> + (task->flags & PF_RUNS_TASK_WORK))) {
> init_task_work(&file->f_rcuhead, ____fput);
> if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
> return;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e157efc54023..46d640b70ca9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>
> if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> struct task_struct *task = current;
> - if (likely(!(task->flags & PF_KTHREAD))) {
> + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 77f01ac385f7..e4eebac708e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
> * I am cleaning dirty pages from some other bdi. */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> -#define PF__HOLE__00800000 0x00800000
> +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */
The flag seems better to me than just relying on exit_work as itt's
easier to reason about.
> #define PF__HOLE__01000000 0x01000000
> #define PF__HOLE__02000000 0x02000000
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b6d20dfb9a8..d612d8f14861 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
> p->flags &= ~PF_KTHREAD;
> if (args->kthread)
> p->flags |= PF_KTHREAD;
> + else
> + p->flags |= PF_RUNS_TASK_WORK;
> if (args->user_worker) {
> /*
> * Mark us a user worker, and block any signal that isn't
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..aec19876e121 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -183,3 +183,4 @@ void task_work_run(void)
> } while (work);
> }
> }
> +EXPORT_SYMBOL(task_work_run);
> --
> 2.43.0
>
On Tue, Dec 05, 2023 at 12:25:40PM +0100, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > User-space processes always call task_work_run() as needed when
> > returning from a system call. Kernel-threads generally do not.
> > Because of this some work that is best run in the task_works context
> > (guaranteed that no locks are held) cannot be queued to task_works from
> > kernel threads and so are queued to a (single) work_time to be managed
> > on a work queue.
> >
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> >
> > I have evidence from a customer site when nfsd (which runs as kernel
> > threads) is being asked to modify many millions of files which causes
> > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > earlier than would be ideal. When __dput (from the workqueue) calls
> > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > previously cached info from storage. This slows down the single thread
> > that is making all the final __dput() calls for all the nfsd threads
> > with the net result that files are added to the delayed_fput_list faster
> > than they are removed, and the system eventually runs out of memory.
> >
> > This happens because there is no back-pressure: the nfsd isn't forced to
> > slow down when __dput() is slow for any reason. To fix this we can
> > change the nfsd threads to call task_work_run() regularly (much like
> > user-space processes do) and allow it to declare this so that work does
> > get queued to task_works rather than to a work queue.
> >
> > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> > instead of PF_KTHREAD to determine whether it is sensible to queue
> > something to task_works. This flag is always set for non-kernel threads.
> >
> > task_work_run() is also exported so that it can be called from a module
> > such as nfsd.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
>
> The thing that bugs me the most about this is that we expose task work
> infrastructure to modules which I think is a really bad idea. File
> handling code brings so many driver to their knees and now we're handing
> them another footgun.
>
> I'm not per se opposed to all of this but is this really what the other
> NFS maintainers want to switch to as well? And is this really that badly
> needed and that common that we want to go down that road? I wouldn't
> mind not having to do all this if we can get by via other means.
The problem of slow flushing during close is not limited to XFS or
any particular underlying file system. Sometimes it is due to
performance or scalability bugs, but sometimes it's just a slow
storage stack on the NFS server (eg NFS re-export).
One slow synchronous flush in a single-threaded queue will result
in head-of-queue blocking. That is something that needs to be
addressed (IMHO, first).
Adding back pressure on NFS clients when NFSD is not able to get
dirty data onto durable storage fast enough is a long term solution,
but it's probably a heavier lift. I'm not wedded to using task_work
to do that, but it does seem to fit the problem at hand.
> > fs/file_table.c | 3 ++-
> > fs/namespace.c | 2 +-
> > include/linux/sched.h | 2 +-
> > kernel/fork.c | 2 ++
> > kernel/task_work.c | 1 +
> > 5 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index ee21b3da9d08..d36cade6e366 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -435,7 +435,8 @@ void fput(struct file *file)
> > if (atomic_long_dec_and_test(&file->f_count)) {
> > struct task_struct *task = current;
> >
> > - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> > + if (likely(!in_interrupt() &&
> > + (task->flags & PF_RUNS_TASK_WORK))) {
> > init_task_work(&file->f_rcuhead, ____fput);
> > if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
> > return;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index e157efc54023..46d640b70ca9 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >
> > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> > struct task_struct *task = current;
> > - if (likely(!(task->flags & PF_KTHREAD))) {
> > + if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> > init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> > return;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 77f01ac385f7..e4eebac708e7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
> > * I am cleaning dirty pages from some other bdi. */
> > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> > -#define PF__HOLE__00800000 0x00800000
> > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */
>
> The flag seems better to me than just relying on exit_work as itt's
> easier to reason about.
>
> > #define PF__HOLE__01000000 0x01000000
> > #define PF__HOLE__02000000 0x02000000
> > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3b6d20dfb9a8..d612d8f14861 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
> > p->flags &= ~PF_KTHREAD;
> > if (args->kthread)
> > p->flags |= PF_KTHREAD;
> > + else
> > + p->flags |= PF_RUNS_TASK_WORK;
> > if (args->user_worker) {
> > /*
> > * Mark us a user worker, and block any signal that isn't
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> > } while (work);
> > }
> > }
> > +EXPORT_SYMBOL(task_work_run);
> > --
> > 2.43.0
> >
--
Chuck Lever