Subject: [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path.

[ This is a repost of https://lkml.kernel.org/r/[email protected] ]

This is a follup-up on the patch
sched: Delay task stack freeing on RT
https://lkml.kernel.org/r/[email protected]

It addresses the review feedback:
- Decouple stack accounting from its free invocation. The accounting
happens in do_exit(), the final free call happens later.

- Add put_task_stack_sched() to finish_task_switch(). Here the VMAP
stack is cached only. If it fails, or in the !VMAP case then the final
free happens in delayed_put_task_struct(). This is also an oportunity
to cache the stack.

From testing I observe the following:

| bash-1715 [006] ..... 124.901510: copy_process: allocC ffffc90007e70000
| sh-cmds.sh-1746 [007] ..... 124.907389: copy_process: allocC ffffc90007dc4000
| <idle>-0 [019] ...1. 124.918126: free_thread_stack: cache ffffc90007dc4000
| sh-cmds.sh-1746 [007] ..... 124.918279: copy_process: allocC ffffc90007de8000
| <idle>-0 [004] ...1. 124.920121: free_thread_stack: delay ffffc90007de8001
| <idle>-0 [007] ...1. 124.920299: free_thread_stack: cache ffffc90007e70000
| <idle>-0 [007] ..s1. 124.945433: free_thread_stack: cache ffffc90007de8000

TS 124.901510, bash started sh-cmds.sh, obtained stack from cache.
TS 124.907389, script invokes its first command, obtained stacak from
cache. As you can see bash was running on CPU6 but its child was moved
CPU7.
TS 124.918126, the first command is done, stack is ached on CPU19.
TS 124.918279, script's second command, ache from stack.
TS 124.920121, the command is done. The stack cache on CPU4 is full.
TS 124.920299, the script is done, caches stack on CPU7.
TS 124.945433, the RCU-callback of last command is now happening. On
CPU7, which is where the command was invoked (but not running). Instead
of freeing the stack, it was cached since CPU7 had an empty slot.

If I pin the script to CPU5 and run it with multiple commands then it
works as expected:

| bash-1799 [005] ..... 993.608131: copy_process: allocC ffffc90007fa0000
| sh-cmds.sh-1827 [005] ..... 993.608888: copy_process: allocC ffffc90007fa8000
| sh-cmds.sh-1827 [005] ..... 993.610734: copy_process: allocV ffffc90007ff4000
| sh-cmds.sh-1829 [005] ...1. 993.610757: free_thread_stack: cache ffffc90007fa8000
| sh-cmds.sh-1827 [005] ..... 993.612401: copy_process: allocC ffffc90007fa8000
| <...>-1830 [005] ...1. 993.612416: free_thread_stack: cache ffffc90007ff4000
| sh-cmds.sh-1827 [005] ..... 993.613707: copy_process: allocC ffffc90007ff4000
| sh-cmds.sh-1831 [005] ...1. 993.613723: free_thread_stack: cache ffffc90007fa8000
| sh-cmds.sh-1827 [005] ..... 993.615024: copy_process: allocC ffffc90007fa8000
| <...>-1832 [005] ...1. 993.615040: free_thread_stack: cache ffffc90007ff4000
| sh-cmds.sh-1827 [005] ..... 993.616380: copy_process: allocC ffffc90007ff4000
| <...>-1833 [005] ...1. 993.616397: free_thread_stack: cache ffffc90007fa8000
| bash-1799 [005] ...1. 993.617759: free_thread_stack: cache ffffc90007fa0000
| <idle>-0 [005] ...1. 993.617871: free_thread_stack: delay ffffc90007ff4001
| <idle>-0 [005] ..s1. 993.638311: free_thread_stack: free ffffc90007ff4000

and no new is allocated during its runtime and a cached stack is used.

Sebastian



Subject: [PATCH 6/8] kernel/fork: Move task stack account to do_exit().

There is no need to perform the stack accounting of the outgoing task in
its final schedule() invocation which happens with disabled preemption.
The task is leaving, the resources will be freed and the accounting can
happen in do_exit() before the actual schedule invocation which
frees the stack memory.

Move the accounting of the stack memory from release_task_stack() to
exit_task_stack_account() which then can be invoked from do_exit().

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/sched/task_stack.h | 2 ++
kernel/exit.c | 1 +
kernel/fork.c | 35 +++++++++++++++++++++-----------
3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index d10150587d819..892562ebbd3aa 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -79,6 +79,8 @@ static inline void *try_get_task_stack(struct task_struct *tsk)
static inline void put_task_stack(struct task_struct *tsk) {}
#endif

+void exit_task_stack_account(struct task_struct *tsk);
+
#define task_stack_end_corrupted(task) \
(*(end_of_stack(task)) != STACK_END_MAGIC)

diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab93..c303cffe7fdb4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -845,6 +845,7 @@ void __noreturn do_exit(long code)
put_page(tsk->task_frag.page);

validate_creds_for_do_exit(tsk);
+ exit_task_stack_account(tsk);

check_stack_usage();
preempt_disable();
diff --git a/kernel/fork.c b/kernel/fork.c
index 73f644482e932..5f4e659a922e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -211,9 +211,8 @@ static int free_vm_stack_cache(unsigned int cpu)
return 0;
}

-static int memcg_charge_kernel_stack(struct task_struct *tsk)
+static int memcg_charge_kernel_stack(struct vm_struct *vm)
{
- struct vm_struct *vm = task_stack_vm_area(tsk);
int i;
int ret;

@@ -239,6 +238,7 @@ static int memcg_charge_kernel_stack(struct task_struct *tsk)

static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
+ struct vm_struct *vm;
void *stack;
int i;

@@ -256,7 +256,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
/* Clear stale pointers from reused stack. */
memset(s->addr, 0, THREAD_SIZE);

- if (memcg_charge_kernel_stack(tsk)) {
+ if (memcg_charge_kernel_stack(s)) {
vfree(s->addr);
return -ENOMEM;
}
@@ -279,7 +279,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
if (!stack)
return -ENOMEM;

- if (memcg_charge_kernel_stack(tsk)) {
+ vm = find_vm_area(stack);
+ if (memcg_charge_kernel_stack(vm)) {
vfree(stack);
return -ENOMEM;
}
@@ -288,19 +289,15 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
* free_thread_stack() can be called in interrupt context,
* so cache the vm_struct.
*/
- tsk->stack_vm_area = find_vm_area(stack);
+ tsk->stack_vm_area = vm;
tsk->stack = stack;
return 0;
}

static void free_thread_stack(struct task_struct *tsk)
{
- struct vm_struct *vm = task_stack_vm_area(tsk);
int i;

- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
- memcg_kmem_uncharge_page(vm->pages[i], 0);
-
for (i = 0; i < NR_CACHED_STACKS; i++) {
if (this_cpu_cmpxchg(cached_stacks[i], NULL,
tsk->stack_vm_area) != NULL)
@@ -453,12 +450,25 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
}
}

+void exit_task_stack_account(struct task_struct *tsk)
+{
+ account_kernel_stack(tsk, -1);
+
+ if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+ struct vm_struct *vm;
+ int i;
+
+ vm = task_stack_vm_area(tsk);
+ for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ memcg_kmem_uncharge_page(vm->pages[i], 0);
+ }
+}
+
static void release_task_stack(struct task_struct *tsk)
{
if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
return; /* Better to leak the stack than to free prematurely */

- account_kernel_stack(tsk, -1);
free_thread_stack(tsk);
}

@@ -917,6 +927,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_THREAD_INFO_IN_TASK
refcount_set(&tsk->stack_refcount, 1);
#endif
+ account_kernel_stack(tsk, 1);

err = scs_prepare(tsk, node);
if (err)
@@ -960,8 +971,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
tsk->wake_q.next = NULL;
tsk->worker_private = NULL;

- account_kernel_stack(tsk, 1);
-
kcov_task_init(tsk);
kmap_local_fork(tsk);

@@ -980,6 +989,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
return tsk;

free_stack:
+ exit_task_stack_account(tsk);
free_thread_stack(tsk);
free_tsk:
free_task_struct(tsk);
@@ -2448,6 +2458,7 @@ static __latent_entropy struct task_struct *copy_process(
exit_creds(p);
bad_fork_free:
WRITE_ONCE(p->__state, TASK_DEAD);
+ exit_task_stack_account(p);
put_task_stack(p);
delayed_free_task(p);
fork_out:
--
2.34.1

Subject: [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64.

Provide a generic alloc_thread_stack_node() for IA64/
CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets
task_struct::stack so it behaves exactly like the other implementations.

Rename IA64's alloc_thread_stack_node() and add the generic version to
the fork code so it is in one place _and_ to drastically lower chances
of fat fingering the IA64 code.
Do the same for free_thread_stack().

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/ia64/include/asm/thread_info.h | 6 +++---
kernel/fork.c | 16 ++++++++++++++++
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 51d20cb377062..1684716f08201 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -55,15 +55,15 @@ struct thread_info {
#ifndef ASM_OFFSETS_C
/* how to get the thread information struct from C */
#define current_thread_info() ((struct thread_info *) ((char *) current + IA64_TASK_SIZE))
-#define alloc_thread_stack_node(tsk, node) \
+#define arch_alloc_thread_stack_node(tsk, node) \
((unsigned long *) ((char *) (tsk) + IA64_TASK_SIZE))
#define task_thread_info(tsk) ((struct thread_info *) ((char *) (tsk) + IA64_TASK_SIZE))
#else
#define current_thread_info() ((struct thread_info *) 0)
-#define alloc_thread_stack_node(tsk, node) ((unsigned long *) 0)
+#define arch_alloc_thread_stack_node(tsk, node) ((unsigned long *) 0)
#define task_thread_info(tsk) ((struct thread_info *) 0)
#endif
-#define free_thread_stack(tsk) /* nothing */
+#define arch_free_thread_stack(tsk) /* nothing */
#define task_stack_page(tsk) ((void *)(tsk))

#define __HAVE_THREAD_FUNCTIONS
diff --git a/kernel/fork.c b/kernel/fork.c
index c47dcba5d66d2..a0d58ae6fac76 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -330,6 +330,22 @@ void thread_stack_cache_init(void)
}

# endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#else /* CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
+
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
+{
+ unsigned long *stack;
+
+ stack = arch_alloc_thread_stack_node(tsk, node);
+ tsk->stack = stack;
+ return stack;
+}
+
+static void free_thread_stack(struct task_struct *tsk, bool cache_only)
+{
+ arch_free_thread_stack(tsk);
+}
+
#endif /* !CONFIG_ARCH_THREAD_STACK_ALLOCATOR */

/* SLAB cache for signal_struct structures (tsk->signal) */
--
2.34.1

Subject: Re: [PATCH REPOST 0/8] kernel/fork: Move thread stack free otu of the scheduler path.

On 2022-01-25 16:26:44 [+0100], To [email protected] wrote:
> [ This is a repost of https://lkml.kernel.org/r/[email protected] ]
>
> This is a follup-up on the patch
> sched: Delay task stack freeing on RT
> https://lkml.kernel.org/r/[email protected]
>
> It addresses the review feedback:
> - Decouple stack accounting from its free invocation. The accounting
> happens in do_exit(), the final free call happens later.
>
> - Add put_task_stack_sched() to finish_task_switch(). Here the VMAP
> stack is cached only. If it fails, or in the !VMAP case then the final
> free happens in delayed_put_task_struct(). This is also an oportunity
> to cache the stack.
>
> >From testing I observe the following:
>
> | bash-1715 [006] ..... 124.901510: copy_process: allocC ffffc90007e70000
> | sh-cmds.sh-1746 [007] ..... 124.907389: copy_process: allocC ffffc90007dc4000
> | <idle>-0 [019] ...1. 124.918126: free_thread_stack: cache ffffc90007dc4000
> | sh-cmds.sh-1746 [007] ..... 124.918279: copy_process: allocC ffffc90007de8000
> | <idle>-0 [004] ...1. 124.920121: free_thread_stack: delay ffffc90007de8001
> | <idle>-0 [007] ...1. 124.920299: free_thread_stack: cache ffffc90007e70000
> | <idle>-0 [007] ..s1. 124.945433: free_thread_stack: cache ffffc90007de8000
>
> TS 124.901510, bash started sh-cmds.sh, obtained stack from cache.
> TS 124.907389, script invokes its first command, obtained stacak from
> cache. As you can see bash was running on CPU6 but its child was moved
> CPU7.
> TS 124.918126, the first command is done, stack is ached on CPU19.
> TS 124.918279, script's second command, ache from stack.
> TS 124.920121, the command is done. The stack cache on CPU4 is full.
> TS 124.920299, the script is done, caches stack on CPU7.
> TS 124.945433, the RCU-callback of last command is now happening. On
> CPU7, which is where the command was invoked (but not running). Instead
> of freeing the stack, it was cached since CPU7 had an empty slot.
>
> If I pin the script to CPU5 and run it with multiple commands then it
> works as expected:
>
> | bash-1799 [005] ..... 993.608131: copy_process: allocC ffffc90007fa0000
> | sh-cmds.sh-1827 [005] ..... 993.608888: copy_process: allocC ffffc90007fa8000
> | sh-cmds.sh-1827 [005] ..... 993.610734: copy_process: allocV ffffc90007ff4000
> | sh-cmds.sh-1829 [005] ...1. 993.610757: free_thread_stack: cache ffffc90007fa8000
> | sh-cmds.sh-1827 [005] ..... 993.612401: copy_process: allocC ffffc90007fa8000
> | <...>-1830 [005] ...1. 993.612416: free_thread_stack: cache ffffc90007ff4000
> | sh-cmds.sh-1827 [005] ..... 993.613707: copy_process: allocC ffffc90007ff4000
> | sh-cmds.sh-1831 [005] ...1. 993.613723: free_thread_stack: cache ffffc90007fa8000
> | sh-cmds.sh-1827 [005] ..... 993.615024: copy_process: allocC ffffc90007fa8000
> | <...>-1832 [005] ...1. 993.615040: free_thread_stack: cache ffffc90007ff4000
> | sh-cmds.sh-1827 [005] ..... 993.616380: copy_process: allocC ffffc90007ff4000
> | <...>-1833 [005] ...1. 993.616397: free_thread_stack: cache ffffc90007fa8000
> | bash-1799 [005] ...1. 993.617759: free_thread_stack: cache ffffc90007fa0000
> | <idle>-0 [005] ...1. 993.617871: free_thread_stack: delay ffffc90007ff4001
> | <idle>-0 [005] ..s1. 993.638311: free_thread_stack: free ffffc90007ff4000
>
> and no new is allocated during its runtime and a cached stack is used.

ping

Sebastian

2022-02-13 14:17:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 6/8] kernel/fork: Move task stack account to do_exit().

On 1/25/22 07:26, Sebastian Andrzej Siewior wrote:
> There is no need to perform the stack accounting of the outgoing task in
> its final schedule() invocation which happens with disabled preemption.
> The task is leaving, the resources will be freed and the accounting can
> happen in do_exit() before the actual schedule invocation which
> frees the stack memory.
>
> Move the accounting of the stack memory from release_task_stack() to
> exit_task_stack_account() which then can be invoked from do_exit().

Seems reasonable.

Subject: Re: [PATCH 3/8] kernel/fork, IA64: Provide a alloc_thread_stack_node() for IA64.

On 2022-01-25 16:26:47 [+0100], To [email protected] wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c47dcba5d66d2..a0d58ae6fac76 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -330,6 +330,22 @@ void thread_stack_cache_init(void)

>
> +static void free_thread_stack(struct task_struct *tsk, bool cache_only)

This cache_only parameter shouldn't be here…

> +{
> + arch_free_thread_stack(tsk);
> +}
> +
> #endif /* !CONFIG_ARCH_THREAD_STACK_ALLOCATOR */
>
> /* SLAB cache for signal_struct structures (tsk->signal) */

Sebastian