2024-03-11 16:46:54

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 00/14] Dynamic Kernel Stacks

This is follow-up to the LSF/MM proposal [1]. Please provide your
thoughts and comments about dynamic kernel stacks feature. This is a WIP
has not been tested beside booting on some machines, and running LKDTM
thread exhaust tests. The series also lacks selftests, and
documentations.

This feature allows to grow kernel stack dynamically, from 4KiB and up
to the THREAD_SIZE. The intend is to save memory on fleet machines. From
the initial experiments it shows to save on average 70-75% of the kernel
stack memory.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.
However, the table below shows the amount of kernel stack memory before
vs. after on idling freshly booted machines:

CPU #Cores #Stacks BASE(kb) Dynamic(kb) Saving
AMD Genoa 384 5786 92576 23388 74.74%
Intel Skylake 112 3182 50912 12860 74.74%
AMD Rome 128 3401 54416 14784 72.83%
AMD Rome 256 4908 78528 20876 73.42%
Intel Haswell 72 2644 42304 10624 74.89%

Some workloads with that have millions of threads would can benefit
significantly from this feature.

[1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@mail.gmail.com

Pasha Tatashin (14):
task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check
fork: Clean-up ifdef logic around stack allocation
fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks
code
fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
fork: check charging success before zeroing stack
fork: zero vmap stack using clear_page() instead of memset()
fork: use the first page in stack to store vm_stack in cached_stacks
fork: separate vmap stack alloction and free calls
mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush()
public functions
fork: Dynamic Kernel Stacks
x86: add support for Dynamic Kernel Stacks
task_stack.h: Clean-up stack_not_used() implementation
task_stack.h: Add stack_not_used() support for dynamic stack
fork: Dynamic Kernel Stack accounting

arch/Kconfig | 33 +++
arch/x86/Kconfig | 1 +
arch/x86/kernel/traps.c | 3 +
arch/x86/mm/fault.c | 3 +
include/linux/mmzone.h | 3 +
include/linux/sched.h | 2 +-
include/linux/sched/task_stack.h | 94 ++++++--
include/linux/vmalloc.h | 15 ++
kernel/fork.c | 388 ++++++++++++++++++++++++++-----
kernel/sched/core.c | 1 +
mm/internal.h | 9 -
mm/vmalloc.c | 24 ++
mm/vmstat.c | 3 +
13 files changed, 487 insertions(+), 92 deletions(-)

--
2.44.0.278.ge034bb2e1d-goog



2024-03-11 16:47:09

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 02/14] fork: Clean-up ifdef logic around stack allocation

There is unneeded OR in the ifdef functions that are used to allocate
and free kernel stacks based on direct map or vmap. Adding dynamic stack
support would complicate this logic even further.

Therefore, clean up by Changing the order so OR is no longer needed.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0d944e92a43f..32600bf2422a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -179,13 +179,7 @@ static inline void free_task_struct(struct task_struct *tsk)
kmem_cache_free(task_struct_cachep, tsk);
}

-/*
- * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
- * kmemcache based allocator.
- */
-# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
-
-# ifdef CONFIG_VMAP_STACK
+#ifdef CONFIG_VMAP_STACK
/*
* vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
* flush. Try to minimize the number of calls by caching stacks.
@@ -337,7 +331,13 @@ static void free_thread_stack(struct task_struct *tsk)
tsk->stack_vm_area = NULL;
}

-# else /* !CONFIG_VMAP_STACK */
+#else /* !CONFIG_VMAP_STACK */
+
+/*
+ * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
+ * kmemcache based allocator.
+ */
+#if THREAD_SIZE >= PAGE_SIZE

static void thread_stack_free_rcu(struct rcu_head *rh)
{
@@ -369,8 +369,7 @@ static void free_thread_stack(struct task_struct *tsk)
tsk->stack = NULL;
}

-# endif /* CONFIG_VMAP_STACK */
-# else /* !(THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)) */
+#else /* !(THREAD_SIZE >= PAGE_SIZE) */

static struct kmem_cache *thread_stack_cache;

@@ -409,7 +408,8 @@ void thread_stack_cache_init(void)
BUG_ON(thread_stack_cache == NULL);
}

-# endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#endif /* THREAD_SIZE >= PAGE_SIZE */
+#endif /* CONFIG_VMAP_STACK */

/* SLAB cache for signal_struct structures (tsk->signal) */
static struct kmem_cache *signal_cachep;
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:47:11

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check

Remove __HAVE_ARCH_KSTACK_END as it has been osolete since removal of
metag architecture in v4.17.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/sched/task_stack.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index ccd72b978e1f..860faea06883 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -116,7 +116,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
#endif
extern void set_task_stack_end_magic(struct task_struct *tsk);

-#ifndef __HAVE_ARCH_KSTACK_END
static inline int kstack_end(void *addr)
{
/* Reliable end of stack detection:
@@ -124,6 +123,5 @@ static inline int kstack_end(void *addr)
*/
return !(((unsigned long)addr+sizeof(void*)-1) & (THREAD_SIZE-sizeof(void*)));
}
-#endif

#endif /* _LINUX_SCHED_TASK_STACK_H */
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:47:23

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code

There are two data types: "struct vm_struct" and "struct vm_stack" that
have the same local variable names: vm_stack, or vm, or s, which makes
code confusing to read.

Change the code so the naming is consisent:

struct vm_struct is always called vm_area
struct vm_stack is always called vm_stack

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 38 ++++++++++++++++++--------------------
1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 32600bf2422a..60e812825a7a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -192,12 +192,12 @@ struct vm_stack {
struct vm_struct *stack_vm_area;
};

-static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
+static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
{
unsigned int i;

for (i = 0; i < NR_CACHED_STACKS; i++) {
- if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+ if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm_area) != NULL)
continue;
return true;
}
@@ -207,11 +207,12 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
static void thread_stack_free_rcu(struct rcu_head *rh)
{
struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
+ struct vm_struct *vm_area = vm_stack->stack_vm_area;

if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
return;

- vfree(vm_stack);
+ vfree(vm_area->addr);
}

static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -228,12 +229,12 @@ static int free_vm_stack_cache(unsigned int cpu)
int i;

for (i = 0; i < NR_CACHED_STACKS; i++) {
- struct vm_struct *vm_stack = cached_vm_stacks[i];
+ struct vm_struct *vm_area = cached_vm_stacks[i];

- if (!vm_stack)
+ if (!vm_area)
continue;

- vfree(vm_stack->addr);
+ vfree(vm_area->addr);
cached_vm_stacks[i] = NULL;
}

@@ -263,32 +264,29 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)

static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
- struct vm_struct *vm;
+ struct vm_struct *vm_area;
void *stack;
int i;

for (i = 0; i < NR_CACHED_STACKS; i++) {
- struct vm_struct *s;
-
- s = this_cpu_xchg(cached_stacks[i], NULL);
-
- if (!s)
+ vm_area = this_cpu_xchg(cached_stacks[i], NULL);
+ if (!vm_area)
continue;

/* Reset stack metadata. */
- kasan_unpoison_range(s->addr, THREAD_SIZE);
+ kasan_unpoison_range(vm_area->addr, THREAD_SIZE);

- stack = kasan_reset_tag(s->addr);
+ stack = kasan_reset_tag(vm_area->addr);

/* Clear stale pointers from reused stack. */
memset(stack, 0, THREAD_SIZE);

- if (memcg_charge_kernel_stack(s)) {
- vfree(s->addr);
+ if (memcg_charge_kernel_stack(vm_area)) {
+ vfree(vm_area->addr);
return -ENOMEM;
}

- tsk->stack_vm_area = s;
+ tsk->stack_vm_area = vm_area;
tsk->stack = stack;
return 0;
}
@@ -306,8 +304,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
if (!stack)
return -ENOMEM;

- vm = find_vm_area(stack);
- if (memcg_charge_kernel_stack(vm)) {
+ vm_area = find_vm_area(stack);
+ if (memcg_charge_kernel_stack(vm_area)) {
vfree(stack);
return -ENOMEM;
}
@@ -316,7 +314,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
* free_thread_stack() can be called in interrupt context,
* so cache the vm_struct.
*/
- tsk->stack_vm_area = vm;
+ tsk->stack_vm_area = vm_area;
stack = kasan_reset_tag(stack);
tsk->stack = stack;
return 0;
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:47:36

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE

In many places number of pages in the stack is detremined via
(THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
(THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.

However, with dynamic stacks, the number of pages in vm_area will grow
with stack, therefore, use vm_area->nr_pages to determine the actual
number of pages allocated in stack.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 60e812825a7a..a35f4008afa0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,13 +243,11 @@ static int free_vm_stack_cache(unsigned int cpu)

static int memcg_charge_kernel_stack(struct vm_struct *vm)
{
- int i;
- int ret;
+ int i, ret, nr_pages;
int nr_charged = 0;

- BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
-
- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+ nr_pages = vm->nr_pages;
+ for (i = 0; i < nr_pages; i++) {
ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
if (ret)
goto err;
@@ -531,9 +529,10 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
struct vm_struct *vm = task_stack_vm_area(tsk);
- int i;
+ int i, nr_pages;

- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ nr_pages = vm->nr_pages;
+ for (i = 0; i < nr_pages; i++)
mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
account * (PAGE_SIZE / 1024));
} else {
@@ -551,10 +550,11 @@ void exit_task_stack_account(struct task_struct *tsk)

if (IS_ENABLED(CONFIG_VMAP_STACK)) {
struct vm_struct *vm;
- int i;
+ int i, nr_pages;

vm = task_stack_vm_area(tsk);
- for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ nr_pages = vm->nr_pages;
+ for (i = 0; i < nr_pages; i++)
memcg_kmem_uncharge_page(vm->pages[i], 0);
}
}
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:47:56

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()

In preporation for dynamic kernel stacks do not zero the whole span of
the stack, but instead only the pages that are part of the vm_area.

This is because with dynamic stacks we might have only partially
populated stacks.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 6a2f2c85e09f..41e0baee79d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -263,8 +263,8 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
struct vm_struct *vm_area;
+ int i, j, nr_pages;
void *stack;
- int i;

for (i = 0; i < NR_CACHED_STACKS; i++) {
vm_area = this_cpu_xchg(cached_stacks[i], NULL);
@@ -282,7 +282,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
stack = kasan_reset_tag(vm_area->addr);

/* Clear stale pointers from reused stack. */
- memset(stack, 0, THREAD_SIZE);
+ nr_pages = vm_area->nr_pages;
+ for (j = 0; j < nr_pages; j++)
+ clear_page(page_address(vm_area->pages[j]));

tsk->stack_vm_area = vm_area;
tsk->stack = stack;
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:48:04

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 05/14] fork: check charging success before zeroing stack

No need to do zero cahced stack if memcg charge fails, so move the
charging attempt before the memset operation.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index a35f4008afa0..6a2f2c85e09f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -271,6 +271,11 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
if (!vm_area)
continue;

+ if (memcg_charge_kernel_stack(vm_area)) {
+ vfree(vm_area->addr);
+ return -ENOMEM;
+ }
+
/* Reset stack metadata. */
kasan_unpoison_range(vm_area->addr, THREAD_SIZE);

@@ -279,11 +284,6 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
/* Clear stale pointers from reused stack. */
memset(stack, 0, THREAD_SIZE);

- if (memcg_charge_kernel_stack(vm_area)) {
- vfree(vm_area->addr);
- return -ENOMEM;
- }
-
tsk->stack_vm_area = vm_area;
tsk->stack = stack;
return 0;
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:48:23

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 07/14] fork: use the first page in stack to store vm_stack in cached_stacks

vmap stack are stored in a per-cpu cache_stacks in order to reduce
number of allocations and free calls. However, the stacks ared stored
using the buttom address of the stack. Since stacks normally grow down,
this is a problem with dynamic stacks, as the lower pages might not
even be allocated. Instead of the first available page from vm_area.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 41e0baee79d2..3004e6ce6c65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -217,9 +217,10 @@ static void thread_stack_free_rcu(struct rcu_head *rh)

static void thread_stack_delayed_free(struct task_struct *tsk)
{
- struct vm_stack *vm_stack = tsk->stack;
+ struct vm_struct *vm_area = tsk->stack_vm_area;
+ struct vm_stack *vm_stack = page_address(vm_area->pages[0]);

- vm_stack->stack_vm_area = tsk->stack_vm_area;
+ vm_stack->stack_vm_area = vm_area;
call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
}

--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:48:32

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 08/14] fork: separate vmap stack alloction and free calls

In preparation for the dynamic stacks, separate out the
__vmalloc_node_range and vfree calls from the vmap based stack
allocations. The dynamic stacks will use their own variants of these
functions.

Signed-off-by: Pasha Tatashin <[email protected]>
---
kernel/fork.c | 53 ++++++++++++++++++++++++++++++---------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3004e6ce6c65..bbae5f705773 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -204,6 +204,29 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
return false;
}

+static inline struct vm_struct *alloc_vmap_stack(int node)
+{
+ void *stack;
+
+ /*
+ * Allocated stacks are cached and later reused by new threads,
+ * so memcg accounting is performed manually on assigning/releasing
+ * stacks to tasks. Drop __GFP_ACCOUNT.
+ */
+ stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
+ VMALLOC_START, VMALLOC_END,
+ THREADINFO_GFP & ~__GFP_ACCOUNT,
+ PAGE_KERNEL,
+ 0, node, __builtin_return_address(0));
+
+ return (stack) ? find_vm_area(stack) : NULL;
+}
+
+static inline void free_vmap_stack(struct vm_struct *vm_area)
+{
+ vfree(vm_area->addr);
+}
+
static void thread_stack_free_rcu(struct rcu_head *rh)
{
struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
@@ -212,7 +235,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
return;

- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
}

static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -235,7 +258,7 @@ static int free_vm_stack_cache(unsigned int cpu)
if (!vm_area)
continue;

- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
cached_vm_stacks[i] = NULL;
}

@@ -265,7 +288,6 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
struct vm_struct *vm_area;
int i, j, nr_pages;
- void *stack;

for (i = 0; i < NR_CACHED_STACKS; i++) {
vm_area = this_cpu_xchg(cached_stacks[i], NULL);
@@ -273,14 +295,13 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
continue;

if (memcg_charge_kernel_stack(vm_area)) {
- vfree(vm_area->addr);
+ free_vmap_stack(vm_area);
return -ENOMEM;
}

/* Reset stack metadata. */
kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
-
- stack = kasan_reset_tag(vm_area->addr);
+ tsk->stack = kasan_reset_tag(vm_area->addr);

/* Clear stale pointers from reused stack. */
nr_pages = vm_area->nr_pages;
@@ -288,26 +309,15 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
clear_page(page_address(vm_area->pages[j]));

tsk->stack_vm_area = vm_area;
- tsk->stack = stack;
return 0;
}

- /*
- * Allocated stacks are cached and later reused by new threads,
- * so memcg accounting is performed manually on assigning/releasing
- * stacks to tasks. Drop __GFP_ACCOUNT.
- */
- stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
- VMALLOC_START, VMALLOC_END,
- THREADINFO_GFP & ~__GFP_ACCOUNT,
- PAGE_KERNEL,
- 0, node, __builtin_return_address(0));
- if (!stack)
+ vm_area = alloc_vmap_stack(node);
+ if (!vm_area)
return -ENOMEM;

- vm_area = find_vm_area(stack);
if (memcg_charge_kernel_stack(vm_area)) {
- vfree(stack);
+ free_vmap_stack(vm_area);
return -ENOMEM;
}
/*
@@ -316,8 +326,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
* so cache the vm_struct.
*/
tsk->stack_vm_area = vm_area;
- stack = kasan_reset_tag(stack);
- tsk->stack = stack;
+ tsk->stack = kasan_reset_tag(vm_area->addr);
return 0;
}

--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:48:48

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 09/14] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush() public functions

get_vm_area_node()
Unlike the other public get_vm_area_* variants, this one accepts node
from which to allocate data structure, and also the align, which allows
to create vm area with a specific alignment.

This call is going to be used by dynamic stacks in order to ensure that
the stack VM area of a specific alignment, and that even if there is
only one page mapped, no page table allocations are going to be needed
to map the other stack pages.

vmap_pages_range_noflush()
Is already a global function, but was exported through mm/internal.h,
since we will need it from kernel/fork.c in order to map the initial
stack pages, move the forward declaration of this function to the
linux/vmalloc.h header.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/vmalloc.h | 15 +++++++++++++++
mm/internal.h | 9 ---------
mm/vmalloc.c | 24 ++++++++++++++++++++++++
3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..e18b6ab1584b 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -210,6 +210,9 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
unsigned long flags,
unsigned long start, unsigned long end,
const void *caller);
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+ unsigned long flags, int node, gfp_t gfp,
+ const void *caller);
void free_vm_area(struct vm_struct *area);
extern struct vm_struct *remove_vm_area(const void *addr);
extern struct vm_struct *find_vm_area(const void *addr);
@@ -241,10 +244,22 @@ static inline void set_vm_flush_reset_perms(void *addr)
vm->flags |= VM_FLUSH_RESET_PERMS;
}

+int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages,
+ unsigned int page_shift);
+
#else
static inline void set_vm_flush_reset_perms(void *addr)
{
}
+
+static inline
+int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages,
+ unsigned int page_shift)
+{
+ return -EINVAL;
+}
#endif

/* for /proc/kcore */
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..ba1e2ce68157 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -969,19 +969,10 @@ size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
*/
#ifdef CONFIG_MMU
void __init vmalloc_init(void);
-int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
- pgprot_t prot, struct page **pages, unsigned int page_shift);
#else
static inline void vmalloc_init(void)
{
}
-
-static inline
-int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
- pgprot_t prot, struct page **pages, unsigned int page_shift)
-{
- return -EINVAL;
-}
#endif

int __must_check __vmap_pages_range_noflush(unsigned long addr,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..7dcba463ff99 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2650,6 +2650,30 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
NUMA_NO_NODE, GFP_KERNEL, caller);
}

+/**
+ * get_vm_area_node - reserve a contiguous and aligned kernel virtual area
+ * @size: size of the area
+ * @align: alignment of the start address of the area
+ * @flags: %VM_IOREMAP for I/O mappings
+ * @node: NUMA node from which to allocate the area data structure
+ * @gfp: Flags to pass to the allocator
+ * @caller: Caller to be stored in the vm area data structure
+ *
+ * Search an area of @size/align in the kernel virtual mapping area,
+ * and reserved it for out purposes. Returns the area descriptor
+ * on success or %NULL on failure.
+ *
+ * Return: the area descriptor on success or %NULL on failure.
+ */
+struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align,
+ unsigned long flags, int node, gfp_t gfp,
+ const void *caller)
+{
+ return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ node, gfp, caller);
+}
+
/**
* find_vm_area - find a continuous kernel virtual area
* @addr: base address
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:49:08

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

Add dynamic_stack_fault() calls to the kernel faults, and also declare
HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
enabled on x86 architecture.

Signed-off-by: Pasha Tatashin <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/traps.c | 3 +++
arch/x86/mm/fault.c | 3 +++
3 files changed, 7 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..9bb0da3110fa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -197,6 +197,7 @@ config X86
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
select HAVE_ARCH_VMAP_STACK if X86_64
+ select HAVE_ARCH_DYNAMIC_STACK if X86_64
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_WITHIN_STACK_FRAMES
select HAVE_ASM_MODVERSIONS
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c3b2f863acf0..cc05401e729f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

+ if (dynamic_stack_fault(current, address))
+ return;
+
irqentry_nmi_enter(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d6375b3c633b..651c558b10eb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
if (is_f00f_bug(regs, hw_error_code, address))
return;

+ if (dynamic_stack_fault(current, address))
+ return;
+
/* Was the fault spurious, caused by lazy TLB invalidation? */
if (spurious_kernel_fault(hw_error_code, address))
return;
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:49:10

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 10/14] fork: Dynamic Kernel Stacks

The core implementation of dynamic kernel stacks.

Unlike traditional kernel stacks, these stack are auto-grow as they are
used. This allows to save a significant amount of memory in the fleet
environments. Also, potentially the default size of kernel thread can be
increased in order to prevent stack overflows without compromising on
the overall memory overhead.

The dynamic kernel stacks interface provides two global functions:

1. dynamic_stack_fault().
Architectures that support dynamic kernel stacks, must call this function
in order to handle the fault in the stack.

It allocates and maps new pages into the stack. The pages are
maintained in a per-cpu data structure.

2. dynamic_stack()
Must be called as a thread leaving CPU to check if the thread has
allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set.
If this is the case, there are two things need to be performed:
a. Charge the thread for the allocated stack pages.
b. refill the per-cpu array so the next thread can also fault.

Dynamic kernel threads do not support "STACK_END_MAGIC", as the last
page is does not have to be faulted in. However, since they are based of
vmap stacks, the guard pages always protect the dynamic kernel stacks
from overflow.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.

Therefore, the numbers should be tested for a specific workload. From
my tests I found the following values on a freshly booted idling
machines:

CPU #Cores #Stacks Regular(kb) Dynamic(kb)
AMD Genoa 384 5786 92576 23388
Intel Skylake 112 3182 50912 12860
AMD Rome 128 3401 54416 14784
AMD Rome 256 4908 78528 20876
Intel Haswell 72 2644 42304 10624

On all machines dynamic kernel stacks take about 25% of the original
stack memory. Only 5% of active tasks performed a stack page fault in
their life cycles.

Signed-off-by: Pasha Tatashin <[email protected]>
---
arch/Kconfig | 34 +++++
include/linux/sched.h | 2 +-
include/linux/sched/task_stack.h | 41 +++++-
kernel/fork.c | 239 +++++++++++++++++++++++++++++++
kernel/sched/core.c | 1 +
5 files changed, 315 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..da3df347b069 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1241,6 +1241,40 @@ config VMAP_STACK
backing virtual mappings with real shadow memory, and KASAN_VMALLOC
must be enabled.

+config HAVE_ARCH_DYNAMIC_STACK
+ def_bool n
+ help
+ An arch should select this symbol if it can support kernel stacks
+ dynamic growth.
+
+ - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
+ stack related page faults
+
+ - Arch must be able to faults from interrupt context.
+ - Arch must allows the kernel to handle stack faults gracefully, even
+ during interrupt handling.
+
+ - Exceptions such as no pages available should be handled the same
+ in the consitent and predictable way. I.e. the exception should be
+ handled the same as when stack overflow occurs when guard pages are
+ touched with extra information about the allocation error.
+
+config DYNAMIC_STACK
+ default y
+ bool "Dynamically grow kernel stacks"
+ depends on THREAD_INFO_IN_TASK
+ depends on HAVE_ARCH_DYNAMIC_STACK
+ depends on VMAP_STACK
+ depends on !KASAN
+ depends on !DEBUG_STACK_USAGE
+ depends on !STACK_GROWSUP
+ help
+ Dynamic kernel stacks allow to save memory on machines with a lot of
+ threads by starting with small stacks, and grow them only when needed.
+ On workloads where most of the stack depth do not reach over one page
+ the memory saving can be subsentantial. The feature requires virtually
+ mapped kernel stacks in order to handle page faults.
+
config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
def_bool n
help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..d3ce3cd065ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1628,7 +1628,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_DYNAMIC_STACK 0x00010000 /* This thread allocated dynamic stack pages */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 860faea06883..4934bfd65ad1 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -82,9 +82,49 @@ static inline void put_task_stack(struct task_struct *tsk) {}

void exit_task_stack_account(struct task_struct *tsk);

+#ifdef CONFIG_DYNAMIC_STACK
+
+#define task_stack_end_corrupted(task) 0
+
+#ifndef THREAD_PREALLOC_PAGES
+#define THREAD_PREALLOC_PAGES 1
+#endif
+
+#define THREAD_DYNAMIC_PAGES \
+ ((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES)
+
+void dynamic_stack_refill_pages(void);
+bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address);
+
+/*
+ * Refill and charge for the used pages.
+ */
+static inline void dynamic_stack(struct task_struct *tsk)
+{
+ if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) {
+ dynamic_stack_refill_pages();
+ tsk->flags &= ~PF_DYNAMIC_STACK;
+ }
+}
+
+static inline void set_task_stack_end_magic(struct task_struct *tsk) {}
+
+#else /* !CONFIG_DYNAMIC_STACK */
+
#define task_stack_end_corrupted(task) \
(*(end_of_stack(task)) != STACK_END_MAGIC)

+void set_task_stack_end_magic(struct task_struct *tsk);
+static inline void dynamic_stack(struct task_struct *tsk) {}
+
+static inline bool dynamic_stack_fault(struct task_struct *tsk,
+ unsigned long address)
+{
+ return false;
+}
+
+#endif /* CONFIG_DYNAMIC_STACK */
+
static inline int object_is_on_stack(const void *obj)
{
void *stack = task_stack_page(current);
@@ -114,7 +154,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
# endif
}
#endif
-extern void set_task_stack_end_magic(struct task_struct *tsk);

static inline int kstack_end(void *addr)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index bbae5f705773..63e1fd661e17 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -204,6 +204,232 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
return false;
}

+#ifdef CONFIG_DYNAMIC_STACK
+
+static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[THREAD_DYNAMIC_PAGES]);
+
+static struct vm_struct *alloc_vmap_stack(int node)
+{
+ gfp_t gfp = THREADINFO_GFP & ~__GFP_ACCOUNT;
+ unsigned long addr, end;
+ struct vm_struct *vm_area;
+ int err, i;
+
+ vm_area = get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node,
+ gfp, __builtin_return_address(0));
+ if (!vm_area)
+ return NULL;
+
+ vm_area->pages = kmalloc_node(sizeof(void *) *
+ (THREAD_SIZE >> PAGE_SHIFT), gfp, node);
+ if (!vm_area->pages)
+ goto cleanup_err;
+
+ for (i = 0; i < THREAD_PREALLOC_PAGES; i++) {
+ vm_area->pages[i] = alloc_pages(gfp, 0);
+ if (!vm_area->pages[i])
+ goto cleanup_err;
+ vm_area->nr_pages++;
+ }
+
+ addr = (unsigned long)vm_area->addr +
+ (THREAD_DYNAMIC_PAGES << PAGE_SHIFT);
+ end = (unsigned long)vm_area->addr + THREAD_SIZE;
+ err = vmap_pages_range_noflush(addr, end, PAGE_KERNEL, vm_area->pages,
+ PAGE_SHIFT);
+ if (err)
+ goto cleanup_err;
+
+ return vm_area;
+cleanup_err:
+ for (i = 0; i < vm_area->nr_pages; i++)
+ __free_page(vm_area->pages[i]);
+ kfree(vm_area->pages);
+ kfree(vm_area);
+
+ return NULL;
+}
+
+static void free_vmap_stack(struct vm_struct *vm_area)
+{
+ int i, nr_pages;
+
+ remove_vm_area(vm_area->addr);
+
+ nr_pages = vm_area->nr_pages;
+ for (i = 0; i < nr_pages; i++)
+ __free_page(vm_area->pages[i]);
+
+ kfree(vm_area->pages);
+ kfree(vm_area);
+}
+
+/*
+ * This flag is used to pass information from fault handler to refill about
+ * which pages were allocated, and should be charged to memcg.
+ */
+#define DYNAMIC_STACK_PAGE_AQUIRED_FLAG 0x1
+
+static struct page *dynamic_stack_get_page(void)
+{
+ struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+ int i;
+
+ for (i = 0; i < THREAD_DYNAMIC_PAGES; i++) {
+ struct page *page = pages[i];
+
+ if (page && !((uintptr_t)page & DYNAMIC_STACK_PAGE_AQUIRED_FLAG)) {
+ pages[i] = (void *)((uintptr_t)pages[i] | DYNAMIC_STACK_PAGE_AQUIRED_FLAG);
+ return page;
+ }
+ }
+
+ return NULL;
+}
+
+static int dynamic_stack_refill_pages_cpu(unsigned int cpu)
+{
+ struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+ int i;
+
+ for (i = 0; i < THREAD_DYNAMIC_PAGES; i++) {
+ if (pages[i])
+ break;
+ pages[i] = alloc_pages(THREADINFO_GFP & ~__GFP_ACCOUNT, 0);
+ if (unlikely(!pages[i])) {
+ pr_err("failed to allocate dynamic stack page for cpu[%d]\n",
+ cpu);
+ }
+ }
+
+ return 0;
+}
+
+static int dynamic_stack_free_pages_cpu(unsigned int cpu)
+{
+ struct page **pages = per_cpu_ptr(dynamic_stack_pages, cpu);
+ int i;
+
+ for (i = 0; i < THREAD_DYNAMIC_PAGES; i++) {
+ if (!pages[i])
+ continue;
+ __free_page(pages[i]);
+ pages[i] = NULL;
+ }
+
+ return 0;
+}
+
+void dynamic_stack_refill_pages(void)
+{
+ struct page **pages = this_cpu_ptr(dynamic_stack_pages);
+ int i, ret;
+
+ for (i = 0; i < THREAD_DYNAMIC_PAGES; i++) {
+ struct page *page = pages[i];
+
+ if (!((uintptr_t)page & DYNAMIC_STACK_PAGE_AQUIRED_FLAG))
+ break;
+
+ page = (void *)((uintptr_t)page & ~DYNAMIC_STACK_PAGE_AQUIRED_FLAG);
+ ret = memcg_kmem_charge_page(page, GFP_KERNEL, 0);
+ /*
+ * XXX Since stack pages were already allocated, we should never
+ * fail charging. Therefore, we should probably induce force
+ * charge and oom killing if charge fails.
+ */
+ if (unlikely(ret))
+ pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n");
+
+ mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
+ PAGE_SIZE / 1024);
+
+ page = alloc_pages(THREADINFO_GFP & ~__GFP_ACCOUNT, 0);
+ if (unlikely(!page))
+ pr_err_ratelimited("failed to refill per-cpu dynamic stack\n");
+ pages[i] = page;
+ }
+}
+
+bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long address)
+{
+ unsigned long stack, hole_end, addr;
+ struct vm_struct *vm_area;
+ struct page *page;
+ int nr_pages;
+ pte_t *pte;
+
+ /* check if address is inside the kernel stack area */
+ stack = (unsigned long)tsk->stack;
+ if (address < stack || address >= stack + THREAD_SIZE)
+ return false;
+
+ vm_area = tsk->stack_vm_area;
+ if (!vm_area)
+ return false;
+
+ /*
+ * check if this stack can still grow, otherwise fault will be reported
+ * as guard page access.
+ */
+ nr_pages = vm_area->nr_pages;
+ if (nr_pages >= (THREAD_SIZE >> PAGE_SHIFT))
+ return false;
+
+ /* Check if fault address is within the stack hole */
+ hole_end = stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT);
+ if (address >= hole_end)
+ return false;
+
+ /*
+ * Most likely we faulted in the page right next to the last mapped
+ * page in the stack, however, it is possible (but very unlikely) that
+ * the faulted page is actually skips some pages in the stack. Make sure
+ * we do not create more than one holes in the stack, and map every
+ * page between the current fault address and the last page that is
+ * mapped in the stack.
+ */
+ address = PAGE_ALIGN_DOWN(address);
+ for (addr = hole_end - PAGE_SIZE; addr >= address; addr -= PAGE_SIZE) {
+ /* Take the next page from the per-cpu list */
+ page = dynamic_stack_get_page();
+ if (!page) {
+ instrumentation_begin();
+ pr_emerg("Failed to allocate a page during kernel_stack_fault\n");
+ instrumentation_end();
+ return false;
+ }
+
+ /* Store the new page in the stack's vm_area */
+ vm_area->pages[nr_pages] = page;
+ vm_area->nr_pages = nr_pages + 1;
+
+ /* Add the new page entry to the page table */
+ pte = virt_to_kpte(addr);
+ if (!pte) {
+ instrumentation_begin();
+ pr_emerg("The PTE page table for a kernel stack is not found\n");
+ instrumentation_end();
+ return false;
+ }
+
+ /* Make sure there are no existing mappings at this address */
+ if (pte_present(*pte)) {
+ instrumentation_begin();
+ pr_emerg("The PTE contains a mapping\n");
+ instrumentation_end();
+ return false;
+ }
+ set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+ }
+
+ /* Refill the pcp stack pages during context switch */
+ tsk->flags |= PF_DYNAMIC_STACK;
+
+ return true;
+}
+
+#else /* !CONFIG_DYNAMIC_STACK */
static inline struct vm_struct *alloc_vmap_stack(int node)
{
void *stack;
@@ -226,6 +452,7 @@ static inline void free_vmap_stack(struct vm_struct *vm_area)
{
vfree(vm_area->addr);
}
+#endif /* CONFIG_DYNAMIC_STACK */

static void thread_stack_free_rcu(struct rcu_head *rh)
{
@@ -1083,6 +1310,16 @@ void __init fork_init(void)
NULL, free_vm_stack_cache);
#endif

+#ifdef CONFIG_DYNAMIC_STACK
+ cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack",
+ dynamic_stack_refill_pages_cpu,
+ dynamic_stack_free_pages_cpu);
+ /*
+ * Fill the dynamic stack pages for the boot CPU, others will be filled
+ * as CPUs are onlined.
+ */
+ dynamic_stack_refill_pages_cpu(smp_processor_id());
+#endif
scs_init();

lockdep_init_task(&init_task);
@@ -1096,6 +1333,7 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
return 0;
}

+#ifndef CONFIG_DYNAMIC_STACK
void set_task_stack_end_magic(struct task_struct *tsk)
{
unsigned long *stackend;
@@ -1103,6 +1341,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
stackend = end_of_stack(tsk);
*stackend = STACK_END_MAGIC; /* for overflow detection */
}
+#endif

static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..20f9523c3159 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6617,6 +6617,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq = cpu_rq(cpu);
prev = rq->curr;

+ dynamic_stack(prev);
schedule_debug(prev, !!sched_mode);

if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:49:17

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 12/14] task_stack.h: Clean-up stack_not_used() implementation

Inside small stack_not_used() function there are several ifdefs for
stack growing-up vs. regular versions. Instead just implement this
function two times, one for growing-up and another regular.

This is needed, because there will be a third implementation of this
function for dynamic stacks.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/sched/task_stack.h | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 4934bfd65ad1..396d5418ae32 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -135,25 +135,30 @@ static inline int object_is_on_stack(const void *obj)
extern void thread_stack_cache_init(void);

#ifdef CONFIG_DEBUG_STACK_USAGE
+#ifdef CONFIG_STACK_GROWSUP
static inline unsigned long stack_not_used(struct task_struct *p)
{
unsigned long *n = end_of_stack(p);

- do { /* Skip over canary */
-# ifdef CONFIG_STACK_GROWSUP
+ do { /* Skip over canary */
n--;
-# else
- n++;
-# endif
} while (!*n);

-# ifdef CONFIG_STACK_GROWSUP
return (unsigned long)end_of_stack(p) - (unsigned long)n;
-# else
+}
+#else /* !CONFIG_STACK_GROWSUP */
+static inline unsigned long stack_not_used(struct task_struct *p)
+{
+ unsigned long *n = end_of_stack(p);
+
+ do { /* Skip over canary */
+ n++;
+ } while (!*n);
+
return (unsigned long)n - (unsigned long)end_of_stack(p);
-# endif
}
-#endif
+#endif /* CONFIG_STACK_GROWSUP */
+#endif /* CONFIG_DEBUG_STACK_USAGE */

static inline int kstack_end(void *addr)
{
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:49:32

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 13/14] task_stack.h: Add stack_not_used() support for dynamic stack

CONFIG_DEBUG_STACK_USAGE is enabled by default on most architectures.

Its purpose is to determine and print the maximum stack depth on
thread exit.

The way it works, is it starts from the buttom of the stack and
searches the first non-zero word in the stack. With dynamic stack it
does not work very well, as it means it faults every pages in every
stack.

Instead, add a specific version of stack_not_used() for dynamic stacks
where instead of starting from the buttom of the stack, we start from
the last page mapped in the stack.

In addition to not doing uncessary page faulting, this search is
optimized by skipping search through zero pages.

Also, because dynamic stack does not end with MAGIC_NUMBER, there is
no need to skeep the buttom most word in the stack.

Signed-off-by: Pasha Tatashin <[email protected]>
---
arch/Kconfig | 1 -
include/linux/sched/task_stack.h | 38 +++++++++++++++++++++++---------
2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index da3df347b069..759b2bb7edb6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1266,7 +1266,6 @@ config DYNAMIC_STACK
depends on HAVE_ARCH_DYNAMIC_STACK
depends on VMAP_STACK
depends on !KASAN
- depends on !DEBUG_STACK_USAGE
depends on !STACK_GROWSUP
help
Dynamic kernel stacks allow to save memory on machines with a lot of
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 396d5418ae32..c5fb679b31ee 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -9,6 +9,7 @@
#include <linux/sched.h>
#include <linux/magic.h>
#include <linux/refcount.h>
+#include <linux/vmalloc.h>

#ifdef CONFIG_THREAD_INFO_IN_TASK

@@ -109,6 +110,21 @@ static inline void dynamic_stack(struct task_struct *tsk)

static inline void set_task_stack_end_magic(struct task_struct *tsk) {}

+#ifdef CONFIG_DEBUG_STACK_USAGE
+static inline unsigned long stack_not_used(struct task_struct *p)
+{
+ struct vm_struct *vm_area = p->stack_vm_area;
+ unsigned long alloc_size = vm_area->nr_pages << PAGE_SHIFT;
+ unsigned long stack = (unsigned long)p->stack;
+ unsigned long *n = (unsigned long *)(stack + THREAD_SIZE - alloc_size);
+
+ while (!*n)
+ n++;
+
+ return (unsigned long)n - stack;
+}
+#endif /* CONFIG_DEBUG_STACK_USAGE */
+
#else /* !CONFIG_DYNAMIC_STACK */

#define task_stack_end_corrupted(task) \
@@ -123,17 +139,6 @@ static inline bool dynamic_stack_fault(struct task_struct *tsk,
return false;
}

-#endif /* CONFIG_DYNAMIC_STACK */
-
-static inline int object_is_on_stack(const void *obj)
-{
- void *stack = task_stack_page(current);
-
- return (obj >= stack) && (obj < (stack + THREAD_SIZE));
-}
-
-extern void thread_stack_cache_init(void);
-
#ifdef CONFIG_DEBUG_STACK_USAGE
#ifdef CONFIG_STACK_GROWSUP
static inline unsigned long stack_not_used(struct task_struct *p)
@@ -160,6 +165,17 @@ static inline unsigned long stack_not_used(struct task_struct *p)
#endif /* CONFIG_STACK_GROWSUP */
#endif /* CONFIG_DEBUG_STACK_USAGE */

+#endif /* CONFIG_DYNAMIC_STACK */
+
+static inline int object_is_on_stack(const void *obj)
+{
+ void *stack = task_stack_page(current);
+
+ return (obj >= stack) && (obj < (stack + THREAD_SIZE));
+}
+
+extern void thread_stack_cache_init(void);
+
static inline int kstack_end(void *addr)
{
/* Reliable end of stack detection:
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 16:49:48

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 14/14] fork: Dynamic Kernel Stack accounting

Add an accounting of amount of stack pages that has been faulted is
currently in use.

Example use case:
$ cat /proc/vmstat | grep stack
nr_kernel_stack 18684
nr_dynamic_stacks_faults 156

The above shows that the kernel stacks use total 18684KiB, out of which
156KiB were faulted in.

Given that the pre-allocated stacks are 4KiB, we can determine the total
number of tasks:

tasks = (nr_kernel_stack - nr_dynamic_stacks_faults) / 4 = 4632.

The amount of kernel stack memory without dynamic stack on this machine
woud be:

4632 * 16 KiB = 74,112 KiB

Therefore, in this example dynamic stacks save: 55,428 KiB

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/mmzone.h | 3 +++
kernel/fork.c | 13 ++++++++++++-
mm/vmstat.c | 3 +++
3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a497f189d988..ba4f1d148c3f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -198,6 +198,9 @@ enum node_stat_item {
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
NR_KERNEL_STACK_KB, /* measured in KiB */
+#ifdef CONFIG_DYNAMIC_STACK
+ NR_DYNAMIC_STACKS_FAULTS_KB, /* KiB of faulted kernel stack memory */
+#endif
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 63e1fd661e17..2520583d160a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -343,6 +343,9 @@ void dynamic_stack_refill_pages(void)

mod_lruvec_page_state(page, NR_KERNEL_STACK_KB,
PAGE_SIZE / 1024);
+ mod_lruvec_page_state(page,
+ NR_DYNAMIC_STACKS_FAULTS_KB,
+ PAGE_SIZE / 1024);

page = alloc_pages(THREADINFO_GFP & ~__GFP_ACCOUNT, 0);
if (unlikely(!page))
@@ -771,9 +774,17 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
int i, nr_pages;

nr_pages = vm->nr_pages;
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
account * (PAGE_SIZE / 1024));
+#ifdef CONFIG_DYNAMIC_STACK
+ if (i >= THREAD_PREALLOC_PAGES) {
+ mod_lruvec_page_state(vm->pages[i],
+ NR_DYNAMIC_STACKS_FAULTS_KB,
+ account * (PAGE_SIZE / 1024));
+ }
+#endif
+ }
} else {
void *stack = task_stack_page(tsk);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..1ad6eede3d85 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1237,6 +1237,9 @@ const char * const vmstat_text[] = {
"nr_foll_pin_acquired",
"nr_foll_pin_released",
"nr_kernel_stack",
+#ifdef CONFIG_DYNAMIC_STACK
+ "nr_dynamic_stacks_faults",
+#endif
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
"nr_shadow_call_stack",
#endif
--
2.44.0.278.ge034bb2e1d-goog


2024-03-11 17:09:31

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On 3/11/24, Pasha Tatashin <[email protected]> wrote:
> This is follow-up to the LSF/MM proposal [1]. Please provide your
> thoughts and comments about dynamic kernel stacks feature. This is a WIP
> has not been tested beside booting on some machines, and running LKDTM
> thread exhaust tests. The series also lacks selftests, and
> documentations.
>
> This feature allows to grow kernel stack dynamically, from 4KiB and up
> to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> the initial experiments it shows to save on average 70-75% of the kernel
> stack memory.
>

Can you please elaborate how this works? I have trouble figuring it
out from cursory reading of the patchset and commit messages, that
aside I would argue this should have been explained in the cover
letter.

For example, say a thread takes a bunch of random locks (most notably
spinlocks) and/or disables preemption, then pushes some stuff onto the
stack which now faults. That is to say the fault can happen in rather
arbitrary context.

If any of the conditions described below are prevented in the first
place it really needs to be described how.

That said, from top of my head:
1. what about faults when the thread holds a bunch of arbitrary locks
or has preemption disabled? is the allocation lockless?
2. what happens if there is no memory from which to map extra pages in
the first place? you may be in position where you can't go off cpu

--
Mateusz Guzik <mjguzik gmail.com>

2024-03-11 18:59:33

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <[email protected]> wrote:
>
> On 3/11/24, Pasha Tatashin <[email protected]> wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
>

Hi Mateusz,

> Can you please elaborate how this works? I have trouble figuring it
> out from cursory reading of the patchset and commit messages, that
> aside I would argue this should have been explained in the cover
> letter.

Sure, I answered your questions below.

> For example, say a thread takes a bunch of random locks (most notably
> spinlocks) and/or disables preemption, then pushes some stuff onto the
> stack which now faults. That is to say the fault can happen in rather
> arbitrary context.
>
> If any of the conditions described below are prevented in the first
> place it really needs to be described how.
>
> That said, from top of my head:
> 1. what about faults when the thread holds a bunch of arbitrary locks
> or has preemption disabled? is the allocation lockless?

Each thread has a stack with 4 pages.
Pre-allocated page: This page is always allocated and mapped at thread creation.
Dynamic pages (3): These pages are mapped dynamically upon stack faults.

A per-CPU data structure holds 3 dynamic pages for each CPU. These
pages are used to handle stack faults occurring when a running thread
faults (even within interrupt-disabled contexts). Typically, only one
page is needed, but in the rare case where the thread accesses beyond
that, we might use up to all three pages in a single fault. This
structure allows for atomic handling of stack faults, preventing
conflicts from other processes. Additionally, the thread's 16K-aligned
virtual address (VA) and guaranteed pre-allocated page means no page
table allocation is required during the fault.

When a thread leaves the CPU in normal kernel mode, we check a flag to
see if it has experienced stack faults. If so, we charge the thread
for the new stack pages and refill the per-CPU data structure with any
missing pages.

> 2. what happens if there is no memory from which to map extra pages in
> the first place? you may be in position where you can't go off cpu

When the per-CPU data structure cannot be refilled, and a new thread
faults, we issue a message indicating a critical stack fault. This
triggers a system-wide panic similar to a guard page access violation

Pasha

2024-03-11 19:21:14

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On 3/11/24, Pasha Tatashin <[email protected]> wrote:
> On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <[email protected]> wrote:
>> 1. what about faults when the thread holds a bunch of arbitrary locks
>> or has preemption disabled? is the allocation lockless?
>
> Each thread has a stack with 4 pages.
> Pre-allocated page: This page is always allocated and mapped at thread
> creation.
> Dynamic pages (3): These pages are mapped dynamically upon stack faults.
>
> A per-CPU data structure holds 3 dynamic pages for each CPU. These
> pages are used to handle stack faults occurring when a running thread
> faults (even within interrupt-disabled contexts). Typically, only one
> page is needed, but in the rare case where the thread accesses beyond
> that, we might use up to all three pages in a single fault. This
> structure allows for atomic handling of stack faults, preventing
> conflicts from other processes. Additionally, the thread's 16K-aligned
> virtual address (VA) and guaranteed pre-allocated page means no page
> table allocation is required during the fault.
>
> When a thread leaves the CPU in normal kernel mode, we check a flag to
> see if it has experienced stack faults. If so, we charge the thread
> for the new stack pages and refill the per-CPU data structure with any
> missing pages.
>

So this also has to happen if the thread holds a bunch of arbitrary
semaphores and goes off cpu with them? Anyhow, see below.

>> 2. what happens if there is no memory from which to map extra pages in
>> the first place? you may be in position where you can't go off cpu
>
> When the per-CPU data structure cannot be refilled, and a new thread
> faults, we issue a message indicating a critical stack fault. This
> triggers a system-wide panic similar to a guard page access violation
>

OOM handling is fundamentally what I was worried about. I'm confident
this failure mode makes the feature unsuitable for general-purpose
deployments.

Now, I have no vote here, it may be this is perfectly fine as an
optional feature, which it is in your patchset. However, if this is to
go in, the option description definitely needs a big fat warning about
possible panics if enabled.

I fully agree something(tm) should be done about stacks and the
current usage is a massive bummer. I wonder if things would be ok if
they shrinked to just 12K? Perhaps that would provide big enough
saving (of course smaller than the one you are getting now), while
avoiding any of the above.

All that said, it's not my call what do here. Thank you for the explanation.

--
Mateusz Guzik <mjguzik gmail.com>

2024-03-11 19:32:20

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC 10/14] fork: Dynamic Kernel Stacks

Hi,

just typos etc.

On 3/11/24 09:46, Pasha Tatashin wrote:
> The core implementation of dynamic kernel stacks.
>

..

>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> arch/Kconfig | 34 +++++
> include/linux/sched.h | 2 +-
> include/linux/sched/task_stack.h | 41 +++++-
> kernel/fork.c | 239 +++++++++++++++++++++++++++++++
> kernel/sched/core.c | 1 +
> 5 files changed, 315 insertions(+), 2 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a5af0edd3eb8..da3df347b069 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1241,6 +1241,40 @@ config VMAP_STACK
> backing virtual mappings with real shadow memory, and KASAN_VMALLOC
> must be enabled.
>
> +config HAVE_ARCH_DYNAMIC_STACK
> + def_bool n
> + help
> + An arch should select this symbol if it can support kernel stacks
> + dynamic growth.
> +
> + - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
> + stack related page faults

stack-related

> +
> + - Arch must be able to faults from interrupt context.

fault

> + - Arch must allows the kernel to handle stack faults gracefully, even

allow

> + during interrupt handling.
> +
> + - Exceptions such as no pages available should be handled the same

handled in the same

> + in the consitent and predictable way. I.e. the exception should be

consistent

> + handled the same as when stack overflow occurs when guard pages are
> + touched with extra information about the allocation error.
> +
> +config DYNAMIC_STACK
> + default y
> + bool "Dynamically grow kernel stacks"
> + depends on THREAD_INFO_IN_TASK
> + depends on HAVE_ARCH_DYNAMIC_STACK
> + depends on VMAP_STACK
> + depends on !KASAN
> + depends on !DEBUG_STACK_USAGE
> + depends on !STACK_GROWSUP
> + help
> + Dynamic kernel stacks allow to save memory on machines with a lot of
> + threads by starting with small stacks, and grow them only when needed.
> + On workloads where most of the stack depth do not reach over one page

does

> + the memory saving can be subsentantial. The feature requires virtually

substantial.

> + mapped kernel stacks in order to handle page faults.
> +
> config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> def_bool n
> help



> +/*
> + * This flag is used to pass information from fault handler to refill about
> + * which pages were allocated, and should be charged to memcg.
> + */
> +#define DYNAMIC_STACK_PAGE_AQUIRED_FLAG 0x1

ACQUIRED
please



--
#Randy

2024-03-11 19:55:54

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 11, 2024 at 3:21 PM Mateusz Guzik <[email protected]> wrote:
>
> On 3/11/24, Pasha Tatashin <[email protected]> wrote:
> > On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <[email protected]> wrote:
> >> 1. what about faults when the thread holds a bunch of arbitrary locks
> >> or has preemption disabled? is the allocation lockless?
> >
> > Each thread has a stack with 4 pages.
> > Pre-allocated page: This page is always allocated and mapped at thread
> > creation.
> > Dynamic pages (3): These pages are mapped dynamically upon stack faults.
> >
> > A per-CPU data structure holds 3 dynamic pages for each CPU. These
> > pages are used to handle stack faults occurring when a running thread
> > faults (even within interrupt-disabled contexts). Typically, only one
> > page is needed, but in the rare case where the thread accesses beyond
> > that, we might use up to all three pages in a single fault. This
> > structure allows for atomic handling of stack faults, preventing
> > conflicts from other processes. Additionally, the thread's 16K-aligned
> > virtual address (VA) and guaranteed pre-allocated page means no page
> > table allocation is required during the fault.
> >
> > When a thread leaves the CPU in normal kernel mode, we check a flag to
> > see if it has experienced stack faults. If so, we charge the thread
> > for the new stack pages and refill the per-CPU data structure with any
> > missing pages.
> >
>
> So this also has to happen if the thread holds a bunch of arbitrary
> semaphores and goes off cpu with them? Anyhow, see below.

Yes, this is alright, if thread is allowed to sleep it should not hold
any alloc_pages() locks.

> >> 2. what happens if there is no memory from which to map extra pages in
> >> the first place? you may be in position where you can't go off cpu
> >
> > When the per-CPU data structure cannot be refilled, and a new thread
> > faults, we issue a message indicating a critical stack fault. This
> > triggers a system-wide panic similar to a guard page access violation
> >
>
> OOM handling is fundamentally what I was worried about. I'm confident
> this failure mode makes the feature unsuitable for general-purpose
> deployments.

The primary goal of this series is to enhance system safety, not
introduce additional risks. Memory saving is a welcome side effect.
Please see below for explanations.

>
> Now, I have no vote here, it may be this is perfectly fine as an
> optional feature, which it is in your patchset. However, if this is to
> go in, the option description definitely needs a big fat warning about
> possible panics if enabled.
>
> I fully agree something(tm) should be done about stacks and the
> current usage is a massive bummer. I wonder if things would be ok if
> they shrinked to just 12K? Perhaps that would provide big enough


The current setting of 1 pre-allocated page 3-dynamic page is just
WIP, we can very well change to 2 pre-allocated 2-dynamic pages, or
3/1 etc.

At Google, we still utilize 8K stacks (have not increased it to 16K
when upstream increased it in 2014) and are only now encountering
extreme cases where the 8K limit is reached. Consequently, we plan to
increase the limit to 16K. Dynamic Kernel Stacks allow us to maintain
an 8K pre-allocated stack while handling page faults only in
exceptionally rare circumstances.

Another example is to increase THREAD_SIZE to 32K, and keep 16K
pre-allocated. This is the same as what upstream has today, but avoids
panics with guard pages thus making the systems safer for everyone.

Pasha

2024-03-11 19:56:53

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 10/14] fork: Dynamic Kernel Stacks

On Mon, Mar 11, 2024 at 3:32 PM Randy Dunlap <[email protected]> wrote:
>
> Hi,
>
> just typos etc.
>
> On 3/11/24 09:46, Pasha Tatashin wrote:
> > The core implementation of dynamic kernel stacks.
> >
>
> ...
>
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > arch/Kconfig | 34 +++++
> > include/linux/sched.h | 2 +-
> > include/linux/sched/task_stack.h | 41 +++++-
> > kernel/fork.c | 239 +++++++++++++++++++++++++++++++
> > kernel/sched/core.c | 1 +
> > 5 files changed, 315 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index a5af0edd3eb8..da3df347b069 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1241,6 +1241,40 @@ config VMAP_STACK
> > backing virtual mappings with real shadow memory, and KASAN_VMALLOC
> > must be enabled.
> >
> > +config HAVE_ARCH_DYNAMIC_STACK
> > + def_bool n
> > + help
> > + An arch should select this symbol if it can support kernel stacks
> > + dynamic growth.
> > +
> > + - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle
> > + stack related page faults
>
> stack-related
>
> > +
> > + - Arch must be able to faults from interrupt context.
>
> fault
>
> > + - Arch must allows the kernel to handle stack faults gracefully, even
>
> allow
>
> > + during interrupt handling.
> > +
> > + - Exceptions such as no pages available should be handled the same
>
> handled in the same
>
> > + in the consitent and predictable way. I.e. the exception should be
>
> consistent
>
> > + handled the same as when stack overflow occurs when guard pages are
> > + touched with extra information about the allocation error.
> > +
> > +config DYNAMIC_STACK
> > + default y
> > + bool "Dynamically grow kernel stacks"
> > + depends on THREAD_INFO_IN_TASK
> > + depends on HAVE_ARCH_DYNAMIC_STACK
> > + depends on VMAP_STACK
> > + depends on !KASAN
> > + depends on !DEBUG_STACK_USAGE
> > + depends on !STACK_GROWSUP
> > + help
> > + Dynamic kernel stacks allow to save memory on machines with a lot of
> > + threads by starting with small stacks, and grow them only when needed.
> > + On workloads where most of the stack depth do not reach over one page
>
> does
>
> > + the memory saving can be subsentantial. The feature requires virtually
>
> substantial.
>
> > + mapped kernel stacks in order to handle page faults.
> > +
> > config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> > def_bool n
> > help
>
>
>
> > +/*
> > + * This flag is used to pass information from fault handler to refill about
> > + * which pages were allocated, and should be charged to memcg.
> > + */
> > +#define DYNAMIC_STACK_PAGE_AQUIRED_FLAG 0x1
>
> ACQUIRED
> please

Thank you Randy, I will address your comments in my next revision.

Pasha

>
>
>
> --
> #Randy

2024-03-11 22:18:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks



On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
> Add dynamic_stack_fault() calls to the kernel faults, and also declare
> HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
> enabled on x86 architecture.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/traps.c | 3 +++
> arch/x86/mm/fault.c | 3 +++
> 3 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..9bb0da3110fa 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -197,6 +197,7 @@ config X86
> select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
> select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
> select HAVE_ARCH_VMAP_STACK if X86_64
> + select HAVE_ARCH_DYNAMIC_STACK if X86_64
> select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> select HAVE_ARCH_WITHIN_STACK_FRAMES
> select HAVE_ASM_MODVERSIONS
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c3b2f863acf0..cc05401e729f 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> }
> #endif
>
> + if (dynamic_stack_fault(current, address))
> + return;
> +

Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing.

Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it.

There are some other options: you could pre-map

Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. But perhaps we could have a rule that a task can only be scheduled in if there is sufficient memory available for its stack. And perhaps we could avoid every page-faulting by filling in the PTEs for the potential stack pages but leaving them un-accessed. I *think* that all x86 implementations won't fill the TLB for a non-accessed page without also setting the accessed bit, so the performance hit of filling the PTEs, running the task, and then doing the appropriate synchronization to clear the PTEs and read the accessed bit on schedule-out to release the pages may not be too bad. But you would need to do this cautiously in the scheduler, possibly in the *next* task but before the prev task is actually released enough to be run on a different CPU. It's going to be messy.

2024-03-11 23:11:08

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <[email protected]> wrote:
>
>
>
> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
> > Add dynamic_stack_fault() calls to the kernel faults, and also declare
> > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
> > enabled on x86 architecture.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/kernel/traps.c | 3 +++
> > arch/x86/mm/fault.c | 3 +++
> > 3 files changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 5edec175b9bf..9bb0da3110fa 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -197,6 +197,7 @@ config X86
> > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
> > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
> > select HAVE_ARCH_VMAP_STACK if X86_64
> > + select HAVE_ARCH_DYNAMIC_STACK if X86_64
> > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> > select HAVE_ARCH_WITHIN_STACK_FRAMES
> > select HAVE_ASM_MODVERSIONS
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index c3b2f863acf0..cc05401e729f 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> > }
> > #endif
> >
> > + if (dynamic_stack_fault(current, address))
> > + return;
> > +
>
> Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing.
>
> Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it.

Hi Andy,

Thank you for the insightful feedback.

I'm somewhat confused about why we end up in exc_double_fault() in the
first place. My initial assumption was that dynamic_stack_fault()
would only be needed within do_kern_addr_fault(). However, while
testing in QEMU, I found that when using memset() on a stack variable,
code like this:

rep stos %rax,%es:(%rdi)

causes a double fault instead of a regular fault. I added it to
exc_double_fault() as a result, but I'm curious if you have any
insights into why this behavior occurs.

> There are some other options: you could pre-map

Pre-mapping would be expensive. It would mean pre-mapping the dynamic
pages for every scheduled thread, and we'd still need to check the
access bit every time a thread leaves the CPU. Dynamic thread faults
should be considered rare events and thus shouldn't significantly
affect the performance of normal context switch operations. With 8K
stacks, we might encounter only 0.00001% of stacks requiring an extra
page, and even fewer needing 16K.

> Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant.

We never block during handling stack faults. There's a per-CPU page
pool, guaranteeing availability for the faulting thread. The thread
simply takes pages from this per-CPU data structure and refills the
pool when leaving the CPU. The faulting routine is efficient,
requiring a fixed number of loads without any locks, stalling, or even
cmpxchg operations.

> But perhaps we could have a rule that a task can only be scheduled in if there is sufficient memory available for its stack.

Yes, I've considered this as well. We might implement this to avoid
crashes due to page faults. Basically, if the per-CPU pool cannot be
refilled, we'd prevent task scheduling until it is. We're already so
short on memory that the kernel can't allocate up to 3 pages of
memory.

Thank you,
Pasha

> And perhaps we could avoid every page-faulting by filling in the PTEs for the potential stack pages but leaving them un-accessed. I *think* that all x86 implementations won't fill the TLB for a non-accessed page without also setting the accessed bit, so the performance hit of filling the PTEs, running the task, and then doing the appropriate synchronization to clear the PTEs and read the accessed bit on schedule-out to release the pages may not be too bad. But you would need to do this cautiously in the scheduler, possibly in the *next* task but before the prev task is actually released enough to be run on a different CPU. It's going to be messy.

2024-03-11 23:34:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11 2024 at 19:10, Pasha Tatashin wrote:
> On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <[email protected]> wrote:
>> Also, I think the whole memory allocation concept in this whole
>> series is a bit odd. Fundamentally, we *can't* block on these stack
>> faults -- we may be in a context where blocking will deadlock. We
>> may be in the page allocator. Panicing due to kernel stack
>> allocation would be very unpleasant.
>
> We never block during handling stack faults. There's a per-CPU page
> pool, guaranteeing availability for the faulting thread. The thread
> simply takes pages from this per-CPU data structure and refills the
> pool when leaving the CPU. The faulting routine is efficient,
> requiring a fixed number of loads without any locks, stalling, or even
> cmpxchg operations.

Is this true for any context including nested exceptions and #NMI?

Thanks,

tglx

2024-03-11 23:34:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11, 2024, at 4:10 PM, Pasha Tatashin wrote:
> On Mon, Mar 11, 2024 at 6:17 PM Andy Lutomirski <[email protected]> wrote:
>>
>>
>>
>> On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
>> > Add dynamic_stack_fault() calls to the kernel faults, and also declare
>> > HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
>> > enabled on x86 architecture.
>> >
>> > Signed-off-by: Pasha Tatashin <[email protected]>
>> > ---
>> > arch/x86/Kconfig | 1 +
>> > arch/x86/kernel/traps.c | 3 +++
>> > arch/x86/mm/fault.c | 3 +++
>> > 3 files changed, 7 insertions(+)
>> >
>> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> > index 5edec175b9bf..9bb0da3110fa 100644
>> > --- a/arch/x86/Kconfig
>> > +++ b/arch/x86/Kconfig
>> > @@ -197,6 +197,7 @@ config X86
>> > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
>> > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
>> > select HAVE_ARCH_VMAP_STACK if X86_64
>> > + select HAVE_ARCH_DYNAMIC_STACK if X86_64
>> > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>> > select HAVE_ARCH_WITHIN_STACK_FRAMES
>> > select HAVE_ASM_MODVERSIONS
>> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> > index c3b2f863acf0..cc05401e729f 100644
>> > --- a/arch/x86/kernel/traps.c
>> > +++ b/arch/x86/kernel/traps.c
>> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
>> > }
>> > #endif
>> >
>> > + if (dynamic_stack_fault(current, address))
>> > + return;
>> > +
>>
>> Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing.
>>
>> Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it.
>
> Hi Andy,
>
> Thank you for the insightful feedback.
>
> I'm somewhat confused about why we end up in exc_double_fault() in the
> first place. My initial assumption was that dynamic_stack_fault()
> would only be needed within do_kern_addr_fault(). However, while
> testing in QEMU, I found that when using memset() on a stack variable,
> code like this:
>
> rep stos %rax,%es:(%rdi)
>
> causes a double fault instead of a regular fault. I added it to
> exc_double_fault() as a result, but I'm curious if you have any
> insights into why this behavior occurs.
>

Imagine you're a CPU running kernel code, on a fairly traditional architecture like x86. The code tries to access some swapped out user memory. You say "sorry, that memory is not present" and generate a page fault. You save the current state *to the stack* and chance the program counter to point to the page fault handler. The page fault handler does its thing, then pops the old state off the stack and resumes the faulting code.

A few microseconds later, the kernel fills up its stack and then does:

PUSH something

but that would write to a not-present stack page, because you already filled the stack. Okay, a page fault -- no big deal, we know how to handle that. So you push the current state to the stack. On wait, you *can't* push the current state to the stack, because that would involve writing to an unmapped page of memory.

So you trigger a double-fault. You push some state to the double-fault handler's special emergency stack. But wait, *what* state do you push? Is it the state that did the "PUSH something" and overflowed the stack? Or is some virtual state that's a mixture of that and the failed page fault handler? What if the stack wasn't quite full and you actually succeeded in pushing the old stack pointer but not the old program counter? What saved state goes where?

This is a complicated mess, so the people who designed all this said 'hey, wait a minute, let's not call double faults a "fault" -- let's call them an "abort"' so we can stop confusing ourselves and ship CPUs to customers. And "abort" means "the saved state is not well defined -- don't rely on it having any particular meaning".

So, until a few years ago, we would just print something like "PANIC: double fault" and kill the whole system. A few years ago, I decided this was lame, and I wanted to have stack guard pages, so i added real fancy new logic: instead, we do our best to display the old state, but it's a guess and all we're doing with it is printk -- if it's wrong, it's annoying, but that's all. And then we kill the running thread -- instead of trying to return (and violating our sacred contract with the x86 architecture), we *reset* the current crashing thread's state to a known-good state. Then we return to *that* state. Now we're off the emergency stack and we're running something resembling normal kernel code, but we can't return, as there is nowhere to return to. But that's fine -- instead we kill the current thread, kind of like _exit(). That never returns, so it's okay that we can't return.

But your patch adds a return statement to this whole mess, which will return to the moderately-likely-to-be-corrupt state that caused a double fault inside the microcode for the page fault path, and you have stepped outside the well-defined path in the x86 architecture, and you've triggered something akin to Undefined Behavior. The CPU won't catch fire, but it reserves the right to execute from an incorrect RSP and/or RIP, to be in the middle of an instruction, etc.

(For that matter, what if there was exactly enough room to enter the page fault handler, but the very first instruction of the page fault handler overflowed the stack? Then you allocate more memory, get lucky and successfully resume the page fault handler, and then promptly OOPS because you run the page fault handler and it thinks you got a kernel page fault? My OOPS code handles that, but, again, it's not trying to recover.)

>> There are some other options: you could pre-map
>
> Pre-mapping would be expensive. It would mean pre-mapping the dynamic
> pages for every scheduled thread, and we'd still need to check the
> access bit every time a thread leaves the CPU.

That's a write to four consecutive words in memory, with no locking required.

> Dynamic thread faults
> should be considered rare events and thus shouldn't significantly
> affect the performance of normal context switch operations. With 8K
> stacks, we might encounter only 0.00001% of stacks requiring an extra
> page, and even fewer needing 16K.

Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :)

>
>> Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant.
>
> We never block during handling stack faults. There's a per-CPU page
> pool, guaranteeing availability for the faulting thread. The thread
> simply takes pages from this per-CPU data structure and refills the
> pool when leaving the CPU. The faulting routine is efficient,
> requiring a fixed number of loads without any locks, stalling, or even
> cmpxchg operations.

You can't block when scheduling, either. What if you can't refill the pool?

2024-03-11 23:34:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On 3/11/24 15:17, Andy Lutomirski wrote:
> I *think* that all x86 implementations won't fill the TLB for a
> non-accessed page without also setting the accessed bit,

That's my understanding as well. The SDM is a little more obtuse about it:

> Whenever the processor uses a paging-structure entry as part of
> linear-address translation, it sets the accessed flag in that entry
> (if it is not already set).

but it's there.

But if we start needing Accessed=1 to be accurate, clearing those PTEs
gets more expensive because it needs to be atomic to lock out the page
walker. It basically needs to start getting treated similarly to what
is done for Dirty=1 on userspace PTEs. Not the end of the world, of
course, but one more source of overhead.


2024-03-11 23:42:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote:
> On 3/11/24 15:17, Andy Lutomirski wrote:
>> I *think* that all x86 implementations won't fill the TLB for a
>> non-accessed page without also setting the accessed bit,
>
> That's my understanding as well. The SDM is a little more obtuse about it:
>
>> Whenever the processor uses a paging-structure entry as part of
>> linear-address translation, it sets the accessed flag in that entry
>> (if it is not already set).
>
> but it's there.
>
> But if we start needing Accessed=1 to be accurate, clearing those PTEs
> gets more expensive because it needs to be atomic to lock out the page
> walker. It basically needs to start getting treated similarly to what
> is done for Dirty=1 on userspace PTEs. Not the end of the world, of
> course, but one more source of overhead.

In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.)

So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked:

if (!pte->accessed) {
*pte = 0;
reuse the memory;
}

What could possibly go wrong?

I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)

2024-03-11 23:56:39

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks



> On 12 Mar 2024, at 1:41, Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote:
>> On 3/11/24 15:17, Andy Lutomirski wrote:
>>> I *think* that all x86 implementations won't fill the TLB for a
>>> non-accessed page without also setting the accessed bit,
>>
>> That's my understanding as well. The SDM is a little more obtuse about it:
>>
>>> Whenever the processor uses a paging-structure entry as part of
>>> linear-address translation, it sets the accessed flag in that entry
>>> (if it is not already set).
>>
>> but it's there.
>>
>> But if we start needing Accessed=1 to be accurate, clearing those PTEs
>> gets more expensive because it needs to be atomic to lock out the page
>> walker. It basically needs to start getting treated similarly to what
>> is done for Dirty=1 on userspace PTEs. Not the end of the world, of
>> course, but one more source of overhead.
>
> In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.)
>
> So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked:
>
> if (!pte->accessed) {
> *pte = 0;
> reuse the memory;
> }
>
> What could possibly go wrong?
>
> I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)
>

Just a thought: you don’t care if someone only reads from the stack's page (you can just install another page later). IOW: you only care if someone writes.

So you can look on the dirty-bit, which is not being set speculatively and save yourself one problem.


2024-03-12 00:03:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks



On Mon, Mar 11, 2024, at 4:56 PM, Nadav Amit wrote:
>> On 12 Mar 2024, at 1:41, Andy Lutomirski <[email protected]> wrote:
>>
>> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote:
>>> On 3/11/24 15:17, Andy Lutomirski wrote:
>>>> I *think* that all x86 implementations won't fill the TLB for a
>>>> non-accessed page without also setting the accessed bit,
>>>
>>> That's my understanding as well. The SDM is a little more obtuse about it:
>>>
>>>> Whenever the processor uses a paging-structure entry as part of
>>>> linear-address translation, it sets the accessed flag in that entry
>>>> (if it is not already set).
>>>
>>> but it's there.
>>>
>>> But if we start needing Accessed=1 to be accurate, clearing those PTEs
>>> gets more expensive because it needs to be atomic to lock out the page
>>> walker. It basically needs to start getting treated similarly to what
>>> is done for Dirty=1 on userspace PTEs. Not the end of the world, of
>>> course, but one more source of overhead.
>>
>> In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.)
>>
>> So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked:
>>
>> if (!pte->accessed) {
>> *pte = 0;
>> reuse the memory;
>> }
>>
>> What could possibly go wrong?
>>
>> I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)
>>
>
> Just a thought: you don’t care if someone only reads from the stack's
> page (you can just install another page later). IOW: you only care if
> someone writes.
>
> So you can look on the dirty-bit, which is not being set speculatively
> and save yourself one problem.

Doesn't this buy a new problem? Install a page, run the thread without using the page but speculatively load the PTE as read-only into the TLB, context-switch out the thread, (entirely safely and correctly) determine that the page wasn't used, remove it from the PTE, use it for something else and fill it with things that aren't zero, run the thread again, and read from it. Now it has some other thread's data!

One might slightly credibly argue that this isn't a problem -- between RSP and the bottom of the area that one nominally considers to the by the stack is allowed to return arbitrary garbage, especially in the kernel where there's no red zone (until someone builds a kernel with a redzone on a FRED system, hmm), but this is still really weird. If you *write* in that area, the CPU hopefully puts the *correct* value in the TLB and life goes on, but how much do you trust anyone to have validated what happens when a PTE is present, writable and clean but the TLB contains a stale entry pointing somewhere else? And is it really okay to do this to the poor kernel?

If we're going to add a TLB flush on context switch, then (a) we are being rather silly and (b) we might as well just use atomics to play with the accessed bit instead, I think.

2024-03-12 00:09:29

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

> >> There are some other options: you could pre-map
> >
> > Pre-mapping would be expensive. It would mean pre-mapping the dynamic
> > pages for every scheduled thread, and we'd still need to check the
> > access bit every time a thread leaves the CPU.
>
> That's a write to four consecutive words in memory, with no locking required.

You convinced me, this might not be that bad. At the thread creation
time we will save the locations of the unmapped thread PTE's, and set
them on every schedule. There is a slight increase in scheduling cost,
but perhaps it is not as bad as I initially thought. This approach,
however, makes this dynamic stac feature much safer, and can be easily
extended to all arches that support access/dirty bit tracking.

>
> > Dynamic thread faults
> > should be considered rare events and thus shouldn't significantly
> > affect the performance of normal context switch operations. With 8K
> > stacks, we might encounter only 0.00001% of stacks requiring an extra
> > page, and even fewer needing 16K.
>
> Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :)
>
> >
> >> Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant.
> >
> > We never block during handling stack faults. There's a per-CPU page
> > pool, guaranteeing availability for the faulting thread. The thread
> > simply takes pages from this per-CPU data structure and refills the
> > pool when leaving the CPU. The faulting routine is efficient,
> > requiring a fixed number of loads without any locks, stalling, or even
> > cmpxchg operations.
>
> You can't block when scheduling, either. What if you can't refill the pool?

Why can't we (I am not a scheduler guy)? IRQ's are not yet disabled,
what prevents us from blocking while the old process has not yet been
removed from the CPU?

2024-03-12 00:24:31

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

> >
> > You can't block when scheduling, either. What if you can't refill the pool?
>
> Why can't we (I am not a scheduler guy)? IRQ's are not yet disabled,
> what prevents us from blocking while the old process has not yet been
> removed from the CPU?

Answering my own question: single cpu machine, the thread is going
away, but tries to refill the dynamic-pages, alloc_page() goes into
slow path i.e (__alloc_pages_slowpath), where it performs
cond_resched() while waiting for oom kills, and yet we cannot leave
the CPU.

2024-03-12 00:53:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On 3/11/24 16:56, Nadav Amit wrote:
> So you can look on the dirty-bit, which is not being set
> speculatively and save yourself one problem.
Define "set speculatively". :)

> If software on one logical processor writes to a page while software
> on another logical processor concurrently clears the R/W flag in the
> paging-structure entry that maps the page, execution on some
> processors may result in the entry’s dirty flag being set (due to the
> write on the first logical processor) and the entry’s R/W flag being
> clear (due to the update to the entry on the second logical
> processor).

In other words, you'll see both a fault *AND* the dirty bit. The write
never retired and the dirty bit is set.

Does that count as being set speculatively?

That's just the behavior that the SDM explicitly admits to.

2024-03-12 01:28:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On March 11, 2024 5:53:33 PM PDT, Dave Hansen <[email protected]> wrote:
>On 3/11/24 16:56, Nadav Amit wrote:
>> So you can look on the dirty-bit, which is not being set
>> speculatively and save yourself one problem.
>Define "set speculatively". :)
>
>> If software on one logical processor writes to a page while software
>> on another logical processor concurrently clears the R/W flag in the
>> paging-structure entry that maps the page, execution on some
>> processors may result in the entry’s dirty flag being set (due to the
>> write on the first logical processor) and the entry’s R/W flag being
>> clear (due to the update to the entry on the second logical
>> processor).
>
>In other words, you'll see both a fault *AND* the dirty bit. The write
>never retired and the dirty bit is set.
>
>Does that count as being set speculatively?
>
>That's just the behavior that the SDM explicitly admits to.

Indeed; both the A and D bits are by design permissive; that is, the hardware can set them at any time.

The only guarantees are:

1. The hardware will not set the A bit on a not present late, nor the D bit on a read only page.

2. *Provided that the user has invalidated the page entry in the TLB*, hardware guarantees the respective bits will be set before a dependent memory access is made visible. Thus the bits are guaranteed to reflect a strict superset of operations performed architecturally.

2024-03-12 02:17:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11, 2024, at 6:25 PM, H. Peter Anvin wrote:
> On March 11, 2024 5:53:33 PM PDT, Dave Hansen <[email protected]> wrote:
>>On 3/11/24 16:56, Nadav Amit wrote:
>>> So you can look on the dirty-bit, which is not being set
>>> speculatively and save yourself one problem.
>>Define "set speculatively". :)
>>
>>> If software on one logical processor writes to a page while software
>>> on another logical processor concurrently clears the R/W flag in the
>>> paging-structure entry that maps the page, execution on some
>>> processors may result in the entry’s dirty flag being set (due to the
>>> write on the first logical processor) and the entry’s R/W flag being
>>> clear (due to the update to the entry on the second logical
>>> processor).
>>
>>In other words, you'll see both a fault *AND* the dirty bit. The write
>>never retired and the dirty bit is set.
>>
>>Does that count as being set speculatively?
>>
>>That's just the behavior that the SDM explicitly admits to.
>
> Indeed; both the A and D bits are by design permissive; that is, the
> hardware can set them at any time.
>
> The only guarantees are:
>
> 1. The hardware will not set the A bit on a not present late, nor the D
> bit on a read only page.

Wait a sec. What about setting the D bit on a not-present page?

I always assumed that the actual intended purpose of the D bit was for things like file mapping. Imagine an alternate universe in which Linux used hardware dirty tracking instead of relying on do_wp_page, etc.

mmap(..., MAP_SHARED): PTE is created, read-write, clean

user program may or may not write to the page.

Now either munmap is called or the kernel needs to reclaim memory. So the kernel checks if the page is dirty and, if so, writes it back, and then unmaps it.

Now some silly people invented SMP, so this needs an atomic operation: xchg the PTE to all-zeros, see if the dirty bit is set, and, if itt's set, write back the page. Otherwise discard it.

Does this really not work on Intel CPU?

2024-03-12 02:22:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On March 11, 2024 7:16:38 PM PDT, Andy Lutomirski <[email protected]> wrote:
>On Mon, Mar 11, 2024, at 6:25 PM, H. Peter Anvin wrote:
>> On March 11, 2024 5:53:33 PM PDT, Dave Hansen <[email protected]> wrote:
>>>On 3/11/24 16:56, Nadav Amit wrote:
>>>> So you can look on the dirty-bit, which is not being set
>>>> speculatively and save yourself one problem.
>>>Define "set speculatively". :)
>>>
>>>> If software on one logical processor writes to a page while software
>>>> on another logical processor concurrently clears the R/W flag in the
>>>> paging-structure entry that maps the page, execution on some
>>>> processors may result in the entry’s dirty flag being set (due to the
>>>> write on the first logical processor) and the entry’s R/W flag being
>>>> clear (due to the update to the entry on the second logical
>>>> processor).
>>>
>>>In other words, you'll see both a fault *AND* the dirty bit. The write
>>>never retired and the dirty bit is set.
>>>
>>>Does that count as being set speculatively?
>>>
>>>That's just the behavior that the SDM explicitly admits to.
>>
>> Indeed; both the A and D bits are by design permissive; that is, the
>> hardware can set them at any time.
>>
>> The only guarantees are:
>>
>> 1. The hardware will not set the A bit on a not present late, nor the D
>> bit on a read only page.
>
>Wait a sec. What about setting the D bit on a not-present page?
>
>I always assumed that the actual intended purpose of the D bit was for things like file mapping. Imagine an alternate universe in which Linux used hardware dirty tracking instead of relying on do_wp_page, etc.
>
>mmap(..., MAP_SHARED): PTE is created, read-write, clean
>
>user program may or may not write to the page.
>
>Now either munmap is called or the kernel needs to reclaim memory. So the kernel checks if the page is dirty and, if so, writes it back, and then unmaps it.
>
>Now some silly people invented SMP, so this needs an atomic operation: xchg the PTE to all-zeros, see if the dirty bit is set, and, if itt's set, write back the page. Otherwise discard it.
>
>Does this really not work on Intel CPU?
>

Sorry, I should have been more clear.

Hardware will not set a bit that would correspond to a prohibited access.

2024-03-12 07:16:15

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()



On 11.03.24 г. 18:46 ч., Pasha Tatashin wrote:
> In preporation for dynamic kernel stacks do not zero the whole span of
> the stack, but instead only the pages that are part of the vm_area.
>
> This is because with dynamic stacks we might have only partially
> populated stacks.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6a2f2c85e09f..41e0baee79d2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -263,8 +263,8 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
> static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> {
> struct vm_struct *vm_area;
> + int i, j, nr_pages;
> void *stack;
> - int i;
>
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> @@ -282,7 +282,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> stack = kasan_reset_tag(vm_area->addr);
>
> /* Clear stale pointers from reused stack. */
> - memset(stack, 0, THREAD_SIZE);
> + nr_pages = vm_area->nr_pages;
> + for (j = 0; j < nr_pages; j++)
> + clear_page(page_address(vm_area->pages[j]));

Can't this be memset(stack, 0, nr_pages*PAGE_SIZE) ?
>
> tsk->stack_vm_area = vm_area;
> tsk->stack = stack;

2024-03-12 07:21:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks



> On 12 Mar 2024, at 2:02, Andy Lutomirski <[email protected]> wrote:
>
> Doesn't this buy a new problem? Install a page, run the thread without using the page but speculatively load the PTE as read-only into the TLB, context-switch out the thread, (entirely safely and correctly) determine that the page wasn't used, remove it from the PTE, use it for something else and fill it with things that aren't zero, run the thread again, and read from it. Now it has some other thread's data!

Yes, you are correct. Bad idea of mine. Regardless of data leak, it opens the door for subtle hard-to-analyze bugs where 2 reads return different values.

2024-03-12 16:52:59

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 05/14] fork: check charging success before zeroing stack

On Tue, Mar 12, 2024 at 11:57 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Mon, Mar 11, 2024 at 04:46:29PM +0000, Pasha Tatashin wrote:
> > No need to do zero cahced stack if memcg charge fails, so move the
>
> Typo.

Thanks, I will fix this.

>
> --
> Kiryl Shutsemau / Kirill A. Shutemov

2024-03-12 16:54:04

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()

On Tue, Mar 12, 2024 at 3:16 AM Nikolay Borisov <[email protected]> wrote:
>
>
>
> On 11.03.24 г. 18:46 ч., Pasha Tatashin wrote:
> > In preporation for dynamic kernel stacks do not zero the whole span of
> > the stack, but instead only the pages that are part of the vm_area.
> >
> > This is because with dynamic stacks we might have only partially
> > populated stacks.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > kernel/fork.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 6a2f2c85e09f..41e0baee79d2 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -263,8 +263,8 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
> > static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> > {
> > struct vm_struct *vm_area;
> > + int i, j, nr_pages;
> > void *stack;
> > - int i;
> >
> > for (i = 0; i < NR_CACHED_STACKS; i++) {
> > vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> > @@ -282,7 +282,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> > stack = kasan_reset_tag(vm_area->addr);
> >
> > /* Clear stale pointers from reused stack. */
> > - memset(stack, 0, THREAD_SIZE);
> > + nr_pages = vm_area->nr_pages;
> > + for (j = 0; j < nr_pages; j++)
> > + clear_page(page_address(vm_area->pages[j]));
>
> Can't this be memset(stack, 0, nr_pages*PAGE_SIZE) ?

No, we can't, because the pages can be physically discontiguous.

Pasha

2024-03-12 17:19:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks



On 3/11/24 09:46, Pasha Tatashin wrote:
> This is follow-up to the LSF/MM proposal [1]. Please provide your
> thoughts and comments about dynamic kernel stacks feature. This is a WIP
> has not been tested beside booting on some machines, and running LKDTM
> thread exhaust tests. The series also lacks selftests, and
> documentations.
>
> This feature allows to grow kernel stack dynamically, from 4KiB and up
> to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> the initial experiments it shows to save on average 70-75% of the kernel
> stack memory.
>
> The average depth of a kernel thread depends on the workload, profiling,
> virtualization, compiler optimizations, and driver implementations.
> However, the table below shows the amount of kernel stack memory before
> vs. after on idling freshly booted machines:
>
> CPU #Cores #Stacks BASE(kb) Dynamic(kb) Saving
> AMD Genoa 384 5786 92576 23388 74.74%
> Intel Skylake 112 3182 50912 12860 74.74%
> AMD Rome 128 3401 54416 14784 72.83%
> AMD Rome 256 4908 78528 20876 73.42%
> Intel Haswell 72 2644 42304 10624 74.89%
>
> Some workloads with that have millions of threads would can benefit
> significantly from this feature.
>

Ok, first of all, talking about "kernel memory" here is misleading.
Unless your threads are spending nearly all their time sleeping, the
threads will occupy stack and TLS memory in user space as well.

Second, non-dynamic kernel memory is one of the core design decisions in
Linux from early on. This means there are lot of deeply embedded
assumptions which would have to be untangled.

Linus would, of course, be the real authority on this, but if someone
would ask me what the fundamental design philosophies of the Linux
kernel are -- the design decisions which make Linux Linux, if you will
-- I would say:

1. Non-dynamic kernel memory
2. Permanent mapping of physical memory
3. Kernel API modeled closely after the POSIX API
(no complicated user space layers)
4. Fast system call entry/exit (a necessity for a
kernel API based on simple system calls)
5. Monolithic (but modular) kernel environment
(not cross-privilege, coroutine or message passing)

Third, *IF* this is something that should be done (and I personally
strongly suspect it should not), at least on x86-64 it probably should
be for FRED hardware only. With FRED, it is possible to set the #PF
event stack level to 1, which will cause an automatic stack switch for
#PF in kernel space (only). However, even in kernel space, #PF can sleep
if it references a user space page, in which case it would have to be
demoted back onto the ring 0 stack (there are multiple ways of doing
that, but it does entail an overhead.)

-hpa

2024-03-12 19:46:33

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Tue, Mar 12, 2024 at 1:19 PM H. Peter Anvin <[email protected]> wrote:
>
>
>
> On 3/11/24 09:46, Pasha Tatashin wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
> > The average depth of a kernel thread depends on the workload, profiling,
> > virtualization, compiler optimizations, and driver implementations.
> > However, the table below shows the amount of kernel stack memory before
> > vs. after on idling freshly booted machines:
> >
> > CPU #Cores #Stacks BASE(kb) Dynamic(kb) Saving
> > AMD Genoa 384 5786 92576 23388 74.74%
> > Intel Skylake 112 3182 50912 12860 74.74%
> > AMD Rome 128 3401 54416 14784 72.83%
> > AMD Rome 256 4908 78528 20876 73.42%
> > Intel Haswell 72 2644 42304 10624 74.89%
> >
> > Some workloads with that have millions of threads would can benefit
> > significantly from this feature.
> >
>
> Ok, first of all, talking about "kernel memory" here is misleading.

Hi Peter,

I re-read my cover letter, and I do not see where "kernel memory" is
mentioned. We are talking about kernel stacks overhead that is
proportional to the user workload, as every active thread has an
associated kernel stack. The idea is to save memory by not
pre-allocating all pages of kernel-stacks, but instead use it as a
safeguard when a stack actually becomes deep. Come-up with a solution
that can handle rare deeper stacks only when needed. This could be
done through faulting on the supported hardware (as proposed in this
series), or via pre-map on every schedule event, and checking the
access when thread goes off cpu (as proposed by Andy Lutomirski to
avoid double faults on x86) .

In other words, this feature is only about one very specific type of
kernel memory that is not even directly mapped (the feature required
vmapped stacks).

> Unless your threads are spending nearly all their time sleeping, the
> threads will occupy stack and TLS memory in user space as well.

Can you please elaborate, what data is contained in the kernel stack
when thread is in user space? My series requires thread_info not to be
in the stack by depending on THREAD_INFO_IN_TASK.

> Second, non-dynamic kernel memory is one of the core design decisions in
> Linux from early on. This means there are lot of deeply embedded
> assumptions which would have to be untangled.
>
> Linus would, of course, be the real authority on this, but if someone
> would ask me what the fundamental design philosophies of the Linux
> kernel are -- the design decisions which make Linux Linux, if you will
> -- I would say:
>
> 1. Non-dynamic kernel memory
> 2. Permanent mapping of physical memory

The one and two are correlated. Given that all the memory is directly
mapped, the kernel core cannot be relocatable, swappable, faultable
etc.

> 3. Kernel API modeled closely after the POSIX API
> (no complicated user space layers)
> 4. Fast system call entry/exit (a necessity for a
> kernel API based on simple system calls)
> 5. Monolithic (but modular) kernel environment
> (not cross-privilege, coroutine or message passing)
>
> Third, *IF* this is something that should be done (and I personally
> strongly suspect it should not), at least on x86-64 it probably should
> be for FRED hardware only. With FRED, it is possible to set the #PF
> event stack level to 1, which will cause an automatic stack switch for
> #PF in kernel space (only). However, even in kernel space, #PF can sleep
> if it references a user space page, in which case it would have to be
> demoted back onto the ring 0 stack (there are multiple ways of doing
> that, but it does entail an overhead.)

My understanding is that with the proposed approach only double faults
are prohibited to be used. Pre-map/check-access could still work, even
though it would add some cost to the context switching.

Thank you,
Pasha

2024-03-12 21:38:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On 3/12/24 12:45, Pasha Tatashin wrote:
>>
>> Ok, first of all, talking about "kernel memory" here is misleading.
>
> Hi Peter,
>
> I re-read my cover letter, and I do not see where "kernel memory" is
> mentioned. We are talking about kernel stacks overhead that is
> proportional to the user workload, as every active thread has an
> associated kernel stack. The idea is to save memory by not
> pre-allocating all pages of kernel-stacks, but instead use it as a
> safeguard when a stack actually becomes deep. Come-up with a solution
> that can handle rare deeper stacks only when needed. This could be
> done through faulting on the supported hardware (as proposed in this
> series), or via pre-map on every schedule event, and checking the
> access when thread goes off cpu (as proposed by Andy Lutomirski to
> avoid double faults on x86) .
>
> In other words, this feature is only about one very specific type of
> kernel memory that is not even directly mapped (the feature required
> vmapped stacks).
>
>> Unless your threads are spending nearly all their time sleeping, the
>> threads will occupy stack and TLS memory in user space as well.
>
> Can you please elaborate, what data is contained in the kernel stack
> when thread is in user space? My series requires thread_info not to be
> in the stack by depending on THREAD_INFO_IN_TASK.
>

My point is that what matters is total memory use, not just memory used
in the kernel. Amdahl's law.

-hpa


2024-03-12 21:57:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC 05/14] fork: check charging success before zeroing stack

On Mon, Mar 11, 2024 at 04:46:29PM +0000, Pasha Tatashin wrote:
> No need to do zero cahced stack if memcg charge fails, so move the

Typo.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-12 21:58:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

Pasha Tatashin <[email protected]> writes:

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d6375b3c633b..651c558b10eb 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> if (is_f00f_bug(regs, hw_error_code, address))
> return;
>
> + if (dynamic_stack_fault(current, address))
> + return;

Probably I'm missing something here, but since you are on the same stack as
you are trying to fix up, how can this possibly work?

Fault on missing stack
#PF
<will push more stuff onto the missing stack causing a double fault>

You would need a separate stack just to handle this. But the normal
page fault handler couldn't use it because it needs to be able to block.

Ah I get it -- you handle it in the double fault handler? So every
stack grow will be a #DF too? That's scary.


-Andi

2024-03-12 22:19:00

by David Laight

[permalink] [raw]
Subject: RE: [RFC 00/14] Dynamic Kernel Stacks

...
> I re-read my cover letter, and I do not see where "kernel memory" is
> mentioned. We are talking about kernel stacks overhead that is
> proportional to the user workload, as every active thread has an
> associated kernel stack. The idea is to save memory by not
> pre-allocating all pages of kernel-stacks, but instead use it as a
> safeguard when a stack actually becomes deep. Come-up with a solution
> that can handle rare deeper stacks only when needed. This could be
> done through faulting on the supported hardware (as proposed in this
> series), or via pre-map on every schedule event, and checking the
> access when thread goes off cpu (as proposed by Andy Lutomirski to
> avoid double faults on x86) .
>
> In other words, this feature is only about one very specific type of
> kernel memory that is not even directly mapped (the feature required
> vmapped stacks).

Just for interest how big does the register save area get?
In the 'good old days' it could be allocated from the low end of the
stack memory. But AVX512 starts making it large - never mind some
other things that (IIRC) might get to 8k.
Even the task area is probably non-trivial since far fewer things
can be shared than one might hope.

I'm sure I remember someone contemplating not allocating stacks to
each thread. I think that requires waking up with a system call
restart for some system calls - plausibly possible for futex() and poll().

Another option is to do a proper static analysis of stack usage
and fix the paths that have deep stacks and remove all recursion.
I'm pretty sure objtool knows the stack offsets of every call instruction.
The indirect call hashes (fine IBT?) should allow indirect calls
be handled as well as direct calls.
Processing the 'A calls B at offset n' to generate a max depth
is just a SMOP.

At the moment I think all 'void (*)(void *)' function have the same hash?
So the compiler would need a function attribute to seed the hash.

With that you might be able to remove all the code paths that actually
use a lot of stack - instead of just guessing and limiting individual
stack frames.

My 'gut feel' from calculating the stack use that way for an embedded
system back in the early 1980s is that the max use will be inside
printk() inside an obscure error path and if you actually hit it
things will explode.
(We didn't have enough memory to allocate big enough stacks!)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-13 10:31:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote:
> @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> }
> #endif
>
> + if (dynamic_stack_fault(current, address))
> + return;
> +
> irqentry_nmi_enter(regs);
> instrumentation_begin();
> notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d6375b3c633b..651c558b10eb 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> if (is_f00f_bug(regs, hw_error_code, address))
> return;
>
> + if (dynamic_stack_fault(current, address))
> + return;

T1 schedules out with stack used close to the fault boundary.

switch_to(T2)

Now T1 schedules back in

switch_to(T1)
__switch_to_asm()
...
switch_stacks() <- SP on T1 stack
! ...
! jmp __switch_to()
! __switch_to()
! ...
! raw_cpu_write(pcpu_hot.current_task, next_p);

After switching SP to T1's stack and up to the point where
pcpu_hot.current_task (aka current) is updated to T1 a stack fault will
invoke dynamic_stack_fault(T2, address) which will return false here:

/* check if address is inside the kernel stack area */
stack = (unsigned long)tsk->stack;
if (address < stack || address >= stack + THREAD_SIZE)
return false;

because T2's stack does obviously not cover the faulting address on T1's
stack. As a consequence double fault will panic the machine.

Thanks,

tglx

2024-03-13 13:44:36

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <[email protected]> wrote:
>
> On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote:
> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> > }
> > #endif
> >
> > + if (dynamic_stack_fault(current, address))
> > + return;
> > +
> > irqentry_nmi_enter(regs);
> > instrumentation_begin();
> > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index d6375b3c633b..651c558b10eb 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> > if (is_f00f_bug(regs, hw_error_code, address))
> > return;
> >
> > + if (dynamic_stack_fault(current, address))
> > + return;
>
> T1 schedules out with stack used close to the fault boundary.
>
> switch_to(T2)
>
> Now T1 schedules back in
>
> switch_to(T1)
> __switch_to_asm()
> ...
> switch_stacks() <- SP on T1 stack
> ! ...
> ! jmp __switch_to()
> ! __switch_to()
> ! ...
> ! raw_cpu_write(pcpu_hot.current_task, next_p);
>
> After switching SP to T1's stack and up to the point where
> pcpu_hot.current_task (aka current) is updated to T1 a stack fault will
> invoke dynamic_stack_fault(T2, address) which will return false here:
>
> /* check if address is inside the kernel stack area */
> stack = (unsigned long)tsk->stack;
> if (address < stack || address >= stack + THREAD_SIZE)
> return false;
>
> because T2's stack does obviously not cover the faulting address on T1's
> stack. As a consequence double fault will panic the machine.

Hi Thomas,

Thank you, you are absolutely right, we can't trust "current" in the
fault handler.

We can change dynamic_stack_fault() to only accept fault_address as an
argument, and let it determine the right task_struct pointer
internally.

Let's modify dynamic_stack_fault() to accept only the fault_address.
It can then determine the correct task_struct pointer internally.

Here's a potential solution that is fast, avoids locking, and ensures atomicity:

1. Kernel Stack VA Space
Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
exclusively for kernel stacks. This simplifies validation of faulting
addresses to be part of a stack.

2. Finding the faulty task
- Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
topmost stack page (since stack addresses are aligned to THREAD_SIZE).
- Store the task_struct pointer as the last word on this topmost page,
that is always present as it is a pre-allcated stack page.

3. Stack Padding
Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
to accommodate the task_struct pointer.

Another issue that this race brings is that 3-pages per-cpu might not
be enough, we might need up-to 6 pages: 3 to cover going-away task,
and 3 to cover the new task.

Pasha

2024-03-13 15:29:15

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin
<[email protected]> wrote:
>
> On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <[email protected]> wrote:
> >
> > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote:
> > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> > > }
> > > #endif
> > >
> > > + if (dynamic_stack_fault(current, address))
> > > + return;
> > > +
> > > irqentry_nmi_enter(regs);
> > > instrumentation_begin();
> > > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index d6375b3c633b..651c558b10eb 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> > > if (is_f00f_bug(regs, hw_error_code, address))
> > > return;
> > >
> > > + if (dynamic_stack_fault(current, address))
> > > + return;
> >
> > T1 schedules out with stack used close to the fault boundary.
> >
> > switch_to(T2)
> >
> > Now T1 schedules back in
> >
> > switch_to(T1)
> > __switch_to_asm()
> > ...
> > switch_stacks() <- SP on T1 stack
> > ! ...
> > ! jmp __switch_to()
> > ! __switch_to()
> > ! ...
> > ! raw_cpu_write(pcpu_hot.current_task, next_p);
> >
> > After switching SP to T1's stack and up to the point where
> > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will
> > invoke dynamic_stack_fault(T2, address) which will return false here:
> >
> > /* check if address is inside the kernel stack area */
> > stack = (unsigned long)tsk->stack;
> > if (address < stack || address >= stack + THREAD_SIZE)
> > return false;
> >
> > because T2's stack does obviously not cover the faulting address on T1's
> > stack. As a consequence double fault will panic the machine.
>
> Hi Thomas,
>
> Thank you, you are absolutely right, we can't trust "current" in the
> fault handler.
>
> We can change dynamic_stack_fault() to only accept fault_address as an
> argument, and let it determine the right task_struct pointer
> internally.
>
> Let's modify dynamic_stack_fault() to accept only the fault_address.
> It can then determine the correct task_struct pointer internally.
>
> Here's a potential solution that is fast, avoids locking, and ensures atomicity:
>
> 1. Kernel Stack VA Space
> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
> exclusively for kernel stacks. This simplifies validation of faulting
> addresses to be part of a stack.
>
> 2. Finding the faulty task
> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
> topmost stack page (since stack addresses are aligned to THREAD_SIZE).
> - Store the task_struct pointer as the last word on this topmost page,
> that is always present as it is a pre-allcated stack page.
>
> 3. Stack Padding
> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
> to accommodate the task_struct pointer.

Alternatively, do not even look-up the task_struct in
dynamic_stack_fault(), but only install the mapping to the faulting
address, store va in the per-cpu array, and handle the rest in
dynamic_stack() during context switching. At that time spin locks can
be taken, and we can do a find_vm_area(addr) call.

This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to
keep task_struct in there.

Pasha

2024-03-13 16:12:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Wed, Mar 13 2024 at 11:28, Pasha Tatashin wrote:
> On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin
> <[email protected]> wrote:
>> Here's a potential solution that is fast, avoids locking, and ensures atomicity:
>>
>> 1. Kernel Stack VA Space
>> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
>> exclusively for kernel stacks. This simplifies validation of faulting
>> addresses to be part of a stack.
>>
>> 2. Finding the faulty task
>> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
>> topmost stack page (since stack addresses are aligned to THREAD_SIZE).
>> - Store the task_struct pointer as the last word on this topmost page,
>> that is always present as it is a pre-allcated stack page.
>>
>> 3. Stack Padding
>> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
>> to accommodate the task_struct pointer.
>
> Alternatively, do not even look-up the task_struct in
> dynamic_stack_fault(), but only install the mapping to the faulting
> address, store va in the per-cpu array, and handle the rest in
> dynamic_stack() during context switching. At that time spin locks can
> be taken, and we can do a find_vm_area(addr) call.
>
> This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to
> keep task_struct in there.

Why not simply doing the 'current' update right next to the stack
switching in __switch_to_asm() which has no way of faulting.

That needs to validate whether anything uses current between the stack
switch and the place where current is updated today. I think nothing
should do so, but I would not be surprised either if it would be the
case. Such code would already today just work by chance I think,

That should not be hard to analyze and fixup if necessary.

So that's fixable, but I'm not really convinced that all of this is safe
and correct under all circumstances. That needs a lot more analysis than
just the trivial one I did for switch_to().

Thanks,

tglx

2024-03-14 07:55:19

by Christophe Leroy

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()



Le 12/03/2024 à 17:53, Pasha Tatashin a écrit :
> On Tue, Mar 12, 2024 at 3:16 AM Nikolay Borisov <[email protected]> wrote:
>>
>>
>>
>> On 11.03.24 г. 18:46 ч., Pasha Tatashin wrote:
>>> In preporation for dynamic kernel stacks do not zero the whole span of
>>> the stack, but instead only the pages that are part of the vm_area.
>>>
>>> This is because with dynamic stacks we might have only partially
>>> populated stacks.
>>>
>>> Signed-off-by: Pasha Tatashin <[email protected]>
>>> ---
>>> kernel/fork.c | 6 ++++--
>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 6a2f2c85e09f..41e0baee79d2 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -263,8 +263,8 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
>>> static int alloc_thread_stack_node(struct task_struct *tsk, int node)
>>> {
>>> struct vm_struct *vm_area;
>>> + int i, j, nr_pages;
>>> void *stack;
>>> - int i;
>>>
>>> for (i = 0; i < NR_CACHED_STACKS; i++) {
>>> vm_area = this_cpu_xchg(cached_stacks[i], NULL);
>>> @@ -282,7 +282,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
>>> stack = kasan_reset_tag(vm_area->addr);
>>>
>>> /* Clear stale pointers from reused stack. */
>>> - memset(stack, 0, THREAD_SIZE);
>>> + nr_pages = vm_area->nr_pages;
>>> + for (j = 0; j < nr_pages; j++)
>>> + clear_page(page_address(vm_area->pages[j]));
>>
>> Can't this be memset(stack, 0, nr_pages*PAGE_SIZE) ?
>
> No, we can't, because the pages can be physically discontiguous.
>

But the pages were already physically discontiguous before your change,
what's the difference ?

It doesn't matter that the pages are physically discontiguous as far as
they are virtually contiguous, which should still be the case here for a
stack.

Nevertheless, from powerpc point of view I'm happy with clear_page()
which is more optimised than memset(0)

Christophe

2024-03-14 14:03:15

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()

> But the pages were already physically discontiguous before your change,
> what's the difference ?

Pages were not physically contiguous before my change. They were
allocated with __vmalloc_node_range() which allocates sparse pages and
maps them to a virtually contiguous span of memory within
[VMALLOC_START, VMALLOC_END) range.

> It doesn't matter that the pages are physically discontiguous as far as
> they are virtually contiguous, which should still be the case here for a
> stack.

This patch is a preparation patch for the "dynamic kernel stack"
feature, in the description it says:
This is because with dynamic stacks we might have only partially
populated stacks.

We could compute the populated part of the stack, and determine its
start and end mapped VA range by using vm_area->pages[] and
vm_area->nr_pages, but that would make code a little uglier especially
becuase we would need to take into the account if stack is growing up
or down.. Therefore, using clear_page() is simpler and should be fast
enough.

Thanks,
Pasha

2024-03-14 14:04:44

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Wed, Mar 13, 2024 at 12:12 PM Thomas Gleixner <[email protected]> wrote:
>
> On Wed, Mar 13 2024 at 11:28, Pasha Tatashin wrote:
> > On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin
> > <[email protected]> wrote:
> >> Here's a potential solution that is fast, avoids locking, and ensures atomicity:
> >>
> >> 1. Kernel Stack VA Space
> >> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
> >> exclusively for kernel stacks. This simplifies validation of faulting
> >> addresses to be part of a stack.
> >>
> >> 2. Finding the faulty task
> >> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
> >> topmost stack page (since stack addresses are aligned to THREAD_SIZE).
> >> - Store the task_struct pointer as the last word on this topmost page,
> >> that is always present as it is a pre-allcated stack page.
> >>
> >> 3. Stack Padding
> >> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
> >> to accommodate the task_struct pointer.
> >
> > Alternatively, do not even look-up the task_struct in
> > dynamic_stack_fault(), but only install the mapping to the faulting
> > address, store va in the per-cpu array, and handle the rest in
> > dynamic_stack() during context switching. At that time spin locks can
> > be taken, and we can do a find_vm_area(addr) call.
> >
> > This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to
> > keep task_struct in there.
>
> Why not simply doing the 'current' update right next to the stack
> switching in __switch_to_asm() which has no way of faulting.
>
> That needs to validate whether anything uses current between the stack
> switch and the place where current is updated today. I think nothing
> should do so, but I would not be surprised either if it would be the
> case. Such code would already today just work by chance I think,
>
> That should not be hard to analyze and fixup if necessary.
>
> So that's fixable, but I'm not really convinced that all of this is safe
> and correct under all circumstances. That needs a lot more analysis than
> just the trivial one I did for switch_to().

Agreed, if the current task pointer can be switched later, after loads
and stores to the stack, that would be a better solution. I will
incorporate this approach into my next version.

I also concur that this proposal necessitates more rigorous analysis.
This work remains in the investigative phase, where I am seeking a
viable solution to the problem.

The core issue is that kernel stacks consume excessive memory for
certain workloads. However, we cannot simply reduce their size, as
this leads to machine crashes in the infrequent instances where stacks
do run deep.

Thanks,
Pasha

2024-03-14 15:18:44

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC 08/14] fork: separate vmap stack alloction and free calls

On Tue, Mar 12, 2024 at 12:47 AM Pasha Tatashin
<[email protected]> wrote:
>
> In preparation for the dynamic stacks, separate out the
> __vmalloc_node_range and vfree calls from the vmap based stack
> allocations. The dynamic stacks will use their own variants of these
> functions.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 53 ++++++++++++++++++++++++++++++---------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3004e6ce6c65..bbae5f705773 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -204,6 +204,29 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> return false;
> }
>
> +static inline struct vm_struct *alloc_vmap_stack(int node)
> +{
> + void *stack;
> +
> + /*
> + * Allocated stacks are cached and later reused by new threads,
> + * so memcg accounting is performed manually on assigning/releasing
> + * stacks to tasks. Drop __GFP_ACCOUNT.
> + */
> + stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
> + VMALLOC_START, VMALLOC_END,
> + THREADINFO_GFP & ~__GFP_ACCOUNT,
> + PAGE_KERNEL,
> + 0, node, __builtin_return_address(0));
> +
> + return (stack) ? find_vm_area(stack) : NULL;
> +}
> +
> +static inline void free_vmap_stack(struct vm_struct *vm_area)
> +{
> + vfree(vm_area->addr);
> +}
> +
> static void thread_stack_free_rcu(struct rcu_head *rh)
> {
> struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
> @@ -212,7 +235,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh)
> if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
> return;
>
> - vfree(vm_area->addr);
> + free_vmap_stack(vm_area);
> }

I've discovered that the function free_vmap_stack() can trigger a warning.
It appears that free_vmap_stack() should handle interrupt context and
task context separately as vfree().

[root@JeffXie ]# poweroff
[root@JeffXie ]# umount: devtmpfs busy - remounted read-only
[ 93.036872] EXT4-fs (vda): re-mounted
2e1f057b-471f-4c08-a7b8-611457b221f2 ro. Quota mode: none.
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system poweroff
[ 94.043540] ------------[ cut here ]------------
[ 94.043977] WARNING: CPU: 0 PID: 0 at kernel/smp.c:786
smp_call_function_many_cond+0x4e5/0x550
[ 94.044744] Modules linked in:
[ 94.045024] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
6.8.0-00014-g82270db6e1f0 #91
[ 94.045697] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.15.0-1 04/01/2014
[ 94.046399] RIP: 0010:smp_call_function_many_cond+0x4e5/0x550
[ 94.046914] Code: 48 8b 78 08 48 c7 c1 a0 84 16 81 4c 89 f6 e8 22
11 f6 ff 65 ff 0d 23 38 ec 7e 0f 85 a1 fc ff ff 0f 1f 44 00 00 e9 97
fc ff ff <0f> 0b e9 61
[ 94.048509] RSP: 0018:ffffc90000003e48 EFLAGS: 00010206
[ 94.048965] RAX: ffffffff82cb3fd0 RBX: ffff88811862cbc0 RCX: 0000000000000003
[ 94.049598] RDX: 0000000000000100 RSI: 0000000000000000 RDI: 0000000000000000
[ 94.050226] RBP: ffff8881052c5090 R08: 0000000000000000 R09: 0000000000000001
[ 94.050861] R10: ffffffff82a060c0 R11: 0000000000008847 R12: ffff888102eb3500
[ 94.051480] R13: ffff88811862b800 R14: ffff88811862cc38 R15: 0000000000000000
[ 94.052109] FS: 0000000000000000(0000) GS:ffff888118600000(0000)
knlGS:0000000000000000
[ 94.052812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 94.053318] CR2: 00000000004759e0 CR3: 0000000002a2e000 CR4: 0000000000750ef0
[ 94.053955] PKRU: 55555554
[ 94.054203] Call Trace:
[ 94.054433] <IRQ>
[ 94.054632] ? __warn+0x84/0x140
[ 94.054925] ? smp_call_function_many_cond+0x4e5/0x550
[ 94.055362] ? report_bug+0x199/0x1b0
[ 94.055697] ? handle_bug+0x3c/0x70
[ 94.056010] ? exc_invalid_op+0x18/0x70
[ 94.056350] ? asm_exc_invalid_op+0x1a/0x20
[ 94.056728] ? smp_call_function_many_cond+0x4e5/0x550
[ 94.057179] ? __pfx_do_kernel_range_flush+0x10/0x10
[ 94.057622] on_each_cpu_cond_mask+0x24/0x40
[ 94.057999] flush_tlb_kernel_range+0x98/0xb0
[ 94.058390] free_unmap_vmap_area+0x2d/0x40
[ 94.058768] remove_vm_area+0x3a/0x70
[ 94.059094] free_vmap_stack+0x15/0x60
[ 94.059427] rcu_core+0x2bf/0x980
[ 94.059735] ? rcu_core+0x244/0x980
[ 94.060046] ? kvm_clock_get_cycles+0x18/0x30
[ 94.060431] __do_softirq+0xc2/0x292
[ 94.060760] irq_exit_rcu+0x6a/0x90
[ 94.061074] sysvec_apic_timer_interrupt+0x6e/0x90
[ 94.061507] </IRQ>
[ 94.061704] <TASK>
[ 94.061903] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 94.062367] RIP: 0010:default_idle+0xf/0x20
[ 94.062746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 b4 2a
00 fb f4 <fa> c3 cc c0
[ 94.064342] RSP: 0018:ffffffff82a03e70 EFLAGS: 00000212
[ 94.064805] RAX: ffff888118628608 RBX: ffffffff82a0c980 RCX: 0000000000000000
[ 94.065429] RDX: 4000000000000000 RSI: ffffffff82725be8 RDI: 000000000000a14c
[ 94.066066] RBP: 0000000000000000 R08: 000000000000a14c R09: 0000000000000001
[ 94.066705] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 94.067311] R13: 0000000000000000 R14: ffffffff82a0c030 R15: 00000000000000ac
[ 94.067936] default_idle_call+0x2c/0xd0
[ 94.068284] do_idle+0x1ce/0x210
[ 94.068584] cpu_startup_entry+0x2a/0x30
[ 94.068931] rest_init+0xc5/0xd0
[ 94.069224] arch_call_rest_init+0xe/0x30
[ 94.069597] start_kernel+0x58e/0x8d0
[ 94.069929] x86_64_start_reservations+0x18/0x30
[ 94.070353] x86_64_start_kernel+0xc6/0xe0
[ 94.070725] secondary_startup_64_no_verify+0x16d/0x17b
[ 94.071189] </TASK>
[ 94.071392] ---[ end trace 0000000000000000 ]---
[ 95.040718] e1000e: EEE TX LPI TIMER: 00000000
[ 95.055005] ACPI: PM: Preparing to enter system sleep state S5
[ 95.055619] reboot: Power down


./scripts/faddr2line ./vmlinux smp_call_function_many_cond+0x4e5/0x550
smp_call_function_many_cond+0x4e5/0x550:
smp_call_function_many_cond at kernel/smp.c:786 (discriminator 1)

756 static void smp_call_function_many_cond(const struct cpumask *mask,
757 smp_call_func_t func, void *info,
758 unsigned int scf_flags,
759 smp_cond_func_t cond_func)
[...]
781 * When @wait we can deadlock when we interrupt between
llist_add() and
782 * arch_send_call_function_ipi*(); when !@wait we can
deadlock due to
783 * csd_lock() on because the interrupt context uses the same csd
784 * storage.
785 */
786 WARN_ON_ONCE(!in_task());
// <<< warning here
[...]



--
Thanks,
JeffXie

2024-03-14 17:15:45

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 08/14] fork: separate vmap stack alloction and free calls

> I've discovered that the function free_vmap_stack() can trigger a warning.
> It appears that free_vmap_stack() should handle interrupt context and
> task context separately as vfree().

Hi Jeff,

Thank you for reporting this. Yes, it appears free_vmap_stack() may
get called from the interrupt context, and yet we call
remove_vm_area() that takes locks. I will fix it in the next version
similar to the way you suggested by adding an in_interrupt() case.

Thank you,
Pasha

> [root@JeffXie ]# poweroff
> [root@JeffXie ]# umount: devtmpfs busy - remounted read-only
> [ 93.036872] EXT4-fs (vda): re-mounted
> 2e1f057b-471f-4c08-a7b8-611457b221f2 ro. Quota mode: none.
> The system is going down NOW!
> Sent SIGTERM to all processes
> Sent SIGKILL to all processes
> Requesting system poweroff
> [ 94.043540] ------------[ cut here ]------------
> [ 94.043977] WARNING: CPU: 0 PID: 0 at kernel/smp.c:786
> smp_call_function_many_cond+0x4e5/0x550
> [ 94.044744] Modules linked in:
> [ 94.045024] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 6.8.0-00014-g82270db6e1f0 #91
> [ 94.045697] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.15.0-1 04/01/2014
> [ 94.046399] RIP: 0010:smp_call_function_many_cond+0x4e5/0x550
> [ 94.046914] Code: 48 8b 78 08 48 c7 c1 a0 84 16 81 4c 89 f6 e8 22
> 11 f6 ff 65 ff 0d 23 38 ec 7e 0f 85 a1 fc ff ff 0f 1f 44 00 00 e9 97
> fc ff ff <0f> 0b e9 61
> [ 94.048509] RSP: 0018:ffffc90000003e48 EFLAGS: 00010206
> [ 94.048965] RAX: ffffffff82cb3fd0 RBX: ffff88811862cbc0 RCX: 0000000000000003
> [ 94.049598] RDX: 0000000000000100 RSI: 0000000000000000 RDI: 0000000000000000
> [ 94.050226] RBP: ffff8881052c5090 R08: 0000000000000000 R09: 0000000000000001
> [ 94.050861] R10: ffffffff82a060c0 R11: 0000000000008847 R12: ffff888102eb3500
> [ 94.051480] R13: ffff88811862b800 R14: ffff88811862cc38 R15: 0000000000000000
> [ 94.052109] FS: 0000000000000000(0000) GS:ffff888118600000(0000)
> knlGS:0000000000000000
> [ 94.052812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 94.053318] CR2: 00000000004759e0 CR3: 0000000002a2e000 CR4: 0000000000750ef0
> [ 94.053955] PKRU: 55555554
> [ 94.054203] Call Trace:
> [ 94.054433] <IRQ>
> [ 94.054632] ? __warn+0x84/0x140
> [ 94.054925] ? smp_call_function_many_cond+0x4e5/0x550
> [ 94.055362] ? report_bug+0x199/0x1b0
> [ 94.055697] ? handle_bug+0x3c/0x70
> [ 94.056010] ? exc_invalid_op+0x18/0x70
> [ 94.056350] ? asm_exc_invalid_op+0x1a/0x20
> [ 94.056728] ? smp_call_function_many_cond+0x4e5/0x550
> [ 94.057179] ? __pfx_do_kernel_range_flush+0x10/0x10
> [ 94.057622] on_each_cpu_cond_mask+0x24/0x40
> [ 94.057999] flush_tlb_kernel_range+0x98/0xb0
> [ 94.058390] free_unmap_vmap_area+0x2d/0x40
> [ 94.058768] remove_vm_area+0x3a/0x70
> [ 94.059094] free_vmap_stack+0x15/0x60
> [ 94.059427] rcu_core+0x2bf/0x980
> [ 94.059735] ? rcu_core+0x244/0x980
> [ 94.060046] ? kvm_clock_get_cycles+0x18/0x30
> [ 94.060431] __do_softirq+0xc2/0x292
> [ 94.060760] irq_exit_rcu+0x6a/0x90
> [ 94.061074] sysvec_apic_timer_interrupt+0x6e/0x90
> [ 94.061507] </IRQ>
> [ 94.061704] <TASK>
> [ 94.061903] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 94.062367] RIP: 0010:default_idle+0xf/0x20
> [ 94.062746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 b4 2a
> 00 fb f4 <fa> c3 cc c0
> [ 94.064342] RSP: 0018:ffffffff82a03e70 EFLAGS: 00000212
> [ 94.064805] RAX: ffff888118628608 RBX: ffffffff82a0c980 RCX: 0000000000000000
> [ 94.065429] RDX: 4000000000000000 RSI: ffffffff82725be8 RDI: 000000000000a14c
> [ 94.066066] RBP: 0000000000000000 R08: 000000000000a14c R09: 0000000000000001
> [ 94.066705] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 94.067311] R13: 0000000000000000 R14: ffffffff82a0c030 R15: 00000000000000ac
> [ 94.067936] default_idle_call+0x2c/0xd0
> [ 94.068284] do_idle+0x1ce/0x210
> [ 94.068584] cpu_startup_entry+0x2a/0x30
> [ 94.068931] rest_init+0xc5/0xd0
> [ 94.069224] arch_call_rest_init+0xe/0x30
> [ 94.069597] start_kernel+0x58e/0x8d0
> [ 94.069929] x86_64_start_reservations+0x18/0x30
> [ 94.070353] x86_64_start_kernel+0xc6/0xe0
> [ 94.070725] secondary_startup_64_no_verify+0x16d/0x17b
> [ 94.071189] </TASK>
> [ 94.071392] ---[ end trace 0000000000000000 ]---
> [ 95.040718] e1000e: EEE TX LPI TIMER: 00000000
> [ 95.055005] ACPI: PM: Preparing to enter system sleep state S5
> [ 95.055619] reboot: Power down
>
>
> ./scripts/faddr2line ./vmlinux smp_call_function_many_cond+0x4e5/0x550
> smp_call_function_many_cond+0x4e5/0x550:
> smp_call_function_many_cond at kernel/smp.c:786 (discriminator 1)
>
> 756 static void smp_call_function_many_cond(const struct cpumask *mask,
> 757 smp_call_func_t func, void *info,
> 758 unsigned int scf_flags,
> 759 smp_cond_func_t cond_func)
> [...]
> 781 * When @wait we can deadlock when we interrupt between
> llist_add() and
> 782 * arch_send_call_function_ipi*(); when !@wait we can
> deadlock due to
> 783 * csd_lock() on because the interrupt context uses the same csd
> 784 * storage.
> 785 */
> 786 WARN_ON_ONCE(!in_task());
> // <<< warning here
> [...]
>
>
>
> --
> Thanks,
> JeffXie

2024-03-14 18:26:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Thu, Mar 14 2024 at 10:03, Pasha Tatashin wrote:
> On Wed, Mar 13, 2024 at 12:12 PM Thomas Gleixner <tglx@linutronixde> wrote:
>> That needs to validate whether anything uses current between the stack
>> switch and the place where current is updated today. I think nothing
>> should do so, but I would not be surprised either if it would be the
>> case. Such code would already today just work by chance I think,
>>
>> That should not be hard to analyze and fixup if necessary.
>>
>> So that's fixable, but I'm not really convinced that all of this is safe
>> and correct under all circumstances. That needs a lot more analysis than
>> just the trivial one I did for switch_to().
>
> Agreed, if the current task pointer can be switched later, after loads
> and stores to the stack, that would be a better solution. I will
> incorporate this approach into my next version.

No. You need to ensure that there is neither a load or store on the
stack between:

movq %rsp, TASK_threadsp(%rdi)
movq TASK_threadsp(%rsi), %rsp

and update_current(). IOW, you need to move the update of
pcpu_hot.current to ASM right after the RSP switch.

> I also concur that this proposal necessitates more rigorous analysis.

Glad we agree here :)

Thanks,

tglx

2024-03-14 19:06:01

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Tue, Mar 12, 2024 at 02:36:27PM -0700, H. Peter Anvin wrote:
> On 3/12/24 12:45, Pasha Tatashin wrote:
> > >
> > > Ok, first of all, talking about "kernel memory" here is misleading.
> >
> > Hi Peter,
> >
> > I re-read my cover letter, and I do not see where "kernel memory" is
> > mentioned. We are talking about kernel stacks overhead that is
> > proportional to the user workload, as every active thread has an
> > associated kernel stack. The idea is to save memory by not
> > pre-allocating all pages of kernel-stacks, but instead use it as a
> > safeguard when a stack actually becomes deep. Come-up with a solution
> > that can handle rare deeper stacks only when needed. This could be
> > done through faulting on the supported hardware (as proposed in this
> > series), or via pre-map on every schedule event, and checking the
> > access when thread goes off cpu (as proposed by Andy Lutomirski to
> > avoid double faults on x86) .
> >
> > In other words, this feature is only about one very specific type of
> > kernel memory that is not even directly mapped (the feature required
> > vmapped stacks).
> >
> > > Unless your threads are spending nearly all their time sleeping, the
> > > threads will occupy stack and TLS memory in user space as well.
> >
> > Can you please elaborate, what data is contained in the kernel stack
> > when thread is in user space? My series requires thread_info not to be
> > in the stack by depending on THREAD_INFO_IN_TASK.
> >
>
> My point is that what matters is total memory use, not just memory used in
> the kernel. Amdahl's law.

If userspace is running a few processes with many threads and the
userspace stacks are small, kernel stacks could end up dominating.

I'd like to see some numbers though.

2024-03-14 19:24:06

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

> >
> > My point is that what matters is total memory use, not just memory used in
> > the kernel. Amdahl's law.
>
> If userspace is running a few processes with many threads and the
> userspace stacks are small, kernel stacks could end up dominating.
>
> I'd like to see some numbers though.

The unused kernel stack pages occupy petabytes of memory across the fleet [1].

I also submitted a patch [2] that can help visualize the maximum stack
page access distribution.

[1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@mail.gmail.com
[2] https://lore.kernel.org/all/[email protected]

2024-03-14 19:29:17

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > >
> > > My point is that what matters is total memory use, not just memory used in
> > > the kernel. Amdahl's law.
> >
> > If userspace is running a few processes with many threads and the
> > userspace stacks are small, kernel stacks could end up dominating.
> >
> > I'd like to see some numbers though.
>
> The unused kernel stack pages occupy petabytes of memory across the fleet [1].

Raw number doesn't mean much here (I know how many machines Google has,
of course it's going to be petabytes ;), percentage of system memory
would be better.

What I'd _really_ like to see is raw output from memory allocation
profiling, so we can see how much memory is going to kernel stacks vs.
other kernel allocations.

Number of kernel threads vs. number of user threads would also be good
to know - I've been seeing ps output lately where we've got a lot more
workqueue workers than we should, perhaps that's something that could be
addressed.

2024-03-14 19:34:50

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 3:29 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > > >
> > > > My point is that what matters is total memory use, not just memory used in
> > > > the kernel. Amdahl's law.
> > >
> > > If userspace is running a few processes with many threads and the
> > > userspace stacks are small, kernel stacks could end up dominating.
> > >
> > > I'd like to see some numbers though.
> >
> > The unused kernel stack pages occupy petabytes of memory across the fleet [1].
>
> Raw number doesn't mean much here (I know how many machines Google has,
> of course it's going to be petabytes ;), percentage of system memory
> would be better.
>
> What I'd _really_ like to see is raw output from memory allocation
> profiling, so we can see how much memory is going to kernel stacks vs.
> other kernel allocations.

I've heard there is memory profiling working that can help with that...

While I do not have the data you are asking for, the other kernel
allocations might be useful, but this particular project is targeted
to help with reducing overhead where the memory is not used, or used
in very extreme rare cases.

> Number of kernel threads vs. number of user threads would also be good
> to know - I've been seeing ps output lately where we've got a lot more
> workqueue workers than we should, perhaps that's something that could be
> addressed.

Yes, doing other optimizations make sense, reducing the total number
kernel threads if possible might help as well. I will look into this
as well to see how many user threads vs kernel threads we have.

2024-03-14 19:43:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> Second, non-dynamic kernel memory is one of the core design decisions in
> Linux from early on. This means there are lot of deeply embedded assumptions
> which would have to be untangled.

I think there are other ways of getting the benefit that Pasha is seeking
without moving to dynamically allocated kernel memory. One icky thing
that XFS does is punt work over to a kernel thread in order to use more
stack! That breaks a number of things including lockdep (because the
kernel thread doesn't own the lock, the thread waiting for the kernel
thread owns the lock).

If we had segmented stacks, XFS could say "I need at least 6kB of stack",
and if less than that was available, we could allocate a temporary
stack and switch to it. I suspect Google would also be able to use this
API for their rare cases when they need more than 8kB of kernel stack.
Who knows, we might all be able to use such a thing.

I'd been thinking about this from the point of view of allocating more
stack elsewhere in kernel space, but combining what Pasha has done here
with this idea might lead to a hybrid approach that works better; allocate
32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
rely on people using this "I need more stack" API correctly, and free the
excess pages on return to userspace. No complicated "switch stacks" API
needed, just an "ensure we have at least N bytes of stack remaining" API.

2024-03-14 19:49:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 03:34:03PM -0400, Pasha Tatashin wrote:
> On Thu, Mar 14, 2024 at 3:29 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > > > >
> > > > > My point is that what matters is total memory use, not just memory used in
> > > > > the kernel. Amdahl's law.
> > > >
> > > > If userspace is running a few processes with many threads and the
> > > > userspace stacks are small, kernel stacks could end up dominating.
> > > >
> > > > I'd like to see some numbers though.
> > >
> > > The unused kernel stack pages occupy petabytes of memory across the fleet [1].
> >
> > Raw number doesn't mean much here (I know how many machines Google has,
> > of course it's going to be petabytes ;), percentage of system memory
> > would be better.
> >
> > What I'd _really_ like to see is raw output from memory allocation
> > profiling, so we can see how much memory is going to kernel stacks vs.
> > other kernel allocations.
>
> I've heard there is memory profiling working that can help with that...

I heard you've tried it out, too :)

> While I do not have the data you are asking for, the other kernel
> allocations might be useful, but this particular project is targeted
> to help with reducing overhead where the memory is not used, or used
> in very extreme rare cases.

Well, do you think you could gather it? We shouldn't be blindly applying
performance optimizations; we need to know where to focus our efforts.

e.g. on my laptop I've currently got 356 processes for < 6M of kernel
stack out of 32G total ram, so clearly this isn't much use to me. If the
ratio is similar on your servers - nah, don't want it. I expect the
ratio is not similar and you are burning proportially more memory on
kernel stacks, but we still need to gather the data and do the math :)

>
> > Number of kernel threads vs. number of user threads would also be good
> > to know - I've been seeing ps output lately where we've got a lot more
> > workqueue workers than we should, perhaps that's something that could be
> > addressed.
>
> Yes, doing other optimizations make sense, reducing the total number
> kernel threads if possible might help as well. I will look into this
> as well to see how many user threads vs kernel threads we have.

Great, that will help too.

2024-03-14 19:54:02

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > Second, non-dynamic kernel memory is one of the core design decisions in
> > Linux from early on. This means there are lot of deeply embedded assumptions
> > which would have to be untangled.
>
> I think there are other ways of getting the benefit that Pasha is seeking
> without moving to dynamically allocated kernel memory. One icky thing
> that XFS does is punt work over to a kernel thread in order to use more
> stack! That breaks a number of things including lockdep (because the
> kernel thread doesn't own the lock, the thread waiting for the kernel
> thread owns the lock).
>
> If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> and if less than that was available, we could allocate a temporary
> stack and switch to it. I suspect Google would also be able to use this
> API for their rare cases when they need more than 8kB of kernel stack.
> Who knows, we might all be able to use such a thing.
>
> I'd been thinking about this from the point of view of allocating more
> stack elsewhere in kernel space, but combining what Pasha has done here
> with this idea might lead to a hybrid approach that works better; allocate
> 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> rely on people using this "I need more stack" API correctly, and free the
> excess pages on return to userspace. No complicated "switch stacks" API
> needed, just an "ensure we have at least N bytes of stack remaining" API.

Why would we need an "I need more stack" API? Pasha's approach seems
like everything we need for what you're talking about.

2024-03-14 19:57:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > which would have to be untangled.
> >
> > I think there are other ways of getting the benefit that Pasha is seeking
> > without moving to dynamically allocated kernel memory. One icky thing
> > that XFS does is punt work over to a kernel thread in order to use more
> > stack! That breaks a number of things including lockdep (because the
> > kernel thread doesn't own the lock, the thread waiting for the kernel
> > thread owns the lock).
> >
> > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > and if less than that was available, we could allocate a temporary
> > stack and switch to it. I suspect Google would also be able to use this
> > API for their rare cases when they need more than 8kB of kernel stack.
> > Who knows, we might all be able to use such a thing.
> >
> > I'd been thinking about this from the point of view of allocating more
> > stack elsewhere in kernel space, but combining what Pasha has done here
> > with this idea might lead to a hybrid approach that works better; allocate
> > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > rely on people using this "I need more stack" API correctly, and free the
> > excess pages on return to userspace. No complicated "switch stacks" API
> > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
> Why would we need an "I need more stack" API? Pasha's approach seems
> like everything we need for what you're talking about.

Because double faults are hard, possibly impossible, and the FRED approach
Peter described has extra overhead? This was all described up-thread.

2024-03-14 19:58:23

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 07:57:22PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > which would have to be untangled.
> > >
> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > without moving to dynamically allocated kernel memory. One icky thing
> > > that XFS does is punt work over to a kernel thread in order to use more
> > > stack! That breaks a number of things including lockdep (because the
> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > thread owns the lock).
> > >
> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > and if less than that was available, we could allocate a temporary
> > > stack and switch to it. I suspect Google would also be able to use this
> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > Who knows, we might all be able to use such a thing.
> > >
> > > I'd been thinking about this from the point of view of allocating more
> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > with this idea might lead to a hybrid approach that works better; allocate
> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > rely on people using this "I need more stack" API correctly, and free the
> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> >
> > Why would we need an "I need more stack" API? Pasha's approach seems
> > like everything we need for what you're talking about.
>
> Because double faults are hard, possibly impossible, and the FRED approach
> Peter described has extra overhead? This was all described up-thread.

*nod*

2024-03-15 03:14:41

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > which would have to be untangled.
> > >
> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > without moving to dynamically allocated kernel memory. One icky thing
> > > that XFS does is punt work over to a kernel thread in order to use more
> > > stack! That breaks a number of things including lockdep (because the
> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > thread owns the lock).
> > >
> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > and if less than that was available, we could allocate a temporary
> > > stack and switch to it. I suspect Google would also be able to use this
> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > Who knows, we might all be able to use such a thing.
> > >
> > > I'd been thinking about this from the point of view of allocating more
> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > with this idea might lead to a hybrid approach that works better; allocate
> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > rely on people using this "I need more stack" API correctly, and free the
> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > needed, just an "ensure we have at least N bytes of stack remaining" API.

I like this approach! I think we could also consider having permanent
big stacks for some kernel only threads like kvm-vcpu. A cooperative
stack increase framework could work well and wouldn't negatively
impact the performance of context switching. However, thorough
analysis would be necessary to proactively identify potential stack
overflow situations.

> > Why would we need an "I need more stack" API? Pasha's approach seems
> > like everything we need for what you're talking about.
>
> Because double faults are hard, possibly impossible, and the FRED approach
> Peter described has extra overhead? This was all described up-thread.

Handling faults in #DF is possible. It requires code inspection to
handle race conditions such as what was shown by tglx. However, as
Andy pointed out, this is not supported by SDM as it is an abort
context (yet we return from it because of ESPFIX64, so return is
possible).

My question, however, if we ignore memory savings and only consider
reliability aspect of this feature. What is better unconditionally
crashing the machine because a guard page was reached, or printing a
huge warning with a backtracing information about the offending stack,
handling the fault, and survive? I know that historically Linus
preferred WARN() to BUG() [1]. But, this is a somewhat different
scenario compared to simple BUG vs WARN.

Pasha

[1] https://lore.kernel.org/all/[email protected]

2024-03-15 03:42:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory. One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack! That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it. I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace. No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead? This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature. What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/[email protected]
>

The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.

2024-03-15 04:18:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory. One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack! That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it. I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace. No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead? This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature. What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/[email protected]
>

From a reliability point of view it is better to die than to proceed with possible data loss. The latter is extremely serious.

However, the one way that this could be made to work would be with stack probes, which could be compiler-inserted. The point is that you touch an offset below the stack pointer that is large enough that you cover not only the maximum amount of stack the function needs, but with an additional margin, which includes enough space that you can safely take the #PF on the remaining stack.

2024-03-16 19:18:44

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
>
> On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> >>
> >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> >> > > > which would have to be untangled.
> >> > >
> >> > > I think there are other ways of getting the benefit that Pasha is seeking
> >> > > without moving to dynamically allocated kernel memory. One icky thing
> >> > > that XFS does is punt work over to a kernel thread in order to use more
> >> > > stack! That breaks a number of things including lockdep (because the
> >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> >> > > thread owns the lock).
> >> > >
> >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> >> > > and if less than that was available, we could allocate a temporary
> >> > > stack and switch to it. I suspect Google would also be able to use this
> >> > > API for their rare cases when they need more than 8kB of kernel stack.
> >> > > Who knows, we might all be able to use such a thing.
> >> > >
> >> > > I'd been thinking about this from the point of view of allocating more
> >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> >> > > with this idea might lead to a hybrid approach that works better; allocate
> >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> >> > > rely on people using this "I need more stack" API correctly, and free the
> >> > > excess pages on return to userspace. No complicated "switch stacks" API
> >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> >
> >I like this approach! I think we could also consider having permanent
> >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> >stack increase framework could work well and wouldn't negatively
> >impact the performance of context switching. However, thorough
> >analysis would be necessary to proactively identify potential stack
> >overflow situations.
> >
> >> > Why would we need an "I need more stack" API? Pasha's approach seems
> >> > like everything we need for what you're talking about.
> >>
> >> Because double faults are hard, possibly impossible, and the FRED approach
> >> Peter described has extra overhead? This was all described up-thread.
> >
> >Handling faults in #DF is possible. It requires code inspection to
> >handle race conditions such as what was shown by tglx. However, as
> >Andy pointed out, this is not supported by SDM as it is an abort
> >context (yet we return from it because of ESPFIX64, so return is
> >possible).
> >
> >My question, however, if we ignore memory savings and only consider
> >reliability aspect of this feature. What is better unconditionally
> >crashing the machine because a guard page was reached, or printing a
> >huge warning with a backtracing information about the offending stack,
> >handling the fault, and survive? I know that historically Linus
> >preferred WARN() to BUG() [1]. But, this is a somewhat different
> >scenario compared to simple BUG vs WARN.
> >
> >Pasha
> >
> >[1] https://lore.kernel.org/all/[email protected]
> >
>
> The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.

Got it. So, using a #DF handler for stack page faults isn't feasible.
I suppose the only way for this to work would be to use a dedicated
Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
that might introduce other complications.

Expanding on Mathew's idea of an interface for dynamic kernel stack
sizes, here's what I'm thinking:

- Kernel Threads: Create all kernel threads with a fully populated
THREAD_SIZE stack. (i.e. 16K)
- User Threads: Create all user threads with THREAD_SIZE kernel stack
but only the top page mapped. (i.e. 4K)
- In enter_from_user_mode(): Expand the thread stack to 16K by mapping
three additional pages from the per-CPU stack cache. This function is
called early in kernel entry points.
- exit_to_user_mode(): Unmap the extra three pages and return them to
the per-CPU cache. This function is called late in the kernel exit
path.

Both of the above hooks are called with IRQ disabled on all kernel
entries whether through interrupts and syscalls, and they are called
early/late enough that 4K is enough to handle the rest of entry/exit.

Pasha

2024-03-17 00:41:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
>
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack. (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.
>
> Both of the above hooks are called with IRQ disabled on all kernel
> entries whether through interrupts and syscalls, and they are called
> early/late enough that 4K is enough to handle the rest of entry/exit.

At what point do we replenish the per-CPU stash of pages? If we're
12kB deep in the stack and call mutex_lock(), we can be scheduled out,
and then the new thread can make a syscall. Do we just assume that
get_free_page() can sleep at kernel entry (seems reasonable)? I don't
think this is an infeasible problem, I'd just like it to be described.

2024-03-17 00:50:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On March 14, 2024 12:43:06 PM PDT, Matthew Wilcox <[email protected]> wrote:
>On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> Second, non-dynamic kernel memory is one of the core design decisions in
>> Linux from early on. This means there are lot of deeply embedded assumptions
>> which would have to be untangled.
>
>I think there are other ways of getting the benefit that Pasha is seeking
>without moving to dynamically allocated kernel memory. One icky thing
>that XFS does is punt work over to a kernel thread in order to use more
>stack! That breaks a number of things including lockdep (because the
>kernel thread doesn't own the lock, the thread waiting for the kernel
>thread owns the lock).
>
>If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>and if less than that was available, we could allocate a temporary
>stack and switch to it. I suspect Google would also be able to use this
>API for their rare cases when they need more than 8kB of kernel stack.
>Who knows, we might all be able to use such a thing.
>
>I'd been thinking about this from the point of view of allocating more
>stack elsewhere in kernel space, but combining what Pasha has done here
>with this idea might lead to a hybrid approach that works better; allocate
>32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>rely on people using this "I need more stack" API correctly, and free the
>excess pages on return to userspace. No complicated "switch stacks" API
>needed, just an "ensure we have at least N bytes of stack remaining" API.

This is what stack probes basically does. It provides a very cheap "API" that goes via the #PF (not #DF!) path in the slow case, but synchronously at a well-defined point, but is virtually free in the common case. As a side benefit, they can be compiler-generated, as some operating systems require them.

2024-03-17 01:33:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sun, Mar 17, 2024 at 12:41:33AM +0000, Matthew Wilcox wrote:
> On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack. (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> At what point do we replenish the per-CPU stash of pages? If we're
> 12kB deep in the stack and call mutex_lock(), we can be scheduled out,
> and then the new thread can make a syscall. Do we just assume that
> get_free_page() can sleep at kernel entry (seems reasonable)? I don't
> think this is an infeasible problem, I'd just like it to be described.

schedule() or return to userspace, I believe was mentioned

2024-03-17 14:20:06

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sat, Mar 16, 2024 at 8:41 PM Matthew Wilcox <[email protected]> wrote:
>
> On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack. (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> At what point do we replenish the per-CPU stash of pages? If we're
> 12kB deep in the stack and call mutex_lock(), we can be scheduled out,
> and then the new thread can make a syscall. Do we just assume that
> get_free_page() can sleep at kernel entry (seems reasonable)? I don't
> think this is an infeasible problem, I'd just like it to be described.

Once irq is enabled it is perfectly OK to sleep and wait for the stack
pages to become available.

The following user entries that enable interrupts:
do_user_addr_fault()
local_irq_enable()

do_syscall_64()
syscall_enter_from_user_mode()
local_irq_enable()

__do_fast_syscall_32()
syscall_enter_from_user_mode_prepare()
local_irq_enable()

exc_debug_user()
local_irq_enable()

do_int3_user()
cond_local_irq_enable()

With those it is perfectly OK to sleep and wait for the page to become
available when we are in a situation where the per-cpu cache is empty,
and alloc_page(GFP_NOWAIT) does not succeed.

The other interrupts from userland never enable IRQs. We can have
3-pages per-cpu reserved for handling specifically IRQ-never enable
cases, as there cannot be more than one ever needed.

Pasha

2024-03-17 14:37:48

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check

Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> Remove __HAVE_ARCH_KSTACK_END as it has been osolete since removal of
> metag architecture in v4.17.

Nit: obsolete

>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> include/linux/sched/task_stack.h | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
> index ccd72b978e1f..860faea06883 100644
> --- a/include/linux/sched/task_stack.h
> +++ b/include/linux/sched/task_stack.h
> @@ -116,7 +116,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
> #endif
> extern void set_task_stack_end_magic(struct task_struct *tsk);
>
> -#ifndef __HAVE_ARCH_KSTACK_END
> static inline int kstack_end(void *addr)
> {
> /* Reliable end of stack detection:
> @@ -124,6 +123,5 @@ static inline int kstack_end(void *addr)
> */
> return !(((unsigned long)addr+sizeof(void*)-1) & (THREAD_SIZE-sizeof(void*)));
> }
> -#endif
>
> #endif /* _LINUX_SCHED_TASK_STACK_H */


2024-03-17 14:43:42

by Brian Gerst

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
<[email protected]> wrote:
>
> On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> >
> > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleencom> wrote:
> > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > >>
> > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > >> > > > which would have to be untangled.
> > >> > >
> > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > >> > > stack! That breaks a number of things including lockdep (because the
> > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > >> > > thread owns the lock).
> > >> > >
> > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > >> > > and if less than that was available, we could allocate a temporary
> > >> > > stack and switch to it. I suspect Google would also be able to use this
> > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > >> > > Who knows, we might all be able to use such a thing.
> > >> > >
> > >> > > I'd been thinking about this from the point of view of allocating more
> > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > >> > > rely on people using this "I need more stack" API correctly, and free the
> > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > >
> > >I like this approach! I think we could also consider having permanent
> > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > >stack increase framework could work well and wouldn't negatively
> > >impact the performance of context switching. However, thorough
> > >analysis would be necessary to proactively identify potential stack
> > >overflow situations.
> > >
> > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > >> > like everything we need for what you're talking about.
> > >>
> > >> Because double faults are hard, possibly impossible, and the FRED approach
> > >> Peter described has extra overhead? This was all described up-thread.
> > >
> > >Handling faults in #DF is possible. It requires code inspection to
> > >handle race conditions such as what was shown by tglx. However, as
> > >Andy pointed out, this is not supported by SDM as it is an abort
> > >context (yet we return from it because of ESPFIX64, so return is
> > >possible).
> > >
> > >My question, however, if we ignore memory savings and only consider
> > >reliability aspect of this feature. What is better unconditionally
> > >crashing the machine because a guard page was reached, or printing a
> > >huge warning with a backtracing information about the offending stack,
> > >handling the fault, and survive? I know that historically Linus
> > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > >scenario compared to simple BUG vs WARN.
> > >
> > >Pasha
> > >
> > >[1] https://lore.kernel.org/all/[email protected]
> > >
> >
> > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
>
> Got it. So, using a #DF handler for stack page faults isn't feasible.
> I suppose the only way for this to work would be to use a dedicated
> Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> that might introduce other complications.
>
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
>
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack. (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.
>
> Both of the above hooks are called with IRQ disabled on all kernel
> entries whether through interrupts and syscalls, and they are called
> early/late enough that 4K is enough to handle the rest of entry/exit.

This proposal will not have the memory savings that you are looking
for, since sleeping tasks would still have a fully allocated stack.
This also would add extra overhead to each entry and exit (including
syscalls) that can happen multiple times before a context switch. It
also doesn't make much sense because a task running in user mode will
quickly need those stack pages back when it returns to kernel mode.
Even if it doesn't make a syscall, the timer interrupt will kick it
out of user mode.

What should happen is that the unused stack is reclaimed when a task
goes to sleep. The kernel does not use a red zone, so any stack pages
below the saved stack pointer of a sleeping task (task->thread.sp) can
be safely discarded. Before context switching to a task, fully
populate its task stack. After context switching from a task, reclaim
its unused stack. This way, the task stack in use is always fully
allocated and we don't have to deal with page faults.

To make this happen, __switch_to() would have to be split into two
parts, to cleanly separate what happens before and after the stack
switch. The first part saves processor context for the previous task,
and prepares the next task. Populating the next task's stack would
happen here. Then it would return to the assembly code to do the
stack switch. The second part then loads the context of the next
task, and finalizes any work for the previous task. Reclaiming the
unused stack pages of the previous task would happen here.


Brian Gerst

2024-03-17 14:45:50

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE

Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> In many places number of pages in the stack is detremined via
> (THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
> (THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.
>
> However, with dynamic stacks, the number of pages in vm_area will grow
> with stack, therefore, use vm_area->nr_pages to determine the actual
> number of pages allocated in stack.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 60e812825a7a..a35f4008afa0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -243,13 +243,11 @@ static int free_vm_stack_cache(unsigned int cpu)
>
> static int memcg_charge_kernel_stack(struct vm_struct *vm)

Maybe s/vm/vm_area/ as done in 03/14?

CJ

> {
> - int i;
> - int ret;
> + int i, ret, nr_pages;
> int nr_charged = 0;
>
> - BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> -
> - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> + nr_pages = vm->nr_pages;
> + for (i = 0; i < nr_pages; i++) {
> ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
> if (ret)
> goto err;
> @@ -531,9 +529,10 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> {
> if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> struct vm_struct *vm = task_stack_vm_area(tsk);
> - int i;
> + int i, nr_pages;
>
> - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + nr_pages = vm->nr_pages;
> + for (i = 0; i < nr_pages; i++)
> mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> account * (PAGE_SIZE / 1024));
> } else {
> @@ -551,10 +550,11 @@ void exit_task_stack_account(struct task_struct *tsk)
>
> if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> struct vm_struct *vm;
> - int i;
> + int i, nr_pages;
>
> vm = task_stack_vm_area(tsk);
> - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + nr_pages = vm->nr_pages;
> + for (i = 0; i < nr_pages; i++)
> memcg_kmem_uncharge_page(vm->pages[i], 0);
> }
> }


2024-03-17 14:52:02

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code

Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> There are two data types: "struct vm_struct" and "struct vm_stack" that
> have the same local variable names: vm_stack, or vm, or s, which makes
> code confusing to read.
>
> Change the code so the naming is consisent:

Nit: consistent

>
> struct vm_struct is always called vm_area
> struct vm_stack is always called vm_stack
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 38 ++++++++++++++++++--------------------
> 1 file changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 32600bf2422a..60e812825a7a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -192,12 +192,12 @@ struct vm_stack {
> struct vm_struct *stack_vm_area;
> };
>
> -static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
> +static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> {
> unsigned int i;
>
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> - if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
> + if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm_area) != NULL)
> continue;
> return true;
> }
> @@ -207,11 +207,12 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
> static void thread_stack_free_rcu(struct rcu_head *rh)
> {
> struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
> + struct vm_struct *vm_area = vm_stack->stack_vm_area;
>
> if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
> return;
>
> - vfree(vm_stack);
> + vfree(vm_area->addr);

This does not look like only a renaming of a variable. Is it?

If no, should there be a Fixes tag and should it be detailed in the
commit description?

CJ

> }
>
> static void thread_stack_delayed_free(struct task_struct *tsk)
> @@ -228,12 +229,12 @@ static int free_vm_stack_cache(unsigned int cpu)
> int i;
>
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> - struct vm_struct *vm_stack = cached_vm_stacks[i];
> + struct vm_struct *vm_area = cached_vm_stacks[i];
>
> - if (!vm_stack)
> + if (!vm_area)
> continue;
>
> - vfree(vm_stack->addr);
> + vfree(vm_area->addr);
> cached_vm_stacks[i] = NULL;
> }
>
> @@ -263,32 +264,29 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
>
> static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> {
> - struct vm_struct *vm;
> + struct vm_struct *vm_area;
> void *stack;
> int i;
>
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> - struct vm_struct *s;
> -
> - s = this_cpu_xchg(cached_stacks[i], NULL);
> -
> - if (!s)
> + vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> + if (!vm_area)
> continue;
>
> /* Reset stack metadata. */
> - kasan_unpoison_range(s->addr, THREAD_SIZE);
> + kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
>
> - stack = kasan_reset_tag(s->addr);
> + stack = kasan_reset_tag(vm_area->addr);
>
> /* Clear stale pointers from reused stack. */
> memset(stack, 0, THREAD_SIZE);
>
> - if (memcg_charge_kernel_stack(s)) {
> - vfree(s->addr);
> + if (memcg_charge_kernel_stack(vm_area)) {
> + vfree(vm_area->addr);
> return -ENOMEM;
> }
>
> - tsk->stack_vm_area = s;
> + tsk->stack_vm_area = vm_area;
> tsk->stack = stack;
> return 0;
> }
> @@ -306,8 +304,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> if (!stack)
> return -ENOMEM;
>
> - vm = find_vm_area(stack);
> - if (memcg_charge_kernel_stack(vm)) {
> + vm_area = find_vm_area(stack);
> + if (memcg_charge_kernel_stack(vm_area)) {
> vfree(stack);
> return -ENOMEM;
> }
> @@ -316,7 +314,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> * free_thread_stack() can be called in interrupt context,
> * so cache the vm_struct.
> */
> - tsk->stack_vm_area = vm;
> + tsk->stack_vm_area = vm_area;
> stack = kasan_reset_tag(stack);
> tsk->stack = stack;
> return 0;


2024-03-17 15:13:46

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check

On Sun, Mar 17, 2024 at 10:36 AM Christophe JAILLET
<[email protected]> wrote:
>
> Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> > Remove __HAVE_ARCH_KSTACK_END as it has been osolete since removal of
> > metag architecture in v4.17.
>
> Nit: obsolete

Thank you, I will fix it.

Pasha

>
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > include/linux/sched/task_stack.h | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
> > index ccd72b978e1f..860faea06883 100644
> > --- a/include/linux/sched/task_stack.h
> > +++ b/include/linux/sched/task_stack.h
> > @@ -116,7 +116,6 @@ static inline unsigned long stack_not_used(struct task_struct *p)
> > #endif
> > extern void set_task_stack_end_magic(struct task_struct *tsk);
> >
> > -#ifndef __HAVE_ARCH_KSTACK_END
> > static inline int kstack_end(void *addr)
> > {
> > /* Reliable end of stack detection:
> > @@ -124,6 +123,5 @@ static inline int kstack_end(void *addr)
> > */
> > return !(((unsigned long)addr+sizeof(void*)-1) & (THREAD_SIZE-sizeof(void*)));
> > }
> > -#endif
> >
> > #endif /* _LINUX_SCHED_TASK_STACK_H */
>

2024-03-17 15:15:25

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE

On Sun, Mar 17, 2024 at 10:45 AM Christophe JAILLET
<[email protected]> wrote:
>
> Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> > In many places number of pages in the stack is detremined via
> > (THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that
> > (THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages.
> >
> > However, with dynamic stacks, the number of pages in vm_area will grow
> > with stack, therefore, use vm_area->nr_pages to determine the actual
> > number of pages allocated in stack.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > kernel/fork.c | 18 +++++++++---------
> > 1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 60e812825a7a..a35f4008afa0 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -243,13 +243,11 @@ static int free_vm_stack_cache(unsigned int cpu)
> >
> > static int memcg_charge_kernel_stack(struct vm_struct *vm)
>
> Maybe s/vm/vm_area/ as done in 03/14?

Yes, I will add it to 03/14.

Thank you,
Pasha

>
> CJ
>
> > {
> > - int i;
> > - int ret;
> > + int i, ret, nr_pages;
> > int nr_charged = 0;
> >
> > - BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> > -
> > - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> > + nr_pages = vm->nr_pages;
> > + for (i = 0; i < nr_pages; i++) {
> > ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
> > if (ret)
> > goto err;
> > @@ -531,9 +529,10 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> > {
> > if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > struct vm_struct *vm = task_stack_vm_area(tsk);
> > - int i;
> > + int i, nr_pages;
> >
> > - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> > + nr_pages = vm->nr_pages;
> > + for (i = 0; i < nr_pages; i++)
> > mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> > account * (PAGE_SIZE / 1024));
> > } else {
> > @@ -551,10 +550,11 @@ void exit_task_stack_account(struct task_struct *tsk)
> >
> > if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > struct vm_struct *vm;
> > - int i;
> > + int i, nr_pages;
> >
> > vm = task_stack_vm_area(tsk);
> > - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> > + nr_pages = vm->nr_pages;
> > + for (i = 0; i < nr_pages; i++)
> > memcg_kmem_uncharge_page(vm->pages[i], 0);
> > }
> > }
>

2024-03-17 16:16:09

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <[email protected]> wrote:
>
> On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> <[email protected]> wrote:
> >
> > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> > >
> > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > > >>
> > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > >> > > > which would have to be untangled.
> > > >> > >
> > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > >> > > stack! That breaks a number of things including lockdep (because the
> > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > >> > > thread owns the lock).
> > > >> > >
> > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > >> > > and if less than that was available, we could allocate a temporary
> > > >> > > stack and switch to it. I suspect Google would also be able to use this
> > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > >> > > Who knows, we might all be able to use such a thing.
> > > >> > >
> > > >> > > I'd been thinking about this from the point of view of allocating more
> > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > >
> > > >I like this approach! I think we could also consider having permanent
> > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > >stack increase framework could work well and wouldn't negatively
> > > >impact the performance of context switching. However, thorough
> > > >analysis would be necessary to proactively identify potential stack
> > > >overflow situations.
> > > >
> > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > >> > like everything we need for what you're talking about.
> > > >>
> > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > >> Peter described has extra overhead? This was all described up-thread.
> > > >
> > > >Handling faults in #DF is possible. It requires code inspection to
> > > >handle race conditions such as what was shown by tglx. However, as
> > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > >context (yet we return from it because of ESPFIX64, so return is
> > > >possible).
> > > >
> > > >My question, however, if we ignore memory savings and only consider
> > > >reliability aspect of this feature. What is better unconditionally
> > > >crashing the machine because a guard page was reached, or printing a
> > > >huge warning with a backtracing information about the offending stack,
> > > >handling the fault, and survive? I know that historically Linus
> > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > >scenario compared to simple BUG vs WARN.
> > > >
> > > >Pasha
> > > >
> > > >[1] https://lore.kernel.org/all/[email protected]
> > > >
> > >
> > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> >
> > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > I suppose the only way for this to work would be to use a dedicated
> > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > that might introduce other complications.
> >
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack. (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.

Hi Brian,

> This proposal will not have the memory savings that you are looking
> for, since sleeping tasks would still have a fully allocated stack.

The tasks that were descheduled while running in user mode should not
increase their stack. The potential saving is greater than the
origianl proposal, because in the origianl proposal we never shrink
stacks after faults.

> This also would add extra overhead to each entry and exit (including
> syscalls) that can happen multiple times before a context switch. It
> also doesn't make much sense because a task running in user mode will
> quickly need those stack pages back when it returns to kernel mode.
> Even if it doesn't make a syscall, the timer interrupt will kick it
> out of user mode.
>
> What should happen is that the unused stack is reclaimed when a task
> goes to sleep. The kernel does not use a red zone, so any stack pages
> below the saved stack pointer of a sleeping task (task->thread.sp) can
> be safely discarded. Before context switching to a task, fully

Excellent observation, this makes Andy Lutomirski per-map proposal [1]
usable without tracking dirty/accessed bits. More reliable, and also
platform independent.

> populate its task stack. After context switching from a task, reclaim
> its unused stack. This way, the task stack in use is always fully
> allocated and we don't have to deal with page faults.
>
> To make this happen, __switch_to() would have to be split into two
> parts, to cleanly separate what happens before and after the stack
> switch. The first part saves processor context for the previous task,
> and prepares the next task.

By knowing the stack requirements of __switch_to(), can't we actually
do all that in the common code in context_switch() right before
__switch_to()? We would do an arch specific call to get the
__switch_to() stack requirement, and use that to change the value of
task->thread.sp to know where the stack is going to be while sleeping.
At this time we can do the unmapping of the stack pages from the
previous task, and mapping the pages to the next task.

> Populating the next task's stack would
> happen here. Then it would return to the assembly code to do the
> stack switch. The second part then loads the context of the next
> task, and finalizes any work for the previous task. Reclaiming the
> unused stack pages of the previous task would happen here.

The problem with this (and the origianl Andy's approach), is that we
cannot sleep here. What happens if we get per-cpu stack cache
exhausted because several threads sleep while having deep stacks? How
can we schedule the next task? This is probably a corner case, but it
needs to have a proper handling solution. One solution is while in
schedule() and while interrupts are still enabled before going to
switch_to() we must pre-allocate 3-page in the per-cpu. However, what
if the pre-allocation itself calls cond_resched() because it enters
page allocator slowpath?

Other than the above concern, I concur, this approach looks to be the
best so far. I will think more about it.

Thank you,
Pasha

[1] https://lore.kernel.org/all/[email protected]

2024-03-17 19:00:17

by David Laight

[permalink] [raw]
Subject: RE: [RFC 00/14] Dynamic Kernel Stacks

From: Pasha Tatashin
> Sent: 16 March 2024 19:18
...
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
>
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack. (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.

Isn't that entirely horrid for TLB use and so will require a lot of IPI?

Remember, if a thread sleeps in 'extra stack' and is then resheduled
on a different cpu the extra pages get 'pumped' from one cpu to
another.

I also suspect a stack_probe() is likely to end up being a cache miss
and also slow???
So you wouldn't want one on all calls.
I'm not sure you'd want a conditional branch either.

The explicit request for 'more stack' can be required to be allowed
to sleep - removing a lot of issues.
It would also be portable to all architectures.
I'd also suspect that any thread that needs extra stack is likely
to need to again.
So while the memory could be recovered, I'd bet is isn't worth
doing except under memory pressure.
The call could also return 'no' - perhaps useful for (broken) code
that insists on being recursive.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-17 21:30:54

by Brian Gerst

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
<[email protected]> wrote:
>
> On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <[email protected]> wrote:
> >
> > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > <[email protected]> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> > > >
> > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > > > >>
> > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > >> > > > which would have to be untangled.
> > > > >> > >
> > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > >> > > stack! That breaks a number of things including lockdep (because the
> > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > >> > > thread owns the lock).
> > > > >> > >
> > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > >> > > and if less than that was available, we could allocate a temporary
> > > > >> > > stack and switch to it. I suspect Google would also be able to use this
> > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > >> > > Who knows, we might all be able to use such a thing.
> > > > >> > >
> > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > >
> > > > >I like this approach! I think we could also consider having permanent
> > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > >stack increase framework could work well and wouldn't negatively
> > > > >impact the performance of context switching. However, thorough
> > > > >analysis would be necessary to proactively identify potential stack
> > > > >overflow situations.
> > > > >
> > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > >> > like everything we need for what you're talking about.
> > > > >>
> > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > >> Peter described has extra overhead? This was all described up-thread.
> > > > >
> > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > >handle race conditions such as what was shown by tglx. However, as
> > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > >possible).
> > > > >
> > > > >My question, however, if we ignore memory savings and only consider
> > > > >reliability aspect of this feature. What is better unconditionally
> > > > >crashing the machine because a guard page was reached, or printing a
> > > > >huge warning with a backtracing information about the offending stack,
> > > > >handling the fault, and survive? I know that historically Linus
> > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > >scenario compared to simple BUG vs WARN.
> > > > >
> > > > >Pasha
> > > > >
> > > > >[1] https://lore.kernel.org/all/[email protected]
> > > > >
> > > >
> > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > >
> > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > I suppose the only way for this to work would be to use a dedicated
> > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > that might introduce other complications.
> > >
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack. (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> > >
> > > Both of the above hooks are called with IRQ disabled on all kernel
> > > entries whether through interrupts and syscalls, and they are called
> > > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> Hi Brian,
>
> > This proposal will not have the memory savings that you are looking
> > for, since sleeping tasks would still have a fully allocated stack.
>
> The tasks that were descheduled while running in user mode should not
> increase their stack. The potential saving is greater than the
> origianl proposal, because in the origianl proposal we never shrink
> stacks after faults.

A task has to enter kernel mode in order to be rescheduled. If it
doesn't make a syscall or hit an exception, then the timer interrupt
will eventually kick it out of user mode. At some point schedule() is
called, the task is put to sleep and context is switched to the next
task. A sleeping task will always be using some amount of kernel
stack. How much depends a lot on what caused the task to sleep. If
the timeslice expired it could switch right before the return to user
mode. A page fault could go deep into filesystem and device code
waiting on an I/O operation.

> > This also would add extra overhead to each entry and exit (including
> > syscalls) that can happen multiple times before a context switch. It
> > also doesn't make much sense because a task running in user mode will
> > quickly need those stack pages back when it returns to kernel mode.
> > Even if it doesn't make a syscall, the timer interrupt will kick it
> > out of user mode.
> >
> > What should happen is that the unused stack is reclaimed when a task
> > goes to sleep. The kernel does not use a red zone, so any stack pages
> > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > be safely discarded. Before context switching to a task, fully
>
> Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> usable without tracking dirty/accessed bits. More reliable, and also
> platform independent.

This is x86-specific. Other architectures will likely have differences.

> > populate its task stack. After context switching from a task, reclaim
> > its unused stack. This way, the task stack in use is always fully
> > allocated and we don't have to deal with page faults.
> >
> > To make this happen, __switch_to() would have to be split into two
> > parts, to cleanly separate what happens before and after the stack
> > switch. The first part saves processor context for the previous task,
> > and prepares the next task.
>
> By knowing the stack requirements of __switch_to(), can't we actually
> do all that in the common code in context_switch() right before
> __switch_to()? We would do an arch specific call to get the
> __switch_to() stack requirement, and use that to change the value of
> task->thread.sp to know where the stack is going to be while sleeping.
> At this time we can do the unmapping of the stack pages from the
> previous task, and mapping the pages to the next task.

task->thread.sp is set in __switch_to_asm(), and is pretty much the
last thing done in the context of the previous task. Trying to
predict that value ahead of time is way too fragile. Also, the key
point I was trying to make is that you cannot safely shrink the active
stack. It can only be done after the stack switch to the new task.

> > Populating the next task's stack would
> > happen here. Then it would return to the assembly code to do the
> > stack switch. The second part then loads the context of the next
> > task, and finalizes any work for the previous task. Reclaiming the
> > unused stack pages of the previous task would happen here.
>
> The problem with this (and the origianl Andy's approach), is that we
> cannot sleep here. What happens if we get per-cpu stack cache
> exhausted because several threads sleep while having deep stacks? How
> can we schedule the next task? This is probably a corner case, but it
> needs to have a proper handling solution. One solution is while in
> schedule() and while interrupts are still enabled before going to
> switch_to() we must pre-allocate 3-page in the per-cpu. However, what
> if the pre-allocation itself calls cond_resched() because it enters
> page allocator slowpath?

You would have to keep extra pages in reserve for allocation failures.
mempool could probably help with that.

Brian Gerst

2024-03-18 15:00:31

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <[email protected]> wrote:
>
> On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> <[email protected]> wrote:
> >
> > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <[email protected]> wrote:
> > >
> > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> > > > >
> > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > > > > >>
> > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > >> > > > which would have to be untangled.
> > > > > >> > >
> > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > >> > > stack! That breaks a number of things including lockdep (because the
> > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > >> > > thread owns the lock).
> > > > > >> > >
> > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > >> > > stack and switch to it. I suspect Google would also be able to use this
> > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > >> > >
> > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > >
> > > > > >I like this approach! I think we could also consider having permanent
> > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > >stack increase framework could work well and wouldn't negatively
> > > > > >impact the performance of context switching. However, thorough
> > > > > >analysis would be necessary to proactively identify potential stack
> > > > > >overflow situations.
> > > > > >
> > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > >> > like everything we need for what you're talking about.
> > > > > >>
> > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > >> Peter described has extra overhead? This was all described up-thread.
> > > > > >
> > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > >possible).
> > > > > >
> > > > > >My question, however, if we ignore memory savings and only consider
> > > > > >reliability aspect of this feature. What is better unconditionally
> > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > >huge warning with a backtracing information about the offending stack,
> > > > > >handling the fault, and survive? I know that historically Linus
> > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > >scenario compared to simple BUG vs WARN.
> > > > > >
> > > > > >Pasha
> > > > > >
> > > > > >[1] https://lore.kernel.org/all/[email protected]
> > > > > >
> > > > >
> > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > >
> > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > I suppose the only way for this to work would be to use a dedicated
> > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > that might introduce other complications.
> > > >
> > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > sizes, here's what I'm thinking:
> > > >
> > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > THREAD_SIZE stack. (i.e. 16K)
> > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > but only the top page mapped. (i.e. 4K)
> > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > three additional pages from the per-CPU stack cache. This function is
> > > > called early in kernel entry points.
> > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > the per-CPU cache. This function is called late in the kernel exit
> > > > path.
> > > >
> > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > entries whether through interrupts and syscalls, and they are called
> > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> >
> > Hi Brian,
> >
> > > This proposal will not have the memory savings that you are looking
> > > for, since sleeping tasks would still have a fully allocated stack.
> >
> > The tasks that were descheduled while running in user mode should not
> > increase their stack. The potential saving is greater than the
> > origianl proposal, because in the origianl proposal we never shrink
> > stacks after faults.
>
> A task has to enter kernel mode in order to be rescheduled. If it
> doesn't make a syscall or hit an exception, then the timer interrupt
> will eventually kick it out of user mode. At some point schedule() is
> called, the task is put to sleep and context is switched to the next
> task. A sleeping task will always be using some amount of kernel
> stack. How much depends a lot on what caused the task to sleep. If
> the timeslice expired it could switch right before the return to user
> mode. A page fault could go deep into filesystem and device code
> waiting on an I/O operation.
>
> > > This also would add extra overhead to each entry and exit (including
> > > syscalls) that can happen multiple times before a context switch. It
> > > also doesn't make much sense because a task running in user mode will
> > > quickly need those stack pages back when it returns to kernel mode.
> > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > out of user mode.
> > >
> > > What should happen is that the unused stack is reclaimed when a task
> > > goes to sleep. The kernel does not use a red zone, so any stack pages
> > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > be safely discarded. Before context switching to a task, fully
> >
> > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > usable without tracking dirty/accessed bits. More reliable, and also
> > platform independent.
>
> This is x86-specific. Other architectures will likely have differences.
>
> > > populate its task stack. After context switching from a task, reclaim
> > > its unused stack. This way, the task stack in use is always fully
> > > allocated and we don't have to deal with page faults.
> > >
> > > To make this happen, __switch_to() would have to be split into two
> > > parts, to cleanly separate what happens before and after the stack
> > > switch. The first part saves processor context for the previous task,
> > > and prepares the next task.
> >
> > By knowing the stack requirements of __switch_to(), can't we actually
> > do all that in the common code in context_switch() right before
> > __switch_to()? We would do an arch specific call to get the
> > __switch_to() stack requirement, and use that to change the value of
> > task->thread.sp to know where the stack is going to be while sleeping.
> > At this time we can do the unmapping of the stack pages from the
> > previous task, and mapping the pages to the next task.
>
> task->thread.sp is set in __switch_to_asm(), and is pretty much the
> last thing done in the context of the previous task. Trying to
> predict that value ahead of time is way too fragile.

We don't require an exact value, but rather an approximate upper
limit. To illustrate, subtract 1K from the current .sp, then determine
the corresponding page to decide the number of pages needing
unmapping. The primary advantage is that we can avoid
platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
switch_to() function. Instead, each platform can provide an
appropriate upper bound for switch_to() operations. We know the amount
of information is going to be stored on the stack by the routines, and
also since interrupts are disabled stacks are not used for anything
else there, so I do not see a problem with determining a reasonable
upper bound.

> Also, the key
> point I was trying to make is that you cannot safely shrink the active
> stack. It can only be done after the stack switch to the new task.

Can you please elaborate why this is so? If the lowest pages are not
used, and interrupts are disabled what is not safe about removing them
from the page table?

I am not against the idea of unmapping in __switch_to(), I just want
to understand the reasons why more generic but perhaps not as precise
approach would not work.

> > > Populating the next task's stack would
> > > happen here. Then it would return to the assembly code to do the
> > > stack switch. The second part then loads the context of the next
> > > task, and finalizes any work for the previous task. Reclaiming the
> > > unused stack pages of the previous task would happen here.
> >
> > The problem with this (and the origianl Andy's approach), is that we
> > cannot sleep here. What happens if we get per-cpu stack cache
> > exhausted because several threads sleep while having deep stacks? How
> > can we schedule the next task? This is probably a corner case, but it
> > needs to have a proper handling solution. One solution is while in
> > schedule() and while interrupts are still enabled before going to
> > switch_to() we must pre-allocate 3-page in the per-cpu. However, what
> > if the pre-allocation itself calls cond_resched() because it enters
> > page allocator slowpath?
>
> You would have to keep extra pages in reserve for allocation failures.
> mempool could probably help with that.

Right. Mempool do not work when interrupts are disabled, but perhaps
we can use them to keep per-cpu filled with a separate thread. I will
think about it.

Thanks,
Pasha

2024-03-18 15:10:34

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sun, Mar 17, 2024 at 2:58 PM David Laight <[email protected]> wrote:
>
> From: Pasha Tatashin
> > Sent: 16 March 2024 19:18
> ...
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack. (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
>
> Isn't that entirely horrid for TLB use and so will require a lot of IPI?

The TLB load is going to be exactly the same as today, we already use
small pages for VMA mapped stacks. We won't need to have extra
flushing either, the mappings are in the kernel space, and once pages
are removed from the page table, no one is going to access that VA
space until that thread enters the kernel again. We will need to
invalidate the VA range only when the pages are mapped, and only on
the local cpu.

> Remember, if a thread sleeps in 'extra stack' and is then resheduled
> on a different cpu the extra pages get 'pumped' from one cpu to
> another.

Yes, the per-cpu cache can get unbalanced this way, we can remember
the original CPU where we acquired the pages to return to the same
place.

> I also suspect a stack_probe() is likely to end up being a cache miss
> and also slow???

Can you please elaborate on this point. I am not aware of
stack_probe() and how it is used.

> So you wouldn't want one on all calls.
> I'm not sure you'd want a conditional branch either.
>
> The explicit request for 'more stack' can be required to be allowed
> to sleep - removing a lot of issues.
> It would also be portable to all architectures.
> I'd also suspect that any thread that needs extra stack is likely
> to need to again.
> So while the memory could be recovered, I'd bet is isn't worth
> doing except under memory pressure.
> The call could also return 'no' - perhaps useful for (broken) code
> that insists on being recursive.

The current approach discussed is somewhat different from explicit
more stack requests API. I am investigating how feasible it is to use
kernel stack multiplexing, so the same pages can be re-used by many
threads when they are actually used. If the multiplexing approach
won't work, I will come back to the explicit more stack API.

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2024-03-18 15:14:18

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin
<[email protected]> wrote:
>
> On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculabcom> wrote:
> >
> > From: Pasha Tatashin
> > > Sent: 16 March 2024 19:18
> > ...
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack. (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> >
> > Isn't that entirely horrid for TLB use and so will require a lot of IPI?
>
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

The TLB miss rate is going to slightly increase, but very slightly,
because stacks are small 4-pages with only 3-dynamic pages, and
therefore only up-to 2-3 new misses per syscalls, and that is only for
the complicated deep syscalls, therefore, I suspect it won't affect
the real world performance.

> > Remember, if a thread sleeps in 'extra stack' and is then resheduled
> > on a different cpu the extra pages get 'pumped' from one cpu to
> > another.
>
> Yes, the per-cpu cache can get unbalanced this way, we can remember
> the original CPU where we acquired the pages to return to the same
> place.
>
> > I also suspect a stack_probe() is likely to end up being a cache miss
> > and also slow???
>
> Can you please elaborate on this point. I am not aware of
> stack_probe() and how it is used.
>
> > So you wouldn't want one on all calls.
> > I'm not sure you'd want a conditional branch either.
> >
> > The explicit request for 'more stack' can be required to be allowed
> > to sleep - removing a lot of issues.
> > It would also be portable to all architectures.
> > I'd also suspect that any thread that needs extra stack is likely
> > to need to again.
> > So while the memory could be recovered, I'd bet is isn't worth
> > doing except under memory pressure.
> > The call could also return 'no' - perhaps useful for (broken) code
> > that insists on being recursive.
>
> The current approach discussed is somewhat different from explicit
> more stack requests API. I am investigating how feasible it is to use
> kernel stack multiplexing, so the same pages can be re-used by many
> threads when they are actually used. If the multiplexing approach
> won't work, I will come back to the explicit more stack API.
>
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

2024-03-18 15:20:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

No; we can pass pointers to our kernel stack to other threads. The
obvious one is a mutex; we put a mutex_waiter on our own stack and
add its list_head to the mutex's waiter list. I'm sure you can
think of many other places we do this (eg wait queues, poll(), select(),
etc).


2024-03-18 15:31:29

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > The TLB load is going to be exactly the same as today, we already use
> > small pages for VMA mapped stacks. We won't need to have extra
> > flushing either, the mappings are in the kernel space, and once pages
> > are removed from the page table, no one is going to access that VA
> > space until that thread enters the kernel again. We will need to
> > invalidate the VA range only when the pages are mapped, and only on
> > the local cpu.
>
> No; we can pass pointers to our kernel stack to other threads. The
> obvious one is a mutex; we put a mutex_waiter on our own stack and
> add its list_head to the mutex's waiter list. I'm sure you can
> think of many other places we do this (eg wait queues, poll(), select(),
> etc).

Hm, it means that stack is sleeping in the kernel space, and has its
stack pages mapped and invalidated on the local CPU, but access from
the remote CPU to that stack pages would be problematic.

I think we still won't need IPI, but VA-range invalidation is actually
needed on unmaps, and should happen during context switch so every
time we go off-cpu. Therefore, what Brian/Andy have suggested makes
more sense instead of kernel/enter/exit paths.

Pasha

2024-03-18 15:39:20

by David Laight

[permalink] [raw]
Subject: RE: [RFC 00/14] Dynamic Kernel Stacks

...
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.

Why bother?
The number of tasks running in user_mode is limited to the number
of cpu. So the most you save is a few pages per cpu.

Plausibly a context switch from an interrupt (eg timer tick)
could suspend a task without saving anything on its kernel stack.
But how common is that in reality?
In a well behaved system most user threads will be sleeping on
some event - so with an active kernel stack.

I can also imagine that something like sys_epoll() actually
sleeps with not (that much) stack allocated.
But the calls into all the drivers to check the status
could easily go into another page.
You really wouldn't to keep allocating and deallocating
physical pages (which I'm sure has TLB flushing costs)
all the time for those processes.

Perhaps a 'garbage collection' activity that reclaims stack
pages from processes that have been asleep 'for a while' or
haven't used a lot of stack recently (if hw 'page accessed'
bit can be used) might make more sense.

Have you done any instrumentation to see which system calls
are actually using more than (say) 8k of stack?
And how often the user threads that make those calls do so?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-18 15:53:50

by David Laight

[permalink] [raw]
Subject: RE: [RFC 00/14] Dynamic Kernel Stacks

From: Pasha Tatashin
> Sent: 18 March 2024 15:31
>
> On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > > The TLB load is going to be exactly the same as today, we already use
> > > small pages for VMA mapped stacks. We won't need to have extra
> > > flushing either, the mappings are in the kernel space, and once pages
> > > are removed from the page table, no one is going to access that VA
> > > space until that thread enters the kernel again. We will need to
> > > invalidate the VA range only when the pages are mapped, and only on
> > > the local cpu.
> >
> > No; we can pass pointers to our kernel stack to other threads. The
> > obvious one is a mutex; we put a mutex_waiter on our own stack and
> > add its list_head to the mutex's waiter list. I'm sure you can
> > think of many other places we do this (eg wait queues, poll(), select(),
> > etc).
>
> Hm, it means that stack is sleeping in the kernel space, and has its
> stack pages mapped and invalidated on the local CPU, but access from
> the remote CPU to that stack pages would be problematic.
>
> I think we still won't need IPI, but VA-range invalidation is actually
> needed on unmaps, and should happen during context switch so every
> time we go off-cpu. Therefore, what Brian/Andy have suggested makes
> more sense instead of kernel/enter/exit paths.

I think you'll need to broadcast an invalidate.
Consider:
CPU A: task allocates extra pages and adds something to some list.
CPU B: accesses that data and maybe modifies it.
Some page-table walk setup ut the TLB.
CPU A: task detects the modify, removes the item from the list,
collapses back the stack and sleeps.
Stack pages freed.
CPU A: task wakes up (on the same cpu for simplicity).
Goes down a deep stack and puts an item on a list.
Different physical pages are allocated.
CPU B: accesses the associated KVA.
It better not have a cached TLB.

Doesn't that need an IPI?

Freeing the pages is much harder than allocating them.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-18 17:01:47

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 11:39 AM David Laight <[email protected]> wrote:
>
> ...
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
>
> Why bother?
> The number of tasks running in user_mode is limited to the number
> of cpu. So the most you save is a few pages per cpu.
>
> Plausibly a context switch from an interrupt (eg timer tick)
> could suspend a task without saving anything on its kernel stack.
> But how common is that in reality?
> In a well behaved system most user threads will be sleeping on
> some event - so with an active kernel stack.
>
> I can also imagine that something like sys_epoll() actually
> sleeps with not (that much) stack allocated.
> But the calls into all the drivers to check the status
> could easily go into another page.
> You really wouldn't to keep allocating and deallocating
> physical pages (which I'm sure has TLB flushing costs)
> all the time for those processes.
>
> Perhaps a 'garbage collection' activity that reclaims stack
> pages from processes that have been asleep 'for a while' or
> haven't used a lot of stack recently (if hw 'page accessed'
> bit can be used) might make more sense.
>
> Have you done any instrumentation to see which system calls
> are actually using more than (say) 8k of stack?
> And how often the user threads that make those calls do so?

None of our syscalls, AFAIK.

Pasha

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2024-03-18 17:07:50

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

> I think you'll need to broadcast an invalidate.
> Consider:
> CPU A: task allocates extra pages and adds something to some list.
> CPU B: accesses that data and maybe modifies it.
> Some page-table walk setup ut the TLB.
> CPU A: task detects the modify, removes the item from the list,
> collapses back the stack and sleeps.
> Stack pages freed.
> CPU A: task wakes up (on the same cpu for simplicity).
> Goes down a deep stack and puts an item on a list.
> Different physical pages are allocated.
> CPU B: accesses the associated KVA.
> It better not have a cached TLB.
>
> Doesn't that need an IPI?

Yes, this is annoying. If we share a stack with another CPU, then get
a new stack, and share it again with another CPU we get in trouble.
Yet, IPI during context switch would kill the performance :-\

I wonder if there is a way to optimize this scenario like doing IPI
invalidation only after stack sharing?

Pasha

2024-03-18 17:58:28

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

> > Perhaps a 'garbage collection' activity that reclaims stack
> > pages from processes that have been asleep 'for a while' or
> > haven't used a lot of stack recently (if hw 'page accessed'
> > bit can be used) might make more sense.

Interesting approach: if we take the original Andy's suggestion of
using an access bit to know which stack pages were never used during
context switch and unmap them, and as an extra optimization have a
"garbage collector" that unmaps stacks in some long sleeping rarely
used threads. I will think about this.

Thanks,
Pasha

2024-03-18 21:03:53

by Brian Gerst

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 11:00 AM Pasha Tatashin
<[email protected]> wrote:
>
> On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <[email protected]> wrote:
> >
> > On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> > <[email protected]> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <[email protected]> wrote:
> > > >
> > > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> > > > > >
> > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> > > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > > > > > >>
> > > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > > >> > > > which would have to be untangled.
> > > > > > >> > >
> > > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > > >> > > stack! That breaks a number of things including lockdep (because the
> > > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > > >> > > thread owns the lock).
> > > > > > >> > >
> > > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > > >> > > stack and switch to it. I suspect Google would also be able to use this
> > > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > > >> > >
> > > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > > >
> > > > > > >I like this approach! I think we could also consider having permanent
> > > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > > >stack increase framework could work well and wouldn't negatively
> > > > > > >impact the performance of context switching. However, thorough
> > > > > > >analysis would be necessary to proactively identify potential stack
> > > > > > >overflow situations.
> > > > > > >
> > > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > > >> > like everything we need for what you're talking about.
> > > > > > >>
> > > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > > >> Peter described has extra overhead? This was all described up-thread.
> > > > > > >
> > > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > > >possible).
> > > > > > >
> > > > > > >My question, however, if we ignore memory savings and only consider
> > > > > > >reliability aspect of this feature. What is better unconditionally
> > > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > > >huge warning with a backtracing information about the offending stack,
> > > > > > >handling the fault, and survive? I know that historically Linus
> > > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > > >scenario compared to simple BUG vs WARN.
> > > > > > >
> > > > > > >Pasha
> > > > > > >
> > > > > > >[1] https://lore.kernel.org/all/[email protected]
> > > > > > >
> > > > > >
> > > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > > >
> > > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > > I suppose the only way for this to work would be to use a dedicated
> > > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > > that might introduce other complications.
> > > > >
> > > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > > sizes, here's what I'm thinking:
> > > > >
> > > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > > THREAD_SIZE stack. (i.e. 16K)
> > > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > > but only the top page mapped. (i.e. 4K)
> > > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > > three additional pages from the per-CPU stack cache. This function is
> > > > > called early in kernel entry points.
> > > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > > the per-CPU cache. This function is called late in the kernel exit
> > > > > path.
> > > > >
> > > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > > entries whether through interrupts and syscalls, and they are called
> > > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> > >
> > > Hi Brian,
> > >
> > > > This proposal will not have the memory savings that you are looking
> > > > for, since sleeping tasks would still have a fully allocated stack.
> > >
> > > The tasks that were descheduled while running in user mode should not
> > > increase their stack. The potential saving is greater than the
> > > origianl proposal, because in the origianl proposal we never shrink
> > > stacks after faults.
> >
> > A task has to enter kernel mode in order to be rescheduled. If it
> > doesn't make a syscall or hit an exception, then the timer interrupt
> > will eventually kick it out of user mode. At some point schedule() is
> > called, the task is put to sleep and context is switched to the next
> > task. A sleeping task will always be using some amount of kernel
> > stack. How much depends a lot on what caused the task to sleep. If
> > the timeslice expired it could switch right before the return to user
> > mode. A page fault could go deep into filesystem and device code
> > waiting on an I/O operation.
> >
> > > > This also would add extra overhead to each entry and exit (including
> > > > syscalls) that can happen multiple times before a context switch. It
> > > > also doesn't make much sense because a task running in user mode will
> > > > quickly need those stack pages back when it returns to kernel mode.
> > > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > > out of user mode.
> > > >
> > > > What should happen is that the unused stack is reclaimed when a task
> > > > goes to sleep. The kernel does not use a red zone, so any stack pages
> > > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > > be safely discarded. Before context switching to a task, fully
> > >
> > > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > > usable without tracking dirty/accessed bits. More reliable, and also
> > > platform independent.
> >
> > This is x86-specific. Other architectures will likely have differences.
> >
> > > > populate its task stack. After context switching from a task, reclaim
> > > > its unused stack. This way, the task stack in use is always fully
> > > > allocated and we don't have to deal with page faults.
> > > >
> > > > To make this happen, __switch_to() would have to be split into two
> > > > parts, to cleanly separate what happens before and after the stack
> > > > switch. The first part saves processor context for the previous task,
> > > > and prepares the next task.
> > >
> > > By knowing the stack requirements of __switch_to(), can't we actually
> > > do all that in the common code in context_switch() right before
> > > __switch_to()? We would do an arch specific call to get the
> > > __switch_to() stack requirement, and use that to change the value of
> > > task->thread.sp to know where the stack is going to be while sleeping.
> > > At this time we can do the unmapping of the stack pages from the
> > > previous task, and mapping the pages to the next task.
> >
> > task->thread.sp is set in __switch_to_asm(), and is pretty much the
> > last thing done in the context of the previous task. Trying to
> > predict that value ahead of time is way too fragile.
>
> We don't require an exact value, but rather an approximate upper
> limit. To illustrate, subtract 1K from the current .sp, then determine
> the corresponding page to decide the number of pages needing
> unmapping. The primary advantage is that we can avoid
> platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
> switch_to() function. Instead, each platform can provide an
> appropriate upper bound for switch_to() operations. We know the amount
> of information is going to be stored on the stack by the routines, and
> also since interrupts are disabled stacks are not used for anything
> else there, so I do not see a problem with determining a reasonable
> upper bound.

The stack usage will vary depending on compiler version and
optimization settings. Making an educated guess is possible, but may
not be enough in the future.

What would be nice is to get some actual data on stack usage under
various workloads, both maximum depth and depth at context switch.

> > Also, the key
> > point I was trying to make is that you cannot safely shrink the active
> > stack. It can only be done after the stack switch to the new task.
>
> Can you please elaborate why this is so? If the lowest pages are not
> used, and interrupts are disabled what is not safe about removing them
> from the page table?
>
> I am not against the idea of unmapping in __switch_to(), I just want
> to understand the reasons why more generic but perhaps not as precise
> approach would not work.

As long as a wide buffer is given, it would probably be safe. But it
would still be safer and more precise if done after the switch.



Brian Gerst

2024-03-19 14:57:10

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 18, 2024 at 5:02 PM Brian Gerst <[email protected]> wrote:
>
> On Mon, Mar 18, 2024 at 11:00 AM Pasha Tatashin
> <[email protected]> wrote:
> >
> > On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <[email protected]> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> > > <[email protected]> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmailcom> wrote:
> > > > >
> > > > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <[email protected]> wrote:
> > > > > > >
> > > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <[email protected]> wrote:
> > > > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <[email protected]> wrote:
> > > > > > > >>
> > > > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > > > >> > > > which would have to be untangled.
> > > > > > > >> > >
> > > > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > > > >> > > without moving to dynamically allocated kernel memory. One icky thing
> > > > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > > > >> > > stack! That breaks a number of things including lockdep (because the
> > > > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > > > >> > > thread owns the lock).
> > > > > > > >> > >
> > > > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > > > >> > > stack and switch to it. I suspect Google would also be able to use this
> > > > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > > > >> > >
> > > > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > > > >> > > excess pages on return to userspace. No complicated "switch stacks" API
> > > > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > > > >
> > > > > > > >I like this approach! I think we could also consider having permanent
> > > > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > > > >stack increase framework could work well and wouldn't negatively
> > > > > > > >impact the performance of context switching. However, thorough
> > > > > > > >analysis would be necessary to proactively identify potential stack
> > > > > > > >overflow situations.
> > > > > > > >
> > > > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > > > >> > like everything we need for what you're talking about.
> > > > > > > >>
> > > > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > > > >> Peter described has extra overhead? This was all described up-thread.
> > > > > > > >
> > > > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > > > >possible).
> > > > > > > >
> > > > > > > >My question, however, if we ignore memory savings and only consider
> > > > > > > >reliability aspect of this feature. What is better unconditionally
> > > > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > > > >huge warning with a backtracing information about the offending stack,
> > > > > > > >handling the fault, and survive? I know that historically Linus
> > > > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > > > >scenario compared to simple BUG vs WARN.
> > > > > > > >
> > > > > > > >Pasha
> > > > > > > >
> > > > > > > >[1] https://lore.kernel.org/all/[email protected]
> > > > > > > >
> > > > > > >
> > > > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > > > >
> > > > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > > > I suppose the only way for this to work would be to use a dedicated
> > > > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > > > that might introduce other complications.
> > > > > >
> > > > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > > > sizes, here's what I'm thinking:
> > > > > >
> > > > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > > > THREAD_SIZE stack. (i.e. 16K)
> > > > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > > > but only the top page mapped. (i.e. 4K)
> > > > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > > > three additional pages from the per-CPU stack cache. This function is
> > > > > > called early in kernel entry points.
> > > > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > > > the per-CPU cache. This function is called late in the kernel exit
> > > > > > path.
> > > > > >
> > > > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > > > entries whether through interrupts and syscalls, and they are called
> > > > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> > > >
> > > > Hi Brian,
> > > >
> > > > > This proposal will not have the memory savings that you are looking
> > > > > for, since sleeping tasks would still have a fully allocated stack.
> > > >
> > > > The tasks that were descheduled while running in user mode should not
> > > > increase their stack. The potential saving is greater than the
> > > > origianl proposal, because in the origianl proposal we never shrink
> > > > stacks after faults.
> > >
> > > A task has to enter kernel mode in order to be rescheduled. If it
> > > doesn't make a syscall or hit an exception, then the timer interrupt
> > > will eventually kick it out of user mode. At some point schedule() is
> > > called, the task is put to sleep and context is switched to the next
> > > task. A sleeping task will always be using some amount of kernel
> > > stack. How much depends a lot on what caused the task to sleep. If
> > > the timeslice expired it could switch right before the return to user
> > > mode. A page fault could go deep into filesystem and device code
> > > waiting on an I/O operation.
> > >
> > > > > This also would add extra overhead to each entry and exit (including
> > > > > syscalls) that can happen multiple times before a context switch. It
> > > > > also doesn't make much sense because a task running in user mode will
> > > > > quickly need those stack pages back when it returns to kernel mode.
> > > > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > > > out of user mode.
> > > > >
> > > > > What should happen is that the unused stack is reclaimed when a task
> > > > > goes to sleep. The kernel does not use a red zone, so any stack pages
> > > > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > > > be safely discarded. Before context switching to a task, fully
> > > >
> > > > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > > > usable without tracking dirty/accessed bits. More reliable, and also
> > > > platform independent.
> > >
> > > This is x86-specific. Other architectures will likely have differences.
> > >
> > > > > populate its task stack. After context switching from a task, reclaim
> > > > > its unused stack. This way, the task stack in use is always fully
> > > > > allocated and we don't have to deal with page faults.
> > > > >
> > > > > To make this happen, __switch_to() would have to be split into two
> > > > > parts, to cleanly separate what happens before and after the stack
> > > > > switch. The first part saves processor context for the previous task,
> > > > > and prepares the next task.
> > > >
> > > > By knowing the stack requirements of __switch_to(), can't we actually
> > > > do all that in the common code in context_switch() right before
> > > > __switch_to()? We would do an arch specific call to get the
> > > > __switch_to() stack requirement, and use that to change the value of
> > > > task->thread.sp to know where the stack is going to be while sleeping.
> > > > At this time we can do the unmapping of the stack pages from the
> > > > previous task, and mapping the pages to the next task.
> > >
> > > task->thread.sp is set in __switch_to_asm(), and is pretty much the
> > > last thing done in the context of the previous task. Trying to
> > > predict that value ahead of time is way too fragile.
> >
> > We don't require an exact value, but rather an approximate upper
> > limit. To illustrate, subtract 1K from the current .sp, then determine
> > the corresponding page to decide the number of pages needing
> > unmapping. The primary advantage is that we can avoid
> > platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
> > switch_to() function. Instead, each platform can provide an
> > appropriate upper bound for switch_to() operations. We know the amount
> > of information is going to be stored on the stack by the routines, and
> > also since interrupts are disabled stacks are not used for anything
> > else there, so I do not see a problem with determining a reasonable
> > upper bound.
>
> The stack usage will vary depending on compiler version and
> optimization settings. Making an educated guess is possible, but may
> not be enough in the future.
>
> What would be nice is to get some actual data on stack usage under
> various workloads, both maximum depth and depth at context switch.
>
> > > Also, the key
> > > point I was trying to make is that you cannot safely shrink the active
> > > stack. It can only be done after the stack switch to the new task.
> >
> > Can you please elaborate why this is so? If the lowest pages are not
> > used, and interrupts are disabled what is not safe about removing them
> > from the page table?
> >
> > I am not against the idea of unmapping in __switch_to(), I just want
> > to understand the reasons why more generic but perhaps not as precise
> > approach would not work.
>
> As long as a wide buffer is given, it would probably be safe. But it
> would still be safer and more precise if done after the switch.

Makes sense. Looks like using task->thread.sp during context is not
possible because the pages might have been shared with another CPU. We
would need to do ipi tlb invalidation, which would be too expensive
for the context switch. Therefore, using pte->accessed is more
reliable to determine which pages can be unmapped. However, we could
still use task->thread.sp in a garbage collector.

Pasha

2024-03-19 16:42:48

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code

On Sun, Mar 17, 2024 at 10:42 AM Christophe JAILLET
<[email protected]> wrote:
>
> Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> > There are two data types: "struct vm_struct" and "struct vm_stack" that
> > have the same local variable names: vm_stack, or vm, or s, which makes
> > code confusing to read.
> >
> > Change the code so the naming is consisent:
>
> Nit: consistent
>
> >
> > struct vm_struct is always called vm_area
> > struct vm_stack is always called vm_stack
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > kernel/fork.c | 38 ++++++++++++++++++--------------------
> > 1 file changed, 18 insertions(+), 20 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 32600bf2422a..60e812825a7a 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -192,12 +192,12 @@ struct vm_stack {
> > struct vm_struct *stack_vm_area;
> > };
> >
> > -static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
> > +static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> > {
> > unsigned int i;
> >
> > for (i = 0; i < NR_CACHED_STACKS; i++) {
> > - if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
> > + if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm_area) != NULL)
> > continue;
> > return true;
> > }
> > @@ -207,11 +207,12 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm)
> > static void thread_stack_free_rcu(struct rcu_head *rh)
> > {
> > struct vm_stack *vm_stack = container_of(rh, struct vm_stack, rcu);
> > + struct vm_struct *vm_area = vm_stack->stack_vm_area;
> >
> > if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area))
> > return;
> >
> > - vfree(vm_stack);
> > + vfree(vm_area->addr);
>
> This does not look like only a renaming of a variable. Is it?
>
> If no, should there be a Fixes tag and should it be detailed in the
> commit description?

This change is only for readability purposes. vm_stack is stored in
vm_area, so vfree(vm_stack) equals to vfree(vm_area->addr), but is
hard to read. I will add it to changelog.

>
> CJ
>
> > }
> >
> > static void thread_stack_delayed_free(struct task_struct *tsk)
> > @@ -228,12 +229,12 @@ static int free_vm_stack_cache(unsigned int cpu)
> > int i;
> >
> > for (i = 0; i < NR_CACHED_STACKS; i++) {
> > - struct vm_struct *vm_stack = cached_vm_stacks[i];
> > + struct vm_struct *vm_area = cached_vm_stacks[i];
> >
> > - if (!vm_stack)
> > + if (!vm_area)
> > continue;
> >
> > - vfree(vm_stack->addr);
> > + vfree(vm_area->addr);
> > cached_vm_stacks[i] = NULL;
> > }
> >
> > @@ -263,32 +264,29 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
> >
> > static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> > {
> > - struct vm_struct *vm;
> > + struct vm_struct *vm_area;
> > void *stack;
> > int i;
> >
> > for (i = 0; i < NR_CACHED_STACKS; i++) {
> > - struct vm_struct *s;
> > -
> > - s = this_cpu_xchg(cached_stacks[i], NULL);
> > -
> > - if (!s)
> > + vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> > + if (!vm_area)
> > continue;
> >
> > /* Reset stack metadata. */
> > - kasan_unpoison_range(s->addr, THREAD_SIZE);
> > + kasan_unpoison_range(vm_area->addr, THREAD_SIZE);
> >
> > - stack = kasan_reset_tag(s->addr);
> > + stack = kasan_reset_tag(vm_area->addr);
> >
> > /* Clear stale pointers from reused stack. */
> > memset(stack, 0, THREAD_SIZE);
> >
> > - if (memcg_charge_kernel_stack(s)) {
> > - vfree(s->addr);
> > + if (memcg_charge_kernel_stack(vm_area)) {
> > + vfree(vm_area->addr);
> > return -ENOMEM;
> > }
> >
> > - tsk->stack_vm_area = s;
> > + tsk->stack_vm_area = vm_area;
> > tsk->stack = stack;
> > return 0;
> > }
> > @@ -306,8 +304,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> > if (!stack)
> > return -ENOMEM;
> >
> > - vm = find_vm_area(stack);
> > - if (memcg_charge_kernel_stack(vm)) {
> > + vm_area = find_vm_area(stack);
> > + if (memcg_charge_kernel_stack(vm_area)) {
> > vfree(stack);
> > return -ENOMEM;
> > }
> > @@ -316,7 +314,7 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
> > * free_thread_stack() can be called in interrupt context,
> > * so cache the vm_struct.
> > */
> > - tsk->stack_vm_area = vm;
> > + tsk->stack_vm_area = vm_area;
> > stack = kasan_reset_tag(stack);
> > tsk->stack = stack;
> > return 0;
>

2024-03-17 15:15:54

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()

On Sun, Mar 17, 2024 at 10:48 AM Christophe JAILLET
<[email protected]> wrote:
>
> Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> > In preporation for dynamic kernel stacks do not zero the whole span of
>
> Nit: preparation

Thank you,
Pasha

>
> > the stack, but instead only the pages that are part of the vm_area.
> >
> > This is because with dynamic stacks we might have only partially
> > populated stacks.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > kernel/fork.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
>
> ...
>

2024-03-17 14:50:08

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset()

Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> In preporation for dynamic kernel stacks do not zero the whole span of

Nit: preparation

> the stack, but instead only the pages that are part of the vm_area.
>
> This is because with dynamic stacks we might have only partially
> populated stacks.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)

..


2024-03-17 15:16:15

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 08/14] fork: separate vmap stack alloction and free calls

On Sun, Mar 17, 2024 at 10:52 AM Christophe JAILLET
<[email protected]> wrote:
>
> Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> > In preparation for the dynamic stacks, separate out the
> > __vmalloc_node_range and vfree calls from the vmap based stack
> > allocations. The dynamic stacks will use their own variants of these
> > functions.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > kernel/fork.c | 53 ++++++++++++++++++++++++++++++---------------------
> > 1 file changed, 31 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3004e6ce6c65..bbae5f705773 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -204,6 +204,29 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> > return false;
> > }
> >
> > +static inline struct vm_struct *alloc_vmap_stack(int node)
> > +{
> > + void *stack;
> > +
> > + /*
> > + * Allocated stacks are cached and later reused by new threads,
> > + * so memcg accounting is performed manually on assigning/releasing
> > + * stacks to tasks. Drop __GFP_ACCOUNT.
> > + */
> > + stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
> > + VMALLOC_START, VMALLOC_END,
> > + THREADINFO_GFP & ~__GFP_ACCOUNT,
> > + PAGE_KERNEL,
> > + 0, node, __builtin_return_address(0));
> > +
> > + return (stack) ? find_vm_area(stack) : NULL;
>
> Nit: superfluous ()

Thank you.

>
> > +}
>
> ...
>

2024-03-17 14:52:16

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC 08/14] fork: separate vmap stack alloction and free calls

Le 11/03/2024 à 17:46, Pasha Tatashin a écrit :
> In preparation for the dynamic stacks, separate out the
> __vmalloc_node_range and vfree calls from the vmap based stack
> allocations. The dynamic stacks will use their own variants of these
> functions.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> kernel/fork.c | 53 ++++++++++++++++++++++++++++++---------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3004e6ce6c65..bbae5f705773 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -204,6 +204,29 @@ static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> return false;
> }
>
> +static inline struct vm_struct *alloc_vmap_stack(int node)
> +{
> + void *stack;
> +
> + /*
> + * Allocated stacks are cached and later reused by new threads,
> + * so memcg accounting is performed manually on assigning/releasing
> + * stacks to tasks. Drop __GFP_ACCOUNT.
> + */
> + stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
> + VMALLOC_START, VMALLOC_END,
> + THREADINFO_GFP & ~__GFP_ACCOUNT,
> + PAGE_KERNEL,
> + 0, node, __builtin_return_address(0));
> +
> + return (stack) ? find_vm_area(stack) : NULL;

Nit: superfluous ()

> +}

..