The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zisslpcfi respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). When the shadow stack is specified
explicitly the kernel will not free it, the inconsistency with
implicitly allocated shadow stacks is a bit awkward but that's existing
ABI so we can't change it.
The memory provided must have been allocated for use as a shadow stack,
the expectation is that this will be done using the map_shadow_stack()
syscall. I opted not to add validation for this in clone3() since it
will be enforced by hardware anyway.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
The selftest portions of this depend on 34dce23f7e40 ("selftests/clone3:
Report descriptive test names") in -next[2].
[1] https://lore.kernel.org/r/[email protected]/
[2] https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
---
Mark Brown (5):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
kselftest/clone3: Test shadow stack support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 36 ++++-
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 2 +
include/uapi/linux/sched.h | 17 +-
kernel/fork.c | 40 ++++-
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 180 +++++++++++++++++-----
tools/testing/selftests/clone3/clone3_selftests.h | 5 +
12 files changed, 247 insertions(+), 57 deletions(-)
---
base-commit: 80ab9b52e8d4add7735abdfb935877354b69edb6
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <[email protected]>
Since multiple architectures have support for shadow stacks and we need to
select support for this feature in several places in the generic code
provide a generic config option that the architectures can select.
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
---
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
mm/Kconfig | 6 ++++++
4 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..2b72bae0d877 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1926,6 +1926,7 @@ config X86_USER_SHADOW_STACK
depends on AS_WRUSS
depends on X86_64
select ARCH_USES_HIGH_VMA_FLAGS
+ select ARCH_HAS_USER_SHADOW_STACK
select X86_CET
help
Shadow stack protection is a hardware feature that detects function
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..4101e741663a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -697,7 +697,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
#endif
};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..1728cb77540d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -341,7 +341,7 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
/*
* VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
* support core mm.
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..aaa2c353ea67 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1258,6 +1258,12 @@ config LOCK_MM_AND_FIND_VMA
bool
depends on !STACK_GROWSUP
+config ARCH_HAS_USER_SHADOW_STACK
+ bool
+ help
+ The architecture has hardware support for userspace shadow call
+ stacks (eg, x86 CET, arm64 GCS, RISC-V Zisslpcfi).
+
source "mm/damon/Kconfig"
endmenu
--
2.30.2
Unlike with the normal stack there is no API for configuring the the shadow
stack for a new thread, instead the kernel will dynamically allocate a new
shadow stack with the same size as the normal stack. This appears to be due
to the shadow stack series having been in development since before the more
extensible clone3() was added rather than anything more deliberate.
Add parameters to clone3() specifying the address and size of a shadow
stack for the newly created process, we validate that the range specified
is accessible to userspace but do not validate that it has been mapped
appropriately for use as a shadow stack (normally via map_shadow_stack()).
If the shadow stack is specified in this way then the caller is responsible
for freeing the memory as with the main stack. If no shadow stack is
specified then the existing implicit allocation and freeing behaviour is
maintained.
If the architecture does not support shadow stacks the shadow stack
parameters must be zero, architectures that do support the feature are
expected to have the same requirement on individual systems that lack
shadow stack support.
Update the existing x86 implementation to pay attention to the newly added
arguments, in order to maintain compatibility we use the existing behaviour
if no shadow stack is specified. Minimal validation is done of the supplied
parameters, detailed enforcement is left to when the thread is executed.
Since we are now using four fields from the kernel_clone_args we pass that
into the shadow stack code rather than individual fields.
Signed-off-by: Mark Brown <[email protected]>
---
arch/x86/include/asm/shstk.h | 11 +++++++----
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 36 +++++++++++++++++++++++++++++++-----
include/linux/sched/task.h | 2 ++
include/uapi/linux/sched.h | 17 ++++++++++++++---
kernel/fork.c | 40 ++++++++++++++++++++++++++++++++++++++--
6 files changed, 93 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..8be7b0a909c3 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
#include <linux/types.h>
struct task_struct;
+struct kernel_clone_args;
struct ksignal;
#ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
- unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+ const struct kernel_clone_args *args);
void shstk_free(struct task_struct *p);
int setup_signal_shadow_stack(struct ksignal *ksig);
int restore_signal_shadow_stack(void);
@@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long arg2) { return -EINVAL; }
static inline void reset_thread_features(void) {}
static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
- unsigned long clone_flags,
- unsigned long stack_size) { return 0; }
+ const struct kernel_clone_args *args)
+{
+ return 0;
+}
static inline void shstk_free(struct task_struct *p) {}
static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..a9ca80ea5056 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -207,7 +207,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
* is disabled, new_ssp will remain 0, and fpu_clone() will know not to
* update it.
*/
- new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+ new_ssp = shstk_alloc_thread_stack(p, args);
if (IS_ERR_VALUE(new_ssp))
return PTR_ERR((void *)new_ssp);
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..3ae5c3d657dc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,18 +191,44 @@ void reset_thread_features(void)
current->thread.features_locked = 0;
}
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
- unsigned long stack_size)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
+ unsigned long clone_flags = args->flags;
unsigned long addr, size;
/*
* If shadow stack is not enabled on the new thread, skip any
- * switch to a new shadow stack.
+ * implicit switch to a new shadow stack and reject attempts to
+ * explciitly specify one.
*/
- if (!features_enabled(ARCH_SHSTK_SHSTK))
+ if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+ if (args->shadow_stack)
+ return (unsigned long)ERR_PTR(-EINVAL);
+
return 0;
+ }
+
+ /*
+ * If the user specified a shadow stack then do some basic
+ * validation and use it. The caller is responsible for
+ * freeing the shadow stack.
+ */
+ if (args->shadow_stack) {
+ addr = args->shadow_stack;
+ size = args->shadow_stack_size;
+
+ if (!IS_ALIGNED(addr, 8))
+ return (unsigned long)ERR_PTR(-EINVAL);
+ if (size < 8)
+ return (unsigned long)ERR_PTR(-EINVAL);
+
+ shstk->base = 0;
+ shstk->size = 0;
+
+ return addr + size;
+ }
/*
* For CLONE_VFORK the child will share the parents shadow stack.
@@ -222,7 +248,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
if (!(clone_flags & CLONE_VM))
return 0;
- size = adjust_shstk_size(stack_size);
+ size = adjust_shstk_size(args->stack_size);
addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a23af225c898..94e7cf62be51 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -41,6 +41,8 @@ struct kernel_clone_args {
void *fn_arg;
struct cgroup *cgrp;
struct css_set *cset;
+ unsigned long shadow_stack;
+ unsigned long shadow_stack_size;
};
/*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..1bd1b956834d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,14 @@
* kernel's limit of nested PID namespaces.
* @cgroup: If CLONE_INTO_CGROUP is specified set this to
* a file descriptor for the cgroup.
+ * @shadow_stack: Specify the location of the shadow stack for the
+ * child process.
+ * Note, @shadow_stack is expected to point to the
+ * lowest address. The stack direction will be
+ * determined by the kernel and set up
+ * appropriately based on @shadow_stack_size.
+ * @shadow_stack_size: The size of the shadow stack for the child
+ * process.
*
* The structure is versioned by size and thus extensible.
* New struct members must go at the end of the struct and
@@ -101,12 +109,15 @@ struct clone_args {
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
+ __aligned_u64 shadow_stack;
+ __aligned_u64 shadow_stack_size;
};
#endif
-#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
-#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
-#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3 104 /* sizeof fourth published struct */
/*
* Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..bd61aa7353b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3069,7 +3069,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
CLONE_ARGS_SIZE_VER1);
BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
CLONE_ARGS_SIZE_VER2);
- BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+ BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_size) !=
+ CLONE_ARGS_SIZE_VER3);
+ BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
@@ -3112,6 +3114,8 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.tls = args.tls,
.set_tid_size = args.set_tid_size,
.cgroup = args.cgroup,
+ .shadow_stack = args.shadow_stack,
+ .shadow_stack_size = args.shadow_stack_size,
};
if (args.set_tid &&
@@ -3152,6 +3156,38 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
return true;
}
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that the shadow stack arguments userspace gave us are sane.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
+ if (kargs->shadow_stack) {
+ if (!kargs->shadow_stack_size)
+ return false;
+
+ /*
+ * This doesn't validate that the addresses are mapped
+ * VM_SHADOW_STACK, just that they're mapped at all.
+ */
+ if (!access_ok((void __user *)kargs->shadow_stack,
+ kargs->shadow_stack_size))
+ return false;
+ } else {
+ if (kargs->shadow_stack_size)
+ return false;
+ }
+
+ return true;
+#else
+ /* The architecture does not support shadow stacks */
+ return !kargs->shadow_stack && !kargs->shadow_stack_size;
+#endif
+}
+
static bool clone3_args_valid(struct kernel_clone_args *kargs)
{
/* Verify that no unknown flags are passed along. */
@@ -3174,7 +3210,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
kargs->exit_signal)
return false;
- if (!clone3_stack_valid(kargs))
+ if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
return false;
return true;
--
2.30.2
In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.
No functional change.
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 77 ++++++++++++++++-----------------
1 file changed, 37 insertions(+), 40 deletions(-)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 9429d361059e..afe383689a67 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
};
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+ const char *name;
+ uint64_t flags;
+ size_t size;
+ size_function size_function;
+ int expected;
+ enum test_mode test_mode;
+ filter_function filter;
+};
+
static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
{
struct __clone_args args = {
@@ -104,30 +117,40 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
return 0;
}
-static bool test_clone3(uint64_t flags, size_t size, int expected,
- enum test_mode test_mode)
+static void test_clone3(const struct test *test)
{
+ size_t size;
int ret;
+ if (test->filter && test->filter()) {
+ ksft_test_result_skip("%s\n", test->name);
+ return;
+ }
+
+ if (test->size_function)
+ size = test->size_function();
+ else
+ size = test->size;
+
+ ksft_print_msg("Running test '%s'\n", test->name);
+
ksft_print_msg(
"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
- getpid(), flags, size);
- ret = call_clone3(flags, size, test_mode);
+ getpid(), test->flags, size);
+ ret = call_clone3(test->flags, size, test->test_mode);
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
- getpid(), ret, expected);
- if (ret != expected) {
+ getpid(), ret, test->expected);
+ if (ret != test->expected) {
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
- getpid(), ret, expected);
- return false;
+ getpid(), ret, test->expected);
+ ksft_test_result_fail("%s\n", test->name);
+ return;
}
- return true;
+ ksft_test_result_pass("%s\n", test->name);
}
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
static bool not_root(void)
{
if (getuid() != 0) {
@@ -143,16 +166,6 @@ static size_t page_size_plus_8(void)
return getpagesize() + 8;
}
-struct test {
- const char *name;
- uint64_t flags;
- size_t size;
- size_function size_function;
- int expected;
- enum test_mode test_mode;
- filter_function filter;
-};
-
static const struct test tests[] = {
{
.name = "simple clone3()",
@@ -301,24 +314,8 @@ int main(int argc, char *argv[])
ksft_set_plan(ARRAY_SIZE(tests));
test_clone3_supported();
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
- if (tests[i].filter && tests[i].filter()) {
- ksft_test_result_skip("%s\n", tests[i].name);
- continue;
- }
-
- if (tests[i].size_function)
- size = tests[i].size_function();
- else
- size = tests[i].size;
-
- ksft_print_msg("Running test '%s'\n", tests[i].name);
-
- ksft_test_result(test_clone3(tests[i].flags, size,
- tests[i].expected,
- tests[i].test_mode),
- "%s\n", tests[i].name);
- }
+ for (i = 0; i < ARRAY_SIZE(tests); i++)
+ test_clone3(&tests[i]);
ksft_finished();
}
--
2.30.2
The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.
Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index afe383689a67..f1802db82e4e 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
size_t size;
size_function size_function;
int expected;
+ bool e2big_valid;
enum test_mode test_mode;
filter_function filter;
};
@@ -141,6 +142,11 @@ static void test_clone3(const struct test *test)
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
getpid(), ret, test->expected);
if (ret != test->expected) {
+ if (test->e2big_valid && ret == -E2BIG) {
+ ksft_print_msg("Test reported -E2BIG\n");
+ ksft_test_result_skip("%s\n", test->name);
+ return;
+ }
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
getpid(), ret, test->expected);
--
2.30.2
Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure. We detect support for shadow stacks on the running
system by attempting to allocate a shadow stack page during initialisation
using map_shadow_stack().
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 97 +++++++++++++++++++++++
tools/testing/selftests/clone3/clone3_selftests.h | 5 ++
2 files changed, 102 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index f1802db82e4e..33c35fdfcdfc 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -11,6 +11,7 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
+#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/un.h>
@@ -21,6 +22,10 @@
#include "../kselftest.h"
#include "clone3_selftests.h"
+static bool shadow_stack_supported;
+static __u64 shadow_stack;
+static size_t max_supported_args_size;
+
enum test_mode {
CLONE3_ARGS_NO_TEST,
CLONE3_ARGS_ALL_0,
@@ -28,6 +33,9 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+ CLONE3_ARGS_SHADOW_STACK,
+ CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY,
+ CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY,
};
typedef bool (*filter_function)(void);
@@ -44,6 +52,28 @@ struct test {
filter_function filter;
};
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+static void test_shadow_stack_supported(void)
+{
+ shadow_stack = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+ if (shadow_stack == -1) {
+ ksft_print_msg("map_shadow_stack() not supported\n");
+ } else if ((void *)shadow_stack == MAP_FAILED) {
+ ksft_print_msg("Failed to map shadow stack\n");
+ } else {
+ ksft_print_msg("Shadow stack supportd\n");
+ shadow_stack_supported = true;
+ }
+
+ /* Dummy stack to use for validating error checks */
+ if (!shadow_stack_supported) {
+ shadow_stack = (__u64)malloc(getpagesize());
+ }
+}
+
static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
{
struct __clone_args args = {
@@ -89,6 +119,16 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
args.exit_signal = 0x00000000000000f0ULL;
break;
+ case CLONE3_ARGS_SHADOW_STACK:
+ args.shadow_stack = shadow_stack;
+ args.shadow_stack_size = getpagesize();
+ break;
+ case CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY:
+ args.shadow_stack_size = getpagesize();
+ break;
+ case CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY:
+ args.shadow_stack = shadow_stack;
+ break;
}
memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
@@ -167,6 +207,26 @@ static bool not_root(void)
return false;
}
+static bool have_shadow_stack(void)
+{
+ if (shadow_stack_supported) {
+ ksft_print_msg("Shadow stack supported\n");
+ return true;
+ }
+
+ return false;
+}
+
+static bool no_shadow_stack(void)
+{
+ if (!shadow_stack_supported) {
+ ksft_print_msg("Shadow stack not supported\n");
+ return true;
+ }
+
+ return false;
+}
+
static size_t page_size_plus_8(void)
{
return getpagesize() + 8;
@@ -309,6 +369,42 @@ static const struct test tests[] = {
.expected = -EINVAL,
.test_mode = CLONE3_ARGS_NO_TEST,
},
+ {
+ .name = "Shadow stack on system with shadow stack",
+ .flags = 0,
+ .size = 0,
+ .expected = 0,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK,
+ .filter = no_shadow_stack,
+ },
+ {
+ .name = "Shadow stack with only size specified",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY,
+ .filter = no_shadow_stack,
+ },
+ {
+ .name = "Shadow stack with only pointer specified",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY,
+ .filter = no_shadow_stack,
+ },
+ {
+ .name = "Shadow stack on system without shadow stack",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK,
+ .filter = have_shadow_stack,
+ },
};
int main(int argc, char *argv[])
@@ -319,6 +415,7 @@ int main(int argc, char *argv[])
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(tests));
test_clone3_supported();
+ test_shadow_stack_supported();
for (i = 0; i < ARRAY_SIZE(tests); i++)
test_clone3(&tests[i]);
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index e81ffaaee02b..a77db460211b 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -43,6 +43,11 @@ struct __clone_args {
__aligned_u64 cgroup;
#ifndef CLONE_ARGS_SIZE_VER2
#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#endif
+ __aligned_u64 shadow_stack;
+ __aligned_u64 shadow_stack_size;
+#ifndef CLONE_ARGS_SIZE_VER3
+#define CLONE_ARGS_SIZE_VER3 104 /* sizeof fourth published struct */
#endif
};
--
2.30.2
+Some security folks
On Mon, 2023-10-23 at 14:20 +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the the
> shadow
> stack for a new thread, instead the kernel will dynamically allocate
> a new
> shadow stack with the same size as the normal stack. This appears to
> be due
> to the shadow stack series having been in development since before
> the more
> extensible clone3() was added rather than anything more deliberate.
>
> Add parameters to clone3() specifying the address and size of a
> shadow
> stack for the newly created process, we validate that the range
> specified
> is accessible to userspace but do not validate that it has been
> mapped
> appropriately for use as a shadow stack (normally via
> map_shadow_stack()).
> If the shadow stack is specified in this way then the caller is
> responsible
> for freeing the memory as with the main stack. If no shadow stack is
> specified then the existing implicit allocation and freeing behaviour
> is
> maintained.
>
> If the architecture does not support shadow stacks the shadow stack
> parameters must be zero, architectures that do support the feature
> are
> expected to have the same requirement on individual systems that lack
> shadow stack support.
>
> Update the existing x86 implementation to pay attention to the newly
> added
> arguments, in order to maintain compatibility we use the existing
> behaviour
> if no shadow stack is specified. Minimal validation is done of the
> supplied
> parameters, detailed enforcement is left to when the thread is
> executed.
> Since we are now using four fields from the kernel_clone_args we pass
> that
> into the shadow stack code rather than individual fields.
This will give userspace new powers, very close to a "set SSP" ability.
They could start a new thread on an active shadow stack, corrupt it,
etc.
One way to avoid this would be to have shstk_alloc_thread_stack()
consume a token on the shadow stack passed in the clone args. But it's
tricky because there is not a CMPXCHG, on x86 at least, that works with
shadow stack accesses. So the kernel would probably have to GUP the
page and do a normal CMPXCHG off of the direct map.
That said, it's already possible to get two threads on the same shadow
stack by unmapping one and mapping another shadow stack in the same
place, while the target thread is not doing a call/ret. I don't know if
there is anything we could do about that without serious compatibility
restrictions. But this patch would make it a bit more trivial.
I might lean towards the token solution, even if it becomes more heavy
weight to use clone3 in this way. It depends on whether the above is
worth defending.
>
> Signed-off-by: Mark Brown <[email protected]>
> ---
> arch/x86/include/asm/shstk.h | 11 +++++++----
> arch/x86/kernel/process.c | 2 +-
> arch/x86/kernel/shstk.c | 36 +++++++++++++++++++++++++++++++---
> --
> include/linux/sched/task.h | 2 ++
> include/uapi/linux/sched.h | 17 ++++++++++++++---
> kernel/fork.c | 40
> ++++++++++++++++++++++++++++++++++++++--
> 6 files changed, 93 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/shstk.h
> b/arch/x86/include/asm/shstk.h
> index 42fee8959df7..8be7b0a909c3 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
> #include <linux/types.h>
>
> struct task_struct;
> +struct kernel_clone_args;
> struct ksignal;
>
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>
> long shstk_prctl(struct task_struct *task, int option, unsigned long
> arg2);
> void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> unsigned long clone_flags,
> - unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> + const struct kernel_clone_args
> *args);
> void shstk_free(struct task_struct *p);
> int setup_signal_shadow_stack(struct ksignal *ksig);
> int restore_signal_shadow_stack(void);
> @@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct
> *task, int option,
> unsigned long arg2) { return -EINVAL;
> }
> static inline void reset_thread_features(void) {}
> static inline unsigned long shstk_alloc_thread_stack(struct
> task_struct *p,
> - unsigned long
> clone_flags,
> - unsigned long
> stack_size) { return 0; }
> + const struct
> kernel_clone_args *args)
> +{
> + return 0;
> +}
> static inline void shstk_free(struct task_struct *p) {}
> static inline int setup_signal_shadow_stack(struct ksignal *ksig) {
> return 0; }
> static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b6f4e8399fca..a9ca80ea5056 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -207,7 +207,7 @@ int copy_thread(struct task_struct *p, const
> struct kernel_clone_args *args)
> * is disabled, new_ssp will remain 0, and fpu_clone() will
> know not to
> * update it.
> */
> - new_ssp = shstk_alloc_thread_stack(p, clone_flags, args-
> >stack_size);
> + new_ssp = shstk_alloc_thread_stack(p, args);
> if (IS_ERR_VALUE(new_ssp))
> return PTR_ERR((void *)new_ssp);
>
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 59e15dd8d0f8..3ae5c3d657dc 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -191,18 +191,44 @@ void reset_thread_features(void)
> current->thread.features_locked = 0;
> }
>
> -unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> unsigned long clone_flags,
> - unsigned long stack_size)
> +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> + const struct kernel_clone_args
> *args)
> {
> struct thread_shstk *shstk = &tsk->thread.shstk;
> + unsigned long clone_flags = args->flags;
> unsigned long addr, size;
>
> /*
> * If shadow stack is not enabled on the new thread, skip any
> - * switch to a new shadow stack.
> + * implicit switch to a new shadow stack and reject attempts
> to
> + * explciitly specify one.
> */
> - if (!features_enabled(ARCH_SHSTK_SHSTK))
> + if (!features_enabled(ARCH_SHSTK_SHSTK)) {
> + if (args->shadow_stack)
> + return (unsigned long)ERR_PTR(-EINVAL);
> +
> return 0;
> + }
> +
> + /*
> + * If the user specified a shadow stack then do some basic
> + * validation and use it. The caller is responsible for
> + * freeing the shadow stack.
> + */
> + if (args->shadow_stack) {
> + addr = args->shadow_stack;
> + size = args->shadow_stack_size;
> +
> + if (!IS_ALIGNED(addr, 8))
> + return (unsigned long)ERR_PTR(-EINVAL);
> + if (size < 8)
> + return (unsigned long)ERR_PTR(-EINVAL);
> +
> + shstk->base = 0;
> + shstk->size = 0;
> +
> + return addr + size;
> + }
>
> /*
> * For CLONE_VFORK the child will share the parents shadow
> stack.
> @@ -222,7 +248,7 @@ unsigned long shstk_alloc_thread_stack(struct
> task_struct *tsk, unsigned long cl
> if (!(clone_flags & CLONE_VM))
> return 0;
>
> - size = adjust_shstk_size(stack_size);
> + size = adjust_shstk_size(args->stack_size);
> addr = alloc_shstk(0, size, 0, false);
> if (IS_ERR_VALUE(addr))
> return addr;
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index a23af225c898..94e7cf62be51 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -41,6 +41,8 @@ struct kernel_clone_args {
> void *fn_arg;
> struct cgroup *cgrp;
> struct css_set *cset;
> + unsigned long shadow_stack;
> + unsigned long shadow_stack_size;
> };
>
> /*
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..1bd1b956834d 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -84,6 +84,14 @@
> * kernel's limit of nested PID namespaces.
> * @cgroup: If CLONE_INTO_CGROUP is specified set this to
> * a file descriptor for the cgroup.
> + * @shadow_stack: Specify the location of the shadow stack for the
> + * child process.
> + * Note, @shadow_stack is expected to point to the
> + * lowest address. The stack direction will be
> + * determined by the kernel and set up
> + * appropriately based on @shadow_stack_size.
> + * @shadow_stack_size: The size of the shadow stack for the child
> + * process.
> *
> * The structure is versioned by size and thus extensible.
> * New struct members must go at the end of the struct and
> @@ -101,12 +109,15 @@ struct clone_args {
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> __aligned_u64 cgroup;
> + __aligned_u64 shadow_stack;
> + __aligned_u64 shadow_stack_size;
> };
> #endif
>
> -#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> -#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> -#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> +#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct
> */
> +#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER3 104 /* sizeof fourth published struct
> */
>
> /*
> * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b6d20dfb9a8..bd61aa7353b0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3069,7 +3069,9 @@ noinline static int
> copy_clone_args_from_user(struct kernel_clone_args *kargs,
> CLONE_ARGS_SIZE_VER1);
> BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
> CLONE_ARGS_SIZE_VER2);
> - BUILD_BUG_ON(sizeof(struct clone_args) !=
> CLONE_ARGS_SIZE_VER2);
> + BUILD_BUG_ON(offsetofend(struct clone_args,
> shadow_stack_size) !=
> + CLONE_ARGS_SIZE_VER3);
> + BUILD_BUG_ON(sizeof(struct clone_args) !=
> CLONE_ARGS_SIZE_VER3);
>
> if (unlikely(usize > PAGE_SIZE))
> return -E2BIG;
> @@ -3112,6 +3114,8 @@ noinline static int
> copy_clone_args_from_user(struct kernel_clone_args *kargs,
> .tls = args.tls,
> .set_tid_size = args.set_tid_size,
> .cgroup = args.cgroup,
> + .shadow_stack = args.shadow_stack,
> + .shadow_stack_size = args.shadow_stack_size,
> };
>
> if (args.set_tid &&
> @@ -3152,6 +3156,38 @@ static inline bool clone3_stack_valid(struct
> kernel_clone_args *kargs)
> return true;
> }
>
> +/**
> + * clone3_shadow_stack_valid - check and prepare shadow stack
> + * @kargs: kernel clone args
> + *
> + * Verify that the shadow stack arguments userspace gave us are
> sane.
> + */
> +static inline bool clone3_shadow_stack_valid(struct
> kernel_clone_args *kargs)
> +{
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> + if (kargs->shadow_stack) {
> + if (!kargs->shadow_stack_size)
> + return false;
> +
> + /*
> + * This doesn't validate that the addresses are
> mapped
> + * VM_SHADOW_STACK, just that they're mapped at all.
> + */
It just checks the range, right?
> + if (!access_ok((void __user *)kargs->shadow_stack,
> + kargs->shadow_stack_size))
> + return false;
> + } else {
> + if (kargs->shadow_stack_size)
> + return false;
> + }
> +
> + return true;
> +#else
> + /* The architecture does not support shadow stacks */
> + return !kargs->shadow_stack && !kargs->shadow_stack_size;
> +#endif
> +}
> +
> static bool clone3_args_valid(struct kernel_clone_args *kargs)
> {
> /* Verify that no unknown flags are passed along. */
> @@ -3174,7 +3210,7 @@ static bool clone3_args_valid(struct
> kernel_clone_args *kargs)
> kargs->exit_signal)
> return false;
>
> - if (!clone3_stack_valid(kargs))
> + if (!clone3_stack_valid(kargs) ||
> !clone3_shadow_stack_valid(kargs))
> return false;
>
> return true;
>
On Mon, Oct 23, 2023 at 04:32:22PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2023-10-23 at 14:20 +0100, Mark Brown wrote:
> +Some security folks
I *think* I captured everyone for future versions but I might've missed
some, it's a long Cc list.
> > Add parameters to clone3() specifying the address and size of a
> > shadow
> > stack for the newly created process, we validate that the range
> > specified
> > is accessible to userspace but do not validate that it has been
> > mapped
> > appropriately for use as a shadow stack (normally via
> > map_shadow_stack()).
> > If the shadow stack is specified in this way then the caller is
> > responsible
> > for freeing the memory as with the main stack. If no shadow stack is
> > specified then the existing implicit allocation and freeing behaviour
> > is
> > maintained.
> This will give userspace new powers, very close to a "set SSP" ability.
> They could start a new thread on an active shadow stack, corrupt it,
> etc.
That's true.
> One way to avoid this would be to have shstk_alloc_thread_stack()
> consume a token on the shadow stack passed in the clone args. But it's
> tricky because there is not a CMPXCHG, on x86 at least, that works with
> shadow stack accesses. So the kernel would probably have to GUP the
> page and do a normal CMPXCHG off of the direct map.
> That said, it's already possible to get two threads on the same shadow
> stack by unmapping one and mapping another shadow stack in the same
> place, while the target thread is not doing a call/ret. I don't know if
> there is anything we could do about that without serious compatibility
> restrictions. But this patch would make it a bit more trivial.
> I might lean towards the token solution, even if it becomes more heavy
> weight to use clone3 in this way. It depends on whether the above is
> worth defending.
Right. We're already adding the cost of the extra map_shadow_stack() so
it doesn't seem that out of scope. We could also allow clone3() to be
used for allocation, potentially removing the ability to specify the
address entirely and only specifying the size. I did consider that
option but it felt awkward in the API, though equally the whole shadow
stack allocation thing is a bit that way. That would avoid concerns
about placing and validating tokens entirely but gives less control to
userspace.
This also doesn't do anything to stop anyone trying to allocate sub page
shadow stacks if they're trying to save memory with all the lack of
overrun protection that implies, though that seems to me to be much more
of a deliberate decision that people might make, a token would prevent
that too unless write access to the shadow stack is enabled.
> > +???????????????/*
> > +??????????????? * This doesn't validate that the addresses are
> > mapped
> > +??????????????? * VM_SHADOW_STACK, just that they're mapped at all.
> > +??????????????? */
> It just checks the range, right?
Yes, same check as for the normal stack.
On Mon, 2023-10-23 at 19:32 +0100, Mark Brown wrote:
> Right. We're already adding the cost of the extra map_shadow_stack()
> so
> it doesn't seem that out of scope. We could also allow clone3() to
> be
> used for allocation, potentially removing the ability to specify the
> address entirely and only specifying the size. I did consider that
> option but it felt awkward in the API, though equally the whole
> shadow
> stack allocation thing is a bit that way. That would avoid concerns
> about placing and validating tokens entirely but gives less control
> to
> userspace.
There is also cost in the form of extra complexity. Not to throw FUD,
but GUP has been the source of thorny problems. And here we would be
doing it around security races. We're probably helped that shadow stack
is only private/anonymous memory, so maybe it's enough of a normal case
to not worry about it.
Still, there is some extra complexity, and I'm not sure if we really
need it. The justification seems to mostly be that it's not as flexible
as normal stacks with clone3.
I don't understand why doing size-only is awkward. Just because it
doesn't match the regular stack clone3 semantics?
>
> This also doesn't do anything to stop anyone trying to allocate sub
> page
> shadow stacks if they're trying to save memory with all the lack of
> overrun protection that implies, though that seems to me to be much
> more
> of a deliberate decision that people might make, a token would
> prevent
> that too unless write access to the shadow stack is enabled.
Sorry, I'm not following. Sub-page shadow stacks?
>
> > > + /*
> > > + * This doesn't validate that the addresses are
> > > mapped
> > > + * VM_SHADOW_STACK, just that they're mapped at
> > > all.
> > > + */
>
> > It just checks the range, right?
>
> Yes, same check as for the normal stack.
What looked wrong is that the comment says that it checks if the
addresses are mapped, but the code just does access_ok(). It's a minor
thing in any case.
On Thu, Oct 26, 2023 at 05:10:47PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2023-10-23 at 19:32 +0100, Mark Brown wrote:
> > Right.? We're already adding the cost of the extra map_shadow_stack()
> > so
> > it doesn't seem that out of scope.? We could also allow clone3() to
> > be
> > used for allocation, potentially removing the ability to specify the
> > address entirely and only specifying the size.? I did consider that
> > option but it felt awkward in the API, though equally the whole
> > shadow
> > stack allocation thing is a bit that way.? That would avoid concerns
> > about placing and validating tokens entirely but gives less control
> > to
> > userspace.
> There is also cost in the form of extra complexity. Not to throw FUD,
> but GUP has been the source of thorny problems. And here we would be
> doing it around security races. We're probably helped that shadow stack
> is only private/anonymous memory, so maybe it's enough of a normal case
> to not worry about it.
> Still, there is some extra complexity, and I'm not sure if we really
> need it. The justification seems to mostly be that it's not as flexible
> as normal stacks with clone3.
I definitely agree on the complexity, trying to valdiate a token is
going to be more code doing fiddly things and there's always the risk
that something will change around it and invalidate assumptions the code
makes. Particularly given my inability to test x86 I'm certainly way
more happy pushing this series forward implementing size only than I am
doing token validation.
> I don't understand why doing size-only is awkward. Just because it
> doesn't match the regular stack clone3 semantics?
Basically, yes - we don't allocate userpace pages in clone3() for the
normal stack and we do offer userspace control over where to place
things. There was some grumbling about this in the current ABI from the
arm64 side, though the limited control of the size is more of the issue
really.
I'm not sure placement control is essential but the other bit of it is
the freeing of the shadow stack, especially if userspace is doing stack
switches the current behaviour where we free the stack when the thread
is exiting doesn't feel great exactly. It's mainly an issue for
programs that pivot stacks which isn't the common case but it is a
general sharp edge.
> > This also doesn't do anything to stop anyone trying to allocate sub
> > page
> > shadow stacks if they're trying to save memory with all the lack of
> > overrun protection that implies, though that seems to me to be much
> > more
> > of a deliberate decision that people might make, a token would
> > prevent
> > that too unless write access to the shadow stack is enabled.
> Sorry, I'm not following. Sub-page shadow stacks?
If someone decides to allocate a page of shadow stack then point thread
A at the first half of the page and thread B at the second half of the
page nothing would stop them. There are obvious issues with this but I
can see someone trying to do it in a system that creates lots of
threads and has memory constraints.
> > > > +???????????????/*
> > > > +??????????????? * This doesn't validate that the addresses are
> > > > mapped
> > > > +??????????????? * VM_SHADOW_STACK, just that they're mapped at
> > > > all.
> > > > +??????????????? */
> > > It just checks the range, right?
> > Yes, same check as for the normal stack.
> What looked wrong is that the comment says that it checks if the
> addresses are mapped, but the code just does access_ok(). It's a minor
> thing in any case.
Oh, I see, yes.
On Thu, Oct 26, 2023 at 06:53:37PM +0100, Mark Brown wrote:
>On Thu, Oct 26, 2023 at 05:10:47PM +0000, Edgecombe, Rick P wrote:
>> On Mon, 2023-10-23 at 19:32 +0100, Mark Brown wrote:
>
>> > Right.? We're already adding the cost of the extra map_shadow_stack()
>> > so
>> > it doesn't seem that out of scope.? We could also allow clone3() to
>> > be
>> > used for allocation, potentially removing the ability to specify the
>> > address entirely and only specifying the size.? I did consider that
>> > option but it felt awkward in the API, though equally the whole
>> > shadow
>> > stack allocation thing is a bit that way.? That would avoid concerns
>> > about placing and validating tokens entirely but gives less control
>> > to
>> > userspace.
>
>> There is also cost in the form of extra complexity. Not to throw FUD,
>> but GUP has been the source of thorny problems. And here we would be
>> doing it around security races. We're probably helped that shadow stack
>> is only private/anonymous memory, so maybe it's enough of a normal case
>> to not worry about it.
>
>> Still, there is some extra complexity, and I'm not sure if we really
>> need it. The justification seems to mostly be that it's not as flexible
>> as normal stacks with clone3.
>
>I definitely agree on the complexity, trying to valdiate a token is
>going to be more code doing fiddly things and there's always the risk
>that something will change around it and invalidate assumptions the code
>makes. Particularly given my inability to test x86 I'm certainly way
>more happy pushing this series forward implementing size only than I am
>doing token validation.
>
FWIW, from arch specific perspective, RISC-V shadow stack extension has
`ssamoswap` to perform this token exchange. But I understand x86 has this
limitation (not sure about arm GCS).
From security perspective:--
Someone having ability to execute clone3 with control on input, probably
already achieved some level of control flow bending because they need to
corrupt memory and then carefully control registers input to clone3.
Although if it is purely a data oriented gadget, I think it is possible.
Since this RFC is mostly concerned about `size` of shadow stack. I think
we should limit it to size only.
>> I don't understand why doing size-only is awkward. Just because it
>> doesn't match the regular stack clone3 semantics?
>
>Basically, yes - we don't allocate userpace pages in clone3() for the
>normal stack and we do offer userspace control over where to place
>things. There was some grumbling about this in the current ABI from the
>arm64 side, though the limited control of the size is more of the issue
>really.
>
>I'm not sure placement control is essential but the other bit of it is
>the freeing of the shadow stack, especially if userspace is doing stack
>switches the current behaviour where we free the stack when the thread
>is exiting doesn't feel great exactly. It's mainly an issue for
>programs that pivot stacks which isn't the common case but it is a
>general sharp edge.
In general, I am assuming such placement requirements emanate because
regular stack holds data (local args, etc) as well and thus software may
make assumptions about how stack frame is prepared and may worry about
layout and such. In case of shadow stack, it can only hold return
addresses and tokens (created by user mode itself). Both of them endup
there as result of call or user sw own way of setting up tokens.
So I don't see a need for software to specify address.
>
>> > This also doesn't do anything to stop anyone trying to allocate sub
>> > page
>> > shadow stacks if they're trying to save memory with all the lack of
>> > overrun protection that implies, though that seems to me to be much
>> > more
>> > of a deliberate decision that people might make, a token would
>> > prevent
>> > that too unless write access to the shadow stack is enabled.
>
>> Sorry, I'm not following. Sub-page shadow stacks?
>
>If someone decides to allocate a page of shadow stack then point thread
>A at the first half of the page and thread B at the second half of the
>page nothing would stop them. There are obvious issues with this but I
>can see someone trying to do it in a system that creates lots of
>threads and has memory constraints.
>
>> > > > +???????????????/*
>> > > > +??????????????? * This doesn't validate that the addresses are
>> > > > mapped
>> > > > +??????????????? * VM_SHADOW_STACK, just that they're mapped at
>> > > > all.
>> > > > +??????????????? */
>
>> > > It just checks the range, right?
>
>> > Yes, same check as for the normal stack.
>
>> What looked wrong is that the comment says that it checks if the
>> addresses are mapped, but the code just does access_ok(). It's a minor
>> thing in any case.
>
>Oh, I see, yes.
On Thu, 2023-10-26 at 18:53 +0100, Mark Brown wrote:
> Particularly given my inability to test x86 I'm certainly way
> more happy pushing this series forward implementing size only than I
> am
> doing token validation.
I can help with testing/development once we get the plan settled on.
On Thu, 2023-10-26 at 13:40 -0700, Deepak Gupta wrote:
>
> FWIW, from arch specific perspective, RISC-V shadow stack extension
> has
> `ssamoswap` to perform this token exchange. But I understand x86 has
> this
> limitation (not sure about arm GCS).
>
> From security perspective:--
> Someone having ability to execute clone3 with control on input,
> probably
> already achieved some level of control flow bending because they need
> to
> corrupt memory and then carefully control registers input to clone3.
> Although if it is purely a data oriented gadget, I think it is
> possible.
struct clone_args should be data somewhere, at least temporarily.
>
> Since this RFC is mostly concerned about `size` of shadow stack. I
> think
> we should limit it to size only.
Seems reasonable to me. It still leaves open the option of adding an
shadow stack address field later AFAICT.
The 10/26/2023 13:40, Deepak Gupta wrote:
> On Thu, Oct 26, 2023 at 06:53:37PM +0100, Mark Brown wrote:
> > I'm not sure placement control is essential but the other bit of it is
> > the freeing of the shadow stack, especially if userspace is doing stack
> > switches the current behaviour where we free the stack when the thread
> > is exiting doesn't feel great exactly. It's mainly an issue for
> > programs that pivot stacks which isn't the common case but it is a
> > general sharp edge.
>
> In general, I am assuming such placement requirements emanate because
> regular stack holds data (local args, etc) as well and thus software may
> make assumptions about how stack frame is prepared and may worry about
> layout and such. In case of shadow stack, it can only hold return
no. the lifetime is the issue: a stack in principle can outlive
a thread and resumed even after the original thread exited.
for that to work the shadow stack has to outlive the thread too.
(or the other way around: a stack can be freed before the thread
exits, if the thread pivots away from that stack.)
posix threads etc. don't allow this, but the linux syscall abi
(clone) does allow it.
i think it is reasonable to tie the shadow stack lifetime to the
thread lifetime, but this clearly introduces a limitation on how
the clone api can be used. such constraint on the userspace
programming model is normally a bad decision, but given that most
software (including all posix conforming code) is not affected,
i think it is acceptable for an opt-in feature like shadow stack.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Fri, Oct 27, 2023 at 12:49:59PM +0100, [email protected] wrote:
> The 10/26/2023 13:40, Deepak Gupta wrote:
> > In general, I am assuming such placement requirements emanate because
> > regular stack holds data (local args, etc) as well and thus software may
> > make assumptions about how stack frame is prepared and may worry about
> > layout and such. In case of shadow stack, it can only hold return
> no. the lifetime is the issue: a stack in principle can outlive
> a thread and resumed even after the original thread exited.
> for that to work the shadow stack has to outlive the thread too.
> (or the other way around: a stack can be freed before the thread
> exits, if the thread pivots away from that stack.)
> posix threads etc. don't allow this, but the linux syscall abi
> (clone) does allow it.
> i think it is reasonable to tie the shadow stack lifetime to the
> thread lifetime, but this clearly introduces a limitation on how
> the clone api can be used. such constraint on the userspace
> programming model is normally a bad decision, but given that most
> software (including all posix conforming code) is not affected,
> i think it is acceptable for an opt-in feature like shadow stack.
I tend to agree - software that's doing a lot of stack pivoting could do
something like allocate a small stack with clone3() and then immediately
pivoting away from it so they never actually use the stack that's tied
to the thread. It's a bit clunky and wasteful but should work.
Since everyone seems OK with dealing with the placement issues by
specifying size only I'm planning on sending a new version that does
that after the merge window, assuming nobody else raises concerns.
On Fri, 2023-10-27 at 12:49 +0100, [email protected] wrote:
> no. the lifetime is the issue: a stack in principle can outlive
> a thread and resumed even after the original thread exited.
> for that to work the shadow stack has to outlive the thread too.
Hmm, this makes me think about the tracing usages.
>
> (or the other way around: a stack can be freed before the thread
> exits, if the thread pivots away from that stack.)
>
> posix threads etc. don't allow this, but the linux syscall abi
> (clone) does allow it.
>
> i think it is reasonable to tie the shadow stack lifetime to the
> thread lifetime, but this clearly introduces a limitation on how
> the clone api can be used. such constraint on the userspace
> programming model is normally a bad decision, but given that most
> software (including all posix conforming code) is not affected,
> i think it is acceptable for an opt-in feature like shadow stack.
Do you have any updated plans to share around your earlier ideas for
token schemes that try to shoot for more compatibility or security?
The 10/27/2023 15:55, Edgecombe, Rick P wrote:
> Do you have any updated plans to share around your earlier ideas for
> token schemes that try to shoot for more compatibility or security?
not really.
i don't like that shadow stack overflow cannot be handled,
so we have to allocate huge shadow stacks to avoid overflow.
i had ideas how to handle overflow with sigaltstack, but it
is complicated (unwinding, longjmp, swapcontext,..) so
"allocate huge shadow stack" is currently our best design.
the compatibility issues i see:
- dlopen of incompatible lib (multi thread case)
- makecontext shadow stack leaks (userspace api issue)
- huge shadow stack runs into resource limits
- hand crafted stack switching code needs updates
- minor issues around sigaltstack + swapcontext
i don't have an alternate design that makes these go away.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Fri, Oct 27, 2023 at 12:49:59PM +0100, [email protected] wrote:
>The 10/26/2023 13:40, Deepak Gupta wrote:
>> On Thu, Oct 26, 2023 at 06:53:37PM +0100, Mark Brown wrote:
>> > I'm not sure placement control is essential but the other bit of it is
>> > the freeing of the shadow stack, especially if userspace is doing stack
>> > switches the current behaviour where we free the stack when the thread
>> > is exiting doesn't feel great exactly. It's mainly an issue for
>> > programs that pivot stacks which isn't the common case but it is a
>> > general sharp edge.
>>
>> In general, I am assuming such placement requirements emanate because
>> regular stack holds data (local args, etc) as well and thus software may
>> make assumptions about how stack frame is prepared and may worry about
>> layout and such. In case of shadow stack, it can only hold return
>
>no. the lifetime is the issue: a stack in principle can outlive
>a thread and resumed even after the original thread exited.
>for that to work the shadow stack has to outlive the thread too.
>
I understand an application can pre-allocate a pool of stack and re-use
them whenever it's spawning new threads using clone3 system call.
However, once a new thread has been spawned how can it resume?
By resume I mean consume the callstack context from an earlier thread.
Or you meant something else by `resume` here?
Can you give an example of such an application or runtime where a newly
created thread consumes callstack context created by going away thread?
>(or the other way around: a stack can be freed before the thread
>exits, if the thread pivots away from that stack.)
This is simply a thread saying that I am moving to a different stack.
Again, interested in learning why would a thread do that. If I've to
speculate on reasons, I could think of user runtime managing it's own
pool of worker items (some people call them green threads) or current
stack became too small.
JIT runtimes (and such stuff like go routines) do such things but in
those cases, kernel has no idea about it. From kernel's perspective
there is a main thread stack (hosting thread for JIT) and then main
thread can take a decision switching stack to execute JITted code.
But in that case all it needs is a shadow stack and managing lifetime of
such shadow stack using `clone` wouldn't be helpful and perhaps
`map_shadow_stack` should be used to create on the fly shadow stack.
Another case I can think of for a thread to move to a different stack
when current stack was too small and it wants larger memory. In such
cases as well, I imagine that particular thread would be issuing `mmap`
to allocate larger memory and thus that particular thread can very well
issue `map_shadow_stack`
In both of these cases, a stack free actually means thread (application)
issuing a system call to free the going away stack memory. It can free up
going away shadow stack memory in same way using `unmap_shadow_stack`
Let me know if I misunderstood something or missing some other usecase of
a stack being freed before the thread exits.
>
>posix threads etc. don't allow this, but the linux syscall abi
>(clone) does allow it.
>
>i think it is reasonable to tie the shadow stack lifetime to the
>thread lifetime, but this clearly introduces a limitation on how
>the clone api can be used. such constraint on the userspace
>programming model is normally a bad decision, but given that most
>software (including all posix conforming code) is not affected,
>i think it is acceptable for an opt-in feature like shadow stack.
>
>IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
The 10/27/2023 16:24, Deepak Gupta wrote:
> On Fri, Oct 27, 2023 at 12:49:59PM +0100, [email protected] wrote:
> > no. the lifetime is the issue: a stack in principle can outlive
> > a thread and resumed even after the original thread exited.
> > for that to work the shadow stack has to outlive the thread too.
>
> I understand an application can pre-allocate a pool of stack and re-use
> them whenever it's spawning new threads using clone3 system call.
>
> However, once a new thread has been spawned how can it resume?
a thread can getcontext then exit. later another thread
can setcontext and execute on the stack of the exited
thread and return to a previous stack frame there.
(unlikely to work on runtimes where tls or thread id is
exposed and thus may be cached on the stack. so not for
posix.. but e.g. a go runtime could do this)
> By resume I mean consume the callstack context from an earlier thread.
> Or you meant something else by `resume` here?
>
> Can you give an example of such an application or runtime where a newly
> created thread consumes callstack context created by going away thread?
my claim was not that existing runtimes are doing this,
but that the linux interface contract allows this and
tieing the stack lifetime to the thread is a change of
contract.
> > (or the other way around: a stack can be freed before the thread
> > exits, if the thread pivots away from that stack.)
>
> This is simply a thread saying that I am moving to a different stack.
> Again, interested in learning why would a thread do that. If I've to
> speculate on reasons, I could think of user runtime managing it's own
> pool of worker items (some people call them green threads) or current
> stack became too small.
switching stack is common, freeing the original stack may not be,
but there is nothing that prevents this and then the corresponding
shadow stack is clearly leaked if the kernel manages it. the amount
of leak is proportional to the number of live threads and the sum
of their original stack size which can be big.
but as i said i think this lifetime issue is minor compared
to other shadow stack issues, so it is ok if the shadow stack
is kernel managed.
On Mon, Oct 30, 2023 at 11:39:17AM +0000, [email protected] wrote:
>The 10/27/2023 16:24, Deepak Gupta wrote:
>> On Fri, Oct 27, 2023 at 12:49:59PM +0100, [email protected] wrote:
>> > no. the lifetime is the issue: a stack in principle can outlive
>> > a thread and resumed even after the original thread exited.
>> > for that to work the shadow stack has to outlive the thread too.
>>
>> I understand an application can pre-allocate a pool of stack and re-use
>> them whenever it's spawning new threads using clone3 system call.
>>
>> However, once a new thread has been spawned how can it resume?
>
>a thread can getcontext then exit. later another thread
>can setcontext and execute on the stack of the exited
>thread and return to a previous stack frame there.
>
>(unlikely to work on runtimes where tls or thread id is
>exposed and thus may be cached on the stack. so not for
>posix.. but e.g. a go runtime could do this)
Aah then as you mentioned, we basically need clear lifetime rules around
their creation and deletion.
Because `getcontext/swapcontext/setcontext` can be updated to save shadow
stack token on stack itself and use that to resume. It's just lifetime
that needs to be managed.
>
>> By resume I mean consume the callstack context from an earlier thread.
>> Or you meant something else by `resume` here?
>>
>> Can you give an example of such an application or runtime where a newly
>> created thread consumes callstack context created by going away thread?
>
>my claim was not that existing runtimes are doing this,
>but that the linux interface contract allows this and
>tieing the stack lifetime to the thread is a change of
>contract.
>
>> > (or the other way around: a stack can be freed before the thread
>> > exits, if the thread pivots away from that stack.)
>>
>> This is simply a thread saying that I am moving to a different stack.
>> Again, interested in learning why would a thread do that. If I've to
>> speculate on reasons, I could think of user runtime managing it's own
>> pool of worker items (some people call them green threads) or current
>> stack became too small.
>
>switching stack is common, freeing the original stack may not be,
>but there is nothing that prevents this and then the corresponding
>shadow stack is clearly leaked if the kernel manages it. the amount
>of leak is proportional to the number of live threads and the sum
>of their original stack size which can be big.
>
>but as i said i think this lifetime issue is minor compared
>to other shadow stack issues, so it is ok if the shadow stack
>is kernel managed.