The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). Unlike normal stacks only the shadow stack
size is specified, similar issues to those that lead to the creation of
map_shadow_stack() apply.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/[email protected]/
Signed-off-by: Mark Brown <[email protected]>
---
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/[email protected]
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Mark Brown (5):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
kselftest/clone3: Test shadow stack support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 59 +++++--
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 1 +
include/uapi/linux/sched.h | 4 +
kernel/fork.c | 22 ++-
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 200 +++++++++++++++++-----
tools/testing/selftests/clone3/clone3_selftests.h | 7 +
12 files changed, 250 insertions(+), 67 deletions(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <[email protected]>
In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.
No functional change.
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 77 ++++++++++++++++-----------------
1 file changed, 37 insertions(+), 40 deletions(-)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 3c9bf0cd82a8..1108bd8e36d6 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
};
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+ const char *name;
+ uint64_t flags;
+ size_t size;
+ size_function size_function;
+ int expected;
+ enum test_mode test_mode;
+ filter_function filter;
+};
+
static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
{
struct __clone_args args = {
@@ -104,30 +117,40 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
return 0;
}
-static bool test_clone3(uint64_t flags, size_t size, int expected,
- enum test_mode test_mode)
+static void test_clone3(const struct test *test)
{
+ size_t size;
int ret;
+ if (test->filter && test->filter()) {
+ ksft_test_result_skip("%s\n", test->name);
+ return;
+ }
+
+ if (test->size_function)
+ size = test->size_function();
+ else
+ size = test->size;
+
+ ksft_print_msg("Running test '%s'\n", test->name);
+
ksft_print_msg(
"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
- getpid(), flags, size);
- ret = call_clone3(flags, size, test_mode);
+ getpid(), test->flags, size);
+ ret = call_clone3(test->flags, size, test->test_mode);
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
- getpid(), ret, expected);
- if (ret != expected) {
+ getpid(), ret, test->expected);
+ if (ret != test->expected) {
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
- getpid(), ret, expected);
- return false;
+ getpid(), ret, test->expected);
+ ksft_test_result_fail("%s\n", test->name);
+ return;
}
- return true;
+ ksft_test_result_pass("%s\n", test->name);
}
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
static bool not_root(void)
{
if (getuid() != 0) {
@@ -155,16 +178,6 @@ static size_t page_size_plus_8(void)
return getpagesize() + 8;
}
-struct test {
- const char *name;
- uint64_t flags;
- size_t size;
- size_function size_function;
- int expected;
- enum test_mode test_mode;
- filter_function filter;
-};
-
static const struct test tests[] = {
{
.name = "simple clone3()",
@@ -314,24 +327,8 @@ int main(int argc, char *argv[])
ksft_set_plan(ARRAY_SIZE(tests));
test_clone3_supported();
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
- if (tests[i].filter && tests[i].filter()) {
- ksft_test_result_skip("%s\n", tests[i].name);
- continue;
- }
-
- if (tests[i].size_function)
- size = tests[i].size_function();
- else
- size = tests[i].size;
-
- ksft_print_msg("Running test '%s'\n", tests[i].name);
-
- ksft_test_result(test_clone3(tests[i].flags, size,
- tests[i].expected,
- tests[i].test_mode),
- "%s\n", tests[i].name);
- }
+ for (i = 0; i < ARRAY_SIZE(tests); i++)
+ test_clone3(&tests[i]);
ksft_finished();
}
--
2.30.2
Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure.
In order to facilitate testing on systems without userspace shadow stack
support we manually enable shadow stacks on startup, this is architecture
specific due to the use of an arch_prctl() on x86. Due to interactions with
potential userspace locking of features we actually detect support for
shadow stacks on the running system by attempting to allocate a shadow
stack page during initialisation using map_shadow_stack(), warning if this
succeeds when the enable failed.
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 117 ++++++++++++++++++++++
tools/testing/selftests/clone3/clone3_selftests.h | 7 ++
2 files changed, 124 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 6adbfd14c841..0f9f99dc5aac 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -11,6 +11,7 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
+#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/un.h>
@@ -21,6 +22,10 @@
#include "../kselftest.h"
#include "clone3_selftests.h"
+static bool shadow_stack_enabled;
+static bool shadow_stack_supported;
+static size_t max_supported_args_size;
+
enum test_mode {
CLONE3_ARGS_NO_TEST,
CLONE3_ARGS_ALL_0,
@@ -28,6 +33,7 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+ CLONE3_ARGS_SHADOW_STACK,
};
typedef bool (*filter_function)(void);
@@ -44,6 +50,36 @@ struct test {
filter_function filter;
};
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+/*
+ * We check for shadow stack support by attempting to use
+ * map_shadow_stack() since features may have been locked by the
+ * dynamic linker resulting in spurious errors when we attempt to
+ * enable on startup. We warn if the enable failed.
+ */
+static void test_shadow_stack_supported(void)
+{
+ long shadow_stack;
+
+ shadow_stack = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+ if (shadow_stack == -1) {
+ ksft_print_msg("map_shadow_stack() not supported\n");
+ } else if ((void *)shadow_stack == MAP_FAILED) {
+ ksft_print_msg("Failed to map shadow stack\n");
+ } else {
+ ksft_print_msg("Shadow stack supportd\n");
+ shadow_stack_supported = true;
+
+ if (!shadow_stack_enabled)
+ ksft_print_msg("Mapped but did not enable shadow stack\n");
+
+ munmap((void *)shadow_stack, getpagesize());
+ }
+}
+
static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
{
struct __clone_args args = {
@@ -89,6 +125,9 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
args.exit_signal = 0x00000000000000f0ULL;
break;
+ case CLONE3_ARGS_SHADOW_STACK:
+ args.shadow_stack_size = getpagesize();
+ break;
}
memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
@@ -179,6 +218,26 @@ static bool no_timenamespace(void)
return true;
}
+static bool have_shadow_stack(void)
+{
+ if (shadow_stack_supported) {
+ ksft_print_msg("Shadow stack supported\n");
+ return true;
+ }
+
+ return false;
+}
+
+static bool no_shadow_stack(void)
+{
+ if (!shadow_stack_supported) {
+ ksft_print_msg("Shadow stack not supported\n");
+ return true;
+ }
+
+ return false;
+}
+
static size_t page_size_plus_8(void)
{
return getpagesize() + 8;
@@ -322,16 +381,74 @@ static const struct test tests[] = {
.expected = -EINVAL,
.test_mode = CLONE3_ARGS_NO_TEST,
},
+ {
+ .name = "Shadow stack on system with shadow stack",
+ .flags = CLONE_VM,
+ .size = 0,
+ .expected = 0,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK,
+ .filter = no_shadow_stack,
+ },
+ {
+ .name = "Shadow stack on system without shadow stack",
+ .flags = CLONE_VM,
+ .size = 0,
+ .expected = -EINVAL,
+ .e2big_valid = true,
+ .test_mode = CLONE3_ARGS_SHADOW_STACK,
+ .filter = have_shadow_stack,
+ },
};
+#ifdef __x86_64__
+#define ARCH_SHSTK_ENABLE 0x5001
+#define ARCH_SHSTK_SHSTK (1ULL << 0)
+
+#define ARCH_PRCTL(arg1, arg2) \
+({ \
+ long _ret; \
+ register long _num asm("eax") = __NR_arch_prctl; \
+ register long _arg1 asm("rdi") = (long)(arg1); \
+ register long _arg2 asm("rsi") = (long)(arg2); \
+ \
+ asm volatile ( \
+ "syscall\n" \
+ : "=a"(_ret) \
+ : "r"(_arg1), "r"(_arg2), \
+ "0"(_num) \
+ : "rcx", "r11", "memory", "cc" \
+ ); \
+ _ret; \
+})
+
+#define ENABLED_SHADOW_STACK
+static inline void enable_shadow_stack(void)
+{
+ int ret = ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
+ if (ret == 0)
+ shadow_stack_enabled = true;
+}
+
+#endif
+
+#ifndef ENABLE_SHADOW_STACK
+static void enable_shadow_stack(void)
+{
+}
+#endif
+
int main(int argc, char *argv[])
{
size_t size;
int i;
+ enable_shadow_stack();
+
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(tests));
test_clone3_supported();
+ test_shadow_stack_supported();
for (i = 0; i < ARRAY_SIZE(tests); i++)
test_clone3(&tests[i]);
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index 3d2663fe50ba..2e06127091f5 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -31,6 +31,13 @@ struct __clone_args {
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
+#ifndef CLONE_ARGS_SIZE_VER2
+#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#endif
+ __aligned_u64 shadow_stack_size;
+#ifndef CLONE_ARGS_SIZE_VER3
+#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
+#endif
};
static pid_t sys_clone3(struct __clone_args *args, size_t size)
--
2.30.2
Unlike with the normal stack there is no API for configuring the the shadow
stack for a new thread, instead the kernel will dynamically allocate a new
shadow stack with the same size as the normal stack. This appears to be due
to the shadow stack series having been in development since before the more
extensible clone3() was added rather than anything more deliberate.
Add a parameter to clone3() specifying the size of a shadow stack for
the newly created process. If no shadow stack is specified then the
existing implicit allocation behaviour is maintained.
If the architecture does not support shadow stacks the shadow stack size
parameter must be zero, architectures that do support the feature are
expected to enforce the same requirement on individual systems that lack
shadow stack support.
Update the existing x86 implementation to pay attention to the newly added
arguments, in order to maintain compatibility we use the existing behaviour
if no shadow stack is specified. Minimal validation is done of the supplied
parameters, detailed enforcement is left to when the thread is executed.
Since we are now using more fields from the kernel_clone_args we pass that
into the shadow stack code rather than individual fields.
Signed-off-by: Mark Brown <[email protected]>
---
arch/x86/include/asm/shstk.h | 11 ++++++---
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++++++++++--------------
include/linux/sched/task.h | 1 +
include/uapi/linux/sched.h | 4 +++
kernel/fork.c | 22 +++++++++++++++--
6 files changed, 74 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..8be7b0a909c3 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
#include <linux/types.h>
struct task_struct;
+struct kernel_clone_args;
struct ksignal;
#ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
- unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+ const struct kernel_clone_args *args);
void shstk_free(struct task_struct *p);
int setup_signal_shadow_stack(struct ksignal *ksig);
int restore_signal_shadow_stack(void);
@@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long arg2) { return -EINVAL; }
static inline void reset_thread_features(void) {}
static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
- unsigned long clone_flags,
- unsigned long stack_size) { return 0; }
+ const struct kernel_clone_args *args)
+{
+ return 0;
+}
static inline void shstk_free(struct task_struct *p) {}
static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..a9ca80ea5056 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -207,7 +207,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
* is disabled, new_ssp will remain 0, and fpu_clone() will know not to
* update it.
*/
- new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+ new_ssp = shstk_alloc_thread_stack(p, args);
if (IS_ERR_VALUE(new_ssp))
return PTR_ERR((void *)new_ssp);
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..a14f47d70dfb 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,38 +191,61 @@ void reset_thread_features(void)
current->thread.features_locked = 0;
}
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
- unsigned long stack_size)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
+ unsigned long clone_flags = args->flags;
unsigned long addr, size;
/*
* If shadow stack is not enabled on the new thread, skip any
- * switch to a new shadow stack.
+ * implicit switch to a new shadow stack and reject attempts to
+ * explciitly specify one.
*/
- if (!features_enabled(ARCH_SHSTK_SHSTK))
- return 0;
+ if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+ if (args->shadow_stack_size)
+ return (unsigned long)ERR_PTR(-EINVAL);
- /*
- * For CLONE_VFORK the child will share the parents shadow stack.
- * Make sure to clear the internal tracking of the thread shadow
- * stack so the freeing logic run for child knows to leave it alone.
- */
- if (clone_flags & CLONE_VFORK) {
- shstk->base = 0;
- shstk->size = 0;
return 0;
}
/*
- * For !CLONE_VM the child will use a copy of the parents shadow
- * stack.
+ * If the user specified a shadow stack then do some basic
+ * validation and use it, otherwise fall back to a default
+ * shadow stack size if the clone_flags don't indicate an
+ * allocation is unneeded.
*/
- if (!(clone_flags & CLONE_VM))
- return 0;
+ if (args->shadow_stack_size) {
+ size = args->shadow_stack_size;
+
+ if (size < 8)
+ return (unsigned long)ERR_PTR(-EINVAL);
+ } else {
+ /*
+ * For CLONE_VFORK the child will share the parents
+ * shadow stack. Make sure to clear the internal
+ * tracking of the thread shadow stack so the freeing
+ * logic run for child knows to leave it alone.
+ */
+ if (clone_flags & CLONE_VFORK) {
+ shstk->base = 0;
+ shstk->size = 0;
+ return 0;
+ }
+
+ /*
+ * For !CLONE_VM the child will use a copy of the
+ * parents shadow stack.
+ */
+ if (!(clone_flags & CLONE_VM))
+ return 0;
+
+ size = args->stack_size;
+
+ }
- size = adjust_shstk_size(stack_size);
+ size = adjust_shstk_size(size);
addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a23af225c898..e86a09cfccd8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -41,6 +41,7 @@ struct kernel_clone_args {
void *fn_arg;
struct cgroup *cgrp;
struct css_set *cset;
+ unsigned long shadow_stack_size;
};
/*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..a998b6d0c897 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,8 @@
* kernel's limit of nested PID namespaces.
* @cgroup: If CLONE_INTO_CGROUP is specified set this to
* a file descriptor for the cgroup.
+ * @shadow_stack_size: Specify the size of the shadow stack to allocate
+ * for the child process.
*
* The structure is versioned by size and thus extensible.
* New struct members must go at the end of the struct and
@@ -101,12 +103,14 @@ struct clone_args {
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
+ __aligned_u64 shadow_stack_size;
};
#endif
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
/*
* Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..b8ca8194bca5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3067,7 +3067,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
CLONE_ARGS_SIZE_VER1);
BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
CLONE_ARGS_SIZE_VER2);
- BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+ BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_size) !=
+ CLONE_ARGS_SIZE_VER3);
+ BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
@@ -3110,6 +3112,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.tls = args.tls,
.set_tid_size = args.set_tid_size,
.cgroup = args.cgroup,
+ .shadow_stack_size = args.shadow_stack_size,
};
if (args.set_tid &&
@@ -3150,6 +3153,21 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
return true;
}
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that shadow stacks are only enabled if supported.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+ if (!kargs->shadow_stack_size)
+ return true;
+
+ /* The architecture must check support on the specific machine */
+ return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
+}
+
static bool clone3_args_valid(struct kernel_clone_args *kargs)
{
/* Verify that no unknown flags are passed along. */
@@ -3172,7 +3190,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
kargs->exit_signal)
return false;
- if (!clone3_stack_valid(kargs))
+ if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
return false;
return true;
--
2.30.2
Since multiple architectures have support for shadow stacks and we need to
select support for this feature in several places in the generic code
provide a generic config option that the architectures can select.
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
---
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
mm/Kconfig | 6 ++++++
4 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3762f41bb092..14b7703a9a2b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1952,6 +1952,7 @@ config X86_USER_SHADOW_STACK
depends on AS_WRUSS
depends on X86_64
select ARCH_USES_HIGH_VMA_FLAGS
+ select ARCH_HAS_USER_SHADOW_STACK
select X86_CET
help
Shadow stack protection is a hardware feature that detects function
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ef2eb12906da..f0a904aeee8e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -699,7 +699,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
#endif
};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..10462f354614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -341,7 +341,7 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
/*
* VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
* support core mm.
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..6713bb3b0b48 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1270,6 +1270,12 @@ config LOCK_MM_AND_FIND_VMA
bool
depends on !STACK_GROWSUP
+config ARCH_HAS_USER_SHADOW_STACK
+ bool
+ help
+ The architecture has hardware support for userspace shadow call
+ stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
+
source "mm/damon/Kconfig"
endmenu
--
2.30.2
The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.
Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.
Signed-off-by: Mark Brown <[email protected]>
---
tools/testing/selftests/clone3/clone3.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 1108bd8e36d6..6adbfd14c841 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
size_t size;
size_function size_function;
int expected;
+ bool e2big_valid;
enum test_mode test_mode;
filter_function filter;
};
@@ -141,6 +142,11 @@ static void test_clone3(const struct test *test)
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
getpid(), ret, test->expected);
if (ret != test->expected) {
+ if (test->e2big_valid && ret == -E2BIG) {
+ ksft_print_msg("Test reported -E2BIG\n");
+ ksft_test_result_skip("%s\n", test->name);
+ return;
+ }
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
getpid(), ret, test->expected);
--
2.30.2
On Mon, Nov 20, 2023 at 11:54:28PM +0000, Mark Brown wrote:
> The kernel has recently added support for shadow stacks, currently
> x86 only using their CET feature but both arm64 and RISC-V have
> equivalent features (GCS and Zicfiss respectively), I am actively
> working on GCS[1]. With shadow stacks the hardware maintains an
> additional stack containing only the return addresses for branch
> instructions which is not generally writeable by userspace and ensures
> that any returns are to the recorded addresses. This provides some
> protection against ROP attacks and making it easier to collect call
> stacks. These shadow stacks are allocated in the address space of the
> userspace process.
>
> Our API for shadow stacks does not currently offer userspace any
> flexiblity for managing the allocation of shadow stacks for newly
> created threads, instead the kernel allocates a new shadow stack with
> the same size as the normal stack whenever a thread is created with the
> feature enabled. The stacks allocated in this way are freed by the
> kernel when the thread exits or shadow stacks are disabled for the
> thread. This lack of flexibility and control isn't ideal, in the vast
> majority of cases the shadow stack will be over allocated and the
> implicit allocation and deallocation is not consistent with other
> interfaces. As far as I can tell the interface is done in this manner
> mainly because the shadow stack patches were in development since before
> clone3() was implemented.
>
> Since clone3() is readily extensible let's add support for specifying a
> shadow stack when creating a new thread or process in a similar manner
So while I made clone3() readily extensible I don't want it to ever
devolve into a fancier version of a prctl().
I would really like to see a strong reason for allowing userspace to
configure the shadow stack size at this point in time.
I have a few questions that are probably me just not knowing much about
shadow stacks so hopefully I'm not asking you write a thesis by
accident:
(1) What does it mean for a shadow stack to be over allocated and is
over-allocation really that much of a problem out in the wild that
we need to give I userspace a knob to control a kernel security
feature?
(2) With what other interfaces is implicit allocation and deallocation
not consistent? I don't understand this argument. The kernel creates
a shadow stack as a security measure to store return addresses. It
seems to me exactly that the kernel should implicitly allocate and
deallocate the shadow stack and not have userspace muck around with
its size?
(3) Why is it safe for userspace to request the shadow stack size? What
if they request a tiny shadow stack size? Should this interface
require any privilege?
(4) Why isn't the @stack_size argument I added for clone3() enough?
If it is specified can't the size of the shadow stack derived from it?
And my current main objection is that shadow stacks were just released
to userspace. There can't be a massive amount of users yet - outside of
maybe early adopters.
The fact that there are other architectures that bring in a similar
feature makes me even more hesitant. If they have all agreed _and_
implemented shadow stacks and have unified semantics then we can
consider exposing control knobs to userspace that aren't implicitly
architecture specific currently.
So I don't have anything against the patches per obviously but with the
wider context.
Thanks!
The 11/21/2023 11:17, Christian Brauner wrote:
> On Mon, Nov 20, 2023 at 11:54:28PM +0000, Mark Brown wrote:
> > The kernel has recently added support for shadow stacks, currently
> > x86 only using their CET feature but both arm64 and RISC-V have
> > equivalent features (GCS and Zicfiss respectively), I am actively
> > working on GCS[1]. With shadow stacks the hardware maintains an
> > additional stack containing only the return addresses for branch
> > instructions which is not generally writeable by userspace and ensures
> > that any returns are to the recorded addresses. This provides some
> > protection against ROP attacks and making it easier to collect call
> > stacks. These shadow stacks are allocated in the address space of the
> > userspace process.
> >
> > Our API for shadow stacks does not currently offer userspace any
> > flexiblity for managing the allocation of shadow stacks for newly
> > created threads, instead the kernel allocates a new shadow stack with
> > the same size as the normal stack whenever a thread is created with the
> > feature enabled. The stacks allocated in this way are freed by the
> > kernel when the thread exits or shadow stacks are disabled for the
> > thread. This lack of flexibility and control isn't ideal, in the vast
> > majority of cases the shadow stack will be over allocated and the
> > implicit allocation and deallocation is not consistent with other
> > interfaces. As far as I can tell the interface is done in this manner
> > mainly because the shadow stack patches were in development since before
> > clone3() was implemented.
> >
> > Since clone3() is readily extensible let's add support for specifying a
> > shadow stack when creating a new thread or process in a similar manner
>
> So while I made clone3() readily extensible I don't want it to ever
> devolve into a fancier version of a prctl().
>
> I would really like to see a strong reason for allowing userspace to
> configure the shadow stack size at this point in time.
>
> I have a few questions that are probably me just not knowing much about
> shadow stacks so hopefully I'm not asking you write a thesis by
> accident:
>
> (1) What does it mean for a shadow stack to be over allocated and is
> over-allocation really that much of a problem out in the wild that
> we need to give I userspace a knob to control a kernel security
> feature?
over-allocation: allocating 100M shadow stack (RLIMIT_DATA, RLIMIT_AS)
for a thread that keeps arrays of data on the stack, instead of say 8k,
and thus running into resource limits.
under-allocation: small thread stack with runaway recursion, but large
sigaltstack: the stack overflow handler can run out of space because it
uses the same shadow stack as the thread.
> (2) With what other interfaces is implicit allocation and deallocation
> not consistent? I don't understand this argument. The kernel creates
> a shadow stack as a security measure to store return addresses. It
> seems to me exactly that the kernel should implicitly allocate and
> deallocate the shadow stack and not have userspace muck around with
> its size?
the kernel is not supposed to impose stack size policy or a particular
programming model that limits the stack management options nor prevent
the handling of stack overflows.
> (3) Why is it safe for userspace to request the shadow stack size? What
> if they request a tiny shadow stack size? Should this interface
> require any privilege?
user can allocate huge or tiny stacks already.
and i think userspace can take control over shadow stack management:
it can disable signals, start a clone child with stack_size == 1 page,
map_shadow_stack and switch to it, enable signals. however this is
complicated, leaks 1 page of kernel allocated shadow stack (+reserved
guard page, i guess userspace could unmap, not sure if that works
currently) and requires additional syscalls.
> (4) Why isn't the @stack_size argument I added for clone3() enough?
> If it is specified can't the size of the shadow stack derived from it?
shadow stack only contains return addresses so it is proportional
to the number of stack frames, not the stack size and it must
account for sigaltstack too, not just the thread stack.
if you make minimal assumptions about stack usage and ignore the
sigaltstack issue then the worst case shadow stack requirement
is indeed proportional to the stack_size, but this upper bound
can be pessimistic and userspace knows the tradeoffs better.
>
> And my current main objection is that shadow stacks were just released
> to userspace. There can't be a massive amount of users yet - outside of
> maybe early adopters.
no upstream libc has code to enable shadow stacks at this point
so there are exactly 0 users in the open. (this feature requires
runtime support)
the change is expected to allow wider deployability. (e.g. not
just in glibc)
>
> The fact that there are other architectures that bring in a similar
> feature makes me even more hesitant. If they have all agreed _and_
> implemented shadow stacks and have unified semantics then we can
> consider exposing control knobs to userspace that aren't implicitly
> architecture specific currently.
>
> So I don't have anything against the patches per obviously but with the
> wider context.
>
> Thanks!
On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote:
> The 11/21/2023 11:17, Christian Brauner wrote:
> > I have a few questions that are probably me just not knowing much about
> > shadow stacks so hopefully I'm not asking you write a thesis by
> > accident:
One thing it feels like it's worth saying up front here is that shadow
stacks are userspace memory with special permissions and instructions
for access - they are mapped into the userspace address range and
userspace can directly interact with them in restricted ways. For
example there's some thought to using shadow stacks in unwinders since
all the return addresses are stored in a single convenient block of
memory which it's much harder to corrupt. Overflowing a shadow stack
results in userspace getting a memory access fault just as with other
memory access issues.
> > (2) With what other interfaces is implicit allocation and deallocation
> > not consistent? I don't understand this argument. The kernel creates
> > a shadow stack as a security measure to store return addresses. It
> > seems to me exactly that the kernel should implicitly allocate and
> > deallocate the shadow stack and not have userspace muck around with
> > its size?
> the kernel is not supposed to impose stack size policy or a particular
> programming model that limits the stack management options nor prevent
> the handling of stack overflows.
The inconsistency here is with the management of the standard stack -
with the standard stack userspace passes an already allocated address
range to the kernel. A constant tension during review of the shadow
stack interfaces has been that shadow stack memory is userspace memory
but the security constraints mean that we've come down on the side of
having a custom allocation syscall for it instead of using flags on
mmap() and friends like people often expect, and now having it allocated
as part of clone3(). The aim is to highlight that this difference is
deliberately chosen for specific reasons rather than just carelessness.
> > (3) Why is it safe for userspace to request the shadow stack size? What
> > if they request a tiny shadow stack size? Should this interface
> > require any privilege?
> user can allocate huge or tiny stacks already.
> and i think userspace can take control over shadow stack management:
> it can disable signals, start a clone child with stack_size == 1 page,
> map_shadow_stack and switch to it, enable signals. however this is
> complicated, leaks 1 page of kernel allocated shadow stack (+reserved
> guard page, i guess userspace could unmap, not sure if that works
> currently) and requires additional syscalls.
The other thing here is that if userspace gets this wrong it'll result
in the userspace process hitting the top of the stack and getting fatal
signals in a similar manner to what happens if it gets the size of
the standard stack wrong (the kernel allocation does mean that there
should always be guard pages and it's harder to overrun the stack and
corrupt adjacent memory). There doesn't seem to be any meaningful risk
here over what userspace can already do to itself anyway as part of
thread allocation.
> > (4) Why isn't the @stack_size argument I added for clone3() enough?
> > If it is specified can't the size of the shadow stack derived from it?
> shadow stack only contains return addresses so it is proportional
> to the number of stack frames, not the stack size and it must
> account for sigaltstack too, not just the thread stack.
> if you make minimal assumptions about stack usage and ignore the
> sigaltstack issue then the worst case shadow stack requirement
> is indeed proportional to the stack_size, but this upper bound
> can be pessimistic and userspace knows the tradeoffs better.
It's also worth pointing out here that the existing shadow stack support
for x86 and in review code for arm64 make exactly these assumptions and
guesses at a shadow stack size based on the stack_size for the thread.
There's just been a general lack of enthusiasm for the fact that due to
the need to store variables on the normal stack the resulting shadow
stack is very likely to be substantially overallocated but we can't
safely reduce the size without information from userspace.
> > And my current main objection is that shadow stacks were just released
> > to userspace. There can't be a massive amount of users yet - outside of
> > maybe early adopters.
> no upstream libc has code to enable shadow stacks at this point
> so there are exactly 0 users in the open. (this feature requires
> runtime support)
> the change is expected to allow wider deployability. (e.g. not
> just in glibc)
Right, and the lack of any userspace control of the shadow stack size
has been a review concern with the arm64 GCS series which I'm trying to
address here. The main concern is that userspaces that start a lot of
threads are going to start using a lot more address space than they need
to when shadow stacks are enabled. Given the fairly long deployment
pipeline from extending a syscall to end users who might be using the
feature in conjuction with imposing resource limits it does seem like a
reasonable problem to anticipate.
> > The fact that there are other architectures that bring in a similar
> > feature makes me even more hesitant. If they have all agreed _and_
> > implemented shadow stacks and have unified semantics then we can
> > consider exposing control knobs to userspace that aren't implicitly
> > architecture specific currently.
To be clear the reason I'm working on this is that I've implemented the
arm64 support, I don't even have any access to x86 systems that have the
feature (hence the RFT in the subject line) - Rick Edgecombe did the x86
work. The arm64 code is still in review, the userspace interface is
very similar to that for x86 and there doesn't seem to be any
controversy there which makes me expect that a change is likely. Unlike
x86 we only have a spec and virtual implementations at present, there's
no immintent hardware, so we're taking our time with making sure that
everything is working well. Deepak Gupta (in CC) has been reviewing the
series from the point of view of RISC-V. I think we're all fairly well
aligned on the requirements here.
On 2023-11-20 23:54, Mark Brown wrote:
> Add basic test coverage for specifying the shadow stack for a newly
> created thread via clone3(), including coverage of the newly extended
> argument structure.
>
> In order to facilitate testing on systems without userspace shadow stack
> support we manually enable shadow stacks on startup, this is architecture
> specific due to the use of an arch_prctl() on x86. Due to interactions with
> potential userspace locking of features we actually detect support for
> shadow stacks on the running system by attempting to allocate a shadow
> stack page during initialisation using map_shadow_stack(), warning if this
> succeeds when the enable failed.
>
> Signed-off-by: Mark Brown <[email protected]>
> ---
> tools/testing/selftests/clone3/clone3.c | 117 ++++++++++++++++++++++
> tools/testing/selftests/clone3/clone3_selftests.h | 7 ++
> 2 files changed, 124 insertions(+)
>
> diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
> index 6adbfd14c841..0f9f99dc5aac 100644
> --- a/tools/testing/selftests/clone3/clone3.c
> +++ b/tools/testing/selftests/clone3/clone3.c
> @@ -11,6 +11,7 @@
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> +#include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <sys/un.h>
> @@ -21,6 +22,10 @@
> #include "../kselftest.h"
> #include "clone3_selftests.h"
>
> +static bool shadow_stack_enabled;
> +static bool shadow_stack_supported;
> +static size_t max_supported_args_size;
> +
> enum test_mode {
> CLONE3_ARGS_NO_TEST,
> CLONE3_ARGS_ALL_0,
> @@ -28,6 +33,7 @@ enum test_mode {
> CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
> CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
> CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
> + CLONE3_ARGS_SHADOW_STACK,
> };
>
> typedef bool (*filter_function)(void);
> @@ -44,6 +50,36 @@ struct test {
> filter_function filter;
> };
>
> +#ifndef __NR_map_shadow_stack
> +#define __NR_map_shadow_stack 453
> +#endif
> +
> +/*
> + * We check for shadow stack support by attempting to use
> + * map_shadow_stack() since features may have been locked by the
> + * dynamic linker resulting in spurious errors when we attempt to
> + * enable on startup. We warn if the enable failed.
> + */
> +static void test_shadow_stack_supported(void)
> +{
> + long shadow_stack;
> +
> + shadow_stack = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
> + if (shadow_stack == -1) {
> + ksft_print_msg("map_shadow_stack() not supported\n");
> + } else if ((void *)shadow_stack == MAP_FAILED) {
> + ksft_print_msg("Failed to map shadow stack\n");
> + } else {
> + ksft_print_msg("Shadow stack supportd\n");
> + shadow_stack_supported = true;
> +
> + if (!shadow_stack_enabled)
> + ksft_print_msg("Mapped but did not enable shadow stack\n");
> +
> + munmap((void *)shadow_stack, getpagesize());
> + }
> +}
> +
> static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
> {
> struct __clone_args args = {
> @@ -89,6 +125,9 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
> case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
> args.exit_signal = 0x00000000000000f0ULL;
> break;
> + case CLONE3_ARGS_SHADOW_STACK:
> + args.shadow_stack_size = getpagesize();
> + break;
> }
>
> memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
> @@ -179,6 +218,26 @@ static bool no_timenamespace(void)
> return true;
> }
>
> +static bool have_shadow_stack(void)
> +{
> + if (shadow_stack_supported) {
> + ksft_print_msg("Shadow stack supported\n");
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static bool no_shadow_stack(void)
> +{
> + if (!shadow_stack_supported) {
> + ksft_print_msg("Shadow stack not supported\n");
> + return true;
> + }
> +
> + return false;
> +}
> +
> static size_t page_size_plus_8(void)
> {
> return getpagesize() + 8;
> @@ -322,16 +381,74 @@ static const struct test tests[] = {
> .expected = -EINVAL,
> .test_mode = CLONE3_ARGS_NO_TEST,
> },
> + {
> + .name = "Shadow stack on system with shadow stack",
> + .flags = CLONE_VM,
> + .size = 0,
> + .expected = 0,
> + .e2big_valid = true,
> + .test_mode = CLONE3_ARGS_SHADOW_STACK,
> + .filter = no_shadow_stack,
> + },
> + {
> + .name = "Shadow stack on system without shadow stack",
> + .flags = CLONE_VM,
> + .size = 0,
> + .expected = -EINVAL,
> + .e2big_valid = true,
> + .test_mode = CLONE3_ARGS_SHADOW_STACK,
> + .filter = have_shadow_stack,
> + },
> };
>
> +#ifdef __x86_64__
> +#define ARCH_SHSTK_ENABLE 0x5001
> +#define ARCH_SHSTK_SHSTK (1ULL << 0)
> +
> +#define ARCH_PRCTL(arg1, arg2) \
> +({ \
> + long _ret; \
> + register long _num asm("eax") = __NR_arch_prctl; \
> + register long _arg1 asm("rdi") = (long)(arg1); \
> + register long _arg2 asm("rsi") = (long)(arg2); \
> + \
> + asm volatile ( \
> + "syscall\n" \
> + : "=a"(_ret) \
> + : "r"(_arg1), "r"(_arg2), \
> + "0"(_num) \
> + : "rcx", "r11", "memory", "cc" \
> + ); \
> + _ret; \
> +})
> +
> +#define ENABLED_SHADOW_STACK
> +static inline void enable_shadow_stack(void)
> +{
> + int ret = ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
> + if (ret == 0)
> + shadow_stack_enabled = true;
> +}
> +
> +#endif
> +
> +#ifndef ENABLE_SHADOW_STACK
Should this be ENABLED_SHADOW_STACK ?
Built this patchset for x86 gave me this build error:
make[4]: Entering directory '/home/anders/src/kernel/linux/tools/testing/selftests/clone3'
x86_64-linux-gnu-gcc -g -std=gnu99 -isystem /home/anders/.cache/tuxmake/builds/513/build/usr/include clone3.c -lcap -o /home/anders/.cache/tuxmake/builds/513/build/kselftest/clone3/clone
3
clone3.c:436:13: error: redefinition of 'enable_shadow_stack'
436 | static void enable_shadow_stack(void)
| ^~~~~~~~~~~~~~~~~~~
clone3.c:426:20: note: previous definition of 'enable_shadow_stack' with type 'void(void)'
426 | static inline void enable_shadow_stack(void)
| ^~~~~~~~~~~~~~~~~~~
make[4]: Leaving directory '/home/anders/src/kernel/linux/tools/testing/selftests/clone3'
make[4]: *** [../lib.mk:181: /home/anders/.cache/tuxmake/builds/513/build/kselftest/clone3/clone3] Error 1
make[3]: *** [Makefile:178: all] Error 2
make[3]: Target 'install' not remade because of errors.
make[2]: *** [/home/anders/src/kernel/linux/Makefile:1362: kselftest-install] Error 2
Cheers,
Anders
On Wed, Nov 22, 2023 at 12:19:49PM +0100, Anders Roxell wrote:
> On 2023-11-20 23:54, Mark Brown wrote:
> > +#ifndef ENABLE_SHADOW_STACK
> Should this be ENABLED_SHADOW_STACK ?
Yes, I already fixed this locally.
Please delete unneeded context from mails when replying. Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.
On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote:
> On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote:
> > The 11/21/2023 11:17, Christian Brauner wrote:
>
> > > I have a few questions that are probably me just not knowing much about
> > > shadow stacks so hopefully I'm not asking you write a thesis by
> > > accident:
>
> One thing it feels like it's worth saying up front here is that shadow
> stacks are userspace memory with special permissions and instructions
> for access - they are mapped into the userspace address range and
> userspace can directly interact with them in restricted ways. For
> example there's some thought to using shadow stacks in unwinders since
> all the return addresses are stored in a single convenient block of
> memory which it's much harder to corrupt. Overflowing a shadow stack
> results in userspace getting a memory access fault just as with other
> memory access issues.
Thanks for that summary.
>
> > > (2) With what other interfaces is implicit allocation and deallocation
> > > not consistent? I don't understand this argument. The kernel creates
> > > a shadow stack as a security measure to store return addresses. It
> > > seems to me exactly that the kernel should implicitly allocate and
> > > deallocate the shadow stack and not have userspace muck around with
> > > its size?
>
> > the kernel is not supposed to impose stack size policy or a particular
> > programming model that limits the stack management options nor prevent
> > the handling of stack overflows.
>
> The inconsistency here is with the management of the standard stack -
> with the standard stack userspace passes an already allocated address
> range to the kernel. A constant tension during review of the shadow
> stack interfaces has been that shadow stack memory is userspace memory
> but the security constraints mean that we've come down on the side of
> having a custom allocation syscall for it instead of using flags on
> mmap() and friends like people often expect, and now having it allocated
> as part of clone3(). The aim is to highlight that this difference is
So you have two interfaces for allocating a shadow stack. The first one
is to explicitly alloc a shadow stack via the map_shadow_stack(). The
second one is an implicit allocation during clone3() and you want to
allow explicitly influencing that.
> deliberately chosen for specific reasons rather than just carelessness.
>
> > > (3) Why is it safe for userspace to request the shadow stack size? What
> > > if they request a tiny shadow stack size? Should this interface
> > > require any privilege?
>
> > user can allocate huge or tiny stacks already.
>
> > and i think userspace can take control over shadow stack management:
> > it can disable signals, start a clone child with stack_size == 1 page,
> > map_shadow_stack and switch to it, enable signals. however this is
> > complicated, leaks 1 page of kernel allocated shadow stack (+reserved
> > guard page, i guess userspace could unmap, not sure if that works
> > currently) and requires additional syscalls.
>
> The other thing here is that if userspace gets this wrong it'll result
> in the userspace process hitting the top of the stack and getting fatal
> signals in a similar manner to what happens if it gets the size of
> the standard stack wrong (the kernel allocation does mean that there
> should always be guard pages and it's harder to overrun the stack and
> corrupt adjacent memory). There doesn't seem to be any meaningful risk
> here over what userspace can already do to itself anyway as part of
> thread allocation.
clone3() _aimed_ to cleanup the stack handling a bit but we had concerns
that deviating too much from legacy clone() would mean userspace
couldn't fully replace it. So we would have liked to clean up stack
handling a lot more but there's limits to that. We do however perform
basic sanity checks now.
>
> > > (4) Why isn't the @stack_size argument I added for clone3() enough?
> > > If it is specified can't the size of the shadow stack derived from it?
>
> > shadow stack only contains return addresses so it is proportional
> > to the number of stack frames, not the stack size and it must
> > account for sigaltstack too, not just the thread stack.
>
> > if you make minimal assumptions about stack usage and ignore the
> > sigaltstack issue then the worst case shadow stack requirement
> > is indeed proportional to the stack_size, but this upper bound
> > can be pessimistic and userspace knows the tradeoffs better.
>
> It's also worth pointing out here that the existing shadow stack support
> for x86 and in review code for arm64 make exactly these assumptions and
> guesses at a shadow stack size based on the stack_size for the thread.
Ok.
> There's just been a general lack of enthusiasm for the fact that due to
> the need to store variables on the normal stack the resulting shadow
> stack is very likely to be substantially overallocated but we can't
> safely reduce the size without information from userspace.
Ok.
>
> > > And my current main objection is that shadow stacks were just released
> > > to userspace. There can't be a massive amount of users yet - outside of
> > > maybe early adopters.
>
> > no upstream libc has code to enable shadow stacks at this point
> > so there are exactly 0 users in the open. (this feature requires
> > runtime support)
>
> > the change is expected to allow wider deployability. (e.g. not
> > just in glibc)
>
> Right, and the lack of any userspace control of the shadow stack size
> has been a review concern with the arm64 GCS series which I'm trying to
> address here. The main concern is that userspaces that start a lot of
> threads are going to start using a lot more address space than they need
> to when shadow stacks are enabled. Given the fairly long deployment
> pipeline from extending a syscall to end users who might be using the
> feature in conjuction with imposing resource limits it does seem like a
> reasonable problem to anticipate.
Ok, I can see that argument.
>
> > > The fact that there are other architectures that bring in a similar
> > > feature makes me even more hesitant. If they have all agreed _and_
> > > implemented shadow stacks and have unified semantics then we can
> > > consider exposing control knobs to userspace that aren't implicitly
> > > architecture specific currently.
>
> To be clear the reason I'm working on this is that I've implemented the
> arm64 support, I don't even have any access to x86 systems that have the
Yes, I'm aware.
> feature (hence the RFT in the subject line) - Rick Edgecombe did the x86
> work. The arm64 code is still in review, the userspace interface is
> very similar to that for x86 and there doesn't seem to be any
> controversy there which makes me expect that a change is likely. Unlike
> x86 we only have a spec and virtual implementations at present, there's
> no immintent hardware, so we're taking our time with making sure that
> everything is working well. Deepak Gupta (in CC) has been reviewing the
> series from the point of view of RISC-V. I think we're all fairly well
> aligned on the requirements here.
I'm still not enthusiastic that we only have one implementation for this
in the kernel. What's the harm in waiting until the arm patches are
merged? This shouldn't result in chicken and egg: if the implementations
are sufficiently similar then we can do an appropriate clone3()
extension.
On Mon, Nov 20, 2023 at 11:54:30PM +0000, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the the shadow
> stack for a new thread, instead the kernel will dynamically allocate a new
> shadow stack with the same size as the normal stack. This appears to be due
> to the shadow stack series having been in development since before the more
> extensible clone3() was added rather than anything more deliberate.
>
> Add a parameter to clone3() specifying the size of a shadow stack for
> the newly created process. If no shadow stack is specified then the
> existing implicit allocation behaviour is maintained.
>
> If the architecture does not support shadow stacks the shadow stack size
> parameter must be zero, architectures that do support the feature are
> expected to enforce the same requirement on individual systems that lack
> shadow stack support.
>
> Update the existing x86 implementation to pay attention to the newly added
> arguments, in order to maintain compatibility we use the existing behaviour
> if no shadow stack is specified. Minimal validation is done of the supplied
> parameters, detailed enforcement is left to when the thread is executed.
> Since we are now using more fields from the kernel_clone_args we pass that
> into the shadow stack code rather than individual fields.
>
> Signed-off-by: Mark Brown <[email protected]>
> ---
> arch/x86/include/asm/shstk.h | 11 ++++++---
> arch/x86/kernel/process.c | 2 +-
> arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++++++++++--------------
> include/linux/sched/task.h | 1 +
> include/uapi/linux/sched.h | 4 +++
> kernel/fork.c | 22 +++++++++++++++--
> 6 files changed, 74 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> index 42fee8959df7..8be7b0a909c3 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
> #include <linux/types.h>
>
> struct task_struct;
> +struct kernel_clone_args;
> struct ksignal;
>
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>
> long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
> void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
> - unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> + const struct kernel_clone_args *args);
> void shstk_free(struct task_struct *p);
> int setup_signal_shadow_stack(struct ksignal *ksig);
> int restore_signal_shadow_stack(void);
> @@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
> unsigned long arg2) { return -EINVAL; }
> static inline void reset_thread_features(void) {}
> static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> - unsigned long clone_flags,
> - unsigned long stack_size) { return 0; }
> + const struct kernel_clone_args *args)
> +{
> + return 0;
> +}
> static inline void shstk_free(struct task_struct *p) {}
> static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
> static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b6f4e8399fca..a9ca80ea5056 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -207,7 +207,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
> * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
> * update it.
> */
> - new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
> + new_ssp = shstk_alloc_thread_stack(p, args);
> if (IS_ERR_VALUE(new_ssp))
> return PTR_ERR((void *)new_ssp);
>
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 59e15dd8d0f8..a14f47d70dfb 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -191,38 +191,61 @@ void reset_thread_features(void)
> current->thread.features_locked = 0;
> }
>
> -unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> - unsigned long stack_size)
> +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> + const struct kernel_clone_args *args)
> {
> struct thread_shstk *shstk = &tsk->thread.shstk;
> + unsigned long clone_flags = args->flags;
> unsigned long addr, size;
>
> /*
> * If shadow stack is not enabled on the new thread, skip any
> - * switch to a new shadow stack.
> + * implicit switch to a new shadow stack and reject attempts to
> + * explciitly specify one.
> */
> - if (!features_enabled(ARCH_SHSTK_SHSTK))
> - return 0;
> + if (!features_enabled(ARCH_SHSTK_SHSTK)) {
> + if (args->shadow_stack_size)
> + return (unsigned long)ERR_PTR(-EINVAL);
>
> - /*
> - * For CLONE_VFORK the child will share the parents shadow stack.
> - * Make sure to clear the internal tracking of the thread shadow
> - * stack so the freeing logic run for child knows to leave it alone.
> - */
> - if (clone_flags & CLONE_VFORK) {
> - shstk->base = 0;
> - shstk->size = 0;
> return 0;
> }
>
> /*
> - * For !CLONE_VM the child will use a copy of the parents shadow
> - * stack.
> + * If the user specified a shadow stack then do some basic
> + * validation and use it, otherwise fall back to a default
> + * shadow stack size if the clone_flags don't indicate an
> + * allocation is unneeded.
> */
> - if (!(clone_flags & CLONE_VM))
> - return 0;
> + if (args->shadow_stack_size) {
> + size = args->shadow_stack_size;
> +
> + if (size < 8)
> + return (unsigned long)ERR_PTR(-EINVAL);
It would probably be useful to add a
#define SHADOW_STACK_SIZE_MIN 8
instead of a raw number here.
Any reasonably maximum that should be assumed here? IOW, what happens if
userspace starts specifying 4G shadow_stack_size with each clone3() call
for lolz?
And I think we should move the shadow_stack_size validation into
clone3_shadow_stack_valid() instead of having each architecture do it's
own thing in their own handler. IOW, share as much common code as
possible. Another reason to wait for that arm support to land...
> + } else {
> + /*
> + * For CLONE_VFORK the child will share the parents
> + * shadow stack. Make sure to clear the internal
> + * tracking of the thread shadow stack so the freeing
> + * logic run for child knows to leave it alone.
> + */
> + if (clone_flags & CLONE_VFORK) {
> + shstk->base = 0;
> + shstk->size = 0;
> + return 0;
> + }
Why is the CLONE_VFORK handling only necessary if shadow_stack_size is
unset? In general, a comment or explanation on the interaction between
CLONE_VFORK and shadow_stack_size would be helpful.
> +
> + /*
> + * For !CLONE_VM the child will use a copy of the
> + * parents shadow stack.
> + */
> + if (!(clone_flags & CLONE_VM))
> + return 0;
> +
> + size = args->stack_size;
> +
> + }
>
> - size = adjust_shstk_size(stack_size);
> + size = adjust_shstk_size(size);
> addr = alloc_shstk(0, size, 0, false);
> if (IS_ERR_VALUE(addr))
> return addr;
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index a23af225c898..e86a09cfccd8 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -41,6 +41,7 @@ struct kernel_clone_args {
> void *fn_arg;
> struct cgroup *cgrp;
> struct css_set *cset;
> + unsigned long shadow_stack_size;
> };
>
> /*
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..a998b6d0c897 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -84,6 +84,8 @@
> * kernel's limit of nested PID namespaces.
> * @cgroup: If CLONE_INTO_CGROUP is specified set this to
> * a file descriptor for the cgroup.
> + * @shadow_stack_size: Specify the size of the shadow stack to allocate
> + * for the child process.
> *
> * The structure is versioned by size and thus extensible.
> * New struct members must go at the end of the struct and
> @@ -101,12 +103,14 @@ struct clone_args {
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> __aligned_u64 cgroup;
> + __aligned_u64 shadow_stack_size;
> };
> #endif
>
> #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> #define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> #define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
>
> /*
> * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 10917c3e1f03..b8ca8194bca5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3067,7 +3067,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> CLONE_ARGS_SIZE_VER1);
> BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
> CLONE_ARGS_SIZE_VER2);
> - BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
> + BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_size) !=
> + CLONE_ARGS_SIZE_VER3);
> + BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
>
> if (unlikely(usize > PAGE_SIZE))
> return -E2BIG;
> @@ -3110,6 +3112,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> .tls = args.tls,
> .set_tid_size = args.set_tid_size,
> .cgroup = args.cgroup,
> + .shadow_stack_size = args.shadow_stack_size,
Mild personal ocd: Can you keep the all aligned, please?
> };
>
> if (args.set_tid &&
> @@ -3150,6 +3153,21 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> return true;
> }
>
> +/**
> + * clone3_shadow_stack_valid - check and prepare shadow stack
> + * @kargs: kernel clone args
> + *
> + * Verify that shadow stacks are only enabled if supported.
> + */
> +static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
> +{
> + if (!kargs->shadow_stack_size)
> + return true;
> +
> + /* The architecture must check support on the specific machine */
> + return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
> +}
> +
> static bool clone3_args_valid(struct kernel_clone_args *kargs)
> {
> /* Verify that no unknown flags are passed along. */
> @@ -3172,7 +3190,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
> kargs->exit_signal)
> return false;
>
> - if (!clone3_stack_valid(kargs))
> + if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
> return false;
>
> return true;
>
> --
> 2.30.2
>
On Thu, Nov 23, 2023 at 11:10:24AM +0100, Christian Brauner wrote:
> On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote:
> > On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote:
> > > The 11/21/2023 11:17, Christian Brauner wrote:
> > > > (2) With what other interfaces is implicit allocation and deallocation
> > > > not consistent? I don't understand this argument. The kernel creates
> > > > a shadow stack as a security measure to store return addresses. It
> > > > seems to me exactly that the kernel should implicitly allocate and
> > > > deallocate the shadow stack and not have userspace muck around with
> > > > its size?
...
> > The inconsistency here is with the management of the standard stack -
> > with the standard stack userspace passes an already allocated address
> > range to the kernel. A constant tension during review of the shadow
> > stack interfaces has been that shadow stack memory is userspace memory
> > but the security constraints mean that we've come down on the side of
> > having a custom allocation syscall for it instead of using flags on
> > mmap() and friends like people often expect, and now having it allocated
> > as part of clone3(). The aim is to highlight that this difference is
> So you have two interfaces for allocating a shadow stack. The first one
> is to explicitly alloc a shadow stack via the map_shadow_stack(). The
> second one is an implicit allocation during clone3() and you want to
> allow explicitly influencing that.
Yes. Shadow stacks are also allocated when the inital call to enable
shadow stacks is done, the clone()/clone3() behaviour is to implicitly
allocate when a thread is created by a thread that itself has shadow
stacks enabled (so we avoid gaps in coverage).
> > feature (hence the RFT in the subject line) - Rick Edgecombe did the x86
> > work. The arm64 code is still in review, the userspace interface is
> > very similar to that for x86 and there doesn't seem to be any
> > controversy there which makes me expect that a change is likely. Unlike
> > x86 we only have a spec and virtual implementations at present, there's
> > no immintent hardware, so we're taking our time with making sure that
> > everything is working well. Deepak Gupta (in CC) has been reviewing the
> > series from the point of view of RISC-V. I think we're all fairly well
> > aligned on the requirements here.
> I'm still not enthusiastic that we only have one implementation for this
> in the kernel. What's the harm in waiting until the arm patches are
> merged? This shouldn't result in chicken and egg: if the implementations
> are sufficiently similar then we can do an appropriate clone3()
> extension.
The main thing would be that it would mean that people doing userspace
enablement based on the merged x86 support can't use the stack size
control. It's not the end of the world if that has to wait a bit, it's
a bit of a detail thing, but it would make life easier, I guess the
userspace people can let us know if it's starting to be a real hassle
and we can reevaulate if that happens.
It's also currently a dependency for the arm64 code so it'd be good to
at least get ageement that assuming nothing comes up in testing the
patches can go in along with the arm64 series, removing the dependency
and then adding it as an incremental thing would be a hassle. It's
likely that the arm64 series will be held out of tree for a while to as
more complete userspace support is built up and validated so things
might be sitting for a while - we don't have hardware right now so we
can be cautious with the testing.
On Thu, Nov 23, 2023 at 11:28:47AM +0100, Christian Brauner wrote:
> On Mon, Nov 20, 2023 at 11:54:30PM +0000, Mark Brown wrote:
> Any reasonably maximum that should be assumed here? IOW, what happens if
> userspace starts specifying 4G shadow_stack_size with each clone3() call
> for lolz?
I guess we could impose RLIMIT_STACK?
> > + } else {
> > + /*
> > + * For CLONE_VFORK the child will share the parents
> > + * shadow stack. Make sure to clear the internal
> > + * tracking of the thread shadow stack so the freeing
> > + * logic run for child knows to leave it alone.
> > + */
> > + if (clone_flags & CLONE_VFORK) {
> > + shstk->base = 0;
> > + shstk->size = 0;
> > + return 0;
> > + }
> Why is the CLONE_VFORK handling only necessary if shadow_stack_size is
> unset? In general, a comment or explanation on the interaction between
> CLONE_VFORK and shadow_stack_size would be helpful.
This is the existing implicit behaviour that clone() has, it's current
ABI for x86. The intent is that if the user has explicitly configured a
shadow stack then we just do whatever they asked us to do, if they
didn't we try to guess something sensible. The comment at the top of
this block when where we check if shadow_stack_size is set is intended
to capture this requirement:
/*
* If the user specified a shadow stack then do some basic
* validation and use it, otherwise fall back to a default
* shadow stack size if the clone_flags don't indicate an
* allocation is unneeded.
*/
if (args->shadow_stack_size) {
size = args->shadow_stack_size;
if (size < 8)
return (unsigned long)ERR_PTR(-EINVAL);
} else {
I'm not immediately seeing somewhere else to add something?
On Thu, Nov 23, 2023 at 11:37:54AM +0000, Mark Brown wrote:
> On Thu, Nov 23, 2023 at 11:10:24AM +0100, Christian Brauner wrote:
> > On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote:
> > > On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote:
> > > > The 11/21/2023 11:17, Christian Brauner wrote:
>
> > I'm still not enthusiastic that we only have one implementation for this
> > in the kernel. What's the harm in waiting until the arm patches are
> > merged? This shouldn't result in chicken and egg: if the implementations
> > are sufficiently similar then we can do an appropriate clone3()
> > extension.
>
> The main thing would be that it would mean that people doing userspace
> enablement based on the merged x86 support can't use the stack size
> control. It's not the end of the world if that has to wait a bit, it's
> a bit of a detail thing, but it would make life easier, I guess the
> userspace people can let us know if it's starting to be a real hassle
> and we can reevaulate if that happens.
>
> It's also currently a dependency for the arm64 code so it'd be good to
> at least get ageement that assuming nothing comes up in testing the
> patches can go in along with the arm64 series, removing the dependency
Oh yeah, I'm not fuzzed whose tree this goes through. By all means, take
it with the arm64 series.
> and then adding it as an incremental thing would be a hassle. It's
> likely that the arm64 series will be held out of tree for a while to as
> more complete userspace support is built up and validated so things
> might be sitting for a while - we don't have hardware right now so we
> can be cautious with the testing.
Ok, that sounds good to me.
On Thu, Nov 23, 2023 at 12:17:19PM +0000, Mark Brown wrote:
> On Thu, Nov 23, 2023 at 11:28:47AM +0100, Christian Brauner wrote:
> > On Mon, Nov 20, 2023 at 11:54:30PM +0000, Mark Brown wrote:
>
> > Any reasonably maximum that should be assumed here? IOW, what happens if
> > userspace starts specifying 4G shadow_stack_size with each clone3() call
> > for lolz?
>
> I guess we could impose RLIMIT_STACK?
Yeah, that also seems to be what acct_stack_growth() is using.
>
> > > + } else {
> > > + /*
> > > + * For CLONE_VFORK the child will share the parents
> > > + * shadow stack. Make sure to clear the internal
> > > + * tracking of the thread shadow stack so the freeing
> > > + * logic run for child knows to leave it alone.
> > > + */
> > > + if (clone_flags & CLONE_VFORK) {
> > > + shstk->base = 0;
> > > + shstk->size = 0;
> > > + return 0;
> > > + }
>
> > Why is the CLONE_VFORK handling only necessary if shadow_stack_size is
> > unset? In general, a comment or explanation on the interaction between
> > CLONE_VFORK and shadow_stack_size would be helpful.
>
> This is the existing implicit behaviour that clone() has, it's current
> ABI for x86. The intent is that if the user has explicitly configured a
> shadow stack then we just do whatever they asked us to do, if they
So what I'm asking is: if the calling process is suspended until the
child exits or exec's does it make sense for the child to even get a
shadow stack? I don't know the answer which is why I'm asking.
On Thu, Nov 23, 2023 at 05:33:05PM +0100, Christian Brauner wrote:
> On Thu, Nov 23, 2023 at 12:17:19PM +0000, Mark Brown wrote:
> > > > + if (clone_flags & CLONE_VFORK) {
> > > > + shstk->base = 0;
> > > > + shstk->size = 0;
> > > > + return 0;
> > > > + }
> > > Why is the CLONE_VFORK handling only necessary if shadow_stack_size is
> > > unset? In general, a comment or explanation on the interaction between
> > > CLONE_VFORK and shadow_stack_size would be helpful.
> > This is the existing implicit behaviour that clone() has, it's current
> > ABI for x86. The intent is that if the user has explicitly configured a
> > shadow stack then we just do whatever they asked us to do, if they
> So what I'm asking is: if the calling process is suspended until the
> child exits or exec's does it make sense for the child to even get a
> shadow stack? I don't know the answer which is why I'm asking.
We were initially doing some suppression of stack creation based on the
flags but based on prior discussion we decided it wasn't worth it.
There was some question about corner cases (IIRC the main one was
posix_spawn()), but generally the thinking here was that since userspace
explicitly asked for the shadow stack in the worst case it'll just be
inefficient and userspace can fix things by just not doing that. If we
just create the shadow stack whenever it's requested then it makes the
kernel side handling really simple to implement/verify and we don't have
to worry about having missed any use cases with combinations of flags
that we've not anticipated.
On Mon, Nov 20, 2023 at 11:54:29PM +0000, Mark Brown wrote:
> Since multiple architectures have support for shadow stacks and we need to
> select support for this feature in several places in the generic code
> provide a generic config option that the architectures can select.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Signed-off-by: Mark Brown <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> fs/proc/task_mmu.c | 2 +-
> include/linux/mm.h | 2 +-
> mm/Kconfig | 6 ++++++
> 4 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3762f41bb092..14b7703a9a2b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1952,6 +1952,7 @@ config X86_USER_SHADOW_STACK
> depends on AS_WRUSS
> depends on X86_64
> select ARCH_USES_HIGH_VMA_FLAGS
> + select ARCH_HAS_USER_SHADOW_STACK
> select X86_CET
> help
> Shadow stack protection is a hardware feature that detects function
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ef2eb12906da..f0a904aeee8e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -699,7 +699,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> [ilog2(VM_UFFD_MINOR)] = "ui",
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> -#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> [ilog2(VM_SHADOW_STACK)] = "ss",
> #endif
> };
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 418d26608ece..10462f354614 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -341,7 +341,7 @@ extern unsigned int kobjsize(const void *objp);
> #endif
> #endif /* CONFIG_ARCH_HAS_PKEYS */
>
> -#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> /*
> * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
> * support core mm.
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 89971a894b60..6713bb3b0b48 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1270,6 +1270,12 @@ config LOCK_MM_AND_FIND_VMA
> bool
> depends on !STACK_GROWSUP
>
> +config ARCH_HAS_USER_SHADOW_STACK
> + bool
> + help
> + The architecture has hardware support for userspace shadow call
> + stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
The whitespace looks suspicious, I think there should be a leading tab.
Otherwise
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
> +
> source "mm/damon/Kconfig"
>
> endmenu
>
> --
> 2.30.2
>
--
Sincerely yours,
Mike.