Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp381666pxf; Thu, 8 Apr 2021 05:16:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxx+BuiNeulLTWmJb58FZuMc+Fa9CLlwQJgchHl3G/5STu7nlYeyzz8+sJcUy9WEuzODEPv X-Received: by 2002:a17:906:f953:: with SMTP id ld19mr9610897ejb.164.1617884169925; Thu, 08 Apr 2021 05:16:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617884169; cv=none; d=google.com; s=arc-20160816; b=ZKU2GHqjXGFOxdQSEzDeKrLPO7Lhyt5M4JEsh+BNupKHP5hnRcDnjBUGMuqbB6aaid as/jO2hR07Ujo00VDTLZYqAvW1eeCmB1LAi2lIzF8nAEdXPq4YHl5r7DZwDJof+YV7mz oNPCmMWohLWYJ4I2HwbkHljwFBvfUxrKw9uG4MzS/ZbwFIWn45B2S9n2f7gmfMajWUsA 8qbPKn4NokIGoREiubcvBOc5TJTgqCvZCsPMKW08R0Xvu9zSRjR0elKXNmQZtrhMLVzU Ev5oI5osx2wbL2o3pSw/4vmWxMFJInsfY6Is3fYxiSzDg3vkQilticoRpcJGA2CI48mO 4chA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:robot-unsubscribe :robot-id:message-id:mime-version:references:in-reply-to:cc:subject :to:reply-to:sender:from:dkim-signature:dkim-signature:date; bh=u7pXEz57qcpWFdMZYiY0MaEsFqi2SXg5q0Nt3yPmKX0=; b=sv2yZo6N72WzatL95zaUCPnRetGEq/dvAj/64XK4A/+PattfEdg9xcxjFxfGwzpNbx 4t4SmxRVs1S7QkNDn3IqruOQrNzuNBx5cxsRBJsUDuCScN+JkZWUnDBM2rNT9FlLuRCz UEpz9D4iQBdpt1892XHe4GrQjVRjKWpDGO4PLhNPvmL/mP6Db6Bt1M8nSrp30pClv0HF eiLDW0zX0ONeajnlOwSJkGHPMvZzFpqoivkIvd/wXyB0ZqixLK6Rw618y3us47cIR7zC wMCWl5cEX2zvVH8ciH/zSinZWGDNG7Jg9OKjcHMeapaftcednVXn+8l+pEPp20ynPAB9 Ay5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="wxXi/NqA"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=ErzKGUKN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id fx16si20904462ejb.342.2021.04.08.05.15.46; Thu, 08 Apr 2021 05:16:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="wxXi/NqA"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=ErzKGUKN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231567AbhDHMOB (ORCPT + 99 others); Thu, 8 Apr 2021 08:14:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50192 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231484AbhDHMNz (ORCPT ); Thu, 8 Apr 2021 08:13:55 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F3ABC061763; Thu, 8 Apr 2021 05:13:44 -0700 (PDT) Date: Thu, 08 Apr 2021 12:13:41 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1617884021; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u7pXEz57qcpWFdMZYiY0MaEsFqi2SXg5q0Nt3yPmKX0=; b=wxXi/NqAC/kHNkZ+CW5J1KEAYD/96o9lBiR0xdCm5oDf6OcH+GGHmhbCjzzSYlEb8umLOe UQWxaoyy/TssleHjuMhiEmsafosz8tZ8SmxLna09bAa86gHmiZY2WEvgKGR/Fusz4kbCx5 +nC+e8gm9tK2y3B70yjTQujB0BuXTHjlhhfCLfjJxTnggdSOomC/XSHQJda6/p05phMhiK vioQAA1aDi193hvWhH+6AxzGEE/qRdH2uJXtQLNuYp5iq0SI4ADqpKsZ8H11/ZVeQuh2dn WdeLRz2ATqFG+r3Puc7PKUnOHYpBsHqjW1nCx87ogYzIbCNbVFn8UN+MdKuOiQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1617884021; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u7pXEz57qcpWFdMZYiY0MaEsFqi2SXg5q0Nt3yPmKX0=; b=ErzKGUKNC+cwMtvbNNYUd1rCm0AGWNnT4NpiGGGKCZYZGuo3HXeldkPx0jFG61Gi/Nnihe UYU9e/DewJotzwAQ== From: "tip-bot2 for Kees Cook" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: x86/entry] stack: Optionally randomize kernel stack offset each syscall Cc: Elena Reshetova , Kees Cook , Thomas Gleixner , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20210401232347.2791257-4-keescook@chromium.org> References: <20210401232347.2791257-4-keescook@chromium.org> MIME-Version: 1.0 Message-ID: <161788402128.29796.7811100408931498231.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the x86/entry branch of tip: Commit-ID: 39218ff4c625dbf2e68224024fe0acaa60bcd51a Gitweb: https://git.kernel.org/tip/39218ff4c625dbf2e68224024fe0acaa60bcd51a Author: Kees Cook AuthorDate: Thu, 01 Apr 2021 16:23:44 -07:00 Committer: Thomas Gleixner CommitterDate: Thu, 08 Apr 2021 14:05:19 +02:00 stack: Optionally randomize kernel stack offset each syscall This provides the ability for architectures to enable kernel stack base address offset randomization. This feature is controlled by the boot param "randomize_kstack_offset=on/off", with its default value set by CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. This feature is based on the original idea from the last public release of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credit for the original idea goes to the PaX team. Note that the design and implementation of this upstream randomize_kstack_offset feature differs greatly from the RANDKSTACK feature (see below). Reasoning for the feature: This feature aims to make harder the various stack-based attacks that rely on deterministic stack structure. We have had many such attacks in past (just to name few): https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html As Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info, STACKLEAK), attackers have had to find new ways for their exploits to work. They have done so, continuing to rely on the kernel's stack determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not relevant. For example, the following recent attacks would have been hampered if the stack offset was non-deterministic between syscalls: https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf (page 70: targeting the pt_regs copy with linear stack overflow) https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html (leaked stack address from one syscall as a target during next syscall) The main idea is that since the stack offset is randomized on each system call, it is harder for an attack to reliably land in any particular place on the thread stack, even with address exposures, as the stack base will change on the next syscall. Also, since randomization is performed after placing pt_regs, the ptrace-based approach[1] to discover the randomized offset during a long-running syscall should not be possible. Design description: During most of the kernel's execution, it runs on the "thread stack", which is pretty deterministic in its structure: it is fixed in size, and on every entry from userspace to kernel on a syscall the thread stack starts construction from an address fetched from the per-cpu cpu_current_top_of_stack variable. The first element to be pushed to the thread stack is the pt_regs struct that stores all required CPU registers and syscall parameters. Finally the specific syscall function is called, with the stack being used as the kernel executes the resulting request. The goal of randomize_kstack_offset feature is to add a random offset after the pt_regs has been pushed to the stack and before the rest of the thread stack is used during the syscall processing, and to change it every time a process issues a syscall. The source of randomness is currently architecture-defined (but x86 is using the low byte of rdtsc()). Future improvements for different entropy sources is possible, but out of scope for this patch. Further more, to add more unpredictability, new offsets are chosen at the end of syscalls (the timing of which should be less easy to measure from userspace than at syscall entry time), and stored in a per-CPU variable, so that the life of the value does not stay explicitly tied to a single task. As suggested by Andy Lutomirski, the offset is added using alloca() and an empty asm() statement with an output constraint, since it avoids changes to assembly syscall entry code, to the unwinder, and provides correct stack alignment as defined by the compiler. In order to make this available by default with zero performance impact for those that don't want it, it is boot-time selectable with static branches. This way, if the overhead is not wanted, it can just be left turned off with no performance impact. The generated assembly for x86_64 with GCC looks like this: ... ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax # 12380 ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax ffffffff81003983: 48 83 c0 0f add $0xf,%rax ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax ffffffff8100398c: 48 29 c4 sub %rax,%rsp ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax ... As a result of the above stack alignment, this patch introduces about 5 bits of randomness after pt_regs is spilled to the thread stack on x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for stack alignment). The amount of entropy could be adjusted based on how much of the stack space we wish to trade for security. My measure of syscall performance overhead (on x86_64): lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null randomize_kstack_offset=y Simple syscall: 0.7082 microseconds randomize_kstack_offset=n Simple syscall: 0.7016 microseconds So, roughly 0.9% overhead growth for a no-op syscall, which is very manageable. And for people that don't want this, it's off by default. There are two gotchas with using the alloca() trick. First, compilers that have Stack Clash protection (-fstack-clash-protection) enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to any dynamic stack allocations. While the randomization offset is always less than a page, the resulting assembly would still contain (unreachable!) probing routines, bloating the resulting assembly. To avoid this, -fno-stack-clash-protection is unconditionally added to the kernel Makefile since this is the only dynamic stack allocation in the kernel (now that VLAs have been removed) and it is provably safe from Stack Clash style attacks. The second gotcha with alloca() is a negative interaction with -fstack-protector*, in that it sees the alloca() as an array allocation, which triggers the unconditional addition of the stack canary function pre/post-amble which slows down syscalls regardless of the static branch. In order to avoid adding this unneeded check and its associated performance impact, architectures need to carefully remove uses of -fstack-protector-strong (or -fstack-protector) in the compilation units that use the add_random_kstack() macro and to audit the resulting stack mitigation coverage (to make sure no desired coverage disappears). No change is visible for this on x86 because the stack protector is already unconditionally disabled for the compilation unit, but the change is required on arm64. There is, unfortunately, no attribute that can be used to disable stack protector for specific functions. Comparison to PaX RANDKSTACK feature: The RANDKSTACK feature randomizes the location of the stack start (cpu_current_top_of_stack), i.e. including the location of pt_regs structure itself on the stack. Initially this patch followed the same approach, but during the recent discussions[2], it has been determined to be of a little value since, if ptrace functionality is available for an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write different offsets in the pt_regs struct, observe the cache behavior of the pt_regs accesses, and figure out the random stack offset. Another difference is that the random offset is stored in a per-cpu variable, rather than having it be per-thread. As a result, these implementations differ a fair bit in their implementation details and results, though obviously the intent is similar. [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/ [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/ [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html Co-developed-by: Elena Reshetova Signed-off-by: Elena Reshetova Signed-off-by: Kees Cook Signed-off-by: Thomas Gleixner Reviewed-by: Thomas Gleixner Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org --- Documentation/admin-guide/kernel-parameters.txt | 11 +++- Makefile | 4 +- arch/Kconfig | 23 +++++++- include/linux/randomize_kstack.h | 54 ++++++++++++++++- init/main.c | 23 +++++++- 5 files changed, 115 insertions(+) create mode 100644 include/linux/randomize_kstack.h diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0454572..bee8644 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4061,6 +4061,17 @@ fully seed the kernel's CRNG. Default is controlled by CONFIG_RANDOM_TRUST_CPU. + randomize_kstack_offset= + [KNL] Enable or disable kernel stack offset + randomization, which provides roughly 5 bits of + entropy, frustrating memory corruption attacks + that depend on stack address determinism or + cross-syscall address exposures. This is only + available on architectures that have defined + CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET. + Format: (1/Y/y=enable, 0/N/n=disable) + Default is CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. + ras=option[,option,...] [KNL] RAS-specific options cec_disable [X86] diff --git a/Makefile b/Makefile index cc77fd4..d3bf503 100644 --- a/Makefile +++ b/Makefile @@ -813,6 +813,10 @@ KBUILD_CFLAGS += -ftrivial-auto-var-init=zero KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang endif +# While VLAs have been removed, GCC produces unreachable stack probes +# for the randomize_kstack_offset feature. Disable it for all compilers. +KBUILD_CFLAGS += $(call cc-option, -fno-stack-clash-protection) + DEBUG_CFLAGS := # Workaround for GCC versions < 5.0 diff --git a/arch/Kconfig b/arch/Kconfig index ecfd352..6b11c82 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1054,6 +1054,29 @@ config VMAP_STACK backing virtual mappings with real shadow memory, and KASAN_VMALLOC must be enabled. +config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + def_bool n + help + An arch should select this symbol if it can support kernel stack + offset randomization with calls to add_random_kstack_offset() + during syscall entry and choose_random_kstack_offset() during + syscall exit. Careful removal of -fstack-protector-strong and + -fstack-protector should also be applied to the entry code and + closely examined, as the artificial stack bump looks like an array + to the compiler, so it will attempt to add canary checks regardless + of the static branch state. + +config RANDOMIZE_KSTACK_OFFSET_DEFAULT + bool "Randomize kernel stack offset on syscall entry" + depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + help + The kernel stack offset can be randomized (after pt_regs) by + roughly 5 bits of entropy, frustrating memory corruption + attacks that depend on stack address determinism or + cross-syscall address exposures. This feature is controlled + by kernel boot param "randomize_kstack_offset=on/off", and this + config chooses the default boot state. + config ARCH_OPTIONAL_KERNEL_RWX def_bool n diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h new file mode 100644 index 0000000..fd80fab --- /dev/null +++ b/include/linux/randomize_kstack.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_RANDOMIZE_KSTACK_H +#define _LINUX_RANDOMIZE_KSTACK_H + +#include +#include +#include + +DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, + randomize_kstack_offset); +DECLARE_PER_CPU(u32, kstack_offset); + +/* + * Do not use this anywhere else in the kernel. This is used here because + * it provides an arch-agnostic way to grow the stack with correct + * alignment. Also, since this use is being explicitly masked to a max of + * 10 bits, stack-clash style attacks are unlikely. For more details see + * "VLAs" in Documentation/process/deprecated.rst + */ +void *__builtin_alloca(size_t size); +/* + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the + * "VLA" from being unbounded (see above). 10 bits leaves enough room for + * per-arch offset masks to reduce entropy (by removing higher bits, since + * high entropy may overly constrain usable stack space), and for + * compiler/arch-specific stack alignment to remove the lower bits. + */ +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF) + +/* + * These macros must be used during syscall entry when interrupts and + * preempt are disabled, and after user registers have been stored to + * the stack. + */ +#define add_random_kstack_offset() do { \ + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ + &randomize_kstack_offset)) { \ + u32 offset = raw_cpu_read(kstack_offset); \ + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset)); \ + /* Keep allocation even after "ptr" loses scope. */ \ + asm volatile("" : "=o"(*ptr) :: "memory"); \ + } \ +} while (0) + +#define choose_random_kstack_offset(rand) do { \ + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ + &randomize_kstack_offset)) { \ + u32 offset = raw_cpu_read(kstack_offset); \ + offset ^= (rand); \ + raw_cpu_write(kstack_offset, offset); \ + } \ +} while (0) + +#endif diff --git a/init/main.c b/init/main.c index 53b2788..f498aac 100644 --- a/init/main.c +++ b/init/main.c @@ -844,6 +844,29 @@ static void __init mm_init(void) pti_init(); } +#ifdef CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, + randomize_kstack_offset); +DEFINE_PER_CPU(u32, kstack_offset); + +static int __init early_randomize_kstack_offset(char *buf) +{ + int ret; + bool bool_result; + + ret = kstrtobool(buf, &bool_result); + if (ret) + return ret; + + if (bool_result) + static_branch_enable(&randomize_kstack_offset); + else + static_branch_disable(&randomize_kstack_offset); + return 0; +} +early_param("randomize_kstack_offset", early_randomize_kstack_offset); +#endif + void __init __weak arch_call_rest_init(void) { rest_init();