Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756548AbeAHKmq (ORCPT + 1 other); Mon, 8 Jan 2018 05:42:46 -0500 Received: from mail-ua0-f196.google.com ([209.85.217.196]:41805 "EHLO mail-ua0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756019AbeAHKmp (ORCPT ); Mon, 8 Jan 2018 05:42:45 -0500 X-Google-Smtp-Source: ACJfBotJ4rDaVcK5kEKAyZ7NPSHbDqOJufSbUUcs3kmfk+MYwg09NipieJLjcXpR4ka7Z+abjanx7aLJ7oA+53xcirk= MIME-Version: 1.0 In-Reply-To: References: <1515363085-4219-1-git-send-email-dwmw@amazon.co.uk> From: Paul Turner Date: Mon, 8 Jan 2018 02:42:13 -0800 Message-ID: Subject: Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel To: David Woodhouse Cc: Andi Kleen , LKML , Linus Torvalds , Greg Kroah-Hartman , Tim Chen , Dave Hansen , Thomas Gleixner , Kees Cook , Rik van Riel , Peter Zijlstra , Andy Lutomirski , Jiri Kosina , One Thousand Gnomes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: [ First send did not make list because gmail ate its plain-text force when I pasted content. ] One detail that is missing is that we still need RSB refill in some cases. This is not because the retpoline sequence itself will underflow (it is actually guaranteed not to, since it consumes only RSB entries that it generates. But either to avoid poisoning of the RSB entries themselves, or to avoid the hardware turning to alternate predictors on RSB underflow. Enumerating the cases we care about: user->kernel in the absence of SMEP: In the absence of SMEP, we must worry about user-generated RSB entries being consumable by kernel execution. Generally speaking, for synchronous execution this will not occur (e.g. syscall, interrupt), however, one important case remains. When we context switch between two threads, we should flush the RSB so that execution generated from the unbalanced return path on the thread that we just scheduled into, cannot consume RSB entries potentially installed by the prior thread. kernel->kernel independent of SMEP: While much harder to coordinate, facilities such as eBPF potentially allow exploitable return targets to be created. Generally speaking (particularly if eBPF has been disabled) the risk is _much_ lower here, since we can only return into kernel execution that was already occurring on another thread (which could e.g. likely be attacked there directly independent of RSB poisoning.) guest->hypervisor, independent of SMEP: For guest ring0 -> host ring0 transitions, it is possible that the tagging only includes that the entry was only generated in a ring0 context. Meaning that a guest generated entry may be consumed by the host. This admits: hypervisor_run_vcpu_implementation() { … run virtualized work (1) < update vmcs state, prior to any function return > (2) < return from hypervisor_run_vcpu_implementation() to handle VMEXIT > (3) } A guest to craft poisoned entries at (1) which, if not flushed at (2), may immediately be eligible for consumption at (3). While the cases above involve the crafting and use of poisoned entries. Recall also that one of the initial conditions was that we should avoid RSB underflow as some CPUs may try to use other indirect predictors when this occurs. The cases we care about here are: - When we return _into_ protected execution. For the kernel, this means when we exit interrupt context into kernel context, since may have emptied or reduced the number of RSB entries while in iinterrupt context. - Context switch (even if we are returning to user code, we need to at unwind the scheduler/triggering frames that preempted it previously, considering that detail, this is a subset of the above, but listed for completeness) - On VMEXIT (it turns out we need to worry about both poisoned entries, and no entries, the solution is a single refill nonetheless). - Leaving deeper (>C1) c-states, which may have flushed hardware state - Where we are unwinding call-chains of >16 entries[*] [*] This is obviously the trickiest case. Fortunately, it is tough to exploit since such call-chains are reasonably rare, and action must typically be predicted at a considerable distance from where current execution lies. Both dramatically increasing the feasibility of an attack and lowering the bit-rate (number of ops per attempt is necessarily increased). For our systems, since we control the binary image we can determine this through aggregate profiling of every machine in the fleet. I'm happy to provide those symbols; but it's obviously restricted from complete coverage due to code differences. Generally, this is a level of paranoia no typical user will likely care about and only applies to a subset of CPUs. A sequence for efficiently refilling the RSB is: mov $8, %rax; .align 16; 3: call 4f; 3p: pause; call 3p; .align 16; 4: call 5f; 4p: pause; call 4p; .align 16; 5: dec %rax; jnz 3b; add $(16*8), %rsp; This implementation uses 8 loops, with 2 calls per iteration. This is marginally faster than a single call per iteration. We did not observe useful benefit (particularly relative to text size) from further unrolling. This may also be usefully split into smaller (e.g. 4 or 8 call) segments where we can usefully pipeline/intermix with other operations. It includes retpoline type traps so that if an entry is consumed, it cannot lead to controlled speculation. On my test system it took ~43 cycles on average. Note that non-zero displacement calls should be used as these may be optimized to not interact with the RSB due to their use in fetching RIP for 32-bit relocations. On Mon, Jan 8, 2018 at 2:34 AM, Paul Turner wrote: > One detail that is missing is that we still need RSB refill in some cases. > This is not because the retpoline sequence itself will underflow (it is > actually guaranteed not to, since it consumes only RSB entries that it > generates. > But either to avoid poisoning of the RSB entries themselves, or to avoid the > hardware turning to alternate predictors on RSB underflow. > > Enumerating the cases we care about: > > user->kernel in the absence of SMEP: > In the absence of SMEP, we must worry about user-generated RSB entries being > consumable by kernel execution. > Generally speaking, for synchronous execution this will not occur (e.g. > syscall, interrupt), however, one important case remains. > When we context switch between two threads, we should flush the RSB so that > execution generated from the unbalanced return path on the thread that we > just scheduled into, cannot consume RSB entries potentially installed by the > prior thread. > > kernel->kernel independent of SMEP: > While much harder to coordinate, facilities such as eBPF potentially allow > exploitable return targets to be created. > Generally speaking (particularly if eBPF has been disabled) the risk is > _much_ lower here, since we can only return into kernel execution that was > already occurring on another thread (which could e.g. likely be attacked > there directly independent of RSB poisoning.) > > guest->hypervisor, independent of SMEP: > For guest ring0 -> host ring0 transitions, it is possible that the tagging > only includes that the entry was only generated in a ring0 context. Meaning > that a guest generated entry may be consumed by the host. This admits: > > hypervisor_run_vcpu_implementation() { > > … run virtualized work (1) > > < update vmcs state, prior to any function return > (2) > < return from hypervisor_run_vcpu_implementation() to handle VMEXIT > (3) > } > > A guest to craft poisoned entries at (1) which, if not flushed at (2), may > immediately be eligible for consumption at (3). > > While the cases above involve the crafting and use of poisoned entries. > Recall also that one of the initial conditions was that we should avoid RSB > underflow as some CPUs may try to use other indirect predictors when this > occurs. > > The cases we care about here are: > - When we return _into_ protected execution. For the kernel, this means > when we exit interrupt context into kernel context, since may have emptied > or reduced the number of RSB entries while in iinterrupt context. > - Context switch (even if we are returning to user code, we need to at > unwind the scheduler/triggering frames that preempted it previously, > considering that detail, this is a subset of the above, but listed for > completeness) > - On VMEXIT (it turns out we need to worry about both poisoned entries, and > no entries, the solution is a single refill nonetheless). > - Leaving deeper (>C1) c-states, which may have flushed hardware state > - Where we are unwinding call-chains of >16 entries[*] > > [*] This is obviously the trickiest case. Fortunately, it is tough to > exploit since such call-chains are reasonably rare, and action must > typically be predicted at a considerable distance from where current > execution lies. Both dramatically increasing the feasibility of an attack > and lowering the bit-rate (number of ops per attempt is necessarily > increased). For our systems, since we control the binary image we can > determine this through aggregate profiling of every machine in the fleet. > I'm happy to provide those symbols; but it's obviously restricted from > complete coverage due to code differences. Generally, this is a level of > paranoia no typical user will likely care about and only applies to a subset > of CPUs. > > > > > > On Sun, Jan 7, 2018 at 2:11 PM, David Woodhouse wrote: >> >> This is a mitigation for the 'variant 2' attack described in >> >> https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html >> >> Using GCC patches available from the hjl/indirect/gcc-7-branch/master >> branch of https://github.com/hjl-tools/gcc/commits/hjl and by manually >> patching assembler code, all vulnerable indirect branches (that occur >> after userspace first runs) are eliminated from the kernel. >> >> They are replaced with a 'retpoline' call sequence which deliberately >> prevents speculation. >> >> Fedora 27 packages of the updated compiler are available at >> https://koji.fedoraproject.org/koji/taskinfo?taskID=24065739 >> >> >> v1: Initial post. >> v2: Add CONFIG_RETPOLINE to build kernel without it. >> Change warning messages. >> Hide modpost warning message >> v3: Update to the latest CET-capable retpoline version >> Reinstate ALTERNATIVE support >> v4: Finish reconciling Andi's and my patch sets, bug fixes. >> Exclude objtool support for now >> Add 'noretpoline' boot option >> Add AMD retpoline alternative >> v5: Silence MODVERSIONS warnings >> Use pause;jmp loop instead of lfence;jmp >> Switch to X86_FEATURE_RETPOLINE positive feature logic >> Emit thunks inline from assembler macros >> Merge AMD support into initial patch >> v6: Update to latest GCC patches with no dots in symbols >> Fix MODVERSIONS properly(ish) >> Fix typo breaking 32-bit, introduced in V5 >> Never set X86_FEATURE_RETPOLINE_AMD yet, pending confirmation >> >> Andi Kleen (3): >> x86/retpoline/irq32: Convert assembler indirect jumps >> x86/retpoline: Add boot time option to disable retpoline >> x86/retpoline: Exclude objtool with retpoline >> >> David Woodhouse (7): >> x86/retpoline: Add initial retpoline support >> x86/retpoline/crypto: Convert crypto assembler indirect jumps >> x86/retpoline/entry: Convert entry assembler indirect jumps >> x86/retpoline/ftrace: Convert ftrace assembler indirect jumps >> x86/retpoline/hyperv: Convert assembler indirect jumps >> x86/retpoline/xen: Convert Xen hypercall indirect jumps >> x86/retpoline/checksum32: Convert assembler indirect jumps >> >> Documentation/admin-guide/kernel-parameters.txt | 3 + >> arch/x86/Kconfig | 17 ++++- >> arch/x86/Kconfig.debug | 6 +- >> arch/x86/Makefile | 10 +++ >> arch/x86/crypto/aesni-intel_asm.S | 5 +- >> arch/x86/crypto/camellia-aesni-avx-asm_64.S | 3 +- >> arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 3 +- >> arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 3 +- >> arch/x86/entry/entry_32.S | 5 +- >> arch/x86/entry/entry_64.S | 12 +++- >> arch/x86/include/asm/asm-prototypes.h | 25 +++++++ >> arch/x86/include/asm/cpufeatures.h | 2 + >> arch/x86/include/asm/mshyperv.h | 18 ++--- >> arch/x86/include/asm/nospec-branch.h | 92 >> +++++++++++++++++++++++++ >> arch/x86/include/asm/xen/hypercall.h | 5 +- >> arch/x86/kernel/cpu/common.c | 3 + >> arch/x86/kernel/cpu/intel.c | 11 +++ >> arch/x86/kernel/ftrace_32.S | 6 +- >> arch/x86/kernel/ftrace_64.S | 8 +-- >> arch/x86/kernel/irq_32.c | 9 +-- >> arch/x86/lib/Makefile | 1 + >> arch/x86/lib/checksum_32.S | 7 +- >> arch/x86/lib/retpoline.S | 48 +++++++++++++ >> 23 files changed, 264 insertions(+), 38 deletions(-) >> create mode 100644 arch/x86/include/asm/nospec-branch.h >> create mode 100644 arch/x86/lib/retpoline.S >> >> -- >> 2.7.4 >> >