Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753364AbbDBLO3 (ORCPT ); Thu, 2 Apr 2015 07:14:29 -0400 Received: from mail-wg0-f44.google.com ([74.125.82.44]:36380 "EHLO mail-wg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751054AbbDBLO0 (ORCPT ); Thu, 2 Apr 2015 07:14:26 -0400 MIME-Version: 1.0 In-Reply-To: <20150402103735.GA21105@gmail.com> References: <9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org> <20150402090744.GA26846@gmail.com> <551D14D3.1070907@redhat.com> <20150402103735.GA21105@gmail.com> Date: Thu, 2 Apr 2015 07:14:25 -0400 Message-ID: Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set From: Brian Gerst To: Ingo Molnar Cc: Denys Vlasenko , Andy Lutomirski , "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Linus Torvalds , Borislav Petkov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4931 Lines: 118 On Thu, Apr 2, 2015 at 6:37 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> On 04/02/2015 11:07 AM, Ingo Molnar wrote: >> > >> > * Andy Lutomirski wrote: >> > >> >> When I wrote the opportunistic SYSRET code, I missed an important >> >> difference between SYSRET and IRET. Both instructions are capable >> >> of setting EFLAGS.TF, but they behave differently when doing so. >> >> IRET will not issue a #DB trap after execution when it sets TF This >> >> is critical -- otherwise you'd never be able to make forward >> >> progress when returning to userspace. SYSRET, on the other hand, >> >> will trap with #DB immediately after returning to CPL3, and the next >> >> instruction will never execute. >> >> >> >> This breaks anything that opportunistically SYSRETs to a user >> >> context with TF set. For example, running this code with TF set and >> >> a SIGTRAP handler loaded never gets past post_nop. >> >> >> >> extern unsigned char post_nop[]; >> >> asm volatile ("pushfq\n\t" >> >> "popq %%r11\n\t" >> >> "nop\n\t" >> >> "post_nop:" >> >> : : "c" (post_nop) : "r11"); >> >> >> >> In my defense, I can't find this documented in the AMD or Intel >> >> manual. >> >> >> >> Fix it by using IRET to restore TF. >> >> >> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible >> >> Signed-off-by: Andy Lutomirski >> >> --- >> >> >> >> This affects 4.0-rc as well as -tip. A full test case lives here: >> >> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/ >> >> >> >> It's called single_step_syscall_64. >> >> >> >> On Intel systems, the 32-bit version of that test fails for unrelated >> >> reasons, but that's not a regression, and fixing it will be much more >> >> intrusive. >> >> >> >> Changes from v1: >> >> - Remove mention of testl from changelog. >> >> - Improve comment per Denys' suggestion. >> >> >> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++- >> >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S >> >> index 750c6efcb718..537716380959 100644 >> >> --- a/arch/x86/kernel/entry_64.S >> >> +++ b/arch/x86/kernel/entry_64.S >> >> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */ >> >> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */ >> >> jne opportunistic_sysret_failed >> >> >> >> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */ >> >> + /* >> >> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, >> >> + * restoring TF results in a trap from userspace immediately after >> >> + * SYSRET. This would cause an infinite loop whenever #DB happens >> >> + * with register state that satisfies the opportunistic SYSRET >> >> + * conditions. For example, single-stepping this user code: >> >> + * >> >> + * movq $stuck_here,%rcx >> >> + * pushfq >> >> + * popq %r11 >> >> + * stuck_here: >> >> + * >> >> + * would never get past stuck_here. >> >> + */ >> >> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11 >> >> jnz opportunistic_sysret_failed >> > >> > So I merged this as it's an obvious bugfix, but in hindsight I'm >> > really uneasy about the whole opportunistic SYSRET concept: it appears >> > that the chance that %rcx matches return-%rip is astronomical - this >> > is why this bug wasn't noticed live so far. >> > >> > So should we really be doing this? >> >> Andy does this not for the off-chance that userspace's RCX is equal >> to return address and R11 == RFLAGS. The chances of that are >> astronomically small. >> >> This code path triggers when ptrace/audit/seccomp is active. Instead >> of torturing ourselves trying to not divert into IRET return, now >> code is steered that way. But then immediately before actual IRET, >> we check again: "do we really need IRET?" IOW "did ptrace really >> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of >> cases will not be true. > > I keep forgetting about that, my test systems have the audit muck > turned off ;-) > > Fair enough - and it's sensible to share the IRET path between > interrupts and complex-return system calls, even though the check > is unnecessary overhead for the pure interrupt return path... Maybe we could reintroduce TIF_IRET for this purpose instead of (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip it for interrupts. -- Brian Gerst -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/