Received: by 10.223.164.221 with SMTP id h29csp225599wrb; Tue, 31 Oct 2017 17:44:43 -0700 (PDT) X-Google-Smtp-Source: ABhQp+QKLWeDhZ5iSY8GX3zjs/2CwDK2l1Q3Tit1HjmB87/QQniEWIkCNBWdBLjRYOBwwHtRRIZE X-Received: by 10.99.124.27 with SMTP id x27mr3633261pgc.304.1509497083674; Tue, 31 Oct 2017 17:44:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1509497083; cv=none; d=google.com; s=arc-20160816; b=ZgQch85y6DLdR1ZQNBUa2VfhBOLvLXrxcmmGdZBfPRwzgk4CNE5BRRxz7tjPQIdf4q hbfGFpNAXUhmb6oPrcnQ1F3k9GYzhTeLDR9T/hBrNW7cnwHBma3MwvUJMlthMF17bUeE HE9DdmO3N8zdbs6in0yQMvu6BfIaNGJpCizMnp3oJywO3GfcC4iPxPdfprUiLcy4aKyn Rnx1PobPYEPzWIDYEUTY0dYigyoA4uuMukJjrXhFA3kTiQiuYiVvzs5PuJuqVQmFouMk gqPAOvupn1SuS8UQmhzaXnIMOn8WCuWES1WHcDdzmnLx5MJ1lprCDcA//b5A4jdH+9I4 gKQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=mYIPXmYTUGWW0Nxe5zJLsB58JpTLL5nuNa4omYHvEGc=; b=hYdQzRgANsxzlpC5ANXH2XZJuBBM43hm65JWK7YSbKwr0hhnr9X+N6vhq2YKIwGKxJ FnkrADcy5tIS+AbZ+XZrrZxctN+By+zUefB3W/1u3SkSACDVizbpGvI+0wKfxNNmLxCg lBwham6BEuQ/HhBsjjA/w83BXh0+K5NASciKpysaX47Kac+Pd1Fhy8D83j0LvPY+EohB /Tb6wPI6xhQuB8WdEFXtff9F1jCp8Ttv6KBU1aSnLUUHqqu+02bhKe2f1n/ANZ8fd6/L yZmihPxdIgwnkJ499vJm/btmdy0LG9iqXn9iKD5sYZxiaMVvmjtcTTatVo0xE4v05rLJ IG7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YvXu8lwZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z25si3090862pfd.583.2017.10.31.17.44.28; Tue, 31 Oct 2017 17:44:43 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YvXu8lwZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753950AbdKAAnw (ORCPT + 99 others); Tue, 31 Oct 2017 20:43:52 -0400 Received: from mail-io0-f193.google.com ([209.85.223.193]:51445 "EHLO mail-io0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750764AbdKAAnv (ORCPT ); Tue, 31 Oct 2017 20:43:51 -0400 Received: by mail-io0-f193.google.com with SMTP id b186so2742337iof.8 for ; Tue, 31 Oct 2017 17:43:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=mYIPXmYTUGWW0Nxe5zJLsB58JpTLL5nuNa4omYHvEGc=; b=YvXu8lwZN+BqGawYZGhuy3V25t+ZUuegEoShKbjU0NKJfBVJncQ0v+OOu3jC22ZrsT T6hqGyNbCeHXBH3rUWJyv8cHIQQW9pajNiGu82oaWR7yjiVxYawqKISUH82hDJ4uT+ay ggEcR6SPXWZFUXWd59+OOnQHw02ZbqGxf66CxRgI/L6oJSJR8/HVxfrZNHWiqTZl6fLb 0bAOuZX/jaDVCx5DCUytXMNu3hCVzXEB2U+29gladTVN+THWSv//+tldWQ4tEuyIZlLX 9K/fEEQaArW/ics9OaCCZXzMRT+3meeaSN1lJajMZ6R0TX/vrlorlNZBf1GlPD9aex4k ONZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=mYIPXmYTUGWW0Nxe5zJLsB58JpTLL5nuNa4omYHvEGc=; b=ZHXEam+oOMjjK8EvrLsJuEKBrq+YHUYM1pjXE9FcIV2fvUFU6LVVv63kjxN2LJv83L grFsV0P3wGb1rPJwoj+bHniPwyDPmBuoTm6pSnbW6mQRNz6AIkgDOAlrQGhvivBy34aw kIE90h2WShMnBPSjeDb8LZ8iIktMmG/2q8+2A6pnPuf6CUUEnE+INA5hVvzdMvhFj4oD FuXzHqtaR/1FSqITze41eX4BfvL8JIpbQxtWpOzNM2e2qbIZa7CFj3byfwy6OpFE43Rf V5WIql073UedYIiqFnSVz+8F3r3qDbgW+q51i8KLxKA/oHpC/Iw2B6YL+8gKwJv/nqv6 UG1Q== X-Gm-Message-State: AMCzsaX5qCo+XU+q51rvr2qRxBk9ZyxAZY3rgcuLk1APSN2XrS527dfB KR7NZBC33PzZpW8SeTYLcdSrGJYjFTQEUj/OYg== X-Received: by 10.36.92.195 with SMTP id q186mr5388919itb.57.1509497030287; Tue, 31 Oct 2017 17:43:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.2.141.56 with HTTP; Tue, 31 Oct 2017 17:43:49 -0700 (PDT) In-Reply-To: <20171031223148.5334003A@viggo.jf.intel.com> References: <20171031223146.6B47C861@viggo.jf.intel.com> <20171031223148.5334003A@viggo.jf.intel.com> From: Brian Gerst Date: Tue, 31 Oct 2017 20:43:49 -0400 Message-ID: Subject: Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching To: Dave Hansen Cc: Linux Kernel Mailing List , Linux-MM , moritz.lipp@iaik.tugraz.at, daniel.gruss@iaik.tugraz.at, michael.schwarz@iaik.tugraz.at, Andy Lutomirski , Linus Torvalds , Kees Cook , hughd@google.com, "the arch/x86 maintainers" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 31, 2017 at 6:31 PM, Dave Hansen wrote: > > This is largely code from Andy Lutomirski. I fixed a few bugs > in it, and added a few SWITCH_TO_* spots. > > KAISER needs to switch to a different CR3 value when it enters > the kernel and switch back when it exits. This essentially > needs to be done before we leave assembly code. > > This is extra challenging because the context in which we have to > make this switch is tricky: the registers we are allowed to > clobber can vary. It's also hard to store things on the stack > because there are already things on it with an established ABI > (ptregs) or the stack is unsafe to use at all. > > This patch establishes a set of macros that allow changing to > the user and kernel CR3 values, but do not actually switch > CR3. The code will, however, clobber the registers that it > says it will and also does perform *writes* to CR3. So, this > patch by itself tests that the registers we are clobbering > and restoring from are OK, and that things like our stack > manipulation are in safe places. > > In other words, if you bisect to here, this *does* introduce > changes that can break things. > > Interactions with SWAPGS: previous versions of the KAISER code > relied on having per-cpu scratch space so we have a register > to clobber for our CR3 MOV. The %GS register is what we use > to index into our per-cpu sapce, so SWAPGS *had* to be done > before the CR3 switch. That scratch space is gone now, but we > still keep the semantic that SWAPGS must be done before the > CR3 MOV. This is good to keep because it is not that hard to > do and it allows us to do things like add per-cpu debugging > information to help us figure out what goes wrong sometimes. > > What this does in the NMI code is worth pointing out. NMIs > can interrupt *any* context and they can also be nested with > NMIs interrupting other NMIs. The comments below > ".Lnmi_from_kernel" explain the format of the stack that we > have to deal with this situation. Changing the format of > this stack is not a fun exercise: I tried. Instead of > storing the old CR3 value on the stack, we depend on the > *regular* register save/restore mechanism and then use %r14 > to keep CR3 during the NMI. It will not be clobbered by the > C NMI handlers that get called. > > Signed-off-by: Dave Hansen > Cc: Moritz Lipp > Cc: Daniel Gruss > Cc: Michael Schwarz > Cc: Andy Lutomirski > Cc: Linus Torvalds > Cc: Kees Cook > Cc: Hugh Dickins > Cc: x86@kernel.org > --- > > b/arch/x86/entry/calling.h | 40 +++++++++++++++++++++++++++++++++++++ > b/arch/x86/entry/entry_64.S | 33 +++++++++++++++++++++++++----- > b/arch/x86/entry/entry_64_compat.S | 13 ++++++++++++ > 3 files changed, 81 insertions(+), 5 deletions(-) > > diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h > --- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work 2017-10-31 15:03:48.105007253 -0700 > +++ b/arch/x86/entry/calling.h 2017-10-31 15:03:48.113007631 -0700 > @@ -1,5 +1,6 @@ > #include > #include > +#include > > /* > > @@ -217,6 +218,45 @@ For 32-bit we have the following convent > #endif > .endm > > +.macro ADJUST_KERNEL_CR3 reg:req > +.endm > + > +.macro ADJUST_USER_CR3 reg:req > +.endm > + > +.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req > + mov %cr3, \scratch_reg > + ADJUST_KERNEL_CR3 \scratch_reg > + mov \scratch_reg, %cr3 > +.endm > + > +.macro SWITCH_TO_USER_CR3 scratch_reg:req > + mov %cr3, \scratch_reg > + ADJUST_USER_CR3 \scratch_reg > + mov \scratch_reg, %cr3 > +.endm > + > +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req > + movq %cr3, %r\scratch_reg > + movq %r\scratch_reg, \save_reg > + /* > + * Just stick a random bit in here that never gets set. Fixed > + * up in real KAISER patches in a moment. > + */ > + bt $63, %r\scratch_reg > + jz .Ldone_\@ > + > + ADJUST_KERNEL_CR3 %r\scratch_reg > + movq %r\scratch_reg, %cr3 > + > +.Ldone_\@: > +.endm > + > +.macro RESTORE_CR3 save_reg:req > + /* optimize this */ > + movq \save_reg, %cr3 > +.endm > + > #endif /* CONFIG_X86_64 */ > > /* > diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S > --- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work 2017-10-31 15:03:48.107007348 -0700 > +++ b/arch/x86/entry/entry_64_compat.S 2017-10-31 15:03:48.113007631 -0700 > @@ -48,8 +48,13 @@ > ENTRY(entry_SYSENTER_compat) > /* Interrupts are off on entry. */ > SWAPGS_UNSAFE_STACK > + > movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp > > + pushq %rdi > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi > + popq %rdi > + > /* > * User tracing code (ptrace or signal handlers) might assume that > * the saved RAX contains a 32-bit number when we're invoking a 32-bit > @@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat) > pushq $0 /* pt_regs->r15 = 0 */ > cld > > + pushq %rdi > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi > + popq %rdi > /* > * SYSENTER doesn't filter flags, so we need to clear NT and AC > * ourselves. To save a few cycles, we can check whether > @@ -214,6 +222,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram > pushq $0 /* pt_regs->r14 = 0 */ > pushq $0 /* pt_regs->r15 = 0 */ > > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi > + > /* > * User mode is traced as though IRQs are on, and SYSENTER > * turned them off. > @@ -240,6 +250,7 @@ sysret32_from_system_call: > popq %rsi /* pt_regs->si */ > popq %rdi /* pt_regs->di */ > > + SWITCH_TO_USER_CR3 scratch_reg=%r8 > /* > * USERGS_SYSRET32 does: > * GSBASE = user's GS base > @@ -324,6 +335,7 @@ ENTRY(entry_INT80_compat) > pushq %r15 /* pt_regs->r15 */ > cld > > + SWITCH_TO_KERNEL_CR3 scratch_reg=%r11 > /* > * User mode is traced as though IRQs are on, and the interrupt > * gate turned them off. > @@ -337,6 +349,7 @@ ENTRY(entry_INT80_compat) > /* Go back to user mode. */ > TRACE_IRQS_ON > SWAPGS > + SWITCH_TO_USER_CR3 scratch_reg=%r11 > jmp restore_regs_and_iret > END(entry_INT80_compat) > > diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S > --- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work 2017-10-31 15:03:48.109007442 -0700 > +++ b/arch/x86/entry/entry_64.S 2017-10-31 15:03:48.115007726 -0700 > @@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64) > movq %rsp, PER_CPU_VAR(rsp_scratch) > movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp > > - TRACE_IRQS_OFF > - > /* Construct struct pt_regs on stack */ > pushq $__USER_DS /* pt_regs->ss */ > pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */ > @@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe) > sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */ > UNWIND_HINT_REGS extra=0 > > + /* NB: right here, all regs except r11 are live. */ > + > + SWITCH_TO_KERNEL_CR3 scratch_reg=%r11 > + > + /* Must wait until we have the kernel CR3 to call C functions: */ > + TRACE_IRQS_OFF > + > /* > * If we need to do entry work or if we guess we'll need to do > * exit work, go straight to the slow path. > @@ -220,6 +225,7 @@ entry_SYSCALL_64_fastpath: > TRACE_IRQS_ON /* user mode is traced as IRQs on */ > movq RIP(%rsp), %rcx > movq EFLAGS(%rsp), %r11 > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > RESTORE_C_REGS_EXCEPT_RCX_R11 > movq RSP(%rsp), %rsp > UNWIND_HINT_EMPTY > @@ -313,6 +319,7 @@ return_from_SYSCALL_64: > * perf profiles. Nothing jumps here. > */ > syscall_return_via_sysret: > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > /* rcx and r11 are already restored (see code above) */ > RESTORE_C_REGS_EXCEPT_RCX_R11 > movq RSP(%rsp), %rsp > @@ -320,6 +327,7 @@ syscall_return_via_sysret: > USERGS_SYSRET64 > > opportunistic_sysret_failed: > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > SWAPGS > jmp restore_c_regs_and_iret > END(entry_SYSCALL_64) > @@ -422,6 +430,7 @@ ENTRY(ret_from_fork) > movq %rsp, %rdi > call syscall_return_slowpath /* returns with IRQs disabled */ > TRACE_IRQS_ON /* user mode is traced as IRQS on */ > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > SWAPGS > jmp restore_regs_and_iret > > @@ -611,6 +620,7 @@ GLOBAL(retint_user) > mov %rsp,%rdi > call prepare_exit_to_usermode > TRACE_IRQS_IRETQ > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > SWAPGS > jmp restore_regs_and_iret > > @@ -1091,7 +1101,11 @@ ENTRY(paranoid_entry) > js 1f /* negative -> in kernel */ > SWAPGS > xorl %ebx, %ebx > -1: ret > + > +1: > + SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14 > + > + ret > END(paranoid_entry) > > /* > @@ -1118,6 +1132,7 @@ ENTRY(paranoid_exit) > paranoid_exit_no_swapgs: > TRACE_IRQS_IRETQ_DEBUG > paranoid_exit_restore: > + RESTORE_CR3 %r14 > RESTORE_EXTRA_REGS > RESTORE_C_REGS > REMOVE_PT_GPREGS_FROM_STACK 8 > @@ -1144,6 +1159,9 @@ ENTRY(error_entry) > */ > SWAPGS > > + /* We have user CR3. Change to kernel CR3. */ > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rax > + > .Lerror_entry_from_usermode_after_swapgs: > /* > * We need to tell lockdep that IRQs are off. We can't do this until > @@ -1190,9 +1208,10 @@ ENTRY(error_entry) > > .Lerror_bad_iret: > /* > - * We came from an IRET to user mode, so we have user gsbase. > - * Switch to kernel gsbase: > + * We came from an IRET to user mode, so we have user > + * gsbase and CR3. Switch to kernel gsbase and CR3: > */ > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rax > SWAPGS > > /* > @@ -1313,6 +1332,7 @@ ENTRY(nmi) > UNWIND_HINT_REGS > ENCODE_FRAME_POINTER > > + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi > /* > * At this point we no longer need to worry about stack damage > * due to nesting -- we're on the normal thread stack and we're > @@ -1328,6 +1348,7 @@ ENTRY(nmi) > * work, because we don't want to enable interrupts. > */ > SWAPGS > + SWITCH_TO_USER_CR3 scratch_reg=%rdi > jmp restore_regs_and_iret > > .Lnmi_from_kernel: > @@ -1538,6 +1559,8 @@ end_repeat_nmi: > movq $-1, %rsi > call do_nmi > > + RESTORE_CR3 save_reg=%r14 > + > testl %ebx, %ebx /* swapgs needed? */ > jnz nmi_restore > nmi_swapgs: > _ This all needs to be conditional on a config option. Something with this amount of performance impact needs to be 100% optional. -- Brian Gerst From 1582814130958663556@xxx Tue Oct 31 22:33:04 +0000 2017 X-GM-THRID: 1582814130958663556 X-Gmail-Labels: Inbox,Category Forums