Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932440AbdHWRaT (ORCPT ); Wed, 23 Aug 2017 13:30:19 -0400 Received: from mail-qt0-f181.google.com ([209.85.216.181]:33259 "EHLO mail-qt0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932272AbdHWRaR (ORCPT ); Wed, 23 Aug 2017 13:30:17 -0400 MIME-Version: 1.0 In-Reply-To: <20170823152059.aix2zb24xsndfntq@treble> References: <20170822175115.k3tdjlkltua7lkiu@treble> <20170823152059.aix2zb24xsndfntq@treble> From: oliver yang Date: Thu, 24 Aug 2017 01:30:16 +0800 Message-ID: Subject: Re: [PATCH] x86/nmi/64: avoid passing user space rsp of pt_regs to nmi handler To: Josh Poimboeuf Cc: yang oliver , "tglx@linutronix.de" , "mingo@redhat.com" , "hpa@zytor.com" , "luto@kernel.org" , "x86@kernel.org" , "rostedt@goodmis.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5119 Lines: 119 2017-08-23 23:20 GMT+08:00 Josh Poimboeuf : > On Wed, Aug 23, 2017 at 10:19:02PM +0800, oliver yang wrote: >> 2017-08-23 1:51 GMT+08:00 Josh Poimboeuf : >> > On Tue, Aug 22, 2017 at 05:09:19PM +0000, yang oliver wrote: >> >> From: Yong Yang >> >> >> >> While NMI interrupts the very beginning of SYSCALL, the rsp could >> >> be still user space address without loading kernel address. Then >> >> the pt_regs constructed by the NMI entry point could have a user >> >> space rsp. If a NMI handler tried to dump stack by using this rsp, >> >> it can cause the kernel panic. >> > >> > To me this sounds like an unwinder bug. The unwinder is supposed to >> > have checks to prevent it from accessing user space. >> >> What kind of checking should we do here? >> >> While the NMI interrupts the first instruction of entry_SYSCALL_64, >> the rip is kernel address, but rsp is still a user space address. >> >> Usually, kernel uses the "user_mode" check to determine user space >> pt_regs, but this check doesn't work for this situation. > > The unwinder tries to make sure that all its memory accesses are within > the bounds of the kernel stacks (task, irq, and exception stacks). See > update_stack_state() in a recent kernel: > > /* > * If the next bp isn't on the current stack, switch to the next one. > * > * We may have to traverse multiple stacks to deal with the possibility > * that info->next_sp could point to an empty stack and the next bp > * could be on a subsequent stack. > */ > while (!on_stack(info, frame, len)) > if (get_stack_info(info->next_sp, state->task, info, > &state->stack_mask)) > return false; > >> > I know you had >> > previously reported this for an older (pre-4.9) kernel. Have you >> > verified the bug still exists on a recent kernel? >> >> Yes, I ran into kernel panic on 3.10 kernel, and the root case is >> show_stack_log_lvl in nmi handler dump the user space stack, >> but that memory got swapped out or freed. >> >> >> On latest kernel, show_stack_log_lvl never dump stack contents. >> But I'm not sure other NMI handler, or kernel trace code could dump >> a user space stack dump while the kernel mode pt_regs which has >> a user space sp. >> >> I wrote a module which register a NMI handler, it did the check like >> below, >> >> if (!user_mode(regs)) { /* kernel mode, but could contain user address */ >> rsp = regs->sp; >> >> if (rsp < __PAGE_OFFSET) { >> pr_err("User Stack Pointer: %016lx", rsp); >> pr_err("User Stack Dump: %016lx", *((unsigned long *)rsp)); >> } >> } >> >> I manage to make NMI interrupted the entry_SYSCALL_64 boundary, >> then I could trigger the issue. >> >> Not sure whether it is good enough. >> >> Let me know if you have other suggestions. > > The frame pointer unwinder was completely rewritten in 4.9, and > show_stack_log_lvl() was removed. AFAIK, it shouldn't panic when > unwinding from an NMI in early syscall code. If you can recreate the > problem on a recent kernel, please let me know. > OK. The show_stack_log_lvl was removed, and the dump_stack() won't dump stack memory directly, that means no panic any more. For any customized NMI handler, I don't think unwinder could prevent NMI handler to dump the stack memory. However, it seemed we don't want NMI handler to have some design to avoid this issue. Then it is the NMI handler responsibilities to avoid the panic. Here is the show_regs output while I inject NMI at the boundary of SYSCALL, just FYI. In order kernel there was "Stack:" memory dump, but new kernel won't have: [ 100.209221] User Stack Pointer: 00007f06aebcfc58 [ 100.209221] User Stack Dump: 0000000000000000 [ 100.209221] CPU: 0 PID: 2196 Comm: staragent-core Tainted: G O 4.13.0-rc2+ #36 [ 100.209221] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 100.209221] task: ffff88006b400000 task.stack: ffffc900009e0000 [ 100.209221] RIP: 0010:entry_SYSCALL_64+0x0/0x3 [ 100.209221] RSP: 0018:00007f06aebcfc58 EFLAGS: 00000046 [ 100.209221] RAX: 0000000000000060 RBX: 0000000000027aee RCX: 00007ffefe487c63 [ 100.209221] RDX: 00000000599da7ff RSI: 0000000000000000 RDI: 00007f06aebcfc90 [ 100.209221] RBP: 00007f06aebcfc80 R08: 0000000032692a2d R09: 0000000000000894 [ 100.209221] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 100.209221] R13: 0019349516db8c69 R14: 00007f06aebd0700 R15: 0000000000000000 [ 100.209221] FS: 00007f06aebd0700(0000) GS:ffff880079600000(0000) knlGS:0000000000000000 [ 100.209221] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 100.209221] CR2: 000000c42040e000 CR3: 00000000749a2000 CR4: 00000000000406f0 [ 100.209221] Call Trace: [ 100.209221] Code: 08 48 8d 7f 18 e8 b1 dc 64 ff 48 89 df e8 79 19 65 ff 48 8b 7d 08 be 01 02 00 00 e8 8b 0e 60 ff 5b 5d c3 90 90 90 90 90 90 90 90 <0f> 01 f8 65 48 89 24 25 a8 b7 00 00 65 48 8b 24 25 84 87 1d 00 -- ------------------ Oliver Yang