Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933273Ab2EWAjo (ORCPT ); Tue, 22 May 2012 20:39:44 -0400 Received: from hrndva-omtalb.mail.rr.com ([71.74.56.122]:6180 "EHLO hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752235Ab2EWAji (ORCPT ); Tue, 22 May 2012 20:39:38 -0400 X-Authority-Analysis: v=2.0 cv=D8PF24tj c=1 sm=0 a=ZycB6UtQUfgMyuk2+PxD7w==:17 a=XQbtiDEiEegA:10 a=5SG0PmZfjMsA:10 a=Q9fys5e9bTEA:10 a=meVymXHHAAAA:8 a=ayC55rCoAAAA:8 a=I21yznq-B6Y38OGxoyoA:9 a=vlEEBiu4gVfvJgIWYDMA:7 a=PUjeQqilurYA:10 a=jeBq3FmKZ4MA:10 a=cK_KYy3NRiW2m4Pd:21 a=vxwn3eC9U5vWuEkc:21 a=ZycB6UtQUfgMyuk2+PxD7w==:117 X-Cloudmark-Score: 0 X-Originating-IP: 74.67.80.29 Message-ID: <1337733575.13348.134.camel@gandalf.stny.rr.com> Subject: Re: NMI vs #PF clash From: Steven Rostedt To: Linus Torvalds Cc: Avi Kivity , linux-kernel , Ingo Molnar , "H. Peter Anvin" , Thomas Gleixner , Paul Turner , Peter Zijlstra , Frederic Weisbecker , Mathieu Desnoyers Date: Tue, 22 May 2012 20:39:35 -0400 In-Reply-To: References: <4FBB8C40.6080304@redhat.com> <1337693441.13348.36.camel@gandalf.stny.rr.com> <4FBB986F.5030306@redhat.com> <1337695780.13348.41.camel@gandalf.stny.rr.com> <4FBBA094.3090703@redhat.com> <1337696825.13348.44.camel@gandalf.stny.rr.com> Content-Type: text/plain; charset="ISO-8859-15" X-Mailer: Evolution 3.2.2-1 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4925 Lines: 159 On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote: > On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt wrote: > > > > Is reading it fast? Then we could do a two reads and only write when > > needed. > > Even better: we could do nothing at all. > > We could just say: let's make sure that any #PF case that can happen > in #NMI can also be re-done with arbitrary 'error_code' and 'struct > regs' contents. > > At that point, what could happen is > - #PF > - NMI > - #PF > - read cr2 for NMI fault > - handle the NMI #PF > - return from #PF > - return from #NMI > - read cr2 for original #PF fault - but get the NMI cr2 again > - hande the #PF again (this should be a no-op now) I tested this by forcing all NMIs to have a page fault, and then running perf against hackbench to see what breaks. This doesn't work because the cr2 contains where the fault happened. If the NMI faulting address was in kernel space, the page fault it interrupted will think the user task is trying to access kernel memory and fault. Basically, the address will not be valid for the user task and this will make the task take a SEGFAULT. I had the NMI fault on the address 0x12348, but had a exception table to handle it. hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000] With the below patch, the faulting NMIs don't seem to be harming anything. > - return from #PF > - instruction restart causes new #PF > - now we do the original page fault > > So one option is to just make sure that the few cases (just the > vmalloc area?) that NMI can trigger are all ok to be re-done with > other state. > > I note that right now we have > > if (unlikely(fault_in_kernel_space(address))) { > if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) { > if (vmalloc_fault(address) >= 0) > return; > > and that the error_code check means that the retried NMI #PF would not > go through that. But maybe we don't even need that check? > > That error_code thing seems to literally be the only thing that keeps > us from just re-doing the vmalloc_fault() silently. Avi, you want to test this? -- Steve Index: linux-trace.git/arch/x86/kernel/entry_64.S =================================================================== --- linux-trace.git.orig/arch/x86/kernel/entry_64.S +++ linux-trace.git/arch/x86/kernel/entry_64.S @@ -1586,7 +1586,7 @@ ENTRY(nmi) * Check the special variable on the stack to see if NMIs are * executing. */ - cmpl $1, -8(%rsp) + cmpl $1, -16(%rsp) je nested_nmi /* @@ -1615,7 +1615,7 @@ nested_nmi: 1: /* Set up the interrupted NMIs stack to jump to repeat_nmi */ - leaq -6*8(%rsp), %rdx + leaq -7*8(%rsp), %rdx movq %rdx, %rsp CFI_ADJUST_CFA_OFFSET 6*8 pushq_cfi $__KERNEL_DS @@ -1625,7 +1625,7 @@ nested_nmi: pushq_cfi $repeat_nmi /* Put stack back */ - addq $(11*8), %rsp + addq $(12*8), %rsp CFI_ADJUST_CFA_OFFSET -11*8 nested_nmi_out: @@ -1650,6 +1650,8 @@ first_nmi: * +-------------------------+ * | temp storage for rdx | * +-------------------------+ + * | saved cr2 | + * +-------------------------+ * | NMI executing variable | * +-------------------------+ * | Saved SS | @@ -1672,8 +1674,20 @@ first_nmi: * to the repeat_nmi. The original stack frame and the temp storage * is also used by nested NMIs and can not be trusted on exit. */ + + /* + * Save off the CR2 register. If we take a page fault in the NMI then + * it could corrupt the CR2 value. If the NMI preempts a page fault + * handler before it was able to read the CR2 register, and then the + * NMI itself takes a page fault, the page fault that was preempted + * will read the information from the NMI page fault and not the + * origin fault. Save it off and restore it if it changes. + */ + movq %cr2, %rdx + pushq_cfi %rdx + /* Do not pop rdx, nested NMIs will corrupt that part of the stack */ - movq (%rsp), %rdx + movq 8(%rsp), %rdx CFI_RESTORE rdx /* Leave room for the "in NMI" variable. */ @@ -1682,7 +1696,7 @@ first_nmi: /* Copy the stack frame to the Saved frame */ .rept 5 - pushq_cfi 6*8(%rsp) + pushq_cfi 7*8(%rsp) .endr CFI_DEF_CFA_OFFSET SS+8-RIP @@ -1734,6 +1748,13 @@ end_repeat_nmi: nmi_swapgs: SWAPGS_UNSAFE_STACK nmi_restore: + /* Test if the cr2 reg changed */ + movq ORIG_RAX-R15+(12*8)(%rsp), %rdx + movq %cr2, %rcx + cmp %rdx, %rcx + je 1f + movq %rdx, %cr2 +1: RESTORE_ALL 8 /* Clear the NMI executing stack variable */ movq $0, 10*8(%rsp) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/