Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp3212679pxb; Mon, 25 Jan 2021 09:43:26 -0800 (PST) X-Google-Smtp-Source: ABdhPJyJldZ86820Bj7hsO6GF0YQmR9jqM1ZEp1dcRVZ9pBGCW+bCq4BJd3xMbjLeM2tIbUI4wsT X-Received: by 2002:a17:906:f759:: with SMTP id jp25mr1035078ejb.207.1611596606460; Mon, 25 Jan 2021 09:43:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611596606; cv=none; d=google.com; s=arc-20160816; b=HpI2VibyoisvJbCVB5DCYYh6R95kM0IBX81cn3cpouzro9O3SIzcfcKfjmgKp8gJUM D/gDuEY84n6upPSJg2TYPq7EHvVx7brHr1QFKEYuSXBsrbxNR5VJUr5UQbjWShoSi4QQ KJ4t5EhONwiVDSB6YhEineSoV1T1C8wZtm/uRzzChGLByTEvg8RJlTd3ca9aYrJ89h17 fVP4W53znabCJGjpLOAk1rDUt14HiwSPs175Ylpsqj/VIvxTHYV8LmXp3O6fS5KFHnUo i+qmpoyzRmrSCrBs5Otv5qHEt9ApNoitlr9JUTxhQ3BCyTtPfTCmW7W50mIF+jFzAB3I awWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=K3MRLqRWxXIsqPxECDmHRRzi7yIobG8wtijVwpC490E=; b=lpje/zCiVitG/Mqb+f8gr4r7N39ww3bozS2gZQIXYgCU3Z47enVyL+BAU5Or3JBpiL VWwRKEUN7Zmf0NWlr03WEPvul5ORoCbyTIC1cnt+r2HPkxnyq8vI6n71b5jqMQI1mbj5 wBzCMWW0tZQd9OkmnGLwGfmEKqLAA0wmd47/sWnkoAGUDARjnUHYjXMHh7Gfxj0KOctl 7Revm04Ittgvu9/VX9K04PX4VvZDjiOhYC8VeCuV7C1YJiMINGlZ8jNjvK4+EWmFUCR4 MjSwu/duw0d1VtJlfaFA1llpYjZ24gsnBuLNdGAGZXZnxp4R0IkNH2/4FgqqKSv0gURQ xjng== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t9si6401294eji.322.2021.01.25.09.43.00; Mon, 25 Jan 2021 09:43:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731178AbhAYRkV (ORCPT + 99 others); Mon, 25 Jan 2021 12:40:21 -0500 Received: from mail.kernel.org ([198.145.29.99]:35490 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731185AbhAYRjm (ORCPT ); Mon, 25 Jan 2021 12:39:42 -0500 Received: from gandalf.local.home (cpe-66-24-58-225.stny.res.rr.com [66.24.58.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 52E8E22B3B; Mon, 25 Jan 2021 17:39:01 +0000 (UTC) Date: Mon, 25 Jan 2021 12:38:59 -0500 From: Steven Rostedt To: Lai Jiangshan Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Lai Jiangshan , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H. Peter Anvin" Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further Message-ID: <20210125123859.39b244ca@gandalf.local.home> In-Reply-To: <20210125074506.15064-1-jiangshanlai@gmail.com> References: <20210125074506.15064-1-jiangshanlai@gmail.com> X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 25 Jan 2021 15:45:06 +0800 Lai Jiangshan wrote: > From: Lai Jiangshan > > The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified > the NMI code by changing paravirt code into native code and left a comment > about "inspecting RIP instead". But until now, "inspecting RIP instead" > has not been made happened and this patch tries to complete it. > > Comments in the code was from Andy Lutomirski. Thanks! > > Signed-off-by: Lai Jiangshan > --- > arch/x86/entry/entry_64.S | 44 ++++++++++----------------------------- > 1 file changed, 11 insertions(+), 33 deletions(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index cad08703c4ad..21f67ea62341 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi) > je nested_nmi > > /* > - * Now test if the previous stack was an NMI stack. This covers > - * the case where we interrupt an outer NMI after it clears > - * "NMI executing" but before IRET. We need to be careful, though: > - * there is one case in which RSP could point to the NMI stack > - * despite there being no NMI active: naughty userspace controls > - * RSP at the very beginning of the SYSCALL targets. We can > - * pull a fast one on naughty userspace, though: we program > - * SYSCALL to mask DF, so userspace cannot cause DF to be set > - * if it controls the kernel's RSP. We set DF before we clear > - * "NMI executing". > + * Now test if we interrupted an outer NMI that just cleared "NMI > + * executing" and is about to IRET. This is a single-instruction > + * window. This check does not handle the case in which we get a > + * nested interrupt (#MC, #VE, #VC, etc.) after clearing > + * "NMI executing" but before the outer NMI executes IRET. > */ > - lea 6*8(%rsp), %rdx > - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */ > - cmpq %rdx, 4*8(%rsp) > - /* If the stack pointer is above the NMI stack, this is a normal NMI */ > - ja first_nmi > - > - subq $EXCEPTION_STKSZ, %rdx > - cmpq %rdx, 4*8(%rsp) > - /* If it is below the NMI stack, it is a normal NMI */ > - jb first_nmi > - > - /* Ah, it is within the NMI stack. */ > - > - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp) > - jz first_nmi /* RSP was user controlled. */ So we no longer check to see if the current stack is on the NMI stack. Makes sense, since this beginning of the NMI code can not be interrupted, as there's no breakpoints or faults that can occur when that happens. The $nmi_executing is set in all other locations except for: repeat_nmi - end_repeat_nmi and for the iretq itself (.Lnmi_iret). Thus, by just checking the nmi_executing variable on the stack along with the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely tell if the NMI is nested or not. I was working on rewriting the beginning comments to reflect these updates, and discovered a possible bug that exists (unrelated to this patch). > + cmpq $.Lnmi_iret, 8(%rsp) > + jne first_nmi > On triggering an NMI from user space, I see the switch to the thread stack is done, and "exc_nmi" is called. The problem I see with this is that exc_nmi is called with the thread stack, if it were to take an exception, NMIs would be enabled allowing for a nested NMI to run. From what I can tell, I don't see anything stopping that NMI from executing over the currently running NMI. That is, this means that NMI handlers are now re-entrant. Yes, the stack issue is not a problem here, but NMI handlers are not allowed to be re-entrant. For example, we have spin locks in NMI handlers that are considered fine if they are only used in NMI handlers. But because there's a possible way to make NMI handlers re-entrant then these spin locks can deadlock. I'm guessing that we need to add some tricks to the user space path to set and clear the "NMI executing" variable, but the return may become a bit complex in clearing that without races. -- Steve