Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754393AbbGXQdR (ORCPT ); Fri, 24 Jul 2015 12:33:17 -0400 Received: from mail-pd0-f170.google.com ([209.85.192.170]:35144 "EHLO mail-pd0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752407AbbGXQdP (ORCPT ); Fri, 24 Jul 2015 12:33:15 -0400 Message-ID: <1437755591.5522.0.camel@gmail.com> Subject: Re: Dealing with the NMI mess From: Raymond Jennings To: Andy Lutomirski Cc: X86 ML , "linux-kernel@vger.kernel.org" , Willy Tarreau , Borislav Petkov , Thomas Gleixner , Peter Zijlstra , Linus Torvalds , Steven Rostedt , Brian Gerst Date: Fri, 24 Jul 2015 09:33:11 -0700 In-Reply-To: References: Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.12.11 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4120 Lines: 102 On Thu, 2015-07-23 at 13:21 -0700, Andy Lutomirski wrote: > [moved to a new thread, cc list trimmed] > > Hi all- > > We've considered two approaches to dealing with NMIs: > > 1. Allow nesting. We know quite well how messy that is. This might be a stupid question, but 1. What exactly does the NMI handler handle 2. Is it possible for the NMI handler to just increment a counter and return if it nests, and let the outer handler notice and rerun itself. > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. > > We haven't considered: > > 3. Forbid faults (other than MCE) inside NMI. > > Option 3 is almost easy. There are really only two kinds of faults > that can legitimately nest inside NMI: #PF and #DB. #DB is easy to > fix (e.g. with my patches or Peter's patches). > > What if we went all out and forbade page faults in NMI as well. There > are two reasons that I can think of that we might page fault inside an > NMI: > > a) vmalloc fault. I think Ingo already half-implemented a rework to > eliminate vmalloc faults entirely. > > b) User memory access faults. > > The reason we access user state in general from an NMI is to allow > perf to capture enough user stack data to let the tooling backtrace > back to user space. What if we did it differently? Instead of > capturing this data in NMI context, capture it in > prepare_exit_to_usermode. That would let us capture user state > *correctly*, which we currently can't really do. There's a > never-ending series of minor bugs in which we try to guess the user > register state from NMI context, and it sort of works. In > prepare_exit_to_usermode, we really truly know the user state. > There's a race where an NMI hits during or after > prepare_exit_to_usermode, but maybe that's okay -- just admit defeat > in that case and don't show the user state. (Realistically, without > CFI data, we're not going to be guaranteed to get the right state > anyway.) > > To make this work, we'd have to teach NMI-from-userspace to call the > callback itself. It would look like: > > prepare_exit_to_usermode() { > ... > while (blah blah blah) { > if (cached_flags & TIF_PERF_CAPTURE_USER_STATE) > perf_capture_user_state(); > ... > } > ... > } > > and then, on NMI exit, we'd call perf_capture_user_state directly, > since we don't want to enable IRQs or do opportunsitic sysret on exit > from NMI. (Why not? Because NMIs are still masked, and we don't want > to pay for double-IRET to unmask them, so we really want to leave IRQs > off and IRET straight back to user mode.) > > There's an unavoidable race in which we enter user mode with > TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could > IPI-to-self from the NMI handler to cover that case (mostly -- we > capture the wrong state if we're on our way to an IRET fault), or we > could just check on entry if the flag is still set and, if so, admit > defeat. > > Peter, can this be done without breaking the perf ABI? If we were > designing all of this stuff from scratch right now, I'd suggest doing > it this way, but I'm not sure whether it makes sense to try to > retrofit it in. > > > If we decide to stick with option 2, then I've now convinced myself > that banning all kernel breakpoints and watchpoints during NMI > processing is probably for the best. Maybe we should go one step > farther and ban all DR7 breakpoints period. Sure, it will slow down > perf if there are user breakpoints or watchpoints set, but, having > looked at the asm, returning from #DB using RET is, while doable, > distinctly ugly. > > --Andy > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/