Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754138AbbGWUVj (ORCPT ); Thu, 23 Jul 2015 16:21:39 -0400 Received: from mail-la0-f45.google.com ([209.85.215.45]:36304 "EHLO mail-la0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753784AbbGWUVh (ORCPT ); Thu, 23 Jul 2015 16:21:37 -0400 MIME-Version: 1.0 From: Andy Lutomirski Date: Thu, 23 Jul 2015 13:21:16 -0700 Message-ID: Subject: Dealing with the NMI mess To: X86 ML , "linux-kernel@vger.kernel.org" , Willy Tarreau , Borislav Petkov , Thomas Gleixner , Peter Zijlstra , Linus Torvalds , Steven Rostedt , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3415 Lines: 88 [moved to a new thread, cc list trimmed] Hi all- We've considered two approaches to dealing with NMIs: 1. Allow nesting. We know quite well how messy that is. 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. We haven't considered: 3. Forbid faults (other than MCE) inside NMI. Option 3 is almost easy. There are really only two kinds of faults that can legitimately nest inside NMI: #PF and #DB. #DB is easy to fix (e.g. with my patches or Peter's patches). What if we went all out and forbade page faults in NMI as well. There are two reasons that I can think of that we might page fault inside an NMI: a) vmalloc fault. I think Ingo already half-implemented a rework to eliminate vmalloc faults entirely. b) User memory access faults. The reason we access user state in general from an NMI is to allow perf to capture enough user stack data to let the tooling backtrace back to user space. What if we did it differently? Instead of capturing this data in NMI context, capture it in prepare_exit_to_usermode. That would let us capture user state *correctly*, which we currently can't really do. There's a never-ending series of minor bugs in which we try to guess the user register state from NMI context, and it sort of works. In prepare_exit_to_usermode, we really truly know the user state. There's a race where an NMI hits during or after prepare_exit_to_usermode, but maybe that's okay -- just admit defeat in that case and don't show the user state. (Realistically, without CFI data, we're not going to be guaranteed to get the right state anyway.) To make this work, we'd have to teach NMI-from-userspace to call the callback itself. It would look like: prepare_exit_to_usermode() { ... while (blah blah blah) { if (cached_flags & TIF_PERF_CAPTURE_USER_STATE) perf_capture_user_state(); ... } ... } and then, on NMI exit, we'd call perf_capture_user_state directly, since we don't want to enable IRQs or do opportunsitic sysret on exit from NMI. (Why not? Because NMIs are still masked, and we don't want to pay for double-IRET to unmask them, so we really want to leave IRQs off and IRET straight back to user mode.) There's an unavoidable race in which we enter user mode with TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could IPI-to-self from the NMI handler to cover that case (mostly -- we capture the wrong state if we're on our way to an IRET fault), or we could just check on entry if the flag is still set and, if so, admit defeat. Peter, can this be done without breaking the perf ABI? If we were designing all of this stuff from scratch right now, I'd suggest doing it this way, but I'm not sure whether it makes sense to try to retrofit it in. If we decide to stick with option 2, then I've now convinced myself that banning all kernel breakpoints and watchpoints during NMI processing is probably for the best. Maybe we should go one step farther and ban all DR7 breakpoints period. Sure, it will slow down perf if there are user breakpoints or watchpoints set, but, having looked at the asm, returning from #DB using RET is, while doable, distinctly ugly. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/