Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964775AbbGWVuf (ORCPT ); Thu, 23 Jul 2015 17:50:35 -0400 Received: from mail-la0-f45.google.com ([209.85.215.45]:34101 "EHLO mail-la0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753876AbbGWVuc (ORCPT ); Thu, 23 Jul 2015 17:50:32 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150723173105.6795c0dc@gandalf.local.home> From: Andy Lutomirski Date: Thu, 23 Jul 2015 14:50:11 -0700 Message-ID: Subject: Re: Dealing with the NMI mess To: Linus Torvalds Cc: Steven Rostedt , X86 ML , "linux-kernel@vger.kernel.org" , Willy Tarreau , Borislav Petkov , Thomas Gleixner , Peter Zijlstra , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2939 Lines: 68 On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds wrote: > On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt wrote: >> >> Let me get this straight. The idea is in the #DB handler to detect that >> it was triggered in NMI context, and if so, simply disarm that >> breakpoint permanently, right? > > No, for simplicity, I'd make it cover not just NMI code, but any > "kernel code with interrupts disabled". > > Because that's the test we'd use for "use ret instead of iret". > > And that wider test is exactly because it's so damn hard to get the > exact instruction boundaries right. Let's *not* go down the path > (again) of having to get the whole %rip range and "magic stack pointer > values" etc. > > Make it simple and completely unambiguous. The rule really would be: > > - if we return to kernel space and interrupts are disabled, we will > use "ret" rather than "iret" > > Hard rule. Simple. Straightforward. No random %rip values. No > random %rsp values. NO CRAP. > > - but because we use "ret" rather than "iret" we can't get RF > semantics, it means that #DB is special. RF is supposed to make us > make forward progress > > So for that reason, #DB just says "if the breakpoint happened > during that interrupts-ff reghion, I will clear %dr7 to guarantee > forward progress" What if we relax it slightly: "if the breakpoint happened during that interrupts-off region, I will clear all *kernel breakpoints* in %dr7 to guarantee forward progress"? Watchpoints don't need RF to make forward progress, and, by leaving watchpoints alone, we avoid breaking gdb. > > So those would be the two main rules. Very simple, and avoiding all nasty cases. > > Now, I'd be willing to then hide the "oops, we clear dr7 very > agrressively" issue by having a few additional _heuristics_. But I > call them "heuristics" because unlike the current NMI nesting games, > they aren't about core stability. They are about "ok, maybe somebody > wants to trigger those faults, and we'll be _nice_ and try to make it > easy for them", but nothing more. > > So for example, if that "#DB clears %dr7" happened, it sounds easy to > set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a > cached value, so that if we disabled things because of some user stack > trace access, it will be re-enabled by the time we return to user > space. I think that sounds reasonable, but it's not something the core > low-level entry x86 assembly code needs to even care about. It's not > that level of "core", it's just being polite. Once we limit it to instruction breakpoints, I don't think re-enabling before returning to userspace matters. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/