MIME-Version: 1.0
In-Reply-To: <20150723173105.6795c0dc@gandalf.local.home>
References: <CALCETrUf9s-o-ETMiSxxjMGxVeH7di4O9vTi0Oe7wS-RCiVXLA@mail.gmail.com>
	<CA+55aFwR8mHw=wm+Uecy0ERgrD7WbijBn9kj_ZAd47L4GyG5Xw@mail.gmail.com>
	<CALCETrVAzhE7w3BDjqRack54BLncZALbnAOZyeXHx1cSTryy4g@mail.gmail.com>
	<CA+55aFyxs8Q5WrjN9o4Zmfd_4+muLkcoO8cXyv5Nt+Pf8c0TBQ@mail.gmail.com>
	<20150723173105.6795c0dc@gandalf.local.home>
Date: Thu, 23 Jul 2015 14:48:13 -0700
Message-ID: <CA+55aFy0-rj7hp3zOUAZD5y5Zp=v6Cu3TG0SHB-buj3oYTJcZg@mail.gmail.com>
Subject: Re: Dealing with the NMI mess
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Willy Tarreau <w@1wt.eu>, Borislav Petkov <bp@alien8.de>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>, Brian Gerst <brgerst@gmail.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2392
Lines: 55

On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?

No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".

Because that's the test we'd use for "use ret instead of iret".

And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.

Make it simple and completely unambiguous. The rule really would be:

 - if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"

   Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.

 - but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress

   So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"

So those would be the two main rules. Very simple, and avoiding all nasty cases.

Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.

So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/