Date: Fri, 24 Jul 2015 19:10:18 +0200
From: Willy Tarreau <w@1wt.eu>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Steven Rostedt <rostedt@goodmis.org>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Borislav Petkov <bp@alien8.de>, Thomas Gleixner <tglx@linutronix.de>,
        Brian Gerst <brgerst@gmail.com>
Subject: Re: Dealing with the NMI mess
Message-ID: <20150724171018.GH3612@1wt.eu>
References: <CALCETrUf9s-o-ETMiSxxjMGxVeH7di4O9vTi0Oe7wS-RCiVXLA@mail.gmail.com> <CA+55aFwR8mHw=wm+Uecy0ERgrD7WbijBn9kj_ZAd47L4GyG5Xw@mail.gmail.com> <CALCETrVAzhE7w3BDjqRack54BLncZALbnAOZyeXHx1cSTryy4g@mail.gmail.com> <CA+55aFyxs8Q5WrjN9o4Zmfd_4+muLkcoO8cXyv5Nt+Pf8c0TBQ@mail.gmail.com> <20150723173105.6795c0dc@gandalf.local.home> <CA+55aFy0-rj7hp3zOUAZD5y5Zp=v6Cu3TG0SHB-buj3oYTJcZg@mail.gmail.com> <CALCETrWMgsrgEYWpzPFapOj+-SvZfadDAZ7SH7O8bFsR2b6F1Q@mail.gmail.com> <CA+55aFzma9NgODkzz08zpEKSWVnwxuCvwPt_JnO8HaHwRnBPdQ@mail.gmail.com> <20150724081326.GO25159@twins.programming.kicks-ass.net> <CALCETrWjzU79ASDK+0RJQyCy6qTdM3FPTa4ZM0d5sVW66yhcug@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALCETrWjzU79ASDK+0RJQyCy6qTdM3FPTa4ZM0d5sVW66yhcug@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2548
Lines: 54

On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF.  Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.  So this might be as simple as:
> 
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
>   for (i = 0; i < 4; i++)
>     if (dr6 & (DR_TRAP0<<i)) {
>       /* hit a kernel breakpoint with IF clear */
>       dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>     }
> 
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
> 
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.

Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
doesn't validate anything :

  "It is the responsibility of the OS to ensure the descriptors in
   the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
   (consistent with the base, limit, and attribute values forced by
   the instructions)."

The OS has to set the RSP by itself before doing SYSRET, which opens a
race between "mov rsp" and "sysret", but if we only take that path once
we figure we come from NMI (using just IF+RSP), we know that IRQs and
NMIs are still disabled and cannot strike at this instant. Maybe MCEs
can, but they would execute within the NMI's stack just as if they were
triggered inside the NMI as well so I don't see a problem here.

I tried to imagine a case where kernel page faults, then NMI comes in,
then debug strikes and we have to return from debug to NMI then to fault
handler and I don't think we break the chain. Of course there are many
subtleties I can't grab because I don't understand all the details.

Do you think that could simplify things or that it's another stupid idea ?

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/