From: Dave Hansen <dave.hansen@intel.com>
Subject: Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts
To: Tim Chen <tim.c.chen@linux.intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Andy Lutomirski <luto@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Greg KH <gregkh@linuxfoundation.org>
References: <cover.1515086770.git.tim.c.chen@linux.intel.com>
 <0c525c4c6c817e9c42c7ed583d86dc591a86efde.1515086770.git.tim.c.chen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
        Andi Kleen <ak@linux.intel.com>,
        Arjan Van De Ven <arjan.van.de.ven@intel.com>,
        linux-kernel@vger.kernel.org
Message-ID: <e2b435e0-7ecc-1ed4-4937-230fed226866@intel.com>
Date: Thu, 4 Jan 2018 12:45:14 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <0c525c4c6c817e9c42c7ed583d86dc591a86efde.1515086770.git.tim.c.chen@linux.intel.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org

On 01/04/2018 09:56 AM, Tim Chen wrote:
> If NMI runs when exiting kernel between IBRS_DISABLE and
> SWAPGS, the NMI would have turned on IBRS bit 0 and then it would have
> left enabled when exiting the NMI. IBRS bit 0 would then be left
> enabled in userland until the next enter kernel.
> 
> That is a minor inefficiency only, but we can eliminate it by saving
> the MSR when entering the NMI in save_paranoid and restoring it when
> exiting the NMI.

Can I suggest and alternate description for the NMI case?  This is
long-winded, but it should keep me from having to think through it yet
again. :)

"
The normal interrupt code uses the 'error_entry' path which uses the
Code Segment (CS) of the instruction that was interrupted to tell
whether it interrupted the kernel or userspace and thus has to switch
IBRS, or leave it alone.

The NMI code is different.  It uses 'paranoid_entry' because it can
interrupt the kernel while it is running with a userspace IBRS (and %GS
and CR3) value, but has a kernel CS.  If we used the same approach as
the normal interrupt code, we might do the following;

	SYSENTER_entry
<-------------- NMI HERE
	IBRS=1
		do_something()
	IBRS=0
	SYSRET

The NMI code might notice that we are running in the kernel and decide
that it is OK to skip the IBRS=1.  This would leave it running
unprotected with IBRS=0, which is bad.

However, if we unconditionally set IBRS=1, in the NMI, we might get the
following case:

	SYSENTER_entry
	IBRS=1
		do_something()
	IBRS=0
<-------------- NMI HERE (set IBRS=1)
	SYSRET

and we would return to userspace with IBRS=1.  Userspace would run
slowly until we entered and exited the kernel again.  (This is the case
Tim is alluding to in the patch description).

Instead of those two approaches, we chose a third one where we simply
save the IBRS value in a scratch register (%r13) and then restore that
value, verbatim.  This is what PTI does with CR3 and it works beautifully.
"