Date: Tue, 6 Dec 2016 08:52:27 +0100
From: Borislav Petkov <bp@alien8.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org, One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Brian Gerst <brgerst@gmail.com>,
        Matthew Whitehead <tedheadster@gmail.com>,
        Henrique de Moraes Holschuh <hmh@hmh.eng.br>,
        Peter Zijlstra <peterz@infradead.org>,
        xen-devel <Xen-devel@lists.xen.org>, Juergen Gross <jgross@suse.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        Andrew Cooper <andrew.cooper3@citrix.com>,
        "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
Message-ID: <20161206075227.utcmqghjjcdofx3w@pd.tnic>
References: <cover.1480973468.git.luto@kernel.org>
 <c452beb3aaa5a0232b5ef27f443f0c27b6c65fe2.1480973468.git.luto@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <c452beb3aaa5a0232b5ef27f443f0c27b6c65fe2.1480973468.git.luto@kernel.org>
User-Agent: NeoMutt/20161014 (1.7.1)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3958
Lines: 102

On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
> Aside from being excessively slow, CPUID is problematic: Linux runs
> on a handful of CPUs that don't have CPUID.  Use IRET-to-self
> instead.  IRET-to-self works everywhere, so it makes testing easy.
> 
> For reference, On my laptop, IRET-to-self is ~110ns,
> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
> and MOV-to-CR2 is ~42ns.
> 
> While we're at it: sync_core() serves a very specific purpose.
> Document it.
> 
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>  1 file changed, 55 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 64fbc937d586..201a956e345f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>  
>  #define cpu_relax_lowlatency() cpu_relax()
>  
> -/* Stop speculative execution and prefetching of modified code. */
> +/*
> + * This function forces the icache and prefetched instruction stream to
> + * catch up with reality in two very specific cases:
> + *
> + *  a) Text was modified using one virtual address and is about to be executed
> + *     from the same physical page at a different virtual address.
> + *
> + *  b) Text was modified on a different CPU, may subsequently be
> + *     executed on this CPU, and you want to make sure the new version
> + *     gets executed.  This generally means you're calling this in a IPI.
> + *
> + * If you're calling this for a different reason, you're probably doing
> + * it wrong.

"... and think hard before you call this - it is slow."

I'd add that now that it is even slower than CPUID.

> + */
>  static inline void sync_core(void)
>  {
> -	int tmp;
> -
> -#ifdef CONFIG_X86_32
>  	/*
> -	 * Do a CPUID if available, otherwise do a jump.  The jump
> -	 * can conveniently enough be the jump around CPUID.
> +	 * There are quite a few ways to do this.  IRET-to-self is nice
> +	 * because it works on every CPU, at any CPL (so it's compatible
> +	 * with paravirtualization), and it never exits to a hypervisor.
> +	 * The only down sides are that it's a bit slow (it seems to be
> +	 * a bit more than 2x slower than the fastest options) and that
> +	 * it unmasks NMIs.

Ewww, I hadn't thought of that angle. Aren't we going to get in all
kinds of hard to debug issues due to that couple of cycles window of
unmasked NMIs?

We sync_core in some NMI handler and then right in the middle of it we
get another NMI. Yeah, we have the nested NMI stuff still but I'd like
to avoid complications if possible.

> The "push %cs" is needed because, in
> +	 * paravirtual environments, __KERNEL_CS may not be a valid CS
> +	 * value when we do IRET directly.
> +	 *
> +	 * In case NMI unmasking or performance every becomes a problem,
> +	 * the next best option appears to be MOV-to-CR2 and an
> +	 * unconditional jump.  That sequence also works on all CPUs,
> +	 * but it will fault at CPL3.

Does it really have to be non-priviledged?

If not, there are a couple more serializing insns:

"• Privileged serializing instructions — INVD, INVEPT, INVLPG,
INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"

What about INVD? It is expensive too :-)

Can't we use, MOV %dr or so? With previously saving/restoring the reg?

WBINVD could be another candidate, albeit a big hammer.

WRMSR maybe too. If it faults, it's fine too because then you have the
IRET automagically. Hell, we could even make it fault on purpose by
writing into an invalid MSR but then we're back to the unmasking NMIs...
:-\

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.