MIME-Version: 1.0
In-Reply-To: <20161206075227.utcmqghjjcdofx3w@pd.tnic>
References: <cover.1480973468.git.luto@kernel.org> <c452beb3aaa5a0232b5ef27f443f0c27b6c65fe2.1480973468.git.luto@kernel.org>
 <20161206075227.utcmqghjjcdofx3w@pd.tnic>
From: Andy Lutomirski <luto@amacapital.net>
Date: Tue, 6 Dec 2016 09:49:42 -0800
Message-ID: <CALCETrWbMdhgvdg32tb=S+aLJ5ksrc4rBdst5bENEGPks682zA@mail.gmail.com>
Subject: Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
To: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
        One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Brian Gerst <brgerst@gmail.com>,
        Matthew Whitehead <tedheadster@gmail.com>,
        Henrique de Moraes Holschuh <hmh@hmh.eng.br>,
        Peter Zijlstra <peterz@infradead.org>,
        xen-devel <Xen-devel@lists.xen.org>, Juergen Gross <jgross@suse.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        Andrew Cooper <andrew.cooper3@citrix.com>,
        "H. Peter Anvin" <hpa@zytor.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Transfer-Encoding: 8bit
Content-Length: 4757
Lines: 125

On Mon, Dec 5, 2016 at 11:52 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
>> Aside from being excessively slow, CPUID is problematic: Linux runs
>> on a handful of CPUs that don't have CPUID.  Use IRET-to-self
>> instead.  IRET-to-self works everywhere, so it makes testing easy.
>>
>> For reference, On my laptop, IRET-to-self is ~110ns,
>> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
>> and MOV-to-CR2 is ~42ns.
>>
>> While we're at it: sync_core() serves a very specific purpose.
>> Document it.
>>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>>  1 file changed, 55 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 64fbc937d586..201a956e345f 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>>
>>  #define cpu_relax_lowlatency() cpu_relax()
>>
>> -/* Stop speculative execution and prefetching of modified code. */
>> +/*
>> + * This function forces the icache and prefetched instruction stream to
>> + * catch up with reality in two very specific cases:
>> + *
>> + *  a) Text was modified using one virtual address and is about to be executed
>> + *     from the same physical page at a different virtual address.
>> + *
>> + *  b) Text was modified on a different CPU, may subsequently be
>> + *     executed on this CPU, and you want to make sure the new version
>> + *     gets executed.  This generally means you're calling this in a IPI.
>> + *
>> + * If you're calling this for a different reason, you're probably doing
>> + * it wrong.
>
> "... and think hard before you call this - it is slow."
>
> I'd add that now that it is even slower than CPUID.

But only barely.  And it's hugely faster than CPUID under KVM or
similar.  And it works on all CPUs.

>
>> + */
>>  static inline void sync_core(void)
>>  {
>> -     int tmp;
>> -
>> -#ifdef CONFIG_X86_32
>>       /*
>> -      * Do a CPUID if available, otherwise do a jump.  The jump
>> -      * can conveniently enough be the jump around CPUID.
>> +      * There are quite a few ways to do this.  IRET-to-self is nice
>> +      * because it works on every CPU, at any CPL (so it's compatible
>> +      * with paravirtualization), and it never exits to a hypervisor.
>> +      * The only down sides are that it's a bit slow (it seems to be
>> +      * a bit more than 2x slower than the fastest options) and that
>> +      * it unmasks NMIs.
>
> Ewww, I hadn't thought of that angle. Aren't we going to get in all
> kinds of hard to debug issues due to that couple of cycles window of
> unmasked NMIs?
>
> We sync_core in some NMI handler and then right in the middle of it we
> get another NMI. Yeah, we have the nested NMI stuff still but I'd like
> to avoid complications if possible.

I'll counter with:

1. Why on earth would an NMI call sync_core()?

2. We have very careful and code to handle this issue, and NMIs really
do cause page faults which have exactly the same problem.

So I'm not too worried.

>
>> The "push %cs" is needed because, in
>> +      * paravirtual environments, __KERNEL_CS may not be a valid CS
>> +      * value when we do IRET directly.
>> +      *
>> +      * In case NMI unmasking or performance every becomes a problem,
>> +      * the next best option appears to be MOV-to-CR2 and an
>> +      * unconditional jump.  That sequence also works on all CPUs,
>> +      * but it will fault at CPL3.
>
> Does it really have to be non-priviledged?

Unless we want to declare lguest unsupported, delete it from the tree,
or, sigh, actually maintain it, then yes :(  native_write_cr2() would
work on Xen, but it's slow.

>
> If not, there are a couple more serializing insns:
>
> "• Privileged serializing instructions — INVD, INVEPT, INVLPG,
> INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
> exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"
>
> What about INVD? It is expensive too :-)

Only if you write the patch and label it:

Snickered-at-by: Andy Lutomirski <luto@kernel.org>

>
> Can't we use, MOV %dr or so? With previously saving/restoring the reg?
>
> WBINVD could be another candidate, albeit a big hammer.
>
> WRMSR maybe too. If it faults, it's fine too because then you have the
> IRET automagically. Hell, we could even make it fault on purpose by
> writing into an invalid MSR but then we're back to the unmasking NMIs...
> :-\

I still like MOV-to-CR2 better than all of these.

--Andy