Subject: Re: [PATCH 2/9] x86/fpu: Hard-disable lazy fpu mode
To: Andy Lutomirski <luto@amacapital.net>
References: <1475627678-20788-1-git-send-email-riel@redhat.com>
 <1475627678-20788-3-git-send-email-riel@redhat.com>
 <f3a95fc0-2a98-5627-7cea-8847ee6496e7@redhat.com>
 <1475675843.11869.8.camel@redhat.com>
 <cf59408b-7b30-4eae-874f-d4be9b3484be@redhat.com>
 <CALCETrV9rXJOgdBY9Wyardo0NETA1meCEM_C4-e+SYsZAoUU7A@mail.gmail.com>
Cc: Rik van Riel <riel@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Dave Hansen <dave.hansen@linux.intel.com>, X86 ML <x86@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        Andrew Lutomirski <luto@kernel.org>, pa@zytor.com,
        Borislav Petkov <bp@suse.de>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <e273bc2a-7f2c-5bfb-9a91-0521548ff84e@redhat.com>
Date: Wed, 5 Oct 2016 18:09:50 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.3.0
MIME-Version: 1.0
In-Reply-To: <CALCETrV9rXJOgdBY9Wyardo0NETA1meCEM_C4-e+SYsZAoUU7A@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1486
Lines: 34


On 05/10/2016 17:59, Andy Lutomirski wrote:
> I actually benchmarked the underlying instructions quite a bit on
> Intel.  (Not on AMD, but I doubt the results are very different.)
> Writes to CR0.TS are *incredibly* slow, as are device-not-available
> exceptions.  Keep in mind that, while there's a (slow) CLTS
> instruction, there is no corresponding STTS instruction, so we're left
> with a fully serializing, slowly microcoded move to CR0.  On SVM, I
> think it's worse, because IIRC SVM doesn't have fancy execution
> controls that let MOV to CR0 avoid exiting.

SVM lets you choose whether to trap on TS and MP; update_cr0_intercept
is where KVM does that (the "selective CR0 write" intercept is always
on, while the "CR0 write" intercept is toggled in that function).

> We're talking a couple
> hundred cycles best case for a TS set/clear pair, and thousands of
> cycles if we actually take a fault.
> 
> In contrast, an unconditional XSAVE + XRSTOR was considerably faster.

Did you also do a comparison against FXSAVE/FXRSTOR (on either pre- or
post-SandyBridge processors)?

But yeah, it's possible that the lack of STTS screws the whole plan,
despite the fpu.preload optimization in switch_fpu_prepare.

Paolo

> This leads to the counterintuitive result that, if we switch from task
> A to B and back and task A is heavily using the FPU, then it's faster
> to unconditoinally save and restore the full state both ways than it
> is to set and clear TS so we can avoid it.