MIME-Version: 1.0
In-Reply-To: <cf59408b-7b30-4eae-874f-d4be9b3484be@redhat.com>
References: <1475627678-20788-1-git-send-email-riel@redhat.com>
 <1475627678-20788-3-git-send-email-riel@redhat.com> <f3a95fc0-2a98-5627-7cea-8847ee6496e7@redhat.com>
 <1475675843.11869.8.camel@redhat.com> <cf59408b-7b30-4eae-874f-d4be9b3484be@redhat.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 5 Oct 2016 08:59:33 -0700
Message-ID: <CALCETrV9rXJOgdBY9Wyardo0NETA1meCEM_C4-e+SYsZAoUU7A@mail.gmail.com>
Subject: Re: [PATCH 2/9] x86/fpu: Hard-disable lazy fpu mode
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Dave Hansen <dave.hansen@linux.intel.com>, X86 ML <x86@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        Andrew Lutomirski <luto@kernel.org>, pa@zytor.com,
        Borislav Petkov <bp@suse.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2910
Lines: 64

On Wed, Oct 5, 2016 at 7:03 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 05/10/2016 15:57, Rik van Riel wrote:
>> On Wed, 2016-10-05 at 09:14 +0200, Paolo Bonzini wrote:
>>>
>>> On 05/10/2016 02:34, riel@redhat.com wrote:
>>>>
>>>> From: Andy Lutomirski <luto@kernel.org>
>>>>
>>>> Since commit 58122bf1d856 ("x86/fpu: Default eagerfpu=on on all
>>>> CPUs") in Linux 4.6, eager FPU mode has been the default on all x86
>>>> systems, and no one has reported any regressions.
>>>>
>>>> This patch removes the ability to enable lazy mode: use_eager_fpu()
>>>> becomes "return true" and all of the FPU mode selection machinery
>>>> is
>>>> removed.
>>>
>>> I haven't quite followed up on my promise to benchmark lazy vs. eager
>>> FPU, but I probably should do that now...
>>>
>>> I see two possible issues with this.  First, AMD as far as I know does
>>> not have XSAVEOPT.  Second, when using virtualization, depending on
>>> how you configure your cluster it's enough to have one pre-SandyBridge
>>> Intel machine to force no XSAVE on all machines.
>>
>> The "OPT" part of XSAVEOPT does not work across the
>> host/guest boundary, anyway.
>
> Yes, but it works for bare metal (and in fact eager FPU was keyed on
> XSAVEOPT before 58122bf1d856, not XSAVE).
>
> I'm not talking about KVM here; I am just saying that the lazy FPU code
> might be used more than we'd like to, because of AMD machines and of
> cases where XSAVE is hidden altogether from guests.  Of course it is
> quite unlikely that it be reported as a regression, since things just
> work.  But as far as I know 58122bf1d856 went in without any substantial
> (or not-so-substantial) benchmarking.

I actually benchmarked the underlying instructions quite a bit on
Intel.  (Not on AMD, but I doubt the results are very different.)
Writes to CR0.TS are *incredibly* slow, as are device-not-available
exceptions.  Keep in mind that, while there's a (slow) CLTS
instruction, there is no corresponding STTS instruction, so we're left
with a fully serializing, slowly microcoded move to CR0.  On SVM, I
think it's worse, because IIRC SVM doesn't have fancy execution
controls that let MOV to CR0 avoid exiting.  We're talking a couple
hundred cycles best case for a TS set/clear pair, and thousands of
cycles if we actually take a fault.

In contrast, an unconditional XSAVE + XRSTOR was considerably faster.

This leads to the counterintuitive result that, if we switch from task
A to B and back and task A is heavily using the FPU, then it's faster
to unconditoinally save and restore the full state both ways than it
is to set and clear TS so we can avoid it.

I would guess that the lazy mode hasn't been a win under most
workloads for many years.  It's worse on 64-bit CPUs, since almost all
userspace uses XMM regs for memcpy.  At least on 32-bit CPUs, SIMD
instructions weren't always available and userspace was conservative.

--Andy