Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754406AbcJEQJ6 (ORCPT ); Wed, 5 Oct 2016 12:09:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56166 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753340AbcJEQJ5 (ORCPT ); Wed, 5 Oct 2016 12:09:57 -0400 Subject: Re: [PATCH 2/9] x86/fpu: Hard-disable lazy fpu mode To: Andy Lutomirski References: <1475627678-20788-1-git-send-email-riel@redhat.com> <1475627678-20788-3-git-send-email-riel@redhat.com> <1475675843.11869.8.camel@redhat.com> Cc: Rik van Riel , "linux-kernel@vger.kernel.org" , Dave Hansen , X86 ML , Thomas Gleixner , Ingo Molnar , Andrew Lutomirski , pa@zytor.com, Borislav Petkov From: Paolo Bonzini Message-ID: Date: Wed, 5 Oct 2016 18:09:50 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Wed, 05 Oct 2016 16:09:56 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1486 Lines: 34 On 05/10/2016 17:59, Andy Lutomirski wrote: > I actually benchmarked the underlying instructions quite a bit on > Intel. (Not on AMD, but I doubt the results are very different.) > Writes to CR0.TS are *incredibly* slow, as are device-not-available > exceptions. Keep in mind that, while there's a (slow) CLTS > instruction, there is no corresponding STTS instruction, so we're left > with a fully serializing, slowly microcoded move to CR0. On SVM, I > think it's worse, because IIRC SVM doesn't have fancy execution > controls that let MOV to CR0 avoid exiting. SVM lets you choose whether to trap on TS and MP; update_cr0_intercept is where KVM does that (the "selective CR0 write" intercept is always on, while the "CR0 write" intercept is toggled in that function). > We're talking a couple > hundred cycles best case for a TS set/clear pair, and thousands of > cycles if we actually take a fault. > > In contrast, an unconditional XSAVE + XRSTOR was considerably faster. Did you also do a comparison against FXSAVE/FXRSTOR (on either pre- or post-SandyBridge processors)? But yeah, it's possible that the lack of STTS screws the whole plan, despite the fpu.preload optimization in switch_fpu_prepare. Paolo > This leads to the counterintuitive result that, if we switch from task > A to B and back and task A is heavily using the FPU, then it's faster > to unconditoinally save and restore the full state both ways than it > is to set and clear TS so we can avoid it.