Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752062AbbBVAeN (ORCPT ); Sat, 21 Feb 2015 19:34:13 -0500 Received: from eddie.linux-mips.org ([148.251.95.138]:44043 "EHLO cvs.linux-mips.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751730AbbBVAeL (ORCPT ); Sat, 21 Feb 2015 19:34:11 -0500 Date: Sun, 22 Feb 2015 00:34:08 +0000 (GMT) From: "Maciej W. Rozycki" To: Borislav Petkov cc: Ingo Molnar , Andy Lutomirski , Oleg Nesterov , Rik van Riel , x86@kernel.org, linux-kernel@vger.kernel.org, Linus Torvalds Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs In-Reply-To: <20150221172914.GB32073@pd.tnic> Message-ID: References: <20150221093150.GA27841@gmail.com> <20150221163840.GA32073@pd.tnic> <20150221172914.GB32073@pd.tnic> User-Agent: Alpine 2.11 (LFD 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3415 Lines: 63 On Sat, 21 Feb 2015, Borislav Petkov wrote: > Provided I've not made a mistake, this leads me to think that this > simple workload and pretty much everything else uses the FPU through > glibc which does the SSE memcpy and so on. Which basically kills the > whole idea behind lazy FPU as practically you don't really encounter > workloads nowadays which don't use the FPU thanks to glibc and the lazy > strategy doesn't really bring anything. > > Which would then mean, we don't really need the lazy handling as > userspace is making it eager, so to speak, for us. Please correct me if I'm wrong, but it looks to me like you're confusing lazy FPU context allocation and lazy FPU context switching. These build on the same hardware principles, but they are different concepts. Your "userspace is making it eager" statement in the context of glibc using SSE for `memcpy' is certainly true for lazy FPU context allocation, however I wouldn't be so sure about lazy FPU context switching, and a kernel compilation (or in fact any compilation) does not appear to be a representative benchmark to me. I am sure lots of software won't be calling `memcpy' all the time, there should be context switches between which the FPU is not referred to at all. Also, does `__builtin_memcpy' also expand to SSE? I'd expect it rather than external `memcpy' to be used by GCC for copying fixed amounts of data, especially smaller ones such as when passing structures by value in function calls or for string operations like `strdup' or suchlike. These I'd expect to be ubiquitous, whereas external `memcpy' I'd expect to be called from time to time only. Additionally I believe long-executing FPU instructions (i.e. transcendentals) can take advantage of continuing to execute in parallel where the context has already been switched rather than stalling an eager FPU context switch until the FPU instruction has completed. And last but not least, why does the handling of CR0.TS traps have to be complicated? It does not look like rocket science to me, it should be a mere handful of instructions, the time required to move the two FP contexts out from and in to the FPU respectively should dominate processing time. Where quoted the optimisation manual states 250 cycles for FXSAVE and FXRSTOR combined. And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at bootstrap depending on processor features, you don't have to do all the run-time check on every trap. You can even optimise the FSAVE handler away at the build time if you know it won't ever be used based on the minimal supported processor family selected. Do you happen to know or can determine how much time (in clock cycles) a CR0.TS trap itself takes, including any time required to preserve the execution state in the handler such as pushing/popping GPRs to/from the stack (as opposed to processing time spent on moving the FP contexts back and forth)? Is there no room for improvement left there? How many task scheduling slots say per million must be there poking at the FPU for eager FPU context switching to take advantage over lazy one? Maciej -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/