From: Andy Lutomirski Subject: Re: HalfSipHash Acceptable Usage Date: Wed, 21 Dec 2016 17:54:26 -0800 Message-ID: References: <20161221155540.29529.qmail@ns.sciencehorizons.net> Reply-To: kernel-hardening@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: George Spelvin , "Jason A. Donenfeld" , Andi Kleen , David Miller , David Laight , "Daniel J . Bernstein" , Eric Biggers , Eric Dumazet , Hannes Frederic Sowa , Jean-Philippe Aumasson , "kernel-hardening@lists.openwall.com" , Linux Crypto Mailing List , Linux Kernel Mailing List , Network Development , Tom Herbert , "Theodore Ts'o" , Vegard Nossum To: Linus Torvalds Return-path: List-Post: List-Help: List-Unsubscribe: List-Subscribe: In-Reply-To: List-Id: linux-crypto.vger.kernel.org On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds wrote: > On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin > wrote: >> >> How much does kernel_fpu_begin()/kernel_fpu_end() cost? > > It's now better than it used to be, but it's absolutely disastrous > still. We're talking easily many hundreds of cycles. Under some loads, > thousands. > > And I warn you already: it will _benchmark_ a hell of a lot better > than it will work in reality. In benchmarks, you'll hit all the > optimizations ("oh, I've already saved away all the FP registers, no > need to do it again"). > > In contrast, in reality, especially with things like "do it once or > twice per incoming packet", you'll easily hit the absolute worst > cases, where not only does it take a few hundred cycles to save the FP > state, you'll then return to user space in between packets, which > triggers the slow-path return code and reloads the FP state, which is > another few hundred cycles plus. Hah, you're thinking that the x86 code works the way that Rik and I want it to work, and you just made my day. :) What actually happens is that the state is saved in kernel_fpu_begin() and restored in kernel_fpu_end(), and it'll take a few hundred cycles best case. If you do it a bunch of times in a loop, you *might* trigger a CPU optimization that notices that the state being saved is the same state that was just restored, but you're still going to pay the full restore code each round trip no matter what. The code is much clearer in 4.10 kernels now that I deleted the unused "lazy" branches. > > Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX > unit and keep it powered up for a while afterwards", while in real > life you would quite easily hit the "oh, AVX is powered down because > we were idle, now it powers up at half speed which is another latency > hit _and_ the AVX unit won't run full out anyway". I *think* that was mostly fixed in Broadwell or thereabouts (in terms of latency -- throughput and power consumption still suffers).