Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030738AbbD1Q62 (ORCPT ); Tue, 28 Apr 2015 12:58:28 -0400 Received: from mail.skyhub.de ([78.46.96.112]:49017 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030298AbbD1Q60 (ORCPT ); Tue, 28 Apr 2015 12:58:26 -0400 Date: Tue, 28 Apr 2015 18:58:07 +0200 From: Borislav Petkov To: Linus Torvalds Cc: "H. Peter Anvin" , Andy Lutomirski , Andy Lutomirski , X86 ML , Denys Vlasenko , Brian Gerst , Denys Vlasenko , Ingo Molnar , Steven Rostedt , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , Linux Kernel Mailing List , Mel Gorman Subject: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue Message-ID: <20150428165807.GI19025@pd.tnic> References: <20150427164024.GD28871@pd.tnic> <20150427183854.GG28871@pd.tnic> <20150427185344.GI28871@pd.tnic> <61BCF405-8000-43EB-A6B1-2BF5677E4ADE@zytor.com> <20150427200329.GL28871@pd.tnic> <2F6CA156-F03F-4F49-A6B9-7D1D1E1D805B@zytor.com> <20150428155511.GF19025@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4544 Lines: 130 On Tue, Apr 28, 2015 at 09:28:52AM -0700, Linus Torvalds wrote: > On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov wrote: > > > > Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are > > better than the 0F 1F 00 suggested by the manual (Haha!): > > That's which AMD CPU? F16h. > On my intel i7-4770S, they are the same cost (I cut down your loop > numbers by an order of magnitude each because I couldn't be arsed to > wait for it, so it might be off by a cycle or two): > > Running 60 times, 1000000 loops per run. > nop_0x90 average: 81.065681 > nop_3_byte average: 80.230101 > > That said, I think your benchmark tests the speed of "rdtsc" rather > than the no-ops. Putting the read_tsc inside the inner loop basically > makes it swamp everything else. Whoops, now that you mention it... of course, that RDTSC *along* with the barriers around it is much much more expensive than the NOPs. > > $ taskset -c 3 ./nops > > Running 600 times, 10000000 loops per run. > > nop_0x90 average: 439.805220 > > nop_3_byte average: 442.412915 > > I think that's in the noise, and could be explained by random > alignment of the loop too, or even random factors like "the CPU heated > up, so the later run was slightly slower". The difference between 439 > and 442 doesn't strike me as all that significant. > > It might be better to *not* inline, and instead make a real function > call to something that has a lot of no-ops (do some preprocessor magic > to make more no-ops in one go). At least that way the alignment is > likely the same for the two cases. malloc a page, populate it with NOPs, slap a RET at the end and jump to it? Maybe even more than 1 page? > Or if not that, then I think you're better off with something like > > p1 = read_tsc(); > for (i = 0; i < LOOPS; i++) { > nop_0x90(); > > } > p2 = read_tsc(); > r = (p2 - p1); > > because while you're now measuring the loop overhead too, that's > *much* smaller than the rdtsc overhead. So I get something like Yap, that looks better. > Running 600 times, 1000000 loops per run. > nop_0x90 average: 3.786935 > nop_3_byte average: 3.677228 > > and notice the difference between "~80 cycles" and "~3.7 cycles". > Yeah, that's rdtsc. I bet your 440 is about the same thing too. > > Btw, the whole thing about "averaging cycles" is not the right thing > to do either. You should probably take the *minimum* cycles count, not > the average, because anything non-minimal means "some perturbation" > (ie interrupt etc). My train of thought was: if you do a *lot* of runs, perturbations would average out. But ok, noted. > So I think something like the attached would be better. It gives an > approximate "cycles per one four-byte nop", and I get > > [torvalds@i7 ~]$ taskset -c 3 ./a.out > Running 60 times, 1000000 loops per run. > nop_0x90 average: 0.200479 > nop_3_byte average: 0.199694 > > which sounds suspiciously good to me (5 nops per cycle? uop cache and > nop compression, I guess). Well, AFAIK, NOPs do require resources for tracking in the machine. I was hoping that hw would be smarter and discard at decode time but there probably are reasons that it can't be done (...yet). So they most likely get discarted at retire time and I can't imagine how an otherwise relatively idle core's ROB with gazillion of NOPs would look like. Those things need hw traces. Maybe in another life. :-) $ taskset -c 3 ./t Running 60 times, 1000000 loops per run. nop_0x90 average: 0.390625 nop_3_byte average: 0.390625 and those exact numbers are actually reproducible pretty reliably. $ taskset -c 3 ./t Running 60 times, 1000000 loops per run. nop_0x90 average: 0.390625 nop_3_byte average: 0.390625 $ taskset -c 3 ./t Running 60 times, 1000000 loops per run. nop_0x90 average: 0.390625 nop_3_byte average: 0.390625 $ taskset -c 3 ./t Running 60 times, 1000000 loops per run. nop_0x90 average: 0.390625 nop_3_byte average: 0.390625 Hmm, so what are we saying? Modern CPUs should use one set of NOPs and that's it... Maybe we need to do more measurements... Hmmm. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/