Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751444AbeAEK2b (ORCPT + 1 other); Fri, 5 Jan 2018 05:28:31 -0500 Received: from mail-it0-f53.google.com ([209.85.214.53]:46715 "EHLO mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750787AbeAEK23 (ORCPT ); Fri, 5 Jan 2018 05:28:29 -0500 X-Google-Smtp-Source: ACJfBoshBUd6DBA5Cat/KFjEhP2UxE/qlg79oEbs0PHroD7cNxPPE1mAw37Lq65QzzcmsquHchBO9g== Date: Fri, 5 Jan 2018 02:28:24 -0800 From: Paul Turner To: David Woodhouse Cc: Alexei Starovoitov , Linus Torvalds , Andi Kleen , LKML , Greg Kroah-Hartman , Tim Chen , Dave Hansen , Thomas Gleixner , Kees Cook , Rik van Riel , Peter Zijlstra , Andy Lutomirski , Jiri Kosina , One Thousand Gnomes Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support Message-ID: <20180105102824.GA247671@google.com> References: <1515058213.12987.89.camel@amazon.co.uk> <20180104143710.8961-1-dwmw@amazon.co.uk> <20180104181744.komdplek7nfdvlsw@ast-mbp> <20180104183559.wlqoxmp7rf4d44ku@ast-mbp> <1515094078.29312.17.camel@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1515094078.29312.17.camel@infradead.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote: > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote: > > > > Pretty much. > > Paul's writeup: https://support.google.com/faqs/answer/7625886 > > tldr: jmp *%r11 gets converted to: > > call set_up_target; > > capture_spec: > >   pause; > >   jmp capture_spec; > > set_up_target: > >   mov %r11, (%rsp); > >   ret; > > where capture_spec part will be looping speculatively. > > That is almost identical to what's in my latest patch set, except that > the capture_spec loop has 'lfence' instead of 'pause'. When choosing this sequence I benchmarked several alternatives here, including (nothing, nops, fences, and other serializing instructions such as cpuid). The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why it was chosen. "pause; jmp" 33.231 cycles/call 9.517 ns/call "lfence; jmp" 33.354 cycles/call 9.552 ns/call (Timings are for a complete retpolined indirect branch.) > > As Andi says, I'd want to see explicit approval from the CPU architects > for making that change. Beyond guaranteeing that speculative execution is constrained, the choice of sequence here is a performance detail and not one of correctness. > > We've already had false starts there — for a long time, Intel thought > that a much simpler option with an lfence after the register load was > sufficient, and then eventually worked out that in some rare cases it > wasn't. While AMD still seem to think it *is* sufficient for them, > apparently. As an interesting aside, that speculation proceeds beyond lfence can be trivially proven using the timings above. In fact, if we substitute only: "lfence" (with no jmp) We see: 29.573 cycles/call 8.469 ns/call Now, the only way for this timing to be different, is if speculation beyond the lfence was executed differently. That said, this is a negative result, it does suggest that the jmp is contributing a larger than realized cost to our speculative loop. We can likely shave off some additional time with some unrolling. I did try this previously, but did not see results above the noise floor; it seems worth trying this again; will take a look tomorrow.