Date: Fri, 5 Jan 2018 02:28:24 -0800
From: Paul Turner <pjt@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andi Kleen <ak@linux.intel.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Greg Kroah-Hartman <gregkh@linux-foundation.org>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Dave Hansen <dave.hansen@intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Kees Cook <keescook@google.com>,
        Rik van Riel <riel@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Andy Lutomirski <luto@amacapital.net>,
        Jiri Kosina <jikos@kernel.org>,
        One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
Message-ID: <20180105102824.GA247671@google.com>
References: <1515058213.12987.89.camel@amazon.co.uk>
 <20180104143710.8961-1-dwmw@amazon.co.uk>
 <20180104181744.komdplek7nfdvlsw@ast-mbp>
 <CA+55aFxRwAEe1VH-E_HMPf+jpxAUWaUkwLbJLXG2U44qHga+Dg@mail.gmail.com>
 <20180104183559.wlqoxmp7rf4d44ku@ast-mbp>
 <1515094078.29312.17.camel@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1515094078.29312.17.camel@infradead.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org

On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > 
> > Pretty much.
> > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > tldr: jmp *%r11 gets converted to:
> > call set_up_target;
> > capture_spec:
> >   pause;
> >   jmp capture_spec;
> > set_up_target:
> >   mov %r11, (%rsp);
> >   ret;
> > where capture_spec part will be looping speculatively.
> 
> That is almost identical to what's in my latest patch set, except that
> the capture_spec loop has 'lfence' instead of 'pause'.

When choosing this sequence I benchmarked several alternatives here, including
(nothing, nops, fences, and other serializing instructions such as cpuid).

The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
it was chosen.

  "pause; jmp" 33.231 cycles/call 9.517 ns/call
  "lfence; jmp" 33.354 cycles/call 9.552 ns/call

(Timings are for a complete retpolined indirect branch.)
> 
> As Andi says, I'd want to see explicit approval from the CPU architects
> for making that change.

Beyond guaranteeing that speculative execution is constrained, the choice of
sequence here is a performance detail and not one of correctness.

> 
> We've already had false starts there — for a long time, Intel thought
> that a much simpler option with an lfence after the register load was
> sufficient, and then eventually worked out that in some rare cases it
> wasn't. While AMD still seem to think it *is* sufficient for them,
> apparently.

As an interesting aside, that speculation proceeds beyond lfence can be
trivially proven using the timings above.  In fact, if we substitute only:
  "lfence" (with no jmp)

We see:
  29.573 cycles/call 8.469 ns/call

Now, the only way for this timing to be different, is if speculation beyond the
lfence was executed differently.

That said, this is a negative result, it does suggest that the jmp is
contributing a larger than realized cost to our speculative loop.  We can likely
shave off some additional time with some unrolling.  I did try this previously,
but did not see results above the noise floor; it seems worth trying this again;
will take a look tomorrow.