LinuxLists.cc - [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

2018-01-04 09:11:21

Subject: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

Apologies for the discombobulation around today's disclosure. Obviously the
original goal was to communicate this a little more coherently, but the
unscheduled advances in the disclosure disrupted the efforts to pull this
together more cleanly.

I wanted to open discussion the "retpoline" approach and and define its
requirements so that we can separate the core
details from questions regarding any particular implementation thereof.

As a starting point, a full write-up describing the approach is available at:
https://support.google.com/faqs/answer/7625886

The 30 second version is:
Returns are a special type of indirect branch. As function returns are intended
to pair with function calls, processors often implement dedicated return stack
predictors. The choice of this branch prediction allows us to generate an
indirect branch in which speculative execution is intentionally redirected into
a controlled location by a return stack target that we control. Preventing
branch target injections (also known as "Spectre") against these binaries.

On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
"native" indirect branch for which branch prediction hardware has been disabled.
This is unfortunately measurable -- from 3 cycles on average to about 30.
However the cost is largely mitigated for many workloads since the kernel uses
comparatively few indirect branches (versus say, a C++ binary). With some
effort we have the average overall overhead within the 0-1.5% range for our
internal workloads, including some particularly high packet processing engines.

There are several components, the majority of which are independent of kernel
modifications:

(1) A compiler supporting retpoline transformations.
(1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
made compatible with (1).
[ Note: The only known indirect jmp which is not safe to convert, is the
early virtual address check in head entry. ]
(2) Kernel modifications for preventing return-stack underflow (see document
above).
The key points where this occurs are:
- Context switches (into protected targets)
- interrupt return (we return into potentially unwinding execution)
- sleep state exit (flushes cashes)
- guest exit.
(These can be run-time gated, a full refill costs 30-45 cycles.)
(3) Optional: Optimizations so that direct branches can be used for hot kernel
indirects. While as discussed above, kernel execution generally depends on
fewer indirect branches, there are a few places (in particular, the
networking stack) where we have chained sequences of indirects on hot paths.
(4) More general support for guarding against RSB underflow in an affected
target. While this is harder to exploit and may not be required for many
users, the approaches we have used here are not generally applicable.
Further discussion is required.

With respect to the what these deltas mean for an unmodified kernel:
(1a) At minimum annotation only. More complicated, config and
run-time gated options are also possigble.
(2) Trivially run-time & config gated.
(3) The de-virtualizing of these branches improves performance in both the
retpoline and non-retpoline cases.

For an out of the box kernel that is reasonably protected, (1)-(3) are required.

I apologize that this does not come with a clean set of patches, merging the
things that we and Intel have looked at here. That was one of the original
goals for this week. Strictly speaking, I think that Andi, David, and I have
a fair amount of merging and clean-up to do here. This is an attempt
to keep discussion of the fundamentals at least independent of that.

I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
any details in sub-threads. I'll also link back some of the other compiler work
which is landing for (1).

Thanks,

- Paul

2018-01-04 09:13:09

by Paul Turner

[permalink] [raw]

Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <[email protected]> wrote:
> Apologies for the discombobulation around today's disclosure. Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
> https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch. As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors. The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control. Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary). With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.
> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
> early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
> The key points where this occurs are:
> - Context switches (into protected targets)
> - interrupt return (we return into potentially unwinding execution)
> - sleep state exit (flushes cashes)
> - guest exit.
> (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
> indirects. While as discussed above, kernel execution generally depends on
> fewer indirect branches, there are a few places (in particular, the
> networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
> target. While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:
> (1a) At minimum annotation only. More complicated, config and
> run-time gated options are also possigble.
> (2) Trivially run-time & config gated.
> (3) The de-virtualizing of these branches improves performance in both the
> retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here. That was one of the original
> goals for this week. Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here. This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
> any details in sub-threads. I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

2018-01-04 09:25:26

by Paul Turner

[permalink] [raw]

Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <[email protected]> wrote:
> Apologies for the discombobulation around today's disclosure. Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
> https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch. As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors. The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control. Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary). With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.

An implementation for LLVM is available at:
https://reviews.llvm.org/D41723

> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
> early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
> The key points where this occurs are:
> - Context switches (into protected targets)
> - interrupt return (we return into potentially unwinding execution)
> - sleep state exit (flushes cashes)
> - guest exit.
> (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
> indirects. While as discussed above, kernel execution generally depends on
> fewer indirect branches, there are a few places (in particular, the
> networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
> target. While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:

Sorry this should have been, a kernel that does not care about this protection.

It has been a long day :-).

> (1a) At minimum annotation only. More complicated, config and
> run-time gated options are also possigble.
> (2) Trivially run-time & config gated.
> (3) The de-virtualizing of these branches improves performance in both the
> retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here. That was one of the original
> goals for this week. Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here. This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
> any details in sub-threads. I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

2018-01-04 09:35:33

On Thu, 2018-01-04 at 06:46 -0800, Dave Hansen wrote:
> On 01/04/2018 06:37 AM, David Woodhouse wrote:
> > KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
> > can't just jump to the thunk because the thunk isn't mapped. So it gets
> > its own copy of the thunk, inline.
>
> This one call site isn't too painful, of course.
>
> But, is there anything keeping us from just sticking the thunk in the
> entry text section where it would be available while still in the
> trampoline?

Not really. Since we have a thunk per register and the trampoline is
limited in size, if you wanted to put just the *one* thunk that's
needed into the trampoline then it's slightly icky, but all perfectly
feasible.

Attachments:

smime.p7s (5.09 kB)

2018-01-04 15:02:11

by Juergen Gross

[permalink] [raw]

Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On 04/01/18 15:37, David Woodhouse wrote:
> Convert pvops invocations to use non-speculative call sequences, when
> CONFIG_RETPOLINE is enabled.
>
> There is scope for future optimisation here — once the pvops methods are
> actually set, we could just turn the damn things into *direct* jumps.
> But this is perfectly sufficient for now, without that added complexity.

I don't see the need to modify the pvops calls.

All indirect calls are replaced by either direct calls or other code
long before any user code is active.

For modules the replacements are in place before the module is being
used.

Juergen

>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/paravirt_types.h | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 6ec54d01972d..54b735b8ae12 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -336,11 +336,17 @@ extern struct pv_lock_ops pv_lock_ops;
> #define PARAVIRT_PATCH(x) \
> (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>
> +#define paravirt_clobber(clobber) \
> + [paravirt_clobber] "i" (clobber)
> +#ifdef CONFIG_RETPOLINE
> +#define paravirt_type(op) \
> + [paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
> + [paravirt_opptr] "r" ((op))
> +#else
> #define paravirt_type(op) \
> [paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
> [paravirt_opptr] "i" (&(op))
> -#define paravirt_clobber(clobber) \
> - [paravirt_clobber] "i" (clobber)
> +#endif
>
> /*
> * Generate some code, and mark it as patchable by the
> @@ -392,7 +398,11 @@ int paravirt_disable_iospace(void);
> * offset into the paravirt_patch_template structure, and can therefore be
> * freely converted back into a structure offset.
> */
> +#ifdef CONFIG_RETPOLINE
> +#define PARAVIRT_CALL "call __x86.indirect_thunk.%V[paravirt_opptr];"
> +#else
> #define PARAVIRT_CALL "call *%c[paravirt_opptr];"
> +#endif
>
> /*
> * These macros are intended to wrap calls through one of the paravirt
>

2018-01-04 15:10:37

by Juergen Gross

[permalink] [raw]

Subject: Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps

On 04/01/18 15:37, David Woodhouse wrote:
> Convert indirect call in Xen hypercall to use non-speculative sequence,
> when CONFIG_RETPOLINE is enabled.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/xen/hypercall.h | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
> index 7cb282e9e587..393c0048c63e 100644
> --- a/arch/x86/include/asm/xen/hypercall.h
> +++ b/arch/x86/include/asm/xen/hypercall.h
> @@ -44,6 +44,7 @@
> #include <asm/page.h>
> #include <asm/pgtable.h>
> #include <asm/smap.h>
> +#include <asm/nospec-branch.h>

Where does this file come from? It isn't added in any of the patches.

Juergen

2018-01-04 15:13:18

by Woodhouse, David

[permalink] [raw]

Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On Thu, 2018-01-04 at 16:02 +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > Convert pvops invocations to use non-speculative call sequences, when
> > CONFIG_RETPOLINE is enabled.
> >
> > There is scope for future optimisation here — once the pvops methods are
> > actually set, we could just turn the damn things into *direct* jumps.
> > But this is perfectly sufficient for now, without that added complexity

2018-01-04 15:19:09

2018-01-04 16:04:41

by Juergen Gross

[permalink] [raw]

Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On 04/01/18 16:18, Andrew Cooper wrote:
> On 04/01/18 15:02, Juergen Gross wrote:
>> On 04/01/18 15:37, David Woodhouse wrote:
>>> Convert pvops invocations to use non-speculative call sequences, when
>>> CONFIG_RETPOLINE is enabled.
>>>
>>> There is scope for future optimisation here — once the pvops methods are
>>> actually set, we could just turn the damn things into *direct* jumps.
>>> But this is perfectly sufficient for now, without that added complexity.
>> I don't see the need to modify the pvops calls.
>>
>> All indirect calls are replaced by either direct calls or other code
>> long before any user code is active.
>>
>> For modules the replacements are in place before the module is being
>> used.
>
> When booting virtualised, sibling hyperthreads can arrange VM-to-VM SP2
> attacks.
>
> One mitigation though is to consider if there is any interesting data to
> leak that early during boot.

Right. And if you are able to detect a booting VM in the other
hyperthread, obtain enough information about its kernel layout and
extract the information via statistical methods in the very short time
frame of the boot before pvops patching takes place. Not to forget the
vast amount of data the booting VM will pull through the caches making
side channel attacks a rather flaky endeavor...

I'd opt for leaving pvops calls untouched. The Reviewed-by: I gave for
the patch was just for its correctness. :-)

Juergen

2018-01-04 16:19:21

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <[email protected]> wrote:
> On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
>> Apologies for the discombobulation around today's disclosure. Obviously the
>> original goal was to communicate this a little more coherently, but the
>> unscheduled advances in the disclosure disrupted the efforts to pull this
>> together more cleanly.
>>
>> I wanted to open discussion the "retpoline" approach and and define its
>> requirements so that we can separate the core
>> details from questions regarding any particular implementation thereof.
>>
>> As a starting point, a full write-up describing the approach is available at:
>> https://support.google.com/faqs/answer/7625886
>
> Note that (ab)using 'ret' in this way is incompatible with CET on
> upcoming processors. HJ added a -mno-indirect-branch-register option to
> the latest round of GCC patches, which puts the branch target in a
> register instead of on the stack. My kernel patches (which I'm about to
> reconcile with Andi's tweaks and post) do the same.
>
> That means that in the cases where at runtime we want to ALTERNATIVE
> out the retpoline, it just turns back into a bare 'jmp *\reg'.
>
>

I hate to say this, but I think Intel should postpone CET until the
dust settles. Intel should also consider a hardware-protected stack
that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
instruction. That, by itself, would give considerable protection.
But we still need JMP_NO_SPECULATE. Or, better yet, get the CPU to
stop leaking data during speculative execution.

2018-01-04 16:24:44

On Fri, 2018-01-05 at 13:54 +0100, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, David Woodhouse wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 07cdd1715705..900fa7016d3f 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -342,5 +342,6 @@
> > #define X86_BUG_MONITOR                      X86_BUG(12) /* IPI required to wake up remote CPU */
> > #define X86_BUG_AMD_E400             X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> > #define X86_BUG_CPU_INSECURE         X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> > +#define X86_BUG_NO_RETPOLINE         X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
>
> I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> tells the kernel, that the CPU is affected by variant 3.

As it says, it's a placeholder. The actual conditions depend on whether
we decide to use IBRS or not, which also depends on whether we find
IBRS_ATT or whether we're on Skylake+.

The IBRS patch series should be updating it.

> So what we really want is
>
>    X86_BUG_MELTDOWN_V1/2/3
>
> which get set when the CPU is affected by a particular variant and then
> have feature flags
>
>    X86_FEATURE_RETPOLINE
>    X86_FEATURE_IBRS
>    X86_FEATURE_NOSPEC

At some point during this whole painful mess, I had come to the
conclusion that having relocations in altinstr didn't work, and that's
why I had X86_xx_NO_RETPOLINE instead of X86_xx_RETPOLINE. I now think
that something else was wrong when I was testing that, and relocs in
altinstr do work. So sure, X86_FEATURE_RETPOLINE ought to work. I can
change that round, and it's simpler for the IBRS patch set to take it
into account and set it when appropriate.

Attachments:

smime.p7s (5.09 kB)

2018-01-05 16:42:31

On Fri, 2018-01-05 at 17:45 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> > Nope, alternatives are broken. Only a jmp as the *first* opcode of
> > altinstr gets handled by recompute_jump(), while any subsequent insn is
> > just copied untouched.
>
> Not broken - simply no one needed it until now. I'm looking into it.
> Looks like the insn decoder might come in handy finally.#

I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
Which made me sad, and took a while to debug.

Let's call it "not working for this use case", and I'll stick with
X86_FEATURE_NO_RETPOLINE for now, so the jump can go in the oldinstr
case not in altinstr.

Attachments:

smime.p7s (5.09 kB)

2018-01-05 17:28:55

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 9:12 AM, Woodhouse, David <[email protected]> wrote:
>
> I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
> which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
> Which made me sad, and took a while to debug.

Yes, I would suggest against expecting altinstructions to have
relocation information. They are generated in a different place, so..

That said, I honestly like the inline version (the one that is in the
google paper first) of the retpoline more than the out-of-line one.
And that one shouldn't have any relocagtion issues, because all the
offsets are relative.

We want to use that one for the entry stub anyway, can't we just
standardize on that one for all our assembly?

If the *compiler* uses the out-of-line version, that's a separate
thing. But for our asm cases, let's just make it all be the inline
case, ok?

It also should simplify the whole target generation. None of this
silly "__x86.indirect_thunk.\reg" crap with different targets for
different register choices.

Hmm?

Linus

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>
> Yes, I would suggest against expecting altinstructions to have
> relocation information. They are generated in a different place, so..
>
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocation issues, because all the
> offsets are relative.

Note that the *only* issue with the relocation is that it pushes me to
use X86_FEATURE_NO_RETPOLINE for my feature instead of
X86_FEATURE_RETPOLINE as might be natural. And actually there's a
motivation to do that anyway, because of the way three-way alternatives
interact.

With the existing negative flag I can do

ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)

But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:

ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

So I was completely prepared to live with the slightly unnatural
inverse logic of the feature flag. But since you asked...

> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?
>
> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

OK.... it starts off looking a bit like this. You're right; with the
caveats above it will let me invert the logic to X86_FEATURE_RETPOLINE
because the alternatives mechanism no longer needs to adjust any part
of the retpoline code path when it's in 'altinstr'.

And it does let me use a simple NOSPEC_JMP in the entry trampoline
instead of open-coding it again, which is nice.

But the first pass of it, below, is fugly as hell. I'll take another
look at *using* the ALTERNATIVE_2 macro instead of reimplementing it
for NOSPEC_CALL, but I strongly suspect that's just going to land me
with a fairly unreadable __stringify(jmp;call;lfence;jmp;call;mov;ret)
monstrosity all on a single line. Assembler macros are... brittle.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76f94bbacaec..8f7e1129f493 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -188,17 +188,7 @@ ENTRY(entry_SYSCALL_64_trampoline)
*/
pushq %rdi
movq $entry_SYSCALL_64_stage2, %rdi
- /*
- * Open-code the retpoline from retpoline.S, because we can't
- * just jump to it directly.
- */
- ALTERNATIVE "call 2f", "jmp *%rdi", X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %rdi, (%rsp)
- ret
+ NOSPEC_JMP rdi
END(entry_SYSCALL_64_trampoline)

.popsection
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index eced0dfaddc9..1c8312ff186a 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -14,15 +14,54 @@
*/
.macro NOSPEC_JMP reg:req
#ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(jmp __x86.indirect_thunk.\reg), __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+ ALTERNATIVE_2 "call 1112f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+1111:
+ lfence
+ jmp 1111b
+1112:
+ mov %\reg, (%_ASM_SP)
+ ret
#else
- jmp *%\reg
+ jmp *%\reg
#endif
.endm

+/*
+ * Even __stringify() on the arguments doesn't really make it nice to use
+ * the existing ALTERNATIVE_2 macro here. So open-code our own version...
+ */
.macro NOSPEC_CALL reg:req
#ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(call __x86.indirect_thunk.\reg), __stringify(call *%\reg), X86_FEATURE_NO_RETPOLINE
+140:
+ jmp 1113f
+1110:
+ call 1112f
+1111:
+ lfence
+ jmp 1111b
+1112:
+ mov %\reg, (%_ASM_SP)
+ ret
+1113:
+ call 1110b
+141:
+ .skip -((alt_max_short(new_len1, new_len2) - (old_len)) > 0) * \
+ (alt_max_short(new_len1, new_len2) - (old_len)),0x90
+142:
+
+ .pushsection .altinstructions,"a"
+ altinstruction_entry 140b,143f,X86_FEATURE_K8,142b-140b,144f-143f,142b-141b
+ altinstruction_entry 140b,144f,X86_FEATURE_NO_RETPOLINE,142b-140b,145f-144f,142b-141b
+ .popsection
+
+ .pushsection .altinstr_replacement,"ax"
+143:
+ lfence
+ call *%\reg
+144:
+ call *%\reg
+145:
+ .popsection
#else
call *%\reg
#endif
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 2a4b1f09eb84..5c15e4307da5 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -6,19 +6,14 @@
#include <asm/cpufeatures.h>
#include <asm/alternative-asm.h>
#include <asm/export.h>
+#include <asm/nospec-branch.h>

.macro THUNK sp reg
.section .text.__x86.indirect_thunk.\reg

ENTRY(__x86.indirect_thunk.\reg)
CFI_STARTPROC
- ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %\reg, (%\sp)
- ret
+ NOSPEC_JMP \reg
CFI_ENDPROC
ENDPROC(__x86.indirect_thunk.\reg)
EXPORT_SYMBOL(__x86.indirect_thunk.\reg)

Attachments:

smime.p7s (5.09 kB)

2018-01-05 21:11:10

by Brian Gerst

[permalink] [raw]

Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 3:32 PM, Woodhouse, David <[email protected]> wrote:
> On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>>
>> Yes, I would suggest against expecting altinstructions to have
>> relocation information. They are generated in a different place, so..
>>
>> That said, I honestly like the inline version (the one that is in the
>> google paper first) of the retpoline more than the out-of-line one.
>> And that one shouldn't have any relocation issues, because all the
>> offsets are relative.
>
> Note that the *only* issue with the relocation is that it pushes me to
> use X86_FEATURE_NO_RETPOLINE for my feature instead of
> X86_FEATURE_RETPOLINE as might be natural. And actually there's a
> motivation to do that anyway, because of the way three-way alternatives
> interact.
>
> With the existing negative flag I can do
>
> ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)
>
> But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:
>
> ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

Another way to do it is with two consecutive alternatives:

ALTERNATIVE(NOP, K8: lfence)
ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)

This also avoids the issue with the relocation of the jmp target when
the replacement is more than one instruction.

--
Brian Gerst

2018-01-05 22:02:01

On Sat, 2018-01-06 at 01:30 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> > I seem to recall that we also discussed the need for this for converting
> > pvops to use alternatives, though the "why" is eluding me at the moment.
>
> Ok, here's something which seems to work in my VM here. I'll continue
> playing with it tomorrow. Josh, if you have some example sequences for
> me to try, send them my way pls.
>
> Anyway, here's an example:
>
> alternative("", "xor %%rdi, %%rdi; jmp startup_64", X86_FEATURE_K8);
>
> which did this:
>
> [    0.921013] apply_alternatives: feat: 3*32+4, old: (ffffffff81027429, len: 8), repl: (ffffffff824759d2, len: 8), pad: 8
> [    0.924002] ffffffff81027429: old_insn: 90 90 90 90 90 90 90 90
> [    0.928003] ffffffff824759d2: rpl_insn: 48 31 ff e9 26 a6 b8 fe
> [    0.930212] process_jumps: repl[0]: 0x48
> [    0.932002] process_jumps: insn len: 3
> [    0.932814] process_jumps: repl[0]: 0xe9
> [    0.934003] recompute_jump: o_dspl: 0xfeb8a626
> [    0.934914] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8bd7
> [    0.936001] recompute_jump: final displ: 0xfffd8bd2, JMP 0xffffffff81000000
> [    0.937240] process_jumps: insn len: 5
> [    0.938053] ffffffff81027429: final_insn: e9 d2 8b fd ff a6 b8 fe
>
> Apparently our insn decoder is smart enough to parse the insn and get
> its length, so I can use that. It jumps over the first 3-byte XOR and
> than massaged the following 5-byte jump.

Thanks. From code inspection, I couldn't see that it was smart enough
*not* to process a relative jump in the 'altinstr' section which was
jumping to a target *within* that same altinstr section, and thus
didn't need to be touched at all. Does this work?

alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);

Attachments:

smime.p7s (5.09 kB)

2018-01-06 10:58:20

On Sat, 2018-01-06 at 18:02 +0100, Borislav Petkov wrote:
> On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> > Thanks. From code inspection, I couldn't see that it was smart enough
> > *not* to process a relative jump in the 'altinstr' section which was
> > jumping to a target *within* that same altinstr section, and thus
> > didn't need to be touched at all. Does this work?
> >
> > alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);
>
> So this is fine because it gets turned into a two-byte jump:
>
> [    0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
> [    0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
> [    0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
> [    0.822455] process_jumps: insn start 0x48, at 0, len: 3
> [    0.823496] process_jumps: insn start 0xeb, at 3, len: 2
> [    0.824002] process_jumps: insn start 0xe9, at 5, len: 5
> [    0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
> [    0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
> [    0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe
>
> i.e., notice the "eb 00" thing.
>
> Which, when copied into the kernel proper, will simply work as it is
> a small offset which, when referring to other code which gets copied
> *together* with it, should work. I.e., we're not changing the offsets
> during the copy so all good.
>
> It becomes more tricky when you force a 5-byte jump:
>
>         alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);
>
> because then you need to know whether the offset is within the
> .altinstr_replacement section itself or it is meant to be an absolute
> offset like jmp startup_64 or within another section.

Right, so it all tends to work out OK purely by virtue of the fact that
oldinstr and altinstr end up far enough apart in the image that they're
5-byte jumps. Which isn't perfect but we've lived with worse.

I'm relatively pleased that we've managed to eliminate this as a
dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
following Linus' suggestion to just emit the thunk inline instead of
calling the same one as GCC.

The other fun one for alternatives is in entry_64.S, where we really
need the return address of the call instruction to be *precisely* the
.Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
normal NOSPEC_CALL there:

/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path.  If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86.indirect_thunk.rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

Now it's not like an unconditional branch to the out-of-line thunk is
really going to be much of a problem, even in the case where that out-
of-line thunk is alternative'd into a bare 'jmp *%rax'. But it would be
slightly nicer to avoid it.

At the moment though, it's really hard to ensure that the 'call'
instruction that leaves its address on the stack is right at the end.

Am I missing a trick there, other than manually inserting leading NOPs
(instead of the automatic trailing ones) to ensure that everything
lines up, and making assumptions about how the assembler will encode
instructions (not that it has *much* choice, but it has some).

On the whole I think it's fine as it is, but if you have a simple fix
then that would be nice.

Attachments:

smime.p7s (5.09 kB)

2018-01-07 11:46:57

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, Jan 07, 2018 at 09:40:42AM +0000, David Woodhouse wrote:
> Right, so it all tends to work out OK purely by virtue of the fact that
> oldinstr and altinstr end up far enough apart in the image that they're
> 5-byte jumps. Which isn't perfect but we've lived with worse.

Well, the reference point is important. And I don't think we've done
more involved things than jumping back to something in .text proper.
However, I think I know how to fix this so that arbitrary jump offsets
would work but I need to talk to our gcc guys first.

If the jump is close enough for 2 bytes, then it should work as long as
the offset to the target doesn't change.

The main thing recompute_jumps() does is turn 5-byte jumps - which gas
creates because the jump target is in .text but the jump itself is in
.altinstr_replacement - into 2-byte ones. Because when you copy the jump
back into .text, the offset might fit in a signed byte all of a sudden.

There are still some nasties with forcing 5-byte jumps but I think I
know how to fix those. Stay tuned...

> I'm relatively pleased that we've managed to eliminate this as a
> dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
> following Linus' suggestion to just emit the thunk inline instead of
> calling the same one as GCC.
>
> The other fun one for alternatives is in entry_64.S, where we really
> need the return address of the call instruction to be *precisely* the
> .Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
> normal NOSPEC_CALL there:

So CALL, as the doc says, pushes the offset of the *next* insn onto the
stack and branches to the target address.

So I'm thinking, as long as the next insn doesn't move and gcc doesn't
pad anything, you're fine.

However, I suspect that I'm missing something else here and I guess I'll
have more clue if I look at the whole thing. So can you point me to your
current branch so that I can take a look at the code?

Thx.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-07 12:21:53

On Sun, 2018-01-07 at 23:06 -0600, Josh Poimboeuf wrote:
>
> Here's the use case I had in mind before. With paravirt,
>
> ENABLE_INTERRUPTS(CLBR_NONE)
>
> becomes
>
> push  %rax
> call  *pv_irq_ops.irq_enable
> pop   %rax
>
> and I wanted to apply those instructions with an alternative. It
> doesn't work currently because the 'call' isn't first.

I believe Borislav has made that work... however, if you're literally
doing the above then you'd be introducing new indirect branches which
is precisely what I've been trying to eliminate.

I believe I was told to stop prodding at pvops and just trust that they
all get turned into *direct* jumps at runtime. For example the above
call would not be literally 'call *pv_irq_ops.irq_enable', because by
the time the pvops are patched we'd *know* the final value of the
irq_enable method, and we'd turn it into a *direct* call instead.

Do I need to start looking at pvops again?

Attachments:

smime.p7s (5.09 kB)

2018-01-08 21:50:58

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, 2018-01-07 at 15:03 +0100, Borislav Petkov wrote:
>
> My fear is if some funky compiler changes the sizes of the insns in
> RETPOLINE_CALL/JMP and then the padding becomes wrong. But looking at the
> labels, they're all close so you have a 2-byte jmp already and the
>
> call 1112f
>
> should be ok. The MOV is reg,(reg) which should not change size of 4...
>
> But I'm remaining cautious here.

Right. I forget the specifics, but I've *watched* LLVM break carefully
hand-crafted asm code by emitting 4-byte variants when we expected 2-
byte, etc.

On the whole, I'm sufficiently unhappy with making such assumptions,
that I think the cure is worse than the disease. We can live with that
*one* out-of-line call to the thunk in the syscall case, and that was
the *only* one that really needed the call to be at the end.

Note that in the alternative case there, we don't even need to load it
into a register at all. We could do our own alternatives specially for
that case, and hand-tune the lengths only for them. But *with* a sanity
check to break the build on mismatch.

I don't think it's worth it at this point though.

Attachments:

smime.p7s (5.09 kB)