2018-01-04 09:11:21

by Paul Turner

[permalink] [raw]
Subject: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

Apologies for the discombobulation around today's disclosure. Obviously the
original goal was to communicate this a little more coherently, but the
unscheduled advances in the disclosure disrupted the efforts to pull this
together more cleanly.

I wanted to open discussion the "retpoline" approach and and define its
requirements so that we can separate the core
details from questions regarding any particular implementation thereof.

As a starting point, a full write-up describing the approach is available at:
https://support.google.com/faqs/answer/7625886

The 30 second version is:
Returns are a special type of indirect branch. As function returns are intended
to pair with function calls, processors often implement dedicated return stack
predictors. The choice of this branch prediction allows us to generate an
indirect branch in which speculative execution is intentionally redirected into
a controlled location by a return stack target that we control. Preventing
branch target injections (also known as "Spectre") against these binaries.

On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
"native" indirect branch for which branch prediction hardware has been disabled.
This is unfortunately measurable -- from 3 cycles on average to about 30.
However the cost is largely mitigated for many workloads since the kernel uses
comparatively few indirect branches (versus say, a C++ binary). With some
effort we have the average overall overhead within the 0-1.5% range for our
internal workloads, including some particularly high packet processing engines.

There are several components, the majority of which are independent of kernel
modifications:

(1) A compiler supporting retpoline transformations.
(1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
made compatible with (1).
[ Note: The only known indirect jmp which is not safe to convert, is the
early virtual address check in head entry. ]
(2) Kernel modifications for preventing return-stack underflow (see document
above).
The key points where this occurs are:
- Context switches (into protected targets)
- interrupt return (we return into potentially unwinding execution)
- sleep state exit (flushes cashes)
- guest exit.
(These can be run-time gated, a full refill costs 30-45 cycles.)
(3) Optional: Optimizations so that direct branches can be used for hot kernel
indirects. While as discussed above, kernel execution generally depends on
fewer indirect branches, there are a few places (in particular, the
networking stack) where we have chained sequences of indirects on hot paths.
(4) More general support for guarding against RSB underflow in an affected
target. While this is harder to exploit and may not be required for many
users, the approaches we have used here are not generally applicable.
Further discussion is required.

With respect to the what these deltas mean for an unmodified kernel:
(1a) At minimum annotation only. More complicated, config and
run-time gated options are also possigble.
(2) Trivially run-time & config gated.
(3) The de-virtualizing of these branches improves performance in both the
retpoline and non-retpoline cases.

For an out of the box kernel that is reasonably protected, (1)-(3) are required.

I apologize that this does not come with a clean set of patches, merging the
things that we and Intel have looked at here. That was one of the original
goals for this week. Strictly speaking, I think that Andi, David, and I have
a fair amount of merging and clean-up to do here. This is an attempt
to keep discussion of the fundamentals at least independent of that.

I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
any details in sub-threads. I'll also link back some of the other compiler work
which is landing for (1).

Thanks,

- Paul


2018-01-04 09:13:09

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <[email protected]> wrote:
> Apologies for the discombobulation around today's disclosure. Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
> https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch. As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors. The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control. Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary). With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.
> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
> early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
> The key points where this occurs are:
> - Context switches (into protected targets)
> - interrupt return (we return into potentially unwinding execution)
> - sleep state exit (flushes cashes)
> - guest exit.
> (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
> indirects. While as discussed above, kernel execution generally depends on
> fewer indirect branches, there are a few places (in particular, the
> networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
> target. While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:
> (1a) At minimum annotation only. More complicated, config and
> run-time gated options are also possigble.
> (2) Trivially run-time & config gated.
> (3) The de-virtualizing of these branches improves performance in both the
> retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here. That was one of the original
> goals for this week. Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here. This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
> any details in sub-threads. I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

2018-01-04 09:25:26

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <[email protected]> wrote:
> Apologies for the discombobulation around today's disclosure. Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
> https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch. As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors. The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control. Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary). With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.

An implementation for LLVM is available at:
https://reviews.llvm.org/D41723

> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
> early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
> The key points where this occurs are:
> - Context switches (into protected targets)
> - interrupt return (we return into potentially unwinding execution)
> - sleep state exit (flushes cashes)
> - guest exit.
> (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
> indirects. While as discussed above, kernel execution generally depends on
> fewer indirect branches, there are a few places (in particular, the
> networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
> target. While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:

Sorry this should have been, a kernel that does not care about this protection.

It has been a long day :-).

> (1a) At minimum annotation only. More complicated, config and
> run-time gated options are also possigble.
> (2) Trivially run-time & config gated.
> (3) The de-virtualizing of these branches improves performance in both the
> retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here. That was one of the original
> goals for this week. Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here. This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
> any details in sub-threads. I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

2018-01-04 09:35:33

by Woodhouse, David

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886

Note that (ab)using 'ret' in this way is incompatible with CET on
upcoming processors. HJ added a -mno-indirect-branch-register option to
the latest round of GCC patches, which puts the branch target in a
register instead of on the stack. My kernel patches (which I'm about to
reconcile with Andi's tweaks and post) do the same.

That means that in the cases where at runtime we want to ALTERNATIVE
out the retpoline, it just turns back into a bare 'jmp *\reg'.


Attachments:
smime.p7s (5.09 kB)

2018-01-04 09:48:32

by Greg KH

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 04, 2018 at 01:24:41AM -0800, Paul Turner wrote:
> On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <[email protected]> wrote:
> > Apologies for the discombobulation around today's disclosure. Obviously the
> > original goal was to communicate this a little more coherently, but the
> > unscheduled advances in the disclosure disrupted the efforts to pull this
> > together more cleanly.
> >
> > I wanted to open discussion the "retpoline" approach and and define its
> > requirements so that we can separate the core
> > details from questions regarding any particular implementation thereof.
> >
> > As a starting point, a full write-up describing the approach is available at:
> > https://support.google.com/faqs/answer/7625886
> >
> > The 30 second version is:
> > Returns are a special type of indirect branch. As function returns are intended
> > to pair with function calls, processors often implement dedicated return stack
> > predictors. The choice of this branch prediction allows us to generate an
> > indirect branch in which speculative execution is intentionally redirected into
> > a controlled location by a return stack target that we control. Preventing
> > branch target injections (also known as "Spectre") against these binaries.
> >
> > On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> > "native" indirect branch for which branch prediction hardware has been disabled.
> > This is unfortunately measurable -- from 3 cycles on average to about 30.
> > However the cost is largely mitigated for many workloads since the kernel uses
> > comparatively few indirect branches (versus say, a C++ binary). With some
> > effort we have the average overall overhead within the 0-1.5% range for our
> > internal workloads, including some particularly high packet processing engines.
> >
> > There are several components, the majority of which are independent of kernel
> > modifications:
> >
> > (1) A compiler supporting retpoline transformations.
>
> An implementation for LLVM is available at:
> https://reviews.llvm.org/D41723

Nice, thanks for the link and the write up. There is also a patch for
gcc floating around somewhere, does anyone have the link for that?

thanks,

greg k-h

2018-01-04 09:59:25

by Woodhouse, David

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, 2018-01-04 at 10:48 +0100, Greg Kroah-Hartman wrote:
>
> Nice, thanks for the link and the write up.  There is also a patch for
> gcc floating around somewhere, does anyone have the link for that?

http://git.infradead.org/users/dwmw2/gcc-retpoline.git/shortlog/refs/heads/gcc-7_2_0-retpoline-20171219

I put packages for Fedora 27 at ftp://ftp.infradead.org/pub/retpoline/


Attachments:
smime.p7s (5.09 kB)

2018-01-04 14:37:26

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm

Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/alternative.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index dbfd0854651f..cf5961ca8677 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -140,7 +140,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
".popsection\n" \
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr, feature, 1) \
- ".popsection"
+ ".popsection\n"

#define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
OLDINSTR_2(oldinstr, 1, 2) \
@@ -151,7 +151,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr1, feature1, 1) \
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2) \
- ".popsection"
+ ".popsection\n"

/*
* Alternative instructions for different CPU types or capabilities.
--
2.14.3

2018-01-04 14:37:29

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps

Convert indirect jumps in core 32/64bit entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.

KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
can't just jump to the thunk because the thunk isn't mapped. So it gets
its own copy of the thunk, inline.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/entry/entry_32.S | 5 +++--
arch/x86/entry/entry_64.S | 20 ++++++++++++++++----
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ace8f321a5a1..abd1e5dd487d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -44,6 +44,7 @@
#include <asm/asm.h>
#include <asm/smap.h>
#include <asm/frame.h>
+#include <asm/nospec-branch.h>

.section .entry.text, "ax"

@@ -290,7 +291,7 @@ ENTRY(ret_from_fork)

/* kernel thread */
1: movl %edi, %eax
- call *%ebx
+ NOSPEC_CALL ebx
/*
* A kernel thread is allowed to return here after successfully
* calling do_execve(). Exit to userspace to complete the execve()
@@ -919,7 +920,7 @@ common_exception:
movl %ecx, %es
TRACE_IRQS_OFF
movl %esp, %eax # pt_regs pointer
- call *%edi
+ NOSPEC_CALL edi
jmp ret_from_exception
END(common_exception)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f048e384ff54..9e449701115a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -37,6 +37,7 @@
#include <asm/pgtable_types.h>
#include <asm/export.h>
#include <asm/frame.h>
+#include <asm/nospec-branch.h>
#include <linux/err.h>

#include "calling.h"
@@ -191,7 +192,17 @@ ENTRY(entry_SYSCALL_64_trampoline)
*/
pushq %rdi
movq $entry_SYSCALL_64_stage2, %rdi
- jmp *%rdi
+ /*
+ * Open-code the retpoline from retpoline.S, because we can't
+ * just jump to it directly.
+ */
+ ALTERNATIVE "call 2f", "jmp *%rdi", X86_BUG_NO_RETPOLINE
+1:
+ lfence
+ jmp 1b
+2:
+ mov %rdi, (%rsp)
+ ret
END(entry_SYSCALL_64_trampoline)

.popsection
@@ -270,7 +281,8 @@ entry_SYSCALL_64_fastpath:
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
- call *sys_call_table(, %rax, 8)
+ movq sys_call_table(, %rax, 8), %rax
+ NOSPEC_CALL rax
.Lentry_SYSCALL_64_after_fastpath_call:

movq %rax, RAX(%rsp)
@@ -442,7 +454,7 @@ ENTRY(stub_ptregs_64)
jmp entry_SYSCALL64_slow_path

1:
- jmp *%rax /* Called from C */
+ NOSPEC_JMP rax /* Called from C */
END(stub_ptregs_64)

.macro ptregs_stub func
@@ -521,7 +533,7 @@ ENTRY(ret_from_fork)
1:
/* kernel thread */
movq %r12, %rdi
- call *%rbx
+ NOSPEC_CALL rbx
/*
* A kernel thread is allowed to return here after successfully
* calling do_execve(). Exit to userspace to complete the execve()
--
2.14.3

2018-01-04 14:37:28

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps

Convert indirect call in Xen hypercall to use non-speculative sequence,
when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/xen/hypercall.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index 7cb282e9e587..393c0048c63e 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -44,6 +44,7 @@
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/smap.h>
+#include <asm/nospec-branch.h>

#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
@@ -217,9 +218,9 @@ privcmd_call(unsigned call,
__HYPERCALL_5ARG(a1, a2, a3, a4, a5);

stac();
- asm volatile("call *%[call]"
+ asm volatile(NOSPEC_CALL
: __HYPERCALL_5PARAM
- : [call] "a" (&hypercall_page[call])
+ : [thunk_target] "a" (&hypercall_page[call])
: __HYPERCALL_CLOBBER5);
clac();

--
2.14.3

2018-01-04 14:38:14

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

Convert pvops invocations to use non-speculative call sequences, when
CONFIG_RETPOLINE is enabled.

There is scope for future optimisation here — once the pvops methods are
actually set, we could just turn the damn things into *direct* jumps.
But this is perfectly sufficient for now, without that added complexity.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 6ec54d01972d..54b735b8ae12 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -336,11 +336,17 @@ extern struct pv_lock_ops pv_lock_ops;
#define PARAVIRT_PATCH(x) \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))

+#define paravirt_clobber(clobber) \
+ [paravirt_clobber] "i" (clobber)
+#ifdef CONFIG_RETPOLINE
+#define paravirt_type(op) \
+ [paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
+ [paravirt_opptr] "r" ((op))
+#else
#define paravirt_type(op) \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
[paravirt_opptr] "i" (&(op))
-#define paravirt_clobber(clobber) \
- [paravirt_clobber] "i" (clobber)
+#endif

/*
* Generate some code, and mark it as patchable by the
@@ -392,7 +398,11 @@ int paravirt_disable_iospace(void);
* offset into the paravirt_patch_template structure, and can therefore be
* freely converted back into a structure offset.
*/
+#ifdef CONFIG_RETPOLINE
+#define PARAVIRT_CALL "call __x86.indirect_thunk.%V[paravirt_opptr];"
+#else
#define PARAVIRT_CALL "call *%c[paravirt_opptr];"
+#endif

/*
* These macros are intended to wrap calls through one of the paravirt
--
2.14.3

2018-01-04 14:38:18

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler

From: Andi Kleen <[email protected]>

When the kernel or a module hasn't been compiled with a retpoline
aware compiler, print a warning and set a taint flag.

For modules it is checked at compile time, however it cannot
check assembler or other non compiled objects used in the module link.

Due to lack of better letter it uses taint option 'Z'

v2: Change warning message
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
Documentation/admin-guide/tainted-kernels.rst | 3 +++
arch/x86/kernel/setup.c | 6 ++++++
include/linux/kernel.h | 4 +++-
kernel/module.c | 11 ++++++++++-
kernel/panic.c | 1 +
scripts/mod/modpost.c | 9 +++++++++
6 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 1df03b5cb02f..800261b6bd6f 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -52,6 +52,9 @@ characters, each representing a particular tainted value.

16) ``K`` if the kernel has been live patched.

+ 17) ``Z`` if the x86 kernel or a module hasn't been compiled with
+ a retpoline aware compiler and may be vulnerable to data leaks.
+
The primary reason for the **'Tainted: '** string is to tell kernel
debuggers if this is a clean kernel or if anything unusual has
occurred. Tainting is permanent: even if an offending module is
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 8af2e8d0c0a1..cc880b46b756 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1296,6 +1296,12 @@ void __init setup_arch(char **cmdline_p)
#endif

unwind_init();
+
+#ifndef RETPOLINE
+ add_taint(TAINT_NO_RETPOLINE, LOCKDEP_STILL_OK);
+ pr_warn("No support for retpoline in kernel compiler\n");
+ pr_warn("System may be vulnerable to data leaks.\n");
+#endif
}

#ifdef CONFIG_X86_32
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index ce51455e2adf..fbb4d3baffcc 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -550,7 +550,9 @@ extern enum system_states {
#define TAINT_SOFTLOCKUP 14
#define TAINT_LIVEPATCH 15
#define TAINT_AUX 16
-#define TAINT_FLAGS_COUNT 17
+#define TAINT_NO_RETPOLINE 17
+
+#define TAINT_FLAGS_COUNT 18

struct taint_flag {
char c_true; /* character printed when tainted */
diff --git a/kernel/module.c b/kernel/module.c
index dea01ac9cb74..92db3f59a29a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3028,7 +3028,16 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
mod->name);
add_taint_module(mod, TAINT_OOT_MODULE, LOCKDEP_STILL_OK);
}
-
+#ifdef RETPOLINE
+ if (!get_modinfo(info, "retpoline")) {
+ if (!test_taint(TAINT_NO_RETPOLINE)) {
+ pr_warn("%s: loading module not compiled with retpoline compiler.\n",
+ mod->name);
+ pr_warn("Kernel may be vulnerable to data leaks.\n");
+ }
+ add_taint_module(mod, TAINT_NO_RETPOLINE, LOCKDEP_STILL_OK);
+ }
+#endif
if (get_modinfo(info, "staging")) {
add_taint_module(mod, TAINT_CRAP, LOCKDEP_STILL_OK);
pr_warn("%s: module is from the staging directory, the quality "
diff --git a/kernel/panic.c b/kernel/panic.c
index 2cfef408fec9..6686c67b6e4b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -325,6 +325,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
{ 'L', ' ', false }, /* TAINT_SOFTLOCKUP */
{ 'K', ' ', true }, /* TAINT_LIVEPATCH */
{ 'X', ' ', true }, /* TAINT_AUX */
+ { 'Z', ' ', true }, /* TAINT_NO_RETPOLINE */
};

/**
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index f51cf977c65b..6510536c06df 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -2165,6 +2165,14 @@ static void add_intree_flag(struct buffer *b, int is_intree)
buf_printf(b, "\nMODULE_INFO(intree, \"Y\");\n");
}

+/* Cannot check for assembler */
+static void add_retpoline(struct buffer *b)
+{
+ buf_printf(b, "\n#ifdef RETPOLINE\n");
+ buf_printf(b, "MODULE_INFO(retpoline, \"Y\");\n");
+ buf_printf(b, "#endif\n");
+}
+
static void add_staging_flag(struct buffer *b, const char *name)
{
static const char *staging_dir = "drivers/staging";
@@ -2506,6 +2514,7 @@ int main(int argc, char **argv)
err |= check_modname_len(mod);
add_header(&buf, mod);
add_intree_flag(&buf, !external_module);
+ add_retpoline(&buf);
add_staging_flag(&buf, mod->name);
err |= add_versions(&buf, mod);
add_depends(&buf, mod, modules);
--
2.14.3

2018-01-04 14:38:12

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 04/13] x86/retpoline/ftrace: Convert ftrace assembler indirect jumps

Convert all indirect jumps in ftrace assembler code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kernel/ftrace_32.S | 6 ++++--
arch/x86/kernel/ftrace_64.S | 8 ++++----
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S
index b6c6468e10bc..eb9a56fbda02 100644
--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -8,6 +8,7 @@
#include <asm/segment.h>
#include <asm/export.h>
#include <asm/ftrace.h>
+#include <asm/nospec-branch.h>

#ifdef CC_USING_FENTRY
# define function_hook __fentry__
@@ -197,7 +198,8 @@ ftrace_stub:
movl 0x4(%ebp), %edx
subl $MCOUNT_INSN_SIZE, %eax

- call *ftrace_trace_function
+ movl ftrace_trace_function, %ecx
+ NOSPEC_CALL ecx

popl %edx
popl %ecx
@@ -241,5 +243,5 @@ return_to_handler:
movl %eax, %ecx
popl %edx
popl %eax
- jmp *%ecx
+ NOSPEC_JMP ecx
#endif
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index c832291d948a..d4611d8bdbbf 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -7,7 +7,7 @@
#include <asm/ptrace.h>
#include <asm/ftrace.h>
#include <asm/export.h>
-
+#include <asm/nospec-branch.h>

.code64
.section .entry.text, "ax"
@@ -286,8 +286,8 @@ trace:
* ip and parent ip are used and the list function is called when
* function tracing is enabled.
*/
- call *ftrace_trace_function
-
+ movq ftrace_trace_function, %r8
+ NOSPEC_CALL r8
restore_mcount_regs

jmp fgraph_trace
@@ -329,5 +329,5 @@ GLOBAL(return_to_handler)
movq 8(%rsp), %rdx
movq (%rsp), %rax
addq $24, %rsp
- jmp *%rdi
+ NOSPEC_JMP rdi
#endif
--
2.14.3

2018-01-04 14:38:09

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 12/13] retpoline/objtool: Disable some objtool warnings

From: Andi Kleen <[email protected]>

With the indirect call thunk enabled compiler two objtool
warnings are triggered very frequently and make the build
very noisy.

I don't see a good way to avoid them, so just disable them
for now.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
tools/objtool/check.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 9b341584eb1b..435c71f944dc 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -503,8 +503,13 @@ static int add_call_destinations(struct objtool_file *file)
insn->call_dest = find_symbol_by_offset(insn->sec,
dest_off);
if (!insn->call_dest) {
+#if 0
+ /* Compilers with -mindirect-branch=thunk-extern trigger
+ * this everywhere on x86. Disable for now.
+ */
WARN_FUNC("can't find call dest symbol at offset 0x%lx",
insn->sec, insn->offset, dest_off);
+#endif
return -1;
}
} else if (rela->sym->type == STT_SECTION) {
@@ -1716,8 +1721,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
return 1;

} else if (func && has_modified_stack_frame(&state)) {
+#if 0
+ /* Compilers with -mindirect-branch=thunk-extern trigger
+ * this everywhere on x86. Disable for now.
+ */
+
WARN_FUNC("sibling call from callable instruction with modified stack frame",
sec, insn->offset);
+#endif
return 1;
}

--
2.14.3

2018-01-04 14:42:17

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 09/13] x86/retpoline/irq32: Convert assembler indirect jumps

From: Andi Kleen <[email protected]>

Convert all indirect jumps in 32bit irq inline asm code to use
non speculative sequences.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kernel/irq_32.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index a83b3346a0e1..e1e58f738c3d 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -20,6 +20,7 @@
#include <linux/mm.h>

#include <asm/apic.h>
+#include <asm/nospec-branch.h>

#ifdef CONFIG_DEBUG_STACKOVERFLOW

@@ -55,11 +56,11 @@ DEFINE_PER_CPU(struct irq_stack *, softirq_stack);
static void call_on_stack(void *func, void *stack)
{
asm volatile("xchgl %%ebx,%%esp \n"
- "call *%%edi \n"
+ NOSPEC_CALL
"movl %%ebx,%%esp \n"
: "=b" (stack)
: "0" (stack),
- "D"(func)
+ [thunk_target] "D"(func)
: "memory", "cc", "edx", "ecx", "eax");
}

@@ -95,11 +96,11 @@ static inline int execute_on_irq_stack(int overflow, struct irq_desc *desc)
call_on_stack(print_stack_overflow, isp);

asm volatile("xchgl %%ebx,%%esp \n"
- "call *%%edi \n"
+ NOSPEC_CALL
"movl %%ebx,%%esp \n"
: "=a" (arg1), "=b" (isp)
: "0" (desc), "1" (isp),
- "D" (desc->handle_irq)
+ [thunk_target] "D" (desc->handle_irq)
: "memory", "cc", "ecx");
return 1;
}
--
2.14.3

2018-01-04 14:42:33

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 13/13] retpoline: Attempt to quiten objtool warning for unreachable code

From: Andi Kleen <[email protected]>

The speculative jump trampoline has to contain unreachable code.
objtool keeps complaining

arch/x86/lib/retpoline.o: warning: objtool: __x86.indirect_thunk()+0x8: unreachable instruction

I tried to fix it here by adding ASM_UNREACHABLE annotation (after
supporting them for pure assembler), but it still complains.
Seems like a objtool bug?

So it doesn't actually fix the warning oyet.
Of course it's just a warning so the kernel will still work fine.

Perhaps Josh can figure it out

Cc: [email protected]
Not-Signed-off-by: Andi Kleen <[email protected]>
Not-Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/lib/retpoline.S | 5 +++++
include/linux/compiler.h | 10 +++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index bbdda5cc136e..8112feaea6ff 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -2,6 +2,7 @@

#include <linux/stringify.h>
#include <linux/linkage.h>
+#include <linux/compiler.h>
#include <asm/dwarf2.h>
#include <asm/cpufeatures.h>
#include <asm/alternative-asm.h>
@@ -14,7 +15,9 @@ ENTRY(__x86.indirect_thunk.\reg)
ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
1:
lfence
+ ASM_UNREACHABLE
jmp 1b
+ ASM_UNREACHABLE
2:
mov %\reg, (%\sp)
ret
@@ -40,7 +43,9 @@ ENTRY(__x86.indirect_thunk)
ALTERNATIVE "call 2f", "ret", X86_BUG_NO_RETPOLINE
1:
lfence
+ ASM_UNREACHABLE
jmp 1b
+ ASM_UNREACHABLE
2:
lea 4(%esp), %esp
ret
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 52e611ab9a6c..cfba91acc79a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -269,7 +269,15 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s

#endif /* __KERNEL__ */

-#endif /* __ASSEMBLY__ */
+#else /* __ASSEMBLY__ */
+
+#define ASM_UNREACHABLE \
+ 999: \
+ .pushsection .discard.unreachable; \
+ .long 999b - .; \
+ .popsection
+
+#endif /* !__ASSEMBLY__ */

/* Compile time object size, -1 for unknown */
#ifndef __compiletime_object_size
--
2.14.3

2018-01-04 14:43:30

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.

This adds an X86_BUG_NO_RETPOLINE "feature" for runtime patching out
of the thunks. This is a placeholder for now; the patches which support
the new Intel/AMD microcode features will flesh out the precise conditions
under which we disable the retpoline and do other things instead.

[Andi Kleen: Rename the macros and add CONFIG_RETPOLINE option]

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/Kconfig | 8 ++++++
arch/x86/Makefile | 10 ++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/lib/Makefile | 1 +
arch/x86/lib/retpoline.S | 50 ++++++++++++++++++++++++++++++++++++++
5 files changed, 70 insertions(+)
create mode 100644 arch/x86/lib/retpoline.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4fc98c50378..8b0facfa35be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -429,6 +429,14 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH

+config RETPOLINE
+ bool "Avoid speculative indirect branches in kernel"
+ default y
+ help
+ Compile kernel with the retpoline compiler options to guard against
+ kernel to user data leaks by avoiding speculative indirect
+ branches. Requires a new enough compiler. The kernel may run slower.
+
config INTEL_RDT
bool "Intel Resource Director Technology support"
default n
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..f772b3fef202 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -230,6 +230,16 @@ KBUILD_CFLAGS += -Wno-sign-compare
#
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables

+# Avoid indirect branches in kernel to deal with Spectre
+ifdef CONFIG_RETPOLINE
+ RETPOLINE_CFLAGS += $(call cc-option,-mindirect-branch=thunk-extern -mindirect-branch-register)
+ ifneq ($(RETPOLINE_CFLAGS),)
+ KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) -DRETPOLINE
+ else
+ $(warning Retpoline not supported in compiler. System may be insecure.)
+ endif
+endif
+
archscripts: scripts_basic
$(Q)$(MAKE) $(build)=arch/x86/tools relocs

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 07cdd1715705..900fa7016d3f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -342,5 +342,6 @@
#define X86_BUG_MONITOR X86_BUG(12) /* IPI required to wake up remote CPU */
#define X86_BUG_AMD_E400 X86_BUG(13) /* CPU is among the affected by Erratum 400 */
#define X86_BUG_CPU_INSECURE X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
+#define X86_BUG_NO_RETPOLINE X86_BUG(15) /* Placeholder: disable retpoline branch thunks */

#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 7b181b61170e..f23934bbaf4e 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -26,6 +26,7 @@ lib-y += memcpy_$(BITS).o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
+lib-$(CONFIG_RETPOLINE) += retpoline.o

obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o

diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
new file mode 100644
index 000000000000..bbdda5cc136e
--- /dev/null
+++ b/arch/x86/lib/retpoline.S
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/stringify.h>
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative-asm.h>
+
+.macro THUNK sp reg
+ .section .text.__x86.indirect_thunk.\reg
+
+ENTRY(__x86.indirect_thunk.\reg)
+ CFI_STARTPROC
+ ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
+1:
+ lfence
+ jmp 1b
+2:
+ mov %\reg, (%\sp)
+ ret
+ CFI_ENDPROC
+ENDPROC(__x86.indirect_thunk.\reg)
+.endm
+
+#ifdef CONFIG_64BIT
+.irp reg rax rbx rcx rdx rsi rdi rbp r8 r9 r10 r11 r12 r13 r14 r15
+ THUNK rsp \reg
+.endr
+#else
+.irp reg eax ebx ecx edx esi edi ebp
+ THUNK esp \reg
+.endr
+
+/*
+ * Also provide the original ret-equivalent retpoline for i386 because it's
+ * so register-starved, and we don't care about CET compatibility here.
+ */
+ENTRY(__x86.indirect_thunk)
+ CFI_STARTPROC
+ ALTERNATIVE "call 2f", "ret", X86_BUG_NO_RETPOLINE
+1:
+ lfence
+ jmp 1b
+2:
+ lea 4(%esp), %esp
+ ret
+ CFI_ENDPROC
+ENDPROC(__x86.indirect_thunk)
+
+#endif
--
2.14.3

2018-01-04 14:43:33

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 02/13] x86/retpoline/crypto: Convert crypto assembler indirect jumps

Convert all indirect jumps in crypto assembler code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/crypto/aesni-intel_asm.S | 5 +++--
arch/x86/crypto/camellia-aesni-avx-asm_64.S | 3 ++-
arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 3 ++-
arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 3 ++-
4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 16627fec80b2..074c13767c9f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -32,6 +32,7 @@
#include <linux/linkage.h>
#include <asm/inst.h>
#include <asm/frame.h>
+#include <asm/nospec-branch.h>

/*
* The following macros are used to move an (un)aligned 16 byte value to/from
@@ -2884,7 +2885,7 @@ ENTRY(aesni_xts_crypt8)
pxor INC, STATE4
movdqu IV, 0x30(OUTP)

- call *%r11
+ NOSPEC_CALL r11

movdqu 0x00(OUTP), INC
pxor INC, STATE1
@@ -2929,7 +2930,7 @@ ENTRY(aesni_xts_crypt8)
_aesni_gf128mul_x_ble()
movups IV, (IVP)

- call *%r11
+ NOSPEC_CALL r11

movdqu 0x40(OUTP), INC
pxor INC, STATE1
diff --git a/arch/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
index f7c495e2863c..98a717ba5e1a 100644
--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -17,6 +17,7 @@

#include <linux/linkage.h>
#include <asm/frame.h>
+#include <asm/nospec-branch.h>

#define CAMELLIA_TABLE_BYTE_LEN 272

@@ -1227,7 +1228,7 @@ camellia_xts_crypt_16way:
vpxor 14 * 16(%rax), %xmm15, %xmm14;
vpxor 15 * 16(%rax), %xmm15, %xmm15;

- call *%r9;
+ NOSPEC_CALL r9;

addq $(16 * 16), %rsp;

diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index eee5b3982cfd..99d09d3166a5 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -12,6 +12,7 @@

#include <linux/linkage.h>
#include <asm/frame.h>
+#include <asm/nospec-branch.h>

#define CAMELLIA_TABLE_BYTE_LEN 272

@@ -1343,7 +1344,7 @@ camellia_xts_crypt_32way:
vpxor 14 * 32(%rax), %ymm15, %ymm14;
vpxor 15 * 32(%rax), %ymm15, %ymm15;

- call *%r9;
+ NOSPEC_CALL r9;

addq $(16 * 32), %rsp;

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index 7a7de27c6f41..05178b44317d 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -45,6 +45,7 @@

#include <asm/inst.h>
#include <linux/linkage.h>
+#include <asm/nospec-branch.h>

## ISCSI CRC 32 Implementation with crc32 and pclmulqdq Instruction

@@ -172,7 +173,7 @@ continue_block:
movzxw (bufp, %rax, 2), len
lea crc_array(%rip), bufp
lea (bufp, len, 1), bufp
- jmp *bufp
+ NOSPEC_JMP bufp

################################################################
## 2a) PROCESS FULL BLOCKS:
--
2.14.3

2018-01-04 14:43:27

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 05/13] x86/retpoline/hyperv: Convert assembler indirect jumps

Convert all indirect jumps in hyperv inline asm code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5400add2885b..532ab441f39a 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -7,6 +7,7 @@
#include <linux/nmi.h>
#include <asm/io.h>
#include <asm/hyperv.h>
+#include <asm/nospec-branch.h>

/*
* The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
@@ -186,10 +187,11 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
return U64_MAX;

__asm__ __volatile__("mov %4, %%r8\n"
- "call *%5"
+ NOSPEC_CALL
: "=a" (hv_status), ASM_CALL_CONSTRAINT,
"+c" (control), "+d" (input_address)
- : "r" (output_address), "m" (hv_hypercall_pg)
+ : "r" (output_address),
+ THUNK_TARGET(hv_hypercall_pg)
: "cc", "memory", "r8", "r9", "r10", "r11");
#else
u32 input_address_hi = upper_32_bits(input_address);
@@ -200,13 +202,13 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
if (!hv_hypercall_pg)
return U64_MAX;

- __asm__ __volatile__("call *%7"
+ __asm__ __volatile__(NOSPEC_CALL
: "=A" (hv_status),
"+c" (input_address_lo), ASM_CALL_CONSTRAINT
: "A" (control),
"b" (input_address_hi),
"D"(output_address_hi), "S"(output_address_lo),
- "m" (hv_hypercall_pg)
+ THUNK_TARGET(hv_hypercall_pg)
: "cc", "memory");
#endif /* !x86_64 */
return hv_status;
@@ -227,10 +229,10 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)

#ifdef CONFIG_X86_64
{
- __asm__ __volatile__("call *%4"
+ __asm__ __volatile__(NOSPEC_CALL
: "=a" (hv_status), ASM_CALL_CONSTRAINT,
"+c" (control), "+d" (input1)
- : "m" (hv_hypercall_pg)
+ : THUNK_TARGET(hv_hypercall_pg)
: "cc", "r8", "r9", "r10", "r11");
}
#else
@@ -238,13 +240,13 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
u32 input1_hi = upper_32_bits(input1);
u32 input1_lo = lower_32_bits(input1);

- __asm__ __volatile__ ("call *%5"
+ __asm__ __volatile__ (NOSPEC_CALL
: "=A"(hv_status),
"+c"(input1_lo),
ASM_CALL_CONSTRAINT
: "A" (control),
"b" (input1_hi),
- "m" (hv_hypercall_pg)
+ THUNK_TARGET(hv_hypercall_pg)
: "cc", "edi", "esi");
}
#endif
--
2.14.3

2018-01-04 14:43:25

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH v3 07/13] x86/retpoline/checksum32: Convert assembler indirect jumps

Convert all indirect jumps in 32bit checksum assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/lib/checksum_32.S | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S
index 4d34bb548b41..d5ef7cc0efd8 100644
--- a/arch/x86/lib/checksum_32.S
+++ b/arch/x86/lib/checksum_32.S
@@ -29,7 +29,8 @@
#include <asm/errno.h>
#include <asm/asm.h>
#include <asm/export.h>
-
+#include <asm/nospec-branch.h>
+
/*
* computes a partial checksum, e.g. for TCP/UDP fragments
*/
@@ -156,7 +157,7 @@ ENTRY(csum_partial)
negl %ebx
lea 45f(%ebx,%ebx,2), %ebx
testl %esi, %esi
- jmp *%ebx
+ NOSPEC_JMP ebx

# Handle 2-byte-aligned regions
20: addw (%esi), %ax
@@ -439,7 +440,7 @@ ENTRY(csum_partial_copy_generic)
andl $-32,%edx
lea 3f(%ebx,%ebx), %ebx
testl %esi, %esi
- jmp *%ebx
+ NOSPEC_JMP ebx
1: addl $64,%esi
addl $64,%edi
SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl)
--
2.14.3

2018-01-04 14:46:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps

On 01/04/2018 06:37 AM, David Woodhouse wrote:
> KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
> can't just jump to the thunk because the thunk isn't mapped. So it gets
> its own copy of the thunk, inline.

This one call site isn't too painful, of course.

But, is there anything keeping us from just sticking the thunk in the
entry text section where it would be available while still in the
trampoline?

2018-01-04 14:51:13

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps

On Thu, 2018-01-04 at 06:46 -0800, Dave Hansen wrote:
> On 01/04/2018 06:37 AM, David Woodhouse wrote:
> > KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
> > can't just jump to the thunk because the thunk isn't mapped. So it gets
> > its own copy of the thunk, inline.
>
> This one call site isn't too painful, of course.
>
> But, is there anything keeping us from just sticking the thunk in the
> entry text section where it would be available while still in the
> trampoline?

Not really. Since we have a thunk per register and the trampoline is
limited in size, if you wanted to put just the *one* thunk that's
needed into the trampoline then it's slightly icky, but all perfectly
feasible.


Attachments:
smime.p7s (5.09 kB)

2018-01-04 15:02:11

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On 04/01/18 15:37, David Woodhouse wrote:
> Convert pvops invocations to use non-speculative call sequences, when
> CONFIG_RETPOLINE is enabled.
>
> There is scope for future optimisation here — once the pvops methods are
> actually set, we could just turn the damn things into *direct* jumps.
> But this is perfectly sufficient for now, without that added complexity.

I don't see the need to modify the pvops calls.

All indirect calls are replaced by either direct calls or other code
long before any user code is active.

For modules the replacements are in place before the module is being
used.


Juergen

>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/paravirt_types.h | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 6ec54d01972d..54b735b8ae12 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -336,11 +336,17 @@ extern struct pv_lock_ops pv_lock_ops;
> #define PARAVIRT_PATCH(x) \
> (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>
> +#define paravirt_clobber(clobber) \
> + [paravirt_clobber] "i" (clobber)
> +#ifdef CONFIG_RETPOLINE
> +#define paravirt_type(op) \
> + [paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
> + [paravirt_opptr] "r" ((op))
> +#else
> #define paravirt_type(op) \
> [paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
> [paravirt_opptr] "i" (&(op))
> -#define paravirt_clobber(clobber) \
> - [paravirt_clobber] "i" (clobber)
> +#endif
>
> /*
> * Generate some code, and mark it as patchable by the
> @@ -392,7 +398,11 @@ int paravirt_disable_iospace(void);
> * offset into the paravirt_patch_template structure, and can therefore be
> * freely converted back into a structure offset.
> */
> +#ifdef CONFIG_RETPOLINE
> +#define PARAVIRT_CALL "call __x86.indirect_thunk.%V[paravirt_opptr];"
> +#else
> #define PARAVIRT_CALL "call *%c[paravirt_opptr];"
> +#endif
>
> /*
> * These macros are intended to wrap calls through one of the paravirt
>

2018-01-04 15:10:37

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps

On 04/01/18 15:37, David Woodhouse wrote:
> Convert indirect call in Xen hypercall to use non-speculative sequence,
> when CONFIG_RETPOLINE is enabled.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/xen/hypercall.h | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
> index 7cb282e9e587..393c0048c63e 100644
> --- a/arch/x86/include/asm/xen/hypercall.h
> +++ b/arch/x86/include/asm/xen/hypercall.h
> @@ -44,6 +44,7 @@
> #include <asm/page.h>
> #include <asm/pgtable.h>
> #include <asm/smap.h>
> +#include <asm/nospec-branch.h>

Where does this file come from? It isn't added in any of the patches.


Juergen

2018-01-04 15:13:18

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On Thu, 2018-01-04 at 16:02 +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > Convert pvops invocations to use non-speculative call sequences, when
> > CONFIG_RETPOLINE is enabled.
> > 
> > There is scope for future optimisation here — once the pvops methods are
> > actually set, we could just turn the damn things into *direct* jumps.
> > But this is perfectly sufficient for now, without that added complexity

2018-01-04 15:19:09

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps

On Thu, 2018-01-04 at 16:10 +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> >
> > Convert indirect call in Xen hypercall to use non-speculative sequence,
> > when CONFIG_RETPOLINE is enabled.
> >
> > Signed-off-by: David Woodhouse <[email protected]>
> > ---
> >  arch/x86/include/asm/xen/hypercall.h | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/xen/hypercall.h
> > b/arch/x86/include/asm/xen/hypercall.h
> > index 7cb282e9e587..393c0048c63e 100644
> > --- a/arch/x86/include/asm/xen/hypercall.h
> > +++ b/arch/x86/include/asm/xen/hypercall.h
> > @@ -44,6 +44,7 @@
> >  #include <asm/page.h>
> >  #include <asm/pgtable.h>
> >  #include <asm/smap.h>
> > +#include <asm/nospec-branch.h>

> Where does this file come from? It isn't added in any of the patches

Dammit, sorry. Fixed now:

http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/be7e80781


Attachments:
smime.p7s (5.09 kB)

2018-01-04 15:49:11

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On 04/01/18 15:02, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
>> Convert pvops invocations to use non-speculative call sequences, when
>> CONFIG_RETPOLINE is enabled.
>>
>> There is scope for future optimisation here — once the pvops methods are
>> actually set, we could just turn the damn things into *direct* jumps.
>> But this is perfectly sufficient for now, without that added complexity.
> I don't see the need to modify the pvops calls.
>
> All indirect calls are replaced by either direct calls or other code
> long before any user code is active.
>
> For modules the replacements are in place before the module is being
> used.

When booting virtualised, sibling hyperthreads can arrange VM-to-VM SP2
attacks.

One mitigation though is to consider if there is any interesting data to
leak that early during boot.

~Andrew

2018-01-04 15:54:41

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps

On 04/01/18 15:37, David Woodhouse wrote:
> Convert indirect call in Xen hypercall to use non-speculative sequence,
> when CONFIG_RETPOLINE is enabled.
>
> Signed-off-by: David Woodhouse <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>


Juergen

2018-01-04 16:04:41

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On 04/01/18 16:18, Andrew Cooper wrote:
> On 04/01/18 15:02, Juergen Gross wrote:
>> On 04/01/18 15:37, David Woodhouse wrote:
>>> Convert pvops invocations to use non-speculative call sequences, when
>>> CONFIG_RETPOLINE is enabled.
>>>
>>> There is scope for future optimisation here — once the pvops methods are
>>> actually set, we could just turn the damn things into *direct* jumps.
>>> But this is perfectly sufficient for now, without that added complexity.
>> I don't see the need to modify the pvops calls.
>>
>> All indirect calls are replaced by either direct calls or other code
>> long before any user code is active.
>>
>> For modules the replacements are in place before the module is being
>> used.
>
> When booting virtualised, sibling hyperthreads can arrange VM-to-VM SP2
> attacks.
>
> One mitigation though is to consider if there is any interesting data to
> leak that early during boot.

Right. And if you are able to detect a booting VM in the other
hyperthread, obtain enough information about its kernel layout and
extract the information via statistical methods in the very short time
frame of the boot before pvops patching takes place. Not to forget the
vast amount of data the booting VM will pull through the caches making
side channel attacks a rather flaky endeavor...

I'd opt for leaving pvops calls untouched. The Reviewed-by: I gave for
the patch was just for its correctness. :-)


Juergen

2018-01-04 16:19:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <[email protected]> wrote:
> On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
>> Apologies for the discombobulation around today's disclosure. Obviously the
>> original goal was to communicate this a little more coherently, but the
>> unscheduled advances in the disclosure disrupted the efforts to pull this
>> together more cleanly.
>>
>> I wanted to open discussion the "retpoline" approach and and define its
>> requirements so that we can separate the core
>> details from questions regarding any particular implementation thereof.
>>
>> As a starting point, a full write-up describing the approach is available at:
>> https://support.google.com/faqs/answer/7625886
>
> Note that (ab)using 'ret' in this way is incompatible with CET on
> upcoming processors. HJ added a -mno-indirect-branch-register option to
> the latest round of GCC patches, which puts the branch target in a
> register instead of on the stack. My kernel patches (which I'm about to
> reconcile with Andi's tweaks and post) do the same.
>
> That means that in the cases where at runtime we want to ALTERNATIVE
> out the retpoline, it just turns back into a bare 'jmp *\reg'.
>
>

I hate to say this, but I think Intel should postpone CET until the
dust settles. Intel should also consider a hardware-protected stack
that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
instruction. That, by itself, would give considerable protection.
But we still need JMP_NO_SPECULATE. Or, better yet, get the CPU to
stop leaking data during speculative execution.

2018-01-04 16:24:44

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, 2018-01-04 at 08:18 -0800, Andy Lutomirski wrote:
> I hate to say this, but I think Intel should postpone CET until the
> dust settles.

CET isn't a *problem* for retpoline. We've had a CET-compatible version
for a while now, and I posted it earlier. It's just that Andi was
working from an older version of my patches.

Of course, there's a school of thought that says that Intel should
postpone *everything* until this is all fixed sanely, but there's
nothing special about CET in that respect.


Attachments:
smime.p7s (5.09 kB)

2018-01-04 16:37:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps

On Thu, Jan 04, 2018 at 04:02:06PM +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > Convert pvops invocations to use non-speculative call sequences, when
> > CONFIG_RETPOLINE is enabled.
> >
> > There is scope for future optimisation here — once the pvops methods are
> > actually set, we could just turn the damn things into *direct* jumps.
> > But this is perfectly sufficient for now, without that added complexity.
>
> I don't see the need to modify the pvops calls.
>
> All indirect calls are replaced by either direct calls or other code
> long before any user code is active.
>
> For modules the replacements are in place before the module is being
> used.

Agreed. This shouldn't be needed.

-Andi

2018-01-04 18:03:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

David,
these are all marked as spam, because your emails have screwed up
DKIM. You used

From: David Woodhouse <[email protected]>

but then you used infradead as a mailer, so it has the DKIM signature
from infradead, not from Amazon.co.uk.

The DKIM signature does pass for infradead, but amazon dmarc - quite
reasonably - wants the from to match.

End result:

dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE)
header.from=amazon.co.uk

and everything was in spam.

Please don't do this. There's enough spam in the world that we don't
need people mis-configuring their emails and making real emails look
like spam too.

Linus

On Thu, Jan 4, 2018 at 6:36 AM, David Woodhouse <[email protected]> wrote:
> Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
> the corresponding thunks. Provide assembler macros for invoking the thunks
> in the same way that GCC does, from native and inline assembler.

2018-01-04 18:17:52

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 04, 2018 at 02:36:58PM +0000, David Woodhouse wrote:
> Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
> the corresponding thunks. Provide assembler macros for invoking the thunks
> in the same way that GCC does, from native and inline assembler.
>
> This adds an X86_BUG_NO_RETPOLINE "feature" for runtime patching out
> of the thunks. This is a placeholder for now; the patches which support
> the new Intel/AMD microcode features will flesh out the precise conditions
> under which we disable the retpoline and do other things instead.
>
> [Andi Kleen: Rename the macros and add CONFIG_RETPOLINE option]
>
> Signed-off-by: David Woodhouse <[email protected]>
...
> +.macro THUNK sp reg
> + .section .text.__x86.indirect_thunk.\reg
> +
> +ENTRY(__x86.indirect_thunk.\reg)
> + CFI_STARTPROC
> + ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
> +1:
> + lfence
> + jmp 1b
> +2:
> + mov %\reg, (%\sp)
> + ret
> + CFI_ENDPROC
> +ENDPROC(__x86.indirect_thunk.\reg)

Clearly Paul's approach to retpoline without lfence is faster.
I'm guessing it wasn't shared with amazon/intel until now and
this set of patches going to adopt it, right?

Paul, could you share a link to a set of alternative gcc patches
that do retpoline similar to llvm diff ?

2018-01-04 18:25:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
<[email protected]> wrote:
>
> Clearly Paul's approach to retpoline without lfence is faster.
> I'm guessing it wasn't shared with amazon/intel until now and
> this set of patches going to adopt it, right?
>
> Paul, could you share a link to a set of alternative gcc patches
> that do retpoline similar to llvm diff ?

What is the alternative approach? Is it literally just doing a

call 1f
1: mov real_target,(%rsp)
ret

on the assumption that the "ret" will always just predict to that "1"
due to the call stack?

Linus

2018-01-04 18:36:06

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 04, 2018 at 10:25:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
> <[email protected]> wrote:
> >
> > Clearly Paul's approach to retpoline without lfence is faster.
> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
>
> What is the alternative approach? Is it literally just doing a
>
> call 1f
> 1: mov real_target,(%rsp)
> ret
>
> on the assumption that the "ret" will always just predict to that "1"
> due to the call stack?

Pretty much.
Paul's writeup: https://support.google.com/faqs/answer/7625886
tldr: jmp *%r11 gets converted to:
call set_up_target;
capture_spec:
pause;
jmp capture_spec;
set_up_target:
mov %r11, (%rsp);
ret;
where capture_spec part will be looping speculatively.

2018-01-04 18:40:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

> Clearly Paul's approach to retpoline without lfence is faster.
> I'm guessing it wasn't shared with amazon/intel until now and
> this set of patches going to adopt it, right?
>
> Paul, could you share a link to a set of alternative gcc patches
> that do retpoline similar to llvm diff ?

I don't think it's a good idea to use any sequence not signed
off by CPU designers and extensively tested.

While another one may work for most tests, it could always fail in
some corner case.

Right now we have the more heavy weight one and I would
suggest to stay with that one for now. Then worry
about more optimizations later.

Correctness first.

-Andi

2018-01-04 19:28:05

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
>
> Pretty much.
> Paul's writeup: https://support.google.com/faqs/answer/7625886
> tldr: jmp *%r11 gets converted to:
> call set_up_target;
> capture_spec:
>   pause;
>   jmp capture_spec;
> set_up_target:
>   mov %r11, (%rsp);
>   ret;
> where capture_spec part will be looping speculatively.

That is almost identical to what's in my latest patch set, except that
the capture_spec loop has 'lfence' instead of 'pause'.

As Andi says, I'd want to see explicit approval from the CPU architects
for making that change.

We've already had false starts there — for a long time, Intel thought
that a much simpler option with an lfence after the register load was
sufficient, and then eventually worked out that in some rare cases it
wasn't. While AMD still seem to think it *is* sufficient for them,
apparently.


Attachments:
smime.p7s (5.09 kB)

2018-01-04 22:06:05

by Justin Forbes

[permalink] [raw]
Subject: Re: [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler

On Thu, Jan 4, 2018 at 8:37 AM, David Woodhouse <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> When the kernel or a module hasn't been compiled with a retpoline
> aware compiler, print a warning and set a taint flag.
>
> For modules it is checked at compile time, however it cannot
> check assembler or other non compiled objects used in the module link.
>
> Due to lack of better letter it uses taint option 'Z'
>

Is taint really the right thing to do here? Why not just do pr_info?

2018-01-05 10:28:31

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> >
> > Pretty much.
> > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > tldr: jmp *%r11 gets converted to:
> > call set_up_target;
> > capture_spec:
> >   pause;
> >   jmp capture_spec;
> > set_up_target:
> >   mov %r11, (%rsp);
> >   ret;
> > where capture_spec part will be looping speculatively.
>
> That is almost identical to what's in my latest patch set, except that
> the capture_spec loop has 'lfence' instead of 'pause'.

When choosing this sequence I benchmarked several alternatives here, including
(nothing, nops, fences, and other serializing instructions such as cpuid).

The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
it was chosen.

"pause; jmp" 33.231 cycles/call 9.517 ns/call
"lfence; jmp" 33.354 cycles/call 9.552 ns/call

(Timings are for a complete retpolined indirect branch.)
>
> As Andi says, I'd want to see explicit approval from the CPU architects
> for making that change.

Beyond guaranteeing that speculative execution is constrained, the choice of
sequence here is a performance detail and not one of correctness.

>
> We've already had false starts there — for a long time, Intel thought
> that a much simpler option with an lfence after the register load was
> sufficient, and then eventually worked out that in some rare cases it
> wasn't. While AMD still seem to think it *is* sufficient for them,
> apparently.

As an interesting aside, that speculation proceeds beyond lfence can be
trivially proven using the timings above. In fact, if we substitute only:
"lfence" (with no jmp)

We see:
29.573 cycles/call 8.469 ns/call

Now, the only way for this timing to be different, is if speculation beyond the
lfence was executed differently.

That said, this is a negative result, it does suggest that the jmp is
contributing a larger than realized cost to our speculative loop. We can likely
shave off some additional time with some unrolling. I did try this previously,
but did not see results above the noise floor; it seems worth trying this again;
will take a look tomorrow.


2018-01-05 10:32:56

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 04, 2018 at 10:40:23AM -0800, Andi Kleen wrote:
> > Clearly Paul's approach to retpoline without lfence is faster.
> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
>
> I don't think it's a good idea to use any sequence not signed
> off by CPU designers and extensively tested.

I can confirm that "pause; jmp" has been previously reviewed by your side.

That said, again, per the other email, once we have guaranteed that speculative
execution will reach this point, beyond the guarantee that it does something
"safe" the choice of sequence here is a performance detail rather than
correctness.

>
> While another one may work for most tests, it could always fail in
> some corner case.
>
> Right now we have the more heavy weight one and I would
> suggest to stay with that one for now. Then worry
> about more optimizations later.
>
> Correctness first.
>
> -Andi

2018-01-05 10:41:01

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, Jan 04, 2018 at 10:25:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
> <[email protected]> wrote:
> >
> > Clearly Paul's approach to retpoline without lfence is faster.

Using pause rather than lfence does not represent a fundamental difference here.

A protected indirect branch is always adding ~25-30 cycles of overhead.

That this can be avoided in practice is a function of two key factors:
(1) Kernel code uses fewer indirect branches.
(2) The overhead can be avoided for hot indirect branches via devirtualization.
e.g. the semantic equivalent of,
if (ptr == foo)
foo();
else
(*ptr)();
Allowing foo() to be called directly, even though it was provided as an
indirect.

> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
>
> What is the alternative approach? Is it literally just doing a
>
> call 1f
> 1: mov real_target,(%rsp)
> ret
>
> on the assumption that the "ret" will always just predict to that "1"
> due to the call stack?
>
> Linus

2018-01-05 10:49:51

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

On Thu, Jan 04, 2018 at 08:18:57AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <[email protected]> wrote:
> > On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
> >> Apologies for the discombobulation around today's disclosure. Obviously the
> >> original goal was to communicate this a little more coherently, but the
> >> unscheduled advances in the disclosure disrupted the efforts to pull this
> >> together more cleanly.
> >>
> >> I wanted to open discussion the "retpoline" approach and and define its
> >> requirements so that we can separate the core
> >> details from questions regarding any particular implementation thereof.
> >>
> >> As a starting point, a full write-up describing the approach is available at:
> >> https://support.google.com/faqs/answer/7625886
> >
> > Note that (ab)using 'ret' in this way is incompatible with CET on
> > upcoming processors. HJ added a -mno-indirect-branch-register option to
> > the latest round of GCC patches, which puts the branch target in a
> > register instead of on the stack. My kernel patches (which I'm about to
> > reconcile with Andi's tweaks and post) do the same.
> >
> > That means that in the cases where at runtime we want to ALTERNATIVE
> > out the retpoline, it just turns back into a bare 'jmp *\reg'.
> >
> >
>
> I hate to say this, but I think Intel should postpone CET until the
> dust settles. Intel should also consider a hardware-protected stack
> that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
> instruction. That, by itself, would give considerable protection.
> But we still need JMP_NO_SPECULATE. Or, better yet, get the CPU to
> stop leaking data during speculative execution.

Echoing Andy's thoughts, but from a slightly different angle:

1) BTI is worse than the current classes of return attack. Given this,
considered as a binary choice, it's equivalent to the current state of the
world (e.g. no CET).
2) CET will not be "free". I suspect in its initial revisions it will be more
valuable for protecting end-users then enterprise workloads (cost is not
observable for interactive workloads because there's tons of headroom in the
first place).

While the potential incompatibility is unfortunate; I'm not sure it makes a
significant adoption to the adoption rate of CET.

2018-01-05 10:55:55

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > 
> > > Pretty much.
> > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > tldr: jmp *%r11 gets converted to:
> > > call set_up_target;
> > > capture_spec:
> > >   pause;
> > >   jmp capture_spec;
> > > set_up_target:
> > >   mov %r11, (%rsp);
> > >   ret;
> > > where capture_spec part will be looping speculatively.
> > 
> > That is almost identical to what's in my latest patch set, except that
> > the capture_spec loop has 'lfence' instead of 'pause'.
>
> When choosing this sequence I benchmarked several alternatives here, including
> (nothing, nops, fences, and other serializing instructions such as cpuid).
>
> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> it was chosen.
>
>   "pause; jmp" 33.231 cycles/call 9.517 ns/call
>   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
>
> (Timings are for a complete retpolined indirect branch.)

Yeah, I studiously ignored you here and went with only what Intel had
*assured* me was correct and put into the GCC patches, rather than
chasing those 35 picoseconds ;)

The GCC patch set already had about four different variants over time,
with associated "oh shit, that one doesn't actually work; try this".
What we have in my patch set is precisely what GCC emits at the moment.

I'm all for optimising it further, but maybe not this week.

Other than that, is there any other development from your side that I
haven't captured in the latest (v4) series?
http://git.infradead.org/users/dwmw2/linux-retpoline.git/


Attachments:
smime.p7s (5.09 kB)

2018-01-05 11:19:38

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 10:55:38AM +0000, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > >?
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > > ? pause;
> > > > ? jmp capture_spec;
> > > > set_up_target:
> > > > ? mov %r11, (%rsp);
> > > > ? ret;
> > > > where capture_spec part will be looping speculatively.
> > >?
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> >
> > When choosing this sequence I benchmarked several alternatives here, including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> >
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> > it was chosen.
> >
> > ? "pause; jmp" 33.231 cycles/call 9.517 ns/call
> > ? "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> >
> > (Timings are for a complete retpolined indirect branch.)
>
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)
>
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.

While I can't speak to the various iterations of the gcc patches, I can confirm
that the details originally provided were reviewed by Intel.

>
> I'm all for optimising it further, but maybe not this week.
>
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/


2018-01-05 11:25:15

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 10:55:38AM +0000, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > >?
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > > ? pause;
> > > > ? jmp capture_spec;
> > > > set_up_target:
> > > > ? mov %r11, (%rsp);
> > > > ? ret;
> > > > where capture_spec part will be looping speculatively.
> > >?
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> >
> > When choosing this sequence I benchmarked several alternatives here, including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> >
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> > it was chosen.
> >
> > ? "pause; jmp" 33.231 cycles/call 9.517 ns/call
> > ? "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> >
> > (Timings are for a complete retpolined indirect branch.)
>
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)

It's also notable here that while the difference is small in terms of absolute
values, it's likely due to reduced variation:

I would expect:
- pause to be extremely consistent in its timings
- pause and lfence to be close on their average timings, particularly in a
micro-benchmark.

Which suggests that the difference may be larger in the occasional cases that
you are getting "unlucky" and seeing some other uarch interaction in the lfence
path.
>
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.
>
> I'm all for optimising it further, but maybe not this week.
>
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/


2018-01-05 11:26:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On 05/01/2018 11:28, Paul Turner wrote:
>
> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> it was chosen.
>
> "pause; jmp" 33.231 cycles/call 9.517 ns/call
> "lfence; jmp" 33.354 cycles/call 9.552 ns/call

Do you have timings for a non-retpolined indirect branch with the
predictor suppressed via IBRS=1? So at least we can compute the break
even point.

Paolo

2018-01-05 12:20:36

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 3:26 AM, Paolo Bonzini <[email protected]> wrote:
> On 05/01/2018 11:28, Paul Turner wrote:
>>
>> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
>> it was chosen.
>>
>> "pause; jmp" 33.231 cycles/call 9.517 ns/call
>> "lfence; jmp" 33.354 cycles/call 9.552 ns/call
>
> Do you have timings for a non-retpolined indirect branch with the
> predictor suppressed via IBRS=1? So at least we can compute the break
> even point.

The data I collected here previously had the run-time cost as a wash.
On Skylake, an IBRS=1 and a retpolined indirect branch had cost within
a few cycles.

The costs to consider when making a choice here are:

- The transition overheads. This is how frequently will you be
switching in and out of protected code (as IBRS needs to be enabled
and disabled at these boundaries).
- The frequency at which you will be executing protected code on one
sibling, and unprotected code on another (enabling IBRS may affect
sibling execution, depending on SKU)
- The implementation cost (retpoline requires auditing/rebuilding your
target, while IBRS can be used out of the box).


>
> Paolo

2018-01-05 12:54:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Thu, 4 Jan 2018, David Woodhouse wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 07cdd1715705..900fa7016d3f 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -342,5 +342,6 @@
> #define X86_BUG_MONITOR X86_BUG(12) /* IPI required to wake up remote CPU */
> #define X86_BUG_AMD_E400 X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> #define X86_BUG_CPU_INSECURE X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> +#define X86_BUG_NO_RETPOLINE X86_BUG(15) /* Placeholder: disable retpoline branch thunks */

I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
tells the kernel, that the CPU is affected by variant 3.

If the kernel detects that and has PTI support then it sets the 'pti'
feature bit which tells that the mitigation is in place.

So what we really want is

X86_BUG_MELTDOWN_V1/2/3

which get set when the CPU is affected by a particular variant and then
have feature flags

X86_FEATURE_RETPOLINE
X86_FEATURE_IBRS
X86_FEATURE_NOSPEC

or whatever it takes to signal that a mitigation is in place. Then we
depend all actions on those feature flags very much in the way we do for
FEATURE_PTI.

If CPUs come along which are not affected by a particular variant the BUG
flag does not get set.

Thanks,

tglx

2018-01-05 13:01:08

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On 05/01/18 13:54, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, David Woodhouse wrote:
>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>> index 07cdd1715705..900fa7016d3f 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -342,5 +342,6 @@
>> #define X86_BUG_MONITOR X86_BUG(12) /* IPI required to wake up remote CPU */
>> #define X86_BUG_AMD_E400 X86_BUG(13) /* CPU is among the affected by Erratum 400 */
>> #define X86_BUG_CPU_INSECURE X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
>> +#define X86_BUG_NO_RETPOLINE X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
>
> I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> tells the kernel, that the CPU is affected by variant 3.

MELTDOWN is variant 3.

>
> If the kernel detects that and has PTI support then it sets the 'pti'
> feature bit which tells that the mitigation is in place.
>
> So what we really want is
>
> X86_BUG_MELTDOWN_V1/2/3

X86_BUG_MELTDOWN, X86_BUG_SPECTRE_V1, X86_BUG_SPECTRE_V2


Juergen

2018-01-05 13:04:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 5 Jan 2018, Juergen Gross wrote:
> On 05/01/18 13:54, Thomas Gleixner wrote:
> > On Thu, 4 Jan 2018, David Woodhouse wrote:
> >> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> >> index 07cdd1715705..900fa7016d3f 100644
> >> --- a/arch/x86/include/asm/cpufeatures.h
> >> +++ b/arch/x86/include/asm/cpufeatures.h
> >> @@ -342,5 +342,6 @@
> >> #define X86_BUG_MONITOR X86_BUG(12) /* IPI required to wake up remote CPU */
> >> #define X86_BUG_AMD_E400 X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> >> #define X86_BUG_CPU_INSECURE X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> >> +#define X86_BUG_NO_RETPOLINE X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
> >
> > I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> > should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> > tells the kernel, that the CPU is affected by variant 3.
>
> MELTDOWN is variant 3.
>
> >
> > If the kernel detects that and has PTI support then it sets the 'pti'
> > feature bit which tells that the mitigation is in place.
> >
> > So what we really want is
> >
> > X86_BUG_MELTDOWN_V1/2/3
>
> X86_BUG_MELTDOWN, X86_BUG_SPECTRE_V1, X86_BUG_SPECTRE_V2

Right. I'm confused as always :)

Subject: [tip:x86/pti] x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm

Commit-ID: b9e705ef7cfaf22db0daab91ad3cd33b0fa32eb9
Gitweb: https://git.kernel.org/tip/b9e705ef7cfaf22db0daab91ad3cd33b0fa32eb9
Author: David Woodhouse <[email protected]>
AuthorDate: Thu, 4 Jan 2018 14:37:05 +0000
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 5 Jan 2018 14:01:15 +0100

x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm

Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.

Fixes: 9cebed423c84 ("x86, alternative: Use .pushsection/.popsection")
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Cc: Tim Chen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/alternative.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index dbfd085..cf5961c 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -140,7 +140,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
".popsection\n" \
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr, feature, 1) \
- ".popsection"
+ ".popsection\n"

#define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
OLDINSTR_2(oldinstr, 1, 2) \
@@ -151,7 +151,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr1, feature1, 1) \
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2) \
- ".popsection"
+ ".popsection\n"

/*
* Alternative instructions for different CPU types or capabilities.

2018-01-05 14:32:39

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 13:54 +0100, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, David Woodhouse wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 07cdd1715705..900fa7016d3f 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -342,5 +342,6 @@
> >  #define X86_BUG_MONITOR                      X86_BUG(12) /* IPI required to wake up remote CPU */
> >  #define X86_BUG_AMD_E400             X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> >  #define X86_BUG_CPU_INSECURE         X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> > +#define X86_BUG_NO_RETPOLINE         X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
>
> I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> tells the kernel, that the CPU is affected by variant 3.

As it says, it's a placeholder. The actual conditions depend on whether
we decide to use IBRS or not, which also depends on whether we find
IBRS_ATT or whether we're on Skylake+.

The IBRS patch series should be updating it.

> So what we really want is
>
>    X86_BUG_MELTDOWN_V1/2/3
>
> which get set when the CPU is affected by a particular variant and then
> have feature flags
>
>    X86_FEATURE_RETPOLINE
>    X86_FEATURE_IBRS
>    X86_FEATURE_NOSPEC

At some point during this whole painful mess, I had come to the
conclusion that having relocations in altinstr didn't work, and that's
why I had X86_xx_NO_RETPOLINE instead of X86_xx_RETPOLINE. I now think
that something else was wrong when I was testing that, and relocs in
altinstr do work. So sure, X86_FEATURE_RETPOLINE ought to work. I can
change that round, and it's simpler for the IBRS patch set to take it
into account and set it when appropriate.


Attachments:
smime.p7s (5.09 kB)

2018-01-05 16:42:31

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 13:56 +0000, Woodhouse, David wrote:
>
> At some point during this whole painful mess, I had come to the
> conclusion that having relocations in altinstr didn't work, and that's
> why I had X86_xx_NO_RETPOLINE instead of X86_xx_RETPOLINE. I now think
> that something else was wrong when I was testing that, and relocs in
> altinstr do work. So sure, X86_FEATURE_RETPOLINE ought to work. I can
> change that round, and it's simpler for the IBRS patch set to take it
> into account and set it when appropriate.

+bpetkov

Nope, alternatives are broken. Only a jmp as the *first* opcode of
altinstr gets handled by recompute_jump(), while any subsequent insn is
just copied untouched.

To fix that and handle every instruction, the alternative code would
need to know about instruction lengths. I think we need to stick with
the inverted X86_FEATURE_NO_RETPOLINE flag for the moment, and not tie
it to a complex bugfix there.


Attachments:
smime.p7s (5.09 kB)

2018-01-05 16:45:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> Nope, alternatives are broken. Only a jmp as the *first* opcode of
> altinstr gets handled by recompute_jump(), while any subsequent insn is
> just copied untouched.

Not broken - simply no one needed it until now. I'm looking into it.
Looks like the insn decoder might come in handy finally.

:-)

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-05 17:08:16

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 05:45:06PM +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> > Nope, alternatives are broken. Only a jmp as the *first* opcode of
> > altinstr gets handled by recompute_jump(), while any subsequent insn is
> > just copied untouched.
>
> Not broken - simply no one needed it until now. I'm looking into it.
> Looks like the insn decoder might come in handy finally.
>
> :-)

I seem to recall that we also discussed the need for this for converting
pvops to use alternatives, though the "why" is eluding me at the moment.

--
Josh

2018-01-05 17:23:31

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 17:45 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> > Nope, alternatives are broken. Only a jmp as the *first* opcode of
> > altinstr gets handled by recompute_jump(), while any subsequent insn is
> > just copied untouched.
>
> Not broken - simply no one needed it until now. I'm looking into it.
> Looks like the insn decoder might come in handy finally.#

I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
Which made me sad, and took a while to debug.

Let's call it "not working for this use case", and I'll stick with
X86_FEATURE_NO_RETPOLINE for now, so the jump can go in the oldinstr
case not in altinstr.


Attachments:
smime.p7s (5.09 kB)

2018-01-05 17:28:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 9:12 AM, Woodhouse, David <[email protected]> wrote:
>
> I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
> which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
> Which made me sad, and took a while to debug.

Yes, I would suggest against expecting altinstructions to have
relocation information. They are generated in a different place, so..

That said, I honestly like the inline version (the one that is in the
google paper first) of the retpoline more than the out-of-line one.
And that one shouldn't have any relocagtion issues, because all the
offsets are relative.

We want to use that one for the entry stub anyway, can't we just
standardize on that one for all our assembly?

If the *compiler* uses the out-of-line version, that's a separate
thing. But for our asm cases, let's just make it all be the inline
case, ok?

It also should simplify the whole target generation. None of this
silly "__x86.indirect_thunk.\reg" crap with different targets for
different register choices.

Hmm?

Linus

2018-01-05 17:48:49

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>
> Yes, I would suggest against expecting altinstructions to have
> relocation information. They are generated in a different place, so..
>
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocagtion issues, because all the
> offsets are relative.
>
> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?

Sure, that's just a case of tweaking the macros a little to do it
inline instead of jumping to the thunk. I thunk judicious use of
__stringify() has resolved any issues I might have had with that
approach in the first place. However...

> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?
>
> It also should simplify the whole target generation. None of this
> silly "__x86.indirect_thunk.\reg" crap with different targets for
> different register choices.

If we're going to let the compiler use the out-of-line version (which
we *want* to because otherwise we don't get to ALTERNATIVE it away as
appropriate), then we still need to emit the various
__x86.indirect_thunk.\reg thunks for the compiler to use anyway.

At that point, there isn't a *huge* benefit to doing the retpoline
inline in our own asm, when it might as well just be a jump to the
thunks which exist anyway. But neither do I have an argument *against*
doing so, so I'll happily tweak the NOSPEC_CALL/NOSPEC_JMP macros
accordingly if you really want...


Attachments:
smime.p7s (5.09 kB)

2018-01-05 18:05:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

Should be a simple change.

>
> It also should simplify the whole target generation. None of this
> silly "__x86.indirect_thunk.\reg" crap with different targets for
> different register choices.
>
> Hmm?

We need the different thunks anyways for the compiler generated code.
So it won't go away.

-Andi

2018-01-05 20:33:41

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>
> Yes, I would suggest against expecting altinstructions to have
> relocation information. They are generated in a different place, so..
>
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocation issues, because all the
> offsets are relative.

Note that the *only* issue with the relocation is that it pushes me to
use X86_FEATURE_NO_RETPOLINE for my feature instead of
X86_FEATURE_RETPOLINE as might be natural. And actually there's a
motivation to do that anyway, because of the way three-way alternatives
interact.

With the existing negative flag I can do 

 ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)

But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:

 ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

So I was completely prepared to live with the slightly unnatural
inverse logic of the feature flag. But since you asked...

> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?
>
> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

OK.... it starts off looking a bit like this. You're right; with the
caveats above it will let me invert the logic to X86_FEATURE_RETPOLINE
because the alternatives mechanism no longer needs to adjust any part
of the retpoline code path when it's in 'altinstr'.

And it does let me use a simple NOSPEC_JMP in the entry trampoline
instead of open-coding it again, which is nice.

But the first pass of it, below, is fugly as hell. I'll take another
look at *using* the ALTERNATIVE_2 macro instead of reimplementing it
for NOSPEC_CALL, but I strongly suspect that's just going to land me
with a fairly unreadable __stringify(jmp;call;lfence;jmp;call;mov;ret)
monstrosity all on a single line. Assembler macros are... brittle.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76f94bbacaec..8f7e1129f493 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -188,17 +188,7 @@ ENTRY(entry_SYSCALL_64_trampoline)
   */
  pushq %rdi
  movq $entry_SYSCALL_64_stage2, %rdi
- /*
-  * Open-code the retpoline from retpoline.S, because we can't
-  * just jump to it directly.
-  */
- ALTERNATIVE "call 2f", "jmp *%rdi", X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %rdi, (%rsp)
- ret
+ NOSPEC_JMP rdi
 END(entry_SYSCALL_64_trampoline)
 
  .popsection
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index eced0dfaddc9..1c8312ff186a 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -14,15 +14,54 @@
  */
 .macro NOSPEC_JMP reg:req
 #ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(jmp __x86.indirect_thunk.\reg), __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+ ALTERNATIVE_2 "call 1112f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+1111:
+ lfence
+ jmp 1111b
+1112:
+ mov %\reg, (%_ASM_SP)
+ ret
 #else
- jmp *%\reg
+ jmp *%\reg
 #endif
 .endm
 
+/*
+ * Even __stringify() on the arguments doesn't really make it nice to use
+ * the existing ALTERNATIVE_2 macro here. So open-code our own version...
+ */
 .macro NOSPEC_CALL reg:req
 #ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(call __x86.indirect_thunk.\reg), __stringify(call *%\reg), X86_FEATURE_NO_RETPOLINE
+140:
+ jmp 1113f
+1110:
+ call 1112f
+1111:
+ lfence
+ jmp 1111b
+1112:
+ mov %\reg, (%_ASM_SP)
+ ret
+1113:
+ call 1110b
+141:
+ .skip -((alt_max_short(new_len1, new_len2) - (old_len)) > 0) * \
+ (alt_max_short(new_len1, new_len2) - (old_len)),0x90
+142:
+
+ .pushsection .altinstructions,"a"
+ altinstruction_entry 140b,143f,X86_FEATURE_K8,142b-140b,144f-143f,142b-141b
+ altinstruction_entry 140b,144f,X86_FEATURE_NO_RETPOLINE,142b-140b,145f-144f,142b-141b
+ .popsection
+
+ .pushsection .altinstr_replacement,"ax"
+143:
+ lfence
+ call *%\reg
+144:
+ call *%\reg
+145:
+ .popsection
 #else
  call *%\reg
 #endif
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 2a4b1f09eb84..5c15e4307da5 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -6,19 +6,14 @@
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
+#include <asm/nospec-branch.h>
 
 .macro THUNK sp reg
  .section .text.__x86.indirect_thunk.\reg
 
 ENTRY(__x86.indirect_thunk.\reg)
  CFI_STARTPROC
- ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %\reg, (%\sp)
- ret
+ NOSPEC_JMP \reg
  CFI_ENDPROC
 ENDPROC(__x86.indirect_thunk.\reg)
 EXPORT_SYMBOL(__x86.indirect_thunk.\reg)


Attachments:
smime.p7s (5.09 kB)

2018-01-05 21:11:10

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 3:32 PM, Woodhouse, David <[email protected]> wrote:
> On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>>
>> Yes, I would suggest against expecting altinstructions to have
>> relocation information. They are generated in a different place, so..
>>
>> That said, I honestly like the inline version (the one that is in the
>> google paper first) of the retpoline more than the out-of-line one.
>> And that one shouldn't have any relocation issues, because all the
>> offsets are relative.
>
> Note that the *only* issue with the relocation is that it pushes me to
> use X86_FEATURE_NO_RETPOLINE for my feature instead of
> X86_FEATURE_RETPOLINE as might be natural. And actually there's a
> motivation to do that anyway, because of the way three-way alternatives
> interact.
>
> With the existing negative flag I can do
>
> ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)
>
> But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:
>
> ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

Another way to do it is with two consecutive alternatives:

ALTERNATIVE(NOP, K8: lfence)
ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)

This also avoids the issue with the relocation of the jmp target when
the replacement is more than one instruction.

--
Brian Gerst

2018-01-05 22:02:01

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocagtion issues, because all the
> offsets are relative.
>
> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?
>
> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

OK, this one looks saner, and I think I've tested all the 32/64 bit
retpoline/amd/none permutations. Pushed to
http://git.infradead.org/users/dwmw2/linux-retpoline.git/

If this matches what you were thinking, I'll refactor the series in the
morning to do it this way.

From cfddb3bcae1524da52e782398da2809ec8faa200 Mon Sep 17 00:00:00 2001
From: David Woodhouse <[email protected]>
Date: Fri, 5 Jan 2018 21:50:41 +0000
Subject: [PATCH] x86/retpoline: Clean up inverted X86_FEATURE_NO_RETPOLINE
 logic

If we make the thunks inline so they don't need relocations to branch to
__x86.indirect_thunk.xxx then we don't fall foul of the alternatives
handling, and we can have the retpoline variant in 'altinstr'. This
means that we can use X86_FEATURE_RETPOLINE which is more natural.

Unfortunately, it does mean that the X86_FEATURE_K8 trick doesn't work
any more, so we need an additional X86_FEATURE_RETPOLINE_AMD for that.

Signed-off-by: David Woodhouse <[email protected]>
---
 arch/x86/entry/entry_64.S            | 12 +--------
 arch/x86/include/asm/cpufeatures.h   |  3 ++-
 arch/x86/include/asm/nospec-branch.h | 50 +++++++++++++++++++++++++++---------
 arch/x86/kernel/cpu/common.c         |  5 ++++
 arch/x86/kernel/cpu/intel.c          |  3 ++-
 arch/x86/lib/retpoline.S             | 17 +++++-------
 6 files changed, 54 insertions(+), 36 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76f94bbacaec..8f7e1129f493 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -188,17 +188,7 @@ ENTRY(entry_SYSCALL_64_trampoline)
   */
  pushq %rdi
  movq $entry_SYSCALL_64_stage2, %rdi
- /*
-  * Open-code the retpoline from retpoline.S, because we can't
-  * just jump to it directly.
-  */
- ALTERNATIVE "call 2f", "jmp *%rdi", X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %rdi, (%rsp)
- ret
+ NOSPEC_JMP rdi
 END(entry_SYSCALL_64_trampoline)
 
  .popsection
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2d916fd13bf9..6f10edabbf82 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -203,7 +203,8 @@
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_SME ( 7*32+10) /* AMD Secure Memory Encryption */
 #define X86_FEATURE_PTI ( 7*32+11) /* Kernel Page Table Isolation enabled */
-#define X86_FEATURE_NO_RETPOLINE ( 7*32+12) /* Retpoline mitigation for Spectre variant 2 */
+#define X86_FEATURE_RETPOLINE ( 7*32+12) /* Intel Retpoline mitigation for Spectre variant 2 */
+#define X86_FEATURE_RETPOLINE_AMD ( 7*32+13) /* AMD Retpoline mitigation for Spectre variant 2 */
 #define X86_FEATURE_INTEL_PPIN ( 7*32+14) /* Intel Processor Inventory Number */
 #define X86_FEATURE_INTEL_PT ( 7*32+15) /* Intel Processor Trace */
 #define X86_FEATURE_AVX512_4VNNIW ( 7*32+16) /* AVX-512 Neural Network Instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index eced0dfaddc9..6e92edf64e53 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -8,23 +8,44 @@
 #include <asm/cpufeatures.h>
 
 #ifdef __ASSEMBLY__
+
 /*
- * The asm code uses CONFIG_RETPOLINE; this part will happen even if
- * the toolchain isn't retpoline-capable.
+ * These are the bare retpoline primitives for indirect jmp and call.
+ * Do not use these directly.
  */
+.macro RETPOLINE_JMP reg:req
+ call 1112f
+1111: lfence
+ jmp 1111b
+1112: mov %\reg, (%_ASM_SP)
+ ret
+.endm
+
+.macro RETPOLINE_CALL reg:req
+ jmp 1113f
+1110: RETPOLINE_JMP \reg
+1113: call 1110b
+.endm
+
+
+
 .macro NOSPEC_JMP reg:req
 #ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(jmp __x86.indirect_thunk.\reg), __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+ ALTERNATIVE_2 __stringify(jmp *%\reg), \
+ __stringify(RETPOLINE_JMP \reg), X86_FEATURE_RETPOLINE, \
+ __stringify(lfence; jmp *%\reg), X86_FEATURE_RETPOLINE_AMD
 #else
- jmp *%\reg
+ jmp *%\reg
 #endif
 .endm
 
 .macro NOSPEC_CALL reg:req
 #ifdef CONFIG_RETPOLINE
- ALTERNATIVE __stringify(call __x86.indirect_thunk.\reg), __stringify(call *%\reg), X86_FEATURE_NO_RETPOLINE
+ ALTERNATIVE_2 __stringify(call *%\reg), \
+ __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE,\
+ __stringify(lfence; call *%\reg), X86_FEATURE_RETPOLINE_AMD
 #else
- call *%\reg
+ call *%\reg
 #endif
 .endm
 
@@ -36,16 +57,21 @@
  */
 #if defined(CONFIG_X86_64) && defined(RETPOLINE)
 #  define NOSPEC_CALL ALTERNATIVE( \
+ "call *%[thunk_target]\n", \
  "call __x86.indirect_thunk.%V[thunk_target]\n", \
- "call *%[thunk_target]\n", X86_FEATURE_NO_RETPOLINE)
+ X86_FEATURE_RETPOLINE)
 #  define THUNK_TARGET(addr) [thunk_target] "r" (addr)
 #elif defined(CONFIG_X86_64) && defined(CONFIG_RETPOLINE)
 # define NOSPEC_CALL ALTERNATIVE( \
- "       jmp 1221f; " \
- "1222:  push %[thunk_target];" \
- "       jmp __x86.indirect_thunk;" \
- "1221:  call 1222b;\n", \
- "call *%[thunk_target]\n", X86_FEATURE_NO_RETPOLINE)
+ "call *%[thunk_target]\n", \
+ "       jmp    1113f; " \
+ "1110:  call   1112f; " \
+ "1111: lfence; " \
+ "       jmp    1111b; " \
+ "1112: movl   %[thunk_target], (%esp); " \
+ "       ret; " \
+ "1113:  call   1110b;\n", \
+ X86_FEATURE_RETPOLINE)
 # define THUNK_TARGET(addr) [thunk_target] "rm" (addr)
 #else
 # define NOSPEC_CALL "call *%[thunk_target]\n"
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 372ba3fb400f..40e6e54d8501 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -904,6 +904,11 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 
  setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
  setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
+#ifdef CONFIG_RETPOLINE
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+ if (c->x86_vendor == X86_VENDOR_AMD)
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
+#endif
 
  fpu__init_system(c);
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index e1812d07b53e..35e123e5f413 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -35,7 +35,8 @@
 static int __init noretpoline_setup(char *__unused)
 {
  pr_info("Retpoline runtime disabled\n");
- setup_force_cpu_cap(X86_FEATURE_NO_RETPOLINE);
+ setup_clear_cpu_cap(X86_FEATURE_RETPOLINE);
+ setup_clear_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
  return 1;
 }
 __setup("noretpoline", noretpoline_setup);
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 2a4b1f09eb84..90d9a1589a54 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -6,19 +6,14 @@
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
+#include <asm/nospec-branch.h>
 
-.macro THUNK sp reg
+.macro THUNK reg
  .section .text.__x86.indirect_thunk.\reg
 
 ENTRY(__x86.indirect_thunk.\reg)
  CFI_STARTPROC
- ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
-1:
- lfence
- jmp 1b
-2:
- mov %\reg, (%\sp)
- ret
+ NOSPEC_JMP \reg
  CFI_ENDPROC
 ENDPROC(__x86.indirect_thunk.\reg)
 EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
@@ -26,11 +21,11 @@ EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
 
 #ifdef CONFIG_64BIT
 .irp reg rax rbx rcx rdx rsi rdi rbp r8 r9 r10 r11 r12 r13 r14 r15
- THUNK rsp \reg
+ THUNK \reg
 .endr
 #else
 .irp reg eax ebx ecx edx esi edi ebp
- THUNK esp \reg
+ THUNK \reg
 .endr
 
 /*
@@ -39,7 +34,7 @@ EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
  */
 ENTRY(__x86.indirect_thunk)
  CFI_STARTPROC
- ALTERNATIVE "call 2f", "ret", X86_FEATURE_NO_RETPOLINE
+ ALTERNATIVE "ret", "call 2f", X86_FEATURE_RETPOLINE
 1:
  lfence
  jmp 1b
-- 
2.14.3


Attachments:
smime.p7s (5.09 kB)

2018-01-05 22:06:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 10:00:19PM +0000, Woodhouse, David wrote:
> OK, this one looks saner, and I think I've tested all the 32/64 bit

Dunno, I think Brian's suggestion will make this even simpler:

ALTERNATIVE(NOP, K8: lfence)
ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)

Hmm?

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-05 22:44:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 10:16:54PM +0000, Woodhouse, David wrote:
> You'd still want a RETPOLINE_AMD flag to enable that lfence; it's not
> just K8.

I think you're forgetting that we set K8 on everything >= K8 on AMD. So
this:

+ if (c->x86_vendor == X86_VENDOR_AMD)
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);

can just as well be dropped and X86_FEATURE_K8 used everywhere.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-05 23:50:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 5, 2018 at 2:00 PM, Woodhouse, David <[email protected]> wrote:
> +.macro RETPOLINE_JMP reg:req
> + call 1112f
> +1111: lfence
> + jmp 1111b
> +1112: mov %\reg, (%_ASM_SP)
> + ret
> +.endm
> +
> +.macro RETPOLINE_CALL reg:req
> + jmp 1113f
> +1110: RETPOLINE_JMP \reg
> +1113: call 1110b
> +.endm

Why do these still force a register name?

Is it because this is the only user?

> index 2a4b1f09eb84..90d9a1589a54 100644
> --- a/arch/x86/lib/retpoline.S
> +++ b/arch/x86/lib/retpoline.S
> @@ -6,19 +6,14 @@
> #include <asm/cpufeatures.h>
> #include <asm/alternative-asm.h>
> #include <asm/export.h>
> +#include <asm/nospec-branch.h>
>
> -.macro THUNK sp reg
> +.macro THUNK reg
> .section .text.__x86.indirect_thunk.\reg
>
> ENTRY(__x86.indirect_thunk.\reg)
> CFI_STARTPROC
> - ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
> -1:
> - lfence
> - jmp 1b
> -2:
> - mov %\reg, (%\sp)
> - ret
> + NOSPEC_JMP \reg
> CFI_ENDPROC

Can we just move those macros into retpoline.S because using them
outside that shouldn't happen?

Linus

2018-01-06 00:31:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> I seem to recall that we also discussed the need for this for converting
> pvops to use alternatives, though the "why" is eluding me at the moment.

Ok, here's something which seems to work in my VM here. I'll continue
playing with it tomorrow. Josh, if you have some example sequences for
me to try, send them my way pls.

Anyway, here's an example:

alternative("", "xor %%rdi, %%rdi; jmp startup_64", X86_FEATURE_K8);

which did this:

[ 0.921013] apply_alternatives: feat: 3*32+4, old: (ffffffff81027429, len: 8), repl: (ffffffff824759d2, len: 8), pad: 8
[ 0.924002] ffffffff81027429: old_insn: 90 90 90 90 90 90 90 90
[ 0.928003] ffffffff824759d2: rpl_insn: 48 31 ff e9 26 a6 b8 fe
[ 0.930212] process_jumps: repl[0]: 0x48
[ 0.932002] process_jumps: insn len: 3
[ 0.932814] process_jumps: repl[0]: 0xe9
[ 0.934003] recompute_jump: o_dspl: 0xfeb8a626
[ 0.934914] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8bd7
[ 0.936001] recompute_jump: final displ: 0xfffd8bd2, JMP 0xffffffff81000000
[ 0.937240] process_jumps: insn len: 5
[ 0.938053] ffffffff81027429: final_insn: e9 d2 8b fd ff a6 b8 fe

Apparently our insn decoder is smart enough to parse the insn and get
its length, so I can use that. It jumps over the first 3-byte XOR and
than massaged the following 5-byte jump.

---
From: Borislav Petkov <[email protected]>
Date: Fri, 5 Jan 2018 20:32:58 +0100
Subject: [PATCH] WIP

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/alternative.c | 51 +++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index dbaf14d69ebd..14a855789a50 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -21,6 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/io.h>
#include <asm/fixmap.h>
+#include <asm/insn.h>

int __read_mostly alternatives_patched;

@@ -281,24 +282,24 @@ static inline bool is_jmp(const u8 opcode)
}

static void __init_or_module
-recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
+recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 repl_len, u8 *insnbuf)
{
u8 *next_rip, *tgt_rip;
s32 n_dspl, o_dspl;
- int repl_len;

- if (a->replacementlen != 5)
+ if (repl_len != 5)
return;

- o_dspl = *(s32 *)(insnbuf + 1);
+ o_dspl = *(s32 *)(repl_insn + 1);

/* next_rip of the replacement JMP */
- next_rip = repl_insn + a->replacementlen;
+ next_rip = repl_insn + repl_len;
+
/* target rip of the replacement JMP */
tgt_rip = next_rip + o_dspl;
n_dspl = tgt_rip - orig_insn;

- DPRINTK("target RIP: %p, new_displ: 0x%x", tgt_rip, n_dspl);
+ DPRINTK("target RIP: %px, new_displ: 0x%x", tgt_rip, n_dspl);

if (tgt_rip - orig_insn >= 0) {
if (n_dspl - 2 <= 127)
@@ -337,6 +338,29 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
}

+static void __init_or_module process_jumps(struct alt_instr *a, u8 *insnbuf)
+{
+ u8 *repl = (u8 *)&a->repl_offset + a->repl_offset;
+ u8 *instr = (u8 *)&a->instr_offset + a->instr_offset;
+ struct insn insn;
+ int i = 0;
+
+ if (!a->replacementlen)
+ return;
+
+ while (i < a->replacementlen) {
+ kernel_insn_init(&insn, repl, a->replacementlen);
+
+ insn_get_length(&insn);
+
+ if (is_jmp(repl[0]))
+ recompute_jump(a, instr, repl, insn.length, insnbuf);
+
+ i += insn.length;
+ repl += insn.length;
+ }
+}
+
/*
* "noinline" to cause control flow change and thus invalidate I$ and
* cause refetch after modification.
@@ -352,7 +376,7 @@ static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *ins
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
local_irq_restore(flags);

- DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
+ DUMP_BYTES(instr, a->instrlen, "%px: [%d:%d) optimized NOPs: ",
instr, a->instrlen - a->padlen, a->padlen);
}

@@ -373,7 +397,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
u8 *instr, *replacement;
u8 insnbuf[MAX_PATCH_LEN];

- DPRINTK("alt table %p -> %p", start, end);
+ DPRINTK("alt table %px -> %px", start, end);
/*
* The scan order should be from start to end. A later scanned
* alternative code can overwrite previously scanned alternative code.
@@ -397,14 +421,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
continue;
}

- DPRINTK("feat: %d*32+%d, old: (%p, len: %d), repl: (%p, len: %d), pad: %d",
+ DPRINTK("feat: %d*32+%d, old: (%px, len: %d), repl: (%px, len: %d), pad: %d",
a->cpuid >> 5,
a->cpuid & 0x1f,
instr, a->instrlen,
replacement, a->replacementlen, a->padlen);

- DUMP_BYTES(instr, a->instrlen, "%p: old_insn: ", instr);
- DUMP_BYTES(replacement, a->replacementlen, "%p: rpl_insn: ", replacement);
+ DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
+ DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);

memcpy(insnbuf, replacement, a->replacementlen);
insnbuf_sz = a->replacementlen;
@@ -422,15 +446,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
(unsigned long)instr + *(s32 *)(insnbuf + 1) + 5);
}

- if (a->replacementlen && is_jmp(replacement[0]))
- recompute_jump(a, instr, replacement, insnbuf);
+ process_jumps(a, insnbuf);

if (a->instrlen > a->replacementlen) {
add_nops(insnbuf + a->replacementlen,
a->instrlen - a->replacementlen);
insnbuf_sz += a->instrlen - a->replacementlen;
}
- DUMP_BYTES(insnbuf, insnbuf_sz, "%p: final_insn: ", instr);
+ DUMP_BYTES(insnbuf, insnbuf_sz, "%px: final_insn: ", instr);

text_poke_early(instr, insnbuf, insnbuf_sz);
}
--
2.13.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-06 08:23:43

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sat, 2018-01-06 at 01:30 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> > I seem to recall that we also discussed the need for this for converting
> > pvops to use alternatives, though the "why" is eluding me at the moment.
>
> Ok, here's something which seems to work in my VM here. I'll continue
> playing with it tomorrow. Josh, if you have some example sequences for
> me to try, send them my way pls.
>
> Anyway, here's an example:
>
> alternative("", "xor %%rdi, %%rdi; jmp startup_64", X86_FEATURE_K8);
>
> which did this:
>
> [    0.921013] apply_alternatives: feat: 3*32+4, old: (ffffffff81027429, len: 8), repl: (ffffffff824759d2, len: 8), pad: 8
> [    0.924002] ffffffff81027429: old_insn: 90 90 90 90 90 90 90 90
> [    0.928003] ffffffff824759d2: rpl_insn: 48 31 ff e9 26 a6 b8 fe
> [    0.930212] process_jumps: repl[0]: 0x48
> [    0.932002] process_jumps: insn len: 3
> [    0.932814] process_jumps: repl[0]: 0xe9
> [    0.934003] recompute_jump: o_dspl: 0xfeb8a626
> [    0.934914] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8bd7
> [    0.936001] recompute_jump: final displ: 0xfffd8bd2, JMP 0xffffffff81000000
> [    0.937240] process_jumps: insn len: 5
> [    0.938053] ffffffff81027429: final_insn: e9 d2 8b fd ff a6 b8 fe
>
> Apparently our insn decoder is smart enough to parse the insn and get
> its length, so I can use that. It jumps over the first 3-byte XOR and
> than massaged the following 5-byte jump.

Thanks. From code inspection, I couldn't see that it was smart enough
*not* to process a relative jump in the 'altinstr' section which was
jumping to a target *within* that same altinstr section, and thus
didn't need to be touched at all. Does this work?

alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);


Attachments:
smime.p7s (5.09 kB)

2018-01-06 10:58:20

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Fri, 2018-01-05 at 15:50 -0800, Linus Torvalds wrote:
>
> > +
> > +.macro RETPOLINE_CALL reg:req
> > +       jmp     1113f
> > +1110:  RETPOLINE_JMP \reg
> > +1113:  call    1110b
> > +.endm


(Note that RETPOLINE_CALL is purely internal to nospec-branch.h, used
only from the NOSPEC_CALL macro to make the ALTERNATIVE_2 invocation
les
s ugly than it would have been. But the same applies to NOSPEC_CALL
which is the one that gets used in our .S files including retpoline.S)

> Why do these still force a register name?
>
> Is it because this is the only user?

By "register name" are you asking why it's invoked as
"NOSPEC_CALL rbx" instead of "NOSPEC_CALL %rbx" with the percent sign?

I think I can probably clean that up now, and turn a lot of %\reg
occurrences into just \reg. The only remaining %\reg would be left in
retpoline.S inside the THUNK macro used from .irp.





Attachments:
smime.p7s (5.09 kB)

2018-01-06 17:02:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> Thanks. From code inspection, I couldn't see that it was smart enough
> *not* to process a relative jump in the 'altinstr' section which was
> jumping to a target *within* that same altinstr section, and thus
> didn't need to be touched at all. Does this work?
>
> alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);

So this is fine because it gets turned into a two-byte jump:

[ 0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
[ 0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
[ 0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
[ 0.822455] process_jumps: insn start 0x48, at 0, len: 3
[ 0.823496] process_jumps: insn start 0xeb, at 3, len: 2
[ 0.824002] process_jumps: insn start 0xe9, at 5, len: 5
[ 0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
[ 0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
[ 0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe

i.e., notice the "eb 00" thing.

Which, when copied into the kernel proper, will simply work as it is
a small offset which, when referring to other code which gets copied
*together* with it, should work. I.e., we're not changing the offsets
during the copy so all good.

It becomes more tricky when you force a 5-byte jump:

alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);

because then you need to know whether the offset is within the
.altinstr_replacement section itself or it is meant to be an absolute
offset like jmp startup_64 or within another section.

On that I need to sleep more to figure out what a reliable way to do it,
would be. I mean, not that we need it now. If all we care is two-byte
offsets, those should work now. The stress being on "should".

The current version:

---
From: Borislav Petkov <[email protected]>
Date: Fri, 5 Jan 2018 20:32:58 +0100
Subject: [PATCH] WIP

Signed-off-by: Borislav Petkov <[email protected]>
---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index dbaf14d69ebd..0cb4f886e6d7 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -21,6 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/io.h>
#include <asm/fixmap.h>
+#include <asm/insn.h>

int __read_mostly alternatives_patched;

@@ -280,25 +281,35 @@ static inline bool is_jmp(const u8 opcode)
return opcode == 0xeb || opcode == 0xe9;
}

+/**
+ * @orig_insn: pointer to the original insn
+ * @repl_insn: pointer to the replacement insn
+ * @repl_len: length of the replacement insn
+ * @insnbuf: buffer we're working on massaging the insns
+ * @i_off: offset within the buffer
+ */
static void __init_or_module
-recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
+recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 repl_len,
+ u8 *insnbuf, u8 i_off)
{
u8 *next_rip, *tgt_rip;
s32 n_dspl, o_dspl;
- int repl_len;

- if (a->replacementlen != 5)
+ if (repl_len != 5)
return;

- o_dspl = *(s32 *)(insnbuf + 1);
+ o_dspl = *(s32 *)(repl_insn + 1);
+
+ DPRINTK("o_dspl: 0x%x, orig_insn: %px", o_dspl, orig_insn);

/* next_rip of the replacement JMP */
- next_rip = repl_insn + a->replacementlen;
+ next_rip = repl_insn + repl_len;
+
/* target rip of the replacement JMP */
tgt_rip = next_rip + o_dspl;
n_dspl = tgt_rip - orig_insn;

- DPRINTK("target RIP: %p, new_displ: 0x%x", tgt_rip, n_dspl);
+ DPRINTK("target RIP: %px, new_displ: 0x%x", tgt_rip, n_dspl);

if (tgt_rip - orig_insn >= 0) {
if (n_dspl - 2 <= 127)
@@ -316,8 +327,8 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
two_byte_jmp:
n_dspl -= 2;

- insnbuf[0] = 0xeb;
- insnbuf[1] = (s8)n_dspl;
+ insnbuf[i_off] = 0xeb;
+ insnbuf[i_off + 1] = (s8)n_dspl;
add_nops(insnbuf + 2, 3);

repl_len = 2;
@@ -326,8 +337,8 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
five_byte_jmp:
n_dspl -= 5;

- insnbuf[0] = 0xe9;
- *(s32 *)&insnbuf[1] = n_dspl;
+ insnbuf[i_off] = 0xe9;
+ *(s32 *)&insnbuf[i_off + 1] = n_dspl;

repl_len = 5;

@@ -337,6 +348,32 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
}

+static void __init_or_module process_jumps(struct alt_instr *a, u8 *insnbuf)
+{
+ u8 *repl = (u8 *)&a->repl_offset + a->repl_offset;
+ u8 *instr = (u8 *)&a->instr_offset + a->instr_offset;
+ struct insn insn;
+ int i = 0;
+
+ if (!a->replacementlen)
+ return;
+
+ while (i < a->replacementlen) {
+ kernel_insn_init(&insn, repl, a->replacementlen);
+
+ insn_get_length(&insn);
+
+ DPRINTK("insn start 0x%x, at %d, len: %d", repl[0], i, insn.length);
+
+ if (is_jmp(repl[0]))
+ recompute_jump(a, instr, repl, insn.length, insnbuf, i);
+
+ i += insn.length;
+ repl += insn.length;
+ instr += insn.length;
+ }
+}
+
/*
* "noinline" to cause control flow change and thus invalidate I$ and
* cause refetch after modification.
@@ -352,7 +389,7 @@ static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *ins
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
local_irq_restore(flags);

- DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
+ DUMP_BYTES(instr, a->instrlen, "%px: [%d:%d) optimized NOPs: ",
instr, a->instrlen - a->padlen, a->padlen);
}

@@ -373,7 +410,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
u8 *instr, *replacement;
u8 insnbuf[MAX_PATCH_LEN];

- DPRINTK("alt table %p -> %p", start, end);
+ DPRINTK("alt table %px -> %px", start, end);
/*
* The scan order should be from start to end. A later scanned
* alternative code can overwrite previously scanned alternative code.
@@ -397,14 +434,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
continue;
}

- DPRINTK("feat: %d*32+%d, old: (%p, len: %d), repl: (%p, len: %d), pad: %d",
+ DPRINTK("feat: %d*32+%d, old: (%px, len: %d), repl: (%px, len: %d), pad: %d",
a->cpuid >> 5,
a->cpuid & 0x1f,
instr, a->instrlen,
replacement, a->replacementlen, a->padlen);

- DUMP_BYTES(instr, a->instrlen, "%p: old_insn: ", instr);
- DUMP_BYTES(replacement, a->replacementlen, "%p: rpl_insn: ", replacement);
+ DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
+ DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);

memcpy(insnbuf, replacement, a->replacementlen);
insnbuf_sz = a->replacementlen;
@@ -422,15 +459,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
(unsigned long)instr + *(s32 *)(insnbuf + 1) + 5);
}

- if (a->replacementlen && is_jmp(replacement[0]))
- recompute_jump(a, instr, replacement, insnbuf);
+ process_jumps(a, insnbuf);

if (a->instrlen > a->replacementlen) {
add_nops(insnbuf + a->replacementlen,
a->instrlen - a->replacementlen);
insnbuf_sz += a->instrlen - a->replacementlen;
}
- DUMP_BYTES(insnbuf, insnbuf_sz, "%p: final_insn: ", instr);
+ DUMP_BYTES(insnbuf, insnbuf_sz, "%px: final_insn: ", instr);

text_poke_early(instr, insnbuf, insnbuf_sz);
}
--
2.13.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-07 09:41:00

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sat, 2018-01-06 at 18:02 +0100, Borislav Petkov wrote:
> On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> > Thanks. From code inspection, I couldn't see that it was smart enough
> > *not* to process a relative jump in the 'altinstr' section which was
> > jumping to a target *within* that same altinstr section, and thus
> > didn't need to be touched at all. Does this work?
> > 
> > alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);
>
> So this is fine because it gets turned into a two-byte jump:
>
> [    0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
> [    0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
> [    0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
> [    0.822455] process_jumps: insn start 0x48, at 0, len: 3
> [    0.823496] process_jumps: insn start 0xeb, at 3, len: 2
> [    0.824002] process_jumps: insn start 0xe9, at 5, len: 5
> [    0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
> [    0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
> [    0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe
>
> i.e., notice the "eb 00" thing.
>
> Which, when copied into the kernel proper, will simply work as it is
> a small offset which, when referring to other code which gets copied
> *together* with it, should work. I.e., we're not changing the offsets
> during the copy so all good.
>
> It becomes more tricky when you force a 5-byte jump:
>
>         alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);
>
> because then you need to know whether the offset is within the
> .altinstr_replacement section itself or it is meant to be an absolute
> offset like jmp startup_64 or within another section.

Right, so it all tends to work out OK purely by virtue of the fact that
oldinstr and altinstr end up far enough apart in the image that they're
5-byte jumps. Which isn't perfect but we've lived with worse.

I'm relatively pleased that we've managed to eliminate this as a
dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
following Linus' suggestion to just emit the thunk inline instead of
calling the same one as GCC.

The other fun one for alternatives is in entry_64.S, where we really
need the return address of the call instruction to be *precisely* the 
.Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
normal NOSPEC_CALL there:

/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path.  If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86.indirect_thunk.rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

Now it's not like an unconditional branch to the out-of-line thunk is
really going to be much of a problem, even in the case where that out-
of-line thunk is alternative'd into a bare 'jmp *%rax'. But it would be
slightly nicer to avoid it.

At the moment though, it's really hard to ensure that the 'call'
instruction that leaves its address on the stack is right at the end.

Am I missing a trick there, other than manually inserting leading NOPs
(instead of the automatic trailing ones) to ensure that everything
lines up, and making assumptions about how the assembler will encode
instructions (not that it has *much* choice, but it has some).

On the whole I think it's fine as it is, but if you have a simple fix
then that would be nice.


Attachments:
smime.p7s (5.09 kB)

2018-01-07 11:46:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, Jan 07, 2018 at 09:40:42AM +0000, David Woodhouse wrote:
> Right, so it all tends to work out OK purely by virtue of the fact that
> oldinstr and altinstr end up far enough apart in the image that they're
> 5-byte jumps. Which isn't perfect but we've lived with worse.

Well, the reference point is important. And I don't think we've done
more involved things than jumping back to something in .text proper.
However, I think I know how to fix this so that arbitrary jump offsets
would work but I need to talk to our gcc guys first.

If the jump is close enough for 2 bytes, then it should work as long as
the offset to the target doesn't change.

The main thing recompute_jumps() does is turn 5-byte jumps - which gas
creates because the jump target is in .text but the jump itself is in
.altinstr_replacement - into 2-byte ones. Because when you copy the jump
back into .text, the offset might fit in a signed byte all of a sudden.

There are still some nasties with forcing 5-byte jumps but I think I
know how to fix those. Stay tuned...

> I'm relatively pleased that we've managed to eliminate this as a
> dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
> following Linus' suggestion to just emit the thunk inline instead of
> calling the same one as GCC.
>
> The other fun one for alternatives is in entry_64.S, where we really
> need the return address of the call instruction to be *precisely* the 
> .Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
> normal NOSPEC_CALL there:

So CALL, as the doc says, pushes the offset of the *next* insn onto the
stack and branches to the target address.

So I'm thinking, as long as the next insn doesn't move and gcc doesn't
pad anything, you're fine.

However, I suspect that I'm missing something else here and I guess I'll
have more clue if I look at the whole thing. So can you point me to your
current branch so that I can take a look at the code?

Thx.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-07 12:21:53

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, 2018-01-07 at 12:46 +0100, Borislav Petkov wrote:
>
> > 
> > The other fun one for alternatives is in entry_64.S, where we really
> > need the return address of the call instruction to be *precisely* the 
> > .Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
> > normal NOSPEC_CALL there:
>
> So CALL, as the doc says, pushes the offset of the *next* insn onto the
> stack and branches to the target address.
>
> So I'm thinking, as long as the next insn doesn't move and gcc doesn't
> pad anything, you're fine.
>
> However, I suspect that I'm missing something else here and I guess I'll
> have more clue if I look at the whole thing. So can you point me to your
> current branch so that I can take a look at the code?

http://git.infradead.org/users/dwmw2/linux-retpoline.git

In particular, this call site in entry_64.S:
http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/entry/entry_64.S#l270

It's still just unconditionally calling the out-of-line thunk and not
using ALTERNATIVE in the CONFIG_RETPOLINE case. I can't just use the
NOSPEC_CALL macro from
http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/include/asm/nospec-branch.h#l46
because of that requirement that the return address (on the stack) for
the CALL instruction must be precisely at the end, in all cases.

Each of the three alternatives *does* end with the CALL, it's just that
for the two which are shorter than the full retpoline one, they'll get
padded with NOPs at the end, so the return address on the stack *won't*
be what's expected.

Explicitly padding the alternatives with leading NOPs so that they are
all precisely the same length would work, and if the alternatives
mechanism were to pad the shorter ones with leading NOPs instead of
trailing NOPs that would *also* work (but be fairly difficult
especially to do that for oldinstr).

I'm not sure I *see* a simple answer, and it isn't really that bad to
just do what GCC is doing and unconditionally call the out-of-line
thunk. So feel free to just throw your hands up in horror and say "no,
we can't cope with that" :)


Attachments:
smime.p7s (5.09 kB)

2018-01-07 14:04:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, Jan 07, 2018 at 12:21:29PM +0000, David Woodhouse wrote:
> http://git.infradead.org/users/dwmw2/linux-retpoline.git
>
> In particular, this call site in entry_64.S:
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/entry/entry_64.S#l270
>
> It's still just unconditionally calling the out-of-line thunk and not
> using ALTERNATIVE in the CONFIG_RETPOLINE case. I can't just use the
> NOSPEC_CALL macro from
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/include/asm/nospec-branch.h#l46
> because of that requirement that the return address (on the stack) for
> the CALL instruction must be precisely at the end, in all cases.

Thanks, this makes it all clear.

> Each of the three alternatives *does* end with the CALL, it's just that
> for the two which are shorter than the full retpoline one, they'll get
> padded with NOPs at the end, so the return address on the stack *won't*
> be what's expected.

Right.

> Explicitly padding the alternatives with leading NOPs so that they are
> all precisely the same length would work, and if the alternatives
> mechanism were to pad the shorter ones with leading NOPs instead of
> trailing NOPs that would *also* work (but be fairly difficult
> especially to do that for oldinstr).

Right, so doing this reliably would need adding flags to struct
alt_instr - and this need has arisen before and we've managed not to do
it :). And I'm not really persuaded we need it now either because this
would be a one-off case where we need the padding in the front.

> I'm not sure I *see* a simple answer, and it isn't really that bad to
> just do what GCC is doing and unconditionally call the out-of-line
> thunk. So feel free to just throw your hands up in horror and say "no,
> we can't cope with that" :)

Right, or we can do something like this - diff ontop of yours below. We used to
do this before the automatic padding - explicit NOP padding by hand :-)

First split the ALTERNATIVE_2 into two alternative calls, as Brian suggested:

ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE_AMD

and then the second:

ALTERNATIVE __stringify(ASM_NOP8; ASM_NOP8; ASM_NOP3; call *\reg), \
__stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE

with frontal NOP padding.

When compiled, it looks like this:

ffffffff818002e2: 90 nop
ffffffff818002e3: 90 nop
ffffffff818002e4: 90 nop

these first three bytes become LFENCE on AMD. On Intel, it gets optimized:

[ 0.042954] ffffffff818002e2: [0:3) optimized NOPs: 0f 1f 00

Then:

ffffffff818002e5: 66 66 66 90 data16 data16 xchg %ax,%ax
ffffffff818002e9: 66 66 66 90 data16 data16 xchg %ax,%ax
ffffffff818002ed: 66 66 66 90 data16 data16 xchg %ax,%ax
ffffffff818002f1: 66 66 66 90 data16 data16 xchg %ax,%ax
ffffffff818002f5: 66 66 90 data16 xchg %ax,%ax
ffffffff818002f8: ff d3 callq *%rbx

This thing remains like this on AMD and on Intel gets turned into:

[ 0.044004] apply_alternatives: feat: 7*32+12, old: (ffffffff818002e5, len: 21), repl: (ffffffff82219edc, len: 21), pad: 0
[ 0.045967] ffffffff818002e5: old_insn: 66 66 66 90 66 66 66 90 66 66 66 90 66 66 66 90 66 66 90 ff d3
[ 0.048006] ffffffff82219edc: rpl_insn: eb 0e e8 04 00 00 00 f3 90 eb fc 48 89 1c 24 c3 e8 ed ff ff ff
[ 0.050581] ffffffff818002e5: final_insn: eb 0e e8 04 00 00 00 f3 90 eb fc 48 89 1c 24 c3 e8 ed ff ff ff
[ 0.052003] ffffffff8180123b: [0:3) optimized NOPs: 0f 1f 00


ffffffff818002e5: eb 0e jmp ffffffff818002f5
ffffffff818002e7: e8 04 00 00 00 callq ffffffff818002f0
ffffffff818002ec: f3 90 pause
ffffffff818002ee: eb fc jmp ffffffff818002ec
ffffffff818002f0: 48 89 1c 24 mov %rbx,(%rsp)
ffffffff818002f4: c3 ret
ffffffff818002f5: e8 ed ff ff ff callq ffffffff818002e7

so that in both cases, the CALL remains last.

My fear is if some funky compiler changes the sizes of the insns in
RETPOLINE_CALL/JMP and then the padding becomes wrong. But looking at the
labels, they're all close so you have a 2-byte jmp already and the

call 1112f

should be ok. The MOV is reg,(reg) which should not change size of 4...

But I'm remaining cautious here.

And we'd need to do the respective thing for NOSPEC_JMP.

Btw, I've moved the setting of X86_FEATURE_RETPOLINE and
X86_FEATURE_RETPOLINE_AMD to the respective intel.c and amd.c files. I
think this is what we want.

And if we're doing two different RETPOLINE flavors, then we should call
X86_FEATURE_RETPOLINE
X86_FEATURE_RETPOLINE_INTEL.

Does that whole thing make some sense...?

---
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c4d08fa29d6c..70825ce909ba 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -6,6 +6,7 @@
#include <asm/alternative.h>
#include <asm/alternative-asm.h>
#include <asm/cpufeatures.h>
+#include <asm/nops.h>

#ifdef __ASSEMBLY__

@@ -43,11 +44,15 @@
#endif
.endm

+/*
+ * RETPOLINE_CALL is 21 bytes so pad the front with 19 bytes + 2 bytes for the
+ * CALL insn so that all CALL instructions remain last.
+ */
.macro NOSPEC_CALL reg:req
#ifdef CONFIG_RETPOLINE
- ALTERNATIVE_2 __stringify(call *\reg), \
- __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE,\
- __stringify(lfence; call *\reg), X86_FEATURE_RETPOLINE_AMD
+ ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE_AMD
+ ALTERNATIVE __stringify(ASM_NOP8; ASM_NOP8; ASM_NOP3; call *\reg), \
+ __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE
#else
call *\reg
#endif
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index b221fe507640..938111e77dd0 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -554,6 +554,8 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
rdmsrl(MSR_FAM10H_NODE_ID, value);
nodes_per_socket = ((value >> 3) & 7) + 1;
}
+
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
}

static void early_init_amd(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cfa5042232a2..372ba3fb400f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -904,9 +904,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)

setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
-#ifdef CONFIG_RETPOLINE
- setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
-#endif

fpu__init_system(c);

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 35e123e5f413..a9e00edaba46 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -524,6 +524,13 @@ static void init_intel_misc_features(struct cpuinfo_x86 *c)
wrmsrl(MSR_MISC_FEATURES_ENABLES, msr);
}

+static void bsp_init_intel(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_RETPOLINE
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+#endif
+}
+
static void init_intel(struct cpuinfo_x86 *c)
{
unsigned int l2 = 0;
@@ -893,6 +900,7 @@ static const struct cpu_dev intel_cpu_dev = {
.c_detect_tlb = intel_detect_tlb,
.c_early_init = early_init_intel,
.c_init = init_intel,
+ .c_bsp_init = bsp_init_intel,
.c_bsp_resume = intel_bsp_resume,
.c_x86_vendor = X86_VENDOR_INTEL,
};

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2018-01-08 05:06:19

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sat, Jan 06, 2018 at 01:30:59AM +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> > I seem to recall that we also discussed the need for this for converting
> > pvops to use alternatives, though the "why" is eluding me at the moment.
>
> Ok, here's something which seems to work in my VM here. I'll continue
> playing with it tomorrow. Josh, if you have some example sequences for
> me to try, send them my way pls.

Here's the use case I had in mind before. With paravirt,

ENABLE_INTERRUPTS(CLBR_NONE)

becomes

push %rax
call *pv_irq_ops.irq_enable
pop %rax

and I wanted to apply those instructions with an alternative. It
doesn't work currently because the 'call' isn't first.

--
Josh

2018-01-08 08:37:11

by Woodhouse, David

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, 2018-01-07 at 23:06 -0600, Josh Poimboeuf wrote:
>
> Here's the use case I had in mind before.  With paravirt,
>
>   ENABLE_INTERRUPTS(CLBR_NONE)
>
> becomes
>
>   push  %rax
>   call  *pv_irq_ops.irq_enable
>   pop   %rax
>
> and I wanted to apply those instructions with an alternative.  It
> doesn't work currently because the 'call' isn't first.

I believe Borislav has made that work... however, if you're literally
doing the above then you'd be introducing new indirect branches which
is precisely what I've been trying to eliminate.

I believe I was told to stop prodding at pvops and just trust that they
all get turned into *direct* jumps at runtime. For example the above
call would not be literally 'call *pv_irq_ops.irq_enable', because by
the time the pvops are patched we'd *know* the final value of the
irq_enable method, and we'd turn it into a *direct* call instead.

Do I need to start looking at pvops again?


Attachments:
smime.p7s (5.09 kB)

2018-01-08 21:50:58

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

On Sun, 2018-01-07 at 15:03 +0100, Borislav Petkov wrote:
>
> My fear is if some funky compiler changes the sizes of the insns in
> RETPOLINE_CALL/JMP and then the padding becomes wrong. But looking at the
> labels, they're all close so you have a 2-byte jmp already and the
>
> call    1112f
>
> should be ok. The MOV is reg,(reg) which should not change size of 4...
>
> But I'm remaining cautious here.

Right. I forget the specifics, but I've *watched* LLVM break carefully
hand-crafted asm code by emitting 4-byte variants when we expected 2-
byte, etc.

On the whole, I'm sufficiently unhappy with making such assumptions,
that I think the cure is worse than the disease. We can live with that
*one* out-of-line call to the thunk in the syscall case, and that was
the *only* one that really needed the call to be at the end.

Note that in the alternative case there, we don't even need to load it
into a register at all. We could do our own alternatives specially for
that case, and hand-tune the lengths only for them. But *with* a sanity
check to break the build on mismatch.

I don't think it's worth it at this point though.


Attachments:
smime.p7s (5.09 kB)