2022-09-15 12:09:05

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH v3 48/59] x86/retbleed: Add SKL return thunk

From: Thomas Gleixner <[email protected]>

To address the Intel SKL RSB underflow issue in software it's required to
do call depth tracking.

Provide a return thunk for call depth tracking on Intel SKL CPUs.

The tracking does not use a counter. It uses uses arithmetic shift
right on call entry and logical shift left on return.

The depth tracking variable is initialized to 0x8000.... when the call
depth is zero. The arithmetic shift right sign extends the MSB and
saturates after the 12th call. The shift count is 5 so the tracking covers
12 nested calls. On return the variable is shifted left logically so it
becomes zero again.

CALL RET
0: 0x8000000000000000 0x0000000000000000
1: 0xfc00000000000000 0xf000000000000000
...
11: 0xfffffffffffffff8 0xfffffffffffffc00
12: 0xffffffffffffffff 0xffffffffffffffe0

After a return buffer fill the depth is credited 12 calls before the next
stuffing has to take place.

There is a inaccuracy for situations like this:

10 calls
5 returns
3 calls
4 returns
3 calls
....

The shift count might cause this to be off by one in either direction, but
there is still a cushion vs. the RSB depth. The algorithm does not claim to
be perfect, but it should obfuscate the problem enough to make exploitation
extremly difficult.

The theory behind this is:

RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguide the
speculation path to a gadget.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,

The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach but failed.

There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about it.

The SAR/SHL usage was suggested by Andi Kleen.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/entry/entry_64.S | 10 +-
arch/x86/include/asm/current.h | 3
arch/x86/include/asm/nospec-branch.h | 118 +++++++++++++++++++++++++++++++++--
arch/x86/kernel/asm-offsets.c | 3
arch/x86/kvm/svm/vmenter.S | 1
arch/x86/lib/retpoline.S | 31 +++++++++
6 files changed, 156 insertions(+), 10 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -288,6 +288,7 @@ SYM_FUNC_END(__switch_to_asm)
SYM_CODE_START_NOALIGN(ret_from_fork)
UNWIND_HINT_EMPTY
ANNOTATE_NOENDBR // copy_thread
+ CALL_DEPTH_ACCOUNT
movq %rax, %rdi
call schedule_tail /* rdi: 'prev' task parameter */

@@ -332,7 +333,7 @@ SYM_CODE_START(xen_error_entry)
UNWIND_HINT_FUNC
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL
RET
SYM_CODE_END(xen_error_entry)

@@ -977,7 +978,7 @@ SYM_CODE_START(paranoid_entry)
* CR3 above, keep the old value in a callee saved register.
*/
IBRS_ENTER save_reg=%r15
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

RET
SYM_CODE_END(paranoid_entry)
@@ -1062,7 +1063,7 @@ SYM_CODE_START(error_entry)
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
IBRS_ENTER
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
/* Put us onto the real thread stack. */
@@ -1097,6 +1098,7 @@ SYM_CODE_START(error_entry)
*/
.Lerror_entry_done_lfence:
FENCE_SWAPGS_KERNEL_ENTRY
+ CALL_DEPTH_ACCOUNT
leaq 8(%rsp), %rax /* return pt_regs pointer */
ANNOTATE_UNRET_END
RET
@@ -1115,7 +1117,7 @@ SYM_CODE_START(error_entry)
FENCE_SWAPGS_USER_ENTRY
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
IBRS_ENTER
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

/*
* Pretend that the exception came from user mode: set up pt_regs
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -17,6 +17,9 @@ struct pcpu_hot {
struct task_struct *current_task;
int preempt_count;
int cpu_number;
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ u64 call_depth;
+#endif
unsigned long top_of_stack;
void *hardirq_stack_ptr;
u16 softirq_pending;
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -12,8 +12,80 @@
#include <asm/msr-index.h>
#include <asm/unwind_hints.h>
#include <asm/percpu.h>
+#include <asm/current.h>

-#define RETPOLINE_THUNK_SIZE 32
+/*
+ * Call depth tracking for Intel SKL CPUs to address the RSB underflow
+ * issue in software.
+ *
+ * The tracking does not use a counter. It uses uses arithmetic shift
+ * right on call entry and logical shift left on return.
+ *
+ * The depth tracking variable is initialized to 0x8000.... when the call
+ * depth is zero. The arithmetic shift right sign extends the MSB and
+ * saturates after the 12th call. The shift count is 5 for both directions
+ * so the tracking covers 12 nested calls.
+ *
+ * Call
+ * 0: 0x8000000000000000 0x0000000000000000
+ * 1: 0xfc00000000000000 0xf000000000000000
+ * ...
+ * 11: 0xfffffffffffffff8 0xfffffffffffffc00
+ * 12: 0xffffffffffffffff 0xffffffffffffffe0
+ *
+ * After a return buffer fill the depth is credited 12 calls before the
+ * next stuffing has to take place.
+ *
+ * There is a inaccuracy for situations like this:
+ *
+ * 10 calls
+ * 5 returns
+ * 3 calls
+ * 4 returns
+ * 3 calls
+ * ....
+ *
+ * The shift count might cause this to be off by one in either direction,
+ * but there is still a cushion vs. the RSB depth. The algorithm does not
+ * claim to be perfect and it can be speculated around by the CPU, but it
+ * is considered that it obfuscates the problem enough to make exploitation
+ * extremly difficult.
+ */
+#define RET_DEPTH_SHIFT 5
+#define RSB_RET_STUFF_LOOPS 16
+#define RET_DEPTH_INIT 0x8000000000000000ULL
+#define RET_DEPTH_INIT_FROM_CALL 0xfc00000000000000ULL
+#define RET_DEPTH_CREDIT 0xffffffffffffffffULL
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+#define CREDIT_CALL_DEPTH \
+ movq $-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define ASM_CREDIT_CALL_DEPTH \
+ movq $-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH \
+ mov $0x80, %rax; \
+ shl $56, %rax; \
+ movq %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH_FROM_CALL \
+ mov $0xfc, %rax; \
+ shl $56, %rax; \
+ movq %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define INCREMENT_CALL_DEPTH \
+ sarq $5, %gs:pcpu_hot + X86_call_depth;
+
+#define ASM_INCREMENT_CALL_DEPTH \
+ sarq $5, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#else
+#define CREDIT_CALL_DEPTH
+#define RESET_CALL_DEPTH
+#define INCREMENT_CALL_DEPTH
+#define RESET_CALL_DEPTH_FROM_CALL
+#endif

/*
* Fill the CPU return stack buffer.
@@ -32,6 +104,7 @@
* from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
*/

+#define RETPOLINE_THUNK_SIZE 32
#define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */

/*
@@ -60,7 +133,8 @@
dec reg; \
jnz 771b; \
/* barrier for jnz misprediction */ \
- lfence;
+ lfence; \
+ ASM_CREDIT_CALL_DEPTH
#else
/*
* i386 doesn't unconditionally have LFENCE, as such it can't
@@ -185,11 +259,32 @@
* where we have a stack but before any RET instruction.
*/
.macro UNTRAIN_RET
-#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY)
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+ defined(CONFIG_X86_FEATURE_CALL_DEPTH)
ANNOTATE_UNRET_END
- ALTERNATIVE_2 "", \
- CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
- "call entry_ibpb", X86_FEATURE_ENTRY_IBPB
+ ALTERNATIVE_3 "", \
+ CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
+ "call entry_ibpb", X86_FEATURE_ENTRY_IBPB, \
+ __stringify(RESET_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+.macro UNTRAIN_RET_FROM_CALL
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+ defined(CONFIG_X86_FEATURE_CALL_DEPTH)
+ ANNOTATE_UNRET_END
+ ALTERNATIVE_3 "", \
+ CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
+ "call entry_ibpb", X86_FEATURE_ENTRY_IBPB, \
+ __stringify(RESET_CALL_DEPTH_FROM_CALL), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+
+.macro CALL_DEPTH_ACCOUNT
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ ALTERNATIVE "", \
+ __stringify(ASM_INCREMENT_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
#endif
.endm

@@ -214,6 +309,17 @@ extern void (*x86_return_thunk)(void);
#define x86_return_thunk (&__x86_return_thunk)
#endif

+#ifdef CONFIG_CALL_DEPTH_TRACKING
+extern void __x86_return_skl(void);
+
+static inline void x86_set_skl_return_thunk(void)
+{
+ x86_return_thunk = &__x86_return_skl;
+}
+#else
+static inline void x86_set_skl_return_thunk(void) {}
+#endif
+
#ifdef CONFIG_RETPOLINE

#define GEN(reg) \
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -110,6 +110,9 @@ static void __used common(void)
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);

OFFSET(X86_top_of_stack, pcpu_hot, top_of_stack);
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ OFFSET(X86_call_depth, pcpu_hot, call_depth);
+#endif

if (IS_ENABLED(CONFIG_KVM_INTEL)) {
BLANK();
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
#include <asm/asm.h>
+#include <asm/asm-offsets.h>
#include <asm/bitsperlong.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -5,9 +5,11 @@
#include <asm/dwarf2.h>
#include <asm/cpufeatures.h>
#include <asm/alternative.h>
+#include <asm/asm-offsets.h>
#include <asm/export.h>
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
+#include <asm/percpu.h>
#include <asm/frame.h>

.section .text.__x86.indirect_thunk
@@ -140,3 +142,32 @@ SYM_FUNC_END(zen_untrain_ret)
EXPORT_SYMBOL(__x86_return_thunk)

#endif /* CONFIG_RETHUNK */
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+
+ .align 64
+SYM_FUNC_START(__x86_return_skl)
+ ANNOTATE_NOENDBR
+ /* Keep the hotpath in a 16byte I-fetch */
+ shlq $5, PER_CPU_VAR(pcpu_hot + X86_call_depth)
+ jz 1f
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+1:
+ .rept 16
+ ANNOTATE_INTRA_FUNCTION_CALL
+ call 2f
+ int3
+2:
+ .endr
+ add $(8*16), %rsp
+
+ CREDIT_CALL_DEPTH
+
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_FUNC_END(__x86_return_skl)
+
+#endif /* CONFIG_CALL_DEPTH_TRACKING */



Subject: [tip: x86/core] x86/retbleed: Add SKL return thunk

The following commit has been merged into the x86/core branch of tip:

Commit-ID: 5d8213864ade86b48fc492584ea86d65a62f892e
Gitweb: https://git.kernel.org/tip/5d8213864ade86b48fc492584ea86d65a62f892e
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 15 Sep 2022 13:11:27 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 17 Oct 2022 16:41:15 +02:00

x86/retbleed: Add SKL return thunk

To address the Intel SKL RSB underflow issue in software it's required to
do call depth tracking.

Provide a return thunk for call depth tracking on Intel SKL CPUs.

The tracking does not use a counter. It uses uses arithmetic shift
right on call entry and logical shift left on return.

The depth tracking variable is initialized to 0x8000.... when the call
depth is zero. The arithmetic shift right sign extends the MSB and
saturates after the 12th call. The shift count is 5 so the tracking covers
12 nested calls. On return the variable is shifted left logically so it
becomes zero again.

CALL RET
0: 0x8000000000000000 0x0000000000000000
1: 0xfc00000000000000 0xf000000000000000
...
11: 0xfffffffffffffff8 0xfffffffffffffc00
12: 0xffffffffffffffff 0xffffffffffffffe0

After a return buffer fill the depth is credited 12 calls before the next
stuffing has to take place.

There is a inaccuracy for situations like this:

10 calls
5 returns
3 calls
4 returns
3 calls
....

The shift count might cause this to be off by one in either direction, but
there is still a cushion vs. the RSB depth. The algorithm does not claim to
be perfect, but it should obfuscate the problem enough to make exploitation
extremly difficult.

The theory behind this is:

RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguide the
speculation path to a gadget.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,

The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach but failed.

There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about it.

The SAR/SHL usage was suggested by Andi Kleen.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/entry/entry_64.S | 10 +-
arch/x86/include/asm/current.h | 3 +-
arch/x86/include/asm/nospec-branch.h | 121 ++++++++++++++++++++++++--
arch/x86/kernel/asm-offsets.c | 3 +-
arch/x86/kvm/svm/vmenter.S | 1 +-
arch/x86/lib/retpoline.S | 31 +++++++-
6 files changed, 159 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4cc0125..15739a2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -288,6 +288,7 @@ SYM_FUNC_END(__switch_to_asm)
SYM_CODE_START_NOALIGN(ret_from_fork)
UNWIND_HINT_EMPTY
ANNOTATE_NOENDBR // copy_thread
+ CALL_DEPTH_ACCOUNT
movq %rax, %rdi
call schedule_tail /* rdi: 'prev' task parameter */

@@ -332,7 +333,7 @@ SYM_CODE_START(xen_error_entry)
UNWIND_HINT_FUNC
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL
RET
SYM_CODE_END(xen_error_entry)

@@ -977,7 +978,7 @@ SYM_CODE_START(paranoid_entry)
* CR3 above, keep the old value in a callee saved register.
*/
IBRS_ENTER save_reg=%r15
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

RET
SYM_CODE_END(paranoid_entry)
@@ -1062,7 +1063,7 @@ SYM_CODE_START(error_entry)
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
IBRS_ENTER
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
/* Put us onto the real thread stack. */
@@ -1097,6 +1098,7 @@ SYM_CODE_START(error_entry)
*/
.Lerror_entry_done_lfence:
FENCE_SWAPGS_KERNEL_ENTRY
+ CALL_DEPTH_ACCOUNT
leaq 8(%rsp), %rax /* return pt_regs pointer */
ANNOTATE_UNRET_END
RET
@@ -1115,7 +1117,7 @@ SYM_CODE_START(error_entry)
FENCE_SWAPGS_USER_ENTRY
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
IBRS_ENTER
- UNTRAIN_RET
+ UNTRAIN_RET_FROM_CALL

/*
* Pretend that the exception came from user mode: set up pt_regs
diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index b89aba0..a1168e7 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -17,6 +17,9 @@ struct pcpu_hot {
struct task_struct *current_task;
int preempt_count;
int cpu_number;
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ u64 call_depth;
+#endif
unsigned long top_of_stack;
void *hardirq_stack_ptr;
u16 softirq_pending;
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index f10ca33..d4be826 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -12,8 +12,83 @@
#include <asm/msr-index.h>
#include <asm/unwind_hints.h>
#include <asm/percpu.h>
+#include <asm/current.h>

-#define RETPOLINE_THUNK_SIZE 32
+/*
+ * Call depth tracking for Intel SKL CPUs to address the RSB underflow
+ * issue in software.
+ *
+ * The tracking does not use a counter. It uses uses arithmetic shift
+ * right on call entry and logical shift left on return.
+ *
+ * The depth tracking variable is initialized to 0x8000.... when the call
+ * depth is zero. The arithmetic shift right sign extends the MSB and
+ * saturates after the 12th call. The shift count is 5 for both directions
+ * so the tracking covers 12 nested calls.
+ *
+ * Call
+ * 0: 0x8000000000000000 0x0000000000000000
+ * 1: 0xfc00000000000000 0xf000000000000000
+ * ...
+ * 11: 0xfffffffffffffff8 0xfffffffffffffc00
+ * 12: 0xffffffffffffffff 0xffffffffffffffe0
+ *
+ * After a return buffer fill the depth is credited 12 calls before the
+ * next stuffing has to take place.
+ *
+ * There is a inaccuracy for situations like this:
+ *
+ * 10 calls
+ * 5 returns
+ * 3 calls
+ * 4 returns
+ * 3 calls
+ * ....
+ *
+ * The shift count might cause this to be off by one in either direction,
+ * but there is still a cushion vs. the RSB depth. The algorithm does not
+ * claim to be perfect and it can be speculated around by the CPU, but it
+ * is considered that it obfuscates the problem enough to make exploitation
+ * extremly difficult.
+ */
+#define RET_DEPTH_SHIFT 5
+#define RSB_RET_STUFF_LOOPS 16
+#define RET_DEPTH_INIT 0x8000000000000000ULL
+#define RET_DEPTH_INIT_FROM_CALL 0xfc00000000000000ULL
+#define RET_DEPTH_CREDIT 0xffffffffffffffffULL
+
+#if defined(CONFIG_CALL_DEPTH_TRACKING) && !defined(COMPILE_OFFSETS)
+
+#include <asm/asm-offsets.h>
+
+#define CREDIT_CALL_DEPTH \
+ movq $-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define ASM_CREDIT_CALL_DEPTH \
+ movq $-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH \
+ mov $0x80, %rax; \
+ shl $56, %rax; \
+ movq %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH_FROM_CALL \
+ mov $0xfc, %rax; \
+ shl $56, %rax; \
+ movq %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define INCREMENT_CALL_DEPTH \
+ sarq $5, %gs:pcpu_hot + X86_call_depth;
+
+#define ASM_INCREMENT_CALL_DEPTH \
+ sarq $5, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#else
+#define CREDIT_CALL_DEPTH
+#define RESET_CALL_DEPTH
+#define INCREMENT_CALL_DEPTH
+#define RESET_CALL_DEPTH_FROM_CALL
+#endif

/*
* Fill the CPU return stack buffer.
@@ -32,6 +107,7 @@
* from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
*/

+#define RETPOLINE_THUNK_SIZE 32
#define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */

/*
@@ -60,7 +136,8 @@
dec reg; \
jnz 771b; \
/* barrier for jnz misprediction */ \
- lfence;
+ lfence; \
+ ASM_CREDIT_CALL_DEPTH
#else
/*
* i386 doesn't unconditionally have LFENCE, as such it can't
@@ -185,11 +262,32 @@
* where we have a stack but before any RET instruction.
*/
.macro UNTRAIN_RET
-#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY)
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+ defined(CONFIG_X86_FEATURE_CALL_DEPTH)
ANNOTATE_UNRET_END
- ALTERNATIVE_2 "", \
- CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
- "call entry_ibpb", X86_FEATURE_ENTRY_IBPB
+ ALTERNATIVE_3 "", \
+ CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
+ "call entry_ibpb", X86_FEATURE_ENTRY_IBPB, \
+ __stringify(RESET_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+.macro UNTRAIN_RET_FROM_CALL
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+ defined(CONFIG_X86_FEATURE_CALL_DEPTH)
+ ANNOTATE_UNRET_END
+ ALTERNATIVE_3 "", \
+ CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \
+ "call entry_ibpb", X86_FEATURE_ENTRY_IBPB, \
+ __stringify(RESET_CALL_DEPTH_FROM_CALL), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+
+.macro CALL_DEPTH_ACCOUNT
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ ALTERNATIVE "", \
+ __stringify(ASM_INCREMENT_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
#endif
.endm

@@ -214,6 +312,17 @@ extern void (*x86_return_thunk)(void);
#define x86_return_thunk (&__x86_return_thunk)
#endif

+#ifdef CONFIG_CALL_DEPTH_TRACKING
+extern void __x86_return_skl(void);
+
+static inline void x86_set_skl_return_thunk(void)
+{
+ x86_return_thunk = &__x86_return_skl;
+}
+#else
+static inline void x86_set_skl_return_thunk(void) {}
+#endif
+
#ifdef CONFIG_RETPOLINE

#define GEN(reg) \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index a982431..13afdbb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -110,6 +110,9 @@ static void __used common(void)
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);

OFFSET(X86_top_of_stack, pcpu_hot, top_of_stack);
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+ OFFSET(X86_call_depth, pcpu_hot, call_depth);
+#endif

if (IS_ENABLED(CONFIG_KVM_INTEL)) {
BLANK();
diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
index 723f853..09eacf1 100644
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
#include <asm/asm.h>
+#include <asm/asm-offsets.h>
#include <asm/bitsperlong.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 073289a..1e79ecc 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -5,9 +5,11 @@
#include <asm/dwarf2.h>
#include <asm/cpufeatures.h>
#include <asm/alternative.h>
+#include <asm/asm-offsets.h>
#include <asm/export.h>
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
+#include <asm/percpu.h>
#include <asm/frame.h>

.section .text.__x86.indirect_thunk
@@ -140,3 +142,32 @@ __EXPORT_THUNK(zen_untrain_ret)
EXPORT_SYMBOL(__x86_return_thunk)

#endif /* CONFIG_RETHUNK */
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+
+ .align 64
+SYM_FUNC_START(__x86_return_skl)
+ ANNOTATE_NOENDBR
+ /* Keep the hotpath in a 16byte I-fetch */
+ shlq $5, PER_CPU_VAR(pcpu_hot + X86_call_depth)
+ jz 1f
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+1:
+ .rept 16
+ ANNOTATE_INTRA_FUNCTION_CALL
+ call 2f
+ int3
+2:
+ .endr
+ add $(8*16), %rsp
+
+ CREDIT_CALL_DEPTH
+
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_FUNC_END(__x86_return_skl)
+
+#endif /* CONFIG_CALL_DEPTH_TRACKING */

2022-10-20 23:19:49

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v3 48/59] x86/retbleed: Add SKL return thunk

Hi Peter and Thomas,

On Thu, Sep 15, 2022 at 01:11:27PM +0200, Peter Zijlstra wrote:
> From: Thomas Gleixner <[email protected]>
>
> To address the Intel SKL RSB underflow issue in software it's required to
> do call depth tracking.
>
> Provide a return thunk for call depth tracking on Intel SKL CPUs.
>
> The tracking does not use a counter. It uses uses arithmetic shift
> right on call entry and logical shift left on return.
>
> The depth tracking variable is initialized to 0x8000.... when the call
> depth is zero. The arithmetic shift right sign extends the MSB and
> saturates after the 12th call. The shift count is 5 so the tracking covers
> 12 nested calls. On return the variable is shifted left logically so it
> becomes zero again.
>
> CALL RET
> 0: 0x8000000000000000 0x0000000000000000
> 1: 0xfc00000000000000 0xf000000000000000
> ...
> 11: 0xfffffffffffffff8 0xfffffffffffffc00
> 12: 0xffffffffffffffff 0xffffffffffffffe0
>
> After a return buffer fill the depth is credited 12 calls before the next
> stuffing has to take place.
>
> There is a inaccuracy for situations like this:
>
> 10 calls
> 5 returns
> 3 calls
> 4 returns
> 3 calls
> ....
>
> The shift count might cause this to be off by one in either direction, but
> there is still a cushion vs. the RSB depth. The algorithm does not claim to
> be perfect, but it should obfuscate the problem enough to make exploitation
> extremly difficult.
>
> The theory behind this is:
>
> RSB is a stack with depth 16 which is filled on every call. On the return
> path speculation "pops" entries to speculate down the call chain. Once the
> speculative RSB is empty it switches to other predictors, e.g. the Branch
> History Buffer, which can be mistrained by user space and misguide the
> speculation path to a gadget.
>
> Call depth tracking is designed to break this speculation path by stuffing
> speculation trap calls into the RSB which are never getting a corresponding
> return executed. This stalls the prediction path until it gets resteered,
>
> The assumption is that stuffing at the 12th return is sufficient to break
> the speculation before it hits the underflow and the fallback to the other
> predictors. Testing confirms that it works. Johannes, one of the retbleed
> researchers. tried to attack this approach but failed.
>
> There is obviously no scientific proof that this will withstand future
> research progress, but all we can do right now is to speculate about it.
>
> The SAR/SHL usage was suggested by Andi Kleen.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

This commit is now in -next as commit 5d8213864ade ("x86/retbleed: Add
SKL return thunk"). I just bisected an immediate reboot on my AMD test
system when starting a virtual machine with QEMU + KVM to it (see the
bisect log below). My Intel test systems do not show this.
Unfortunately, I do not have much more information, as there are no logs
in journalctl, which makes sense as the reboot occurs immediately after
I hit the enter key for the QEMU command.

If there is any further information I can provide or patches I can test
for further debugging, I am more than happy to do so.

Cheers,
Nathan

# bad: [acee3e83b493505058d1e48fce167f623dac1a05] Add linux-next specific files for 20221020
# good: [aae703b02f92bde9264366c545e87cec451de471] Merge tag 'for-6.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect start 'acee3e83b493505058d1e48fce167f623dac1a05' 'aae703b02f92bde9264366c545e87cec451de471'
# good: [fb28ef66ac24412f3afde4587c66c1ff5396df53] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git
git bisect good fb28ef66ac24412f3afde4587c66c1ff5396df53
# bad: [937c5137a567b139c5e598b85501c33e717630cf] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect bad 937c5137a567b139c5e598b85501c33e717630cf
# good: [4f134b9b72eae392d0b76087f66c6f8be0e1777e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
git bisect good 4f134b9b72eae392d0b76087f66c6f8be0e1777e
# good: [eed519a2892f2cd630d51d0627fc811f021b38c4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
git bisect good eed519a2892f2cd630d51d0627fc811f021b38c4
# good: [e7d4821209b2e01017d90750c516f3b2ff318825] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-dt.git
git bisect good e7d4821209b2e01017d90750c516f3b2ff318825
# bad: [b2e9dfe54be4d023124d588d6f03d16a9c0d2507] x86/bpf: Emit call depth accounting if required
git bisect bad b2e9dfe54be4d023124d588d6f03d16a9c0d2507
# good: [5b71ac8a2a3185da34a6556e791b533b48183a41] x86: Fixup asm-offsets duplicate
git bisect good 5b71ac8a2a3185da34a6556e791b533b48183a41
# good: [fe54d0793796ccdb213d8ea7bff0b49903b6afaa] x86/alternatives: Provide text_poke_copy_locked()
git bisect good fe54d0793796ccdb213d8ea7bff0b49903b6afaa
# bad: [5d8213864ade86b48fc492584ea86d65a62f892e] x86/retbleed: Add SKL return thunk
git bisect bad 5d8213864ade86b48fc492584ea86d65a62f892e
# good: [e81dc127ef69887c72735a3e3868930e2bf313ed] x86/callthunks: Add call patching for call depth tracking
git bisect good e81dc127ef69887c72735a3e3868930e2bf313ed
# good: [770ae1b709528a6a173b5c7b183818ee9b45e376] x86/returnthunk: Allow different return thunks
git bisect good 770ae1b709528a6a173b5c7b183818ee9b45e376
# good: [52354973573cc260ff2fc661cb28ff8eaa7b879b] x86/asm: Provide ALTERNATIVE_3
git bisect good 52354973573cc260ff2fc661cb28ff8eaa7b879b
# first bad commit: [5d8213864ade86b48fc492584ea86d65a62f892e] x86/retbleed: Add SKL return thunk

2022-10-21 10:12:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 48/59] x86/retbleed: Add SKL return thunk

On Thu, Oct 20, 2022 at 04:10:28PM -0700, Nathan Chancellor wrote:
> This commit is now in -next as commit 5d8213864ade ("x86/retbleed: Add
> SKL return thunk"). I just bisected an immediate reboot on my AMD test
> system when starting a virtual machine with QEMU + KVM to it (see the
> bisect log below). My Intel test systems do not show this.
> Unfortunately, I do not have much more information, as there are no logs
> in journalctl, which makes sense as the reboot occurs immediately after
> I hit the enter key for the QEMU command.
>
> If there is any further information I can provide or patches I can test
> for further debugging, I am more than happy to do so.

Moo :-(

you happen to have a .config for me?

2022-10-21 15:28:19

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v3 48/59] x86/retbleed: Add SKL return thunk

On Fri, Oct 21, 2022 at 11:53:09AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 20, 2022 at 04:10:28PM -0700, Nathan Chancellor wrote:
> > This commit is now in -next as commit 5d8213864ade ("x86/retbleed: Add
> > SKL return thunk"). I just bisected an immediate reboot on my AMD test
> > system when starting a virtual machine with QEMU + KVM to it (see the
> > bisect log below). My Intel test systems do not show this.
> > Unfortunately, I do not have much more information, as there are no logs
> > in journalctl, which makes sense as the reboot occurs immediately after
> > I hit the enter key for the QEMU command.
> >
> > If there is any further information I can provide or patches I can test
> > for further debugging, I am more than happy to do so.
>
> Moo :-(
>
> you happen to have a .config for me?

Sure thing, sorry I did not provide it in the first place! Attached. It
has been run through localmodconfig for the particular machine but I
assume the core pieces should still be present.

Cheers,
Nathan


Attachments:
(No filename) (1.02 kB)
.config (178.42 kB)
Download all attachments