From: Lai Jiangshan <[email protected]>
Many ASM code in entry_64.S can be rewritten in C if they can be written
to be non-instrumentable and are called in the right order regarding to
whether CR3/gsbase is changed to kernel CR3/gsbase.
The patchset covert some of them to C code.
The patch 11 converts the non paranoid entry (entry of interrupts/
non-IST-exception/IST-exception-from-user) to C code. And patch 1-10
are preparation for it and patch 12-13 are cleanup for it. The patch 1
might fix a defect.
The patch 22 converts the paranoid entry/exit to Code. And patch 14-21 are
pareparation for it and patch 23 is cleanup for it.
The patch 24 converts a small part of ASM code of syscall to C code which
does the checking for whether it can use sysret to return to userspace.
Some other paths can be possible to be in C code, for example: the non
paranoid exit, the syscall entry/exit. The PTI handling for them can
be in C code. But it would required the pt_regs to be copied/pushed
to the entry stack which means the C code would not be efficient.
When converting ASM to C, the most effort is to make them the same.
Almost no creative was involved. The code are kept as the same as ASM
as possible and no functional change intended unless my missunderstanding
in the ASM code was involved. The functions called by the C entry code
are checked to be ensured noinstr or __always_inline. Some of them have
more than one definitions and require some more cares from reviewers.
The comments in the ASM are also copied in the right place in the C code.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Lai Jiangshan (24):
x86/traps: Remove stack-protector from traps.c
x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
x86/traps: Move declaration of native_irq_return_iret up
x86/entry: Expose the address of .Lgs_change to traps.c
x86/entry: Introduce __entry_text for entry code written in C
x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h
x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline
x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as
switch_to_kernel_cr3()
x86/traps: Add fence_swapgs_{user,kernel}_entry()
x86/traps: Move pt_regs only in fixup_bad_iret()
x86/entry: Replace the most of asm code of error_entry to C code
x86/traps: Reconstruct pt_regs on task stack directly in
fixup_bad_iret()
x86/traps: Mark sync_regs() and fixup_bad_iret() as static
__always_inline
x86/entry: Make paranoid_exit() callable
x86/entry: Call paranoid_exit() in asm_exc_nmi()
x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS
x86/entry: Introduce struct ist_regs
x86/entry: Add the C version ist_switch_to_kernel_cr3()
x86/entry: Add the C version ist_restore_cr3()
x86/entry: Add the C version get_percpu_base()
x86/entry: Add the C version ist_switch_to_kernel_gsbase()
x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
x86/entry: Remove the unused ASM macros
x86/syscall/64: Move the checking for sysret to C code
arch/x86/entry/Makefile | 5 +-
arch/x86/entry/calling.h | 144 +--------
arch/x86/entry/common.c | 73 ++++-
arch/x86/entry/entry_64.S | 366 ++++-------------------
arch/x86/{kernel => entry}/traps.c | 397 +++++++++++++++++++++++--
arch/x86/include/asm/processor-flags.h | 15 +
arch/x86/include/asm/special_insns.h | 4 +-
arch/x86/include/asm/syscall.h | 2 +-
arch/x86/include/asm/traps.h | 36 ++-
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets_64.c | 2 +
11 files changed, 554 insertions(+), 492 deletions(-)
rename arch/x86/{kernel => entry}/traps.c (74%)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The code between "call exc_nmi" and nmi_restore is as the same as
paranoid_exit(), so we can just use paranoid_exit() instead of the open
duplicated code.
No functional change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 34 +++++-----------------------------
1 file changed, 5 insertions(+), 29 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 24b121d2bf61..ac67a1109c9c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -921,8 +921,7 @@ SYM_CODE_END(paranoid_entry)
/*
* "Paranoid" exit path from exception stack. This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
+ * only on return from IST interrupts that came from kernel space.
*
* We may be returning to very strange contexts (e.g. very early
* in syscall entry), so checking for preemption here would
@@ -1288,11 +1287,7 @@ end_repeat_nmi:
pushq $-1 /* ORIG_RAX: no syscall to restart */
/*
- * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
- * as we should not be calling schedule in NMI context.
- * Even with normal interrupts enabled. An NMI should not be
- * setting NEED_RESCHED or anything that normal interrupts and
- * exceptions might do.
+ * Use paranoid_entry to handle SWAPGS and CR3.
*/
call paranoid_entry
UNWIND_HINT_REGS
@@ -1301,31 +1296,12 @@ end_repeat_nmi:
movq $-1, %rsi
call exc_nmi
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
/*
- * The above invocation of paranoid_entry stored the GSBASE
- * related information in R/EBX depending on the availability
- * of FSGSBASE.
- *
- * If FSGSBASE is enabled, restore the saved GSBASE value
- * unconditionally, otherwise take the conditional SWAPGS path.
+ * Use paranoid_exit to handle SWAPGS and CR3, but no need to use
+ * restore_regs_and_return_to_kernel as we must handle nested NMI.
*/
- ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
- wrgsbase %rbx
- jmp nmi_restore
-
-nmi_no_fsgsbase:
- /* EBX == 0 -> invoke SWAPGS */
- testl %ebx, %ebx
- jnz nmi_restore
-
-nmi_swapgs:
- swapgs
+ call paranoid_exit
-nmi_restore:
POP_REGS
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the C version of RESTORE_CR3().
Not functional difference intended except the ASM code uses bit test
and clear operations while the C version uses mask check and 'AND'
operations. The resulted asm code of both versions are very similar.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/traps.c | 46 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9d6753d0b9fc..e912bfbd2a61 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -65,6 +65,7 @@
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
#include <asm/proto.h>
+#include <asm/tlbflush.h>
extern unsigned char native_irq_return_iret[];
extern unsigned char asm_load_gs_index_gs_change[];
@@ -799,9 +800,54 @@ static __always_inline unsigned long ist_switch_to_kernel_cr3(void)
return cr3;
}
+
+static __always_inline void pti_switch_to_user_cr3(unsigned long user_cr3)
+{
+#define KERN_PCID_MASK (CR3_PCID_MASK & ~PTI_USER_PCID_MASK)
+
+ if (static_cpu_has(X86_FEATURE_PCID)) {
+ int pcid = user_cr3 & KERN_PCID_MASK;
+ unsigned short pcid_mask = 1ull << pcid;
+
+ /*
+ * Check if there's a pending flush for the user ASID we're
+ * about to set.
+ */
+ if (!(this_cpu_read(cpu_tlbstate.user_pcid_flush_mask) & pcid_mask))
+ user_cr3 |= X86_CR3_PCID_NOFLUSH;
+ else
+ this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~pcid_mask);
+ }
+ native_write_cr3(user_cr3);
+}
+
+static __always_inline void ist_restore_cr3(unsigned long cr3)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ if (unlikely(cr3 & PTI_USER_PGTABLE_MASK)) {
+ pti_switch_to_user_cr3(cr3);
+ return;
+ }
+
+ /*
+ * KERNEL pages can always resume with NOFLUSH as we do
+ * explicit flushes.
+ */
+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= X86_CR3_PCID_NOFLUSH;
+
+ /*
+ * The CR3 write could be avoided when not changing its value,
+ * but would require a CR3 read.
+ */
+ native_write_cr3(cr3);
+}
#else
static __always_inline void switch_to_kernel_cr3(void) {}
static __always_inline unsigned long ist_switch_to_kernel_cr3(void) { return 0; }
+static __always_inline void ist_restore_cr3(unsigned long cr3) {}
#endif
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The address of .Lgs_change will be used in traps.c in later patch when
some entry code is implemented in traps.c. So the address of .Lgs_change
is exposed to traps.c for preparation.
The label .Lgs_change is still needed in ASM code for extable.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 3 ++-
arch/x86/entry/traps.c | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..9164e85b36b8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
SYM_FUNC_START(asm_load_gs_index)
FRAME_BEGIN
swapgs
+SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
.Lgs_change:
movl %edi, %gs
2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
@@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
je .Lbstep_iret
- cmpq $.Lgs_change, RIP+8(%rsp)
+ cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
jne .Lerror_entry_done_lfence
/*
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 7869343473b7..f71db7934f90 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -67,6 +67,7 @@
#include <asm/proto.h>
extern unsigned char native_irq_return_iret[];
+extern unsigned char asm_load_gs_index_gs_change[];
#else
#include <asm/processor-flags.h>
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the second half of paranoid_entry() whose functionality
is to switch to kernel gsbase.
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/traps.c | 48 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9f4bc52410d0..b5c92b4e0cb5 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -981,6 +981,54 @@ static __always_inline unsigned long get_percpu_base(void)
return pcpu_unit_offsets;
}
#endif
+
+/*
+ * Handle GSBASE depends on the availability of FSGSBASE.
+ *
+ * Without FSGSBASE the kernel enforces that negative GSBASE
+ * values indicate kernel GSBASE. With FSGSBASE no assumptions
+ * can be made about the GSBASE value when entering from user
+ * space.
+ */
+static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
+{
+ unsigned long gsbase;
+
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * Read the current GSBASE for return.
+ * Retrieve and set the current CPUs kernel GSBASE.
+ *
+ * The unconditional write to GS base below ensures that
+ * no subsequent loads based on a mispredicted GS base can
+ * happen, therefore no LFENCE is needed here.
+ */
+ gsbase = rdgsbase();
+ wrgsbase(get_percpu_base());
+ return gsbase;
+ }
+
+ gsbase = __rdmsr(MSR_GS_BASE);
+
+ /*
+ * The kernel-enforced convention is a negative GSBASE indicates
+ * a kernel value. No SWAPGS needed on entry and exit.
+ */
+ if ((long)gsbase < 0)
+ return 1;
+
+ native_swapgs();
+
+ /*
+ * The above ist_switch_to_kernel_cr3() doesn't do an unconditional
+ * CR3 write, even in the PTI case. So do an lfence to prevent GS
+ * speculation, regardless of whether PTI is enabled.
+ */
+ fence_swapgs_kernel_entry();
+
+ /* SWAPGS required on exit */
+ return 0;
+}
#endif
static bool is_sysenter_singlestep(struct pt_regs *regs)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
fence_swapgs_{user,kernel}_entry() in traps.c are the same as
the ASM macro FENCE_SWAPGS_{USER,KERNEL}_ENTRY.
fence_swapgs_user_entry is used in the user entry swapgs code path,
to prevent a speculative swapgs when coming from kernel space.
fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code
path, to prevent the swapgs from getting speculatively skipped when
coming from user space.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/traps.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9b7d0f15402e..efec3b4eaa5f 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -789,6 +789,26 @@ static __always_inline void switch_to_kernel_cr3(void)
#else
static __always_inline void switch_to_kernel_cr3(void) {}
#endif
+
+/*
+ * Mitigate Spectre v1 for conditional swapgs code paths.
+ *
+ * fence_swapgs_user_entry is used in the user entry swapgs code path, to
+ * prevent a speculative swapgs when coming from kernel space.
+ *
+ * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
+ * to prevent the swapgs from getting speculatively skipped when coming from
+ * user space.
+ */
+static __always_inline void fence_swapgs_user_entry(void)
+{
+ alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
+}
+
+static __always_inline void fence_swapgs_kernel_entry(void)
+{
+ alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
+}
#endif
static bool is_sysenter_singlestep(struct pt_regs *regs)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
struct ist_regs is the upmost stack frame for IST interrupt, it
consists of struct pt_regs and other fields which will be added in
later patch.
Make vc_switch_off_ist() take ist_regs as its argument and it will switch
the whole ist_regs if needed.
Make the frame before calling paranoid_entry() and paranoid_exit() be
struct ist_regs.
This patch is prepared for converting paranoid_entry() and paranoid_exit()
into C code which will need the additinal fields to store the results in
paranoid_entry() and to use them in paranoid_exit().
The C interrupt handlers don't use struct ist_regs due to they don't need
the additional fields in the struct ist_regs, and they still use pt_regs.
No functional change intended and IST_pt_reg is still 0.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 37 ++++++++++++++++++--------------
arch/x86/entry/traps.c | 16 +++++++-------
arch/x86/include/asm/traps.h | 10 ++++++++-
arch/x86/kernel/asm-offsets_64.c | 2 ++
4 files changed, 40 insertions(+), 25 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e968074046c3..1ae10ca351f4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -431,13 +431,14 @@ SYM_CODE_START(\asmsym)
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=IST_pt_regs
- movq %rsp, %rdi /* pt_regs pointer */
+ leaq IST_pt_regs(%rsp), %rdi /* pt_regs pointer */
call \cfunc
call paranoid_exit
+ addq $IST_pt_regs, %rsp /* put %rsp back to pt_regs */
jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack and use the noist entry point */
@@ -488,7 +489,7 @@ SYM_CODE_START(\asmsym)
*/
call paranoid_entry
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=IST_pt_regs
/*
* Switch off the IST stack to make it free for nested exceptions. The
@@ -496,17 +497,17 @@ SYM_CODE_START(\asmsym)
* stack if it is safe to do so. If not it switches to the VC fall-back
* stack.
*/
- movq %rsp, %rdi /* pt_regs pointer */
+ movq %rsp, %rdi /* ist_regs pointer */
call vc_switch_off_ist
movq %rax, %rsp /* Switch to new stack */
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=IST_pt_regs
- /* Update pt_regs */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ leaq IST_pt_regs(%rsp), %rdi /* pt_regs pointer */
- movq %rsp, %rdi /* pt_regs pointer */
+ /* Update pt_regs */
+ movq ORIG_RAX(%rdi), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rdi) /* no syscall to restart */
call kernel_\cfunc
@@ -516,6 +517,7 @@ SYM_CODE_START(\asmsym)
* so it is definitely mapped even with PTI enabled.
*/
call paranoid_exit
+ addq $IST_pt_regs, %rsp /* put %rsp back to pt_regs */
jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack */
@@ -539,14 +541,15 @@ SYM_CODE_START(\asmsym)
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=IST_pt_regs
- movq %rsp, %rdi /* pt_regs pointer into first argument */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ leaq IST_pt_regs(%rsp), %rdi /* pt_regs pointer into first argument */
+ movq ORIG_RAX(%rdi), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rdi) /* no syscall to restart */
call \cfunc
call paranoid_exit
+ addq $IST_pt_regs, %rsp /* put %rsp back to pt_regs */
jmp restore_regs_and_return_to_kernel
_ASM_NOKPROBE(\asmsym)
@@ -852,8 +855,9 @@ SYM_CODE_START_LOCAL(paranoid_entry)
PUSH_AND_CLEAR_REGS skip_rdi=1
movq RDI(%rsp), %rsi /* temporarily store the return address in %rsi */
movq %rdi, RDI(%rsp) /* put %rdi onto pt_regs */
+ subq $IST_pt_regs, %rsp /* reserve room for ist_regs */
pushq %rsi /* put the return address onto the stack */
- ENCODE_FRAME_POINTER 8
+ ENCODE_FRAME_POINTER 8+IST_pt_regs
/*
* Always stash CR3 in %r14. This value will be restored,
@@ -1294,9 +1298,9 @@ end_repeat_nmi:
* Use paranoid_entry to handle SWAPGS and CR3.
*/
call paranoid_entry
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=IST_pt_regs
- movq %rsp, %rdi
+ leaq IST_pt_regs(%rsp), %rdi /* pt_regs pointer */
movq $-1, %rsi
call exc_nmi
@@ -1305,6 +1309,7 @@ end_repeat_nmi:
* restore_regs_and_return_to_kernel as we must handle nested NMI.
*/
call paranoid_exit
+ addq $IST_pt_regs, %rsp /* put %rsp back to pt_regs */
POP_REGS
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 40722a2f61ae..da55565d43ef 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -693,17 +693,17 @@ struct pt_regs *sync_regs(struct pt_regs *eregs)
}
#ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
+asmlinkage __visible noinstr struct ist_regs *vc_switch_off_ist(struct ist_regs *ist)
{
unsigned long sp, *stack;
struct stack_info info;
- struct pt_regs *regs_ret;
+ struct ist_regs *ist_ret;
/*
* In the SYSCALL entry path the RSP value comes from user-space - don't
* trust it and switch to the current kernel stack
*/
- if (ip_within_syscall_gap(regs)) {
+ if (ip_within_syscall_gap(&ist->regs)) {
sp = this_cpu_read(cpu_current_top_of_stack);
goto sync;
}
@@ -713,7 +713,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
* happened from a safe stack. Not safe are the entry or unknown stacks,
* use the fall-back stack instead in this case.
*/
- sp = regs->sp;
+ sp = ist->regs.sp;
stack = (unsigned long *)sp;
if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
@@ -726,12 +726,12 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
* IST stack. The code below only copies pt_regs, the real switch happens
* in assembly code.
*/
- sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret);
+ sp = ALIGN_DOWN(sp, 8) - sizeof(*ist_ret);
- regs_ret = (struct pt_regs *)sp;
- *regs_ret = *regs;
+ ist_ret = (struct ist_regs *)sp;
+ *ist_ret = *ist;
- return regs_ret;
+ return ist_ret;
}
#endif
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 5c41a279c1e0..e24c63bbc30a 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -11,9 +11,17 @@
#include <asm/trap_pf.h>
#ifdef CONFIG_X86_64
+struct ist_regs {
+ /*
+ * ist specific fields must be defined before pt_regs
+ * and they are located below pt_regs on the stacks.
+ */
+ struct pt_regs regs;
+};
+
asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+asmlinkage __visible noinstr struct ist_regs *vc_switch_off_ist(struct ist_regs *ist);
#endif
#ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..3cffadf64ac2 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
#endif
#include <asm/ia32.h>
+#include <asm/traps.h>
#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
#include <asm/kvm_para.h>
@@ -55,6 +56,7 @@ int main(void)
#undef ENTRY
BLANK();
+ DEFINE(IST_pt_regs, offsetof(struct ist_regs, regs));
#ifdef CONFIG_STACKPROTECTOR
DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the C version of asm macro GET_PERCPU_BASE().
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/traps.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index e912bfbd2a61..9f4bc52410d0 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -945,6 +945,42 @@ struct pt_regs *do_error_entry(struct pt_regs *eregs)
/* Enter from kernel, don't move pt_regs */
return eregs;
}
+
+#ifdef CONFIG_SMP
+/*
+ * CPU/node NR is loaded from the limit (size) field of a special segment
+ * descriptor entry in GDT.
+ *
+ * Do not use RDPID, because KVM loads guest's TSC_AUX on vm-entry and
+ * may not restore the host's value until the CPU returns to userspace.
+ * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
+ * while running KVM's run loop.
+ */
+static __always_inline unsigned int gdt_get_cpu(void)
+{
+ unsigned int p;
+
+ asm ("lsl %[seg],%[p]": [p] "=a" (p): [seg] "r" (__CPUNODE_SEG));
+
+ return p & VDSO_CPUNODE_MASK;
+}
+
+/*
+ * Fetch the per-CPU GSBASE value for this processor.
+ *
+ * We normally use %gs for accessing per-CPU data, but we are setting up
+ * %gs here and obviously can not use %gs itself to access per-CPU data.
+ */
+static __always_inline unsigned long get_percpu_base(void)
+{
+ return __per_cpu_offset[gdt_get_cpu()];
+}
+#else
+static __always_inline unsigned long get_percpu_base(void)
+{
+ return pcpu_unit_offsets;
+}
+#endif
#endif
static bool is_sysenter_singlestep(struct pt_regs *regs)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
All the needed facilities are set in traps.c, the major body of
error_entry() can be implemented in C in traps.c. The C version
generally has better readibility and easier to be updated/improved.
No function change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 71 +--------------------------------
arch/x86/entry/traps.c | 76 ++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/traps.h | 1 +
3 files changed, 78 insertions(+), 70 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42d2918f5646..bc9e2f5ad370 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
cld
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
- testb $3, CS+8(%rsp)
- jz .Lerror_kernelspace
- /*
- * We entered from user mode or we're pretending to have entered
- * from user mode due to an IRET fault.
- */
- SWAPGS
- FENCE_SWAPGS_USER_ENTRY
- /* We have user CR3. Change to kernel CR3. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
-.Lerror_entry_from_usermode_after_swapgs:
- /* Put us onto the real thread stack. */
popq %r12 /* save return addr in %12 */
movq %rsp, %rdi /* arg0 = pt_regs pointer */
- call sync_regs
+ call do_error_entry
movq %rax, %rsp /* switch stack */
ENCODE_FRAME_POINTER
pushq %r12
ret
-
-.Lerror_entry_done_lfence:
- FENCE_SWAPGS_KERNEL_ENTRY
-.Lerror_entry_done:
- ret
-
- /*
- * There are two places in the kernel that can potentially fault with
- * usergs. Handle them here. B stepping K8s sometimes report a
- * truncated RIP for IRET exceptions returning to compat mode. Check
- * for these here too.
- */
-.Lerror_kernelspace:
- leaq native_irq_return_iret(%rip), %rcx
- cmpq %rcx, RIP+8(%rsp)
- je .Lerror_bad_iret
- movl %ecx, %eax /* zero extend */
- cmpq %rax, RIP+8(%rsp)
- je .Lbstep_iret
- cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
- jne .Lerror_entry_done_lfence
-
- /*
- * hack: .Lgs_change can fail with user gsbase. If this happens, fix up
- * gsbase and proceed. We'll fix up the exception and land in
- * .Lgs_change's error handler with kernel gsbase.
- */
- SWAPGS
- FENCE_SWAPGS_USER_ENTRY
- jmp .Lerror_entry_done
-
-.Lbstep_iret:
- /* Fix truncated RIP */
- movq %rcx, RIP+8(%rsp)
- /* fall through */
-
-.Lerror_bad_iret:
- /*
- * We came from an IRET to user mode, so we have user
- * gsbase and CR3. Switch to kernel gsbase and CR3:
- */
- SWAPGS
- FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
- /*
- * Pretend that the exception came from user mode: set up pt_regs
- * as if we faulted immediately after IRET.
- */
- popq %r12 /* save return addr in %12 */
- movq %rsp, %rdi /* arg0 = pt_regs pointer */
- call fixup_bad_iret
- mov %rax, %rsp
- ENCODE_FRAME_POINTER
- pushq %r12
- jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
SYM_CODE_START_LOCAL(error_return)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index c718663e57ef..b8fdf6a9682f 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -804,6 +804,82 @@ static __always_inline void fence_swapgs_kernel_entry(void)
{
alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
}
+
+/*
+ * Put pt_regs onto the task stack and switch GS and CR3 if needed.
+ * The actual stack switch is done in entry_64.S.
+ *
+ * Becareful, it might be in the user CR3 and user GS base at the start
+ * of the function.
+ */
+asmlinkage __visible __entry_text
+struct pt_regs *do_error_entry(struct pt_regs *eregs)
+{
+ unsigned long iret_ip = (unsigned long)native_irq_return_iret;
+
+ /*
+ * When XENPV, it is already in the task stack, and it can't fault
+ * from native_irq_return_iret and asm_load_gs_index_gs_change()
+ * since XENPV uses its own pvops for iret and load_gs_index.
+ * So it can directly return.
+ */
+ if (static_cpu_has(X86_FEATURE_XENPV))
+ return eregs;
+
+ if (user_mode(eregs)) {
+ /*
+ * We entered from user mode.
+ * Switch to kernel gsbase and CR3.
+ */
+ native_swapgs();
+ fence_swapgs_user_entry();
+ switch_to_kernel_cr3();
+
+ /* Put pt_regs onto the task stack. */
+ return sync_regs(eregs);
+ }
+
+ /*
+ * There are two places in the kernel that can potentially fault with
+ * usergs. Handle them here. B stepping K8s sometimes report a
+ * truncated RIP for IRET exceptions returning to compat mode. Check
+ * for these here too.
+ */
+ if ((eregs->ip == iret_ip) || (eregs->ip == (unsigned int)iret_ip)) {
+ eregs->ip = iret_ip; /* Fix truncated RIP */
+
+ /*
+ * We came from an IRET to user mode, so we have user
+ * gsbase and CR3. Switch to kernel gsbase and CR3:
+ */
+ native_swapgs();
+ fence_swapgs_user_entry();
+ switch_to_kernel_cr3();
+
+ /*
+ * Pretend that the exception came from user mode: set up
+ * pt_regs as if we faulted immediately after IRET and put
+ * pt_regs onto the real task stack.
+ */
+ return sync_regs(fixup_bad_iret(eregs));
+ }
+
+ /*
+ * Hack: asm_load_gs_index_gs_change can fail with user gsbase.
+ * If this happens, fix up gsbase and proceed. We'll fix up the
+ * exception and land in asm_load_gs_index_gs_change's error
+ * handler with kernel gsbase.
+ */
+ if (eregs->ip == (unsigned long)asm_load_gs_index_gs_change) {
+ native_swapgs();
+ fence_swapgs_user_entry();
+ } else {
+ fence_swapgs_kernel_entry();
+ }
+
+ /* Enter from kernel, don't move pt_regs */
+ return eregs;
+}
#endif
static bool is_sysenter_singlestep(struct pt_regs *regs)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index e9e3801391d5..7b51a8081ae4 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -14,6 +14,7 @@
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
+asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
void __init trap_init(void);
asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
#endif
--
2.19.1.6.gb485710b
On Wed, Sep 01, 2021 at 01:50:05AM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e38a4cf795d9..9164e85b36b8 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
> SYM_FUNC_START(asm_load_gs_index)
> FRAME_BEGIN
> swapgs
> +SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
> .Lgs_change:
After all the patches, there's only a single .Lgs_change reference left.
Can't we convert the _ASM_EXTABLE() entry to use the new label and
totally remove it?
> movl %edi, %gs
> 2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
> @@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
> movl %ecx, %eax /* zero extend */
> cmpq %rax, RIP+8(%rsp)
> je .Lbstep_iret
> - cmpq $.Lgs_change, RIP+8(%rsp)
> + cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
> jne .Lerror_entry_done_lfence
On Thu, Sep 02, 2021 at 02:28:43PM +0800, Lai Jiangshan wrote:
>
>
> On 2021/9/1 04:44, Peter Zijlstra wrote:
> > On Wed, Sep 01, 2021 at 01:50:01AM +0800, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <[email protected]>
> > >
> > > Many ASM code in entry_64.S can be rewritten in C if they can be written
> > > to be non-instrumentable and are called in the right order regarding to
> > > whether CR3/gsbase is changed to kernel CR3/gsbase.
> > >
> > > The patchset covert some of them to C code.
> >
> > Overall, very nice! There's a lot of details to verify, but I really
> > like where this ends up.
> >
> > I'm not sure about moving traps.c, but whatever, that's simple stuff.
> > I'll try and put some more eyeballs on this later.
> >
>
> Hello, Peter
>
> How is it going?
Too much patches in the inbox, sorry :/
> Do I need to send a new version?
Not at this time.
> How can I cooperate or participate in the details?
What I meant was that review of these patches will have to check all the
details which will take some time.
> Do you plan to move the other path of ASM code into C code?
I do have some plans, but currently a total lack of time. Your patches
are a nice contribution.
On 2021/9/2 17:14, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:05AM +0800, Lai Jiangshan wrote:
>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index e38a4cf795d9..9164e85b36b8 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
>> SYM_FUNC_START(asm_load_gs_index)
>> FRAME_BEGIN
>> swapgs
>> +SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
>> .Lgs_change:
>
> After all the patches, there's only a single .Lgs_change reference left.
> Can't we convert the _ASM_EXTABLE() entry to use the new label and
> totally remove it?
The label .Lgs_change is still needed in ASM code for extable.
I tried the way as you suggested, and the result is:
warning: objtool: __ex_table+0x0: don't know how to handle non-section reloc symbol asm_load_gs_index_gs_change
But never mind, I have already converted load_gs_index into C code.
I will send it later.
>
>> movl %edi, %gs
>> 2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
>> @@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
>> movl %ecx, %eax /* zero extend */
>> cmpq %rax, RIP+8(%rsp)
>> je .Lbstep_iret
>> - cmpq $.Lgs_change, RIP+8(%rsp)
>> + cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
>> jne .Lerror_entry_done_lfence
On Wed, Sep 01, 2021 at 01:50:10AM +0800, Lai Jiangshan wrote:
> +/*
> + * Mitigate Spectre v1 for conditional swapgs code paths.
> + *
> + * fence_swapgs_user_entry is used in the user entry swapgs code path, to
> + * prevent a speculative swapgs when coming from kernel space.
> + *
> + * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
> + * to prevent the swapgs from getting speculatively skipped when coming from
> + * user space.
> + */
> +static __always_inline void fence_swapgs_user_entry(void)
> +{
> + alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
> +}
> +
> +static __always_inline void fence_swapgs_kernel_entry(void)
> +{
> + alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
> +}
I think slightly larger primitives might make more sense; that is
include the swapgs in these function and drop the fence_ prefix.
Something a bit like...
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -853,20 +853,22 @@ static __always_inline void ist_restore_
/*
* Mitigate Spectre v1 for conditional swapgs code paths.
*
- * fence_swapgs_user_entry is used in the user entry swapgs code path, to
- * prevent a speculative swapgs when coming from kernel space.
+ * swapgs_user_entry is used in the user entry swapgs code path, to prevent a
+ * speculative swapgs when coming from kernel space.
*
- * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
- * to prevent the swapgs from getting speculatively skipped when coming from
- * user space.
+ * swapgs_kernel_entry is used in the kernel entry non-swapgs code path, to
+ * prevent the swapgs from getting speculatively skipped when coming from user
+ * space.
*/
-static __always_inline void fence_swapgs_user_entry(void)
+static __always_inline void swapgs_user_entry(void)
{
+ native_swapgs();
alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
}
-static __always_inline void fence_swapgs_kernel_entry(void)
+static __always_inline void swapgs_kernel_entry(void)
{
+ native_swapgs();
alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
}
@@ -896,8 +898,7 @@ struct pt_regs *do_error_entry(struct pt
* We entered from user mode.
* Switch to kernel gsbase and CR3.
*/
- native_swapgs();
- fence_swapgs_user_entry();
+ swapgs_user_entry();
switch_to_kernel_cr3();
/* Put pt_regs onto the task stack. */
@@ -917,8 +918,7 @@ struct pt_regs *do_error_entry(struct pt
* We came from an IRET to user mode, so we have user
* gsbase and CR3. Switch to kernel gsbase and CR3:
*/
- native_swapgs();
- fence_swapgs_user_entry();
+ swapgs_user_entry();
switch_to_kernel_cr3();
/*
@@ -936,8 +936,7 @@ struct pt_regs *do_error_entry(struct pt
* handler with kernel gsbase.
*/
if (eregs->ip == (unsigned long)asm_load_gs_index_gs_change) {
- native_swapgs();
- fence_swapgs_user_entry();
+ swapgs_user_entry();
} else {
fence_swapgs_kernel_entry();
}
@@ -1017,14 +1016,12 @@ static __always_inline unsigned long ist
if ((long)gsbase < 0)
return 1;
- native_swapgs();
-
/*
* The above ist_switch_to_kernel_cr3() doesn't do an unconditional
* CR3 write, even in the PTI case. So do an lfence to prevent GS
* speculation, regardless of whether PTI is enabled.
*/
- fence_swapgs_kernel_entry();
+ swapgs_kernel_entry();
/* SWAPGS required on exit */
return 0;
@@ -1089,7 +1086,7 @@ void paranoid_exit(struct ist_regs *ist)
}
if (!ist->gsbase)
- native_swapgs();
+ swapgs_user_entry();
}
#endif
On Wed, Sep 01, 2021 at 01:50:12AM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 42d2918f5646..bc9e2f5ad370 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
> cld
> PUSH_AND_CLEAR_REGS save_ret=1
> ENCODE_FRAME_POINTER 8
>
> popq %r12 /* save return addr in %12 */
> movq %rsp, %rdi /* arg0 = pt_regs pointer */
> + call do_error_entry
> movq %rax, %rsp /* switch stack */
> ENCODE_FRAME_POINTER
> pushq %r12
> ret
There's only a single error_entry callsite, which is idtentry_body. One
of the things I wanted to do is change this lot so we change to the
task_stack in 'C', using an adaptation of call_on_irqstack() and
basically don't return frrom C until we're done with \cfunc.
That is, once we call C, stay there, and don't do this back and forth
between C and asm.
As is, the resulting asm in error_entry is somewhat confusing given that
we sometimes don't actually switch stacks.
From: Lai Jiangshan <[email protected]>
There is no constrain/limition to force native_load_gs_index() to be in
ASM code.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 36 ----------------------------
arch/x86/entry/traps.c | 25 +++++++++++++++++++
arch/x86/include/asm/special_insns.h | 11 +--------
arch/x86/mm/extable.c | 15 ++++++++++++
4 files changed, 41 insertions(+), 46 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2d17fca1f503..16ee481e9b5f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -659,42 +659,6 @@ native_irq_return_ldt:
SYM_CODE_END(common_interrupt_return)
_ASM_NOKPROBE(common_interrupt_return)
-/*
- * Reload gs selector with exception handling
- * edi: new selector
- *
- * Is in entry.text as it shouldn't be instrumented.
- */
-SYM_FUNC_START(asm_load_gs_index)
- FRAME_BEGIN
- swapgs
-SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
-.Lgs_change:
- movl %edi, %gs
-2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
- swapgs
- FRAME_END
- ret
-SYM_FUNC_END(asm_load_gs_index)
-EXPORT_SYMBOL(asm_load_gs_index)
-
- _ASM_EXTABLE(.Lgs_change, .Lbad_gs)
- .section .fixup, "ax"
- /* running with kernelgs */
-SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
- swapgs /* switch back to user gs */
-.macro ZAP_GS
- /* This can't be a string because the preprocessor needs to see it. */
- movl $__USER_DS, %eax
- movl %eax, %gs
-.endm
- ALTERNATIVE "", "ZAP_GS", X86_BUG_NULL_SEG
- xorl %eax, %eax
- movl %eax, %gs
- jmp 2b
-SYM_CODE_END(.Lbad_gs)
- .previous
-
#ifdef CONFIG_XEN_PV
/*
* A note on the "critical region" in our callback handler.
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 52511db6baa6..2a17d74569fd 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -679,6 +679,31 @@ DEFINE_IDTENTRY_RAW(exc_int3)
}
#ifdef CONFIG_X86_64
+
+/*
+ * Reload gs selector with exception handling
+ * selector: new selector
+ *
+ * Is noinstr as it shouldn't be instrumented.
+ */
+noinstr void native_load_gs_index(unsigned int selector)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ native_swapgs();
+ asm volatile(
+ ".global asm_load_gs_index_gs_change \n"
+ "asm_load_gs_index_gs_change: \n"
+ "1: movl %0, %%gs \n"
+ " swapgs \n"
+ "2: \n"
+ _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
+ :: "r" (selector) : "memory");
+ alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
+ local_irq_restore(flags);
+}
+
/*
* Help handler running on a per-cpu (IST or entry trampoline) stack
* to switch to the normal thread stack if the interrupted code was in
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 058995bb153c..c5d74ebce527 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -120,16 +120,7 @@ static inline void native_wbinvd(void)
asm volatile("wbinvd": : :"memory");
}
-extern asmlinkage void asm_load_gs_index(unsigned int selector);
-
-static inline void native_load_gs_index(unsigned int selector)
-{
- unsigned long flags;
-
- local_irq_save(flags);
- asm_load_gs_index(selector);
- local_irq_restore(flags);
-}
+extern asmlinkage void native_load_gs_index(unsigned int selector);
static inline unsigned long __read_cr4(void)
{
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index e1664e9f969c..695a580a652a 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -138,6 +138,21 @@ __visible bool ex_handler_clear_fs(const struct exception_table_entry *fixup,
}
EXPORT_SYMBOL(ex_handler_clear_fs);
+__visible bool ex_handler_clear_gs(const struct exception_table_entry *fixup,
+ struct pt_regs *regs, int trapnr,
+ unsigned long error_code,
+ unsigned long fault_addr)
+{
+ /*
+ * The following native_load_gs_index() should not fault, otherwise
+ * it would be infinite recursion.
+ */
+ if (static_cpu_has(X86_BUG_NULL_SEG))
+ native_load_gs_index(__USER_DS);
+ native_load_gs_index(0);
+ return ex_handler_default(fixup, regs, trapnr, error_code, fault_addr);
+}
+
enum handler_type ex_get_fault_handler_type(unsigned long ip)
{
const struct exception_table_entry *e;
--
2.19.1.6.gb485710b
On 2021/9/2 18:16, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:12AM +0800, Lai Jiangshan wrote:
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 42d2918f5646..bc9e2f5ad370 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
>> cld
>> PUSH_AND_CLEAR_REGS save_ret=1
>> ENCODE_FRAME_POINTER 8
>>
>> popq %r12 /* save return addr in %12 */
>> movq %rsp, %rdi /* arg0 = pt_regs pointer */
>> + call do_error_entry
>> movq %rax, %rsp /* switch stack */
>> ENCODE_FRAME_POINTER
>> pushq %r12
>> ret
>
> There's only a single error_entry callsite, which is idtentry_body. One
> of the things I wanted to do is change this lot so we change to the
> task_stack in 'C', using an adaptation of call_on_irqstack() and
> basically don't return frrom C until we're done with \cfunc.
>
> That is, once we call C, stay there, and don't do this back and forth
> between C and asm.
I haven't figured out how can an adaptation of call_on_irqstack() can do it.
The original stack need to be "free" for next task. And we can't switch
the stack before error_entry() since the CR3 is not switched.
I believe the ASM code here can be simplified and clearer further. But I don't
think going back and forth between C and ASM is real issue if the ASM code is
short and simple enough.
>
> As is, the resulting asm in error_entry is somewhat confusing given that
> we sometimes don't actually switch stacks.
>
On 9/2/21 3:50 AM, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> There is no constrain/limition to force native_load_gs_index() to be in
> ASM code.
Hi,
First of all, let me say I really like your patchset, and I will try to
review it in detail ASAP (most of the initial read pass looks very sane
to me.
However, I would like to object in part this specific patch. It adds a
fair bit of extra code to the exception path, and adds jumps between
files which makes the code much harder to read.
You end up doing one swapgs in assembly and one in C, which would seem
to be a very good indication that really isn't an improvement.
Note that this entire sequence is scheduled to be obsoleted by a single
atomic hardware instruction, LKGS, which will replace ALL of
native_load_gs_index(); it will no longer be necessary even to disable
interrupts as there is no non-atomic state. In that sense, doing this as
an out-of-line C function (with some inline assembly) is great, because
it makes it far easier to use LKGS as an alternative; the only (small)
disadvantage is that it ends up clobbering a lot of registers
unnecessarily (in assembly it can be implemented clobbering only two
registers; one if one uses pushf/popf to save the interrupt flag.)
noinstr void native_load_gs_index(unsigned int selector)
{
unsigned long flags;
local_irq_save(flags);
native_swapgs();
redo:
asm goto("1: movl %0,%%gs\n",
"2:\n"
_ASM_EXTABLE(%1)
: : "r" (selector) : "i" (&&bad_seg);
alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
native_swapgs();
local_irq_restore(flags);
return;
bad_seg:
/* Exception entry will have restored kernel GS */
native_swapgs();
if (static_cpu_has(X86_BUG_NULL_SEG)) {
asm volatile("movl %0,%%gs"
: : "r" (__USER_DS));
}
selector = 0;
goto redo;
}
-hpa
On 9/7/21 6:38 PM, H. Peter Anvin wrote:
> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> There is no constrain/limition to force native_load_gs_index() to be in
>> ASM code.
>
> Hi,
>
> First of all, let me say I really like your patchset, and I will try to
> review it in detail ASAP (most of the initial read pass looks very sane
> to me.
>
> However, I would like to object in part this specific patch. It adds a
> fair bit of extra code to the exception path, and adds jumps between
> files which makes the code much harder to read.
>
> You end up doing one swapgs in assembly and one in C, which would seem
> to be a very good indication that really isn't an improvement.
>
> Note that this entire sequence is scheduled to be obsoleted by a single
> atomic hardware instruction, LKGS, which will replace ALL of
> native_load_gs_index(); it will no longer be necessary even to disable
> interrupts as there is no non-atomic state. In that sense, doing this as
> an out-of-line C function (with some inline assembly) is great, because
> it makes it far easier to use LKGS as an alternative; the only (small)
> disadvantage is that it ends up clobbering a lot of registers
> unnecessarily (in assembly it can be implemented clobbering only two
> registers; one if one uses pushf/popf to save the interrupt flag.)
>
OK, here is a version which actually compiles:
(It makes me wonder if it would be useful to have an _ASM_EXTABLE_GOTO()
macro to highlight the use of an asm goto C label.)
noinstr void native_load_gs_index(unsigned int selector)
{
unsigned long flags;
local_irq_save(flags);
native_swapgs();
do_mov_gs:
asm_volatile_goto("1: mov %[seg],%%gs\n"
"2:\n"
_ASM_EXTABLE(1b,%l[bad_seg])
: : [seg] "r" (selector) : : bad_seg);
alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
native_swapgs();
local_irq_restore(flags);
return;
bad_seg:
selector = 0;
/* The exception dispatch will have restored kernel GS */
native_swapgs();
if (static_cpu_has_bug(X86_BUG_NULL_SEG))
asm volatile("mov %[seg],%%gs"
: : [seg] "r" (__USER_DS));
goto do_mov_gs;
}
On 9/7/21 9:42 PM, H. Peter Anvin wrote:
>
>
> On 9/7/21 6:38 PM, H. Peter Anvin wrote:
>> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <[email protected]>
>>>
>>> There is no constrain/limition to force native_load_gs_index() to be in
>>> ASM code.
>>
>> Hi,
>>
>> First of all, let me say I really like your patchset, and I will try to
>> review it in detail ASAP (most of the initial read pass looks very sane
>> to me.
>>
>> However, I would like to object in part this specific patch. It adds a
>> fair bit of extra code to the exception path, and adds jumps between
>> files which makes the code much harder to read.
>>
>> You end up doing one swapgs in assembly and one in C, which would seem
>> to be a very good indication that really isn't an improvement.
>>
>> Note that this entire sequence is scheduled to be obsoleted by a single
>> atomic hardware instruction, LKGS, which will replace ALL of
>> native_load_gs_index(); it will no longer be necessary even to disable
>> interrupts as there is no non-atomic state. In that sense, doing this as
>> an out-of-line C function (with some inline assembly) is great, because
>> it makes it far easier to use LKGS as an alternative; the only (small)
>> disadvantage is that it ends up clobbering a lot of registers
>> unnecessarily (in assembly it can be implemented clobbering only two
>> registers; one if one uses pushf/popf to save the interrupt flag.)
>>
>
> OK, here is a version which actually compiles:
>
... slightly shorter and minimally better compiled code ...
noinstr void native_load_gs_index(unsigned int selector)
{
unsigned long flags;
local_irq_save(flags);
native_swapgs();
do_mov_gs:
asm_volatile_goto("1: mov %[seg],%%gs\n"
"2:\n"
_ASM_EXTABLE(1b,%l[bad_seg])
: : [seg] "r" (selector) : : bad_seg);
alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
native_swapgs();
local_irq_restore(flags);
return;
bad_seg:
/* The exception dispatch will have restored kernel GS */
native_swapgs();
alternative_input("", "mov %[seg],%%gs",
X86_BUG_NULL_SEG, [seg] "r" (__USER_DS));
selector = 0;
goto do_mov_gs;
}
On 2021/9/8 13:00, H. Peter Anvin wrote:
> On 9/7/21 9:42 PM, H. Peter Anvin wrote:
>>
>>
>> On 9/7/21 6:38 PM, H. Peter Anvin wrote:
>>> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>>>> From: Lai Jiangshan <[email protected]>
>>>>
>>>> There is no constrain/limition to force native_load_gs_index() to be in
>>>> ASM code.
>>>
>>> Hi,
>>>
>>> First of all, let me say I really like your patchset, and I will try to
>>> review it in detail ASAP (most of the initial read pass looks very sane
>>> to me.
Hello
Thank you for your review.
>>>
>>> However, I would like to object in part this specific patch. It adds a
>>> fair bit of extra code to the exception path, and adds jumps between
>>> files which makes the code much harder to read.
I tried putting all code into a single C function.
But I didn't know how to use a C-label in _ASM_EXTABLE and then I split it.
Your code is much better.
Thanks
Lai
>>>
>>> You end up doing one swapgs in assembly and one in C, which would seem
>>> to be a very good indication that really isn't an improvement.
>>>
>>> Note that this entire sequence is scheduled to be obsoleted by a single
>>> atomic hardware instruction, LKGS, which will replace ALL of
>>> native_load_gs_index(); it will no longer be necessary even to disable
>>> interrupts as there is no non-atomic state. In that sense, doing this as
>>> an out-of-line C function (with some inline assembly) is great, because
>>> it makes it far easier to use LKGS as an alternative; the only (small)
>>> disadvantage is that it ends up clobbering a lot of registers
>>> unnecessarily (in assembly it can be implemented clobbering only two
>>> registers; one if one uses pushf/popf to save the interrupt flag.)
>>>
>>
>> OK, here is a version which actually compiles:
>>
>
> ... slightly shorter and minimally better compiled code ...
>
> noinstr void native_load_gs_index(unsigned int selector)
> {
> unsigned long flags;
>
> local_irq_save(flags);
> native_swapgs();
> do_mov_gs:
> asm_volatile_goto("1: mov %[seg],%%gs\n"
> "2:\n"
> _ASM_EXTABLE(1b,%l[bad_seg])
> : : [seg] "r" (selector) : : bad_seg);
> alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
> native_swapgs();
> local_irq_restore(flags);
> return;
>
> bad_seg:
> /* The exception dispatch will have restored kernel GS */
> native_swapgs();
> alternative_input("", "mov %[seg],%%gs",
> X86_BUG_NULL_SEG, [seg] "r" (__USER_DS));
> selector = 0;
> goto do_mov_gs;
> }
Come to think about it, I think it would be better if we were to
incorporate the X86_BUG_NULL_SEG null segment load in into the main
native_load_gs_index() code while we are swapgs'd anyway. This
simplifies the code further, and should avoid calling
native_load_gs_index() twice if this bug is enabled.
(This is the traps.c code only; removing the now unnecessary instances
is left as an exercise for the reader...)
-hpa
On 2021/9/1 01:50, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> struct ist_regs is the upmost stack frame for IST interrupt, it
> consists of struct pt_regs and other fields which will be added in
> later patch.
>
> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
> the whole ist_regs if needed.
>
> Make the frame before calling paranoid_entry() and paranoid_exit() be
> struct ist_regs.
>
> This patch is prepared for converting paranoid_entry() and paranoid_exit()
> into C code which will need the additinal fields to store the results in
> paranoid_entry() and to use them in paranoid_exit().
This patch was over designed.
In ASM code, we can easily save results in the callee-saved registers.
For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
And in C code, we can't save results in registers. And I thought there was
no place to save the results because the CR3 and gsbase are not kernel's.
So I extended the pt_regs to ist_regs to save the results.
But it was incorrect. The results can be saved in percpu data at the end of
paranoid_entry() after the CR3/gsbase are settled down. And the results
can be read at the beginning of paranoid_exit() before the CR3/gsbase are
switched to the interrupted context's.
sigh.
>
> The C interrupt handlers don't use struct ist_regs due to they don't need
> the additional fields in the struct ist_regs, and they still use pt_regs.
>
On 2021/9/10 08:18, Lai Jiangshan wrote:
>
>
> On 2021/9/1 01:50, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> struct ist_regs is the upmost stack frame for IST interrupt, it
>> consists of struct pt_regs and other fields which will be added in
>> later patch.
>>
>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>> the whole ist_regs if needed.
>>
>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>> struct ist_regs.
>>
>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>> into C code which will need the additinal fields to store the results in
>> paranoid_entry() and to use them in paranoid_exit().
>
> This patch was over designed.
>
> In ASM code, we can easily save results in the callee-saved registers.
> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>
> And in C code, we can't save results in registers. And I thought there was
> no place to save the results because the CR3 and gsbase are not kernel's.
> So I extended the pt_regs to ist_regs to save the results.
>
> But it was incorrect. The results can be saved in percpu data at the end of
> paranoid_entry() after the CR3/gsbase are settled down. And the results
> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
> switched to the interrupted context's.
>
> sigh.
Sigh again. IST interrupts can be nested. We can't save the results in percpu
data unless we make it stack-like which is not a good idea. The changelog
failed to express it.
>
>>
>> The C interrupt handlers don't use struct ist_regs due to they don't need
>> the additional fields in the struct ist_regs, and they still use pt_regs.
>>
>
Note: the examples in this email all compiled with:
gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel
The disassembly has been slightly simplified.
On 9/9/21 5:18 PM, Lai Jiangshan wrote:
>
> This patch was over designed.
>
> In ASM code, we can easily save results in the callee-saved registers.
> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>
> And in C code, we can't save results in registers. And I thought there was
> no place to save the results because the CR3 and gsbase are not kernel's.
> So I extended the pt_regs to ist_regs to save the results.
>
> But it was incorrect. The results can be saved in percpu data at the
> end of
> paranoid_entry() after the CR3/gsbase are settled down. And the results
> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
> switched to the interrupted context's.
>
OK, count me confused. Of course you can save results in caller-saved
registers in C; it is kind of what they do:
extern void bar(void);
unsigned long foo(unsigned long v)
{
bar();
return v;
}
0000000000000000 <foo>:
0: 41 54 push %r12
2: 49 89 fc mov %rdi,%r12
5: e8 00 00 00 00 callq bar
a: 4c 89 e0 mov %r12,%rax
d: 41 5c pop %r12
f: c3 retq
Now, if you need to specify *which* registers, you have to declare them
as global register variables - NOT local (which have completely
different semantics). This also means that you (probably) want to
isolate this code into its own compilation unit, because it will prevent
any other code in the same .c file from using that register as well.
For example:
register unsigned long r14 asm("%r14");
unsigned long foo(unsigned long v)
{
r14 = v;
bar();
v = r14;
return v;
}
0000000000000000 <foo>:
0: 49 89 fe mov %rdi,%r14
3: e8 00 00 00 00 callq bar
8: 4c 89 f0 mov %r14,%rax
b: c3 retq
WARNING: This also means that gcc will happily discard the old value in
%r14, so if you need it preserved you have to do so explicitly; if you
are called direct from assembly and are happy to lose the value then the
above code is fine -- and it is even slightly more efficient!
For preserving the old r14, in this case:
register unsigned long r14 asm("%r14");
unsigned long foo(unsigned long v)
{
unsigned long saved_r14 = r14;
r14 = v;
bar();
v = r14;
r14 = saved_r14;
return v;
}
0000000000000000 <foo>:
0: 53 push %rbx
1: 4c 89 f3 mov %r14,%rbx
4: 49 89 fe mov %rdi,%r14
7: e8 00 00 00 00 callq bar
c: 4c 89 f0 mov %r14,%rax
f: 49 89 de mov %rbx,%r14
12: 5b pop %rbx
13: c3 retq
HOWEVER, if you are relying on not using the stack, then using C code is
probably very much not a good idea. It is very hard to guarantee that
just because the C compiler is *currently* not using a stack, that it
won't do so *in the future*.
Again, finally, local register variables DO NOT WORK, this does NOT do
what you expect:
unsigned long foo(unsigned long v)
{
register unsigned long r14 asm("%r14") = v;
bar();
return r14;
}
0000000000000000 <foo>:
0: 41 54 push %r12
2: 49 89 fc mov %rdi,%r12
5: e8 00 00 00 00 callq bar
a: 4c 89 e0 mov %r12,%rax
d: 41 5c pop %r12
f: c3 retq
It may affect your design, I don't know, but FRED exception delivery (as currently architected per draft 2.0; it is subject to change still) pushes several fields above the conventional stack frame (and thus above pt_regs.)
On September 9, 2021 5:18:47 PM PDT, Lai Jiangshan <[email protected]> wrote:
>
>
>On 2021/9/1 01:50, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> struct ist_regs is the upmost stack frame for IST interrupt, it
>> consists of struct pt_regs and other fields which will be added in
>> later patch.
>>
>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>> the whole ist_regs if needed.
>>
>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>> struct ist_regs.
>>
>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>> into C code which will need the additinal fields to store the results in
>> paranoid_entry() and to use them in paranoid_exit().
>
>This patch was over designed.
>
>In ASM code, we can easily save results in the callee-saved registers.
>For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>
>And in C code, we can't save results in registers. And I thought there was
>no place to save the results because the CR3 and gsbase are not kernel's.
>So I extended the pt_regs to ist_regs to save the results.
>
>But it was incorrect. The results can be saved in percpu data at the end of
>paranoid_entry() after the CR3/gsbase are settled down. And the results
>can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>switched to the interrupted context's.
>
>sigh.
>
>>
>> The C interrupt handlers don't use struct ist_regs due to they don't need
>> the additional fields in the struct ist_regs, and they still use pt_regs.
>>
>
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Note: with FRED paranoid entry is no longer needed since the hardware enforces gs_base mode consistency.
On September 9, 2021 9:50:40 PM PDT, "H. Peter Anvin" <[email protected]> wrote:
>It may affect your design, I don't know, but FRED exception delivery (as currently architected per draft 2.0; it is subject to change still) pushes several fields above the conventional stack frame (and thus above pt_regs.)
>
>On September 9, 2021 5:18:47 PM PDT, Lai Jiangshan <[email protected]> wrote:
>>
>>
>>On 2021/9/1 01:50, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <[email protected]>
>>>
>>> struct ist_regs is the upmost stack frame for IST interrupt, it
>>> consists of struct pt_regs and other fields which will be added in
>>> later patch.
>>>
>>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>>> the whole ist_regs if needed.
>>>
>>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>>> struct ist_regs.
>>>
>>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>>> into C code which will need the additinal fields to store the results in
>>> paranoid_entry() and to use them in paranoid_exit().
>>
>>This patch was over designed.
>>
>>In ASM code, we can easily save results in the callee-saved registers.
>>For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>>
>>And in C code, we can't save results in registers. And I thought there was
>>no place to save the results because the CR3 and gsbase are not kernel's.
>>So I extended the pt_regs to ist_regs to save the results.
>>
>>But it was incorrect. The results can be saved in percpu data at the end of
>>paranoid_entry() after the CR3/gsbase are settled down. And the results
>>can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>>switched to the interrupted context's.
>>
>>sigh.
>>
>>>
>>> The C interrupt handlers don't use struct ist_regs due to they don't need
>>> the additional fields in the struct ist_regs, and they still use pt_regs.
>>>
>>
>
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2021/9/10 12:31, H. Peter Anvin wrote:
> Note: the examples in this email all compiled with:
>
> gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel
>
> The disassembly has been slightly simplified.
Saving results in registers is non-flexible no matter in ASM or hack in C like this.
Saving CR3 in ist_regs is not different than saving rax in pt_regs, and both of
ist_regs and embedded pt_regs can be moved when stack is required to switch.
I prefer to my original design.
>
>
> On 9/9/21 5:18 PM, Lai Jiangshan wrote:
>>
>> This patch was over designed.
>>
>> In ASM code, we can easily save results in the callee-saved registers.
>> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>>
>> And in C code, we can't save results in registers. And I thought there was
>> no place to save the results because the CR3 and gsbase are not kernel's.
>> So I extended the pt_regs to ist_regs to save the results.
>>
>> But it was incorrect. The results can be saved in percpu data at the end of
>> paranoid_entry() after the CR3/gsbase are settled down. And the results
>> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>> switched to the interrupted context's.
>>
>
> OK, count me confused. Of course you can save results in caller-saved registers in C; it is kind of what they do:
>
>
> extern void bar(void);
>
> unsigned long foo(unsigned long v)
> {
> bar();
> return v;
> }
>
> 0000000000000000 <foo>:
> 0: 41 54 push %r12
> 2: 49 89 fc mov %rdi,%r12
> 5: e8 00 00 00 00 callq bar
> a: 4c 89 e0 mov %r12,%rax
> d: 41 5c pop %r12
> f: c3 retq
>
> Now, if you need to specify *which* registers, you have to declare them as global register variables - NOT local (which
> have completely different semantics). This also means that you (probably) want to isolate this code into its own
> compilation unit, because it will prevent any other code in the same .c file from using that register as well.
>
> For example:
>
> register unsigned long r14 asm("%r14");
> unsigned long foo(unsigned long v)
> {
> r14 = v;
> bar();
> v = r14;
> return v;
> }
>
> 0000000000000000 <foo>:
> 0: 49 89 fe mov %rdi,%r14
> 3: e8 00 00 00 00 callq bar
> 8: 4c 89 f0 mov %r14,%rax
> b: c3 retq
>
> WARNING: This also means that gcc will happily discard the old value in %r14, so if you need it preserved you have to do
> so explicitly; if you are called direct from assembly and are happy to lose the value then the above code is fine -- and
> it is even slightly more efficient!
>
> For preserving the old r14, in this case:
>
> register unsigned long r14 asm("%r14");
> unsigned long foo(unsigned long v)
> {
> unsigned long saved_r14 = r14;
> r14 = v;
> bar();
> v = r14;
> r14 = saved_r14;
> return v;
> }
>
> 0000000000000000 <foo>:
> 0: 53 push %rbx
> 1: 4c 89 f3 mov %r14,%rbx
> 4: 49 89 fe mov %rdi,%r14
> 7: e8 00 00 00 00 callq bar
> c: 4c 89 f0 mov %r14,%rax
> f: 49 89 de mov %rbx,%r14
> 12: 5b pop %rbx
> 13: c3 retq
>
>
> HOWEVER, if you are relying on not using the stack, then using C code is probably very much not a good idea. It is very
> hard to guarantee that just because the C compiler is *currently* not using a stack, that it won't do so *in the future*.
>
> Again, finally, local register variables DO NOT WORK, this does NOT do what you expect:
>
> unsigned long foo(unsigned long v)
> {
> register unsigned long r14 asm("%r14") = v;
> bar();
> return r14;
> }
>
> 0000000000000000 <foo>:
> 0: 41 54 push %r12
> 2: 49 89 fc mov %rdi,%r12
> 5: e8 00 00 00 00 callq bar
> a: 4c 89 e0 mov %r12,%rax
> d: 41 5c pop %r12
> f: c3 retq
>
>
On 9/10/21 12:13 AM, Lai Jiangshan wrote:
>
>
> On 2021/9/10 12:31, H. Peter Anvin wrote:
>> Note: the examples in this email all compiled with:
>>
>> gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel
>>
>> The disassembly has been slightly simplified.
>
> Saving results in registers is non-flexible no matter in ASM or hack in
> C like this.
>
> Saving CR3 in ist_regs is not different than saving rax in pt_regs, and
> both of
> ist_regs and embedded pt_regs can be moved when stack is required to
> switch.
>
> I prefer to my original design.
>
I agree. I was having some issues following your arguments.
-hpa
On Thu, Sep 2, 2021, at 3:50 AM, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> There is no constrain/limition to force native_load_gs_index() to be in
> ASM code.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
>
> #ifdef CONFIG_X86_64
> +
> +/*
> + * Reload gs selector with exception handling
> + * selector: new selector
> + *
> + * Is noinstr as it shouldn't be instrumented.
> + */
> +noinstr void native_load_gs_index(unsigned int selector)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
This patch would be a bit less alarming if you moved the swapgs into asm. Also, this needs a comment explaining why skipping the swapgs back to kernel gs in the exception path is correct.
> + native_swapgs();
> + asm volatile(
> + ".global asm_load_gs_index_gs_change \n"
> + "asm_load_gs_index_gs_change: \n"
> + "1: movl %0, %%gs \n"
> + " swapgs \n"
> + "2: \n"
> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
> + :: "r" (selector) : "memory");
> + alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
> + local_irq_restore(flags);
> +}
> +
On 2021/9/14 04:01, Andy Lutomirski wrote:
>
>
> On Thu, Sep 2, 2021, at 3:50 AM, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> There is no constrain/limition to force native_load_gs_index() to be in
>> ASM code.
>>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>
>>
>> #ifdef CONFIG_X86_64
>> +
>> +/*
>> + * Reload gs selector with exception handling
>> + * selector: new selector
>> + *
>> + * Is noinstr as it shouldn't be instrumented.
>> + */
>> +noinstr void native_load_gs_index(unsigned int selector)
>> +{
>> + unsigned long flags;
>> +
>> + local_irq_save(flags);
>
> This patch would be a bit less alarming if you moved the swapgs into asm.
Emmm, this patch is not so clean and persuadable in C.
I think Peter is still working on reworking the patchset and may be
including improving this patch. I'm Okay if this patch is dropped.
> Also, this needs a comment explaining why skipping the swapgs back to kernel gs in the exception path is correct.
>
I think it is all known that the exception handler in ASM_EXTABLE is running in kernel context where kernel gs is active.
It does need a comment explaining why the label asm_load_gs_index_gs_change is needed, how does it help the
error_entry() restores back to the kernel gs.
Since the C-version error_entry() has to check asm_load_gs_index_gs_change, I think other way to handle the fault of
"mov %gs" is just doing the %gs fixup in the C-version error_entry(). (see patch 11). it would be more directly,
simple, and self-documented.
Thank you for reviewing.
>> + native_swapgs();
>> + asm volatile(
>> + ".global asm_load_gs_index_gs_change \n"
>> + "asm_load_gs_index_gs_change: \n"
>> + "1: movl %0, %%gs \n"
>> + " swapgs \n"
>> + "2: \n"
>> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
>> + :: "r" (selector) : "memory");
>> + alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
>> + local_irq_restore(flags);
>> +}
>> +
On Tue, Sep 14, 2021 at 10:04:47AM +0800, Lai Jiangshan wrote:
> I think Peter is still working on reworking the patchset and may be
> including improving this patch. I'm Okay if this patch is dropped.
No I am not.
I said I have interest, I also said you shouldn't repost after a single
bit of feedback -- there's nothing more annoying than trying to review a
large-ish and complicated series of patches when you get a new version
every other day.
But please do work on it, because I really don't have the bandwidth to
carry this atm. By now there's been a fair amount of feedback, so a new
version might be appropriate.
I really do think you've got some very good things here. Please work on
it. I will try and review :-)
On Tue, Sep 14, 2021 at 10:14:20AM +0200, Peter Zijlstra wrote:
> I really do think you've got some very good things here. Please work on
> it. I will try and review :-)
IOW, what he's saying is, you should handle your patchset the usual way.
And if you're unsure how, feel free to skim over:
Documentation/process/submitting-patches.rst
it is all explained there. :-)
HTH.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2021/9/14 16:14, Peter Zijlstra wrote:
> On Tue, Sep 14, 2021 at 10:04:47AM +0800, Lai Jiangshan wrote:
>> I think Peter is still working on reworking the patchset and may be
>> including improving this patch. I'm Okay if this patch is dropped.
>
> No I am not.
I'm sorry.
>
> I said I have interest, I also said you shouldn't repost after a single
> bit of feedback -- there's nothing more annoying than trying to review a
> large-ish and complicated series of patches when you get a new version
> every other day >
> But please do work on it, because I really don't have the bandwidth to
> carry this atm. By now there's been a fair amount of feedback, so a new
> version might be appropriate.
Thank you, I will revise the patchset and send it.
>
> I really do think you've got some very good things here. Please work on
> it. I will try and review :-)
>