From: Lai Jiangshan <[email protected]>
Many ASM code in entry_64.S can be rewritten in C if they can be written
to be non-instrumentable and are called in the right order regarding to
whether CR3/gsbase is changed to kernel CR3/gsbase.
The patchset covert some of them to C code.
The patch 24 converts the error_entry() to C code. And patch 1-23
are fixes and preparation for it.
The patches 25-27 convert entry_INT80_compat and do cleanup.
The patches 28-46 convert the IST entry code to C code. Many of them
are preparation for the actual conversion.
The patches 47-49 do cleanup.
The patch 50 converts a small part of ASM code of syscall to C code which
does the checking for whether it can use sysret to return to userspace.
Some other paths can be possible to be in C code, for example: the
error exit, the syscall entry/exit. The PTI handling for them can
be in C code. But it would required the pt_regs to be copied/pushed
to the entry stack which means the C code would not be efficient.
When converting ASM to C, the most effort is to make them the same.
Almost no creative was involved. The code are kept as the same as ASM
as possible and no functional change intended unless my misunderstanding
in the ASM code was involved. The functions called by the C entry code
are checked to be ensured noinstr or __always_inline. Some of them have
more than one definitions and require some more cares from reviewers.
The comments in the ASM are also copied in the right place in the C code.
Changed from V4:
Move FENCE_SWAPGS_KERNEL_ENTRY up in the patch1. And change the
corresponding C code in later patches to keep coherence.
Jmp to xenpv_restore_regs_and_return_to_usermode in
swapgs_restore_regs_and_return_to_usermode instead of calling
it everywhere.
Add Miguel Ojeda's Reviewed-by.
Changed from V3:
Add a "Reviewed-by" for the xenpv fix
Reviewed-by: Boris Ostrovsky <[email protected]>
Change __attribute((__section__(section))) to __section(section)
Move a part of ist_paranoid_exit() as a new ist_restore_gsbase()
Add a new commit (patch 32) to change the ASM RESTORE_CR3, the
corresponding C version ist_restore_cr3() is changed too.
Changed from V2:
Fix two places with missed FENCE_SWAPGS_KERNEL_ENTRY.
Fix swapgs_restore_regs_and_return_to_usermode for XENPV.
Updates the C entry_error()/parnoid_entry() to use
fence_swapgs_kernel_entry when with user gsbase
in kernel CR3.
Simplify removing stack-protector in MAKEFILE.
Squash commits about removing stack-protector in MAKEFILE.
In V2 the C entry_error() checks xenpv first and uses natvie_swapgs
but ASM entry_error() uses pv-aware SWAPGS. In V3, the
commit is split into 3 commit, so the conversion has no
semantic change.
Move cld to the start of idtentry.
Use idtentry macro for entry_INT80_compat and remove the old one.
Add cleanup for PTI_USER_PGTABLE_BIT when it is moved to header
file.
Remove pv-aware SWAPGS.
Changed from V1:
Add a fix as the patch1. Found by trying to applied Peterz's
suggestion in patch11.
The whole entry_error() is converted to C instead of partial.
The whole parnoid_entry() is converted to C instead of partial.
The asm code of "parnoid_entry() cfunc() parnoid_exit()" are
converted to C as suggested by Peterz.
Add entry64.c rather than move traps.c to arch/x86/entry/
The order of some commits is changed.
Remove two cleanups
[V1]: https://lore.kernel.org/all/[email protected]/
[V2]: https://lore.kernel.org/lkml/[email protected]/
[V3]: https://lore.kernel.org/lkml/[email protected]/
[V4]: https://lore.kernel.org/lkml/[email protected]/
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Joerg Roedel <[email protected]>
Lai Jiangshan (50):
x86/entry: Add fence for kernel entry swapgs in paranoid_entry()
x86/entry: Use the correct fence macro after swapgs in kernel CR3
x86/traps: Remove stack-protector from traps.c
x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
x86/entry: Use swapgs and native_iret directly in
swapgs_restore_regs_and_return_to_usermode
compiler_types.h: Add __noinstr_section() for noinstr
x86/entry: Introduce __entry_text for entry code written in C
x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h
x86: Remove unused kernel_to_user_p4dp() and user_to_kernel_p4dp()
x86: Replace PTI_PGTABLE_SWITCH_BIT with PTI_USER_PGTABLE_BIT
x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline
x86/traps: Move the declaration of native_irq_return_iret into proto.h
x86/entry: Add arch/x86/entry/entry64.c for C entry code
x86/entry: Expose the address of .Lgs_change to entry64.c
x86/entry: Add C verion of SWITCH_TO_KERNEL_CR3 as
switch_to_kernel_cr3()
x86/traps: Add fence_swapgs_{user,kernel}_entry()
x86/entry: Add C user_entry_swapgs_and_fence()
x86/traps: Move pt_regs only in fixup_bad_iret()
x86/entry: Switch the stack after error_entry() returns
x86/entry: move PUSH_AND_CLEAR_REGS out of error_entry
x86/entry: Move cld to the start of idtentry
x86/entry: Don't call error_entry for XENPV
x86/entry: Convert SWAPGS to swapgs in error_entry()
x86/entry: Implement the whole error_entry() as C code
x86/entry: Use idtentry macro for entry_INT80_compat
x86/entry: Convert SWAPGS to swapgs in entry_SYSENTER_compat()
x86: Remove the definition of SWAPGS
x86/entry: Make paranoid_exit() callable
x86/entry: Call paranoid_exit() in asm_exc_nmi()
x86/entry: move PUSH_AND_CLEAR_REGS out of paranoid_entry
x86/entry: Add the C version ist_switch_to_kernel_cr3()
x86/entry: Skip CR3 write when the saved CR3 is kernel CR3 in
RESTORE_CR3
x86/entry: Add the C version ist_restore_cr3()
x86/entry: Add the C version get_percpu_base()
x86/entry: Add the C version ist_switch_to_kernel_gsbase()
x86/entry: Implement the C version ist_paranoid_entry()
x86/entry: Implement the C version ist_paranoid_exit()
x86/entry: Add a C macro to define the function body for IST in
.entry.text
x86/debug, mce: Use C entry code
x86/idtentry.h: Move the definitions *IDTENTRY_{MCE|DEBUG}* up
x86/nmi: Use DEFINE_IDTENTRY_NMI for nmi
x86/nmi: Use C entry code
x86/entry: Add a C macro to define the function body for IST in
.entry.text with an error code
x86/doublefault: Use C entry code
x86/sev: Add and use ist_vc_switch_off_ist()
x86/sev: Use C entry code
x86/entry: Remove ASM function paranoid_entry() and paranoid_exit()
x86/entry: Remove the unused ASM macros
x86/entry: Remove save_ret from PUSH_AND_CLEAR_REGS
x86/syscall/64: Move the checking for sysret to C code
arch/x86/entry/Makefile | 3 +-
arch/x86/entry/calling.h | 142 +-------
arch/x86/entry/common.c | 73 +++-
arch/x86/entry/entry64.c | 346 +++++++++++++++++++
arch/x86/entry/entry_64.S | 445 ++++---------------------
arch/x86/entry/entry_64_compat.S | 104 +-----
arch/x86/include/asm/idtentry.h | 111 +++++-
arch/x86/include/asm/irqflags.h | 8 -
arch/x86/include/asm/pgtable.h | 23 +-
arch/x86/include/asm/processor-flags.h | 15 +
arch/x86/include/asm/proto.h | 5 +-
arch/x86/include/asm/special_insns.h | 4 +-
arch/x86/include/asm/syscall.h | 2 +-
arch/x86/include/asm/traps.h | 6 +-
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/cpu/mce/Makefile | 3 +
arch/x86/kernel/nmi.c | 2 +-
arch/x86/kernel/traps.c | 33 +-
arch/x86/xen/xen-asm.S | 20 ++
include/linux/compiler_types.h | 8 +-
20 files changed, 674 insertions(+), 682 deletions(-)
create mode 100644 arch/x86/entry/entry64.c
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Commit 18ec54fdd6d18 ("x86/speculation: Prepare entry code for Spectre
v1 swapgs mitigations") adds FENCE_SWAPGS_{KERNEL|USER}_ENTRY
for conditional swapgs. And in paranoid_entry(), it uses only
FENCE_SWAPGS_KERNEL_ENTRY for both branches. It is because the fence
is required for both cases since the CR3 write is conditinal even PTI
is enabled.
But commit 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in
paranoid entry") switches the code order and changes the branches.
And it misses the needed FENCE_SWAPGS_KERNEL_ENTRY for user gsbase case.
Add it back by moving FENCE_SWAPGS_KERNEL_ENTRY up to cover both branches.
Fixes: Commit 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
Cc: Josh Poimboeuf <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Sasha Levin <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..14ffe12807ba 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -888,6 +888,13 @@ SYM_CODE_START_LOCAL(paranoid_entry)
ret
.Lparanoid_entry_checkgs:
+ /*
+ * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
+ * unconditional CR3 write, even in the PTI case. So do an lfence
+ * to prevent GS speculation, regardless of whether PTI is enabled.
+ */
+ FENCE_SWAPGS_KERNEL_ENTRY
+
/* EBX = 1 -> kernel GSBASE active, no restore required */
movl $1, %ebx
/*
@@ -903,13 +910,6 @@ SYM_CODE_START_LOCAL(paranoid_entry)
.Lparanoid_entry_swapgs:
swapgs
- /*
- * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
- * unconditional CR3 write, even in the PTI case. So do an lfence
- * to prevent GS speculation, regardless of whether PTI is enabled.
- */
- FENCE_SWAPGS_KERNEL_ENTRY
-
/* EBX = 0 -> SWAPGS required on exit */
xorl %ebx, %ebx
ret
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The commit c75890700455 ("x86/entry/64: Remove unneeded kernel CR3
switching") removes a CR3 write in the faulting path of load_gs_index.
But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI
is enabled. Rahter, it depends on the CR3 write of SWITCH_TO_KERNEL_CR3.
So the path should use FENCE_SWAPGS_KERNEL_ENTRY if SWITCH_TO_KERNEL_CR3
is removed.
Fixes: c75890700455 ("x86/entry/64: Remove unneeded kernel CR3 switching")
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 14ffe12807ba..6189a0dc83ab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -993,11 +993,6 @@ SYM_CODE_START_LOCAL(error_entry)
pushq %r12
ret
-.Lerror_entry_done_lfence:
- FENCE_SWAPGS_KERNEL_ENTRY
-.Lerror_entry_done:
- ret
-
/*
* There are two places in the kernel that can potentially fault with
* usergs. Handle them here. B stepping K8s sometimes report a
@@ -1020,8 +1015,15 @@ SYM_CODE_START_LOCAL(error_entry)
* .Lgs_change's error handler with kernel gsbase.
*/
SWAPGS
- FENCE_SWAPGS_USER_ENTRY
- jmp .Lerror_entry_done
+
+ /*
+ * The above code has no serializing instruction. So do an lfence
+ * to prevent GS speculation, regardless of whether it is kernel
+ * gsbase or user gsbase.
+ */
+.Lerror_entry_done_lfence:
+ FENCE_SWAPGS_KERNEL_ENTRY
+ ret
.Lbstep_iret:
/* Fix truncated RIP */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
When stack-protector is enabled, the compiler adds some instrument code
at the beginning and the end of some functions. Many functions in traps.c
are non-instrumentable. Moreover, stack-protector code in the beginning
of the affected function accesses the canary that might be watched by
hardware breakpoints which also violate the non-instrumentable
nature of some functions and might cause infinite recursive #DB because
the canary is accessed before resetting the dr7.
So it is better to remove stack-protector from traps.c.
It is also prepared for later patches that move some entry code into
traps.c, some of which can NOT use percpu register until gsbase is
properly switched. And stack-protector depends on the percpu register
to work.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/Makefile | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2ff3e600f426..8ac45801ba8b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -50,6 +50,7 @@ KCOV_INSTRUMENT := n
CFLAGS_head$(BITS).o += -fno-stack-protector
CFLAGS_cc_platform.o += -fno-stack-protector
+CFLAGS_traps.o += -fno-stack-protector
CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
And it will be extended for C entry code.
Cc: Borislav Petkov <[email protected]>
Reviewed-by: Miguel Ojeda <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Suggested-by: Nick Desaulniers <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/compiler_types.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 05ceb2e92b0e..b36e2df98647 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -208,9 +208,11 @@ struct ftrace_likely_data {
#endif
/* Section for code which can't be instrumented at all */
-#define noinstr \
- noinline notrace __attribute((__section__(".noinstr.text"))) \
- __no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
+#define __noinstr_section(section) \
+ noinline notrace __section(section) __no_profile \
+ __no_kcsan __no_sanitize_address __no_sanitize_coverage
+
+#define noinstr __noinstr_section(".noinstr.text")
#endif /* __KERNEL__ */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
kernel_to_user_p4dp() and user_to_kernel_p4dp() have no caller and can
be removed.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..65542106464b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1200,16 +1200,6 @@ static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
{
return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
- return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
- return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
#endif /* CONFIG_PAGE_TABLE_ISOLATION */
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
swapgs_restore_regs_and_return_to_usermode() is used in native code
(non-xenpv) only now, so it doesn't need the PV-aware SWAPGS and
INTERRUPT_RETURN.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ebcc17e1d7f1..fb977d958889 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -608,8 +608,8 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
/* Restore RDI. */
popq %rdi
- SWAPGS
- INTERRUPT_RETURN
+ swapgs
+ jmp native_iret
SYM_INNER_LABEL(restore_regs_and_return_to_kernel, SYM_L_GLOBAL)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
These constants will be also used in C file, so we move them to
arch/x86/include/asm/processor-flags.h which already has a kin
X86_CR3_PTI_PCID_USER_BIT defined in it.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/calling.h | 10 ----------
arch/x86/include/asm/processor-flags.h | 15 +++++++++++++++
2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..996b041e92d2 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -149,16 +149,6 @@ For 32-bit we have the following conventions - kernel is built with
#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two
- * halves:
- */
-#define PTI_USER_PGTABLE_BIT PAGE_SHIFT
-#define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT)
-#define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT
-#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT)
-#define PTI_USER_PGTABLE_AND_PCID_MASK (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
-
.macro SET_NOFLUSH_BIT reg:req
bts $X86_CR3_PCID_NOFLUSH_BIT, \reg
.endm
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..4dd2fbbc861a 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -4,6 +4,7 @@
#include <uapi/asm/processor-flags.h>
#include <linux/mem_encrypt.h>
+#include <asm/page_types.h>
#ifdef CONFIG_VM86
#define X86_VM_MASK X86_EFLAGS_VM
@@ -50,7 +51,21 @@
#endif
#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
# define X86_CR3_PTI_PCID_USER_BIT 11
+
+#ifdef CONFIG_X86_64
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+#endif
+
#endif
#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Some entry code will be implemented in C files. __entry_text is needed
to set them in .entry.text section. __entry_text disables instrumentation
like noinstr, but it doesn't disable stack protector since not all
compiler supported by kernel supporting function level granular
attribute to disable stack protector. It will be disabled by C file
level.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/idtentry.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..6779def97591 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,6 +11,9 @@
#include <asm/irq_stack.h>
+/* Entry code written in C. */
+#define __entry_text __noinstr_section(".entry.text")
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
We need __native_read_cr3() & native_write_cr3() to be ensured noinstr.
It is prepared for later patches which implement entry code in C file.
Some of the code needs to handle KPTI and has to read/write CR3.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/special_insns.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 68c257a3de0d..fbb057ba60e6 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -42,14 +42,14 @@ static __always_inline void native_write_cr2(unsigned long val)
asm volatile("mov %0,%%cr2": : "r" (val) : "memory");
}
-static inline unsigned long __native_read_cr3(void)
+static __always_inline unsigned long __native_read_cr3(void)
{
unsigned long val;
asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER);
return val;
}
-static inline void native_write_cr3(unsigned long val)
+static __always_inline void native_write_cr3(unsigned long val)
{
asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
}
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
While in the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the
trampoline stack. But XEN pv doesn't use trampoline stack, so
PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack. Hence source
and destination stacks are identical in that case, which means reusing
swapgs_restore_regs_and_return_to_usermode() in XEN pv would cause %rsp
to move up to the top of the kernel stack and leave the IRET frame below
%rsp, which is dangerous to be corrupted if #NMI / #MC hit as either of
these events occurring in the middle of the stack pushing would clobber
data on the (original) stack.
And, when XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing
the IRET frame on to the original address is useless and error-prone
when there is any future attempt to modify the code.
Fixes: 7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries")
Cc: Jan Beulich <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Peter Anvin <[email protected]>
Cc: [email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 4 ++++
arch/x86/xen/xen-asm.S | 20 ++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6189a0dc83ab..ebcc17e1d7f1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -574,6 +574,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
1:
#endif
+#ifdef CONFIG_XEN_PV
+ ALTERNATIVE "", "jmp xenpv_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
+#endif
+
POP_REGS pop_rdi=0
/*
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 220dd9678494..444d824775f6 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/linkage.h>
+#include <../entry/calling.h>
.pushsection .noinstr.text, "ax"
/*
@@ -192,6 +193,25 @@ SYM_CODE_START(xen_iret)
jmp hypercall_iret
SYM_CODE_END(xen_iret)
+/*
+ * XEN pv doesn't use trampoline stack, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is
+ * also the kernel stack. Reusing swapgs_restore_regs_and_return_to_usermode()
+ * in XEN pv would cause %rsp to move up to the top of the kernel stack and
+ * leave the IRET frame below %rsp, which is dangerous to be corrupted if #NMI
+ * interrupts. And swapgs_restore_regs_and_return_to_usermode() pushing the IRET
+ * frame at the same address is useless.
+ */
+SYM_CODE_START(xenpv_restore_regs_and_return_to_usermode)
+ UNWIND_HINT_REGS
+ POP_REGS
+
+ /* stackleak_erase() can work safely on the kernel stack. */
+ STACKLEAK_ERASE_NOCLOBBER
+
+ addq $8, %rsp /* skip regs->orig_ax */
+ jmp xen_iret
+SYM_CODE_END(xenpv_restore_regs_and_return_to_usermode)
+
/*
* Xen handles syscall callbacks much like ordinary exceptions, which
* means we have:
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The C user_entry_swapgs_and_fence() implements the ASM code:
swapgs
FENCE_SWAPGS_USER_ENTRY
It will be used in the user entry swapgs code path, doing the swapgs and
lfence to prevent a speculative swapgs when coming from kernel space.
Cc: Josh Poimboeuf <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index bdc9540f25d3..3db503ea0703 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -49,6 +49,9 @@ static __always_inline void switch_to_kernel_cr3(void) {}
* fence_swapgs_kernel_entry is used in the kernel entry code path without
* CR3 write or with conditinal CR3 write only, to prevent the swapgs from
* getting speculatively skipped when coming from user space.
+ *
+ * user_entry_swapgs_and_fence is a wrapper of swapgs and fence for user entry
+ * code path.
*/
static __always_inline void fence_swapgs_user_entry(void)
{
@@ -59,3 +62,9 @@ static __always_inline void fence_swapgs_kernel_entry(void)
{
alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
}
+
+static __always_inline void user_entry_swapgs_and_fence(void)
+{
+ native_swapgs();
+ fence_swapgs_user_entry();
+}
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
fence_swapgs_{user,kernel}_entry() in entry64.c are the same as
the ASM macro FENCE_SWAPGS_{USER,KERNEL}_ENTRY.
fence_swapgs_user_entry is used in the user entry swapgs code path,
to prevent a speculative swapgs when coming from kernel space.
fence_swapgs_kernel_entry is used in the kernel entry code path,
to prevent the swapgs from getting speculatively skipped when
coming from user space.
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 9a5c535b1ddf..bdc9540f25d3 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -38,3 +38,24 @@ static __always_inline void switch_to_kernel_cr3(void)
#else
static __always_inline void switch_to_kernel_cr3(void) {}
#endif
+
+/*
+ * Mitigate Spectre v1 for conditional swapgs code paths.
+ *
+ * fence_swapgs_user_entry is used in the user entry swapgs code path, to
+ * prevent a speculative swapgs when coming from kernel space. It must be
+ * used with switch_to_kernel_cr3() in the same path.
+ *
+ * fence_swapgs_kernel_entry is used in the kernel entry code path without
+ * CR3 write or with conditinal CR3 write only, to prevent the swapgs from
+ * getting speculatively skipped when coming from user space.
+ */
+static __always_inline void fence_swapgs_user_entry(void)
+{
+ alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
+}
+
+static __always_inline void fence_swapgs_kernel_entry(void)
+{
+ alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
+}
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Moving PUSH_AND_CLEAR_REGS out of error_entry doesn't change any
functionality. It will enlarge the size:
size arch/x86/entry/entry_64.o.before:
text data bss dec hex filename
17916 384 0 18300 477c arch/x86/entry/entry_64.o
size --format=SysV arch/x86/entry/entry_64.o.before:
.entry.text 5528 0
.orc_unwind 6456 0
.orc_unwind_ip 4304 0
size arch/x86/entry/entry_64.o.after:
text data bss dec hex filename
26868 384 0 27252 6a74 arch/x86/entry/entry_64.o
size --format=SysV arch/x86/entry/entry_64.o.after:
.entry.text 8200 0
.orc_unwind 10224 0
.orc_unwind_ip 6816 0
But .entry.text in x86_64 is 2M aligned, enlarging it to 8.2k doesn't
enlarge the final text size.
The tables .orc_unwind[_ip] are enlarged due to it adds many pushes.
It is prepared for converting the whole error_entry into C code.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3667317f6825..25c534b78eb7 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -322,6 +322,9 @@ SYM_CODE_END(ret_from_fork)
*/
.macro idtentry_body cfunc has_error_code:req
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
call error_entry
movq %rax, %rsp /* switch stack settled by sync_regs() */
ENCODE_FRAME_POINTER
@@ -976,8 +979,6 @@ SYM_CODE_END(paranoid_exit)
SYM_CODE_START_LOCAL(error_entry)
UNWIND_HINT_FUNC
cld
- PUSH_AND_CLEAR_REGS save_ret=1
- ENCODE_FRAME_POINTER 8
testb $3, CS+8(%rsp)
jz .Lerror_kernelspace
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The address of .Lgs_change will be used in traps.c in later patch when
some entry code is implemented in entry64.c. So the address of .Lgs_change
is exposed to traps.c for preparation.
The label .Lgs_change is still needed in ASM code for extable due to it
can not use asm_load_gs_index_gs_change. Otherwise:
warning: objtool: __ex_table+0x0: don't know how to handle
non-section reloc symbol asm_load_gs_index_gs_change
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 2 ++
arch/x86/entry/entry_64.S | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 762595603ce7..9813a30dbadb 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -12,3 +12,5 @@
* is PTI user CR3 or both.
*/
#include <asm/traps.h>
+
+extern unsigned char asm_load_gs_index_gs_change[];
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index fb977d958889..e7e56665daa2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -733,6 +733,7 @@ _ASM_NOKPROBE(common_interrupt_return)
SYM_FUNC_START(asm_load_gs_index)
FRAME_BEGIN
swapgs
+SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
.Lgs_change:
movl %edi, %gs
2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
@@ -1010,7 +1011,7 @@ SYM_CODE_START_LOCAL(error_entry)
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
je .Lbstep_iret
- cmpq $.Lgs_change, RIP+8(%rsp)
+ cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
jne .Lerror_entry_done_lfence
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Make fixup_bad_iret() works like sync_regs() which doesn't
move the return address of error_entry().
It is prepared later patch which implements the body of error_entry()
in C code. The fixup_bad_iret() can't handle return address when it
is called from C code.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 5 ++++-
arch/x86/include/asm/traps.h | 2 +-
arch/x86/kernel/traps.c | 17 ++++++-----------
3 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e7e56665daa2..c6b617c19fe1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1048,9 +1048,12 @@ SYM_CODE_START_LOCAL(error_entry)
* Pretend that the exception came from user mode: set up pt_regs
* as if we faulted immediately after IRET.
*/
- mov %rsp, %rdi
+ popq %r12 /* save return addr in %12 */
+ movq %rsp, %rdi /* arg0 = pt_regs pointer */
call fixup_bad_iret
mov %rax, %rsp
+ ENCODE_FRAME_POINTER
+ pushq %r12
jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 6221be7cafc3..1cdd7e8bcba7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -13,7 +13,7 @@
#ifdef CONFIG_X86_64
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
-struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
+struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
void __init trap_init(void);
asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
#endif
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1be5c1edad6b..4e9d306f313c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -759,13 +759,8 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
}
#endif
-struct bad_iret_stack {
- void *error_entry_ret;
- struct pt_regs regs;
-};
-
asmlinkage __visible noinstr
-struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
+struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
{
/*
* This is called from entry_64.S early in handling a fault
@@ -775,19 +770,19 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
* just below the IRET frame) and we want to pretend that the
* exception came from the IRET target.
*/
- struct bad_iret_stack tmp, *new_stack =
- (struct bad_iret_stack *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
+ struct pt_regs tmp, *new_stack =
+ (struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
/* Copy the IRET target to the temporary storage. */
- __memcpy(&tmp.regs.ip, (void *)s->regs.sp, 5*8);
+ __memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
/* Copy the remainder of the stack from the current stack. */
- __memcpy(&tmp, s, offsetof(struct bad_iret_stack, regs.ip));
+ __memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
/* Update the entry stack */
__memcpy(new_stack, &tmp, sizeof(tmp));
- BUG_ON(!user_mode(&new_stack->regs));
+ BUG_ON(!user_mode(new_stack));
return new_stack;
}
#endif
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Make it next to CLAC
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 25c534b78eb7..74c82fb01d36 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -356,6 +356,7 @@ SYM_CODE_END(ret_from_fork)
SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8
ASM_CLAC
+ cld
.if \has_error_code == 0
pushq $-1 /* ORIG_RAX: no syscall to restart */
@@ -423,6 +424,7 @@ SYM_CODE_END(\asmsym)
SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS
ASM_CLAC
+ cld
pushq $-1 /* ORIG_RAX: no syscall to restart */
@@ -478,6 +480,7 @@ SYM_CODE_END(\asmsym)
SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS
ASM_CLAC
+ cld
/*
* If the entry is from userspace, switch stacks and treat it as
@@ -539,6 +542,7 @@ SYM_CODE_END(\asmsym)
SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS offset=8
ASM_CLAC
+ cld
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
@@ -854,7 +858,6 @@ SYM_CODE_END(xen_failsafe_callback)
*/
SYM_CODE_START_LOCAL(paranoid_entry)
UNWIND_HINT_FUNC
- cld
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
@@ -978,7 +981,6 @@ SYM_CODE_END(paranoid_exit)
*/
SYM_CODE_START_LOCAL(error_entry)
UNWIND_HINT_FUNC
- cld
testb $3, CS+8(%rsp)
jz .Lerror_kernelspace
@@ -1112,6 +1114,7 @@ SYM_CODE_START(asm_exc_nmi)
*/
ASM_CLAC
+ cld
/* Use %rdx as our temp variable throughout */
pushq %rdx
@@ -1131,7 +1134,6 @@ SYM_CODE_START(asm_exc_nmi)
*/
swapgs
- cld
FENCE_SWAPGS_USER_ENTRY
SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq %rsp, %rdx
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
When in XENPV, it is already in the task stack, and it can't fault
at native_irq_return_iret nor asm_load_gs_index_gs_change since
XENPV uses its own pvops for iret and load_gs_index(). And it
doesn't need to switch CR3. So it can skip invoking error_entry().
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 74c82fb01d36..2f3883d92536 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -325,8 +325,17 @@ SYM_CODE_END(ret_from_fork)
PUSH_AND_CLEAR_REGS
ENCODE_FRAME_POINTER
- call error_entry
- movq %rax, %rsp /* switch stack settled by sync_regs() */
+ /*
+ * Call error_entry and switch stack settled by sync_regs().
+ *
+ * When in XENPV, it is already in the task stack, and it can't fault
+ * at native_irq_return_iret nor asm_load_gs_index_gs_change since
+ * XENPV uses its own pvops for iret and load_gs_index(). And it
+ * doesn't need to switch CR3. So it can skip invoking error_entry().
+ */
+ ALTERNATIVE "call error_entry; movq %rax, %rsp", \
+ "", X86_FEATURE_XENPV
+
ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
entry_INT80_compat is identical to idtentry macro except a special
handling for %rax in the prolog.
Add the prolog to idtentry and use idtentry for entry_INT80_compat.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 18 ++++++
arch/x86/entry/entry_64_compat.S | 102 -------------------------------
arch/x86/include/asm/idtentry.h | 47 ++++++++++++++
arch/x86/include/asm/proto.h | 4 --
4 files changed, 65 insertions(+), 106 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index dac327c56204..d8a0a40706b6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -371,6 +371,24 @@ SYM_CODE_START(\asmsym)
pushq $-1 /* ORIG_RAX: no syscall to restart */
.endif
+ .if \vector == IA32_SYSCALL_VECTOR
+ /*
+ * User tracing code (ptrace or signal handlers) might assume
+ * that the saved RAX contains a 32-bit number when we're
+ * invoking a 32-bit syscall. Just in case the high bits are
+ * nonzero, zero-extend the syscall number. (This could almost
+ * certainly be deleted with no ill effects.)
+ */
+ movl %eax, %eax
+
+ /*
+ * do_int80_syscall_32() expects regs->orig_ax to be user ax,
+ * and regs->ax to be $-ENOSYS.
+ */
+ movq %rax, (%rsp)
+ movq $-ENOSYS, %rax
+ .endif
+
.if \vector == X86_TRAP_BP
/*
* If coming from kernel space, create a 6-word gap to allow the
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..a4fcea0cab14 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -311,105 +311,3 @@ sysret32_from_system_call:
swapgs
sysretl
SYM_CODE_END(entry_SYSCALL_compat)
-
-/*
- * 32-bit legacy system call entry.
- *
- * 32-bit x86 Linux system calls traditionally used the INT $0x80
- * instruction. INT $0x80 lands here.
- *
- * This entry point can be used by 32-bit and 64-bit programs to perform
- * 32-bit system calls. Instances of INT $0x80 can be found inline in
- * various programs and libraries. It is also used by the vDSO's
- * __kernel_vsyscall fallback for hardware that doesn't support a faster
- * entry method. Restarted 32-bit system calls also fall back to INT
- * $0x80 regardless of what instruction was originally used to do the
- * system call.
- *
- * This is considered a slow path. It is not used by most libc
- * implementations on modern hardware except during process startup.
- *
- * Arguments:
- * eax system call number
- * ebx arg1
- * ecx arg2
- * edx arg3
- * esi arg4
- * edi arg5
- * ebp arg6
- */
-SYM_CODE_START(entry_INT80_compat)
- UNWIND_HINT_EMPTY
- /*
- * Interrupts are off on entry.
- */
- ASM_CLAC /* Do this early to minimize exposure */
- SWAPGS
-
- /*
- * User tracing code (ptrace or signal handlers) might assume that
- * the saved RAX contains a 32-bit number when we're invoking a 32-bit
- * syscall. Just in case the high bits are nonzero, zero-extend
- * the syscall number. (This could almost certainly be deleted
- * with no ill effects.)
- */
- movl %eax, %eax
-
- /* switch to thread stack expects orig_ax and rdi to be pushed */
- pushq %rax /* pt_regs->orig_ax */
- pushq %rdi /* pt_regs->di */
-
- /* Need to switch before accessing the thread stack. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-
- /* In the Xen PV case we already run on the thread stack. */
- ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
-
- movq %rsp, %rdi
- movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
-
- pushq 6*8(%rdi) /* regs->ss */
- pushq 5*8(%rdi) /* regs->rsp */
- pushq 4*8(%rdi) /* regs->eflags */
- pushq 3*8(%rdi) /* regs->cs */
- pushq 2*8(%rdi) /* regs->ip */
- pushq 1*8(%rdi) /* regs->orig_ax */
- pushq (%rdi) /* pt_regs->di */
-.Lint80_keep_stack:
-
- pushq %rsi /* pt_regs->si */
- xorl %esi, %esi /* nospec si */
- pushq %rdx /* pt_regs->dx */
- xorl %edx, %edx /* nospec dx */
- pushq %rcx /* pt_regs->cx */
- xorl %ecx, %ecx /* nospec cx */
- pushq $-ENOSYS /* pt_regs->ax */
- pushq %r8 /* pt_regs->r8 */
- xorl %r8d, %r8d /* nospec r8 */
- pushq %r9 /* pt_regs->r9 */
- xorl %r9d, %r9d /* nospec r9 */
- pushq %r10 /* pt_regs->r10*/
- xorl %r10d, %r10d /* nospec r10 */
- pushq %r11 /* pt_regs->r11 */
- xorl %r11d, %r11d /* nospec r11 */
- pushq %rbx /* pt_regs->rbx */
- xorl %ebx, %ebx /* nospec rbx */
- pushq %rbp /* pt_regs->rbp */
- xorl %ebp, %ebp /* nospec rbp */
- pushq %r12 /* pt_regs->r12 */
- xorl %r12d, %r12d /* nospec r12 */
- pushq %r13 /* pt_regs->r13 */
- xorl %r13d, %r13d /* nospec r13 */
- pushq %r14 /* pt_regs->r14 */
- xorl %r14d, %r14d /* nospec r14 */
- pushq %r15 /* pt_regs->r15 */
- xorl %r15d, %r15d /* nospec r15 */
-
- UNWIND_HINT_REGS
-
- cld
-
- movq %rsp, %rdi
- call do_int80_syscall_32
- jmp swapgs_restore_regs_and_return_to_usermode
-SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 6779def97591..49fabc3e3f0d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -207,6 +207,20 @@ __visible noinstr void func(struct pt_regs *regs, \
\
static noinline void __##func(struct pt_regs *regs, u32 vector)
+/**
+ * DECLARE_IDTENTRY_IA32_EMULATION - Declare functions for int80
+ * @vector: Vector number (ignored for C)
+ * @asm_func: Function name of the entry point
+ * @cfunc: The C handler called from the ASM entry point (ignored for C)
+ *
+ * Declares two functions:
+ * - The ASM entry point: asm_func
+ * - The XEN PV trap entry point: xen_##asm_func (maybe unused)
+ */
+#define DECLARE_IDTENTRY_IA32_EMULATION(vector, asm_func, cfunc) \
+ asmlinkage void asm_func(void); \
+ asmlinkage void xen_##asm_func(void)
+
/**
* DECLARE_IDTENTRY_SYSVEC - Declare functions for system vector entry points
* @vector: Vector number (ignored for C)
@@ -433,6 +447,35 @@ __visible noinstr void func(struct pt_regs *regs, \
#define DECLARE_IDTENTRY_ERRORCODE(vector, func) \
idtentry vector asm_##func func has_error_code=1
+/*
+ * 32-bit legacy system call entry.
+ *
+ * 32-bit x86 Linux system calls traditionally used the INT $0x80
+ * instruction. INT $0x80 lands here.
+ *
+ * This entry point can be used by 32-bit and 64-bit programs to perform
+ * 32-bit system calls. Instances of INT $0x80 can be found inline in
+ * various programs and libraries. It is also used by the vDSO's
+ * __kernel_vsyscall fallback for hardware that doesn't support a faster
+ * entry method. Restarted 32-bit system calls also fall back to INT
+ * $0x80 regardless of what instruction was originally used to do the
+ * system call.
+ *
+ * This is considered a slow path. It is not used by most libc
+ * implementations on modern hardware except during process startup.
+ *
+ * Arguments:
+ * eax system call number
+ * ebx arg1
+ * ecx arg2
+ * edx arg3
+ * esi arg4
+ * edi arg5
+ * ebp arg6
+ */
+#define DECLARE_IDTENTRY_IA32_EMULATION(vector, asm_func, cfunc) \
+ idtentry vector asm_func cfunc has_error_code=0
+
/* Special case for 32bit IRET 'trap'. Do not emit ASM code */
#define DECLARE_IDTENTRY_SW(vector, func)
@@ -634,6 +677,10 @@ DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, spurious_interrupt);
#endif
+#ifdef CONFIG_IA32_EMULATION
+DECLARE_IDTENTRY_IA32_EMULATION(IA32_SYSCALL_VECTOR, entry_INT80_compat, do_int80_syscall_32);
+#endif
+
/* System vector entry points */
#ifdef CONFIG_X86_LOCAL_APIC
DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR, sysvec_error_interrupt);
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 33ae276c8b34..597c767091cb 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -29,10 +29,6 @@ void entry_SYSENTER_compat(void);
void __end_entry_SYSENTER_compat(void);
void entry_SYSCALL_compat(void);
void entry_SYSCALL_compat_safe_stack(void);
-void entry_INT80_compat(void);
-#ifdef CONFIG_XEN_PV
-void xen_entry_INT80_compat(void);
-#endif
#endif
void x86_configure_nx(void);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
XENPV doesn't use error_entry() anymore, so the pv-aware SWAPGS can be
changed to native swapgs.
It is prepared for later patch to convert error_entry() to C code, which
uses native_swapgs() directly. Converting SWAPGS to swapgs in ASM
error_entry first to ensure the later patch has zero semantic change.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2f3883d92536..ebc7419a0367 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -997,7 +997,7 @@ SYM_CODE_START_LOCAL(error_entry)
* We entered from user mode or we're pretending to have entered
* from user mode due to an IRET fault.
*/
- SWAPGS
+ swapgs
FENCE_SWAPGS_USER_ENTRY
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
@@ -1029,7 +1029,7 @@ SYM_CODE_START_LOCAL(error_entry)
* gsbase and proceed. We'll fix up the exception and land in
* .Lgs_change's error handler with kernel gsbase.
*/
- SWAPGS
+ swapgs
/*
* The above code has no serializing instruction. So do an lfence
@@ -1051,7 +1051,7 @@ SYM_CODE_START_LOCAL(error_entry)
* We came from an IRET to user mode, so we have user
* gsbase and CR3. Switch to kernel gsbase and CR3:
*/
- SWAPGS
+ swapgs
FENCE_SWAPGS_USER_ENTRY
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
All the needed facilities are set in entry64.c, the whole error_entry()
can be implemented in C in entry64.c. The C version generally has better
readability and easier to be updated/improved.
No function change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 68 ++++++++++++++++++++++++++++++
arch/x86/entry/entry_64.S | 82 +-----------------------------------
arch/x86/include/asm/traps.h | 1 +
3 files changed, 70 insertions(+), 81 deletions(-)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 3db503ea0703..0dc63ae8153a 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -68,3 +68,71 @@ static __always_inline void user_entry_swapgs_and_fence(void)
native_swapgs();
fence_swapgs_user_entry();
}
+
+/*
+ * Put pt_regs onto the task stack and switch GS and CR3 if needed.
+ * The actual stack switch is done in entry_64.S.
+ *
+ * Be careful, it might be in the user CR3 and user GS base at the start
+ * of the function.
+ */
+asmlinkage __visible __entry_text
+struct pt_regs *error_entry(struct pt_regs *eregs)
+{
+ unsigned long iret_ip = (unsigned long)native_irq_return_iret;
+
+ if (user_mode(eregs)) {
+ /*
+ * We entered from user mode.
+ * Switch to kernel gsbase and CR3.
+ */
+ user_entry_swapgs_and_fence();
+ switch_to_kernel_cr3();
+
+ /* Put pt_regs onto the task stack. */
+ return sync_regs(eregs);
+ }
+
+ /*
+ * There are two places in the kernel that can potentially fault with
+ * usergs. Handle them here. B stepping K8s sometimes report a
+ * truncated RIP for IRET exceptions returning to compat mode. Check
+ * for these here too.
+ */
+ if ((eregs->ip == iret_ip) || (eregs->ip == (unsigned int)iret_ip)) {
+ eregs->ip = iret_ip; /* Fix truncated RIP */
+
+ /*
+ * We came from an IRET to user mode, so we have user
+ * gsbase and CR3. Switch to kernel gsbase and CR3:
+ */
+ user_entry_swapgs_and_fence();
+ switch_to_kernel_cr3();
+
+ /*
+ * Pretend that the exception came from user mode: set up
+ * pt_regs as if we faulted immediately after IRET and put
+ * pt_regs onto the real task stack.
+ */
+ return sync_regs(fixup_bad_iret(eregs));
+ }
+
+ /*
+ * Hack: asm_load_gs_index_gs_change can fail with user gsbase.
+ * If this happens, fix up gsbase and proceed. We'll fix up the
+ * exception and land in asm_load_gs_index_gs_change's error
+ * handler with kernel gsbase.
+ */
+ if (eregs->ip == (unsigned long)asm_load_gs_index_gs_change)
+ native_swapgs();
+
+ /*
+ * The above code has no serializing instruction. So do an lfence
+ * to prevent GS speculation, regardless of whether it is kernel
+ * gsbase or user gsbase.
+ */
+ fence_swapgs_kernel_entry();
+
+ /* Enter from kernel, don't move pt_regs */
+ return eregs;
+}
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ebc7419a0367..dac327c56204 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -333,7 +333,7 @@ SYM_CODE_END(ret_from_fork)
* XENPV uses its own pvops for iret and load_gs_index(). And it
* doesn't need to switch CR3. So it can skip invoking error_entry().
*/
- ALTERNATIVE "call error_entry; movq %rax, %rsp", \
+ ALTERNATIVE "movq %rsp, %rdi; call error_entry; movq %rax, %rsp", \
"", X86_FEATURE_XENPV
ENCODE_FRAME_POINTER
@@ -985,86 +985,6 @@ SYM_CODE_START_LOCAL(paranoid_exit)
jmp restore_regs_and_return_to_kernel
SYM_CODE_END(paranoid_exit)
-/*
- * Save all registers in pt_regs, and switch GS if needed.
- */
-SYM_CODE_START_LOCAL(error_entry)
- UNWIND_HINT_FUNC
- testb $3, CS+8(%rsp)
- jz .Lerror_kernelspace
-
- /*
- * We entered from user mode or we're pretending to have entered
- * from user mode due to an IRET fault.
- */
- swapgs
- FENCE_SWAPGS_USER_ENTRY
- /* We have user CR3. Change to kernel CR3. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
- leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
-.Lerror_entry_from_usermode_after_swapgs:
- /* Put us onto the real thread stack. */
- call sync_regs
- ret
-
- /*
- * There are two places in the kernel that can potentially fault with
- * usergs. Handle them here. B stepping K8s sometimes report a
- * truncated RIP for IRET exceptions returning to compat mode. Check
- * for these here too.
- */
-.Lerror_kernelspace:
- leaq native_irq_return_iret(%rip), %rcx
- cmpq %rcx, RIP+8(%rsp)
- je .Lerror_bad_iret
- movl %ecx, %eax /* zero extend */
- cmpq %rax, RIP+8(%rsp)
- je .Lbstep_iret
- cmpq $asm_load_gs_index_gs_change, RIP+8(%rsp)
- jne .Lerror_entry_done_lfence
-
- /*
- * hack: .Lgs_change can fail with user gsbase. If this happens, fix up
- * gsbase and proceed. We'll fix up the exception and land in
- * .Lgs_change's error handler with kernel gsbase.
- */
- swapgs
-
- /*
- * The above code has no serializing instruction. So do an lfence
- * to prevent GS speculation, regardless of whether it is kernel
- * gsbase or user gsbase.
- */
-.Lerror_entry_done_lfence:
- FENCE_SWAPGS_KERNEL_ENTRY
- leaq 8(%rsp), %rax /* return pt_regs pointer */
- ret
-
-.Lbstep_iret:
- /* Fix truncated RIP */
- movq %rcx, RIP+8(%rsp)
- /* fall through */
-
-.Lerror_bad_iret:
- /*
- * We came from an IRET to user mode, so we have user
- * gsbase and CR3. Switch to kernel gsbase and CR3:
- */
- swapgs
- FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
- /*
- * Pretend that the exception came from user mode: set up pt_regs
- * as if we faulted immediately after IRET.
- */
- leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
- call fixup_bad_iret
- mov %rax, %rdi
- jmp .Lerror_entry_from_usermode_after_swapgs
-SYM_CODE_END(error_entry)
-
SYM_CODE_START_LOCAL(error_return)
UNWIND_HINT_REGS
DEBUG_ENTRY_ASSERT_IRQS_OFF
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1cdd7e8bcba7..686461ac9803 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -14,6 +14,7 @@
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
+asmlinkage __visible notrace struct pt_regs *error_entry(struct pt_regs *eregs);
void __init trap_init(void);
asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
#endif
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
XENPV has its own entry point for SYSENTER, it doesn't use
entry_SYSENTER_compat. So the pv-awared SWAPGS can be changed to
swapgs.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64_compat.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index a4fcea0cab14..72e017c3941f 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,7 +49,7 @@
SYM_CODE_START(entry_SYSENTER_compat)
UNWIND_HINT_EMPTY
/* Interrupts are off on entry. */
- SWAPGS
+ swapgs
pushq %rax
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Move the last JMP out of paranoid_exit() and make it callable.
Allow paranoid_exit() to be re-written in C later and also allow
asm_exc_nmi() to call it to avoid duplicated code.
No functional change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d8a0a40706b6..e6e655a1764a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -471,7 +471,8 @@ SYM_CODE_START(\asmsym)
call \cfunc
- jmp paranoid_exit
+ call paranoid_exit
+ jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack and use the noist entry point */
.Lfrom_usermode_switch_stack_\@:
@@ -549,7 +550,8 @@ SYM_CODE_START(\asmsym)
* identical to the stack in the IRET frame or the VC fall-back stack,
* so it is definitely mapped even with PTI enabled.
*/
- jmp paranoid_exit
+ call paranoid_exit
+ jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack */
.Lfrom_usermode_switch_stack_\@:
@@ -580,7 +582,8 @@ SYM_CODE_START(\asmsym)
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
call \cfunc
- jmp paranoid_exit
+ call paranoid_exit
+ jmp restore_regs_and_return_to_kernel
_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
@@ -975,7 +978,7 @@ SYM_CODE_END(paranoid_entry)
* Y User space GSBASE, must be restored unconditionally
*/
SYM_CODE_START_LOCAL(paranoid_exit)
- UNWIND_HINT_REGS
+ UNWIND_HINT_REGS offset=8
/*
* The order of operations is important. RESTORE_CR3 requires
* kernel GSBASE.
@@ -991,16 +994,17 @@ SYM_CODE_START_LOCAL(paranoid_exit)
/* With FSGSBASE enabled, unconditionally restore GSBASE */
wrgsbase %rbx
- jmp restore_regs_and_return_to_kernel
+ ret
.Lparanoid_exit_checkgs:
/* On non-FSGSBASE systems, conditionally do SWAPGS */
testl %ebx, %ebx
- jnz restore_regs_and_return_to_kernel
+ jnz .Lparanoid_exit_done
/* We are returning to a context with user GSBASE */
swapgs
- jmp restore_regs_and_return_to_kernel
+.Lparanoid_exit_done:
+ ret
SYM_CODE_END(paranoid_exit)
SYM_CODE_START_LOCAL(error_return)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The code between "call exc_nmi" and nmi_restore is as the same as
paranoid_exit(), so we can just use paranoid_exit() instead of the open
duplicated code.
No functional change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 34 +++++-----------------------------
1 file changed, 5 insertions(+), 29 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e6e655a1764a..3a434b179963 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -960,8 +960,7 @@ SYM_CODE_END(paranoid_entry)
/*
* "Paranoid" exit path from exception stack. This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
+ * only on return from IST interrupts that came from kernel space.
*
* We may be returning to very strange contexts (e.g. very early
* in syscall entry), so checking for preemption here would
@@ -1309,11 +1308,7 @@ end_repeat_nmi:
pushq $-1 /* ORIG_RAX: no syscall to restart */
/*
- * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
- * as we should not be calling schedule in NMI context.
- * Even with normal interrupts enabled. An NMI should not be
- * setting NEED_RESCHED or anything that normal interrupts and
- * exceptions might do.
+ * Use paranoid_entry to handle SWAPGS and CR3.
*/
call paranoid_entry
UNWIND_HINT_REGS
@@ -1322,31 +1317,12 @@ end_repeat_nmi:
movq $-1, %rsi
call exc_nmi
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
/*
- * The above invocation of paranoid_entry stored the GSBASE
- * related information in R/EBX depending on the availability
- * of FSGSBASE.
- *
- * If FSGSBASE is enabled, restore the saved GSBASE value
- * unconditionally, otherwise take the conditional SWAPGS path.
+ * Use paranoid_exit to handle SWAPGS and CR3, but no need to use
+ * restore_regs_and_return_to_kernel as we must handle nested NMI.
*/
- ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
- wrgsbase %rbx
- jmp nmi_restore
-
-nmi_no_fsgsbase:
- /* EBX == 0 -> invoke SWAPGS */
- testl %ebx, %ebx
- jnz nmi_restore
-
-nmi_swapgs:
- swapgs
+ call paranoid_exit
-nmi_restore:
POP_REGS
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It is prepared for converting the whole paranoid_entry() into C code.
No functional change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3a434b179963..b6bcf7fcad34 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -322,9 +322,6 @@ SYM_CODE_END(ret_from_fork)
*/
.macro idtentry_body cfunc has_error_code:req
- PUSH_AND_CLEAR_REGS
- ENCODE_FRAME_POINTER
-
/*
* Call error_entry and switch stack settled by sync_regs().
*
@@ -403,6 +400,9 @@ SYM_CODE_START(\asmsym)
.Lfrom_usermode_no_gap_\@:
.endif
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
idtentry_body \cfunc \has_error_code
_ASM_NOKPROBE(\asmsym)
@@ -455,11 +455,14 @@ SYM_CODE_START(\asmsym)
pushq $-1 /* ORIG_RAX: no syscall to restart */
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
/*
* If the entry is from userspace, switch stacks and treat it as
* a normal entry.
*/
- testb $3, CS-ORIG_RAX(%rsp)
+ testb $3, CS(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
@@ -510,11 +513,14 @@ SYM_CODE_START(\asmsym)
ASM_CLAC
cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
/*
* If the entry is from userspace, switch stacks and treat it as
* a normal entry.
*/
- testb $3, CS-ORIG_RAX(%rsp)
+ testb $3, CS(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
/*
@@ -573,6 +579,9 @@ SYM_CODE_START(\asmsym)
ASM_CLAC
cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
UNWIND_HINT_REGS
@@ -888,8 +897,6 @@ SYM_CODE_END(xen_failsafe_callback)
*/
SYM_CODE_START_LOCAL(paranoid_entry)
UNWIND_HINT_FUNC
- PUSH_AND_CLEAR_REGS save_ret=1
- ENCODE_FRAME_POINTER 8
/*
* Always stash CR3 in %r14. This value will be restored,
@@ -1307,6 +1314,9 @@ end_repeat_nmi:
*/
pushq $-1 /* ORIG_RAX: no syscall to restart */
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
+
/*
* Use paranoid_entry to handle SWAPGS and CR3.
*/
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
There is no user of the pv-aware SWAPGS anymore.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/irqflags.h | 8 --------
1 file changed, 8 deletions(-)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index c5ce9845c999..da41a80eb912 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -139,14 +139,6 @@ static __always_inline void arch_local_irq_restore(unsigned long flags)
if (!arch_irqs_disabled_flags(flags))
arch_local_irq_enable();
}
-#else
-#ifdef CONFIG_X86_64
-#ifdef CONFIG_XEN_PV
-#define SWAPGS ALTERNATIVE "swapgs", "", X86_FEATURE_XENPV
-#else
-#define SWAPGS swapgs
-#endif
-#endif
#endif /* !__ASSEMBLY__ */
#endif
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
They are the same in meaning and value.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65542106464b..c8909457574a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -5,6 +5,7 @@
#include <linux/mem_encrypt.h>
#include <asm/page.h>
#include <asm/pgtable_types.h>
+#include <asm/processor-flags.h>
/*
* Macro to mark a page protection value as UC-
@@ -1164,14 +1165,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
static inline int pgd_large(pgd_t pgd) { return 0; }
#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size). The kernel one is at the beginning 4k and
- * the user one is in the last 4k. To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
-
/*
* This generates better code than the inline assembly in
* __set_bit().
@@ -1193,12 +1186,12 @@ static inline void *ptr_clear_bit(void *ptr, int bit)
static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
{
- return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+ return ptr_set_bit(pgdp, PTI_USER_PGTABLE_BIT);
}
static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
{
- return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+ return ptr_clear_bit(pgdp, PTI_USER_PGTABLE_BIT);
}
#endif /* CONFIG_PAGE_TABLE_ISOLATION */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The declaration of native_irq_return_iret is used in exc_double_fault()
only by now. But it will be used in other place later, so the declaration
is moved to a header file for preparation.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/proto.h | 1 +
arch/x86/kernel/traps.c | 2 --
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index feed36d44d04..33ae276c8b34 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -13,6 +13,7 @@ void syscall_init(void);
#ifdef CONFIG_X86_64
void entry_SYSCALL_64(void);
void entry_SYSCALL_64_safe_stack(void);
+extern unsigned char native_irq_return_iret[];
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
#endif
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..1be5c1edad6b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,8 +359,6 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
#endif
#ifdef CONFIG_X86_ESPFIX64
- extern unsigned char native_irq_return_iret[];
-
/*
* If IRET takes a non-IST fault on the espfix64 stack, then we
* end up promoting it to a doublefault. In that case, take
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Add a C file "entry64.c" to deposit C entry code for traps and faults
which will be as the same logic as the existing ASM code in entry_64.S.
The file is as low level as entry_64.S and its code can be running in
the environments that the GS base is user controlled value, or the CR3
is PTI user CR3 or both.
All the code in this file should not be instrumentable. Many instrument
facilities can be disabled by per-function attributes which are included
in __noinstr_section. But stack-protector can not be disabled function-
granularly by many versions of GCC that can be supported for compiling
the kernel. So stack-protector is disabled for the whole file in Makefile.
It is prepared for later patches that implement C version of the entry
code in entry64.c.
Suggested-by: Joerg Roedel <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/Makefile | 3 ++-
arch/x86/entry/entry64.c | 14 ++++++++++++++
2 files changed, 16 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/entry64.c
diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7fec5dcf6438..792f7009ff32 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -10,13 +10,14 @@ KCOV_INSTRUMENT := n
CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
CFLAGS_common.o += -fno-stack-protector
+CFLAGS_entry64.o += -fno-stack-protector
obj-y := entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
obj-y += common.o
+obj-$(CONFIG_X86_64) += entry64.o
obj-y += vdso/
obj-y += vsyscall/
obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o
obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o
-
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
new file mode 100644
index 000000000000..762595603ce7
--- /dev/null
+++ b/arch/x86/entry/entry64.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ * Copyright (C) 2000, 2001, 2002 Andi Kleen SuSE Labs
+ * Copyright (C) 2000 Pavel Machek <[email protected]>
+ * Copyright (C) 2021 Lai Jiangshan, Alibaba
+ *
+ * Handle entries and exits for hardware traps and faults.
+ *
+ * It is as low level as entry_64.S and its code can be running in the
+ * environments that the GS base is user controlled value, or the CR3
+ * is PTI user CR3 or both.
+ */
+#include <asm/traps.h>
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the C version of asm macro GET_PERCPU_BASE().
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 5f47221d8935..3ec145c38e9e 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -183,3 +183,39 @@ struct pt_regs *error_entry(struct pt_regs *eregs)
/* Enter from kernel, don't move pt_regs */
return eregs;
}
+
+#ifdef CONFIG_SMP
+/*
+ * CPU/node NR is loaded from the limit (size) field of a special segment
+ * descriptor entry in GDT.
+ *
+ * Do not use RDPID, because KVM loads guest's TSC_AUX on vm-entry and
+ * may not restore the host's value until the CPU returns to userspace.
+ * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
+ * while running KVM's run loop.
+ */
+static __always_inline unsigned int gdt_get_cpu(void)
+{
+ unsigned int p;
+
+ asm ("lsl %[seg],%[p]" : [p] "=a" (p) : [seg] "r" (__CPUNODE_SEG));
+
+ return p & VDSO_CPUNODE_MASK;
+}
+
+/*
+ * Fetch the per-CPU GSBASE value for this processor.
+ *
+ * We normally use %gs for accessing per-CPU data, but we are setting up
+ * %gs here and obviously can not use %gs itself to access per-CPU data.
+ */
+static __always_inline unsigned long get_percpu_base(void)
+{
+ return __per_cpu_offset[gdt_get_cpu()];
+}
+#else
+static __always_inline unsigned long get_percpu_base(void)
+{
+ return pcpu_unit_offsets;
+}
+#endif
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
The C version switch_to_kernel_cr3() implements SWITCH_TO_KERNEL_CR3().
No functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 9813a30dbadb..9a5c535b1ddf 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -14,3 +14,27 @@
#include <asm/traps.h>
extern unsigned char asm_load_gs_index_gs_change[];
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static __always_inline void pti_switch_to_kernel_cr3(unsigned long user_cr3)
+{
+ /*
+ * Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+ * at kernel pagetables:
+ */
+ unsigned long cr3 = user_cr3 & ~PTI_USER_PGTABLE_AND_PCID_MASK;
+
+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= X86_CR3_PCID_NOFLUSH;
+
+ native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_kernel_cr3(void)
+{
+ if (static_cpu_has(X86_FEATURE_PTI))
+ pti_switch_to_kernel_cr3(__native_read_cr3());
+}
+#else
+static __always_inline void switch_to_kernel_cr3(void) {}
+#endif
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the second half of paranoid_entry() whose functionality
is to switch to kernel gsbase.
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 47 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 3ec145c38e9e..6eb8ccfc5a8b 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -219,3 +219,50 @@ static __always_inline unsigned long get_percpu_base(void)
return pcpu_unit_offsets;
}
#endif
+
+/*
+ * Handle GSBASE depends on the availability of FSGSBASE.
+ *
+ * Without FSGSBASE the kernel enforces that negative GSBASE
+ * values indicate kernel GSBASE. With FSGSBASE no assumptions
+ * can be made about the GSBASE value when entering from user
+ * space.
+ */
+static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
+{
+ unsigned long gsbase;
+
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * Read the current GSBASE for return.
+ * Retrieve and set the current CPUs kernel GSBASE.
+ *
+ * The unconditional write to GS base below ensures that
+ * no subsequent loads based on a mispredicted GS base can
+ * happen, therefore no LFENCE is needed here.
+ */
+ gsbase = rdgsbase();
+ wrgsbase(get_percpu_base());
+ return gsbase;
+ }
+
+ /*
+ * The above ist_switch_to_kernel_cr3() doesn't do an unconditional
+ * CR3 write, even in the PTI case. So do an lfence to prevent GS
+ * speculation, regardless of whether PTI is enabled.
+ */
+ fence_swapgs_kernel_entry();
+
+ gsbase = __rdmsr(MSR_GS_BASE);
+
+ /*
+ * The kernel-enforced convention is a negative GSBASE indicates
+ * a kernel value. No SWAPGS needed on entry and exit.
+ */
+ if ((long)gsbase < 0)
+ return 1;
+
+ /* User GSBASE active, SWAPGS required on exit */
+ native_swapgs();
+ return 0;
+}
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
error_entry() calls sync_regs() to settle/copy the pt_regs and switches
the stack directly after sync_regs(). But because error_entry() is also
called from entry, the switching has to handle the return address together,
which causes the behavior tangly.
Switching to the stack after error_entry() makes the code simpler and
intuitive.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c6b617c19fe1..3667317f6825 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -323,6 +323,8 @@ SYM_CODE_END(ret_from_fork)
.macro idtentry_body cfunc has_error_code:req
call error_entry
+ movq %rax, %rsp /* switch stack settled by sync_regs() */
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
@@ -988,14 +990,10 @@ SYM_CODE_START_LOCAL(error_entry)
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+ leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:
/* Put us onto the real thread stack. */
- popq %r12 /* save return addr in %12 */
- movq %rsp, %rdi /* arg0 = pt_regs pointer */
call sync_regs
- movq %rax, %rsp /* switch stack */
- ENCODE_FRAME_POINTER
- pushq %r12
ret
/*
@@ -1028,6 +1026,7 @@ SYM_CODE_START_LOCAL(error_entry)
*/
.Lerror_entry_done_lfence:
FENCE_SWAPGS_KERNEL_ENTRY
+ leaq 8(%rsp), %rax /* return pt_regs pointer */
ret
.Lbstep_iret:
@@ -1048,12 +1047,9 @@ SYM_CODE_START_LOCAL(error_entry)
* Pretend that the exception came from user mode: set up pt_regs
* as if we faulted immediately after IRET.
*/
- popq %r12 /* save return addr in %12 */
- movq %rsp, %rdi /* arg0 = pt_regs pointer */
+ leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
call fixup_bad_iret
- mov %rax, %rsp
- ENCODE_FRAME_POINTER
- pushq %r12
+ mov %rax, %rdi
jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the whole ASM version paranoid_entry().
No functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 37 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/idtentry.h | 3 +++
2 files changed, 40 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 6eb8ccfc5a8b..e1af3e5720f9 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -266,3 +266,40 @@ static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
native_swapgs();
return 0;
}
+
+/*
+ * Switch and save CR3 in *@cr3 if PTI enabled. Return GSBASE related
+ * information in *@gsbase depending on the availability of the FSGSBASE
+ * instructions:
+ *
+ * FSGSBASE *@gsbase
+ * N 0 -> SWAPGS on exit
+ * 1 -> no SWAPGS on exit
+ *
+ * Y GSBASE value at entry, must be restored in ist_paranoid_exit
+ */
+__visible __entry_text
+void ist_paranoid_entry(unsigned long *cr3, unsigned long *gsbase)
+{
+ /*
+ * Always stash CR3 in *@cr3. This value will be restored,
+ * verbatim, at exit. Needed if ist_paranoid_entry interrupted
+ * another entry that already switched to the user CR3 value
+ * but has not yet returned to userspace.
+ *
+ * This is also why CS (stashed in the "iret frame" by the
+ * hardware at entry) can not be used: this may be a return
+ * to kernel code, but with a user CR3 value.
+ *
+ * Switching CR3 does not depend on kernel GSBASE so it can
+ * be done before switching to the kernel GSBASE. This is
+ * required for FSGSBASE because the kernel GSBASE has to
+ * be retrieved from a kernel internal table.
+ */
+ *cr3 = ist_switch_to_kernel_cr3();
+
+ barrier();
+
+ /* Handle GSBASE, store the return value in *@gsbase for exit. */
+ *gsbase = ist_switch_to_kernel_gsbase();
+}
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 49fabc3e3f0d..f6efa21ec242 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -307,6 +307,9 @@ static __always_inline void __##func(struct pt_regs *regs)
DECLARE_IDTENTRY(vector, func)
#ifdef CONFIG_X86_64
+__visible __entry_text
+void ist_paranoid_entry(unsigned long *cr3, unsigned long *gsbase);
+
/**
* DECLARE_IDTENTRY_IST - Declare functions for IST handling IDT entry points
* @vector: Vector number (ignored for C)
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the C version of RESTORE_CR3().
Not functional difference intended except the ASM code uses bit test
and clear operations while the C version uses mask check and 'AND'
operations. The resulted asm code of both versions are very similar.
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 283bd685a275..5f47221d8935 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -11,6 +11,7 @@
* environments that the GS base is user controlled value, or the CR3
* is PTI user CR3 or both.
*/
+#include <asm/tlbflush.h>
#include <asm/traps.h>
extern unsigned char asm_load_gs_index_gs_change[];
@@ -30,6 +31,26 @@ static __always_inline void pti_switch_to_kernel_cr3(unsigned long user_cr3)
native_write_cr3(cr3);
}
+static __always_inline void pti_switch_to_user_cr3(unsigned long user_cr3)
+{
+#define KERN_PCID_MASK (CR3_PCID_MASK & ~PTI_USER_PCID_MASK)
+
+ if (static_cpu_has(X86_FEATURE_PCID)) {
+ int pcid = user_cr3 & KERN_PCID_MASK;
+ unsigned short pcid_mask = 1ull << pcid;
+
+ /*
+ * Check if there's a pending flush for the user ASID we're
+ * about to set.
+ */
+ if (!(this_cpu_read(cpu_tlbstate.user_pcid_flush_mask) & pcid_mask))
+ user_cr3 |= X86_CR3_PCID_NOFLUSH;
+ else
+ this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~pcid_mask);
+ }
+ native_write_cr3(user_cr3);
+}
+
static __always_inline void switch_to_kernel_cr3(void)
{
if (static_cpu_has(X86_FEATURE_PTI))
@@ -49,9 +70,20 @@ static __always_inline unsigned long ist_switch_to_kernel_cr3(void)
return cr3;
}
+
+static __always_inline void ist_restore_cr3(unsigned long cr3)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ /* No need to restore when @cr3 is kernel CR3. */
+ if (cr3 & PTI_USER_PGTABLE_MASK)
+ pti_switch_to_user_cr3(cr3);
+}
#else
static __always_inline void switch_to_kernel_cr3(void) {}
static __always_inline unsigned long ist_switch_to_kernel_cr3(void) { return 0; }
+static __always_inline void ist_restore_cr3(unsigned long cr3) {}
#endif
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It switches the CR3 to kernel CR3 and returns the original CR3, and
the caller should save the return value.
It is the C version of SAVE_AND_SWITCH_TO_KERNEL_CR3.
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index 0dc63ae8153a..283bd685a275 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -35,8 +35,23 @@ static __always_inline void switch_to_kernel_cr3(void)
if (static_cpu_has(X86_FEATURE_PTI))
pti_switch_to_kernel_cr3(__native_read_cr3());
}
+
+static __always_inline unsigned long ist_switch_to_kernel_cr3(void)
+{
+ unsigned long cr3 = 0;
+
+ if (static_cpu_has(X86_FEATURE_PTI)) {
+ cr3 = __native_read_cr3();
+
+ if (cr3 & PTI_USER_PGTABLE_MASK)
+ pti_switch_to_kernel_cr3(cr3);
+ }
+
+ return cr3;
+}
#else
static __always_inline void switch_to_kernel_cr3(void) {}
+static __always_inline unsigned long ist_switch_to_kernel_cr3(void) { return 0; }
#endif
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
DEFINE_IDTENTRY_NMI is defined, but not used. It is better to use it.
It is also prepared for later patch to define DEFINE_IDTENTRY_NMI
differently in 32bit and 64bit.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/nmi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bce802d25fb..44c3adb68282 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
static DEFINE_PER_CPU(unsigned long, nmi_cr2);
static DEFINE_PER_CPU(unsigned long, nmi_dr7);
-DEFINE_IDTENTRY_RAW(exc_nmi)
+DEFINE_IDTENTRY_NMI(exc_nmi)
{
irqentry_state_t irq_state;
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Move them closer to the related definitions and reduce a #ifdef entry.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/idtentry.h | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index d0fd32288442..b9a6750dbba2 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -372,6 +372,14 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
#define DEFINE_IDTENTRY_NOIST(func) \
DEFINE_IDTENTRY_RAW(noist_##func)
+#define DECLARE_IDTENTRY_MCE DECLARE_IDTENTRY_IST
+#define DEFINE_IDTENTRY_MCE DEFINE_IDTENTRY_IST
+#define DEFINE_IDTENTRY_MCE_USER DEFINE_IDTENTRY_NOIST
+
+#define DECLARE_IDTENTRY_DEBUG DECLARE_IDTENTRY_IST
+#define DEFINE_IDTENTRY_DEBUG DEFINE_IDTENTRY_IST
+#define DEFINE_IDTENTRY_DEBUG_USER DEFINE_IDTENTRY_NOIST
+
/**
* DECLARE_IDTENTRY_DF - Declare functions for double fault
* @vector: Vector number (ignored for C)
@@ -446,16 +454,6 @@ __visible noinstr void func(struct pt_regs *regs, \
#define DECLARE_IDTENTRY_NMI DECLARE_IDTENTRY_RAW
#define DEFINE_IDTENTRY_NMI DEFINE_IDTENTRY_RAW
-#ifdef CONFIG_X86_64
-#define DECLARE_IDTENTRY_MCE DECLARE_IDTENTRY_IST
-#define DEFINE_IDTENTRY_MCE DEFINE_IDTENTRY_IST
-#define DEFINE_IDTENTRY_MCE_USER DEFINE_IDTENTRY_NOIST
-
-#define DECLARE_IDTENTRY_DEBUG DECLARE_IDTENTRY_IST
-#define DEFINE_IDTENTRY_DEBUG DEFINE_IDTENTRY_IST
-#define DEFINE_IDTENTRY_DEBUG_USER DEFINE_IDTENTRY_NOIST
-#endif
-
#else /* !__ASSEMBLY__ */
/*
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Add DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE() macro to define C code to
implement the ASM code which calls paranoid_entry(), modify orig_ax,
cfunc(), paranoid_exit() in series for IST exceptions with an error code.
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/idtentry.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 57636844b0fd..c57606948433 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -351,6 +351,22 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
ist_paranoid_exit(cr3, gsbase); \
}
+/**
+ * DEFINE_IDTENTRY_IST_ENTRY_ERRORCODE - Emit __entry_text code for IST
+ * entry points with an error code
+ * @func: Function name of the entry point
+ */
+#define DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE(func) \
+__visible __entry_text void ist_##func(struct pt_regs *regs) \
+{ \
+ unsigned long cr3, gsbase, error_code = regs->orig_ax; \
+ \
+ ist_paranoid_entry(&cr3, &gsbase); \
+ regs->orig_ax = -1; /* no syscall to restart */ \
+ func(regs, error_code); \
+ ist_paranoid_exit(cr3, gsbase); \
+}
+
/**
* DEFINE_IDTENTRY_IST - Emit code for IST entry points
* @func: Function name of the entry point
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
They are implemented and used in C code. The ASM version is not needed
any more.
FENCE_SWAPGS_USER_ENTRY is not removed because it is still being used
in the nmi userspace path. It could be possible to be removed in
future entry code enhancement.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/calling.h | 99 ----------------------------------------
1 file changed, 99 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9065c31d2875..d42012fc694d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -210,53 +210,6 @@ For 32-bit we have the following conventions - kernel is built with
popq %rax
.endm
-.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
- ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
- movq %cr3, \scratch_reg
- movq \scratch_reg, \save_reg
- /*
- * Test the user pagetable bit. If set, then the user page tables
- * are active. If clear CR3 already has the kernel page table
- * active.
- */
- bt $PTI_USER_PGTABLE_BIT, \scratch_reg
- jnc .Ldone_\@
-
- ADJUST_KERNEL_CR3 \scratch_reg
- movq \scratch_reg, %cr3
-
-.Ldone_\@:
-.endm
-
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
- ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
-
- /* No need to restore when the saved CR3 is kernel CR3. */
- bt $PTI_USER_PGTABLE_BIT, \save_reg
- jnc .Lend_\@
-
- ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
- /*
- * Check if there's a pending flush for the user ASID we're
- * about to set.
- */
- movq \save_reg, \scratch_reg
- andq $(0x7FF), \scratch_reg
- bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jnc .Lnoflush_\@
-
- btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jmp .Lwrcr3_\@
-
-.Lnoflush_\@:
- SET_NOFLUSH_BIT \save_reg
-
-.Lwrcr3_\@:
- movq \save_reg, %cr3
-.Lend_\@:
-.endm
-
#else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -265,10 +218,6 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro SWITCH_TO_USER_CR3_STACK scratch_reg:req
.endm
-.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
-.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
-.endm
#endif
@@ -277,17 +226,10 @@ For 32-bit we have the following conventions - kernel is built with
*
* FENCE_SWAPGS_USER_ENTRY is used in the user entry swapgs code path, to
* prevent a speculative swapgs when coming from kernel space.
- *
- * FENCE_SWAPGS_KERNEL_ENTRY is used in the kernel entry non-swapgs code path,
- * to prevent the swapgs from getting speculatively skipped when coming from
- * user space.
*/
.macro FENCE_SWAPGS_USER_ENTRY
ALTERNATIVE "", "lfence", X86_FEATURE_FENCE_SWAPGS_USER
.endm
-.macro FENCE_SWAPGS_KERNEL_ENTRY
- ALTERNATIVE "", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL
-.endm
.macro STACKLEAK_ERASE_NOCLOBBER
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
@@ -297,12 +239,6 @@ For 32-bit we have the following conventions - kernel is built with
#endif
.endm
-.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req
- rdgsbase \save_reg
- GET_PERCPU_BASE \scratch_reg
- wrgsbase \scratch_reg
-.endm
-
#else /* CONFIG_X86_64 */
# undef UNWIND_HINT_IRET_REGS
# define UNWIND_HINT_IRET_REGS
@@ -313,38 +249,3 @@ For 32-bit we have the following conventions - kernel is built with
call stackleak_erase
#endif
.endm
-
-#ifdef CONFIG_SMP
-
-/*
- * CPU/node NR is loaded from the limit (size) field of a special segment
- * descriptor entry in GDT.
- */
-.macro LOAD_CPU_AND_NODE_SEG_LIMIT reg:req
- movq $__CPUNODE_SEG, \reg
- lsl \reg, \reg
-.endm
-
-/*
- * Fetch the per-CPU GSBASE value for this processor and put it in @reg.
- * We normally use %gs for accessing per-CPU data, but we are setting up
- * %gs here and obviously can not use %gs itself to access per-CPU data.
- *
- * Do not use RDPID, because KVM loads guest's TSC_AUX on vm-entry and
- * may not restore the host's value until the CPU returns to userspace.
- * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
- * while running KVM's run loop.
- */
-.macro GET_PERCPU_BASE reg:req
- LOAD_CPU_AND_NODE_SEG_LIMIT \reg
- andq $VDSO_CPUNODE_MASK, \reg
- movq __per_cpu_offset(, \reg, 8), \reg
-.endm
-
-#else
-
-.macro GET_PERCPU_BASE reg:req
- movq pcpu_unit_offsets(%rip), \reg
-.endm
-
-#endif /* CONFIG_SMP */
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
IST exceptions are changed to use C entry code which uses the C function
ist_paranoid_entry() and ist_paranoid_exit(). The ASM function
paranoid_entry() and paranoid_exit() are useless.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 128 --------------------------------------
1 file changed, 128 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index dfef02696319..cce2673c5bb0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -849,134 +849,6 @@ SYM_CODE_START(xen_failsafe_callback)
SYM_CODE_END(xen_failsafe_callback)
#endif /* CONFIG_XEN_PV */
-/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASE R/EBX
- * N 0 -> SWAPGS on exit
- * 1 -> no SWAPGS on exit
- *
- * Y GSBASE value at entry, must be restored in paranoid_exit
- */
-SYM_CODE_START_LOCAL(paranoid_entry)
- UNWIND_HINT_FUNC
-
- /*
- * Always stash CR3 in %r14. This value will be restored,
- * verbatim, at exit. Needed if paranoid_entry interrupted
- * another entry that already switched to the user CR3 value
- * but has not yet returned to userspace.
- *
- * This is also why CS (stashed in the "iret frame" by the
- * hardware at entry) can not be used: this may be a return
- * to kernel code, but with a user CR3 value.
- *
- * Switching CR3 does not depend on kernel GSBASE so it can
- * be done before switching to the kernel GSBASE. This is
- * required for FSGSBASE because the kernel GSBASE has to
- * be retrieved from a kernel internal table.
- */
- SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
- /*
- * Handling GSBASE depends on the availability of FSGSBASE.
- *
- * Without FSGSBASE the kernel enforces that negative GSBASE
- * values indicate kernel GSBASE. With FSGSBASE no assumptions
- * can be made about the GSBASE value when entering from user
- * space.
- */
- ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
- /*
- * Read the current GSBASE and store it in %rbx unconditionally,
- * retrieve and set the current CPUs kernel GSBASE. The stored value
- * has to be restored in paranoid_exit unconditionally.
- *
- * The unconditional write to GS base below ensures that no subsequent
- * loads based on a mispredicted GS base can happen, therefore no LFENCE
- * is needed here.
- */
- SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
- ret
-
-.Lparanoid_entry_checkgs:
- /*
- * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
- * unconditional CR3 write, even in the PTI case. So do an lfence
- * to prevent GS speculation, regardless of whether PTI is enabled.
- */
- FENCE_SWAPGS_KERNEL_ENTRY
-
- /* EBX = 1 -> kernel GSBASE active, no restore required */
- movl $1, %ebx
- /*
- * The kernel-enforced convention is a negative GSBASE indicates
- * a kernel value. No SWAPGS needed on entry and exit.
- */
- movl $MSR_GS_BASE, %ecx
- rdmsr
- testl %edx, %edx
- jns .Lparanoid_entry_swapgs
- ret
-
-.Lparanoid_entry_swapgs:
- swapgs
-
- /* EBX = 0 -> SWAPGS required on exit */
- xorl %ebx, %ebx
- ret
-SYM_CODE_END(paranoid_entry)
-
-/*
- * "Paranoid" exit path from exception stack. This is invoked
- * only on return from IST interrupts that came from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated. Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASE R/EBX
- * N 0 -> SWAPGS on exit
- * 1 -> no SWAPGS on exit
- *
- * Y User space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
- UNWIND_HINT_REGS offset=8
- /*
- * The order of operations is important. RESTORE_CR3 requires
- * kernel GSBASE.
- *
- * NB to anyone to try to optimize this code: this code does
- * not execute at all for exceptions from user mode. Those
- * exceptions go through error_exit instead.
- */
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14
-
- /* Handle the three GSBASE cases */
- ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
-
- /* With FSGSBASE enabled, unconditionally restore GSBASE */
- wrgsbase %rbx
- ret
-
-.Lparanoid_exit_checkgs:
- /* On non-FSGSBASE systems, conditionally do SWAPGS */
- testl %ebx, %ebx
- jnz .Lparanoid_exit_done
-
- /* We are returning to a context with user GSBASE */
- swapgs
-.Lparanoid_exit_done:
- ret
-SYM_CODE_END(paranoid_exit)
-
SYM_CODE_START_LOCAL(error_return)
UNWIND_HINT_REGS
DEBUG_ENTRY_ASSERT_IRQS_OFF
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
PUSH_AND_CLEAR_REGS is never used with save_ret anymore.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/calling.h | 16 +++-------------
1 file changed, 3 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index d42012fc694d..6f9de1c6da73 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,15 +63,9 @@ For 32-bit we have the following conventions - kernel is built with
* for assembly code:
*/
-.macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
- .if \save_ret
- pushq %rsi /* pt_regs->si */
- movq 8(%rsp), %rsi /* temporarily store the return address in %rsi */
- movq %rdi, 8(%rsp) /* pt_regs->di (overwriting original return address) */
- .else
+.macro PUSH_REGS rdx=%rdx rax=%rax
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
- .endif
pushq \rdx /* pt_regs->dx */
pushq %rcx /* pt_regs->cx */
pushq \rax /* pt_regs->ax */
@@ -86,10 +80,6 @@ For 32-bit we have the following conventions - kernel is built with
pushq %r14 /* pt_regs->r14 */
pushq %r15 /* pt_regs->r15 */
UNWIND_HINT_REGS
-
- .if \save_ret
- pushq %rsi /* return address on top of stack */
- .endif
.endm
.macro CLEAR_REGS
@@ -114,8 +104,8 @@ For 32-bit we have the following conventions - kernel is built with
.endm
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
- PUSH_REGS rdx=\rdx, rax=\rax, save_ret=\save_ret
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax
+ PUSH_REGS rdx=\rdx, rax=\rax
CLEAR_REGS
.endm
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Like do_fast_syscall_32() which checks whether it can return to userspace
via fast instructions before the function returns, do_syscall_64()
also checks whether it can use sysret to return to userspace before
do_syscall_64() returns via C code. And a bunch of ASM code can be
removed.
No functional change intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/calling.h | 10 +----
arch/x86/entry/common.c | 73 ++++++++++++++++++++++++++++++-
arch/x86/entry/entry_64.S | 78 ++--------------------------------
arch/x86/include/asm/syscall.h | 2 +-
4 files changed, 78 insertions(+), 85 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 6f9de1c6da73..05da3ef48ee4 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -109,27 +109,19 @@ For 32-bit we have the following conventions - kernel is built with
CLEAR_REGS
.endm
-.macro POP_REGS pop_rdi=1 skip_r11rcx=0
+.macro POP_REGS pop_rdi=1
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbp
popq %rbx
- .if \skip_r11rcx
- popq %rsi
- .else
popq %r11
- .endif
popq %r10
popq %r9
popq %r8
popq %rax
- .if \skip_r11rcx
- popq %rsi
- .else
popq %rcx
- .endif
popq %rdx
popq %rsi
.if \pop_rdi
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..718045b7a53c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -70,7 +70,77 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
return false;
}
-__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
+/*
+ * Change top bits to match the most significant bit (47th or 56th bit
+ * depending on paging mode) in the address to get canonical address.
+ *
+ * If width of "canonical tail" ever becomes variable, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+static __always_inline u64 canonical_address(u64 vaddr)
+{
+ if (IS_ENABLED(CONFIG_X86_5LEVEL) && static_cpu_has(X86_FEATURE_LA57))
+ return ((s64)vaddr << (64 - 57)) >> (64 - 57);
+ else
+ return ((s64)vaddr << (64 - 48)) >> (64 - 48);
+}
+
+/*
+ * Check if it can use SYSRET.
+ *
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ *
+ * Returns 0 to return using IRET or 1 to return using SYSRET.
+ */
+static __always_inline int can_sysret(struct pt_regs *regs)
+{
+ /* In the Xen PV case we must use iret anyway. */
+ if (static_cpu_has(X86_FEATURE_XENPV))
+ return 0;
+
+ /* SYSRET requires RCX == RIP && R11 == RFLAGS */
+ if (regs->ip != regs->cx || regs->flags != regs->r11)
+ return 0;
+
+ /* CS and SS must match SYSRET */
+ if (regs->cs != __USER_CS || regs->ss != __USER_DS)
+ return 0;
+
+ /*
+ * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+ * in kernel space. This essentially lets the user take over
+ * the kernel, since userspace controls RSP.
+ */
+ if (regs->cx != canonical_address(regs->cx))
+ return 0;
+
+ /*
+ * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
+ * restore RF properly. If the slowpath sets it for whatever reason, we
+ * need to restore it correctly.
+ *
+ * SYSRET can restore TF, but unlike IRET, restoring TF results in a
+ * trap from userspace immediately after SYSRET. This would cause an
+ * infinite loop whenever #DB happens with register state that satisfies
+ * the opportunistic SYSRET conditions. For example, single-stepping
+ * this user code:
+ *
+ * movq $stuck_here, %rcx
+ * pushfq
+ * popq %r11
+ * stuck_here:
+ *
+ * would never get past 'stuck_here'.
+ */
+ if (regs->r11 & (X86_EFLAGS_RF | X86_EFLAGS_TF))
+ return 0;
+
+ return 1;
+}
+
+/* Returns 0 to return using IRET or 1 to return using SYSRET. */
+__visible noinstr int do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
@@ -84,6 +154,7 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
instrumentation_end();
syscall_exit_to_user_mode(regs);
+ return can_sysret(regs);
}
#endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cce2673c5bb0..2016d969e3ea 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -112,85 +112,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
movslq %eax, %rsi
call do_syscall_64 /* returns with IRQs disabled */
- /*
- * Try to use SYSRET instead of IRET if we're returning to
- * a completely clean 64-bit userspace context. If we're not,
- * go to the slow exit path.
- * In the Xen PV case we must use iret anyway.
- */
-
- ALTERNATIVE "", "jmp swapgs_restore_regs_and_return_to_usermode", \
- X86_FEATURE_XENPV
-
- movq RCX(%rsp), %rcx
- movq RIP(%rsp), %r11
-
- cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
- jne swapgs_restore_regs_and_return_to_usermode
+ testl %eax, %eax
+ jz swapgs_restore_regs_and_return_to_usermode
/*
- * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
- * in kernel space. This essentially lets the user take over
- * the kernel, since userspace controls RSP.
- *
- * If width of "canonical tail" ever becomes variable, this will need
- * to be updated to remain correct on both old and new CPUs.
- *
- * Change top bits to match most significant bit (47th or 56th bit
- * depending on paging mode) in the address.
- */
-#ifdef CONFIG_X86_5LEVEL
- ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
- "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
-#else
- shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
- sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-#endif
-
- /* If this changed %rcx, it was not canonical */
- cmpq %rcx, %r11
- jne swapgs_restore_regs_and_return_to_usermode
-
- cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
- jne swapgs_restore_regs_and_return_to_usermode
-
- movq R11(%rsp), %r11
- cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
- jne swapgs_restore_regs_and_return_to_usermode
-
- /*
- * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
- * restore RF properly. If the slowpath sets it for whatever reason, we
- * need to restore it correctly.
- *
- * SYSRET can restore TF, but unlike IRET, restoring TF results in a
- * trap from userspace immediately after SYSRET. This would cause an
- * infinite loop whenever #DB happens with register state that satisfies
- * the opportunistic SYSRET conditions. For example, single-stepping
- * this user code:
- *
- * movq $stuck_here, %rcx
- * pushfq
- * popq %r11
- * stuck_here:
- *
- * would never get past 'stuck_here'.
- */
- testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
- jnz swapgs_restore_regs_and_return_to_usermode
-
- /* nothing to check for RSP */
-
- cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
- jne swapgs_restore_regs_and_return_to_usermode
-
- /*
- * We win! This label is here just for ease of understanding
+ * This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
- /* rcx and r11 are already restored (see code above) */
- POP_REGS pop_rdi=0 skip_r11rcx=1
+ POP_REGS pop_rdi=0
/*
* Now all regs are restored except RSP and RDI.
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index f7e2d82d24fb..477adea7bac0 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task)
? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
}
-void do_syscall_64(struct pt_regs *regs, int nr);
+int do_syscall_64(struct pt_regs *regs, int nr);
void do_int80_syscall_32(struct pt_regs *regs);
long do_fast_syscall_32(struct pt_regs *regs);
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
When the original CR3 is kernel CR3, paranoid_entry() hasn't changed
the CR3, so the CR3 doesn't need to restored when paranoid_exit() in
the this case.
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/calling.h | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 996b041e92d2..9065c31d2875 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -231,14 +231,11 @@ For 32-bit we have the following conventions - kernel is built with
.macro RESTORE_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
- ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
- /*
- * KERNEL pages can always resume with NOFLUSH as we do
- * explicit flushes.
- */
+ /* No need to restore when the saved CR3 is kernel CR3. */
bt $PTI_USER_PGTABLE_BIT, \save_reg
- jnc .Lnoflush_\@
+ jnc .Lend_\@
+
+ ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
* Check if there's a pending flush for the user ASID we're
@@ -256,10 +253,6 @@ For 32-bit we have the following conventions - kernel is built with
SET_NOFLUSH_BIT \save_reg
.Lwrcr3_\@:
- /*
- * The CR3 write could be avoided when not changing its value,
- * but would require a CR3 read *and* a scratch register.
- */
movq \save_reg, %cr3
.Lend_\@:
.endm
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Use DEFINE_IDTENTRY_IST_ETNRY to emit C entry function and use the function
directly in entry_64.S.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 10 +---------
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/kernel/cpu/mce/Makefile | 3 +++
3 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b6bcf7fcad34..61e89fd5ad8a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -465,16 +465,8 @@ SYM_CODE_START(\asmsym)
testb $3, CS(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
-
- UNWIND_HINT_REGS
-
movq %rsp, %rdi /* pt_regs pointer */
-
- call \cfunc
-
- call paranoid_exit
+ call ist_\cfunc
jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack and use the noist entry point */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0f615943a460..d0fd32288442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -358,6 +358,7 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
* Maps to DEFINE_IDTENTRY_RAW
*/
#define DEFINE_IDTENTRY_IST(func) \
+ DEFINE_IDTENTRY_IST_ETNRY(func) \
DEFINE_IDTENTRY_RAW(func)
/**
diff --git a/arch/x86/kernel/cpu/mce/Makefile b/arch/x86/kernel/cpu/mce/Makefile
index 015856abdbb1..555963416ec3 100644
--- a/arch/x86/kernel/cpu/mce/Makefile
+++ b/arch/x86/kernel/cpu/mce/Makefile
@@ -1,4 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS_core.o += -fno-stack-protector
+
obj-y = core.o severity.o genpool.o
obj-$(CONFIG_X86_ANCIENT_MCE) += winchip.o p5.o
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
It implements the whole ASM version paranoid_exit().
No functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry64.c | 41 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/idtentry.h | 2 ++
2 files changed, 43 insertions(+)
diff --git a/arch/x86/entry/entry64.c b/arch/x86/entry/entry64.c
index e1af3e5720f9..0e8a3bef3b25 100644
--- a/arch/x86/entry/entry64.c
+++ b/arch/x86/entry/entry64.c
@@ -267,6 +267,29 @@ static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
return 0;
}
+static __always_inline void ist_restore_gsbase(unsigned long gsbase)
+{
+ /*
+ * Handle the three GSBASE cases.
+ *
+ * @gsbase contains the GSBASE related information depending
+ * on the availability of the FSGSBASE instructions:
+ *
+ * FSGSBASE @gsbase
+ * N 0 -> SWAPGS on exit
+ * 1 -> no SWAPGS on exit
+ *
+ * Y User space GSBASE, must be restored unconditionally
+ */
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wrgsbase(gsbase);
+ return;
+ }
+
+ if (!gsbase)
+ native_swapgs();
+}
+
/*
* Switch and save CR3 in *@cr3 if PTI enabled. Return GSBASE related
* information in *@gsbase depending on the availability of the FSGSBASE
@@ -303,3 +326,21 @@ void ist_paranoid_entry(unsigned long *cr3, unsigned long *gsbase)
/* Handle GSBASE, store the return value in *@gsbase for exit. */
*gsbase = ist_switch_to_kernel_gsbase();
}
+
+/*
+ * "Paranoid" exit path from exception stack. This is invoked
+ * only on return from IST interrupts that came from kernel space.
+ *
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated. Fortunately, there's no good reason to try
+ * to handle preemption here.
+ */
+__visible __entry_text
+void ist_paranoid_exit(unsigned long cr3, unsigned long gsbase)
+{
+ /* Restore CR3 at first, it can use kernel GSBASE. */
+ ist_restore_cr3(cr3);
+ barrier();
+ ist_restore_gsbase(gsbase);
+}
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index f6efa21ec242..cf41901227ed 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -309,6 +309,8 @@ static __always_inline void __##func(struct pt_regs *regs)
#ifdef CONFIG_X86_64
__visible __entry_text
void ist_paranoid_entry(unsigned long *cr3, unsigned long *gsbase);
+__visible __entry_text
+void ist_paranoid_exit(unsigned long cr3, unsigned long gsbase);
/**
* DECLARE_IDTENTRY_IST - Declare functions for IST handling IDT entry points
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Add DEFINE_IDTENTRY_IST_ETNRY() macro to define C code to implement
the ASM code which calls paranoid_entry(), cfunc(), paranoid_exit()
in series for IST exceptions without error code.
Not functional difference intended.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/idtentry.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index cf41901227ed..0f615943a460 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -337,6 +337,20 @@ void ist_paranoid_exit(unsigned long cr3, unsigned long gsbase);
__visible noinstr void kernel_##func(struct pt_regs *regs, unsigned long error_code); \
__visible noinstr void user_##func(struct pt_regs *regs, unsigned long error_code)
+/**
+ * DEFINE_IDTENTRY_IST_ENTRY - Emit __entry_text code for IST entry points
+ * @func: Function name of the entry point
+ */
+#define DEFINE_IDTENTRY_IST_ETNRY(func) \
+__visible __entry_text void ist_##func(struct pt_regs *regs) \
+{ \
+ unsigned long cr3, gsbase; \
+ \
+ ist_paranoid_entry(&cr3, &gsbase); \
+ func(regs); \
+ ist_paranoid_exit(cr3, gsbase); \
+}
+
/**
* DEFINE_IDTENTRY_IST - Emit code for IST entry points
* @func: Function name of the entry point
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
ist_vc_switch_off_ist() is the same as vc_switch_off_ist(), but it is
called without CR3 or gsbase fixed. It has to call ist_paranoid_entry()
by its own.
It is prepared for using C code for the other part of identry_vc and
remove ASM paranoid_entry() and paranoid_exit().
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 20 ++++++++++----------
arch/x86/include/asm/traps.h | 3 ++-
arch/x86/kernel/traps.c | 14 +++++++++++++-
3 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9def3d2cedb7..944cf85e67da 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -515,26 +515,26 @@ SYM_CODE_START(\asmsym)
testb $3, CS(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
- /*
- * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
- * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
- */
- call paranoid_entry
-
- UNWIND_HINT_REGS
-
/*
* Switch off the IST stack to make it free for nested exceptions. The
- * vc_switch_off_ist() function will switch back to the interrupted
+ * ist_vc_switch_off_ist() function will switch back to the interrupted
* stack if it is safe to do so. If not it switches to the VC fall-back
* stack.
*/
movq %rsp, %rdi /* pt_regs pointer */
- call vc_switch_off_ist
+ call ist_vc_switch_off_ist
movq %rax, %rsp /* Switch to new stack */
UNWIND_HINT_REGS
+ /*
+ * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
+ * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
+ */
+ call paranoid_entry
+
+ UNWIND_HINT_REGS
+
/* Update pt_regs */
movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 686461ac9803..1aefc081d763 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -16,7 +16,8 @@ asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
asmlinkage __visible notrace struct pt_regs *error_entry(struct pt_regs *eregs);
void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+asmlinkage __visible __entry_text
+struct pt_regs *ist_vc_switch_off_ist(struct pt_regs *eregs);
#endif
#ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4e9d306f313c..1a84587cb4c7 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -717,7 +717,7 @@ asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
}
#ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
+static noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
{
unsigned long sp, *stack;
struct stack_info info;
@@ -757,6 +757,18 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
return regs_ret;
}
+
+asmlinkage __visible __entry_text
+struct pt_regs *ist_vc_switch_off_ist(struct pt_regs *regs)
+{
+ unsigned long cr3, gsbase;
+
+ ist_paranoid_entry(&cr3, &gsbase);
+ regs = vc_switch_off_ist(regs);
+ ist_paranoid_exit(cr3, gsbase);
+
+ return regs;
+}
#endif
asmlinkage __visible noinstr
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Use DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE to emit C entry function and
use the function directly in entry_64.S.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 12 ++----------
arch/x86/include/asm/idtentry.h | 1 +
2 files changed, 3 insertions(+), 10 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c4cc47519a11..9def3d2cedb7 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -574,16 +574,8 @@ SYM_CODE_START(\asmsym)
PUSH_AND_CLEAR_REGS
ENCODE_FRAME_POINTER
- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
- UNWIND_HINT_REGS
-
- movq %rsp, %rdi /* pt_regs pointer into first argument */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
- call \cfunc
-
- call paranoid_exit
+ movq %rsp, %rdi /* pt_regs pointer */
+ call ist_\cfunc
jmp restore_regs_and_return_to_kernel
_ASM_NOKPROBE(\asmsym)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index c57606948433..931b689f464c 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -415,6 +415,7 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
* Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
*/
#define DEFINE_IDTENTRY_DF(func) \
+ DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE(func) \
DEFINE_IDTENTRY_RAW_ERRORCODE(func)
/**
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Use DEFINE_IDTENTRY_IST_ETNRY to emit C entry function and use the function
directly in entry_64.S.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 17 ++---------------
arch/x86/include/asm/idtentry.h | 5 ++++-
arch/x86/kernel/Makefile | 1 +
3 files changed, 7 insertions(+), 16 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 61e89fd5ad8a..c4cc47519a11 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1309,21 +1309,8 @@ end_repeat_nmi:
PUSH_AND_CLEAR_REGS
ENCODE_FRAME_POINTER
- /*
- * Use paranoid_entry to handle SWAPGS and CR3.
- */
- call paranoid_entry
- UNWIND_HINT_REGS
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
-
- /*
- * Use paranoid_exit to handle SWAPGS and CR3, but no need to use
- * restore_regs_and_return_to_kernel as we must handle nested NMI.
- */
- call paranoid_exit
+ movq %rsp, %rdi /* pt_regs pointer */
+ call ist_exc_nmi
POP_REGS
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b9a6750dbba2..57636844b0fd 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -372,6 +372,8 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
#define DEFINE_IDTENTRY_NOIST(func) \
DEFINE_IDTENTRY_RAW(noist_##func)
+#define DEFINE_IDTENTRY_NMI DEFINE_IDTENTRY_IST
+
#define DECLARE_IDTENTRY_MCE DECLARE_IDTENTRY_IST
#define DEFINE_IDTENTRY_MCE DEFINE_IDTENTRY_IST
#define DEFINE_IDTENTRY_MCE_USER DEFINE_IDTENTRY_NOIST
@@ -421,6 +423,8 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
#else /* CONFIG_X86_64 */
+#define DEFINE_IDTENTRY_NMI DEFINE_IDTENTRY_RAW
+
/**
* DECLARE_IDTENTRY_DF - Declare functions for double fault 32bit variant
* @vector: Vector number (ignored for C)
@@ -452,7 +456,6 @@ __visible noinstr void func(struct pt_regs *regs, \
/* C-Code mapping */
#define DECLARE_IDTENTRY_NMI DECLARE_IDTENTRY_RAW
-#define DEFINE_IDTENTRY_NMI DEFINE_IDTENTRY_RAW
#else /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8ac45801ba8b..28815c2e6cb2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -51,6 +51,7 @@ KCOV_INSTRUMENT := n
CFLAGS_head$(BITS).o += -fno-stack-protector
CFLAGS_cc_platform.o += -fno-stack-protector
CFLAGS_traps.o += -fno-stack-protector
+CFLAGS_nmi.o += -fno-stack-protector
CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
--
2.19.1.6.gb485710b
From: Lai Jiangshan <[email protected]>
Use DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE to emit C entry function and
use the function directly in entry_64.S.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 22 +---------------------
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/kernel/Makefile | 1 +
3 files changed, 3 insertions(+), 21 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 944cf85e67da..dfef02696319 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -527,28 +527,8 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_REGS
- /*
- * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
- * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
- */
- call paranoid_entry
-
- UNWIND_HINT_REGS
-
- /* Update pt_regs */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
-
movq %rsp, %rdi /* pt_regs pointer */
-
- call kernel_\cfunc
-
- /*
- * No need to switch back to the IST stack. The current stack is either
- * identical to the stack in the IRET frame or the VC fall-back stack,
- * so it is definitely mapped even with PTI enabled.
- */
- call paranoid_exit
+ call ist_kernel_\cfunc
jmp restore_regs_and_return_to_kernel
/* Switch to the regular task stack */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 931b689f464c..84ce63f03c7f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -426,6 +426,7 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
* Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
*/
#define DEFINE_IDTENTRY_VC_KERNEL(func) \
+ DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE(kernel_##func) \
DEFINE_IDTENTRY_RAW_ERRORCODE(kernel_##func)
/**
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 28815c2e6cb2..9535d03aaa61 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -52,6 +52,7 @@ CFLAGS_head$(BITS).o += -fno-stack-protector
CFLAGS_cc_platform.o += -fno-stack-protector
CFLAGS_traps.o += -fno-stack-protector
CFLAGS_nmi.o += -fno-stack-protector
+CFLAGS_sev.o += -fno-stack-protector
CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
--
2.19.1.6.gb485710b
Ping
Thanks
Lai
On 10/11/2021 11:57, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> Use DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE to emit C entry function and
> use the function directly in entry_64.S.
>
A drive-by comment as I was looking for SEV commits...
Typo in definition names, ETNRY -> ENTRY, (which impacts most patches
between 38-48) would likely cause confusion in the future.
Regards,
Liam
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 22 +---------------------
> arch/x86/include/asm/idtentry.h | 1 +
> arch/x86/kernel/Makefile | 1 +
> 3 files changed, 3 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 944cf85e67da..dfef02696319 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -527,28 +527,8 @@ SYM_CODE_START(\asmsym)
>
> UNWIND_HINT_REGS
>
> - /*
> - * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
> - * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
> - */
> - call paranoid_entry
> -
> - UNWIND_HINT_REGS
> -
> - /* Update pt_regs */
> - movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
> - movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
> -
> movq %rsp, %rdi /* pt_regs pointer */
> -
> - call kernel_\cfunc
> -
> - /*
> - * No need to switch back to the IST stack. The current stack is either
> - * identical to the stack in the IRET frame or the VC fall-back stack,
> - * so it is definitely mapped even with PTI enabled.
> - */
> - call paranoid_exit
> + call ist_kernel_\cfunc
> jmp restore_regs_and_return_to_kernel
>
> /* Switch to the regular task stack */
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 931b689f464c..84ce63f03c7f 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -426,6 +426,7 @@ __visible __entry_text void ist_##func(struct pt_regs *regs) \
> * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
> */
> #define DEFINE_IDTENTRY_VC_KERNEL(func) \
> + DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE(kernel_##func) \
> DEFINE_IDTENTRY_RAW_ERRORCODE(kernel_##func)
>
> /**
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 28815c2e6cb2..9535d03aaa61 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -52,6 +52,7 @@ CFLAGS_head$(BITS).o += -fno-stack-protector
> CFLAGS_cc_platform.o += -fno-stack-protector
> CFLAGS_traps.o += -fno-stack-protector
> CFLAGS_nmi.o += -fno-stack-protector
> +CFLAGS_sev.o += -fno-stack-protector
>
> CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
>
>
On Thu, Nov 18, 2021 at 5:31 PM Liam Merwick <[email protected]> wrote:
>
> On 10/11/2021 11:57, Lai Jiangshan wrote:
> > From: Lai Jiangshan <[email protected]>
> >
> > Use DEFINE_IDTENTRY_IST_ETNRY_ERRORCODE to emit C entry function and
> > use the function directly in entry_64.S.
> >
>
> A drive-by comment as I was looking for SEV commits...
>
> Typo in definition names, ETNRY -> ENTRY, (which impacts most patches
> between 38-48) would likely cause confusion in the future.
>
Wow, what a study mistake.
Thanks
Lai
On Wed, Nov 10, 2021 at 07:56:47PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> Commit 18ec54fdd6d18 ("x86/speculation: Prepare entry code for Spectre
> v1 swapgs mitigations") adds FENCE_SWAPGS_{KERNEL|USER}_ENTRY
> for conditional swapgs. And in paranoid_entry(), it uses only
> FENCE_SWAPGS_KERNEL_ENTRY for both branches. It is because the fence
> is required for both cases since the CR3 write is conditinal even PTI
> is enabled.
>
> But commit 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in
> paranoid entry") switches the code order and changes the branches.
> And it misses the needed FENCE_SWAPGS_KERNEL_ENTRY for user gsbase case.
>
> Add it back by moving FENCE_SWAPGS_KERNEL_ENTRY up to cover both branches.
>
> Fixes: Commit 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Chang S. Bae <[email protected]>
> Cc: Sasha Levin <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e38a4cf795d9..14ffe12807ba 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -888,6 +888,13 @@ SYM_CODE_START_LOCAL(paranoid_entry)
> ret
>
> .Lparanoid_entry_checkgs:
> + /*
> + * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
> + * unconditional CR3 write, even in the PTI case. So do an lfence
> + * to prevent GS speculation, regardless of whether PTI is enabled.
> + */
> + FENCE_SWAPGS_KERNEL_ENTRY
> +
> /* EBX = 1 -> kernel GSBASE active, no restore required */
> movl $1, %ebx
> /*
> @@ -903,13 +910,6 @@ SYM_CODE_START_LOCAL(paranoid_entry)
> .Lparanoid_entry_swapgs:
> swapgs
>
> - /*
> - * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
> - * unconditional CR3 write, even in the PTI case. So do an lfence
> - * to prevent GS speculation, regardless of whether PTI is enabled.
> - */
> - FENCE_SWAPGS_KERNEL_ENTRY
> -
> /* EBX = 0 -> SWAPGS required on exit */
> xorl %ebx, %ebx
> ret
I'm confused, shouldn't the LFENCE be between SWAPGS and future uses of
GS prefix?
In the old code, before 96b2371413e8f, we had:
swapgs
SAVE_AND_SWITCH_TO_KERNEL_CR3
FENCE_SWAPGS_KERNEL_ENTRY
// %gs user comes here..
And the comment made sense, since if SAVE_AND_SWITCH_TO_KERNEL_CR3 would
imply an unconditional CR3 write, the LFENCE would not be needed.
Then along gomes 96b2371413e8f and changes the order to:
SAVE_AND_SWITCH_TO_KERNEL_CR3
swapgs
FENCE_SWAPGS_KERNEL_ENTRY
// %gs user comes here..
But now the comment is crazy talk, because even if the CR3 write were
unconditional, it'd be pointless, since it's not after SWAPGS, but we
still have the LFENCE in the right place.
But now you want to make it:
SAVE_AND_SWITCH_TO_KERNEL_CR3
FENCE_SWAPGS_KERNEL_ENTRY
swapgs
// %gs user comes here..
And there's nothing left and speculation can use the old %gs for our
user and things go sideways. Hmm?
(on a completely unrelated note, I find KERNEL_ENTRY and USER_ENTRY
utterly confusing)
On 2021/11/18 23:54, Peter Zijlstra wrote:
>
> I'm confused, shouldn't the LFENCE be between SWAPGS and future uses of
> GS prefix?
I'm wrong a again.
I once thought "it should be followed with serializing operations such
as SWITCH_TO_KERNEL_CR3", and tglx corrected me:
https://lore.kernel.org/lkml/875yumbgox.ffs@tglx/
> It does not matter whether the *serializing* is before or after
And in my brain, it was incorrectly stored as:
It does not matter whether the *fence* is before or after.
I will update patch1 and the corresponding C code in later patches.
Patch 1 in V4 is correct, but not as good as Borislav Petkov pointed out
that it has duplicated FENCE_SWAPGS_KERNEL_ENTRY.
I will change it as
rdmsr
if (need_swapgs) {
swapgs
set ebx/return value
}
FENCE_SWAPGS_KERNEL_ENTRY
>
> In the old code, before 96b2371413e8f, we had:
>
> swapgs
> SAVE_AND_SWITCH_TO_KERNEL_CR3
> FENCE_SWAPGS_KERNEL_ENTRY
>
> // %gs user comes here..
>
> And the comment made sense, since if SAVE_AND_SWITCH_TO_KERNEL_CR3 would
> imply an unconditional CR3 write, the LFENCE would not be needed.
>
> Then along gomes 96b2371413e8f and changes the order to:
>
> SAVE_AND_SWITCH_TO_KERNEL_CR3
> swapgs
> FENCE_SWAPGS_KERNEL_ENTRY
> // %gs user comes here..
>
> But now the comment is crazy talk, because even if the CR3 write were
> unconditional, it'd be pointless, since it's not after SWAPGS, but we
> still have the LFENCE in the right place.
I think the comments also make sense.
If CR3 write were unconditional before swapgs, no fence is needed after
swapgs since cr3 write is serializing.
>
> But now you want to make it:
>
> SAVE_AND_SWITCH_TO_KERNEL_CR3
> FENCE_SWAPGS_KERNEL_ENTRY
> swapgs
> // %gs user comes here..
>
> And there's nothing left and speculation can use the old %gs for our
> user and things go sideways. Hmm?
>
>
> (on a completely unrelated note, I find KERNEL_ENTRY and USER_ENTRY
> utterly confusing)
>
On Wed, Nov 10, 2021 at 07:56:49PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> When stack-protector is enabled, the compiler adds some instrument code
> at the beginning and the end of some functions. Many functions in traps.c
> are non-instrumentable. Moreover, stack-protector code in the beginning
> of the affected function accesses the canary that might be watched by
> hardware breakpoints which also violate the non-instrumentable
> nature of some functions and might cause infinite recursive #DB because
> the canary is accessed before resetting the dr7.
>
> So it is better to remove stack-protector from traps.c.
>
> It is also prepared for later patches that move some entry code into
> traps.c, some of which can NOT use percpu register until gsbase is
> properly switched. And stack-protector depends on the percpu register
> to work.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/kernel/Makefile | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 2ff3e600f426..8ac45801ba8b 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -50,6 +50,7 @@ KCOV_INSTRUMENT := n
>
> CFLAGS_head$(BITS).o += -fno-stack-protector
> CFLAGS_cc_platform.o += -fno-stack-protector
> +CFLAGS_traps.o += -fno-stack-protector
Well, there's a lot more noinstr than just in traps. There's also real C
code in traps. This isn't really a solution.
I think GCC has recently grown __attribute__((no_stack_protector)),
which should be added to noinstr (GCC-11 and above).
Additionally we could add code to objtool to detect this problem.
On Wed, Nov 10, 2021 at 07:57:00PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> The address of .Lgs_change will be used in traps.c in later patch when
> some entry code is implemented in entry64.c. So the address of .Lgs_change
> is exposed to traps.c for preparation.
>
> The label .Lgs_change is still needed in ASM code for extable due to it
> can not use asm_load_gs_index_gs_change. Otherwise:
>
> warning: objtool: __ex_table+0x0: don't know how to handle
> non-section reloc symbol asm_load_gs_index_gs_change
>
I'm thinking commits:
24ff65257375 ("objtool: Teach get_alt_entry() about more relocation types")
4d8b35968bbf ("objtool: Remove reloc symbol type checks in get_alt_entry()")
Might have cured that.
On 2021/11/19 03:55, Peter Zijlstra wrote:
> On Wed, Nov 10, 2021 at 07:56:49PM +0800, Lai Jiangshan wrote:
>> From: Lai Jiangshan <[email protected]>
>>
>> When stack-protector is enabled, the compiler adds some instrument code
>> at the beginning and the end of some functions. Many functions in traps.c
>> are non-instrumentable. Moreover, stack-protector code in the beginning
>> of the affected function accesses the canary that might be watched by
>> hardware breakpoints which also violate the non-instrumentable
>> nature of some functions and might cause infinite recursive #DB because
>> the canary is accessed before resetting the dr7.
>>
>> So it is better to remove stack-protector from traps.c.
>>
>> It is also prepared for later patches that move some entry code into
>> traps.c, some of which can NOT use percpu register until gsbase is
>> properly switched. And stack-protector depends on the percpu register
>> to work.
>>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> arch/x86/kernel/Makefile | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index 2ff3e600f426..8ac45801ba8b 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -50,6 +50,7 @@ KCOV_INSTRUMENT := n
>>
>> CFLAGS_head$(BITS).o += -fno-stack-protector
>> CFLAGS_cc_platform.o += -fno-stack-protector
>> +CFLAGS_traps.o += -fno-stack-protector
>
> Well, there's a lot more noinstr than just in traps.
Although it is stupid to put hardware break point on the stack canary,
it is fatal only in traps.c when the canary is accessed before
resetting the dr7 in #DB handler.
And this only happens when the administer of the system is deliberately
hurting the system, so the fix is not strongly required in this problem.
The best way is to disallow hw_breakpoint to watch the stack canary.
The later patch (patch39) puts __entry_code into traps.c which makes
no_stack_protector is strongly required, so this patch just simply puts
the -fno-stack-protector on traps.c.
> There's also real C
> code in traps. This isn't really a solution.
This patch focuses "hardware break point on the stack canary" only.
It is not a full solution [for other unhappiness when noistr is watching
by stack protector].
I will switch to disallow hw_breakpoint to watch the stack canary.
>
> I think GCC has recently grown __attribute__((no_stack_protector)),
> which should be added to noinstr (GCC-11 and above).
>
> Additionally we could add code to objtool to detect this problem.
>