2020-11-09 14:46:51

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code

[Resending without messing up email addresses (hopefully!),
Please reply using this email thread to have correct emails.
Sorry for the noise.]

With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

- map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

- map additional data used in the entry code (such as stack canary);

- run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;

- have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Code is based on v5.10-rc3.

Thanks,

alex.

-----

Alexandre Chartre (24):
x86/syscall: Add wrapper for invoking syscall function
x86/entry: Update asm_call_on_stack to support more function arguments
x86/entry: Consolidate IST entry from userspace
x86/sev-es: Define a setup stack function for the VC idtentry
x86/entry: Implement ret_from_fork body with C code
x86/pti: Provide C variants of PTI switch CR3 macros
x86/entry: Fill ESPFIX stack using C code
x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
x86/entry: Add C version of paranoid_entry/exit
x86/pti: Introduce per-task PTI trampoline stack
x86/pti: Function to clone page-table entries from a specified mm
x86/pti: Function to map per-cpu page-table entry
x86/pti: Extend PTI user mappings
x86/pti: Use PTI stack instead of trampoline stack
x86/pti: Execute syscall functions on the kernel stack
x86/pti: Execute IDT handlers on the kernel stack
x86/pti: Execute IDT handlers with error code on the kernel stack
x86/pti: Execute system vector handlers on the kernel stack
x86/pti: Execute page fault handler on the kernel stack
x86/pti: Execute NMI handler on the kernel stack
x86/entry: Disable stack-protector for IST entry C handlers
x86/entry: Defer paranoid entry/exit to C code
x86/entry: Remove paranoid_entry and paranoid_exit
x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

arch/x86/entry/common.c | 259 ++++++++++++-
arch/x86/entry/entry_64.S | 513 ++++++++------------------
arch/x86/entry/entry_64_compat.S | 22 --
arch/x86/include/asm/entry-common.h | 108 ++++++
arch/x86/include/asm/idtentry.h | 153 +++++++-
arch/x86/include/asm/irq_stack.h | 11 +
arch/x86/include/asm/page_64_types.h | 36 +-
arch/x86/include/asm/paravirt.h | 15 +
arch/x86/include/asm/paravirt_types.h | 17 +-
arch/x86/include/asm/processor.h | 3 +
arch/x86/include/asm/pti.h | 18 +
arch/x86/include/asm/switch_to.h | 7 +-
arch/x86/include/asm/traps.h | 2 +-
arch/x86/kernel/cpu/mce/core.c | 7 +-
arch/x86/kernel/espfix_64.c | 41 ++
arch/x86/kernel/nmi.c | 34 +-
arch/x86/kernel/sev-es.c | 52 +++
arch/x86/kernel/traps.c | 61 +--
arch/x86/mm/fault.c | 11 +-
arch/x86/mm/pti.c | 71 ++--
kernel/fork.c | 22 ++
21 files changed, 1002 insertions(+), 461 deletions(-)

--
2.18.4


2020-11-09 14:46:57

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments

Update the asm_call_on_stack() function so that it can be invoked
with a function having up to three arguments instead of only one.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 15 +++++++++++----
arch/x86/include/asm/irq_stack.h | 8 ++++++++
2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..c42948aca0a8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs)
/*
* rdi: New stack pointer points to the top word of the stack
* rsi: Function pointer
- * rdx: Function argument (can be NULL if none)
+ * rdx: Function argument 1 (can be NULL if none)
+ * rcx: Function argument 2 (can be NULL if none)
+ * r8 : Function argument 3 (can be NULL if none)
*/
SYM_FUNC_START(asm_call_on_stack)
+SYM_FUNC_START(asm_call_on_stack_1)
+SYM_FUNC_START(asm_call_on_stack_2)
+SYM_FUNC_START(asm_call_on_stack_3)
SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
/*
@@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
*/
mov %rsp, (%rdi)
mov %rdi, %rsp
- /* Move the argument to the right place */
+ mov %rsi, %rax
+ /* Move arguments to the right place */
mov %rdx, %rdi
-
+ mov %rcx, %rsi
+ mov %r8, %rdx
1:
.pushsection .discard.instr_begin
.long 1b - .
.popsection

- CALL_NOSPEC rsi
+ CALL_NOSPEC rax

2:
.pushsection .discard.instr_end
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 775816965c6a..359427216336 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void)
}

void asm_call_on_stack(void *sp, void (*func)(void), void *arg);
+
+void asm_call_on_stack_1(void *sp, void (*func)(void),
+ void *arg1);
+void asm_call_on_stack_2(void *sp, void (*func)(void),
+ void *arg1, void *arg2);
+void asm_call_on_stack_3(void *sp, void (*func)(void),
+ void *arg1, void *arg2, void *arg3);
+
void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
struct pt_regs *regs);
void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
--
2.18.4

2020-11-09 14:47:01

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 15 ++++++++++++---
arch/x86/entry/entry_64.S | 23 +++++------------------
arch/x86/entry/entry_64_compat.S | 22 ----------------------
arch/x86/include/asm/entry-common.h | 14 ++++++++++++++
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++++++++-----
arch/x86/kernel/cpu/mce/core.c | 2 ++
arch/x86/kernel/nmi.c | 2 ++
arch/x86/kernel/traps.c | 6 ++++++
arch/x86/mm/fault.c | 9 +++++++--
9 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ead6a4c72e6a..3f4788dbbde7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
+ switch_to_user_cr3();
}

static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
+ switch_to_kernel_cr3();
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)

instrumentation_end();
syscall_exit_to_user_mode(regs);
+ switch_to_user_cr3();
}
#endif

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
{
+ switch_to_kernel_cr3();
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;

@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)

do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
+ switch_to_user_cr3();
}

-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
{
- unsigned int nr = syscall_32_enter(regs);
int res;

/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
{
+ unsigned int nr = syscall_32_enter(regs);
+ bool syscall_done;
+
/*
* Called using the internal vDSO SYSENTER/SYSCALL32 calling
* convention. Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
regs->ip = landing_pad;

/* Invoke the syscall. If it failed, keep it simple: use IRET. */
- if (!__do_fast_syscall_32(regs))
+ syscall_done = __do_fast_syscall_32(regs, nr);
+ switch_to_user_cr3();
+ if (!syscall_done)
return 0;

#ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 797effbe65b6..4be15a5ffe68 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
- POP_REGS pop_rdi=0 skip_r11rcx=1
+ POP_REGS skip_r11rcx=1

/*
- * We are on the trampoline stack. All regs except RDI are live.
* We are on the trampoline stack. All regs except RSP are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER

- SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
- popq %rdi
movq RSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
swapgs
cld
FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq %rsp, %rdx
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -592,19 +586,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
1:
#endif
- POP_REGS pop_rdi=0
+ POP_REGS
+ addq $8, %rsp /* skip regs->orig_ax */

/*
- * We are on the trampoline stack. All regs except RDI are live.
+ * We are on the trampoline stack. All regs are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER

- SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
- /* Restore RDI. */
- popq %rdi
- addq $8, %rsp /* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN

@@ -899,8 +889,6 @@ SYM_CODE_START_LOCAL(error_entry)
*/
SWAPGS
FENCE_SWAPGS_USER_ENTRY
- /* We have user CR3. Change to kernel CR3. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

.Lerror_entry_from_usermode_after_swapgs:
/*
@@ -959,11 +947,10 @@ SYM_CODE_START_LOCAL(error_entry)
.Lerror_bad_iret:
/*
* We came from an IRET to user mode, so we have user
- * gsbase and CR3. Switch to kernel gsbase and CR3:
+ * gsbase and CR3. Switch to kernel gsbase.
*/
SWAPGS
FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

/*
* Pretend that the exception came from user mode: set up pt_regs
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 541fdaf64045..a6fb5807bf42 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -51,10 +51,6 @@ SYM_CODE_START(entry_SYSENTER_compat)
/* Interrupts are off on entry. */
SWAPGS

- pushq %rax
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
- popq %rax
-
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

/* Construct struct pt_regs on stack */
@@ -204,9 +200,6 @@ SYM_CODE_START(entry_SYSCALL_compat)
/* Stash user ESP */
movl %esp, %r8d

- /* Use %rsp as scratch reg. User ESP is stashed in r8 */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
/* Switch to the kernel stack */
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

@@ -291,18 +284,6 @@ sysret32_from_system_call:
* code. We zero R8-R10 to avoid info leaks.
*/
movq RSP-ORIG_RAX(%rsp), %rsp
-
- /*
- * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
- * on the process stack which is not mapped to userspace and
- * not readable after we SWITCH_TO_USER_CR3. Delay the CR3
- * switch until after after the last reference to the process
- * stack.
- *
- * %r8/%r9 are zeroed before the sysret, thus safe to clobber.
- */
- SWITCH_TO_USER_CR3_NOSTACK scratch_reg=%r8 scratch_reg2=%r9
-
xorl %r8d, %r8d
xorl %r9d, %r9d
xorl %r10d, %r10d
@@ -357,9 +338,6 @@ SYM_CODE_START(entry_INT80_compat)
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */

- /* Need to switch before accessing the thread stack. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-
/* In the Xen PV case we already run on the thread stack. */
ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index b75e9230c990..32e9f3159131 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -157,10 +157,24 @@ static __always_inline void switch_to_user_cr3(void)
native_write_cr3(cr3);
}

+static __always_inline void kernel_pgtable_enter(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ switch_to_kernel_cr3();
+}
+
+static __always_inline void kernel_pgtable_exit(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ switch_to_user_cr3();
+}
+
#else /* CONFIG_PAGE_TABLE_ISOLATION */

static inline void switch_to_kernel_cr3(void) {}
static inline void switch_to_user_cr3(void) {}
+static inline void kernel_pgtable_enter(struct pt_regs *regs) {};
+static inline void kernel_pgtable_exit(struct pt_regs *regs) {};

#endif /* CONFIG_PAGE_TABLE_ISOLATION */

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 647af7ea3bf1..d8bfcd8a4db4 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -147,12 +147,15 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ kernel_pgtable_enter(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
run_idt(__##func, regs); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ kernel_pgtable_exit(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -194,12 +197,15 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ kernel_pgtable_enter(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
run_idt_errcode(__##func, regs, error_code); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ kernel_pgtable_exit(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -290,8 +296,10 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ kernel_pgtable_enter(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -300,6 +308,7 @@ __visible noinstr void func(struct pt_regs *regs, \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ kernel_pgtable_exit(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -333,8 +342,10 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ kernel_pgtable_enter(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -342,6 +353,7 @@ __visible noinstr void func(struct pt_regs *regs) \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ kernel_pgtable_exit(regs); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -362,8 +374,10 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ kernel_pgtable_enter(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -371,6 +385,7 @@ __visible noinstr void func(struct pt_regs *regs) \
__irq_exit_raw(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ kernel_pgtable_exit(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 827088f981c6..e1ae901c4925 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2037,9 +2037,11 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
{
unsigned long dr7;

+ switch_to_kernel_cr3();
dr7 = local_db_save();
run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
+ switch_to_user_cr3();
}
#else
/* 32bit unified entry point */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 23c92ffd58fe..063474f5b5fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -542,8 +542,10 @@ DEFINE_IDTENTRY_NMI(exc_nmi)

__visible noinstr void exc_nmi_user(struct pt_regs *regs)
{
+ switch_to_kernel_cr3();
handle_nmi(regs);
mds_user_clear_cpu_buffers();
+ switch_to_user_cr3();
}

void stop_nmi(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1801791748b8..6c78eeb60d19 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -255,11 +255,13 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

+ kernel_pgtable_enter(regs);
state = irqentry_enter(regs);
instrumentation_begin();
run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
+ kernel_pgtable_exit(regs);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -663,11 +665,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
* including NMI.
*/
if (user_mode(regs)) {
+ switch_to_kernel_cr3();
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
run_idt(do_int3_user, regs);
instrumentation_end();
irqentry_exit_to_user_mode(regs);
+ switch_to_user_cr3();
} else {
bool irq_state = idtentry_enter_nmi(regs);
instrumentation_begin();
@@ -1001,7 +1005,9 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
/* User entry, runs on regular task stack */
DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
+ switch_to_kernel_cr3();
run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
+ switch_to_user_cr3();
}
#else
/* 32 bit does not have separate entry points. */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b9d03603d95d..613a864840ab 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1440,9 +1440,11 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
- unsigned long address = read_cr2();
+ unsigned long address;
irqentry_state_t state;

+ kernel_pgtable_enter(regs);
+ address = read_cr2();
prefetchw(&current->mm->mmap_lock);

/*
@@ -1466,8 +1468,10 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* The async #PF handling code takes care of idtentry handling
* itself.
*/
- if (kvm_handle_async_pf(regs, (u32)address))
+ if (kvm_handle_async_pf(regs, (u32)address)) {
+ kernel_pgtable_exit(regs);
return;
+ }

/*
* Entry handling for valid #PF from kernel mode is slightly
@@ -1486,4 +1490,5 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
instrumentation_end();

irqentry_exit(regs, state);
+ kernel_pgtable_exit(regs);
}
--
2.18.4

2020-11-09 14:47:21

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit

The paranoid_entry and paranoid_exit assembly functions have been
replaced by the kernel_paranoid_entry() and kernel_paranoid_exit()
C functions. Now paranoid_entry/exit are not used anymore and can
be removed.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 131 --------------------------------------
1 file changed, 131 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ea8187d4405..797effbe65b6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback)
SYM_CODE_END(xen_failsafe_callback)
#endif /* CONFIG_XEN_PV */

-/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASE R/EBX
- * N 0 -> SWAPGS on exit
- * 1 -> no SWAPGS on exit
- *
- * Y GSBASE value at entry, must be restored in paranoid_exit
- */
-SYM_CODE_START_LOCAL(paranoid_entry)
- UNWIND_HINT_FUNC
- cld
- PUSH_AND_CLEAR_REGS save_ret=1
- ENCODE_FRAME_POINTER 8
-
- /*
- * Always stash CR3 in %r14. This value will be restored,
- * verbatim, at exit. Needed if paranoid_entry interrupted
- * another entry that already switched to the user CR3 value
- * but has not yet returned to userspace.
- *
- * This is also why CS (stashed in the "iret frame" by the
- * hardware at entry) can not be used: this may be a return
- * to kernel code, but with a user CR3 value.
- *
- * Switching CR3 does not depend on kernel GSBASE so it can
- * be done before switching to the kernel GSBASE. This is
- * required for FSGSBASE because the kernel GSBASE has to
- * be retrieved from a kernel internal table.
- */
- SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
- /*
- * Handling GSBASE depends on the availability of FSGSBASE.
- *
- * Without FSGSBASE the kernel enforces that negative GSBASE
- * values indicate kernel GSBASE. With FSGSBASE no assumptions
- * can be made about the GSBASE value when entering from user
- * space.
- */
- ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
- /*
- * Read the current GSBASE and store it in %rbx unconditionally,
- * retrieve and set the current CPUs kernel GSBASE. The stored value
- * has to be restored in paranoid_exit unconditionally.
- *
- * The unconditional write to GS base below ensures that no subsequent
- * loads based on a mispredicted GS base can happen, therefore no LFENCE
- * is needed here.
- */
- SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
- ret
-
-.Lparanoid_entry_checkgs:
- /* EBX = 1 -> kernel GSBASE active, no restore required */
- movl $1, %ebx
- /*
- * The kernel-enforced convention is a negative GSBASE indicates
- * a kernel value. No SWAPGS needed on entry and exit.
- */
- movl $MSR_GS_BASE, %ecx
- rdmsr
- testl %edx, %edx
- jns .Lparanoid_entry_swapgs
- ret
-
-.Lparanoid_entry_swapgs:
- SWAPGS
-
- /*
- * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
- * unconditional CR3 write, even in the PTI case. So do an lfence
- * to prevent GS speculation, regardless of whether PTI is enabled.
- */
- FENCE_SWAPGS_KERNEL_ENTRY
-
- /* EBX = 0 -> SWAPGS required on exit */
- xorl %ebx, %ebx
- ret
-SYM_CODE_END(paranoid_entry)
-
-/*
- * "Paranoid" exit path from exception stack. This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated. Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASE R/EBX
- * N 0 -> SWAPGS on exit
- * 1 -> no SWAPGS on exit
- *
- * Y User space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
- UNWIND_HINT_REGS
- /*
- * The order of operations is important. RESTORE_CR3 requires
- * kernel GSBASE.
- *
- * NB to anyone to try to optimize this code: this code does
- * not execute at all for exceptions from user mode. Those
- * exceptions go through error_exit instead.
- */
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14
-
- /* Handle the three GSBASE cases */
- ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
-
- /* With FSGSBASE enabled, unconditionally restore GSBASE */
- wrgsbase %rbx
- jmp restore_regs_and_return_to_kernel
-
-.Lparanoid_exit_checkgs:
- /* On non-FSGSBASE systems, conditionally do SWAPGS */
- testl %ebx, %ebx
- jnz restore_regs_and_return_to_kernel
-
- /* We are returning to a context with user GSBASE */
- SWAPGS_UNSAFE_STACK
- jmp restore_regs_and_return_to_kernel
-SYM_CODE_END(paranoid_exit)
-
/*
* Save all registers in pt_regs, and switch GS if needed.
*/
--
2.18.4

2020-11-09 14:47:21

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code

ret_from_fork is a mix of assembly code and calls to C functions.
Re-implement ret_from_fork so that it calls a single C function.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 18 ++++++++++++++++++
arch/x86/entry/entry_64.S | 28 +++++-----------------------
2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d222212908ad..7ee15a12c115 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,24 @@
#include <asm/syscall.h>
#include <asm/irq_stack.h>

+__visible noinstr void return_from_fork(struct pt_regs *regs,
+ struct task_struct *prev,
+ void (*kfunc)(void *), void *kargs)
+{
+ schedule_tail(prev);
+ if (kfunc) {
+ /* kernel thread */
+ kfunc(kargs);
+ /*
+ * A kernel thread is allowed to return here after
+ * successfully calling kernel_execve(). Exit to
+ * userspace to complete the execve() syscall.
+ */
+ regs->ax = 0;
+ }
+ syscall_exit_to_user_mode(regs);
+}
+
static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
{
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 274384644b5e..73e9cd47dc83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm)
*/
.pushsection .text, "ax"
SYM_CODE_START(ret_from_fork)
- UNWIND_HINT_EMPTY
- movq %rax, %rdi
- call schedule_tail /* rdi: 'prev' task parameter */
-
- testq %rbx, %rbx /* from kernel_thread? */
- jnz 1f /* kernel threads are uncommon */
-
-2:
UNWIND_HINT_REGS
- movq %rsp, %rdi
- call syscall_exit_to_user_mode /* returns with IRQs disabled */
+ movq %rsp, %rdi /* pt_regs */
+ movq %rax, %rsi /* 'prev' task parameter */
+ movq %rbx, %rdx /* kernel thread func */
+ movq %r12, %rcx /* kernel thread arg */
+ call return_from_fork /* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
-
-1:
- /* kernel thread */
- UNWIND_HINT_EMPTY
- movq %r12, %rdi
- CALL_NOSPEC rbx
- /*
- * A kernel thread is allowed to return here after successfully
- * calling kernel_execve(). Exit to userspace to complete the execve()
- * syscall.
- */
- movq $0, RAX(%rsp)
- jmp 2b
SYM_CODE_END(ret_from_fork)
.popsection

--
2.18.4

2020-11-09 14:47:39

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 and GS registers inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 72 ++++++++++------------------------
arch/x86/kernel/cpu/mce/core.c | 3 ++
arch/x86/kernel/nmi.c | 18 +++++++--
arch/x86/kernel/sev-es.c | 20 +++++++++-
arch/x86/kernel/traps.c | 30 ++++++++++++--
5 files changed, 83 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..9ea8187d4405 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym)
/* Entry from kernel */

pushq $-1 /* ORIG_RAX: no syscall to restart */
- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
-
+ cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS

movq %rsp, %rdi /* pt_regs pointer */

call \cfunc

- jmp paranoid_exit
+ jmp restore_regs_and_return_to_kernel

_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
@@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym)
*/
ist_entry_user safe_stack_\cfunc, has_error_code=1

- /*
- * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
- * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
- */
- call paranoid_entry
-
+ cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS

/*
@@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym)
* identical to the stack in the IRET frame or the VC fall-back stack,
* so it is definitly mapped even with PTI enabled.
*/
- jmp paranoid_exit
+ jmp restore_regs_and_return_to_kernel

_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
@@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS offset=8
ASM_CLAC

- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
+ cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS

movq %rsp, %rdi /* pt_regs pointer into first argument */
@@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym)
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
call \cfunc

- jmp paranoid_exit
+ jmp restore_regs_and_return_to_kernel

_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
@@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return)
/*
* Runs on exception stack. Xen PV does not go through this path at all,
* so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- * when PAGE_TABLE_ISOLATION is in use. Do not clobber.
*/
SYM_CODE_START(asm_exc_nmi)
/*
@@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi)
* We also must not push anything to the stack before switching
* stacks lest we corrupt the "NMI executing" variable.
*/
- ist_entry_user exc_nmi
+ ist_entry_user exc_nmi_user

/* NMI from kernel */

@@ -1346,9 +1340,7 @@ repeat_nmi:
*
* RSP is pointing to "outermost RIP". gsbase is unknown, but, if
* we're repeating an NMI, gsbase has the same value that it had on
- * the first iteration. paranoid_entry will load the kernel
- * gsbase if needed before we call exc_nmi(). "NMI executing"
- * is zero.
+ * the first iteration. "NMI executing" is zero.
*/
movq $1, 10*8(%rsp) /* Set "NMI executing". */

@@ -1372,44 +1364,20 @@ end_repeat_nmi:
pushq $-1 /* ORIG_RAX: no syscall to restart */

/*
- * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
- * as we should not be calling schedule in NMI context.
- * Even with normal interrupts enabled. An NMI should not be
- * setting NEED_RESCHED or anything that normal interrupts and
+ * We should not be calling schedule in NMI context. Even with
+ * normal interrupts enabled. An NMI should not be setting
+ * NEED_RESCHED or anything that normal interrupts and
* exceptions might do.
*/
- call paranoid_entry
+ cld
+ PUSH_AND_CLEAR_REGS
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS

movq %rsp, %rdi
movq $-1, %rsi
call exc_nmi

- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
- /*
- * The above invocation of paranoid_entry stored the GSBASE
- * related information in R/EBX depending on the availability
- * of FSGSBASE.
- *
- * If FSGSBASE is enabled, restore the saved GSBASE value
- * unconditionally, otherwise take the conditional SWAPGS path.
- */
- ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
- wrgsbase %rbx
- jmp nmi_restore
-
-nmi_no_fsgsbase:
- /* EBX == 0 -> invoke SWAPGS */
- testl %ebx, %ebx
- jnz nmi_restore
-
-nmi_swapgs:
- SWAPGS_UNSAFE_STACK
-
-nmi_restore:
POP_REGS

/*
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 9407c3cd9355..827088f981c6 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2022,11 +2022,14 @@ static __always_inline void exc_machine_check_user(struct pt_regs *regs)
/* MCE hit kernel mode */
DEFINE_IDTENTRY_MCE(exc_machine_check)
{
+ struct kernel_entry_state entry_state;
unsigned long dr7;

+ kernel_paranoid_entry(&entry_state);
dr7 = local_db_save();
exc_machine_check_kernel(regs);
local_db_restore(dr7);
+ kernel_paranoid_exit(&entry_state);
}

/* The user mode variant. */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b6291b683be1..23c92ffd58fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
static DEFINE_PER_CPU(unsigned long, nmi_cr2);
static DEFINE_PER_CPU(unsigned long, nmi_dr7);

-DEFINE_IDTENTRY_NMI(exc_nmi)
+static noinstr void handle_nmi(struct pt_regs *regs)
{
bool irq_state;

@@ -529,9 +529,21 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
goto nmi_restart;
+}
+
+DEFINE_IDTENTRY_NMI(exc_nmi)
+{
+ struct kernel_entry_state entry_state;
+
+ kernel_paranoid_entry(&entry_state);
+ handle_nmi(regs);
+ kernel_paranoid_exit(&entry_state);
+}

- if (user_mode(regs))
- mds_user_clear_cpu_buffers();
+__visible noinstr void exc_nmi_user(struct pt_regs *regs)
+{
+ handle_nmi(regs);
+ mds_user_clear_cpu_buffers();
}

void stop_nmi(void)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index bd977c917cd6..ef9a8b69c25c 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1352,13 +1352,25 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
struct exc_vc_frame {
/* pt_regs should be first */
struct pt_regs regs;
+ /* extra parameters for the handler */
+ struct kernel_entry_state entry_state;
};

DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
{
+ struct kernel_entry_state entry_state;
struct exc_vc_frame *frame;
unsigned long sp;

+ /*
+ * kernel_paranoid_entry() is called first to properly set
+ * the GS register which is used to access per-cpu variables.
+ *
+ * vc_switch_off_ist() uses per-cpu variables so it has to be
+ * called after kernel_paranoid_entry().
+ */
+ kernel_paranoid_entry(&entry_state);
+
/*
* Switch off the IST stack to make it free for nested exceptions.
* The vc_switch_off_ist() function will switch back to the
@@ -1370,7 +1382,8 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
/*
* Found a safe stack. Set it up as if the entry has happened on
* that stack. This means that we need to have pt_regs at the top
- * of the stack.
+ * of the stack, and we can use the bottom of the stack to pass
+ * extra parameters (like the kernel entry state) to the handler.
*
* The effective stack switch happens in assembly code before
* the #VC handler is called.
@@ -1379,16 +1392,21 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)

frame = (struct exc_vc_frame *)sp;
frame->regs = *regs;
+ frame->entry_state = entry_state;

return sp;
}

DEFINE_IDTENTRY_VC(exc_vmm_communication)
{
+ struct exc_vc_frame *frame = (struct exc_vc_frame *)regs;
+
if (likely(!on_vc_fallback_stack(regs)))
safe_stack_exc_vmm_communication(regs, error_code);
else
ist_exc_vmm_communication(regs, error_code);
+
+ kernel_paranoid_exit(&frame->entry_state);
}

bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9a51aa016fb3..1801791748b8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -344,10 +344,10 @@ __visible void __noreturn handle_stack_overflow(const char *message,
DEFINE_IDTENTRY_DF(exc_double_fault)
{
static const char str[] = "double fault";
- struct task_struct *tsk = current;
-
+ struct task_struct *tsk;
+ struct kernel_entry_state entry_state;
#ifdef CONFIG_VMAP_STACK
- unsigned long address = read_cr2();
+ unsigned long address;
#endif

#ifdef CONFIG_X86_ESPFIX64
@@ -371,8 +371,12 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
regs->cs == __KERNEL_CS &&
regs->ip == (unsigned long)native_irq_return_iret)
{
- struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
unsigned long *p = (unsigned long *)regs->sp;
+ struct pt_regs *gpregs;
+
+ kernel_paranoid_entry(&entry_state);
+
+ gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;

/*
* regs->sp points to the failing IRET frame on the
@@ -401,14 +405,28 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
regs->ip = (unsigned long)asm_exc_general_protection;
regs->sp = (unsigned long)&gpregs->orig_ax;

+ kernel_paranoid_exit(&entry_state);
+
return;
}
#endif

+ /*
+ * Switch to the kernel page-table. We are on an IST stack, and
+ * we are going to die so there is no need to switch to the kernel
+ * stack even if we are coming from userspace.
+ */
+ kernel_paranoid_entry(&entry_state);
+
+#ifdef CONFIG_VMAP_STACK
+ address = read_cr2();
+#endif
+
idtentry_enter_nmi(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

+ tsk = current;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_DF;

@@ -973,7 +991,11 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
/* IST stack entry */
DEFINE_IDTENTRY_DEBUG(exc_debug)
{
+ struct kernel_entry_state entry_state;
+
+ kernel_paranoid_entry(&entry_state);
exc_debug_kernel(regs, debug_read_clear_dr6());
+ kernel_paranoid_exit(&entry_state);
}

/* User entry, runs on regular task stack */
--
2.18.4

2020-11-09 14:47:50

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK

SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
of these macros (swapgs() and swapgs_unsafe_stack()).

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/paravirt.h | 15 +++++++++++++++
arch/x86/include/asm/paravirt_types.h | 17 ++++++++++++-----
2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc6830e89..a4898130b36b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -145,6 +145,21 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
}

+static inline void swapgs(void)
+{
+ PVOP_VCALL0(cpu.swapgs);
+}
+
+/*
+ * If swapgs is used while the userspace stack is still current,
+ * there's no way to call a pvop. The PV replacement *must* be
+ * inlined, or the swapgs instruction must be trapped and emulated.
+ */
+static inline void swapgs_unsafe_stack(void)
+{
+ PVOP_VCALL0_ALT(cpu.swapgs, "swapgs");
+}
+
static inline void arch_safe_halt(void)
{
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0fad9f61c76a..eea9acc942a3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -532,12 +532,12 @@ int paravirt_disable_iospace(void);
pre, post, ##__VA_ARGS__)


-#define ____PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...) \
+#define ____PVOP_VCALL(op, insn, clbr, call_clbr, extra_clbr, pre, post, ...) \
({ \
PVOP_VCALL_ARGS; \
PVOP_TEST_NULL(op); \
asm volatile(pre \
- paravirt_alt(PARAVIRT_CALL) \
+ paravirt_alt(insn) \
post \
: call_clbr, ASM_CALL_CONSTRAINT \
: paravirt_type(op), \
@@ -547,12 +547,17 @@ int paravirt_disable_iospace(void);
})

#define __PVOP_VCALL(op, pre, post, ...) \
- ____PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS, \
- VEXTRA_CLOBBERS, \
+ ____PVOP_VCALL(op, PARAVIRT_CALL, CLBR_ANY, \
+ PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS, \
pre, post, ##__VA_ARGS__)

+#define __PVOP_VCALL_ALT(op, insn) \
+ ____PVOP_VCALL(op, insn, CLBR_ANY, \
+ PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS, \
+ "", "")
+
#define __PVOP_VCALLEESAVE(op, pre, post, ...) \
- ____PVOP_VCALL(op.func, CLBR_RET_REG, \
+ ____PVOP_VCALL(op.func, PARAVIRT_CALL, CLBR_RET_REG, \
PVOP_VCALLEE_CLOBBERS, , \
pre, post, ##__VA_ARGS__)

@@ -562,6 +567,8 @@ int paravirt_disable_iospace(void);
__PVOP_CALL(rettype, op, "", "")
#define PVOP_VCALL0(op) \
__PVOP_VCALL(op, "", "")
+#define PVOP_VCALL0_ALT(op, insn) \
+ __PVOP_VCALL_ALT(op, insn)

#define PVOP_CALLEE0(rettype, op) \
__PVOP_CALLEESAVE(rettype, op, "", "")
--
2.18.4

2020-11-09 14:47:52

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit

paranoid_entry/exit are assembly macros. Provide C versions of
these macros (kernel_paranoid_entry() and kernel_paranoid_exit()).
The C functions are functionally equivalent to the assembly macros,
except that kernel_paranoid_entry() doesn't save registers in
pt_regs like paranoid_entry does.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 157 ++++++++++++++++++++++++++++
arch/x86/include/asm/entry-common.h | 10 ++
2 files changed, 167 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d09b1ded5287..54d0931801e1 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -387,3 +387,160 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
static __always_inline void restore_cr3(unsigned long cr3) {}

#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+/*
+ * "Paranoid" entry path from exception stack. Ensure that the CR3 and
+ * GS registers are correctly set for the kernel. Return GSBASE related
+ * information in kernel_entry_state depending on the availability of
+ * the FSGSBASE instructions:
+ *
+ * FSGSBASE kernel_entry_state
+ * N swapgs=true -> SWAPGS on exit
+ * swapgs=false -> no SWAPGS on exit
+ *
+ * Y gsbase=GSBASE value at entry, must be restored in
+ * kernel_paranoid_exit()
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables before
+ * kernel_paranoid_entry() has been called.
+ */
+noinstr void kernel_paranoid_entry(struct kernel_entry_state *state)
+{
+ unsigned long gsbase;
+ unsigned int cpu;
+
+ /*
+ * Save CR3 in the kernel entry state. This value will be
+ * restored, verbatim, at exit. Needed if the paranoid entry
+ * interrupted another entry that already switched to the user
+ * CR3 value but has not yet returned to userspace.
+ *
+ * This is also why CS (stashed in the "iret frame" by the
+ * hardware at entry) can not be used: this may be a return
+ * to kernel code, but with a user CR3 value.
+ *
+ * Switching CR3 does not depend on kernel GSBASE so it can
+ * be done before switching to the kernel GSBASE. This is
+ * required for FSGSBASE because the kernel GSBASE has to
+ * be retrieved from a kernel internal table.
+ */
+ state->cr3 = save_and_switch_to_kernel_cr3();
+
+ /*
+ * Handling GSBASE depends on the availability of FSGSBASE.
+ *
+ * Without FSGSBASE the kernel enforces that negative GSBASE
+ * values indicate kernel GSBASE. With FSGSBASE no assumptions
+ * can be made about the GSBASE value when entering from user
+ * space.
+ */
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ /*
+ * Read the current GSBASE and store it in the kernel
+ * entry state unconditionally, retrieve and set the
+ * current CPUs kernel GSBASE. The stored value has to
+ * be restored at exit unconditionally.
+ *
+ * The unconditional write to GS base below ensures that
+ * no subsequent loads based on a mispredicted GS base
+ * can happen, therefore no LFENCE is needed here.
+ */
+ state->gsbase = rdgsbase();
+
+ /*
+ * Fetch the per-CPU GSBASE value for this processor. We
+ * normally use %gs for accessing per-CPU data, but we
+ * are setting up %gs here and obviously can not use %gs
+ * itself to access per-CPU data.
+ */
+ if (IS_ENABLED(CONFIG_SMP)) {
+ /*
+ * Load CPU from the GDT. Do not use RDPID,
+ * because KVM loads guest's TSC_AUX on vm-entry
+ * and may not restore the host's value until
+ * the CPU returns to userspace. Thus the kernel
+ * would consume a guest's TSC_AUX if an NMI
+ * arrives while running KVM's run loop.
+ */
+ asm_inline volatile ("lsl %[seg],%[p]"
+ : [p] "=r" (cpu)
+ : [seg] "r" (__CPUNODE_SEG));
+
+ cpu &= VDSO_CPUNODE_MASK;
+ gsbase = __per_cpu_offset[cpu];
+ } else {
+ gsbase = *pcpu_unit_offsets;
+ }
+
+ wrgsbase(gsbase);
+
+ } else {
+ /*
+ * The kernel-enforced convention is a negative GSBASE
+ * indicates a kernel value. No SWAPGS needed on entry
+ * and exit.
+ */
+ rdmsrl(MSR_GS_BASE, gsbase);
+ if (((long)gsbase) >= 0) {
+ swapgs();
+ /*
+ * Do an lfence to prevent GS speculation.
+ */
+ alternative("", "lfence",
+ X86_FEATURE_FENCE_SWAPGS_KERNEL);
+ state->swapgs = true;
+ } else {
+ state->swapgs = false;
+ }
+ }
+}
+
+/*
+ * "Paranoid" exit path from exception stack. Restore the CR3 and
+ * GS registers are as they were on entry. This is invoked only
+ * on return from IST interrupts that came from kernel space.
+ *
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated. Fortunately, there's no good reason to try
+ * to handle preemption here.
+ *
+ * The kernel_entry_state contains the GSBASE related information
+ * depending on the availability of the FSGSBASE instructions:
+ *
+ * FSGSBASE kernel_entry_state
+ * N swapgs=true -> SWAPGS on exit
+ * swapgs=false -> no SWAPGS on exit
+ *
+ * Y gsbase=GSBASE value at entry, must be restored
+ * unconditionally
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables after
+ * kernel_paranoid_exit() has been called.
+ */
+noinstr void kernel_paranoid_exit(struct kernel_entry_state *state)
+{
+ /*
+ * The order of operations is important. RESTORE_CR3 requires
+ * kernel GSBASE.
+ *
+ * NB to anyone to try to optimize this code: this code does
+ * not execute at all for exceptions from user mode. Those
+ * exceptions go through error_exit instead.
+ */
+ restore_cr3(state->cr3);
+
+ /* With FSGSBASE enabled, unconditionally restore GSBASE */
+ if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+ wrgsbase(state->gsbase);
+ return;
+ }
+
+ /* On non-FSGSBASE systems, conditionally do SWAPGS */
+ if (state->swapgs) {
+ /* We are returning to a context with user GSBASE */
+ swapgs_unsafe_stack();
+ }
+}
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index b05b212f5ebc..b75e9230c990 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -163,6 +163,16 @@ static inline void switch_to_kernel_cr3(void) {}
static inline void switch_to_user_cr3(void) {}

#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+struct kernel_entry_state {
+ unsigned long cr3;
+ unsigned long gsbase;
+ bool swapgs;
+};
+
+void kernel_paranoid_entry(struct kernel_entry_state *state);
+void kernel_paranoid_exit(struct kernel_entry_state *state);
+
#endif /* MODULE */

#endif
--
2.18.4

2020-11-09 14:48:01

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/page_64_types.h | 36 +++++++++++++++++++++++++++-
arch/x86/include/asm/processor.h | 3 +++
2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
#define KASAN_STACK_ORDER 0
#endif

-#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ * +-------------+
+ * | | ^ ^
+ * | kernel-only | | KERNEL_STACK_SIZE |
+ * | stack | | |
+ * | | V |
+ * +-------------+ <- top of kernel stack | THREAD_SIZE
+ * | | ^ |
+ * | kernel and | | KERNEL_STACK_SIZE |
+ * | PTI stack | | |
+ * | | V v
+ * +-------------+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER \
+ (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)

#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)

#define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))

+#define task_top_of_kernel_stack(task) \
+ ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
#define task_pt_regs(task) \
({ \
unsigned long __ptr = (unsigned long)task_stack_page(task); \
--
2.18.4

2020-11-09 14:48:26

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kernel/nmi.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

inc_irq_stat(__nmi_count);

- if (!ignore_nmis)
- default_do_nmi(regs);
+ if (!ignore_nmis) {
+ if (user_mode(regs)) {
+ /*
+ * If we come from userland then we are on the
+ * trampoline stack, switch to the kernel stack
+ * to execute the NMI handler.
+ */
+ run_idt(default_do_nmi, regs);
+ } else {
+ default_do_nmi(regs);
+ }
+ }

idtentry_exit_nmi(regs, irq_state);

--
2.18.4

2020-11-09 14:48:43

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/mm/pti.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
}

+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+ phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
/*
* Clone the CPU_ENTRY_AREA and associated data into the user space visible
* page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
* This is done for all possible CPUs during boot to ensure
* that it's propagated to all mms.
*/
+ pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));

- unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
- phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
- pte_t *target_pte;
-
- target_pte = pti_user_pagetable_walk_pte(va);
- if (WARN_ON(!target_pte))
- return;
-
- *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
}
}

--
2.18.4

2020-11-09 14:48:51

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 +-
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1)) : \
func(arg1))

+#define CALL_ON_STACK_2(stack, func, arg1, arg2) \
+ ((stack) ? \
+ asm_call_on_stack_2(stack, \
+ (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+ func(arg1, arg2))
+
/*
* Functions to return the top of the kernel stack if we are using the
* user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
}

+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+ struct pt_regs *regs, unsigned long error_code)
+{
+ CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs, \
irqentry_state_t state = irqentry_enter(regs); \
\
instrumentation_begin(); \
- __##func (regs, error_code); \
+ run_idt_errcode(__##func, regs, error_code); \
instrumentation_end(); \
irqentry_exit(regs, state); \
} \
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs, \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
- __##func (regs, (u8)error_code); \
+ run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+ regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
/* User entry, runs on regular task stack */
DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
- exc_debug_user(regs, debug_read_clear_dr6());
+ run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
}
#else
/* 32 bit does not have separate entry points. */
--
2.18.4

2020-11-09 14:49:34

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 42 +++++++++-----------------------
arch/x86/include/asm/pti.h | 8 ++++++
arch/x86/include/asm/switch_to.h | 7 +++++-
3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1

- /*
- * Now all regs are restored except RSP and RDI.
- * Save old stack pointer and switch to trampoline stack.
- */
- movq %rsp, %rdi
- movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
- UNWIND_HINT_EMPTY
-
- pushq RSP-RDI(%rdi) /* RSP */
- pushq (%rdi) /* RDI */
-
/*
* We are on the trampoline stack. All regs except RDI are live.
+ * We are on the trampoline stack. All regs except RSP are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

popq %rdi
- popq %rsp
+ movq RSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
SYM_CODE_END(entry_SYSCALL_64)

@@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
#endif
POP_REGS pop_rdi=0

- /*
- * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
- * Save old stack pointer and switch to trampoline stack.
- */
- movq %rsp, %rdi
- movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
- UNWIND_HINT_EMPTY
-
- /* Copy the IRET frame to the trampoline stack. */
- pushq 6*8(%rdi) /* SS */
- pushq 5*8(%rdi) /* RSP */
- pushq 4*8(%rdi) /* EFLAGS */
- pushq 3*8(%rdi) /* CS */
- pushq 2*8(%rdi) /* RIP */
-
- /* Push user RDI on the trampoline stack. */
- pushq (%rdi)
-
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)

/* Restore RDI. */
popq %rdi
+ addq $8, %rsp /* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN

@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

.Lerror_entry_from_usermode_after_swapgs:
+ /*
+ * We are on the trampoline stack. With PTI, the trampoline
+ * stack is a per-thread stack so we are all set and we can
+ * return.
+ *
+ * Without PTI, the trampoline stack is a per-cpu stack and
+ * we need to switch to the normal thread stack.
+ */
+ ALTERNATIVE "", "ret", X86_FEATURE_PTI
/* Put us onto the real thread stack. */
popq %r12 /* save return addr in %12 */
movq %rsp, %rdi /* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
extern void pti_finalize(void);
extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+ return static_cpu_has(X86_FEATURE_PTI);
+}
#else
static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+ return false;
+}
#endif

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
#define _ASM_X86_SWITCH_TO_H

#include <linux/sched/task_stack.h>
+#include <asm/pti.h>

struct task_struct; /* one of the stranger aspects of C forward declarations */

@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task)
* doesn't work on x86-32 because sp1 and
* cpu_current_top_of_stack have different values (because of
* the non-zero stack-padding on 32bit).
+ *
+ * If PTI is enabled, sp0 points to the PTI stack (mapped in
+ * the kernel and user page-table) which is used when entering
+ * the kernel.
*/
- if (static_cpu_has(X86_FEATURE_XENPV))
+ if (static_cpu_has(X86_FEATURE_XENPV) || pti_enabled())
load_sp0(task_top_of_stack(task));
#endif
}
--
2.18.4

2020-11-09 14:49:46

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/pti.h | 10 ++++++++++
arch/x86/mm/pti.c | 32 ++++++++++++++++----------------
2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
#ifndef __ASSEMBLY__

#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+ PTI_CLONE_PMD,
+ PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
extern void pti_init(void);
extern void pti_check_boottime_disable(void);
extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level);
#else
static inline void pti_check_boottime_disable(void) { }
#endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
static void __init pti_setup_vsyscall(void) { }
#endif

-enum pti_clone_level {
- PTI_CLONE_PMD,
- PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
- enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level)
{
unsigned long addr;

@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
if (addr < start)
break;

- pgd = pgd_offset_k(addr);
+ pgd = pgd_offset(mm, addr);
if (WARN_ON(pgd_none(*pgd)))
return;
p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
}
}

+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+ enum pti_clone_level level)
+{
+ pti_clone_pgtable(&init_mm, start, end, level);
+}
+
#ifdef CONFIG_X86_64
/*
* Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
start = CPU_ENTRY_AREA_BASE;
end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);

- pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+ pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
}
#endif /* CONFIG_X86_64 */

@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
*/
static void pti_clone_entry_text(void)
{
- pti_clone_pgtable((unsigned long) __entry_text_start,
- (unsigned long) __entry_text_end,
- PTI_CLONE_PMD);
+ pti_clone_init_pgtable((unsigned long) __entry_text_start,
+ (unsigned long) __entry_text_end,
+ PTI_CLONE_PMD);
}

/*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
- pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+ pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);

/*
- * pti_clone_pgtable() will set the global bit in any PMDs
- * that it clones, but we also need to get any PTEs in
+ * pti_clone_init_pgtable() will set the global bit in any
+ * PMDs that it clones, but we also need to get any PTEs in
* the last level for areas that are not huge-page-aligned.
*/

--
2.18.4

2020-11-09 14:50:04

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
}

+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+ void *stack = pti_kernel_stack(regs);
+
+ if (stack)
+ asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+ else
+ run_sysvec_on_irqstack_cond(func, regs);
+}
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs) \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
- run_sysvec_on_irqstack_cond(__##func, regs); \
+ run_sysvec(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
--
2.18.4

2020-11-09 14:51:50

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 11 ++++++++++-
arch/x86/entry/entry_64.S | 1 +
arch/x86/include/asm/irq_stack.h | 3 +++
3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54d0931801e1..ead6a4c72e6a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
{
+ unsigned long stack;
+
if (!sysfunc)
return;

- regs->ax = sysfunc(regs);
+ if (!pti_enabled()) {
+ regs->ax = sysfunc(regs);
+ return;
+ }
+
+ stack = (unsigned long)task_top_of_kernel_stack(current);
+ regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+ sysfunc, regs);
}

#ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
SYM_FUNC_START(asm_call_on_stack_3)
SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
/*
* Save the frame pointer unconditionally. This allows the ORC
* unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
#include <linux/ptrace.h>

#include <asm/processor.h>
+#include <asm/syscall.h>

#ifdef CONFIG_X86_64
static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
struct pt_regs *regs);
void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+ struct pt_regs *regs);

static __always_inline void __run_on_irqstack(void (*func)(void))
{
--
2.18.4

2020-11-09 19:37:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code

On 11/9/20 6:44 AM, Alexandre Chartre wrote:
> - map more syscall, interrupt and exception entry code into the user
> page-table (map all noinstr code);

This seems like the thing we'd want to tag explicitly rather than make
it implicit with 'noinstr' code. Worst-case, shouldn't this be:

#define __entry_func noinstr

or something?

I'd also like to see a lot more discussion about what the rules are for
the C code and the compiler. We can't, for instance, do a normal
printk() in this entry functions. Should we stick them in a special
section and have objtool look for suspect patterns or references?

I'm most worried about things like this:

if (something_weird)
pr_warn("this will oops the kernel\n");

2020-11-09 19:55:47

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code


On 11/9/20 8:35 PM, Dave Hansen wrote:
> On 11/9/20 6:44 AM, Alexandre Chartre wrote:
>> - map more syscall, interrupt and exception entry code into the user
>> page-table (map all noinstr code);
>
> This seems like the thing we'd want to tag explicitly rather than make
> it implicit with 'noinstr' code. Worst-case, shouldn't this be:
>
> #define __entry_func noinstr
>
> or something?

Yes. I use the easy solution to just use noinstr because noinstr is mostly
use for entry functions. But if we want to use the user page-table beyond
the entry functions then we will definitively need a dedicated tag.

> I'd also like to see a lot more discussion about what the rules are for
> the C code and the compiler. We can't, for instance, do a normal
> printk() in this entry functions. Should we stick them in a special
> section and have objtool look for suspect patterns or references?
>
> I'm most worried about things like this:
>
> if (something_weird)
> pr_warn("this will oops the kernel\n");

That would be similar to noinstr which uses the .noinstr.text section, and if
I remember correctly objtool detects if a noinstr function calls a non-noinst.
Similarly here, an entry function should not call a non-entry function.

alex.

2020-11-09 19:59:28

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK


[Copying the reply to Andy in the thread with the right email addresses]

On 11/9/20 6:38 PM, Andy Lutomirski wrote:
> On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
>> of these macros (swapgs() and swapgs_unsafe_stack()).
>
> This needs a very good justification. It also needs some kind of
> static verification that these helpers are only used by noinstr code,
> and they need to be __always_inline. And I cannot fathom how C code
> could possibly use SWAPGS_UNSAFE_STACK in a meaningful way.
>

You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK
in C code, that doesn't make sense. Looks like only SWAPGS is then needed.

Or maybe we can just use native_swapgs() instead?

I have added a C version of SWAPGS for moving paranoid_entry() to C because,
in this function, we need to switch CR3 before doing the updating GS. But I
really wonder if we need a paravirt swapgs here, and we can probably just use
native_swapgs().

Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table
then we will be able to update GS before switching CR3. That way we can keep the
GS update in assembly code, and just do the CR3 switch in C code. This would also
avoid having to disable stack-protector (patch 21).

alex.