2020-11-16 14:49:38

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

Version 2 addressing comments from Andy:

- paranoid_entry/exit is back to assembly code. This avoids having
a C version of SWAPGS and the need to disable stack-protector.
(remove patches 8, 9, 21 from v1).

- SAVE_AND_SWITCH_TO_KERNEL_CR3 and RESTORE_CR3 are removed from
paranoid_entry/exit and move to C (patch 19).

- __per_cpu_offset is mapped into the user page-table (patch 11)
so that paranoid_entry can update GS before CR3 is switched.

- use a different stack canary with the user and kernel page-tables.
This is a new patch in v2 to not leak the kernel stack canary
in the user page-table (patch 21).

Patches are now based on v5.10-rc4.

----

With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

- map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

- map additional data used in the entry code (such as stack canary);

- run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;

- have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Patches are based on v5.10-rc4.

Thanks,

alex.

-----

Alexandre Chartre (21):
x86/syscall: Add wrapper for invoking syscall function
x86/entry: Update asm_call_on_stack to support more function arguments
x86/entry: Consolidate IST entry from userspace
x86/sev-es: Define a setup stack function for the VC idtentry
x86/entry: Implement ret_from_fork body with C code
x86/pti: Provide C variants of PTI switch CR3 macros
x86/entry: Fill ESPFIX stack using C code
x86/pti: Introduce per-task PTI trampoline stack
x86/pti: Function to clone page-table entries from a specified mm
x86/pti: Function to map per-cpu page-table entry
x86/pti: Extend PTI user mappings
x86/pti: Use PTI stack instead of trampoline stack
x86/pti: Execute syscall functions on the kernel stack
x86/pti: Execute IDT handlers on the kernel stack
x86/pti: Execute IDT handlers with error code on the kernel stack
x86/pti: Execute system vector handlers on the kernel stack
x86/pti: Execute page fault handler on the kernel stack
x86/pti: Execute NMI handler on the kernel stack
x86/pti: Defer CR3 switch to C code for IST entries
x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
x86/pti: Use a different stack canary with the user and kernel
page-table

arch/x86/entry/common.c | 58 ++++-
arch/x86/entry/entry_64.S | 346 +++++++++++---------------
arch/x86/entry/entry_64_compat.S | 22 --
arch/x86/include/asm/entry-common.h | 194 +++++++++++++++
arch/x86/include/asm/idtentry.h | 130 +++++++++-
arch/x86/include/asm/irq_stack.h | 11 +
arch/x86/include/asm/page_64_types.h | 36 ++-
arch/x86/include/asm/processor.h | 3 +
arch/x86/include/asm/pti.h | 18 ++
arch/x86/include/asm/stackprotector.h | 35 ++-
arch/x86/include/asm/switch_to.h | 7 +-
arch/x86/include/asm/traps.h | 2 +-
arch/x86/kernel/cpu/mce/core.c | 7 +-
arch/x86/kernel/espfix_64.c | 41 +++
arch/x86/kernel/nmi.c | 34 ++-
arch/x86/kernel/sev-es.c | 63 +++++
arch/x86/kernel/traps.c | 61 +++--
arch/x86/mm/fault.c | 11 +-
arch/x86/mm/pti.c | 76 ++++--
include/linux/sched.h | 8 +
kernel/fork.c | 25 ++
21 files changed, 874 insertions(+), 314 deletions(-)

--
2.18.4


2020-11-16 14:49:40

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 07/21] x86/entry: Fill ESPFIX stack using C code

The ESPFIX stack is filled using assembly code. Move this code to a C
function so that it is easier to read and modify.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 62 ++++++++++++++++++-------------------
arch/x86/kernel/espfix_64.c | 41 ++++++++++++++++++++++++
2 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73e9cd47dc83..6e0b5b010e0b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -684,8 +684,10 @@ native_irq_return_ldt:
* long (see ESPFIX_STACK_SIZE). espfix_waddr points to the bottom
* of the ESPFIX stack.
*
- * We clobber RAX and RDI in this code. We stash RDI on the
- * normal stack and RAX on the ESPFIX stack.
+ * We call into C code to fill the ESPFIX stack. We stash registers
+ * that the C function can clobber on the normal stack. The user RAX
+ * is stashed first so that it is adjacent to the iret frame which
+ * will be copied to the ESPFIX stack.
*
* The ESPFIX stack layout we set up looks like this:
*
@@ -699,39 +701,37 @@ native_irq_return_ldt:
* --- bottom of ESPFIX stack ---
*/

- pushq %rdi /* Stash user RDI */
- SWAPGS /* to kernel GS */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi /* to kernel CR3 */
-
- movq PER_CPU_VAR(espfix_waddr), %rdi
- movq %rax, (0*8)(%rdi) /* user RAX */
- movq (1*8)(%rsp), %rax /* user RIP */
- movq %rax, (1*8)(%rdi)
- movq (2*8)(%rsp), %rax /* user CS */
- movq %rax, (2*8)(%rdi)
- movq (3*8)(%rsp), %rax /* user RFLAGS */
- movq %rax, (3*8)(%rdi)
- movq (5*8)(%rsp), %rax /* user SS */
- movq %rax, (5*8)(%rdi)
- movq (4*8)(%rsp), %rax /* user RSP */
- movq %rax, (4*8)(%rdi)
- /* Now RAX == RSP. */
-
- andl $0xffff0000, %eax /* RAX = (RSP & 0xffff0000) */
+ /* save registers */
+ pushq %rax
+ pushq %rdi
+ pushq %rsi
+ pushq %rdx
+ pushq %rcx
+ pushq %r8
+ pushq %r9
+ pushq %r10
+ pushq %r11

/*
- * espfix_stack[31:16] == 0. The page tables are set up such that
- * (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
- * espfix_waddr for any X. That is, there are 65536 RO aliases of
- * the same page. Set up RSP so that RSP[31:16] contains the
- * respective 16 bits of the /userspace/ RSP and RSP nonetheless
- * still points to an RO alias of the ESPFIX stack.
+ * fill_espfix_stack will copy the iret+rax frame to the ESPFIX
+ * stack and return with RAX containing a pointer to the ESPFIX
+ * stack.
*/
- orq PER_CPU_VAR(espfix_stack), %rax
+ leaq 8*8(%rsp), %rdi /* points to the iret+rax frame */
+ call fill_espfix_stack

- SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
- SWAPGS /* to user GS */
- popq %rdi /* Restore user RDI */
+ /*
+ * RAX contains a pointer to the ESPFIX, so restore registers but
+ * RAX. RAX will be restored from the ESPFIX stack.
+ */
+ popq %r11
+ popq %r10
+ popq %r9
+ popq %r8
+ popq %rcx
+ popq %rdx
+ popq %rsi
+ popq %rdi

movq %rax, %rsp
UNWIND_HINT_IRET_REGS offset=8
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4fe7af58cfe1..ff4b5160b39c 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
#include <asm/pgalloc.h>
#include <asm/setup.h>
#include <asm/espfix.h>
+#include <asm/entry-common.h>

/*
* Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -205,3 +206,43 @@ void init_espfix_ap(int cpu)
per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ (addr & ~PAGE_MASK);
}
+
+/*
+ * iret frame with an additional user_rax register.
+ */
+struct iret_rax_frame {
+ unsigned long user_rax;
+ unsigned long rip;
+ unsigned long cs;
+ unsigned long rflags;
+ unsigned long rsp;
+ unsigned long ss;
+};
+
+noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame)
+{
+ struct iret_rax_frame *espfix_frame;
+ unsigned long rsp;
+
+ native_swapgs();
+ user_pagetable_exit();
+
+ espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr);
+ *espfix_frame = *frame;
+
+ /*
+ * espfix_stack[31:16] == 0. The page tables are set up such that
+ * (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
+ * espfix_waddr for any X. That is, there are 65536 RO aliases of
+ * the same page. Set up RSP so that RSP[31:16] contains the
+ * respective 16 bits of the /userspace/ RSP and RSP nonetheless
+ * still points to an RO alias of the ESPFIX stack.
+ */
+ rsp = ((unsigned long)espfix_frame) & 0xffff0000;
+ rsp |= this_cpu_read(espfix_stack);
+
+ user_pagetable_enter();
+ native_swapgs();
+
+ return rsp;
+}
--
2.18.4

2020-11-16 14:50:13

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 03/21] x86/entry: Consolidate IST entry from userspace

Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from
userspace the same way: they switch from the IST stack to the kernel
stack, call the handler and then return to userspace. However, NMI,
MCE/DEBUG and VC implement this same behavior using different code paths,
so consolidate this code into a single assembly macro.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 137 +++++++++++++++++++++-----------------
1 file changed, 75 insertions(+), 62 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c42948aca0a8..51df9f1871c6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork)
#endif
.endm

+/*
+ * Macro to handle an IDT entry defined with the IST mechanism. It should
+ * be invoked at the beginning of the IDT handler with a pointer to the C
+ * function (cfunc_user) to invoke if the IDT was entered from userspace.
+ *
+ * If the IDT was entered from userspace, the macro will switch from the
+ * IST stack to the regular task stack, call the provided function and
+ * return to userland.
+ *
+ * If IDT was entered from the kernel, the macro will just return.
+ */
+.macro ist_entry_user cfunc_user has_error_code=0
+ UNWIND_HINT_IRET_REGS
+ ASM_CLAC
+
+ /* only process entry from userspace */
+ .if \has_error_code == 1
+ testb $3, CS-ORIG_RAX(%rsp)
+ jz .List_entry_from_kernel_\@
+ .else
+ testb $3, CS-RIP(%rsp)
+ jz .List_entry_from_kernel_\@
+ pushq $-1 /* ORIG_RAX: no syscall to restart */
+ .endif
+
+ /* Use %rdx as a temp variable */
+ pushq %rdx
+
+ /*
+ * Switch from the IST stack to the regular task stack and
+ * use the provided entry point.
+ */
+ swapgs
+ cld
+ FENCE_SWAPGS_USER_ENTRY
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
+ movq %rsp, %rdx
+ movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+ UNWIND_HINT_IRET_REGS base=%rdx offset=8
+ pushq 6*8(%rdx) /* pt_regs->ss */
+ pushq 5*8(%rdx) /* pt_regs->rsp */
+ pushq 4*8(%rdx) /* pt_regs->flags */
+ pushq 3*8(%rdx) /* pt_regs->cs */
+ pushq 2*8(%rdx) /* pt_regs->rip */
+ UNWIND_HINT_IRET_REGS
+ pushq 1*8(%rdx) /* pt_regs->orig_ax */
+ PUSH_AND_CLEAR_REGS rdx=(%rdx)
+ ENCODE_FRAME_POINTER
+
+ /*
+ * At this point we no longer need to worry about stack damage
+ * due to nesting -- we're on the normal thread stack and we're
+ * done with the IST stack.
+ */
+
+ mov %rsp, %rdi
+ .if \has_error_code == 1
+ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ .endif
+ call \cfunc_user
+ jmp swapgs_restore_regs_and_return_to_usermode
+
+.List_entry_from_kernel_\@:
+.endm
+
/**
* idtentry_body - Macro to emit code calling the C function
* @cfunc: C function to be called
@@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym)
*/
.macro idtentry_mce_db vector asmsym cfunc
SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ASM_CLAC
-
- pushq $-1 /* ORIG_RAX: no syscall to restart */
-
/*
* If the entry is from userspace, switch stacks and treat it as
* a normal entry.
*/
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
+ ist_entry_user noist_\cfunc

+ /* Entry from kernel */
+
+ pushq $-1 /* ORIG_RAX: no syscall to restart */
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry

@@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym)

jmp paranoid_exit

- /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body noist_\cfunc, has_error_code=0
-
_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm
@@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym)
*/
.macro idtentry_vc vector asmsym cfunc
SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ASM_CLAC
-
/*
* If the entry is from userspace, switch stacks and treat it as
* a normal entry.
*/
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
+ ist_entry_user safe_stack_\cfunc, has_error_code=1

/*
* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
@@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym)
*/
jmp paranoid_exit

- /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body safe_stack_\cfunc, has_error_code=1
-
_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm
@@ -1113,8 +1164,6 @@ SYM_CODE_END(error_return)
* when PAGE_TABLE_ISOLATION is in use. Do not clobber.
*/
SYM_CODE_START(asm_exc_nmi)
- UNWIND_HINT_IRET_REGS
-
/*
* We allow breakpoints in NMIs. If a breakpoint occurs, then
* the iretq it performs will take us out of NMI context.
@@ -1153,14 +1202,6 @@ SYM_CODE_START(asm_exc_nmi)
* other IST entries.
*/

- ASM_CLAC
-
- /* Use %rdx as our temp variable throughout */
- pushq %rdx
-
- testb $3, CS-RIP+8(%rsp)
- jz .Lnmi_from_kernel
-
/*
* NMI from user mode. We need to run on the thread stack, but we
* can't go through the normal entry paths: NMIs are masked, and
@@ -1171,41 +1212,13 @@ SYM_CODE_START(asm_exc_nmi)
* We also must not push anything to the stack before switching
* stacks lest we corrupt the "NMI executing" variable.
*/
+ ist_entry_user exc_nmi

- swapgs
- cld
- FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
- movq %rsp, %rdx
- movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
- UNWIND_HINT_IRET_REGS base=%rdx offset=8
- pushq 5*8(%rdx) /* pt_regs->ss */
- pushq 4*8(%rdx) /* pt_regs->rsp */
- pushq 3*8(%rdx) /* pt_regs->flags */
- pushq 2*8(%rdx) /* pt_regs->cs */
- pushq 1*8(%rdx) /* pt_regs->rip */
- UNWIND_HINT_IRET_REGS
- pushq $-1 /* pt_regs->orig_ax */
- PUSH_AND_CLEAR_REGS rdx=(%rdx)
- ENCODE_FRAME_POINTER
-
- /*
- * At this point we no longer need to worry about stack damage
- * due to nesting -- we're on the normal thread stack and we're
- * done with the NMI stack.
- */
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
+ /* NMI from kernel */

- /*
- * Return back to user mode. We must *not* do the normal exit
- * work, because we don't want to enable interrupts.
- */
- jmp swapgs_restore_regs_and_return_to_usermode
+ /* Use %rdx as our temp variable throughout */
+ pushq %rdx

-.Lnmi_from_kernel:
/*
* Here's what our stack frame will look like:
* +---------------------------------------------------------+
--
2.18.4

2020-11-16 14:50:44

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 15/21] x86/pti: Execute IDT handlers with error code on the kernel stack

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 +-
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1)) : \
func(arg1))

+#define CALL_ON_STACK_2(stack, func, arg1, arg2) \
+ ((stack) ? \
+ asm_call_on_stack_2(stack, \
+ (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+ func(arg1, arg2))
+
/*
* Functions to return the top of the kernel stack if we are using the
* user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
}

+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+ struct pt_regs *regs, unsigned long error_code)
+{
+ CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs, \
irqentry_state_t state = irqentry_enter(regs); \
\
instrumentation_begin(); \
- __##func (regs, error_code); \
+ run_idt_errcode(__##func, regs, error_code); \
instrumentation_end(); \
irqentry_exit(regs, state); \
} \
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs, \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
- __##func (regs, (u8)error_code); \
+ run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+ regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
/* User entry, runs on regular task stack */
DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
- exc_debug_user(regs, debug_read_clear_dr6());
+ run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
}
#else
/* 32 bit does not have separate entry points. */
--
2.18.4

2020-11-16 14:50:51

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 01/21] x86/syscall: Add wrapper for invoking syscall function

Add a wrapper function for invoking a syscall function.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..d222212908ad 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,15 @@
#include <asm/syscall.h>
#include <asm/irq_stack.h>

+static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
+ struct pt_regs *regs)
+{
+ if (!sysfunc)
+ return;
+
+ regs->ax = sysfunc(regs);
+}
+
#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
@@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
instrumentation_begin();
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
- regs->ax = sys_call_table[nr](regs);
+ run_syscall(sys_call_table[nr], regs);
#ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
(nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
- regs->ax = x32_sys_call_table[nr](regs);
+ run_syscall(x32_sys_call_table[nr], regs);
#endif
}
+
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
@@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
if (likely(nr < IA32_NR_syscalls)) {
instrumentation_begin();
nr = array_index_nospec(nr, IA32_NR_syscalls);
- regs->ax = ia32_sys_call_table[nr](regs);
+ run_syscall(ia32_sys_call_table[nr], regs);
instrumentation_end();
}
}
--
2.18.4

2020-11-16 14:50:55

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 13/21] x86/pti: Execute syscall functions on the kernel stack

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 11 ++++++++++-
arch/x86/entry/entry_64.S | 1 +
arch/x86/include/asm/irq_stack.h | 3 +++
3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7ee15a12c115..1aba02ecb806 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
{
+ unsigned long stack;
+
if (!sysfunc)
return;

- regs->ax = sysfunc(regs);
+ if (!pti_enabled()) {
+ regs->ax = sysfunc(regs);
+ return;
+ }
+
+ stack = (unsigned long)task_top_of_kernel_stack(current);
+ regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+ sysfunc, regs);
}

#ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
SYM_FUNC_START(asm_call_on_stack_3)
SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
/*
* Save the frame pointer unconditionally. This allows the ORC
* unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
#include <linux/ptrace.h>

#include <asm/processor.h>
+#include <asm/syscall.h>

#ifdef CONFIG_X86_64
static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
struct pt_regs *regs);
void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+ struct pt_regs *regs);

static __always_inline void __run_on_irqstack(void (*func)(void))
{
--
2.18.4

2020-11-16 14:51:02

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 42 +++++++++-----------------------
arch/x86/include/asm/pti.h | 8 ++++++
arch/x86/include/asm/switch_to.h | 7 +++++-
3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1

- /*
- * Now all regs are restored except RSP and RDI.
- * Save old stack pointer and switch to trampoline stack.
- */
- movq %rsp, %rdi
- movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
- UNWIND_HINT_EMPTY
-
- pushq RSP-RDI(%rdi) /* RSP */
- pushq (%rdi) /* RDI */
-
/*
* We are on the trampoline stack. All regs except RDI are live.
+ * We are on the trampoline stack. All regs except RSP are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

popq %rdi
- popq %rsp
+ movq RSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
SYM_CODE_END(entry_SYSCALL_64)

@@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
#endif
POP_REGS pop_rdi=0

- /*
- * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
- * Save old stack pointer and switch to trampoline stack.
- */
- movq %rsp, %rdi
- movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
- UNWIND_HINT_EMPTY
-
- /* Copy the IRET frame to the trampoline stack. */
- pushq 6*8(%rdi) /* SS */
- pushq 5*8(%rdi) /* RSP */
- pushq 4*8(%rdi) /* EFLAGS */
- pushq 3*8(%rdi) /* CS */
- pushq 2*8(%rdi) /* RIP */
-
- /* Push user RDI on the trampoline stack. */
- pushq (%rdi)
-
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)

/* Restore RDI. */
popq %rdi
+ addq $8, %rsp /* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN

@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

.Lerror_entry_from_usermode_after_swapgs:
+ /*
+ * We are on the trampoline stack. With PTI, the trampoline
+ * stack is a per-thread stack so we are all set and we can
+ * return.
+ *
+ * Without PTI, the trampoline stack is a per-cpu stack and
+ * we need to switch to the normal thread stack.
+ */
+ ALTERNATIVE "", "ret", X86_FEATURE_PTI
/* Put us onto the real thread stack. */
popq %r12 /* save return addr in %12 */
movq %rsp, %rdi /* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
extern void pti_finalize(void);
extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+ return static_cpu_has(X86_FEATURE_PTI);
+}
#else
static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+ return false;
+}
#endif

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
#define _ASM_X86_SWITCH_TO_H

#include <linux/sched/task_stack.h>
+#include <asm/pti.h>

struct task_struct; /* one of the stranger aspects of C forward declarations */

@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task)
* doesn't work on x86-32 because sp1 and
* cpu_current_top_of_stack have different values (because of
* the non-zero stack-padding on 32bit).
+ *
+ * If PTI is enabled, sp0 points to the PTI stack (mapped in
+ * the kernel and user page-table) which is used when entering
+ * the kernel.
*/
- if (static_cpu_has(X86_FEATURE_XENPV))
+ if (static_cpu_has(X86_FEATURE_XENPV) || pti_enabled())
load_sp0(task_top_of_stack(task));
#endif
}
--
2.18.4

2020-11-16 14:51:14

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 18/21] x86/pti: Execute NMI handler on the kernel stack

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kernel/nmi.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

inc_irq_stat(__nmi_count);

- if (!ignore_nmis)
- default_do_nmi(regs);
+ if (!ignore_nmis) {
+ if (user_mode(regs)) {
+ /*
+ * If we come from userland then we are on the
+ * trampoline stack, switch to the kernel stack
+ * to execute the NMI handler.
+ */
+ run_idt(default_do_nmi, regs);
+ } else {
+ default_do_nmi(regs);
+ }
+ }

idtentry_exit_nmi(regs, irq_state);

--
2.18.4

2020-11-16 14:51:15

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 10/21] x86/pti: Function to map per-cpu page-table entry

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/mm/pti.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
}

+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+ phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
/*
* Clone the CPU_ENTRY_AREA and associated data into the user space visible
* page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
* This is done for all possible CPUs during boot to ensure
* that it's propagated to all mms.
*/
+ pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));

- unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
- phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
- pte_t *target_pte;
-
- target_pte = pti_user_pagetable_walk_pte(va);
- if (WARN_ON(!target_pte))
- return;
-
- *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
}
}

--
2.18.4

2020-11-16 14:51:25

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 19/21] x86/pti: Defer CR3 switch to C code for IST entries

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 34 ++--------------------------------
arch/x86/kernel/cpu/mce/core.c | 3 +++
arch/x86/kernel/nmi.c | 18 +++++++++++++++---
arch/x86/kernel/sev-es.c | 13 ++++++++++++-
arch/x86/kernel/traps.c | 30 ++++++++++++++++++++++++++----
5 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..1715bc0cefff 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -900,23 +900,6 @@ SYM_CODE_START_LOCAL(paranoid_entry)
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8

- /*
- * Always stash CR3 in %r14. This value will be restored,
- * verbatim, at exit. Needed if paranoid_entry interrupted
- * another entry that already switched to the user CR3 value
- * but has not yet returned to userspace.
- *
- * This is also why CS (stashed in the "iret frame" by the
- * hardware at entry) can not be used: this may be a return
- * to kernel code, but with a user CR3 value.
- *
- * Switching CR3 does not depend on kernel GSBASE so it can
- * be done before switching to the kernel GSBASE. This is
- * required for FSGSBASE because the kernel GSBASE has to
- * be retrieved from a kernel internal table.
- */
- SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
/*
* Handling GSBASE depends on the availability of FSGSBASE.
*
@@ -956,9 +939,7 @@ SYM_CODE_START_LOCAL(paranoid_entry)
SWAPGS

/*
- * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
- * unconditional CR3 write, even in the PTI case. So do an lfence
- * to prevent GS speculation, regardless of whether PTI is enabled.
+ * Do an lfence prevent GS speculation.
*/
FENCE_SWAPGS_KERNEL_ENTRY

@@ -989,14 +970,10 @@ SYM_CODE_END(paranoid_entry)
SYM_CODE_START_LOCAL(paranoid_exit)
UNWIND_HINT_REGS
/*
- * The order of operations is important. RESTORE_CR3 requires
- * kernel GSBASE.
- *
* NB to anyone to try to optimize this code: this code does
* not execute at all for exceptions from user mode. Those
* exceptions go through error_exit instead.
*/
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14

/* Handle the three GSBASE cases */
ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
@@ -1119,10 +1096,6 @@ SYM_CODE_END(error_return)
/*
* Runs on exception stack. Xen PV does not go through this path at all,
* so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- * when PAGE_TABLE_ISOLATION is in use. Do not clobber.
*/
SYM_CODE_START(asm_exc_nmi)
/*
@@ -1173,7 +1146,7 @@ SYM_CODE_START(asm_exc_nmi)
* We also must not push anything to the stack before switching
* stacks lest we corrupt the "NMI executing" variable.
*/
- ist_entry_user exc_nmi
+ ist_entry_user exc_nmi_user

/* NMI from kernel */

@@ -1385,9 +1358,6 @@ end_repeat_nmi:
movq $-1, %rsi
call exc_nmi

- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
/*
* The above invocation of paranoid_entry stored the GSBASE
* related information in R/EBX depending on the availability
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 9407c3cd9355..31ac01c1155d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2022,11 +2022,14 @@ static __always_inline void exc_machine_check_user(struct pt_regs *regs)
/* MCE hit kernel mode */
DEFINE_IDTENTRY_MCE(exc_machine_check)
{
+ unsigned long saved_cr3;
unsigned long dr7;

+ saved_cr3 = save_and_switch_to_kernel_cr3();
dr7 = local_db_save();
exc_machine_check_kernel(regs);
local_db_restore(dr7);
+ restore_cr3(saved_cr3);
}

/* The user mode variant. */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be0f654c3095..523d88c3fea1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
static DEFINE_PER_CPU(unsigned long, nmi_cr2);
static DEFINE_PER_CPU(unsigned long, nmi_dr7);

-DEFINE_IDTENTRY_RAW(exc_nmi)
+static noinstr void handle_nmi(struct pt_regs *regs)
{
bool irq_state;

@@ -529,9 +529,21 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
goto nmi_restart;
+}
+
+DEFINE_IDTENTRY_NMI(exc_nmi)
+{
+ unsigned long saved_cr3;

- if (user_mode(regs))
- mds_user_clear_cpu_buffers();
+ saved_cr3 = save_and_switch_to_kernel_cr3();
+ handle_nmi(regs);
+ restore_cr3(saved_cr3);
+}
+
+__visible noinstr void exc_nmi_user(struct pt_regs *regs)
+{
+ handle_nmi(regs);
+ mds_user_clear_cpu_buffers();
}

void stop_nmi(void)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index bd977c917cd6..59fc7c472525 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1352,13 +1352,18 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
struct exc_vc_frame {
/* pt_regs should be first */
struct pt_regs regs;
+ /* extra parameters for the handler */
+ unsigned long saved_cr3;
};

DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
{
struct exc_vc_frame *frame;
+ unsigned long saved_cr3;
unsigned long sp;

+ saved_cr3 = save_and_switch_to_kernel_cr3();
+
/*
* Switch off the IST stack to make it free for nested exceptions.
* The vc_switch_off_ist() function will switch back to the
@@ -1370,7 +1375,8 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
/*
* Found a safe stack. Set it up as if the entry has happened on
* that stack. This means that we need to have pt_regs at the top
- * of the stack.
+ * of the stack, and we can use the bottom of the stack to pass
+ * extra parameters (like the saved cr3 value) to the handler.
*
* The effective stack switch happens in assembly code before
* the #VC handler is called.
@@ -1379,16 +1385,21 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)

frame = (struct exc_vc_frame *)sp;
frame->regs = *regs;
+ frame->saved_cr3 = saved_cr3;

return sp;
}

DEFINE_IDTENTRY_VC(exc_vmm_communication)
{
+ struct exc_vc_frame *frame = (struct exc_vc_frame *)regs;
+
if (likely(!on_vc_fallback_stack(regs)))
safe_stack_exc_vmm_communication(regs, error_code);
else
ist_exc_vmm_communication(regs, error_code);
+
+ restore_cr3(frame->saved_cr3);
}

bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9a51aa016fb3..14d2d6f15184 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -344,10 +344,10 @@ __visible void __noreturn handle_stack_overflow(const char *message,
DEFINE_IDTENTRY_DF(exc_double_fault)
{
static const char str[] = "double fault";
- struct task_struct *tsk = current;
-
+ struct task_struct *tsk;
+ unsigned long saved_cr3;
#ifdef CONFIG_VMAP_STACK
- unsigned long address = read_cr2();
+ unsigned long address;
#endif

#ifdef CONFIG_X86_ESPFIX64
@@ -371,8 +371,12 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
regs->cs == __KERNEL_CS &&
regs->ip == (unsigned long)native_irq_return_iret)
{
- struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
unsigned long *p = (unsigned long *)regs->sp;
+ struct pt_regs *gpregs;
+
+ saved_cr3 = save_and_switch_to_kernel_cr3();
+
+ gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;

/*
* regs->sp points to the failing IRET frame on the
@@ -401,14 +405,28 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
regs->ip = (unsigned long)asm_exc_general_protection;
regs->sp = (unsigned long)&gpregs->orig_ax;

+ restore_cr3(saved_cr3);
+
return;
}
#endif

+ /*
+ * Switch to the kernel page-table. We are on an IST stack, and
+ * we are going to die so there is no need to switch to the kernel
+ * stack even if we are coming from userspace.
+ */
+ saved_cr3 = save_and_switch_to_kernel_cr3();
+
+#ifdef CONFIG_VMAP_STACK
+ address = read_cr2();
+#endif
+
idtentry_enter_nmi(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

+ tsk = current;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_DF;

@@ -973,7 +991,11 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
/* IST stack entry */
DEFINE_IDTENTRY_DEBUG(exc_debug)
{
+ unsigned long saved_cr3;
+
+ saved_cr3 = save_and_switch_to_kernel_cr3();
exc_debug_kernel(regs, debug_read_clear_dr6());
+ restore_cr3(saved_cr3);
}

/* User entry, runs on regular task stack */
--
2.18.4

2020-11-16 14:51:30

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 14/21] x86/pti: Execute IDT handlers on the kernel stack

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

For now, only changes IDT handlers which have no argument other
than the pt_regs registers.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 43 +++++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/mce/core.c | 2 +-
arch/x86/kernel/traps.c | 4 +--
3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4b4aca2b1420..3595a31947b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,10 +10,49 @@
#include <linux/hardirq.h>

#include <asm/irq_stack.h>
+#include <asm/pti.h>

bool idtentry_enter_nmi(struct pt_regs *regs);
void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);

+/*
+ * The CALL_ON_STACK_* macro call the specified function either directly
+ * if no stack is provided, or on the specified stack.
+ */
+#define CALL_ON_STACK_1(stack, func, arg1) \
+ ((stack) ? \
+ asm_call_on_stack_1(stack, \
+ (void (*)(void))(func), (void *)(arg1)) : \
+ func(arg1))
+
+/*
+ * Functions to return the top of the kernel stack if we are using the
+ * user page-table (and thus not running with the kernel stack). If we
+ * are using the kernel page-table (and so already using the kernel
+ * stack) when it returns NULL.
+ */
+static __always_inline void *pti_kernel_stack(struct pt_regs *regs)
+{
+ unsigned long stack;
+
+ if (pti_enabled() && user_mode(regs)) {
+ stack = (unsigned long)task_top_of_kernel_stack(current);
+ return (void *)(stack - 8);
+ } else {
+ return NULL;
+ }
+}
+
+/*
+ * Wrappers to run an IDT handler on the kernel stack if we are not
+ * already using this stack.
+ */
+static __always_inline
+void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
+{
+ CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
+}
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
@@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) \
irqentry_state_t state = irqentry_enter(regs); \
\
instrumentation_begin(); \
- __##func (regs); \
+ run_idt(__##func, regs); \
instrumentation_end(); \
irqentry_exit(regs, state); \
} \
@@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs) \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
- __##func (regs); \
+ run_idt(__##func, regs); \
__irq_exit_raw(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..9407c3cd9355 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
unsigned long dr7;

dr7 = local_db_save();
- exc_machine_check_user(regs);
+ run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
}
#else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 09b22a611d99..5161385b3670 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)

state = irqentry_enter(regs);
instrumentation_begin();
- handle_invalid_op(regs);
+ run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
}
@@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
- do_int3_user(regs);
+ run_idt(do_int3_user, regs);
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
--
2.18.4

2020-11-16 14:51:40

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 20/21] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/common.c | 15 ++++++++++++---
arch/x86/entry/entry_64.S | 23 +++++------------------
arch/x86/entry/entry_64_compat.S | 22 ----------------------
arch/x86/include/asm/entry-common.h | 13 +++++++++++++
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++++++++-----
arch/x86/kernel/cpu/mce/core.c | 2 ++
arch/x86/kernel/nmi.c | 2 ++
arch/x86/kernel/traps.c | 6 ++++++
arch/x86/mm/fault.c | 9 +++++++--
9 files changed, 67 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1aba02ecb806..6ef5afc42b82 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
+ user_pagetable_enter();
}

static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
+ user_pagetable_exit();
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)

instrumentation_end();
syscall_exit_to_user_mode(regs);
+ user_pagetable_enter();
}
#endif

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
{
+ user_pagetable_exit();
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;

@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)

do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
+ user_pagetable_enter();
}

-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
{
- unsigned int nr = syscall_32_enter(regs);
int res;

/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
{
+ unsigned int nr = syscall_32_enter(regs);
+ bool syscall_done;
+
/*
* Called using the internal vDSO SYSENTER/SYSCALL32 calling
* convention. Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
regs->ip = landing_pad;

/* Invoke the syscall. If it failed, keep it simple: use IRET. */
- if (!__do_fast_syscall_32(regs))
+ syscall_done = __do_fast_syscall_32(regs, nr);
+ user_pagetable_enter();
+ if (!syscall_done)
return 0;

#ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1715bc0cefff..b7d9a019d001 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
- POP_REGS pop_rdi=0 skip_r11rcx=1
+ POP_REGS skip_r11rcx=1

/*
- * We are on the trampoline stack. All regs except RDI are live.
* We are on the trampoline stack. All regs except RSP are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER

- SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
- popq %rdi
movq RSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
swapgs
cld
FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq %rsp, %rdx
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -594,19 +588,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
1:
#endif
- POP_REGS pop_rdi=0
+ POP_REGS
+ addq $8, %rsp /* skip regs->orig_ax */

/*
- * We are on the trampoline stack. All regs except RDI are live.
+ * We are on the trampoline stack. All regs are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER

- SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
- /* Restore RDI. */
- popq %rdi
- addq $8, %rsp /* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN

@@ -1009,8 +999,6 @@ SYM_CODE_START_LOCAL(error_entry)
*/
SWAPGS
FENCE_SWAPGS_USER_ENTRY
- /* We have user CR3. Change to kernel CR3. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

.Lerror_entry_from_usermode_after_swapgs:
/*
@@ -1069,11 +1057,10 @@ SYM_CODE_START_LOCAL(error_entry)
.Lerror_bad_iret:
/*
* We came from an IRET to user mode, so we have user
- * gsbase and CR3. Switch to kernel gsbase and CR3:
+ * gsbase and CR3. Switch to kernel gsbase.
*/
SWAPGS
FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax

/*
* Pretend that the exception came from user mode: set up pt_regs
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 541fdaf64045..a6fb5807bf42 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -51,10 +51,6 @@ SYM_CODE_START(entry_SYSENTER_compat)
/* Interrupts are off on entry. */
SWAPGS

- pushq %rax
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
- popq %rax
-
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

/* Construct struct pt_regs on stack */
@@ -204,9 +200,6 @@ SYM_CODE_START(entry_SYSCALL_compat)
/* Stash user ESP */
movl %esp, %r8d

- /* Use %rsp as scratch reg. User ESP is stashed in r8 */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
/* Switch to the kernel stack */
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

@@ -291,18 +284,6 @@ sysret32_from_system_call:
* code. We zero R8-R10 to avoid info leaks.
*/
movq RSP-ORIG_RAX(%rsp), %rsp
-
- /*
- * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
- * on the process stack which is not mapped to userspace and
- * not readable after we SWITCH_TO_USER_CR3. Delay the CR3
- * switch until after after the last reference to the process
- * stack.
- *
- * %r8/%r9 are zeroed before the sysret, thus safe to clobber.
- */
- SWITCH_TO_USER_CR3_NOSTACK scratch_reg=%r8 scratch_reg2=%r9
-
xorl %r8d, %r8d
xorl %r9d, %r9d
xorl %r10d, %r10d
@@ -357,9 +338,6 @@ SYM_CODE_START(entry_INT80_compat)
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */

- /* Need to switch before accessing the thread stack. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-
/* In the Xen PV case we already run on the thread stack. */
ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 46682b1433a4..e01735a181b8 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -193,6 +193,17 @@ static __always_inline void user_pagetable_exit(void)
switch_to_kernel_cr3(__native_read_cr3());
}

+static __always_inline void user_pagetable_return(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ user_pagetable_enter();
+}
+
+static __always_inline void user_pagetable_escape(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ user_pagetable_exit();
+}

#else /* CONFIG_PAGE_TABLE_ISOLATION */

@@ -204,6 +215,8 @@ static __always_inline void restore_cr3(unsigned long cr3) {}

static __always_inline void user_pagetable_enter(void) {};
static __always_inline void user_pagetable_exit(void) {};
+static __always_inline void user_pagetable_return(struct pt_regs *regs) {};
+static __always_inline void user_pagetable_escape(struct pt_regs *regs) {};

#endif /* CONFIG_PAGE_TABLE_ISOLATION */
#endif /* MODULE */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a6725afaaec0..f29bfc0700ff 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -132,12 +132,15 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ user_pagetable_escape(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
run_idt(__##func, regs); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ user_pagetable_return(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -179,12 +182,15 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ user_pagetable_escape(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
run_idt_errcode(__##func, regs, error_code); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ user_pagetable_return(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -275,8 +281,10 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ user_pagetable_escape(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -285,6 +293,7 @@ __visible noinstr void func(struct pt_regs *regs, \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ user_pagetable_return(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -318,8 +327,10 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ user_pagetable_escape(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -327,6 +338,7 @@ __visible noinstr void func(struct pt_regs *regs) \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ user_pagetable_return(regs); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -347,8 +359,10 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t state; \
\
+ user_pagetable_escape(regs); \
+ state = irqentry_enter(regs); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
@@ -356,6 +370,7 @@ __visible noinstr void func(struct pt_regs *regs) \
__irq_exit_raw(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
+ user_pagetable_return(regs); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 31ac01c1155d..0203e73711a3 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2037,9 +2037,11 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
{
unsigned long dr7;

+ user_pagetable_exit();
dr7 = local_db_save();
run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
+ user_pagetable_enter();
}
#else
/* 32bit unified entry point */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 523d88c3fea1..f5d0f5d0c626 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -542,8 +542,10 @@ DEFINE_IDTENTRY_NMI(exc_nmi)

__visible noinstr void exc_nmi_user(struct pt_regs *regs)
{
+ user_pagetable_exit();
handle_nmi(regs);
mds_user_clear_cpu_buffers();
+ user_pagetable_enter();
}

void stop_nmi(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 14d2d6f15184..76db3d5a2965 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -255,11 +255,13 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

+ user_pagetable_escape(regs);
state = irqentry_enter(regs);
instrumentation_begin();
run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
+ user_pagetable_return(regs);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -663,11 +665,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
* including NMI.
*/
if (user_mode(regs)) {
+ user_pagetable_exit();
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
run_idt(do_int3_user, regs);
instrumentation_end();
irqentry_exit_to_user_mode(regs);
+ user_pagetable_enter();
} else {
bool irq_state = idtentry_enter_nmi(regs);
instrumentation_begin();
@@ -1001,7 +1005,9 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
/* User entry, runs on regular task stack */
DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
+ user_pagetable_exit();
run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
+ user_pagetable_enter();
}
#else
/* 32 bit does not have separate entry points. */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b9d03603d95d..9ca79e86d0f0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1440,9 +1440,11 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
- unsigned long address = read_cr2();
+ unsigned long address;
irqentry_state_t state;

+ user_pagetable_escape(regs);
+ address = read_cr2();
prefetchw(&current->mm->mmap_lock);

/*
@@ -1466,8 +1468,10 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* The async #PF handling code takes care of idtentry handling
* itself.
*/
- if (kvm_handle_async_pf(regs, (u32)address))
+ if (kvm_handle_async_pf(regs, (u32)address)) {
+ user_pagetable_return(regs);
return;
+ }

/*
* Entry handling for valid #PF from kernel mode is slightly
@@ -1486,4 +1490,5 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
instrumentation_end();

irqentry_exit(regs, state);
+ user_pagetable_return(regs);
}
--
2.18.4

2020-11-16 14:51:43

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 08/21] x86/pti: Introduce per-task PTI trampoline stack

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/page_64_types.h | 36 +++++++++++++++++++++++++++-
arch/x86/include/asm/processor.h | 3 +++
2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
#define KASAN_STACK_ORDER 0
#endif

-#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ * +-------------+
+ * | | ^ ^
+ * | kernel-only | | KERNEL_STACK_SIZE |
+ * | stack | | |
+ * | | V |
+ * +-------------+ <- top of kernel stack | THREAD_SIZE
+ * | | ^ |
+ * | kernel and | | KERNEL_STACK_SIZE |
+ * | PTI stack | | |
+ * | | V v
+ * +-------------+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER \
+ (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)

#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)

#define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))

+#define task_top_of_kernel_stack(task) \
+ ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
#define task_pt_regs(task) \
({ \
unsigned long __ptr = (unsigned long)task_stack_page(task); \
--
2.18.4

2020-11-16 14:52:06

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 17/21] x86/pti: Execute page fault handler on the kernel stack

After a page fault from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the page fault handler, switch
to the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 17 +++++++++++++++++
arch/x86/mm/fault.c | 2 +-
2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0c5d9f027112..a6725afaaec0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
func(arg1, arg2))

+#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \
+ ((stack) ? \
+ asm_call_on_stack_3(stack, \
+ (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \
+ (void *)(arg3)) : \
+ func(arg1, arg2, arg3))
+
/*
* Functions to return the top of the kernel stack if we are using the
* user page-table (and thus not running with the kernel stack). If we
@@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
}

+static __always_inline
+void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long,
+ unsigned long),
+ struct pt_regs *regs, unsigned long error_code,
+ unsigned long address)
+{
+ CALL_ON_STACK_3(pti_kernel_stack(regs),
+ func, regs, error_code, address);
+}
+
static __always_inline
void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
{
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..b9d03603d95d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
state = irqentry_enter(regs);

instrumentation_begin();
- handle_page_fault(regs, error_code, address);
+ run_idt_pagefault(handle_page_fault, regs, error_code, address);
instrumentation_end();

irqentry_exit(regs, state);
--
2.18.4

2020-11-16 14:52:12

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 09/21] x86/pti: Function to clone page-table entries from a specified mm

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/pti.h | 10 ++++++++++
arch/x86/mm/pti.c | 32 ++++++++++++++++----------------
2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
#ifndef __ASSEMBLY__

#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+ PTI_CLONE_PMD,
+ PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
extern void pti_init(void);
extern void pti_check_boottime_disable(void);
extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level);
#else
static inline void pti_check_boottime_disable(void) { }
#endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
static void __init pti_setup_vsyscall(void) { }
#endif

-enum pti_clone_level {
- PTI_CLONE_PMD,
- PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
- enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level)
{
unsigned long addr;

@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
if (addr < start)
break;

- pgd = pgd_offset_k(addr);
+ pgd = pgd_offset(mm, addr);
if (WARN_ON(pgd_none(*pgd)))
return;
p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
}
}

+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+ enum pti_clone_level level)
+{
+ pti_clone_pgtable(&init_mm, start, end, level);
+}
+
#ifdef CONFIG_X86_64
/*
* Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
start = CPU_ENTRY_AREA_BASE;
end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);

- pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+ pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
}
#endif /* CONFIG_X86_64 */

@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
*/
static void pti_clone_entry_text(void)
{
- pti_clone_pgtable((unsigned long) __entry_text_start,
- (unsigned long) __entry_text_end,
- PTI_CLONE_PMD);
+ pti_clone_init_pgtable((unsigned long) __entry_text_start,
+ (unsigned long) __entry_text_end,
+ PTI_CLONE_PMD);
}

/*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
- pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+ pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);

/*
- * pti_clone_pgtable() will set the global bit in any PMDs
- * that it clones, but we also need to get any PTEs in
+ * pti_clone_init_pgtable() will set the global bit in any
+ * PMDs that it clones, but we also need to get any PTEs in
* the last level for areas that are not huge-page-aligned.
*/

--
2.18.4

2020-11-16 14:53:29

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 2 --
arch/x86/mm/pti.c | 19 +++++++++++++++++++
kernel/fork.c | 22 ++++++++++++++++++++++
3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e0b5b010e0b..458af12ed9a1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm)
* rbx: kernel thread func (NULL for user thread)
* r12: kernel thread arg
*/
-.pushsection .text, "ax"
SYM_CODE_START(ret_from_fork)
UNWIND_HINT_REGS
movq %rsp, %rdi /* pt_regs */
@@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork)
call return_from_fork /* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(ret_from_fork)
-.popsection

.macro DEBUG_ENTRY_ASSERT_IRQS_OFF
#ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 71ca245d7b38..e4c6cb4a4840 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -449,6 +449,7 @@ static void __init pti_clone_percpu_page(void *addr)
*/
static void __init pti_clone_user_shared(void)
{
+ unsigned long start, end;
unsigned int cpu;

pti_clone_p4d(CPU_ENTRY_AREA_BASE);
@@ -465,7 +466,16 @@ static void __init pti_clone_user_shared(void)
*/
pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));

+ /*
+ * Map fixed_percpu_data to get the stack canary.
+ */
+ if (IS_ENABLED(CONFIG_STACKPROTECTOR))
+ pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu));
}
+
+ start = (unsigned long)__per_cpu_offset;
+ end = start + sizeof(__per_cpu_offset);
+ pti_clone_init_pgtable(start, end, PTI_CLONE_PTE);
}

#else /* CONFIG_X86_64 */
@@ -505,6 +515,15 @@ static void pti_clone_entry_text(void)
pti_clone_init_pgtable((unsigned long) __entry_text_start,
(unsigned long) __entry_text_end,
PTI_CLONE_PMD);
+
+ /*
+ * Syscall and interrupt entry code (which is in the noinstr
+ * section) will be entered with the user page-table, so that
+ * code has to be mapped in.
+ */
+ pti_clone_init_pgtable((unsigned long) __noinstr_text_start,
+ (unsigned long) __noinstr_text_end,
+ PTI_CLONE_PMD);
}

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..31cd77dbdba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
#endif
}

+static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ unsigned long addr;
+
+ if (!tsk || !static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ /*
+ * Map the task stack after the kernel stack into the user
+ * address space, so that this stack can be used when entering
+ * syscall or interrupt from user mode.
+ */
+ BUG_ON(!task_stack_page(tsk));
+ addr = (unsigned long)task_top_of_kernel_stack(tsk);
+ pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE);
+#endif
+}
+
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
{
@@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (init_new_context(p, mm))
goto fail_nocontext;

+ mm_map_task(mm, p);
+
mm->user_ns = get_user_ns(user_ns);
return mm;

@@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
vmacache_flush(tsk);

if (clone_flags & CLONE_VM) {
+ mm_map_task(oldmm, tsk);
mmget(oldmm);
mm = oldmm;
goto good_mm;
--
2.18.4

2020-11-16 14:53:39

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 02/21] x86/entry: Update asm_call_on_stack to support more function arguments

Update the asm_call_on_stack() function so that it can be invoked
with a function having up to three arguments instead of only one.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/entry_64.S | 15 +++++++++++----
arch/x86/include/asm/irq_stack.h | 8 ++++++++
2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..c42948aca0a8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs)
/*
* rdi: New stack pointer points to the top word of the stack
* rsi: Function pointer
- * rdx: Function argument (can be NULL if none)
+ * rdx: Function argument 1 (can be NULL if none)
+ * rcx: Function argument 2 (can be NULL if none)
+ * r8 : Function argument 3 (can be NULL if none)
*/
SYM_FUNC_START(asm_call_on_stack)
+SYM_FUNC_START(asm_call_on_stack_1)
+SYM_FUNC_START(asm_call_on_stack_2)
+SYM_FUNC_START(asm_call_on_stack_3)
SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
/*
@@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
*/
mov %rsp, (%rdi)
mov %rdi, %rsp
- /* Move the argument to the right place */
+ mov %rsi, %rax
+ /* Move arguments to the right place */
mov %rdx, %rdi
-
+ mov %rcx, %rsi
+ mov %r8, %rdx
1:
.pushsection .discard.instr_begin
.long 1b - .
.popsection

- CALL_NOSPEC rsi
+ CALL_NOSPEC rax

2:
.pushsection .discard.instr_end
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 775816965c6a..359427216336 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void)
}

void asm_call_on_stack(void *sp, void (*func)(void), void *arg);
+
+void asm_call_on_stack_1(void *sp, void (*func)(void),
+ void *arg1);
+void asm_call_on_stack_2(void *sp, void (*func)(void),
+ void *arg1, void *arg2);
+void asm_call_on_stack_3(void *sp, void (*func)(void),
+ void *arg1, void *arg2, void *arg3);
+
void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
struct pt_regs *regs);
void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
--
2.18.4

2020-11-16 14:53:48

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table

Using stack protector requires the stack canary to be mapped into
the current page-table. Now that the page-table switch between the
user and kernel page-table is deferred to C code, stack protector can
be used while the user page-table is active and so the stack canary
is mapped into the user page-table.

To prevent leaking the stack canary used with the kernel page-table,
use a different canary with the user and kernel page-table. The stack
canary is changed when switching the page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/entry-common.h | 56 ++++++++++++++++++++++++++-
arch/x86/include/asm/stackprotector.h | 35 +++++++++++------
arch/x86/kernel/sev-es.c | 18 +++++++++
include/linux/sched.h | 8 ++++
kernel/fork.c | 3 ++
5 files changed, 107 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index e01735a181b8..5b4d0e3237a3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -96,6 +96,52 @@ static __always_inline void arch_exit_to_user_mode(void)
#define PTI_USER_PGTABLE_AND_PCID_MASK \
(PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)

+/*
+ * Functions to set the stack canary to the kernel or user value:
+ *
+ * The kernel stack canary should be used when running with the kernel
+ * page-table, and the user stack canary should be used when running
+ * with the user page-table. Also the kernel stack canary should not
+ * leak to the user page-table.
+ *
+ * So the stack canary should be set to the kernel value when entering
+ * the kernel from userspace *after* switching to the kernel page-table.
+ * And the stack canary should be set to the user value when returning
+ * to userspace *before* switching to the user page-table.
+ *
+ * In both cases, there is a window (between the page-table switch and
+ * the stack canary setting) where we will be running with the kernel
+ * page-table and the user stack canary. This window should be as small
+ * as possible and, ideally, it should:
+ * - not call functions which require the stack protector to be used;
+ * - have interrupt disabled to prevent interrupt handlers from being
+ * processed with the user stack canary (but there is nothing we can
+ * do for NMIs).
+ */
+static __always_inline void set_stack_canary_kernel(void)
+{
+ this_cpu_write(fixed_percpu_data.stack_canary,
+ current->stack_canary);
+}
+
+static __always_inline void set_stack_canary_user(void)
+{
+ this_cpu_write(fixed_percpu_data.stack_canary,
+ current->stack_canary_user);
+}
+
+static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3)
+{
+ if (cr3 & PTI_USER_PGTABLE_MASK)
+ set_stack_canary_kernel();
+}
+
+static __always_inline void restore_stack_canary(unsigned long cr3)
+{
+ if (cr3 & PTI_USER_PGTABLE_MASK)
+ set_stack_canary_user();
+}
+
static __always_inline void write_kernel_cr3(unsigned long cr3)
{
if (static_cpu_has(X86_FEATURE_PCID))
@@ -155,8 +201,10 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
return 0;

cr3 = __native_read_cr3();
- if (cr3 & PTI_USER_PGTABLE_MASK)
+ if (cr3 & PTI_USER_PGTABLE_MASK) {
switch_to_kernel_cr3(cr3);
+ set_stack_canary_kernel();
+ }

return cr3;
}
@@ -167,6 +215,7 @@ static __always_inline void restore_cr3(unsigned long cr3)
return;

if (cr3 & PTI_USER_PGTABLE_MASK) {
+ set_stack_canary_user();
switch_to_user_cr3(cr3);
} else {
/*
@@ -182,6 +231,7 @@ static __always_inline void user_pagetable_enter(void)
if (!static_cpu_has(X86_FEATURE_PTI))
return;

+ set_stack_canary_user();
switch_to_user_cr3(__native_read_cr3());
}

@@ -191,6 +241,7 @@ static __always_inline void user_pagetable_exit(void)
return;

switch_to_kernel_cr3(__native_read_cr3());
+ set_stack_canary_kernel();
}

static __always_inline void user_pagetable_return(struct pt_regs *regs)
@@ -218,6 +269,9 @@ static __always_inline void user_pagetable_exit(void) {};
static __always_inline void user_pagetable_return(struct pt_regs *regs) {};
static __always_inline void user_pagetable_escape(struct pt_regs *regs) {};

+static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3) {}
+static __always_inline void restore_stack_canary(unsigned long cr3) {}
+
#endif /* CONFIG_PAGE_TABLE_ISOLATION */
#endif /* MODULE */

diff --git a/arch/x86/include/asm/stackprotector.h b/arch/x86/include/asm/stackprotector.h
index 7fb482f0f25b..be6c051bafe3 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -52,6 +52,25 @@
#define GDT_STACK_CANARY_INIT \
[GDT_ENTRY_STACK_CANARY] = GDT_ENTRY_INIT(0x4090, 0, 0x18),

+static __always_inline u64 boot_get_random_canary(void)
+{
+ u64 canary;
+ u64 tsc;
+
+ /*
+ * We both use the random pool and the current TSC as a source
+ * of randomness. The TSC only matters for very early init,
+ * there it already has some randomness on most systems. Later
+ * on during the bootup the random pool has true entropy too.
+ */
+ get_random_bytes(&canary, sizeof(canary));
+ tsc = rdtsc();
+ canary += tsc + (tsc << 32UL);
+ canary &= CANARY_MASK;
+
+ return canary;
+}
+
/*
* Initialize the stackprotector canary value.
*
@@ -66,23 +85,15 @@
static __always_inline void boot_init_stack_canary(void)
{
u64 canary;
- u64 tsc;

#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(struct fixed_percpu_data, stack_canary) != 40);
#endif
- /*
- * We both use the random pool and the current TSC as a source
- * of randomness. The TSC only matters for very early init,
- * there it already has some randomness on most systems. Later
- * on during the bootup the random pool has true entropy too.
- */
- get_random_bytes(&canary, sizeof(canary));
- tsc = rdtsc();
- canary += tsc + (tsc << 32UL);
- canary &= CANARY_MASK;
-
+ canary = boot_get_random_canary();
current->stack_canary = canary;
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ current->stack_canary_user = boot_get_random_canary();
+#endif
#ifdef CONFIG_X86_64
this_cpu_write(fixed_percpu_data.stack_canary, canary);
#else
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 59fc7c472525..614fbef497bd 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1387,6 +1387,14 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
frame->regs = *regs;
frame->saved_cr3 = saved_cr3;

+ /*
+ * save_and_switch_to_kernel_cr3() has switched the stack canary
+ * to the kernel stack canary. However, depending on the saved
+ * CR3 value, this function may have been entered with the user
+ * stack canary. So restore the stack canary before returning.
+ */
+ restore_stack_canary(saved_cr3);
+
return sp;
}

@@ -1394,6 +1402,16 @@ DEFINE_IDTENTRY_VC(exc_vmm_communication)
{
struct exc_vc_frame *frame = (struct exc_vc_frame *)regs;

+ /*
+ * The VC setup stack function has switched to the kernel CR3
+ * but not to the kernel stack canary. Switch to the kernel
+ * stack canary now that we are using the kernel page-table.
+ *
+ * The original stack canary will be restored by the
+ * restore_cr3() function.
+ */
+ switch_to_kernel_stack_canary(frame->saved_cr3);
+
if (likely(!on_vc_fallback_stack(regs)))
safe_stack_exc_vmm_communication(regs, error_code);
else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..a0199c5d8ae1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -816,6 +816,14 @@ struct task_struct {
#ifdef CONFIG_STACKPROTECTOR
/* Canary value for the -fstack-protector GCC feature: */
unsigned long stack_canary;
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ /*
+ * With PTI, stack_canary_user is used when we are in the
+ * kernel but using the user page-table. Once we have switched
+ * to the kernel page-table, stack_canary is used instead.
+ */
+ unsigned long stack_canary_user;
+#endif
#endif
/*
* Pointers to the (original) parent process, youngest child, younger sibling,
diff --git a/kernel/fork.c b/kernel/fork.c
index 31cd77dbdba3..70eaba4d8191 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -909,6 +909,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)

#ifdef CONFIG_STACKPROTECTOR
tsk->stack_canary = get_random_canary();
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ tsk->stack_canary_user = get_random_canary();
+#endif
#endif
if (orig->cpus_ptr == &orig->cpus_mask)
tsk->cpus_ptr = &tsk->cpus_mask;
--
2.18.4

2020-11-16 14:54:00

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 06/21] x86/pti: Provide C variants of PTI switch CR3 macros

Page Table Isolation (PTI) use assembly macros to switch the CR3
register between kernel and user page-tables. Add C functions which
implement the same features. For now, these C functions are not
used but they will eventually replace using the assembly macros.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/entry-common.h | 127 ++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 6fe54b2813c1..46682b1433a4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -7,6 +7,7 @@
#include <asm/nospec-branch.h>
#include <asm/io_bitmap.h>
#include <asm/fpu/api.h>
+#include <asm/tlbflush.h>

/* Check that the stack and regs on entry from user mode are sane. */
static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -81,4 +82,130 @@ static __always_inline void arch_exit_to_user_mode(void)
}
#define arch_exit_to_user_mode arch_exit_to_user_mode

+#ifndef MODULE
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK \
+ (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+
+static __always_inline void write_kernel_cr3(unsigned long cr3)
+{
+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= X86_CR3_PCID_NOFLUSH;
+
+ native_write_cr3(cr3);
+}
+
+static __always_inline void write_user_cr3(unsigned long cr3)
+{
+ unsigned short mask;
+ unsigned long asid;
+
+ if (static_cpu_has(X86_FEATURE_PCID)) {
+ /*
+ * Test if the ASID needs a flush.
+ */
+ asid = cr3 & 0x7ff;
+ mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask);
+ if (mask & (1 << asid)) {
+ /* Flush needed, clear the bit */
+ this_cpu_and(cpu_tlbstate.user_pcid_flush_mask,
+ ~(1 << asid));
+ } else {
+ cr3 |= X86_CR3_PCID_NOFLUSH;
+ }
+ }
+
+ native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_kernel_cr3(unsigned long cr3)
+{
+ /*
+ * Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+ * at kernel pagetables.
+ */
+ write_kernel_cr3(cr3 & ~PTI_USER_PGTABLE_AND_PCID_MASK);
+}
+
+static __always_inline void switch_to_user_cr3(unsigned long cr3)
+{
+ if (static_cpu_has(X86_FEATURE_PCID)) {
+ /* Flip the ASID to the user version */
+ cr3 |= PTI_USER_PCID_MASK;
+ }
+
+ /* Flip the PGD to the user version */
+ write_user_cr3(cr3 | PTI_USER_PGTABLE_MASK);
+}
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+ unsigned long cr3;
+
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return 0;
+
+ cr3 = __native_read_cr3();
+ if (cr3 & PTI_USER_PGTABLE_MASK)
+ switch_to_kernel_cr3(cr3);
+
+ return cr3;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ if (cr3 & PTI_USER_PGTABLE_MASK) {
+ switch_to_user_cr3(cr3);
+ } else {
+ /*
+ * The CR3 write could be avoided when not changing
+ * its value, but would require a CR3 read.
+ */
+ write_kernel_cr3(cr3);
+ }
+}
+
+static __always_inline void user_pagetable_enter(void)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ switch_to_user_cr3(__native_read_cr3());
+}
+
+static __always_inline void user_pagetable_exit(void)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ switch_to_kernel_cr3(__native_read_cr3());
+}
+
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+ return 0;
+}
+static __always_inline void restore_cr3(unsigned long cr3) {}
+
+static __always_inline void user_pagetable_enter(void) {};
+static __always_inline void user_pagetable_exit(void) {};
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+#endif /* MODULE */
+
#endif
--
2.18.4

2020-11-16 16:59:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table

On Mon, Nov 16, 2020 at 6:48 AM Alexandre Chartre
<[email protected]> wrote:
>
> Using stack protector requires the stack canary to be mapped into
> the current page-table. Now that the page-table switch between the
> user and kernel page-table is deferred to C code, stack protector can
> be used while the user page-table is active and so the stack canary
> is mapped into the user page-table.
>
> To prevent leaking the stack canary used with the kernel page-table,
> use a different canary with the user and kernel page-table. The stack
> canary is changed when switching the page-table.

Unless I've missed something, this doesn't have the security
properties we want. One CPU can be executing with kernel CR3, and
another CPU can read the stack canary using Meltdown. I think that
doing this safely requires mapping a different page with the stack
canary in the two pagetables.

2020-11-16 17:03:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
<[email protected]> wrote:
>
> When entering the kernel from userland, use the per-task PTI stack
> instead of the per-cpu trampoline stack. Like the trampoline stack,
> the PTI stack is mapped both in the kernel and in the user page-table.
> Using a per-task stack which is mapped into the kernel and the user
> page-table instead of a per-cpu stack will allow executing more code
> before switching to the kernel stack and to the kernel page-table.

Why?

I can't immediately evaluate how nasty the page table setup is because
it's not in this patch. But AFAICS the only thing that this enables
is sleeping with user page tables. Do we really need to do that?

2020-11-16 18:14:18

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack


On 11/16/20 5:57 PM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> When entering the kernel from userland, use the per-task PTI stack
>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>> the PTI stack is mapped both in the kernel and in the user page-table.
>> Using a per-task stack which is mapped into the kernel and the user
>> page-table instead of a per-cpu stack will allow executing more code
>> before switching to the kernel stack and to the kernel page-table.
>
> Why?

When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.

> I can't immediately evaluate how nasty the page table setup is because
> it's not in this patch.

The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).

> But AFAICS the only thing that this enables is sleeping with user pagetables.

That's precisely the point, it allows to sleep with the user page-table.

> Do we really need to do that?

Actually, probably not with this particular patchset, because I do the page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack can
be introduced later when the page-table switch is moved deeper in the C handler and
we can effectively sleep while using the user page-table.

alex.

2020-11-16 18:36:49

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table


On 11/16/20 5:56 PM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 6:48 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> Using stack protector requires the stack canary to be mapped into
>> the current page-table. Now that the page-table switch between the
>> user and kernel page-table is deferred to C code, stack protector can
>> be used while the user page-table is active and so the stack canary
>> is mapped into the user page-table.
>>
>> To prevent leaking the stack canary used with the kernel page-table,
>> use a different canary with the user and kernel page-table. The stack
>> canary is changed when switching the page-table.
>
> Unless I've missed something, this doesn't have the security
> properties we want. One CPU can be executing with kernel CR3, and
> another CPU can read the stack canary using Meltdown.

I think you are right because we have the mapping to the stack canary in
the user page-table. From userspace, we will only read the user stack canary,
but using Meltdown we can speculatively read the kernel stack canary which
will be stored at the same place.

> I think that doing this safely requires mapping a different page with
> the stack canary in the two pagetables.

Right.

alex.

2020-11-16 18:38:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
<[email protected]> wrote:
>
>
> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
> > On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >> When entering the kernel from userland, use the per-task PTI stack
> >> instead of the per-cpu trampoline stack. Like the trampoline stack,
> >> the PTI stack is mapped both in the kernel and in the user page-table.
> >> Using a per-task stack which is mapped into the kernel and the user
> >> page-table instead of a per-cpu stack will allow executing more code
> >> before switching to the kernel stack and to the kernel page-table.
> >
> > Why?
>
> When executing more code in the kernel, we are likely to reach a point
> where we need to sleep while we are using the user page-table, so we need
> to be using a per-thread stack.
>
> > I can't immediately evaluate how nasty the page table setup is because
> > it's not in this patch.
>
> The page-table is the regular page-table as introduced by PTI. It is just
> augmented with a few additional mapping which are in patch 11 (x86/pti:
> Extend PTI user mappings).
>
> > But AFAICS the only thing that this enables is sleeping with user pagetables.
>
> That's precisely the point, it allows to sleep with the user page-table.
>
> > Do we really need to do that?
>
> Actually, probably not with this particular patchset, because I do the page-table
> switch at the very beginning and end of the C handler. I had some code where I
> moved the page-table switch deeper in the kernel handler where you definitively
> can sleep (for example, if you switch back to the user page-table before
> exit_to_user_mode_prepare()).
>
> So a first step should probably be to not introduce the per-task PTI trampoline stack,
> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
> be introduced later when the page-table switch is moved deeper in the C handler and
> we can effectively sleep while using the user page-table.

Seems reasonable.

Where is the code that allocates and frees these stacks hiding? I
think I should at least read it.

2020-11-16 21:33:49

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack



On 11/16/20 7:34 PM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> When entering the kernel from userland, use the per-task PTI stack
>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>> Using a per-task stack which is mapped into the kernel and the user
>>>> page-table instead of a per-cpu stack will allow executing more code
>>>> before switching to the kernel stack and to the kernel page-table.
>>>
>>> Why?
>>
>> When executing more code in the kernel, we are likely to reach a point
>> where we need to sleep while we are using the user page-table, so we need
>> to be using a per-thread stack.
>>
>>> I can't immediately evaluate how nasty the page table setup is because
>>> it's not in this patch.
>>
>> The page-table is the regular page-table as introduced by PTI. It is just
>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>> Extend PTI user mappings).
>>
>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>
>> That's precisely the point, it allows to sleep with the user page-table.
>>
>>> Do we really need to do that?
>>
>> Actually, probably not with this particular patchset, because I do the page-table
>> switch at the very beginning and end of the C handler. I had some code where I
>> moved the page-table switch deeper in the kernel handler where you definitively
>> can sleep (for example, if you switch back to the user page-table before
>> exit_to_user_mode_prepare()).
>>
>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>> be introduced later when the page-table switch is moved deeper in the C handler and
>> we can effectively sleep while using the user page-table.
>
> Seems reasonable.
>
> Where is the code that allocates and frees these stacks hiding? I
> think I should at least read it.

Stacks are allocated/freed with the task stack, this code is unchanged (see
alloc_thread_stack_node()). The trick is that I have doubled the THREAD_SIZE
(patch 8 "x86/pti: Introduce per-task PTI trampoline stack"). Half the stack
is a used as the kernel stack (mapped only in the kernel page-table), the
other half is used as the PTI stack (mapped in the kernel and user page-table).
The mapping to the user page-table is done in mm_map_task() in fork.c (patch 11
"x86/pti: Extend PTI user mappings").

alex.

2020-11-16 22:09:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
> Deferring CR3 switch to C code means that we need to run more of the
> kernel entry code with the user page-table. To do so, we need to:
>
> - map more syscall, interrupt and exception entry code into the user
> page-table (map all noinstr code);
>
> - map additional data used in the entry code (such as stack canary);
>
> - run more entry code on the trampoline stack (which is mapped both
> in the kernel and in the user page-table) until we switch to the
> kernel page-table and then switch to the kernel stack;

So PTI was added exactly to *not* have kernel memory mapped in the user
page table. You're partially reversing that...

> - have a per-task trampoline stack instead of a per-cpu trampoline
> stack, so the task can be scheduled out while it hasn't switched
> to the kernel stack.

per-task? How much more memory is that per task?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-16 22:52:59

by David Laight

[permalink] [raw]
Subject: RE: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

From: Alexandre Chartre
> Sent: 16 November 2020 18:10
>
> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
> > On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >> When entering the kernel from userland, use the per-task PTI stack
> >> instead of the per-cpu trampoline stack. Like the trampoline stack,
> >> the PTI stack is mapped both in the kernel and in the user page-table.
> >> Using a per-task stack which is mapped into the kernel and the user
> >> page-table instead of a per-cpu stack will allow executing more code
> >> before switching to the kernel stack and to the kernel page-table.
> >
> > Why?
>
> When executing more code in the kernel, we are likely to reach a point
> where we need to sleep while we are using the user page-table, so we need
> to be using a per-thread stack.

Isn't that going to allocate a lot more kernel memory?

ISTR some thoughts about using dynamically allocated kernel
stacks when (at least some) wakeups are done by directly
restarting the system call - so that the sleeping thread
doesn't even need a kernel stack.
(I can't remember if that was linux or one of the BSDs)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-11-16 23:13:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

On Mon, Nov 16, 2020 at 12:18 PM Alexandre Chartre
<[email protected]> wrote:
>
>
> On 11/16/20 8:48 PM, Andy Lutomirski wrote:
> > On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >> Extend PTI user mappings so that more kernel entry code can be executed
> >> with the user page-table. To do so, we need to map syscall and interrupt
> >> entry code, per cpu offsets (__per_cpu_offset, which is used some in
> >> entry code), the stack canary, and the PTI stack (which is defined per
> >> task).
> >
> > Does anything unmap the PTI stack? Mapping is easy, and unmapping
> > could be a pretty big mess.
> >
>
> No, there's no unmap. The mapping exists as long as the task page-table
> does (i.e. as long as the task mm exits). I assume that the task stack
> and mm are freed at the same time but that's not something I have checked.
>

Nope. A multi-threaded mm will free task stacks when the task exits,
but the mm may outlive the individual tasks. Additionally, if you
allocate page tables as part of mapping PTI stacks, you need to make
sure the pagetables are freed. Finally, you need to make sure that
the PTI stacks have appropriate guard pages -- just doubling the
allocation is not safe enough.

My intuition is that this is going to be far more complexity than is justified.

2020-11-17 01:54:10

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC][PATCH v2 16/21] x86/pti: Execute system vector handlers on the kernel stack

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/idtentry.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
}

+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+ void *stack = pti_kernel_stack(regs);
+
+ if (stack)
+ asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+ else
+ run_sysvec_on_irqstack_cond(func, regs);
+}
+
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs) \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
- run_sysvec_on_irqstack_cond(__##func, regs); \
+ run_sysvec(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
irqentry_exit(regs, state); \
--
2.18.4

2020-11-17 02:03:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
<[email protected]> wrote:
>
> Extend PTI user mappings so that more kernel entry code can be executed
> with the user page-table. To do so, we need to map syscall and interrupt
> entry code, per cpu offsets (__per_cpu_offset, which is used some in
> entry code), the stack canary, and the PTI stack (which is defined per
> task).

Does anything unmap the PTI stack? Mapping is easy, and unmapping
could be a pretty big mess.

--Andy

2020-11-17 03:23:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
> This RFC proposes to defer the PTI CR3 switch until we reach C code.
> The benefit is that this simplifies the assembly entry code, and make
> the PTI CR3 switch code easier to understand. This also paves the way
> for further possible projects such an easier integration of Address
> Space Isolation (ASI), or the possibilily to execute some selected
> syscall or interrupt handlers without switching to the kernel page-table

What for? What is this going to be used for in the end?

> (and thus avoid the PTI page-table switch overhead).

Overhead of how much? Why do we care?

What is the big picture justfication for this diffstat

> 21 files changed, 874 insertions(+), 314 deletions(-)

and the diffstat for the ASI enablement?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-17 03:23:44

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings


On 11/16/20 8:48 PM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> Extend PTI user mappings so that more kernel entry code can be executed
>> with the user page-table. To do so, we need to map syscall and interrupt
>> entry code, per cpu offsets (__per_cpu_offset, which is used some in
>> entry code), the stack canary, and the PTI stack (which is defined per
>> task).
>
> Does anything unmap the PTI stack? Mapping is easy, and unmapping
> could be a pretty big mess.
>

No, there's no unmap. The mapping exists as long as the task page-table
does (i.e. as long as the task mm exits). I assume that the task stack
and mm are freed at the same time but that's not something I have checked.

alex.

2020-11-17 07:59:22

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/16/20 9:17 PM, Borislav Petkov wrote:
> On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
>> This RFC proposes to defer the PTI CR3 switch until we reach C code.
>> The benefit is that this simplifies the assembly entry code, and make
>> the PTI CR3 switch code easier to understand. This also paves the way
>> for further possible projects such an easier integration of Address
>> Space Isolation (ASI), or the possibility to execute some selected
>> syscall or interrupt handlers without switching to the kernel page-table
>
> What for? What is this going to be used for in the end?
>

In addition to simplify the assembly entry code, this will also simplify
the integration of Address Space Isolation (ASI) which will certainly be
the primary beneficiary of this change. The main goal of ASI is to provide
KVM address space isolation to mitigate guest-to-host speculative attacks
like L1TF or MDS. Current proposal of ASI is plugged into the CR3 switch
assembly macro which make the code brittle and complex. (see [1])

I am also expected this might help with some other ideas like having
syscall (or interrupt handler) which can run without switching the
page-table.


>> (and thus avoid the PTI page-table switch overhead).
>
> Overhead of how much? Why do we care?
>

PTI has a measured overhead of roughly 5% for most workloads, but it can
be much higher in some cases. The overhead is mostly due to the page-table
switch (even with PCID) so if we can run a syscall or an interrupt handler
without switching the page-table then we can get this kind of performance
back.


> What is the big picture justfication for this diffstat
>
>> 21 files changed, 874 insertions(+), 314 deletions(-)
>
> and the diffstat for the ASI enablement?
>

The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into
the CR3 switch assembly macro. We are working on a new implementation, based
on these changes which avoid having to deal with assembly code and makes the
implementation more robust.

alex.

[1] ASI RFCv4 - https://lore.kernel.org/lkml/[email protected]/

2020-11-17 08:21:22

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/16/20 9:24 PM, Borislav Petkov wrote:
> On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
>> Deferring CR3 switch to C code means that we need to run more of the
>> kernel entry code with the user page-table. To do so, we need to:
>>
>> - map more syscall, interrupt and exception entry code into the user
>> page-table (map all noinstr code);
>>
>> - map additional data used in the entry code (such as stack canary);
>>
>> - run more entry code on the trampoline stack (which is mapped both
>> in the kernel and in the user page-table) until we switch to the
>> kernel page-table and then switch to the kernel stack;
>
> So PTI was added exactly to *not* have kernel memory mapped in the user
> page table. You're partially reversing that...

We are not reversing PTI, we are extending it.

PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.

Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.


>> - have a per-task trampoline stack instead of a per-cpu trampoline
>> stack, so the task can be scheduled out while it hasn't switched
>> to the kernel stack.
>
> per-task? How much more memory is that per task?
>

Currently, this is done by doubling the size of the task stack (patch 8),
so that's an extra 8KB. Half of the stack is used as the regular kernel
stack, and the other half used as the PTI stack:

+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ * +-------------+
+ * | | ^ ^
+ * | kernel-only | | KERNEL_STACK_SIZE |
+ * | stack | | |
+ * | | V |
+ * +-------------+ <- top of kernel stack | THREAD_SIZE
+ * | | ^ |
+ * | kernel and | | KERNEL_STACK_SIZE |
+ * | PTI stack | | |
+ * | | V v
+ * +-------------+ <- top of stack
+ */

The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.

alex.

2020-11-17 08:30:35

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack


On 11/16/20 10:24 PM, David Laight wrote:
> From: Alexandre Chartre
>> Sent: 16 November 2020 18:10
>>
>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> When entering the kernel from userland, use the per-task PTI stack
>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>> Using a per-task stack which is mapped into the kernel and the user
>>>> page-table instead of a per-cpu stack will allow executing more code
>>>> before switching to the kernel stack and to the kernel page-table.
>>>
>>> Why?
>>
>> When executing more code in the kernel, we are likely to reach a point
>> where we need to sleep while we are using the user page-table, so we need
>> to be using a per-thread stack.
>
> Isn't that going to allocate a lot more kernel memory?

That's one of my concern, hence this RFC. The current code is doubling the
task stack (this was an easy solution), so that's +8KB per task. See my
reply to Boris, it has a bit more details.

alex.


> ISTR some thoughts about using dynamically allocated kernel
> stacks when (at least some) wakeups are done by directly
> restarting the system call - so that the sleeping thread
> doesn't even need a kernel stack.
> (I can't remember if that was linux or one of the BSDs)
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2020-11-17 08:45:58

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings


On 11/17/20 12:06 AM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 12:18 PM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>> On 11/16/20 8:48 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> Extend PTI user mappings so that more kernel entry code can be executed
>>>> with the user page-table. To do so, we need to map syscall and interrupt
>>>> entry code, per cpu offsets (__per_cpu_offset, which is used some in
>>>> entry code), the stack canary, and the PTI stack (which is defined per
>>>> task).
>>>
>>> Does anything unmap the PTI stack? Mapping is easy, and unmapping
>>> could be a pretty big mess.
>>>
>>
>> No, there's no unmap. The mapping exists as long as the task page-table
>> does (i.e. as long as the task mm exits). I assume that the task stack
>> and mm are freed at the same time but that's not something I have checked.
>>
>
> Nope. A multi-threaded mm will free task stacks when the task exits,
> but the mm may outlive the individual tasks. Additionally, if you
> allocate page tables as part of mapping PTI stacks, you need to make
> sure the pagetables are freed.

So I think I just need to unmap the PTI stack from the user page-table
when the task exits. Everything else is handled because the kernel and
PTI stack are allocated in a single chunk (referenced by task->stack).


> Finally, you need to make sure that
> the PTI stacks have appropriate guard pages -- just doubling the
> allocation is not safe enough.

The PTI stack does have guard pages because it maps only a part of the task
stack into the user page-table, so pages around the PTI stack are not mapped
into the user-pagetable (the page below is the task stack guard, and the page
above is part of the kernel-only stack so it's never mapped into the user
page-table).

+ * +-------------+
+ * | | ^ ^
+ * | kernel-only | | KERNEL_STACK_SIZE |
+ * | stack | | |
+ * | | V |
+ * +-------------+ <- top of kernel stack | THREAD_SIZE
+ * | | ^ |
+ * | kernel and | | KERNEL_STACK_SIZE |
+ * | PTI stack | | |
+ * | | V v
+ * +-------------+ <- top of stack

> My intuition is that this is going to be far more complexity than is justified.

Sounds like only the PTI stack unmap is missing, which is hopefully not
that bad. I will check that.

alex.

2020-11-17 15:10:35

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack



On 11/16/20 7:34 PM, Andy Lutomirski wrote:
> On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> When entering the kernel from userland, use the per-task PTI stack
>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>> Using a per-task stack which is mapped into the kernel and the user
>>>> page-table instead of a per-cpu stack will allow executing more code
>>>> before switching to the kernel stack and to the kernel page-table.
>>>
>>> Why?
>>
>> When executing more code in the kernel, we are likely to reach a point
>> where we need to sleep while we are using the user page-table, so we need
>> to be using a per-thread stack.
>>
>>> I can't immediately evaluate how nasty the page table setup is because
>>> it's not in this patch.
>>
>> The page-table is the regular page-table as introduced by PTI. It is just
>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>> Extend PTI user mappings).
>>
>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>
>> That's precisely the point, it allows to sleep with the user page-table.
>>
>>> Do we really need to do that?
>>
>> Actually, probably not with this particular patchset, because I do the page-table
>> switch at the very beginning and end of the C handler. I had some code where I
>> moved the page-table switch deeper in the kernel handler where you definitively
>> can sleep (for example, if you switch back to the user page-table before
>> exit_to_user_mode_prepare()).
>>
>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>> be introduced later when the page-table switch is moved deeper in the C handler and
>> we can effectively sleep while using the user page-table.
>
> Seems reasonable.
>

I finally remember why I have introduced a per-task PTI trampoline stack right now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
a per-task stack to enter (and return) from the C handler as the handler can potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
from the assembly entry code before and after calling the C function handler (also called
from assembly).

alex.

2020-11-17 15:52:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

On Tue, Nov 17, 2020 at 12:42 AM Alexandre Chartre
<[email protected]> wrote:
>
>
> On 11/17/20 12:06 AM, Andy Lutomirski wrote:
> > On Mon, Nov 16, 2020 at 12:18 PM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >>
> >> On 11/16/20 8:48 PM, Andy Lutomirski wrote:
> >>> On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
> >>> <[email protected]> wrote:
> >>>>
> >>>> Extend PTI user mappings so that more kernel entry code can be executed
> >>>> with the user page-table. To do so, we need to map syscall and interrupt
> >>>> entry code, per cpu offsets (__per_cpu_offset, which is used some in
> >>>> entry code), the stack canary, and the PTI stack (which is defined per
> >>>> task).
> >>>
> >>> Does anything unmap the PTI stack? Mapping is easy, and unmapping
> >>> could be a pretty big mess.
> >>>
> >>
> >> No, there's no unmap. The mapping exists as long as the task page-table
> >> does (i.e. as long as the task mm exits). I assume that the task stack
> >> and mm are freed at the same time but that's not something I have checked.
> >>
> >
> > Nope. A multi-threaded mm will free task stacks when the task exits,
> > but the mm may outlive the individual tasks. Additionally, if you
> > allocate page tables as part of mapping PTI stacks, you need to make
> > sure the pagetables are freed.
>
> So I think I just need to unmap the PTI stack from the user page-table
> when the task exits. Everything else is handled because the kernel and
> PTI stack are allocated in a single chunk (referenced by task->stack).
>
>
> > Finally, you need to make sure that
> > the PTI stacks have appropriate guard pages -- just doubling the
> > allocation is not safe enough.
>
> The PTI stack does have guard pages because it maps only a part of the task
> stack into the user page-table, so pages around the PTI stack are not mapped
> into the user-pagetable (the page below is the task stack guard, and the page
> above is part of the kernel-only stack so it's never mapped into the user
> page-table).
>
> + * +-------------+
> + * | | ^ ^
> + * | kernel-only | | KERNEL_STACK_SIZE |
> + * | stack | | |
> + * | | V |
> + * +-------------+ <- top of kernel stack | THREAD_SIZE
> + * | | ^ |
> + * | kernel and | | KERNEL_STACK_SIZE |
> + * | PTI stack | | |
> + * | | V v
> + * +-------------+ <- top of stack

There's no guard page between the stacks. That seems unfortunate.

>
> > My intuition is that this is going to be far more complexity than is justified.
>
> Sounds like only the PTI stack unmap is missing, which is hopefully not
> that bad. I will check that.
>
> alex.

2020-11-17 15:56:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
<[email protected]> wrote:
>
>
>
> On 11/16/20 7:34 PM, Andy Lutomirski wrote:
> > On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >>
> >> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
> >>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
> >>> <[email protected]> wrote:
> >>>>
> >>>> When entering the kernel from userland, use the per-task PTI stack
> >>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
> >>>> the PTI stack is mapped both in the kernel and in the user page-table.
> >>>> Using a per-task stack which is mapped into the kernel and the user
> >>>> page-table instead of a per-cpu stack will allow executing more code
> >>>> before switching to the kernel stack and to the kernel page-table.
> >>>
> >>> Why?
> >>
> >> When executing more code in the kernel, we are likely to reach a point
> >> where we need to sleep while we are using the user page-table, so we need
> >> to be using a per-thread stack.
> >>
> >>> I can't immediately evaluate how nasty the page table setup is because
> >>> it's not in this patch.
> >>
> >> The page-table is the regular page-table as introduced by PTI. It is just
> >> augmented with a few additional mapping which are in patch 11 (x86/pti:
> >> Extend PTI user mappings).
> >>
> >>> But AFAICS the only thing that this enables is sleeping with user pagetables.
> >>
> >> That's precisely the point, it allows to sleep with the user page-table.
> >>
> >>> Do we really need to do that?
> >>
> >> Actually, probably not with this particular patchset, because I do the page-table
> >> switch at the very beginning and end of the C handler. I had some code where I
> >> moved the page-table switch deeper in the kernel handler where you definitively
> >> can sleep (for example, if you switch back to the user page-table before
> >> exit_to_user_mode_prepare()).
> >>
> >> So a first step should probably be to not introduce the per-task PTI trampoline stack,
> >> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
> >> be introduced later when the page-table switch is moved deeper in the C handler and
> >> we can effectively sleep while using the user page-table.
> >
> > Seems reasonable.
> >
>
> I finally remember why I have introduced a per-task PTI trampoline stack right now:
> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
> a per-task stack to enter (and return) from the C handler as the handler can potentially
> go to sleep.
>
> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
> from the assembly entry code before and after calling the C function handler (also called
> from assembly).

The noinstr part of the C entry code won't sleep.

--Andy

2020-11-17 17:03:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote:
> The main goal of ASI is to provide KVM address space isolation to
> mitigate guest-to-host speculative attacks like L1TF or MDS.

Because the current L1TF and MDS mitigations are lacking or why?

> Current proposal of ASI is plugged into the CR3 switch assembly macro
> which make the code brittle and complex. (see [1])
>
> I am also expected this might help with some other ideas like having
> syscall (or interrupt handler) which can run without switching the
> page-table.

I still fail to see why we need all that. I read, "this does this and
that" but I don't read "the current problem is this" and "this is our
suggested solution for it".

So what is the issue which needs addressing in the current kernel which
is going to justify adding all that code?

> PTI has a measured overhead of roughly 5% for most workloads, but it can
> be much higher in some cases.

"it can be"? Where? Actual use case?

> The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged
> directly into the CR3 switch assembly macro. We are working on a new
> implementation, based on these changes which avoid having to deal with
> assembly code and makes the implementation more robust.

This still doesn't answer my questions. I read a lot of "could be used
for" formulations but I still don't know why we need that. So what is
the problem that the kernel currently has which you're trying to address
with this?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-17 17:03:56

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack



On 11/17/20 4:52 PM, Andy Lutomirski wrote:
> On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>>
>> On 11/16/20 7:34 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> When entering the kernel from userland, use the per-task PTI stack
>>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>>>> Using a per-task stack which is mapped into the kernel and the user
>>>>>> page-table instead of a per-cpu stack will allow executing more code
>>>>>> before switching to the kernel stack and to the kernel page-table.
>>>>>
>>>>> Why?
>>>>
>>>> When executing more code in the kernel, we are likely to reach a point
>>>> where we need to sleep while we are using the user page-table, so we need
>>>> to be using a per-thread stack.
>>>>
>>>>> I can't immediately evaluate how nasty the page table setup is because
>>>>> it's not in this patch.
>>>>
>>>> The page-table is the regular page-table as introduced by PTI. It is just
>>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>>>> Extend PTI user mappings).
>>>>
>>>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>>>
>>>> That's precisely the point, it allows to sleep with the user page-table.
>>>>
>>>>> Do we really need to do that?
>>>>
>>>> Actually, probably not with this particular patchset, because I do the page-table
>>>> switch at the very beginning and end of the C handler. I had some code where I
>>>> moved the page-table switch deeper in the kernel handler where you definitively
>>>> can sleep (for example, if you switch back to the user page-table before
>>>> exit_to_user_mode_prepare()).
>>>>
>>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>>>> be introduced later when the page-table switch is moved deeper in the C handler and
>>>> we can effectively sleep while using the user page-table.
>>>
>>> Seems reasonable.
>>>
>>
>> I finally remember why I have introduced a per-task PTI trampoline stack right now:
>> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
>> a per-task stack to enter (and return) from the C handler as the handler can potentially
>> go to sleep.
>>
>> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
>> from the assembly entry code before and after calling the C function handler (also called
>> from assembly).
>
> The noinstr part of the C entry code won't sleep.
>

But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).

Example:

#define DEFINE_IDTENTRY(func) \
static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
irqentry_state_t state; -+ \
| \
user_pagetable_escape(regs); | use trampoline stack (1)
state = irqentry_enter(regs); | \
instrumentation_begin(); -+ \
run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
instrumentation_end(); -+ \
irqentry_exit(regs, state); | use trampoline stack (2)
user_pagetable_return(regs); -+ \
}

Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.

alex.

2020-11-17 17:11:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote:
> We are not reversing PTI, we are extending it.

You're reversing it in the sense that you're mapping more kernel memory
into the user page table than what is mapped now.

> PTI removes all kernel mapping from the user page-table. However there's
> no issue with mapping some kernel data into the user page-table as long as
> these data have no sensitive information.

I hope that is the case.

> Actually, PTI is already doing that but with a very limited scope. PTI adds
> into the user page-table some kernel mappings which are needed for userland
> to enter the kernel (such as the kernel entry text, the ESPFIX, the
> CPU_ENTRY_AREA_BASE...).
>
> So here, we are extending the PTI mapping so that we can execute more kernel
> code while using the user page-table; it's a kind of PTI on steroids.

And this is what bothers me - someone else might come after you and say,
but but, I need to map more stuff into the user pgt because I wanna do
X... and so on.

> The minimum size would be 1 page (4KB) as this is the minimum mapping size.
> It's certainly enough for now as the usage of the PTI stack is limited, but
> we will need larger stack if we won't to execute more kernel code with the
> user page-table.

So on a big machine with a million tasks, that's at least a million
pages more which is what, ~4 Gb?

There better be a very good justification for the additional memory
consumption...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-17 18:14:33

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/17/20 5:55 PM, Borislav Petkov wrote:
> On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote:
>> The main goal of ASI is to provide KVM address space isolation to
>> mitigate guest-to-host speculative attacks like L1TF or MDS.
>
> Because the current L1TF and MDS mitigations are lacking or why?
>

Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
the moment. In particular, this allows a guest VM to attack another guest VM
or the host kernel running on a sibling cpu-thread. Core Scheduling will
mitigate the guest-to-guest attack but not the guest-to-host attack. Address
Space Isolation provides a mitigation for guest-to-host attack.


>> Current proposal of ASI is plugged into the CR3 switch assembly macro
>> which make the code brittle and complex. (see [1])
>>
>> I am also expected this might help with some other ideas like having
>> syscall (or interrupt handler) which can run without switching the
>> page-table.
>
> I still fail to see why we need all that. I read, "this does this and
> that" but I don't read "the current problem is this" and "this is our
> suggested solution for it".
>
> So what is the issue which needs addressing in the current kernel which
> is going to justify adding all that code?

The main issue this is trying to address is that the CR3 switch is currently
done in assembly code from contexts which are very restrictive: the CR3 switch
is often done when only one or two registers are available for use, sometimes
no stack is available. For example, the syscall entry switches CR3 with a single
register available (%sp) and no stack.

Because of this, it is fairly tricky to expand the logic for switching CR3.
This is a problem that we have faced while implementing Address Space Isolation
(ASI) where we need extra logic to drive the page-table switch. We have successfully
implement ASI with the current CR3 switching assembly code, but this requires
complex assembly construction. Hence this proposal to defer CR3 switching to C
code so that it can be more easily expandable.

Hopefully this can also contribute to make the assembly entry code less complex,
and be beneficial to other projects.


>> PTI has a measured overhead of roughly 5% for most workloads, but it can
>> be much higher in some cases.
>
> "it can be"? Where? Actual use case?

Some benchmarks are available, in particular from phoronix:

https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti
https://www.phoronix.com/scan.php?page=news_item&px=x86-PTI-Initial-Gaming-Tests
https://www.phoronix.com/scan.php?page=article&item=linux-kpti-kvm
https://medium.com/@loganaden/linux-kpti-performance-hit-on-real-workloads-8da185482df3


>> The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged
>> directly into the CR3 switch assembly macro. We are working on a new
>> implementation, based on these changes which avoid having to deal with
>> assembly code and makes the implementation more robust.
>
> This still doesn't answer my questions. I read a lot of "could be used
> for" formulations but I still don't know why we need that. So what is
> the problem that the kernel currently has which you're trying to address
> with this?
>

Hopefully this is clearer with the answer I provided above.

Thanks,

alex.

2020-11-17 18:28:03

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/17/20 6:07 PM, Borislav Petkov wrote:
> On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote:
>> We are not reversing PTI, we are extending it.
>
> You're reversing it in the sense that you're mapping more kernel memory
> into the user page table than what is mapped now.
>
>> PTI removes all kernel mapping from the user page-table. However there's
>> no issue with mapping some kernel data into the user page-table as long as
>> these data have no sensitive information.
>
> I hope that is the case.
>
>> Actually, PTI is already doing that but with a very limited scope. PTI adds
>> into the user page-table some kernel mappings which are needed for userland
>> to enter the kernel (such as the kernel entry text, the ESPFIX, the
>> CPU_ENTRY_AREA_BASE...).
>>
>> So here, we are extending the PTI mapping so that we can execute more kernel
>> code while using the user page-table; it's a kind of PTI on steroids.
>
> And this is what bothers me - someone else might come after you and say,
> but but, I need to map more stuff into the user pgt because I wanna do
> X... and so on.

Agree, any addition should be strictly checked. I have been careful to expand
it to the minimum I needed.


>> The minimum size would be 1 page (4KB) as this is the minimum mapping size.
>> It's certainly enough for now as the usage of the PTI stack is limited, but
>> we will need larger stack if we won't to execute more kernel code with the
>> user page-table.
>
> So on a big machine with a million tasks, that's at least a million
> pages more which is what, ~4 Gb?
>
> There better be a very good justification for the additional memory
> consumption...

Yeah, adding a per-task allocation is my main concern, hence this RFC.


alex.

2020-11-17 18:30:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
> the moment. In particular, this allows a guest VM to attack another guest VM
> or the host kernel running on a sibling cpu-thread. Core Scheduling will
> mitigate the guest-to-guest attack but not the guest-to-host attack.

I see in vmx_vcpu_enter_exit():

/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
else if (static_branch_unlikely(&mds_user_clear))
mds_clear_cpu_buffers();

Is that not enough?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-17 19:05:13

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code



On 11/17/20 7:28 PM, Borislav Petkov wrote:
> On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
>> Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
>> the moment. In particular, this allows a guest VM to attack another guest VM
>> or the host kernel running on a sibling cpu-thread. Core Scheduling will
>> mitigate the guest-to-guest attack but not the guest-to-host attack.
>
> I see in vmx_vcpu_enter_exit():
>
> /* L1D Flush includes CPU buffer clear to mitigate MDS */
> if (static_branch_unlikely(&vmx_l1d_should_flush))
> vmx_l1d_flush(vcpu);
> else if (static_branch_unlikely(&mds_user_clear))
> mds_clear_cpu_buffers();
>
> Is that not enough?

No. This prevents the guest VM from gathering data from the host kernel on the
same cpu-thread. But there's no mitigation for a guest VM running on a cpu-thread
attacking another cpu-thread (which can be running another guest VM or the
host kernel) from the same cpu-core. You cannot use flush/clear barriers because
the two cpu-threads are running in parallel.

alex.

2020-11-17 21:28:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote:
> No. This prevents the guest VM from gathering data from the host
> kernel on the same cpu-thread. But there's no mitigation for a guest
> VM running on a cpu-thread attacking another cpu-thread (which can be
> running another guest VM or the host kernel) from the same cpu-core.
> You cannot use flush/clear barriers because the two cpu-threads are
> running in parallel.

Now there's your justification for why you're doing this. It took a
while...

The "why" should always be part of the 0th message to provide
reviewers/maintainers with answers to the question, what this pile of
patches is all about. Please always add this rationale to your patchset
in the future.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-17 21:30:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> Some benchmarks are available, in particular from phoronix:

What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-18 07:09:32

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/17/20 10:23 PM, Borislav Petkov wrote:
> On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote:
>> No. This prevents the guest VM from gathering data from the host
>> kernel on the same cpu-thread. But there's no mitigation for a guest
>> VM running on a cpu-thread attacking another cpu-thread (which can be
>> running another guest VM or the host kernel) from the same cpu-core.
>> You cannot use flush/clear barriers because the two cpu-threads are
>> running in parallel.
>
> Now there's your justification for why you're doing this. It took a
> while...
>
> The "why" should always be part of the 0th message to provide
> reviewers/maintainers with answers to the question, what this pile of
> patches is all about. Please always add this rationale to your patchset
> in the future.
>

Sorry about that, I will definitively try to do better next time. :-}

Thanks,

alex.

2020-11-18 07:43:02

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/17/20 10:26 PM, Borislav Petkov wrote:
> On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
>> Some benchmarks are available, in particular from phoronix:
>
> What I was expecting was benchmarks *you* have run which show that
> perf penalty, not something one can find quickly on the internet and
> something one cannot always reproduce her-/himself.
>
> You do know that presenting convincing numbers with a patchset greatly
> improves its chances of getting it upstreamed, right?
>

Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.

Thanks,

alex.

2020-11-18 09:34:08

by David Laight

[permalink] [raw]
Subject: RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

From: Alexandre Chartre
> Sent: 18 November 2020 07:42
>
>
> On 11/17/20 10:26 PM, Borislav Petkov wrote:
> > On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> >> Some benchmarks are available, in particular from phoronix:
> >
> > What I was expecting was benchmarks *you* have run which show that
> > perf penalty, not something one can find quickly on the internet and
> > something one cannot always reproduce her-/himself.
> >
> > You do know that presenting convincing numbers with a patchset greatly
> > improves its chances of getting it upstreamed, right?
> >
>
> Well, it looks like I wrongfully assume that KPTI was a well known performance
> overhead since it was introduced (because it adds extra page-table switches),
> but you are right I should be presenting my own numbers.

IIRC the penalty comes from the page table switch.
Doing it at a different time is unlikely to make much difference.

For some workloads the penalty is massive - getting on for 50%.
We are still using old kernels on AWS.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-11-18 10:29:54

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/18/20 10:30 AM, David Laight wrote:
> From: Alexandre Chartre
>> Sent: 18 November 2020 07:42
>>
>>
>> On 11/17/20 10:26 PM, Borislav Petkov wrote:
>>> On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
>>>> Some benchmarks are available, in particular from phoronix:
>>>
>>> What I was expecting was benchmarks *you* have run which show that
>>> perf penalty, not something one can find quickly on the internet and
>>> something one cannot always reproduce her-/himself.
>>>
>>> You do know that presenting convincing numbers with a patchset greatly
>>> improves its chances of getting it upstreamed, right?
>>>
>>
>> Well, it looks like I wrongfully assume that KPTI was a well known performance
>> overhead since it was introduced (because it adds extra page-table switches),
>> but you are right I should be presenting my own numbers.
>
> IIRC the penalty comes from the page table switch.
> Doing it at a different time is unlikely to make much difference.
>

Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.


> For some workloads the penalty is massive - getting on for 50%.
> We are still using old kernels on AWS.
>

Here are some micro benchmarks of the getppid and getpid syscalls which highlight
the PTI overhead. This uses the kernel tools/perf command, and the getpid command
from libMICRO (https://github.com/redhat-performance/libMicro):

system running 5.10-rc4 booted with nopti:
------------------------------------------

# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 10000000 getppid() calls
Total time: 0.792 [sec]

0.079223 usecs/op
12622549 ops/sec

# getpid -B 100000
prc thr usecs/call samples errors cnt/samp
getpid 1 1 0.08029 102 0 100000


We can see that getpid and getppid syscall have the same execution
time around 0.08 usecs. These syscalls are very small and just return
a value, so the time is mostly spent entering/exiting the kernel.


same system booted with pti:
----------------------------

# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 10000000 getppid() calls
Total time: 2.025 [sec]

0.202527 usecs/op
4937605 ops/sec

# getpid -B 100000
prc thr usecs/call samples errors cnt/samp
getpid 1 1 0.20241 102 0 100000


With PTI, the execution time jumps to 0.20 usecs (+0.12 usecs = +150%).

That's a very extreme case because these are very small syscalls, and
in that case the overhead to switch page-tables is significant compared
to the execution time of the syscall.

So with an overhead of +0.12 usecs per syscall, the PTI impact is significant
with workload which uses a lot of short syscalls. But if you use longer syscalls,
for example with an average execution time of 2.0 usecs per syscall then you
have a lower overhead of 6%.

alex.

2020-11-18 11:31:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote:
> Well, it looks like I wrongfully assume that KPTI was a well known performance
> overhead since it was introduced (because it adds extra page-table switches),
> but you are right I should be presenting my own numbers.

Here's one recipe, courtesy of Mel:

https://github.com/gormanm/mmtests

"
./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall test-default

# reboot the machine with pti disabled

./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall test-nopti

poundsyscall just calls getppid() so it's a light-weight syscall and a
proxy measure for syscall entry/exit costs. To do the actual compare

cd work/log
../../compare-kernels.sh

and see what gain there is from disabling pti. If you want to compare
the other direction

../../compare-kernels.sh --baseline test-nopti --compare test-default

If you get an error about BinarySearch

(echo y;echo o conf prerequisites_policy follow;echo o conf commit)|cpan
yes | cpan List::BinarySearch

Only se the second line if you want to interactively confirm what cpan
should download and install."

I've CCed him should you have any questions.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-11-18 13:25:16

by David Laight

[permalink] [raw]
Subject: RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

From: Alexandre Chartre
> Sent: 18 November 2020 10:30
...
> Correct, this RFC is not changing the overhead. However, it is a step forward
> for being able to execute some selected syscalls or interrupt handlers without
> switching to the kernel page-table. The next step would be to identify and add
> the necessary mapping to the user page-table so that specified syscalls can be
> executed without switching the page-table.

Remember that without PTI user space can read all kernel memory.
(I'm not 100% sure you can force a cache-line read.)
It isn't even that slow.
(Even I can understand how it works.)

So if you are worried about user space doing that you can't really
run anything on the user page tables.

System calls like getpid() are irrelevant - they aren't used (much).
Even the time of day ones are implemented in the VDSO without a
context switch.

So the overheads come from other system calls that 'do work'
without actually sleeping.
I'm guessing things like read, write, sendmsg, recvmsg.

The only interesting system call I can think of is futex.
As well as all the calls that return immediately because the
mutex has been released while entering the kernel, I suspect
that being pre-empted by a different thread (of the same process)
doesn't actually need CR3 reloading (without PTI).

I also suspect that it isn't just the CR3 reload that costs.
There could (depending on the cpu) be associated TLB and/or cache
invalidations that have a much larger effect on programs with
large working sets than on simple benchmark programs.

Now bits of data that you are 'more worried about' could be kept
in physical memory that isn't normally mapped (or referenced by
a TLB) and only mapped when needed.
But that doesn't help the general case.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-11-18 17:23:19

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/18/20 2:22 PM, David Laight wrote:
> From: Alexandre Chartre
>> Sent: 18 November 2020 10:30
> ...
>> Correct, this RFC is not changing the overhead. However, it is a step forward
>> for being able to execute some selected syscalls or interrupt handlers without
>> switching to the kernel page-table. The next step would be to identify and add
>> the necessary mapping to the user page-table so that specified syscalls can be
>> executed without switching the page-table.
>
> Remember that without PTI user space can read all kernel memory.
> (I'm not 100% sure you can force a cache-line read.)
> It isn't even that slow.
> (Even I can understand how it works.)
>
> So if you are worried about user space doing that you can't really
> run anything on the user page tables.

Yes, without PTI, userspace can read all kernel memory. But to run some
part of the kernel you don't need to have all kernel mappings. Also a lot
of the kernel contain non-sensitive information which can be safely expose
to userspace. So there's probably some room for running carefully selected
syscalls with the user page-table (and hopefully useful ones).


> System calls like getpid() are irrelevant - they aren't used (much).
> Even the time of day ones are implemented in the VDSO without a
> context switch.

getpid()/getppid() is interesting because it provides the amount of overhead
PTI is adding. But the impact can be more important if some TLB flushing are
also required (as you mentioned below).


> So the overheads come from other system calls that 'do work'
> without actually sleeping.
> I'm guessing things like read, write, sendmsg, recvmsg.
>
> The only interesting system call I can think of is futex.
> As well as all the calls that return immediately because the
> mutex has been released while entering the kernel, I suspect
> that being pre-empted by a different thread (of the same process)
> doesn't actually need CR3 reloading (without PTI).
>
> I also suspect that it isn't just the CR3 reload that costs.
> There could (depending on the cpu) be associated TLB and/or cache
> invalidations that have a much larger effect on programs with
> large working sets than on simple benchmark programs.

Right, although the TLB flush is mitigated with PCID, but this has
more impact if there's no PCID.


> Now bits of data that you are 'more worried about' could be kept
> in physical memory that isn't normally mapped (or referenced by
> a TLB) and only mapped when needed.
> But that doesn't help the general case.
>

Note that having syscall which could be done without switching the
page-table is just one benefit you can get from this RFC. But the main
benefit is for integrating Address Space Isolation (ASI) which will be
much more complex if ASI as to plug in the current assembly CR3 switch.

Thanks,

alex.

2020-11-18 19:40:36

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code


On 11/18/20 12:29 PM, Borislav Petkov wrote:
> On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote:
>> Well, it looks like I wrongfully assume that KPTI was a well known performance
>> overhead since it was introduced (because it adds extra page-table switches),
>> but you are right I should be presenting my own numbers.
>
> Here's one recipe, courtesy of Mel:
>
> https://github.com/gormanm/mmtests
>

Thanks for the detailed information, I have run the test and I see the same difference
as with the tools/perf and libMICRO I already sent: there's a 150% difference for
getpid() with and without pti.

alex.

-----

# ../../compare-kernels.sh --baseline test-nopti --compare test-pti

poundsyscall
test test
nopti pti
Min 2 1.99 ( 0.00%) 5.08 (-155.28%)
Min 4 1.02 ( 0.00%) 2.60 (-154.90%)
Min 6 0.94 ( 0.00%) 2.07 (-120.21%)
Min 8 0.81 ( 0.00%) 1.60 ( -97.53%)
Min 12 0.85 ( 0.00%) 1.65 ( -94.12%)
Min 18 0.82 ( 0.00%) 1.61 ( -96.34%)
Min 24 0.81 ( 0.00%) 1.60 ( -97.53%)
Min 30 0.81 ( 0.00%) 1.60 ( -97.53%)
Min 32 0.81 ( 0.00%) 1.60 ( -97.53%)
Amean 2 2.02 ( 0.00%) 5.10 *-151.83%*
Amean 4 1.03 ( 0.00%) 2.61 *-151.98%*
Amean 6 0.96 ( 0.00%) 2.07 *-116.74%*
Amean 8 0.82 ( 0.00%) 1.60 * -96.56%*
Amean 12 0.87 ( 0.00%) 1.67 * -91.73%*
Amean 18 0.82 ( 0.00%) 1.63 * -97.94%*
Amean 24 0.81 ( 0.00%) 1.60 * -97.41%*
Amean 30 0.82 ( 0.00%) 1.60 * -96.93%*
Amean 32 0.82 ( 0.00%) 1.60 * -96.56%*
Stddev 2 0.02 ( 0.00%) 0.02 ( 33.78%)
Stddev 4 0.01 ( 0.00%) 0.01 ( 7.18%)
Stddev 6 0.01 ( 0.00%) 0.00 ( 68.77%)
Stddev 8 0.01 ( 0.00%) 0.01 ( 10.56%)
Stddev 12 0.01 ( 0.00%) 0.02 ( -12.69%)
Stddev 18 0.01 ( 0.00%) 0.01 (-107.25%)
Stddev 24 0.00 ( 0.00%) 0.00 ( -14.56%)
Stddev 30 0.01 ( 0.00%) 0.01 ( 0.00%)
Stddev 32 0.01 ( 0.00%) 0.00 ( 20.00%)
CoeffVar 2 1.17 ( 0.00%) 0.31 ( 73.70%)
CoeffVar 4 0.82 ( 0.00%) 0.30 ( 63.16%)
CoeffVar 6 1.41 ( 0.00%) 0.20 ( 85.59%)
CoeffVar 8 0.87 ( 0.00%) 0.39 ( 54.50%)
CoeffVar 12 1.66 ( 0.00%) 0.98 ( 41.23%)
CoeffVar 18 0.85 ( 0.00%) 0.89 ( -4.71%)
CoeffVar 24 0.52 ( 0.00%) 0.30 ( 41.97%)
CoeffVar 30 0.65 ( 0.00%) 0.33 ( 49.22%)
CoeffVar 32 0.65 ( 0.00%) 0.26 ( 59.30%)
Max 2 2.04 ( 0.00%) 5.13 (-151.47%)
Max 4 1.04 ( 0.00%) 2.62 (-151.92%)
Max 6 0.98 ( 0.00%) 2.08 (-112.24%)
Max 8 0.83 ( 0.00%) 1.62 ( -95.18%)
Max 12 0.89 ( 0.00%) 1.70 ( -91.01%)
Max 18 0.84 ( 0.00%) 1.66 ( -97.62%)
Max 24 0.82 ( 0.00%) 1.61 ( -96.34%)
Max 30 0.82 ( 0.00%) 1.61 ( -96.34%)
Max 32 0.82 ( 0.00%) 1.61 ( -96.34%)
BAmean-50 2 2.01 ( 0.00%) 5.09 (-153.39%)
BAmean-50 4 1.03 ( 0.00%) 2.60 (-152.62%)
BAmean-50 6 0.95 ( 0.00%) 2.07 (-118.82%)
BAmean-50 8 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-50 12 0.86 ( 0.00%) 1.66 ( -92.79%)
BAmean-50 18 0.82 ( 0.00%) 1.62 ( -97.56%)
BAmean-50 24 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-50 30 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-50 32 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-95 2 2.02 ( 0.00%) 5.09 (-151.87%)
BAmean-95 4 1.03 ( 0.00%) 2.61 (-151.99%)
BAmean-95 6 0.95 ( 0.00%) 2.07 (-117.25%)
BAmean-95 8 0.81 ( 0.00%) 1.60 ( -96.72%)
BAmean-95 12 0.87 ( 0.00%) 1.67 ( -91.82%)
BAmean-95 18 0.82 ( 0.00%) 1.63 ( -97.97%)
BAmean-95 24 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-95 30 0.81 ( 0.00%) 1.60 ( -97.00%)
BAmean-95 32 0.81 ( 0.00%) 1.60 ( -96.59%)
BAmean-99 2 2.02 ( 0.00%) 5.09 (-151.87%)
BAmean-99 4 1.03 ( 0.00%) 2.61 (-151.99%)
BAmean-99 6 0.95 ( 0.00%) 2.07 (-117.25%)
BAmean-99 8 0.81 ( 0.00%) 1.60 ( -96.72%)
BAmean-99 12 0.87 ( 0.00%) 1.67 ( -91.82%)
BAmean-99 18 0.82 ( 0.00%) 1.63 ( -97.97%)
BAmean-99 24 0.81 ( 0.00%) 1.60 ( -97.53%)
BAmean-99 30 0.81 ( 0.00%) 1.60 ( -97.00%)
BAmean-99 32 0.81 ( 0.00%) 1.60 ( -96.59%)

test test
nopti pti
Duration User 150.13 432.03
Duration System 372.10 657.69
Duration Elapsed 94.17 199.27

2020-11-19 01:52:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Tue, Nov 17, 2020 at 8:59 AM Alexandre Chartre
<[email protected]> wrote:
>
>
>
> On 11/17/20 4:52 PM, Andy Lutomirski wrote:
> > On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 11/16/20 7:34 PM, Andy Lutomirski wrote:
> >>> On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
> >>>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> When entering the kernel from userland, use the per-task PTI stack
> >>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
> >>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
> >>>>>> Using a per-task stack which is mapped into the kernel and the user
> >>>>>> page-table instead of a per-cpu stack will allow executing more code
> >>>>>> before switching to the kernel stack and to the kernel page-table.
> >>>>>
> >>>>> Why?
> >>>>
> >>>> When executing more code in the kernel, we are likely to reach a point
> >>>> where we need to sleep while we are using the user page-table, so we need
> >>>> to be using a per-thread stack.
> >>>>
> >>>>> I can't immediately evaluate how nasty the page table setup is because
> >>>>> it's not in this patch.
> >>>>
> >>>> The page-table is the regular page-table as introduced by PTI. It is just
> >>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
> >>>> Extend PTI user mappings).
> >>>>
> >>>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
> >>>>
> >>>> That's precisely the point, it allows to sleep with the user page-table.
> >>>>
> >>>>> Do we really need to do that?
> >>>>
> >>>> Actually, probably not with this particular patchset, because I do the page-table
> >>>> switch at the very beginning and end of the C handler. I had some code where I
> >>>> moved the page-table switch deeper in the kernel handler where you definitively
> >>>> can sleep (for example, if you switch back to the user page-table before
> >>>> exit_to_user_mode_prepare()).
> >>>>
> >>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
> >>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
> >>>> be introduced later when the page-table switch is moved deeper in the C handler and
> >>>> we can effectively sleep while using the user page-table.
> >>>
> >>> Seems reasonable.
> >>>
> >>
> >> I finally remember why I have introduced a per-task PTI trampoline stack right now:
> >> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
> >> a per-task stack to enter (and return) from the C handler as the handler can potentially
> >> go to sleep.
> >>
> >> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
> >> from the assembly entry code before and after calling the C function handler (also called
> >> from assembly).
> >
> > The noinstr part of the C entry code won't sleep.
> >
>
> But the noinstr part of the handler can sleep, and if it does we will need to
> preserve the trampoline stack (even if we switch to the per-task kernel stack to
> execute the noinstr part).
>
> Example:
>
> #define DEFINE_IDTENTRY(func) \
> static __always_inline void __##func(struct pt_regs *regs); \
> \
> __visible noinstr void func(struct pt_regs *regs) \
> { \
> irqentry_state_t state; -+ \
> | \
> user_pagetable_escape(regs); | use trampoline stack (1)
> state = irqentry_enter(regs); | \
> instrumentation_begin(); -+ \
> run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
> instrumentation_end(); -+ \
> irqentry_exit(regs, state); | use trampoline stack (2)
> user_pagetable_return(regs); -+ \
> }
>
> Between (1) and (2) we need to preserve and use the same trampoline stack
> in case __func() went sleeping.
>

Why? Right now, we have the percpu entry stack, and we do just fine
if we enter on one percpu stack and exit from a different one. We
would need to call from asm to C on the entry stack, return back to
asm, and then switch stacks.

2020-11-19 08:09:07

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack



On 11/19/20 2:49 AM, Andy Lutomirski wrote:
> On Tue, Nov 17, 2020 at 8:59 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>>
>> On 11/17/20 4:52 PM, Andy Lutomirski wrote:
>>> On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 11/16/20 7:34 PM, Andy Lutomirski wrote:
>>>>> On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>>>>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> When entering the kernel from userland, use the per-task PTI stack
>>>>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>>>>>> Using a per-task stack which is mapped into the kernel and the user
>>>>>>>> page-table instead of a per-cpu stack will allow executing more code
>>>>>>>> before switching to the kernel stack and to the kernel page-table.
>>>>>>>
>>>>>>> Why?
>>>>>>
>>>>>> When executing more code in the kernel, we are likely to reach a point
>>>>>> where we need to sleep while we are using the user page-table, so we need
>>>>>> to be using a per-thread stack.
>>>>>>
>>>>>>> I can't immediately evaluate how nasty the page table setup is because
>>>>>>> it's not in this patch.
>>>>>>
>>>>>> The page-table is the regular page-table as introduced by PTI. It is just
>>>>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>>>>>> Extend PTI user mappings).
>>>>>>
>>>>>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>>>>>
>>>>>> That's precisely the point, it allows to sleep with the user page-table.
>>>>>>
>>>>>>> Do we really need to do that?
>>>>>>
>>>>>> Actually, probably not with this particular patchset, because I do the page-table
>>>>>> switch at the very beginning and end of the C handler. I had some code where I
>>>>>> moved the page-table switch deeper in the kernel handler where you definitively
>>>>>> can sleep (for example, if you switch back to the user page-table before
>>>>>> exit_to_user_mode_prepare()).
>>>>>>
>>>>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>>>>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>>>>>> be introduced later when the page-table switch is moved deeper in the C handler and
>>>>>> we can effectively sleep while using the user page-table.
>>>>>
>>>>> Seems reasonable.
>>>>>
>>>>
>>>> I finally remember why I have introduced a per-task PTI trampoline stack right now:
>>>> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
>>>> a per-task stack to enter (and return) from the C handler as the handler can potentially
>>>> go to sleep.
>>>>
>>>> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
>>>> from the assembly entry code before and after calling the C function handler (also called
>>>> from assembly).
>>>
>>> The noinstr part of the C entry code won't sleep.
>>>
>>
>> But the noinstr part of the handler can sleep, and if it does we will need to
>> preserve the trampoline stack (even if we switch to the per-task kernel stack to
>> execute the noinstr part).
>>
>> Example:
>>
>> #define DEFINE_IDTENTRY(func) \
>> static __always_inline void __##func(struct pt_regs *regs); \
>> \
>> __visible noinstr void func(struct pt_regs *regs) \
>> { \
>> irqentry_state_t state; -+ \
>> | \
>> user_pagetable_escape(regs); | use trampoline stack (1)
>> state = irqentry_enter(regs); | \
>> instrumentation_begin(); -+ \
>> run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
>> instrumentation_end(); -+ \
>> irqentry_exit(regs, state); | use trampoline stack (2)
>> user_pagetable_return(regs); -+ \
>> }
>>
>> Between (1) and (2) we need to preserve and use the same trampoline stack
>> in case __func() went sleeping.
>>
>
> Why? Right now, we have the percpu entry stack, and we do just fine
> if we enter on one percpu stack and exit from a different one.
>
> We would need to call from asm to C on the entry stack, return back to
> asm, and then switch stacks.
>

That's the problem: I didn't want to return back to asm, so that the pagetable
switch can be done anywhere in the C handler.

So yes, returning to asm to switch the stack is the solution if we want to avoid
having per-task trampoline stack. The drawback is that this forces to do the
page-table switch at the beginning and end of the handler; the pagetable switch
cannot be moved deeper down into the C handler.

But that's probably a good first step (effectively just moving CR3 switch to C
without adding per-task trampoline stack). I will update the patches to do that,
and we can defer the per-task trampoline stack to later if there's an effective
need for it.

alex.

2020-11-19 12:10:36

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack



On 11/19/20 9:05 AM, Alexandre Chartre wrote:
>>>>>>>>>
>>>>>>>>> When entering the kernel from userland, use the per-task PTI stack
>>>>>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>>>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>>>>>>> Using a per-task stack which is mapped into the kernel and the user
>>>>>>>>> page-table instead of a per-cpu stack will allow executing more code
>>>>>>>>> before switching to the kernel stack and to the kernel page-table.
>>>>>>>>
>>>>>>>> Why?
>>>>>>>
>>>>>>> When executing more code in the kernel, we are likely to reach a point
>>>>>>> where we need to sleep while we are using the user page-table, so we need
>>>>>>> to be using a per-thread stack.
>>>>>>>
>>>>>>>> I can't immediately evaluate how nasty the page table setup is because
>>>>>>>> it's not in this patch.
>>>>>>>
>>>>>>> The page-table is the regular page-table as introduced by PTI. It is just
>>>>>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>>>>>>> Extend PTI user mappings).
>>>>>>>
>>>>>>>>     But AFAICS the only thing that this enables is sleeping with user pagetables.
>>>>>>>
>>>>>>> That's precisely the point, it allows to sleep with the user page-table.
>>>>>>>
>>>>>>>> Do we really need to do that?
>>>>>>>
>>>>>>> Actually, probably not with this particular patchset, because I do the page-table
>>>>>>> switch at the very beginning and end of the C handler. I had some code where I
>>>>>>> moved the page-table switch deeper in the kernel handler where you definitively
>>>>>>> can sleep (for example, if you switch back to the user page-table before
>>>>>>> exit_to_user_mode_prepare()).
>>>>>>>
>>>>>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>>>>>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>>>>>>> be introduced later when the page-table switch is moved deeper in the C handler and
>>>>>>> we can effectively sleep while using the user page-table.
>>>>>>
>>>>>> Seems reasonable.
>>>>>>
>>>>>
>>>>> I finally remember why I have introduced a per-task PTI trampoline stack right now:
>>>>> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
>>>>> a per-task stack to enter (and return) from the C handler as the handler can potentially
>>>>> go to sleep.
>>>>>
>>>>> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
>>>>> from the assembly entry code before and after calling the C function handler (also called
>>>>> from assembly).
>>>>
>>>> The noinstr part of the C entry code won't sleep.
>>>>
>>>
>>> But the noinstr part of the handler can sleep, and if it does we will need to
>>> preserve the trampoline stack (even if we switch to the per-task kernel stack to
>>> execute the noinstr part).
>>>
>>> Example:
>>>
>>> #define DEFINE_IDTENTRY(func)                                           \
>>> static __always_inline void __##func(struct pt_regs *regs);             \
>>>                                                                           \
>>> __visible noinstr void func(struct pt_regs *regs)                       \
>>> {                                                                       \
>>>           irqentry_state_t state;         -+                              \
>>>                                            |                              \
>>>           user_pagetable_escape(regs);     | use trampoline stack (1)
>>>           state = irqentry_enter(regs);    |                              \
>>>           instrumentation_begin();        -+                              \
>>>           run_idt(__##func, regs);       |===| run __func() on kernel stack (this can sleep)
>>>           instrumentation_end();          -+                              \
>>>           irqentry_exit(regs, state);      | use trampoline stack (2)
>>>           user_pagetable_return(regs);    -+                              \
>>> }
>>>
>>> Between (1) and (2) we need to preserve and use the same trampoline stack
>>> in case __func() went sleeping.
>>>
>>
>> Why?  Right now, we have the percpu entry stack, and we do just fine
>> if we enter on one percpu stack and exit from a different one.
>> We would need to call from asm to C on the entry stack, return back to
>> asm, and then switch stacks.
>>
>
> That's the problem: I didn't want to return back to asm, so that the pagetable
> switch can be done anywhere in the C handler.
>
> So yes, returning to asm to switch the stack is the solution if we want to avoid
> having per-task trampoline stack. The drawback is that this forces to do the
> page-table switch at the beginning and end of the handler; the pagetable switch
> cannot be moved deeper down into the C handler.
>
> But that's probably a good first step (effectively just moving CR3 switch to C
> without adding per-task trampoline stack). I will update the patches to do that,
> and we can defer the per-task trampoline stack to later if there's an effective
> need for it.
>

That might not be a good first step after all... Calling CR3 switch C functions
from assembly introduces extra pt_regs copies between the trampoline stack and the
kernel stack.

Currently when entering syscall, we immediately switches CR3 and builds pt_regs
directly on the kernel stack. On return, registers are restored from pt_regs from
the kernel stack, the return frame is built on the trampoline stack and then we
switch CR3.

To call CR3 switch C functions on syscall entry, we need to switch to the trampoline
stack, build pt_regs on the trampoline stack, call CR3 switch, switch to the kernel
stack, copy pt_regs to the kernel stack. On return, we have to copy pt_regs back to
the trampoline stack, call CR3 switch, restore registers.

This is less of an impact for interrupt because we enter on the trampoline stack and
the current code already builds pt_regs on the trampoline stack and copies it to the
kernel stack (although this can certainly be avoided in the current code).

I am not comfortable adding these extra steps in syscall and interrupt as the current
code is fairly optimized. With a per-task trampoline stack, we don't have extra copy
because we can build pt_regs directly on the trampoline stack and it will preserved
even when switching to the kernel stack. On syscall/interrupt return, it also saves
a copy of the iret frame from the kernel stack to the trampoline stack.

alex.

2020-11-19 16:11:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Thu, Nov 19, 2020 at 4:06 AM Alexandre Chartre
<[email protected]> wrote:
>
>
>
> On 11/19/20 9:05 AM, Alexandre Chartre wrote:
> >>>>>>>>>
> >>>>>>>>> When entering the kernel from userland, use the per-task PTI stack
> >>>>>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
> >>>>>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
> >>>>>>>>> Using a per-task stack which is mapped into the kernel and the user
> >>>>>>>>> page-table instead of a per-cpu stack will allow executing more code
> >>>>>>>>> before switching to the kernel stack and to the kernel page-table.
> >>>>>>>>
> >>>>>>>> Why?
> >>>>>>>
> >>>>>>> When executing more code in the kernel, we are likely to reach a point
> >>>>>>> where we need to sleep while we are using the user page-table, so we need
> >>>>>>> to be using a per-thread stack.
> >>>>>>>
> >>>>>>>> I can't immediately evaluate how nasty the page table setup is because
> >>>>>>>> it's not in this patch.
> >>>>>>>
> >>>>>>> The page-table is the regular page-table as introduced by PTI. It is just
> >>>>>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
> >>>>>>> Extend PTI user mappings).
> >>>>>>>
> >>>>>>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
> >>>>>>>
> >>>>>>> That's precisely the point, it allows to sleep with the user page-table.
> >>>>>>>
> >>>>>>>> Do we really need to do that?
> >>>>>>>
> >>>>>>> Actually, probably not with this particular patchset, because I do the page-table
> >>>>>>> switch at the very beginning and end of the C handler. I had some code where I
> >>>>>>> moved the page-table switch deeper in the kernel handler where you definitively
> >>>>>>> can sleep (for example, if you switch back to the user page-table before
> >>>>>>> exit_to_user_mode_prepare()).
> >>>>>>>
> >>>>>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
> >>>>>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
> >>>>>>> be introduced later when the page-table switch is moved deeper in the C handler and
> >>>>>>> we can effectively sleep while using the user page-table.
> >>>>>>
> >>>>>> Seems reasonable.
> >>>>>>
> >>>>>
> >>>>> I finally remember why I have introduced a per-task PTI trampoline stack right now:
> >>>>> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
> >>>>> a per-task stack to enter (and return) from the C handler as the handler can potentially
> >>>>> go to sleep.
> >>>>>
> >>>>> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
> >>>>> from the assembly entry code before and after calling the C function handler (also called
> >>>>> from assembly).
> >>>>
> >>>> The noinstr part of the C entry code won't sleep.
> >>>>
> >>>
> >>> But the noinstr part of the handler can sleep, and if it does we will need to
> >>> preserve the trampoline stack (even if we switch to the per-task kernel stack to
> >>> execute the noinstr part).
> >>>
> >>> Example:
> >>>
> >>> #define DEFINE_IDTENTRY(func) \
> >>> static __always_inline void __##func(struct pt_regs *regs); \
> >>> \
> >>> __visible noinstr void func(struct pt_regs *regs) \
> >>> { \
> >>> irqentry_state_t state; -+ \
> >>> | \
> >>> user_pagetable_escape(regs); | use trampoline stack (1)
> >>> state = irqentry_enter(regs); | \
> >>> instrumentation_begin(); -+ \
> >>> run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
> >>> instrumentation_end(); -+ \
> >>> irqentry_exit(regs, state); | use trampoline stack (2)
> >>> user_pagetable_return(regs); -+ \
> >>> }
> >>>
> >>> Between (1) and (2) we need to preserve and use the same trampoline stack
> >>> in case __func() went sleeping.
> >>>
> >>
> >> Why? Right now, we have the percpu entry stack, and we do just fine
> >> if we enter on one percpu stack and exit from a different one.
> >> We would need to call from asm to C on the entry stack, return back to
> >> asm, and then switch stacks.
> >>
> >
> > That's the problem: I didn't want to return back to asm, so that the pagetable
> > switch can be done anywhere in the C handler.
> >
> > So yes, returning to asm to switch the stack is the solution if we want to avoid
> > having per-task trampoline stack. The drawback is that this forces to do the
> > page-table switch at the beginning and end of the handler; the pagetable switch
> > cannot be moved deeper down into the C handler.
> >
> > But that's probably a good first step (effectively just moving CR3 switch to C
> > without adding per-task trampoline stack). I will update the patches to do that,
> > and we can defer the per-task trampoline stack to later if there's an effective
> > need for it.
> >
>
> That might not be a good first step after all... Calling CR3 switch C functions
> from assembly introduces extra pt_regs copies between the trampoline stack and the
> kernel stack.
>
> Currently when entering syscall, we immediately switches CR3 and builds pt_regs
> directly on the kernel stack. On return, registers are restored from pt_regs from
> the kernel stack, the return frame is built on the trampoline stack and then we
> switch CR3.
>
> To call CR3 switch C functions on syscall entry, we need to switch to the trampoline
> stack, build pt_regs on the trampoline stack, call CR3 switch, switch to the kernel
> stack, copy pt_regs to the kernel stack. On return, we have to copy pt_regs back to
> the trampoline stack, call CR3 switch, restore registers.
>
> This is less of an impact for interrupt because we enter on the trampoline stack and
> the current code already builds pt_regs on the trampoline stack and copies it to the
> kernel stack (although this can certainly be avoided in the current code).
>
> I am not comfortable adding these extra steps in syscall and interrupt as the current
> code is fairly optimized. With a per-task trampoline stack, we don't have extra copy
> because we can build pt_regs directly on the trampoline stack and it will preserved
> even when switching to the kernel stack. On syscall/interrupt return, it also saves
> a copy of the iret frame from the kernel stack to the trampoline stack.

IIRC during the early KPTI development, we went through a stage with
per-task stacks. It was a mess, and we got rid of it for good reason.
Your series has a bad diffstat, is a lot of churn, doesn't even handle
the unmap part yet, and, by itself, adds basically no value.

We need to take a big step back and figure out what problem any of
this is solving. As I understand it, most of the issue is that you
want to be able to switch between kernel and guest mode on one core
without exposing kernel memory to Meltdown or Spectre attacks from
guest code on another core. If so, it seems to me that a much nicer
solution would be to use a per-cpu or per-vcpu stack for the vm entry.
It could plausibly even use the existing entry stack. If you used a
per-vcpu stack, though, you could even sleep with some care.

2020-11-19 17:05:02

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack


On 11/19/20 5:06 PM, Andy Lutomirski wrote:
> On Thu, Nov 19, 2020 at 4:06 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> On 11/19/20 9:05 AM, Alexandre Chartre wrote:
>>>>>>>>>>>
>>>>>>>>>>> When entering the kernel from userland, use the per-task PTI stack
>>>>>>>>>>> instead of the per-cpu trampoline stack. Like the trampoline stack,
>>>>>>>>>>> the PTI stack is mapped both in the kernel and in the user page-table.
>>>>>>>>>>> Using a per-task stack which is mapped into the kernel and the user
>>>>>>>>>>> page-table instead of a per-cpu stack will allow executing more code
>>>>>>>>>>> before switching to the kernel stack and to the kernel page-table.
>>>>>>>>>>
>>>>>>>>>> Why?
>>>>>>>>>
>>>>>>>>> When executing more code in the kernel, we are likely to reach a point
>>>>>>>>> where we need to sleep while we are using the user page-table, so we need
>>>>>>>>> to be using a per-thread stack.
>>>>>>>>>
>>>>>>>>>> I can't immediately evaluate how nasty the page table setup is because
>>>>>>>>>> it's not in this patch.
>>>>>>>>>
>>>>>>>>> The page-table is the regular page-table as introduced by PTI. It is just
>>>>>>>>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>>>>>>>>> Extend PTI user mappings).
>>>>>>>>>
>>>>>>>>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>>>>>>>>
>>>>>>>>> That's precisely the point, it allows to sleep with the user page-table.
>>>>>>>>>
>>>>>>>>>> Do we really need to do that?
>>>>>>>>>
>>>>>>>>> Actually, probably not with this particular patchset, because I do the page-table
>>>>>>>>> switch at the very beginning and end of the C handler. I had some code where I
>>>>>>>>> moved the page-table switch deeper in the kernel handler where you definitively
>>>>>>>>> can sleep (for example, if you switch back to the user page-table before
>>>>>>>>> exit_to_user_mode_prepare()).
>>>>>>>>>
>>>>>>>>> So a first step should probably be to not introduce the per-task PTI trampoline stack,
>>>>>>>>> and stick with the existing trampoline stack. The per-task PTI trampoline stack can
>>>>>>>>> be introduced later when the page-table switch is moved deeper in the C handler and
>>>>>>>>> we can effectively sleep while using the user page-table.
>>>>>>>>
>>>>>>>> Seems reasonable.
>>>>>>>>
>>>>>>>
>>>>>>> I finally remember why I have introduced a per-task PTI trampoline stack right now:
>>>>>>> that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
>>>>>>> a per-task stack to enter (and return) from the C handler as the handler can potentially
>>>>>>> go to sleep.
>>>>>>>
>>>>>>> Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
>>>>>>> from the assembly entry code before and after calling the C function handler (also called
>>>>>>> from assembly).
>>>>>>
>>>>>> The noinstr part of the C entry code won't sleep.
>>>>>>
>>>>>
>>>>> But the noinstr part of the handler can sleep, and if it does we will need to
>>>>> preserve the trampoline stack (even if we switch to the per-task kernel stack to
>>>>> execute the noinstr part).
>>>>>
>>>>> Example:
>>>>>
>>>>> #define DEFINE_IDTENTRY(func) \
>>>>> static __always_inline void __##func(struct pt_regs *regs); \
>>>>> \
>>>>> __visible noinstr void func(struct pt_regs *regs) \
>>>>> { \
>>>>> irqentry_state_t state; -+ \
>>>>> | \
>>>>> user_pagetable_escape(regs); | use trampoline stack (1)
>>>>> state = irqentry_enter(regs); | \
>>>>> instrumentation_begin(); -+ \
>>>>> run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
>>>>> instrumentation_end(); -+ \
>>>>> irqentry_exit(regs, state); | use trampoline stack (2)
>>>>> user_pagetable_return(regs); -+ \
>>>>> }
>>>>>
>>>>> Between (1) and (2) we need to preserve and use the same trampoline stack
>>>>> in case __func() went sleeping.
>>>>>
>>>>
>>>> Why? Right now, we have the percpu entry stack, and we do just fine
>>>> if we enter on one percpu stack and exit from a different one.
>>>> We would need to call from asm to C on the entry stack, return back to
>>>> asm, and then switch stacks.
>>>>
>>>
>>> That's the problem: I didn't want to return back to asm, so that the pagetable
>>> switch can be done anywhere in the C handler.
>>>
>>> So yes, returning to asm to switch the stack is the solution if we want to avoid
>>> having per-task trampoline stack. The drawback is that this forces to do the
>>> page-table switch at the beginning and end of the handler; the pagetable switch
>>> cannot be moved deeper down into the C handler.
>>>
>>> But that's probably a good first step (effectively just moving CR3 switch to C
>>> without adding per-task trampoline stack). I will update the patches to do that,
>>> and we can defer the per-task trampoline stack to later if there's an effective
>>> need for it.
>>>
>>
>> That might not be a good first step after all... Calling CR3 switch C functions
>> from assembly introduces extra pt_regs copies between the trampoline stack and the
>> kernel stack.
>>
>> Currently when entering syscall, we immediately switches CR3 and builds pt_regs
>> directly on the kernel stack. On return, registers are restored from pt_regs from
>> the kernel stack, the return frame is built on the trampoline stack and then we
>> switch CR3.
>>
>> To call CR3 switch C functions on syscall entry, we need to switch to the trampoline
>> stack, build pt_regs on the trampoline stack, call CR3 switch, switch to the kernel
>> stack, copy pt_regs to the kernel stack. On return, we have to copy pt_regs back to
>> the trampoline stack, call CR3 switch, restore registers.
>>
>> This is less of an impact for interrupt because we enter on the trampoline stack and
>> the current code already builds pt_regs on the trampoline stack and copies it to the
>> kernel stack (although this can certainly be avoided in the current code).
>>
>> I am not comfortable adding these extra steps in syscall and interrupt as the current
>> code is fairly optimized. With a per-task trampoline stack, we don't have extra copy
>> because we can build pt_regs directly on the trampoline stack and it will preserved
>> even when switching to the kernel stack. On syscall/interrupt return, it also saves
>> a copy of the iret frame from the kernel stack to the trampoline stack.
>
> IIRC during the early KPTI development, we went through a stage with
> per-task stacks. It was a mess, and we got rid of it for good reason.
> Your series has a bad diffstat, is a lot of churn, doesn't even handle
> the unmap part yet, and, by itself, adds basically no value.

I had some concerns about this RFC and they have been confirmed :-)

> We need to take a big step back and figure out what problem any of
> this is solving. As I understand it, most of the issue is that you
> want to be able to switch between kernel and guest mode on one core
> without exposing kernel memory to Meltdown or Spectre attacks from
> guest code on another core. If so, it seems to me that a much nicer
> solution would be to use a per-cpu or per-vcpu stack for the vm entry.
> It could plausibly even use the existing entry stack. If you used a
> per-vcpu stack, though, you could even sleep with some care.
>

A goal for this change was to prepare a smoother integration for Address
Space Isolation (ASI), in particular to meet the recommendation to have
KPTI uses ASI. Given the extra complication to refactor the CR3 switch,
hence to plug ASI into KPTI, I think it is probably best that we first focus
on having a clean and simple ASI implementation (but still generic) which
initially will only be used by KVM. Once we have the ASI framework in place
(with KVM being a consumer of ASI) will certainly be in a better position to
re-evaluate if there is a simple way to integrate with KPTI.

Thanks all feedback and for your time.

alex.

2020-11-19 19:14:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Mon, Nov 16 2020 at 19:10, Alexandre Chartre wrote:
> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>> <[email protected]> wrote:
> When executing more code in the kernel, we are likely to reach a point
> where we need to sleep while we are using the user page-table, so we need
> to be using a per-thread stack.
>
>> I can't immediately evaluate how nasty the page table setup is because
>> it's not in this patch.
>
> The page-table is the regular page-table as introduced by PTI. It is just
> augmented with a few additional mapping which are in patch 11 (x86/pti:
> Extend PTI user mappings).
>
>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>
> That's precisely the point, it allows to sleep with the user page-table.

Coming late, but this does not make any sense to me.

Unless you map most of the kernel into the user page-table sleeping with
the user page-table _cannot_ work. And if you do that you broke KPTI.

You can neither pick arbitrary points in the C code of an exception
handler to switch to the kernel mapping unless you mapped everything
which might be touched before that into user space.

How is that supposed to work?

Thanks,

tglx

2020-11-19 19:20:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

On Tue, Nov 17 2020 at 09:42, Alexandre Chartre wrote:
> On 11/17/20 12:06 AM, Andy Lutomirski wrote:
> The PTI stack does have guard pages because it maps only a part of the task
> stack into the user page-table, so pages around the PTI stack are not mapped
> into the user-pagetable (the page below is the task stack guard, and the page
> above is part of the kernel-only stack so it's never mapped into the user
> page-table).
>
> + * +-------------+
> + * | | ^ ^
> + * | kernel-only | | KERNEL_STACK_SIZE |
> + * | stack | | |
> + * | | V |
> + * +-------------+ <- top of kernel stack | THREAD_SIZE
> + * | | ^ |
> + * | kernel and | | KERNEL_STACK_SIZE |
> + * | PTI stack | | |
> + * | | V v
> + * +-------------+ <- top of stack

Well, the PTI stack might have guard pages, but the kernel stack can now
underflow into the PTI stack. Not good.

Thanks,

tglx


2020-11-19 19:35:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

On Tue, Nov 17 2020 at 09:19, Alexandre Chartre wrote:
> On 11/16/20 9:24 PM, Borislav Petkov wrote:
>> On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
>> So PTI was added exactly to *not* have kernel memory mapped in the user
>> page table. You're partially reversing that...
>
> We are not reversing PTI, we are extending it.

You widen the exposure surface without providing an argument why it is safe.

> PTI removes all kernel mapping from the user page-table. However there's
> no issue with mapping some kernel data into the user page-table as long as
> these data have no sensitive information.

Define sensitive information.

> Actually, PTI is already doing that but with a very limited scope. PTI adds
> into the user page-table some kernel mappings which are needed for userland
> to enter the kernel (such as the kernel entry text, the ESPFIX, the
> CPU_ENTRY_AREA_BASE...).
>
> So here, we are extending the PTI mapping so that we can execute more kernel
> code while using the user page-table; it's a kind of PTI on steroids.

Let's just look at a syscall:

noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
{
long ret;

enter_from_user_mode(regs);
lockdep_hardirqs_off();
user_exit_irqoff();
trace_hardirqs_off_finish();

So just looking at the 3 calls above, how are you going to guarantee
that everything these callchains touch is mapped into user space?

Not to talk about everything which comes after that.

Thanks,

tglx


2020-11-19 19:59:18

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack


On 11/19/20 8:10 PM, Thomas Gleixner wrote:
> On Mon, Nov 16 2020 at 19:10, Alexandre Chartre wrote:
>> On 11/16/20 5:57 PM, Andy Lutomirski wrote:
>>> On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
>>> <[email protected]> wrote:
>> When executing more code in the kernel, we are likely to reach a point
>> where we need to sleep while we are using the user page-table, so we need
>> to be using a per-thread stack.
>>
>>> I can't immediately evaluate how nasty the page table setup is because
>>> it's not in this patch.
>>
>> The page-table is the regular page-table as introduced by PTI. It is just
>> augmented with a few additional mapping which are in patch 11 (x86/pti:
>> Extend PTI user mappings).
>>
>>> But AFAICS the only thing that this enables is sleeping with user pagetables.
>>
>> That's precisely the point, it allows to sleep with the user page-table.
>
> Coming late, but this does not make any sense to me.
>
> Unless you map most of the kernel into the user page-table sleeping with
> the user page-table _cannot_ work. And if you do that you broke KPTI.
>
> You can neither pick arbitrary points in the C code of an exception
> handler to switch to the kernel mapping unless you mapped everything
> which might be touched before that into user space.
>
> How is that supposed to work?
>

Sorry I mixed up a few thing; I got confused with my own code which is not a
good sign...

It's not sleeping with the user page-table which, as you mentioned, doesn't
make sense, it's sleeping with the kernel page-table but with the PTI stack.

Basically, it is:
- entering C code with (user page-table, PTI stack);
- then it switches to the kernel page-table so we have (kernel page-table, PTI stack);
- and then it switches to the kernel stack so we have (kernel page-table, kernel stack).

As this is all C code, some of which is executed with the PTI stack, we need the PTI stack
to be per-task so that the stack is preserved, in case that C code does a sleep/schedule
(no matter if this happens when using the PTI stack or the kernel stack).

alex.

2020-11-19 21:23:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

On Thu, Nov 19 2020 at 20:55, Alexandre Chartre wrote:
> On 11/19/20 8:10 PM, Thomas Gleixner wrote:
> Sorry I mixed up a few thing; I got confused with my own code which is not a
> good sign...
>
> It's not sleeping with the user page-table which, as you mentioned, doesn't
> make sense, it's sleeping with the kernel page-table but with the PTI stack.
>
> Basically, it is:
> - entering C code with (user page-table, PTI stack);
> - then it switches to the kernel page-table so we have (kernel page-table, PTI stack);
> - and then it switches to the kernel stack so we have (kernel page-table, kernel stack).
>
> As this is all C code, some of which is executed with the PTI stack, we need the PTI stack
> to be per-task so that the stack is preserved, in case that C code does a sleep/schedule
> (no matter if this happens when using the PTI stack or the kernel stack).

That makes some more sense, but I'm not convinced that this dual stack
is really a good thing.

Thanks,

tglx

2020-11-24 07:12:07

by kernel test robot

[permalink] [raw]
Subject: [x86/pti] 5da9e742d1: PANIC:double_fault


Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 5da9e742d1934e86760f02caa769eecb239feafe ("[RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack")
url: https://github.com/0day-ci/linux/commits/Alexandre-Chartre/x86-pti-Defer-CR3-switch-to-C-code/20201116-225620
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 1fcd009102ee02e217f2e7635ab65517d785da8e

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-------------------------------------------------------+------------+------------+
| | f2df5fbe42 | 5da9e742d1 |
+-------------------------------------------------------+------------+------------+
| PANIC:double_fault | 0 | 4 |
| double_fault:#[##] | 0 | 4 |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0 | 4 |
+-------------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 6.786553] Loading compiled-in X.509 certificates
[ 6.788127] zswap: loaded using pool lzo/zbud
[ 6.789836] Key type ._fscrypt registered
[ 6.791177] Key type .fscrypt registered
[ 6.792501] Key type fscrypt-provisioning registered
[ 6.794677] traps: PANIC: double fault, error_code: 0x0
[ 6.794679] double fault: 0000 [#1] SMP PTI
[ 6.794680] CPU: 0 PID: 87 Comm: modprobe Not tainted 5.10.0-rc1-00013-g5da9e742d193 #1
[ 6.794681] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 6.794681] RIP: 0023:0xf7fd507d
[ 6.794683] Code: 8b 4c 24 04 53 56 57 55 8b 01 85 c0 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 <cd> 80 5d 5f 5e 5b c3 5d 5f 5e 5b e9 30 09 00 00 65 8b 15 04 00 00
[ 6.794684] RSP: 002b:00000000fffc710c EFLAGS: 00010246
[ 6.794686] RAX: 0000000000000005 RBX: 00000000fffc7330 RCX: 0000000000088000
[ 6.794687] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 6.794688] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 6.794689] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 6.794690] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 6.794690] FS: 0000000000000000 GS: 00000000f7ff1de4
[ 6.794691] Modules linked in:
[ 6.824939] ---[ end trace 2d660ddeebdfa224 ]---
[ 6.824941] RIP: 0023:0xf7fd507d
[ 6.824942] Code: 8b 4c 24 04 53 56 57 55 8b 01 85 c0 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 <cd> 80 5d 5f 5e 5b c3 5d 5f 5e 5b e9 30 09 00 00 65 8b 15 04 00 00
[ 6.824943] RSP: 002b:00000000fffc710c EFLAGS: 00010246
[ 6.824945] RAX: 0000000000000005 RBX: 00000000fffc7330 RCX: 0000000000088000
[ 6.824946] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 6.824947] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 6.824948] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 6.824949] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 6.824950] FS: 0000000000000000(0000) GS:ffff88823fc00000(0063) knlGS:00000000f7ff1de4
[ 6.824950] CS: 0023 DS: 002b ES: 002b CR0: 0000000080050033
[ 6.824952] CR2: ffffc9000032fff8 CR3: 000000012fd52000 CR4: 00000000000406f0
[ 6.824953] Kernel panic - not syncing: Fatal exception in interrupt
[ 6.825052] Kernel Offset: disabled

Kboot worker: lkp-worker11
Elapsed time: 60



To reproduce:

# build kernel
cd linux
cp config-5.10.0-rc1-00013-g5da9e742d193 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Oliver Sang


Attachments:
(No filename) (4.16 kB)
config-5.10.0-rc1-00013-g5da9e742d193 (192.42 kB)
job-script (4.61 kB)
dmesg.xz (10.68 kB)
Download all attachments