LinuxLists.cc - [PATCH 00/35 v5] PTI support for x32

2018-04-16 15:39:41

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 00/35 v5] PTI support for x32

Hi,

here is the 5th iteration of my PTI enablement patches for
x86-32. There are no real changes between v4 and v5 besides
that I rebased the whole patch-set to v4.17-rc1 and resolved
the numerous conflicts that this caused.

Two separate fixes came up since the last post and I sent them
out separatly. One is already in v4.17-rc1 (commit e3e288121408)
and the other was sent a few hours ago
(https://lkml.org/lkml/2018/4/16/230).

I pushed the rebased patches together with the mentioned fix
to:

git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git pti-x32-v5

for easier testing.

I tested this version again with my load-test of running
perf-top/various x86-selftests/kernel-compile in a loop for
a couple of hours. This showed no issues. I also briefly
tested a 64bit kernel and this also worked as expected.

Previous versions of these patches can be found at:

* For v4
Post : https://marc.info/?l=linux-kernel&m=152122860630236&w=2
Git : git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git pti-x32-v4

* For v3:
Post : https://marc.info/?l=linux-kernel&m=152024559419876&w=2
Git : git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git pti-x32-v3

* For v2:
Post : https://marc.info/?l=linux-kernel&m=151816914932088&w=2
Git : git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git pti-x32-v2

Please review.

Thanks,

Joerg

Joerg Roedel (35):
x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c
x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack
x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler
x86/entry/32: Put ESPFIX code into a macro
x86/entry/32: Unshare NMI return path
x86/entry/32: Split off return-to-kernel path
x86/entry/32: Enter the kernel via trampoline stack
x86/entry/32: Leave the kernel via trampoline stack
x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI
x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
x86/entry/32: Simplify debug entry point
x86/32: Use tss.sp1 as cpu_current_top_of_stack
x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points
x86/entry/32: Add PTI cr3 switches to NMI handler code
x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl
x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled
x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
x86/pgtable: Move pti_set_user_pgtbl() to pgtable.h
x86/pgtable: Move two more functions from pgtable_64.h to pgtable.h
x86/mm/pae: Populate valid user PGD entries
x86/mm/pae: Populate the user page-table with user pgd's
x86/mm/legacy: Populate the user page-table with user pgd's
x86/mm/pti: Add an overflow check to pti_clone_pmds()
x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
x86/mm/dump_pagetables: Define INIT_PGD
x86/pgtable/pae: Use separate kernel PMDs for user page-table
x86/ldt: Reserve address-space range on 32 bit for the LDT
x86/ldt: Define LDT_END_ADDR
x86/ldt: Split out sanity check in map_ldt_struct()
x86/ldt: Enable LDT user-mapping for PAE
x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
x86/mm/pti: Add Warning when booting on a PCID capable CPU
x86/entry/32: Add debug code to check entry/exit cr3

arch/x86/Kconfig.debug | 12 +
arch/x86/entry/entry_32.S | 640 +++++++++++++++++++++++-----
arch/x86/include/asm/mmu_context.h | 5 -
arch/x86/include/asm/pgtable-2level.h | 9 +
arch/x86/include/asm/pgtable-2level_types.h | 3 +
arch/x86/include/asm/pgtable-3level.h | 7 +
arch/x86/include/asm/pgtable-3level_types.h | 6 +-
arch/x86/include/asm/pgtable.h | 87 ++++
arch/x86/include/asm/pgtable_32.h | 2 -
arch/x86/include/asm/pgtable_32_types.h | 9 +-
arch/x86/include/asm/pgtable_64.h | 89 +---
arch/x86/include/asm/pgtable_64_types.h | 3 +
arch/x86/include/asm/pgtable_types.h | 28 +-
arch/x86/include/asm/processor-flags.h | 8 +-
arch/x86/include/asm/processor.h | 4 -
arch/x86/include/asm/switch_to.h | 6 +-
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/kernel/asm-offsets.c | 5 +
arch/x86/kernel/asm-offsets_32.c | 2 +-
arch/x86/kernel/asm-offsets_64.c | 2 -
arch/x86/kernel/cpu/common.c | 9 +-
arch/x86/kernel/head_32.S | 20 +-
arch/x86/kernel/ldt.c | 137 ++++--
arch/x86/kernel/process.c | 2 -
arch/x86/kernel/process_32.c | 4 +-
arch/x86/mm/dump_pagetables.c | 21 +-
arch/x86/mm/pgtable.c | 105 ++++-
arch/x86/mm/pti.c | 42 +-
security/Kconfig | 2 +-
29 files changed, 967 insertions(+), 304 deletions(-)

--
2.7.4

2018-04-16 15:27:23

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 32/35] x86/ldt: Enable LDT user-mapping for PAE

From: Joerg Roedel <[email protected]>

This adds the needed special case for PAE to get the LDT
mapped into the user page-table when PTI is enabled. The big
difference to the other paging modes is that we don't have a
full top-level PGD entry available for the LDT, but only PMD
entry.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 5 ----
arch/x86/kernel/ldt.c | 53 ++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 57e3785..28b2376 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -71,12 +71,7 @@ struct ldt_struct {

static inline void *ldt_slot_va(int slot)
{
-#ifdef CONFIG_X86_64
return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
-#else
- BUG();
- return (void *)fix_to_virt(FIX_HOLE);
-#endif
}

/*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index e68ce37..da80296 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -126,6 +126,57 @@ static void do_sanity_check(struct mm_struct *mm,
}
}

+#ifdef CONFIG_X86_PAE
+
+static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
+{
+ p4d_t *p4d;
+ pud_t *pud;
+
+ if (pgd->pgd == 0)
+ return NULL;
+
+ p4d = p4d_offset(pgd, va);
+ if (p4d_none(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, va);
+ if (pud_none(*pud))
+ return NULL;
+
+ return pmd_offset(pud, va);
+}
+
+static void map_ldt_struct_to_user(struct mm_struct *mm)
+{
+ pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+ pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+ pmd_t *k_pmd, *u_pmd;
+
+ k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+ u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+
+ if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+ set_pmd(u_pmd, *k_pmd);
+}
+
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
+{
+ pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+ pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+ bool had_kernel, had_user;
+ pmd_t *k_pmd, *u_pmd;
+
+ k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+ u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+ had_kernel = (k_pmd->pmd != 0);
+ had_user = (u_pmd->pmd != 0);
+
+ do_sanity_check(mm, had_kernel, had_user);
+}
+
+#else /* !CONFIG_X86_PAE */
+
static void map_ldt_struct_to_user(struct mm_struct *mm)
{
pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
@@ -143,6 +194,8 @@ static void sanity_check_ldt_mapping(struct mm_struct *mm)
do_sanity_check(mm, had_kernel, had_user);
}

+#endif /* CONFIG_X86_PAE */
+
/*
* If PTI is enabled, this maps the LDT into the kernelmode and
* usermode tables for the given mm.
--
2.7.4

2018-04-16 15:27:37

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 33/35] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

From: Joerg Roedel <[email protected]>

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel <[email protected]>
---
security/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index c430206..afa91c6 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
config PAGE_TABLE_ISOLATION
bool "Remove the kernel mapping in user mode"
default y
- depends on X86_64 && !UML
+ depends on X86 && !UML
help
This feature reduces the number of hardware side channels by
ensuring that the majority of kernel addresses are not mapped
--
2.7.4

2018-04-16 15:28:39

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 31/35] x86/ldt: Split out sanity check in map_ldt_struct()

From: Joerg Roedel <[email protected]>

This splits out the mapping sanity check and the actual
mapping of the LDT to user-space from the map_ldt_struct()
function in a way so that it is re-usable for PAE paging.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/ldt.c | 82 ++++++++++++++++++++++++++++++++++++---------------
1 file changed, 58 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 46c349c..e68ce37 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -100,6 +100,49 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
return new_ldt;
}

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+static void do_sanity_check(struct mm_struct *mm,
+ bool had_kernel_mapping,
+ bool had_user_mapping)
+{
+ if (mm->context.ldt) {
+ /*
+ * We already had an LDT. The top-level entry should already
+ * have been allocated and synchronized with the usermode
+ * tables.
+ */
+ WARN_ON(!had_kernel_mapping);
+ if (static_cpu_has(X86_FEATURE_PTI))
+ WARN_ON(!had_user_mapping);
+ } else {
+ /*
+ * This is the first time we're mapping an LDT for this process.
+ * Sync the pgd to the usermode tables.
+ */
+ WARN_ON(had_kernel_mapping);
+ if (static_cpu_has(X86_FEATURE_PTI))
+ WARN_ON(had_user_mapping);
+ }
+}
+
+static void map_ldt_struct_to_user(struct mm_struct *mm)
+{
+ pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
+
+ if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+ set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+}
+
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
+{
+ pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
+ bool had_kernel = (pgd->pgd != 0);
+ bool had_user = (kernel_to_user_pgdp(pgd)->pgd != 0);
+
+ do_sanity_check(mm, had_kernel, had_user);
+}
+
/*
* If PTI is enabled, this maps the LDT into the kernelmode and
* usermode tables for the given mm.
@@ -115,9 +158,8 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
static int
map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
{
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
- bool is_vmalloc, had_top_level_entry;
unsigned long va;
+ bool is_vmalloc;
spinlock_t *ptl;
pgd_t *pgd;
int i;
@@ -131,13 +173,15 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
*/
WARN_ON(ldt->slot != -1);

+ /* Check if the current mappings are sane */
+ sanity_check_ldt_mapping(mm);
+
/*
* Did we already have the top level entry allocated? We can't
* use pgd_none() for this because it doens't do anything on
* 4-level page table kernels.
*/
pgd = pgd_offset(mm, LDT_BASE_ADDR);
- had_top_level_entry = (pgd->pgd != 0);

is_vmalloc = is_vmalloc_addr(ldt->entries);

@@ -172,35 +216,25 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
pte_unmap_unlock(ptep, ptl);
}

- if (mm->context.ldt) {
- /*
- * We already had an LDT. The top-level entry should already
- * have been allocated and synchronized with the usermode
- * tables.
- */
- WARN_ON(!had_top_level_entry);
- if (static_cpu_has(X86_FEATURE_PTI))
- WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
- } else {
- /*
- * This is the first time we're mapping an LDT for this process.
- * Sync the pgd to the usermode tables.
- */
- WARN_ON(had_top_level_entry);
- if (static_cpu_has(X86_FEATURE_PTI)) {
- WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
- set_pgd(kernel_to_user_pgdp(pgd), *pgd);
- }
- }
+ /* Propagate LDT mapping to the user page-table */
+ map_ldt_struct_to_user(mm);

va = (unsigned long)ldt_slot_va(slot);
flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0);

ldt->slot = slot;
-#endif
return 0;
}

+#else /* !CONFIG_PAGE_TABLE_ISOLATION */
+
+static int
+map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
+{
+ return 0;
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
static void free_ldt_pgtables(struct mm_struct *mm)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
--
2.7.4

2018-04-16 15:28:52

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 08/35] x86/entry/32: Leave the kernel via trampoline stack

From: Joerg Roedel <[email protected]>

Switch back to the trampoline stack before returning to
userspace.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 79 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 1d6b527..927df80 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -347,6 +347,60 @@
.endm

/*
+ * Switch back from the kernel stack to the entry stack.
+ *
+ * The %esp register must point to pt_regs on the task stack. It will
+ * first calculate the size of the stack-frame to copy, depending on
+ * whether we return to VM86 mode or not. With that it uses 'rep movsl'
+ * to copy the contents of the stack over to the entry stack.
+ *
+ * We must be very careful here, as we can't trust the contents of the
+ * task-stack once we switched to the entry-stack. When an NMI happens
+ * while on the entry-stack, the NMI handler will switch back to the top
+ * of the task stack, overwriting our stack-frame we are about to copy.
+ * Therefore we switch the stack only after everything is copied over.
+ */
+.macro SWITCH_TO_ENTRY_STACK
+
+ ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV
+
+ /* Bytes to copy */
+ movl $PTREGS_SIZE, %ecx
+
+#ifdef CONFIG_VM86
+ testl $(X86_EFLAGS_VM), PT_EFLAGS(%esp)
+ jz .Lcopy_pt_regs_\@
+
+ /* Additional 4 registers to copy when returning to VM86 mode */
+ addl $(4 * 4), %ecx
+
+.Lcopy_pt_regs_\@:
+#endif
+
+ /* Initialize source and destination for movsl */
+ movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+ subl %ecx, %edi
+ movl %esp, %esi
+
+ /* Save future stack pointer in %ebx */
+ movl %edi, %ebx
+
+ /* Copy over the stack-frame */
+ shrl $2, %ecx
+ cld
+ rep movsl
+
+ /*
+ * Switch to entry-stack - needs to happen after everything is
+ * copied because the NMI handler will overwrite the task-stack
+ * when on entry-stack
+ */
+ movl %ebx, %esp
+
+.Lend_\@:
+.endm
+
+/*
* %eax: prev task
* %edx: next task
*/
@@ -586,25 +640,45 @@ ENTRY(entry_SYSENTER_32)

/* Opportunistic SYSEXIT */
TRACE_IRQS_ON /* User mode traces as IRQs on. */
+
+ /*
+ * Setup entry stack - we keep the pointer in %eax and do the
+ * switch after almost all user-state is restored.
+ */
+
+ /* Load entry stack pointer and allocate frame for eflags/eax */
+ movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %eax
+ subl $(2*4), %eax
+
+ /* Copy eflags and eax to entry stack */
+ movl PT_EFLAGS(%esp), %edi
+ movl PT_EAX(%esp), %esi
+ movl %edi, (%eax)
+ movl %esi, 4(%eax)
+
+ /* Restore user registers and segments */
movl PT_EIP(%esp), %edx /* pt_regs->ip */
movl PT_OLDESP(%esp), %ecx /* pt_regs->sp */
1: mov PT_FS(%esp), %fs
PTGS_TO_GS
+
popl %ebx /* pt_regs->bx */
addl $2*4, %esp /* skip pt_regs->cx and pt_regs->dx */
popl %esi /* pt_regs->si */
popl %edi /* pt_regs->di */
popl %ebp /* pt_regs->bp */
- popl %eax /* pt_regs->ax */
+
+ /* Switch to entry stack */
+ movl %eax, %esp

/*
* Restore all flags except IF. (We restore IF separately because
* STI gives a one-instruction window in which we won't be interrupted,
* whereas POPF does not.)
*/
- addl $PT_EFLAGS-PT_DS, %esp /* point esp at pt_regs->flags */
btr $X86_EFLAGS_IF_BIT, (%esp)
popfl
+ popl %eax

/*
* Return back to the vDSO, which will pop ecx and edx.
@@ -673,6 +747,7 @@ ENTRY(entry_INT80_32)

restore_all:
TRACE_IRQS_IRET
+ SWITCH_TO_ENTRY_STACK
.Lrestore_all_notrace:
CHECK_AND_APPLY_ESPFIX
.Lrestore_nocheck:
--
2.7.4

2018-04-16 15:28:53

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 35/35] x86/entry/32: Add debug code to check entry/exit cr3

From: Joerg Roedel <[email protected]>

Add a config option that enabled code to check that we enter
and leave the kernel with the correct cr3. This is needed
because we have no NX protection of user-addresses in the
kernel-cr3 on x86-32 and wouldn't notice that type of bug
otherwise.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/Kconfig.debug | 12 ++++++++++++
arch/x86/entry/entry_32.S | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 192e4d2..a57f556 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -337,6 +337,18 @@ config X86_DEBUG_FPU

If unsure, say N.

+config X86_DEBUG_ENTRY_CR3
+ bool "Debug CR3 for Kernel entry/exit"
+ depends on X86_32 && PAGE_TABLE_ISOLATION
+ ---help---
+ Add instructions to the x86-32 entry code to check whether the kernel
+ is entered and left with the correct CR3. When PTI is enabled, this
+ checks whether we enter the kernel with the user-space cr3 when
+ coming from user-mode and if we leave with user-cr3 back to
+ user-space.
+
+ If unsure, say N.
+
config PUNIT_ATOM_DEBUG
tristate "ATOM Punit debug driver"
depends on PCI
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index f47e535..6b371a9 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -166,6 +166,24 @@
.Lend_\@:
.endm

+.macro BUG_IF_WRONG_CR3 no_user_check=0
+#ifdef CONFIG_X86_DEBUG_ENTRY_CR3
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ .if \no_user_check == 0
+ /* coming from usermode? */
+ testl $SEGMENT_RPL_MASK, PT_CS(%esp)
+ jz .Lend_\@
+ .endif
+ /* On user-cr3? */
+ movl %cr3, %eax
+ testl $PTI_SWITCH_MASK, %eax
+ jnz .Lend_\@
+ /* From userspace with kernel cr3 - BUG */
+ ud2
+.Lend_\@:
+#endif
+.endm
+
/*
* Switch to kernel cr3 if not already loaded and return current cr3 in
* \scratch_reg
@@ -218,6 +236,8 @@
.macro SAVE_ALL_NMI cr3_reg:req
SAVE_ALL

+ BUG_IF_WRONG_CR3
+
/*
* Now switch the CR3 when PTI is enabled.
*
@@ -229,6 +249,7 @@

.Lend_\@:
.endm
+
/*
* This is a sneaky trick to help the unwinder find pt_regs on the stack. The
* frame pointer is replaced with an encoded pointer to pt_regs. The encoding
@@ -292,6 +313,8 @@

.Lswitched_\@:

+ BUG_IF_WRONG_CR3
+
RESTORE_REGS pop=\pop
.endm

@@ -362,6 +385,8 @@

ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV

+ BUG_IF_WRONG_CR3
+
SWITCH_TO_KERNEL_CR3 scratch_reg=%eax

/*
@@ -803,6 +828,7 @@ ENTRY(entry_SYSENTER_32)
*/
pushfl
pushl %eax
+ BUG_IF_WRONG_CR3 no_user_check=1
SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
popl %eax
popfl
@@ -897,6 +923,7 @@ ENTRY(entry_SYSENTER_32)
* whereas POPF does not.)
*/
btr $X86_EFLAGS_IF_BIT, (%esp)
+ BUG_IF_WRONG_CR3 no_user_check=1
popfl
popl %eax

@@ -974,6 +1001,8 @@ restore_all:
/* Switch back to user CR3 */
SWITCH_TO_USER_CR3 scratch_reg=%eax

+ BUG_IF_WRONG_CR3
+
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
.Lirq_return:
@@ -987,6 +1016,7 @@ restore_all:
restore_all_kernel:
TRACE_IRQS_IRET
PARANOID_EXIT_TO_KERNEL_MODE
+ BUG_IF_WRONG_CR3
RESTORE_REGS 4
jmp .Lirq_return

@@ -994,6 +1024,19 @@ restore_all_kernel:
ENTRY(iret_exc )
pushl $0 # no error code
pushl $do_iret_error
+
+#ifdef CONFIG_X86_DEBUG_ENTRY_CR3
+ /*
+ * The stack-frame here is the one that iret faulted on, so its a
+ * return-to-user frame. We are on kernel-cr3 because we come here from
+ * the fixup code. This confuses the CR3 checker, so switch to user-cr3
+ * as the checker expects it.
+ */
+ pushl %eax
+ SWITCH_TO_USER_CR3 scratch_reg=%eax
+ popl %eax
+#endif
+
jmp common_exception
.previous
_ASM_EXTABLE(.Lirq_return, iret_exc)
--
2.7.4

2018-04-16 15:28:57

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 34/35] x86/mm/pti: Add Warning when booting on a PCID capable CPU

From: Joerg Roedel <[email protected]>

Warn the user in case the performance can be significantly
improved by switching to a 64-bit kernel.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 60a1ee0..ad905e0 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -497,6 +497,22 @@ void __init pti_init(void)

pr_info("enabled\n");

+#ifdef CONFIG_X86_32
+ if (boot_cpu_has(X86_FEATURE_PCID)) {
+ /* Use printk to work around pr_fmt() */
+ printk(KERN_WARNING "\n");
+ printk(KERN_WARNING "************************************************************\n");
+ printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING! **\n");
+ printk(KERN_WARNING "** **\n");
+ printk(KERN_WARNING "** You are using 32-bit PTI on a 64-bit PCID-capable CPU. **\n");
+ printk(KERN_WARNING "** Your performance will increase dramatically if you **\n");
+ printk(KERN_WARNING "** switch to a 64-bit kernel! **\n");
+ printk(KERN_WARNING "** **\n");
+ printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING! **\n");
+ printk(KERN_WARNING "************************************************************\n");
+ }
+#endif
+
pti_clone_user_shared();

/* Undo all global bits from the init pagetables in head_64.S: */
--
2.7.4

2018-04-16 15:29:22

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 11/35] x86/entry/32: Simplify debug entry point

From: Joerg Roedel <[email protected]>

The common exception entry code now handles the
entry-from-sysenter stack situation and makes sure to leave
with the same stack as it entered the kernel.

So there is no need anymore for the special handling in the
debug entry code.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 35 +++--------------------------------
1 file changed, 3 insertions(+), 32 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b3477ff..71e1cb3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1231,41 +1231,12 @@ END(common_exception)

ENTRY(debug)
/*
- * #DB can happen at the first instruction of
- * entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this
- * happens, then we will be running on a very small stack. We
- * need to detect this condition and switch to the thread
- * stack before calling any C code at all.
- *
- * If you edit this code, keep in mind that NMIs can happen in here.
+ * Entry from sysenter is now handled in common_exception
*/
ASM_CLAC
pushl $-1 # mark this as an int
-
- SAVE_ALL
- ENCODE_FRAME_POINTER
- xorl %edx, %edx # error code 0
- movl %esp, %eax # pt_regs pointer
-
- /* Are we currently on the SYSENTER stack? */
- movl PER_CPU_VAR(cpu_entry_area), %ecx
- addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
- subl %eax, %ecx /* ecx = (end of entry_stack) - esp */
- cmpl $SIZEOF_entry_stack, %ecx
- jb .Ldebug_from_sysenter_stack
-
- TRACE_IRQS_OFF
- call do_debug
- jmp ret_from_exception
-
-.Ldebug_from_sysenter_stack:
- /* We're on the SYSENTER stack. Switch off. */
- movl %esp, %ebx
- movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
- TRACE_IRQS_OFF
- call do_debug
- movl %ebx, %esp
- jmp ret_from_exception
+ pushl $do_debug
+ jmp common_exception
END(debug)

/*
--
2.7.4

2018-04-16 15:29:50

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 29/35] x86/ldt: Reserve address-space range on 32 bit for the LDT

From: Joerg Roedel <[email protected]>

Reserve 2MB/4MB of address-space for mapping the LDT to
user-space on 32 bit PTI kernels.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_32_types.h | 7 +++++--
arch/x86/mm/dump_pagetables.c | 9 +++++++++
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index e3225e8..1fa76c9 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -50,13 +50,16 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
((FIXADDR_TOT_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) \
& PMD_MASK)

-#define PKMAP_BASE \
+#define LDT_BASE_ADDR \
((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)

+#define PKMAP_BASE \
+ ((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
+
#ifdef CONFIG_HIGHMEM
# define VMALLOC_END (PKMAP_BASE - 2 * PAGE_SIZE)
#else
-# define VMALLOC_END (CPU_ENTRY_AREA_BASE - 2 * PAGE_SIZE)
+# define VMALLOC_END (LDT_BASE_ADDR - 2 * PAGE_SIZE)
#endif

#define MODULES_VADDR VMALLOC_START
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 84498da..956c886 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -122,6 +122,9 @@ enum address_markers_idx {
#ifdef CONFIG_HIGHMEM
PKMAP_BASE_NR,
#endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ LDT_NR,
+#endif
CPU_ENTRY_AREA_NR,
FIXADDR_START_NR,
END_OF_SPACE_NR,
@@ -135,6 +138,9 @@ static struct addr_marker address_markers[] = {
#ifdef CONFIG_HIGHMEM
[PKMAP_BASE_NR] = { 0UL, "Persistent kmap() Area" },
#endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ [LDT_NR] = { 0UL, "LDT remap" },
+#endif
[CPU_ENTRY_AREA_NR] = { 0UL, "CPU entry area" },
[FIXADDR_START_NR] = { 0UL, "Fixmap area" },
[END_OF_SPACE_NR] = { -1, NULL }
@@ -608,6 +614,9 @@ static int __init pt_dump_init(void)
# endif
address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
address_markers[CPU_ENTRY_AREA_NR].start_address = CPU_ENTRY_AREA_BASE;
+# ifdef CONFIG_MODIFY_LDT_SYSCALL
+ address_markers[LDT_NR].start_address = LDT_BASE_ADDR;
+# endif
#endif
return 0;
}
--
2.7.4

2018-04-16 15:29:53

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 27/35] x86/mm/dump_pagetables: Define INIT_PGD

From: Joerg Roedel <[email protected]>

Define INIT_PGD to point to the correct initial page-table
for 32 and 64 bit and use it where needed. This fixes the
build on 32 bit with CONFIG_PAGE_TABLE_ISOLATION enabled.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 62a7e9f..84498da 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -110,6 +110,8 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};

+#define INIT_PGD ((pgd_t *) &init_top_pgt)
+
#else /* CONFIG_X86_64 */

enum address_markers_idx {
@@ -138,6 +140,8 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};

+#define INIT_PGD (swapper_pg_dir)
+
#endif /* !CONFIG_X86_64 */

/* Multipliers for offsets within the PTEs */
@@ -495,11 +499,7 @@ static inline bool is_hypervisor_range(int idx)
static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx, bool dmesg)
{
-#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_top_pgt;
-#else
- pgd_t *start = swapper_pg_dir;
-#endif
+ pgd_t *start = INIT_PGD;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
@@ -565,7 +565,7 @@ EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
static void ptdump_walk_user_pgd_level_checkwx(void)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
- pgd_t *pgd = (pgd_t *) &init_top_pgt;
+ pgd_t *pgd = INIT_PGD;

if (!static_cpu_has(X86_FEATURE_PTI))
return;
--
2.7.4

2018-04-16 15:30:26

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 30/35] x86/ldt: Define LDT_END_ADDR

From: Joerg Roedel <[email protected]>

It marks the end of the address-space range reserved for the
LDT. The LDT-code will use it when unmapping the LDT for
user-space.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_32_types.h | 2 ++
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/kernel/ldt.c | 2 +-
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index 1fa76c9..6d5f795 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -53,6 +53,8 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
#define LDT_BASE_ADDR \
((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)

+#define LDT_END_ADDR (LDT_BASE_ADDR + PMD_SIZE)
+
#define PKMAP_BASE \
((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 355b488..f78ded7 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -104,6 +104,7 @@ extern unsigned int ptrs_per_p4d;
#define LDT_PGD_ENTRY_L5 -112UL
#define LDT_PGD_ENTRY (pgtable_l5_enabled ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4)
#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT)
+#define LDT_END_ADDR (LDT_BASE_ADDR + PGDIR_SIZE)

#define __VMALLOC_BASE_L4 0xffffc90000000000
#define __VMALLOC_BASE_L5 0xffa0000000000000
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index d41d896..46c349c 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -206,7 +206,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
#ifdef CONFIG_PAGE_TABLE_ISOLATION
struct mmu_gather tlb;
unsigned long start = LDT_BASE_ADDR;
- unsigned long end = start + (1UL << PGDIR_SHIFT);
+ unsigned long end = LDT_END_ADDR;

if (!static_cpu_has(X86_FEATURE_PTI))
return;
--
2.7.4

2018-04-16 15:30:43

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 28/35] x86/pgtable/pae: Use separate kernel PMDs for user page-table

From: Joerg Roedel <[email protected]>

We need separate kernel PMDs in the user page-table when PTI
is enabled to map the per-process LDT for user-space.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pgtable.c | 100 ++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 81 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f4211d2..ae98d4c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -178,6 +178,14 @@ static void pgd_dtor(pgd_t *pgd)
*/
#define PREALLOCATED_PMDS UNSHARED_PTRS_PER_PGD

+/*
+ * We allocate separate PMDs for the kernel part of the user page-table
+ * when PTI is enabled. We need them to map the per-process LDT into the
+ * user-space page-table.
+ */
+#define PREALLOCATED_USER_PMDS (static_cpu_has(X86_FEATURE_PTI) ? \
+ KERNEL_PGD_PTRS : 0)
+
void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
{
paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
@@ -198,14 +206,14 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)

/* No need to prepopulate any pagetable entries in non-PAE modes. */
#define PREALLOCATED_PMDS 0
-
+#define PREALLOCATED_USER_PMDS 0
#endif /* CONFIG_X86_PAE */

-static void free_pmds(struct mm_struct *mm, pmd_t *pmds[])
+static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
{
int i;

- for(i = 0; i < PREALLOCATED_PMDS; i++)
+ for(i = 0; i < count; i++)
if (pmds[i]) {
pgtable_pmd_page_dtor(virt_to_page(pmds[i]));
free_page((unsigned long)pmds[i]);
@@ -213,7 +221,7 @@ static void free_pmds(struct mm_struct *mm, pmd_t *pmds[])
}
}

-static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
+static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
{
int i;
bool failed = false;
@@ -222,7 +230,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
if (mm == &init_mm)
gfp &= ~__GFP_ACCOUNT;

- for(i = 0; i < PREALLOCATED_PMDS; i++) {
+ for(i = 0; i < count; i++) {
pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
if (!pmd)
failed = true;
@@ -237,7 +245,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
}

if (failed) {
- free_pmds(mm, pmds);
+ free_pmds(mm, pmds, count);
return -ENOMEM;
}

@@ -250,23 +258,38 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
* preallocate which never got a corresponding vma will need to be
* freed manually.
*/
+static void mop_up_one_pmd(struct mm_struct *mm, pgd_t *pgdp)
+{
+ pgd_t pgd = *pgdp;
+
+ if (pgd_val(pgd) != 0) {
+ pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+
+ *pgdp = native_make_pgd(0);
+
+ paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
+ pmd_free(mm, pmd);
+ mm_dec_nr_pmds(mm);
+ }
+}
+
static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
{
int i;

- for(i = 0; i < PREALLOCATED_PMDS; i++) {
- pgd_t pgd = pgdp[i];
+ for(i = 0; i < PREALLOCATED_PMDS; i++)
+ mop_up_one_pmd(mm, &pgdp[i]);

- if (pgd_val(pgd) != 0) {
- pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+#ifdef CONFIG_PAGE_TABLE_ISOLATION

- pgdp[i] = native_make_pgd(0);
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;

- paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
- pmd_free(mm, pmd);
- mm_dec_nr_pmds(mm);
- }
- }
+ pgdp = kernel_to_user_pgdp(pgdp);
+
+ for (i = 0; i < PREALLOCATED_USER_PMDS; i++)
+ mop_up_one_pmd(mm, &pgdp[i + KERNEL_PGD_BOUNDARY]);
+#endif
}

static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
@@ -292,6 +315,38 @@ static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
}
}

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void pgd_prepopulate_user_pmd(struct mm_struct *mm,
+ pgd_t *k_pgd, pmd_t *pmds[])
+{
+ pgd_t *s_pgd = kernel_to_user_pgdp(swapper_pg_dir);
+ pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+ p4d_t *u_p4d;
+ pud_t *u_pud;
+ int i;
+
+ u_p4d = p4d_offset(u_pgd, 0);
+ u_pud = pud_offset(u_p4d, 0);
+
+ s_pgd += KERNEL_PGD_BOUNDARY;
+ u_pud += KERNEL_PGD_BOUNDARY;
+
+ for (i = 0; i < PREALLOCATED_USER_PMDS; i++, u_pud++, s_pgd++) {
+ pmd_t *pmd = pmds[i];
+
+ memcpy(pmd, (pmd_t *)pgd_page_vaddr(*s_pgd),
+ sizeof(pmd_t) * PTRS_PER_PMD);
+
+ pud_populate(mm, u_pud, pmd);
+ }
+
+}
+#else
+static void pgd_prepopulate_user_pmd(struct mm_struct *mm,
+ pgd_t *k_pgd, pmd_t *pmds[])
+{
+}
+#endif
/*
* Xen paravirt assumes pgd table should be in one page. 64 bit kernel also
* assumes that pgd should be in one page.
@@ -372,6 +427,7 @@ static inline void _pgd_free(pgd_t *pgd)
pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
+ pmd_t *u_pmds[PREALLOCATED_USER_PMDS];
pmd_t *pmds[PREALLOCATED_PMDS];

pgd = _pgd_alloc();
@@ -381,12 +437,15 @@ pgd_t *pgd_alloc(struct mm_struct *mm)

mm->pgd = pgd;

- if (preallocate_pmds(mm, pmds) != 0)
+ if (preallocate_pmds(mm, pmds, PREALLOCATED_PMDS) != 0)
goto out_free_pgd;

- if (paravirt_pgd_alloc(mm) != 0)
+ if (preallocate_pmds(mm, u_pmds, PREALLOCATED_USER_PMDS) != 0)
goto out_free_pmds;

+ if (paravirt_pgd_alloc(mm) != 0)
+ goto out_free_user_pmds;
+
/*
* Make sure that pre-populating the pmds is atomic with
* respect to anything walking the pgd_list, so that they
@@ -396,13 +455,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)

pgd_ctor(mm, pgd);
pgd_prepopulate_pmd(mm, pgd, pmds);
+ pgd_prepopulate_user_pmd(mm, pgd, u_pmds);

spin_unlock(&pgd_lock);

return pgd;

+out_free_user_pmds:
+ free_pmds(mm, u_pmds, PREALLOCATED_USER_PMDS);
out_free_pmds:
- free_pmds(mm, pmds);
+ free_pmds(mm, pmds, PREALLOCATED_PMDS);
out_free_pgd:
_pgd_free(pgd);
out:
--
2.7.4

2018-04-16 15:31:59

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 09/35] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI

From: Joerg Roedel <[email protected]>

These macros will be used in the NMI handler code and
replace plain SAVE_ALL and RESTORE_REGS there. We will add
the NMI-specific CR3-switch to these macros later.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 927df80..e2621bf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -186,6 +186,9 @@

.endm

+.macro SAVE_ALL_NMI
+ SAVE_ALL
+.endm
/*
* This is a sneaky trick to help the unwinder find pt_regs on the stack. The
* frame pointer is replaced with an encoded pointer to pt_regs. The encoding
@@ -232,6 +235,10 @@
POP_GS_EX
.endm

+.macro RESTORE_ALL_NMI pop=0
+ RESTORE_REGS pop=\pop
+.endm
+
.macro CHECK_AND_APPLY_ESPFIX
#ifdef CONFIG_X86_ESPFIX32
#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
@@ -1166,7 +1173,7 @@ ENTRY(nmi)
#endif

pushl %eax # pt_regs->orig_ax
- SAVE_ALL
+ SAVE_ALL_NMI
ENCODE_FRAME_POINTER
xorl %edx, %edx # zero error code
movl %esp, %eax # pt_regs pointer
@@ -1194,7 +1201,7 @@ ENTRY(nmi)

.Lnmi_return:
CHECK_AND_APPLY_ESPFIX
- RESTORE_REGS 4
+ RESTORE_ALL_NMI pop=4
jmp .Lirq_return

#ifdef CONFIG_X86_ESPFIX32
@@ -1210,12 +1217,12 @@ ENTRY(nmi)
pushl 16(%esp)
.endr
pushl %eax
- SAVE_ALL
+ SAVE_ALL_NMI
ENCODE_FRAME_POINTER
FIXUP_ESPFIX_STACK # %eax == %esp
xorl %edx, %edx # zero error code
call do_nmi
- RESTORE_REGS
+ RESTORE_ALL_NMI
lss 12+4(%esp), %esp # back to espfix stack
jmp .Lirq_return
#endif
--
2.7.4

2018-04-16 15:32:10

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 23/35] x86/mm/legacy: Populate the user page-table with user pgd's

From: Joerg Roedel <[email protected]>

Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-2level.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8..c399ea5 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)

static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pmd.pud.p4d.pgd = pti_set_user_pgtbl(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
+#endif
*pmdp = pmd;
}

@@ -58,6 +61,9 @@ static inline pte_t native_ptep_get_and_clear(pte_t *xp)
#ifdef CONFIG_SMP
static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pti_set_user_pgtbl(&xp->pud.p4d.pgd, __pgd(0));
+#endif
return __pmd(xchg((pmdval_t *)xp, 0));
}
#else
@@ -67,6 +73,9 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
#ifdef CONFIG_SMP
static inline pud_t native_pudp_get_and_clear(pud_t *xp)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pti_set_user_pgtbl(&xp->p4d.pgd, __pgd(0));
+#endif
return __pud(xchg((pudval_t *)xp, 0));
}
#else
--
2.7.4

2018-04-16 15:32:16

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 24/35] x86/mm/pti: Add an overflow check to pti_clone_pmds()

From: Joerg Roedel <[email protected]>

The addr counter will overflow if we clone the last PMD of
the address space, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 9bea9c3..f967b51 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -297,6 +297,10 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
p4d_t *p4d;
pud_t *pud;

+ /* Overflow check */
+ if (addr < start)
+ break;
+
pgd = pgd_offset_k(addr);
if (WARN_ON(pgd_none(*pgd)))
return;
--
2.7.4

2018-04-16 15:32:25

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 26/35] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32

From: Joerg Roedel <[email protected]>

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index f967b51..60a1ee0 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -348,6 +348,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
}
}

+#ifdef CONFIG_X86_64
/*
* Clone a single p4d (i.e. a top-level entry on 4-level systems and a
* next-level entry on 5-level systems.
@@ -371,6 +372,25 @@ static void __init pti_clone_user_shared(void)
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
}

+#else /* CONFIG_X86_64 */
+
+/*
+ * On 32 bit PAE systems with 1GB of Kernel address space there is only
+ * one pgd/p4d for the whole kernel. Cloning that would map the whole
+ * address space into the user page-tables, making PTI useless. So clone
+ * the page-table on the PMD level to prevent that.
+ */
+static void __init pti_clone_user_shared(void)
+{
+ unsigned long start, end;
+
+ start = CPU_ENTRY_AREA_BASE;
+ end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+ pti_clone_pmds(start, end, _PAGE_GLOBAL);
+}
+#endif /* CONFIG_X86_64 */
+
/*
* Clone the ESPFIX P4D into the user space visible page table
*/
--
2.7.4

2018-04-16 15:32:58

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 21/35] x86/mm/pae: Populate valid user PGD entries

From: Joerg Roedel <[email protected]>

Generic page-table code populates all non-leaf entries with
_KERNPG_TABLE bits set. This is fine for all paging modes
except PAE.

In PAE mode only a subset of the bits is allowed to be set.
Make sure we only set allowed bits by masking out the
reserved bits.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1e5a406..65f2dbaf 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -50,6 +50,7 @@
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
#define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
#define _PAGE_SOFTW2 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2)
+#define _PAGE_SOFTW3 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW3)
#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT)
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -267,14 +268,37 @@ typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;

typedef struct { pgdval_t pgd; } pgd_t;

+#ifdef CONFIG_X86_PAE
+
+/*
+ * PHYSICAL_PAGE_MASK might be non-constant when SME is compiled in, so we can't
+ * use it here.
+ */
+
+#define PGD_PAE_PAGE_MASK ((signed long)PAGE_MASK)
+#define PGD_PAE_PHYS_MASK (((1ULL << __PHYSICAL_MASK_SHIFT)-1) & PGD_PAE_PAGE_MASK)
+
+/*
+ * PAE allows Base Address, P, PWT, PCD and AVL bits to be set in PGD entries.
+ * All other bits are Reserved MBZ
+ */
+#define PGD_ALLOWED_BITS (PGD_PAE_PHYS_MASK | _PAGE_PRESENT | \
+ _PAGE_PWT | _PAGE_PCD | \
+ _PAGE_SOFTW1 | _PAGE_SOFTW2 | _PAGE_SOFTW3 )
+
+#else
+/* No need to mask any bits for !PAE */
+#define PGD_ALLOWED_BITS (~0ULL)
+#endif
+
static inline pgd_t native_make_pgd(pgdval_t val)
{
- return (pgd_t) { val };
+ return (pgd_t) { val & PGD_ALLOWED_BITS };
}

static inline pgdval_t native_pgd_val(pgd_t pgd)
{
- return pgd.pgd;
+ return pgd.pgd & PGD_ALLOWED_BITS;
}

static inline pgdval_t pgd_flags(pgd_t pgd)
--
2.7.4

2018-04-16 15:33:08

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 25/35] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32

From: Joerg Roedel <[email protected]>

Move it out of the X86_64 specific processor defines so
that its visible for 32bit too.

Reviewed-by: Andy Lutomirski <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/processor-flags.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 625a52a..02c2cbd 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -39,10 +39,6 @@
#define CR3_PCID_MASK 0xFFFull
#define CR3_NOFLUSH BIT_ULL(63)

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-# define X86_CR3_PTI_PCID_USER_BIT 11
-#endif
-
#else
/*
* CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -53,4 +49,8 @@
#define CR3_NOFLUSH 0
#endif

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_PCID_USER_BIT 11
+#endif
+
#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
--
2.7.4

2018-04-16 15:33:43

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 12/35] x86/32: Use tss.sp1 as cpu_current_top_of_stack

From: Joerg Roedel <[email protected]>

Now that we store the task-stack in tss.sp1 we can also use
it as cpu_current_top_of_stack. This unifies the handling
with x86-64.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/processor.h | 4 ----
arch/x86/include/asm/thread_info.h | 2 --
arch/x86/kernel/cpu/common.c | 4 ----
arch/x86/kernel/process_32.c | 6 ------
4 files changed, 16 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 4fa4206..3894f63 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -374,12 +374,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
#define __KERNEL_TSS_LIMIT \
(IO_BITMAP_OFFSET + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)

-#ifdef CONFIG_X86_32
-DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
-#else
/* The RO copy can't be accessed with this_cpu_xyz(), so use the RW copy. */
#define cpu_current_top_of_stack cpu_tss_rw.x86_tss.sp1
-#endif

/*
* Save the original ist values for checking stack pointers during debugging
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a5d9521..943c673 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -205,9 +205,7 @@ static inline int arch_within_stack_frames(const void * const stack,

#else /* !__ASSEMBLY__ */

-#ifdef CONFIG_X86_64
# define cpu_current_top_of_stack (cpu_tss_rw + TSS_sp1)
-#endif

#endif

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 311e988..2d67ad0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1512,10 +1512,6 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
* the top of the kernel stack. Use an extra percpu variable to track the
* top of the kernel stack directly.
*/
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
- (unsigned long)&init_thread_union + THREAD_SIZE;
-EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack);
-
#ifdef CONFIG_CC_STACKPROTECTOR
DEFINE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
#endif
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 3f3a8c6..8c29fd5 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -290,12 +290,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
update_sp0(next_p);
refresh_sysenter_cs(next);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
- /*
- * TODO: Find a way to let cpu_current_top_of_stack point to
- * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
- * iret exceptions.
- */
- this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

/*
* Restore %gs if needed (which is common)
--
2.7.4

2018-04-16 15:33:58

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 17/35] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled

From: Joerg Roedel <[email protected]>

Allocate a kernel and a user page-table root when PTI is
enabled. Also allocate a full page per root for PAE because
otherwise the bit to flip in cr3 to switch between them
would be non-constant, which creates a lot of hassle.
Keep that for a later optimization.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/head_32.S | 20 +++++++++++++++-----
arch/x86/mm/pgtable.c | 5 +++--
2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index b59e4fb..12a3e8c 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -512,11 +512,18 @@ ENTRY(initial_code)
ENTRY(setup_once_ref)
.long setup_once

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define PGD_ALIGN (2 * PAGE_SIZE)
+#define PTI_USER_PGD_FILL 1024
+#else
+#define PGD_ALIGN (PAGE_SIZE)
+#define PTI_USER_PGD_FILL 0
+#endif
/*
* BSS section
*/
__PAGE_ALIGNED_BSS
- .align PAGE_SIZE
+ .align PGD_ALIGN
#ifdef CONFIG_X86_PAE
.globl initial_pg_pmd
initial_pg_pmd:
@@ -526,14 +533,17 @@ initial_pg_pmd:
initial_page_table:
.fill 1024,4,0
#endif
+ .align PGD_ALIGN
initial_pg_fixmap:
.fill 1024,4,0
-.globl empty_zero_page
-empty_zero_page:
- .fill 4096,1,0
.globl swapper_pg_dir
+ .align PGD_ALIGN
swapper_pg_dir:
.fill 1024,4,0
+ .fill PTI_USER_PGD_FILL,4,0
+.globl empty_zero_page
+empty_zero_page:
+ .fill 4096,1,0
EXPORT_SYMBOL(empty_zero_page)

/*
@@ -542,7 +552,7 @@ EXPORT_SYMBOL(empty_zero_page)
#ifdef CONFIG_X86_PAE
__PAGE_ALIGNED_DATA
/* Page-aligned for the benefit of paravirt? */
- .align PAGE_SIZE
+ .align PGD_ALIGN
ENTRY(initial_page_table)
.long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
# if KPMDS == 3
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ffc8c13..f4211d2 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -339,7 +339,8 @@ static inline pgd_t *_pgd_alloc(void)
* We allocate one page for pgd.
*/
if (!SHARED_KERNEL_PMD)
- return (pgd_t *)__get_free_page(PGALLOC_GFP);
+ return (pgd_t *)__get_free_pages(PGALLOC_GFP,
+ PGD_ALLOCATION_ORDER);

/*
* Now PAE kernel is not running as a Xen domain. We can allocate
@@ -351,7 +352,7 @@ static inline pgd_t *_pgd_alloc(void)
static inline void _pgd_free(pgd_t *pgd)
{
if (!SHARED_KERNEL_PMD)
- free_page((unsigned long)pgd);
+ free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
else
kmem_cache_free(pgd_cache, pgd);
}
--
2.7.4

2018-04-16 15:34:14

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 22/35] x86/mm/pae: Populate the user page-table with user pgd's

From: Joerg Roedel <[email protected]>

When we populate a PGD entry, make sure we populate it in
the user page-table too.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index f24df59..f2ca313 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -98,6 +98,9 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)

static inline void native_set_pud(pud_t *pudp, pud_t pud)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pud.p4d.pgd = pti_set_user_pgtbl(&pudp->p4d.pgd, pud.p4d.pgd);
+#endif
set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
}

@@ -229,6 +232,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
{
union split_pud res, *orig = (union split_pud *)pudp;

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pti_set_user_pgtbl(&pudp->p4d.pgd, __pgd(0));
+#endif
+
/* xchg acts as a barrier before setting of the high bits */
res.pud_low = xchg(&orig->pud_low, 0);
res.pud_high = orig->pud_high;
--
2.7.4

2018-04-16 15:35:00

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 00/35 v5] PTI support for x32

On Mon, Apr 16, 2018 at 8:24 AM, Joerg Roedel <[email protected]> wrote:
>
> I tested this version again with my load-test of running
> perf-top/various x86-selftests/kernel-compile in a loop for
> a couple of hours. This showed no issues. I also briefly
> tested a 64bit kernel and this also worked as expected.

Andy, can you send your extra x86 self-tests to Joerg too?

Also, it would be nice to have performance numbers, and check that the
global page issues are at least fixed on 32-bit too, since we screwed
that up on x86-64 initially.

On x86-32, the global pages are likely a bigger deal since there's no PCID.

Linus

2018-04-16 15:35:09

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 14/35] x86/entry/32: Add PTI cr3 switches to NMI handler code

From: Joerg Roedel <[email protected]>

The NMI handler is special, as it needs to leave with the
same cr3 as it was entered with. We need to do this because
we could enter the NMI handler from kernel code with
user-cr3 already loaded.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 41 +++++++++++++++++++++++++++++++++++------
1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b2b0ecb..f47e535 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -77,6 +77,8 @@
#endif
.endm

+#define PTI_SWITCH_MASK (1 << PAGE_SHIFT)
+
/*
* User gs save/restore
*
@@ -213,8 +215,19 @@

.endm

-.macro SAVE_ALL_NMI
+.macro SAVE_ALL_NMI cr3_reg:req
SAVE_ALL
+
+ /*
+ * Now switch the CR3 when PTI is enabled.
+ *
+ * We can enter with either user or kernel cr3, the code will
+ * store the old cr3 in \cr3_reg and switches to the kernel cr3
+ * if necessary.
+ */
+ SWITCH_TO_KERNEL_CR3 scratch_reg=\cr3_reg
+
+.Lend_\@:
.endm
/*
* This is a sneaky trick to help the unwinder find pt_regs on the stack. The
@@ -262,7 +275,23 @@
POP_GS_EX
.endm

-.macro RESTORE_ALL_NMI pop=0
+.macro RESTORE_ALL_NMI cr3_reg:req pop=0
+ /*
+ * Now switch the CR3 when PTI is enabled.
+ *
+ * We enter with kernel cr3 and switch the cr3 to the value
+ * stored on \cr3_reg, which is either a user or a kernel cr3.
+ */
+ ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
+
+ testl $PTI_SWITCH_MASK, \cr3_reg
+ jz .Lswitched_\@
+
+ /* User cr3 in \cr3_reg - write it to hardware cr3 */
+ movl \cr3_reg, %cr3
+
+.Lswitched_\@:
+
RESTORE_REGS pop=\pop
.endm

@@ -1333,7 +1362,7 @@ ENTRY(nmi)
#endif

pushl %eax # pt_regs->orig_ax
- SAVE_ALL_NMI
+ SAVE_ALL_NMI cr3_reg=%edi
ENCODE_FRAME_POINTER
xorl %edx, %edx # zero error code
movl %esp, %eax # pt_regs pointer
@@ -1361,7 +1390,7 @@ ENTRY(nmi)

.Lnmi_return:
CHECK_AND_APPLY_ESPFIX
- RESTORE_ALL_NMI pop=4
+ RESTORE_ALL_NMI cr3_reg=%edi pop=4
jmp .Lirq_return

#ifdef CONFIG_X86_ESPFIX32
@@ -1377,12 +1406,12 @@ ENTRY(nmi)
pushl 16(%esp)
.endr
pushl %eax
- SAVE_ALL_NMI
+ SAVE_ALL_NMI cr3_reg=%edi
ENCODE_FRAME_POINTER
FIXUP_ESPFIX_STACK # %eax == %esp
xorl %edx, %edx # zero error code
call do_nmi
- RESTORE_ALL_NMI
+ RESTORE_ALL_NMI cr3_reg=%edi
lss 12+4(%esp), %esp # back to espfix stack
jmp .Lirq_return
#endif
--
2.7.4

2018-04-16 15:35:19

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 16/35] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled

From: Joerg Roedel <[email protected]>

With PTI we need to map the per-process LDT into the kernel
address-space for each process, so we need separate kernel
PMDs per PGD.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-3level_types.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 6a59a6d..78038e0 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -21,9 +21,10 @@ typedef union {
#endif /* !__ASSEMBLY__ */

#ifdef CONFIG_PARAVIRT
-#define SHARED_KERNEL_PMD (pv_info.shared_kernel_pmd)
+#define SHARED_KERNEL_PMD ((!static_cpu_has(X86_FEATURE_PTI) && \
+ (pv_info.shared_kernel_pmd)))
#else
-#define SHARED_KERNEL_PMD 1
+#define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI))
#endif

/*
--
2.7.4

2018-04-16 15:35:32

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 19/35] x86/pgtable: Move pti_set_user_pgtbl() to pgtable.h

From: Joerg Roedel <[email protected]>

There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 21 ---------------------
2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3055c77..557ddf8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -635,8 +635,31 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,

pmd_t *populate_extra_pmd(unsigned long vaddr);
pte_t *populate_extra_pte(unsigned long vaddr);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return pgd;
+ return __pti_set_user_pgtbl(pgdp, pgd);
+}
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+ return pgd;
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
#endif /* __ASSEMBLY__ */

+
#ifdef CONFIG_X86_32
# include <asm/pgtable_32.h>
#else
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9934115..6dd2eb6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -146,27 +146,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
}

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);
-
-/*
- * Take a PGD location (pgdp) and a pgd value that needs to be set there.
- * Populates the user and returns the resulting PGD that must be set in
- * the kernel copy of the page tables.
- */
-static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
-{
- if (!static_cpu_has(X86_FEATURE_PTI))
- return pgd;
- return __pti_set_user_pgtbl(pgdp, pgd);
-}
-#else
-static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
-{
- return pgd;
-}
-#endif
-
static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
{
pgd_t pgd;
--
2.7.4

2018-04-16 15:35:36

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 20/35] x86/pgtable: Move two more functions from pgtable_64.h to pgtable.h

From: Joerg Roedel <[email protected]>

These two functions are required for PTI on 32 bit:

* pgdp_maps_userspace()
* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work
on 64 and 32 bit kernels.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-2level_types.h | 3 +++
arch/x86/include/asm/pgtable-3level_types.h | 1 +
arch/x86/include/asm/pgtable.h | 15 +++++++++++++++
arch/x86/include/asm/pgtable_32.h | 2 --
arch/x86/include/asm/pgtable_64.h | 15 ---------------
arch/x86/include/asm/pgtable_64_types.h | 2 ++
6 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index f982ef8..6deb6cd 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -35,4 +35,7 @@ typedef union {

#define PTRS_PER_PTE 1024

+/* This covers all VMSPLIT_* and VMSPLIT_*_OPT variants */
+#define PGD_KERNEL_START (CONFIG_PAGE_OFFSET >> PGDIR_SHIFT)
+
#endif /* _ASM_X86_PGTABLE_2LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 78038e0..858358a 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -46,5 +46,6 @@ typedef union {
#define PTRS_PER_PTE 512

#define MAX_POSSIBLE_PHYSMEM_BITS 36
+#define PGD_KERNEL_START (CONFIG_PAGE_OFFSET >> PGDIR_SHIFT)

#endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 557ddf8..55c236e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1172,6 +1172,21 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
}
}
#endif
+/*
+ * Page table pages are page-aligned. The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+ unsigned long ptr = (unsigned long)__ptr;
+
+ return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
+}
+
+static inline int pgd_large(pgd_t pgd) { return 0; }

#ifdef CONFIG_PAGE_TABLE_ISOLATION
/*
diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h
index 88a056b..b3ec519 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -34,8 +34,6 @@ static inline void check_pgt_cache(void) { }
void paging_init(void);
void sync_initial_page_table(void);

-static inline int pgd_large(pgd_t pgd) { return 0; }
-
/*
* Define this if things work differently on an i386 and an i486:
* it will (on an i486) warn about kernel memory accesses that are
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 6dd2eb6..84a5eb0 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -132,20 +132,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

-/*
- * Page table pages are page-aligned. The lower half of the top
- * level is used for userspace and the top half for the kernel.
- *
- * Returns true for parts of the PGD that map userspace and
- * false for the parts that map the kernel.
- */
-static inline bool pgdp_maps_userspace(void *__ptr)
-{
- unsigned long ptr = (unsigned long)__ptr;
-
- return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
-}
-
static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
{
pgd_t pgd;
@@ -185,7 +171,6 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
/*
* Level 4 access.
*/
-static inline int pgd_large(pgd_t pgd) { return 0; }
#define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)

/* PUD - Level3 access */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index d5c21a3..355b488 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -142,4 +142,6 @@ extern unsigned int ptrs_per_p4d;

#define EARLY_DYNAMIC_PAGE_TABLES 64

+#define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))
+
#endif /* _ASM_X86_PGTABLE_64_DEFS_H */
--
2.7.4

2018-04-16 15:35:52

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 18/35] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h

From: Joerg Roedel <[email protected]>

Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable.h | 49 +++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 49 ---------------------------------------
2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5f49b4f..3055c77 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1150,6 +1150,55 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
}
#endif

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size). The kernel one is at the beginning 4k and
+ * the user one is in the last 4k. To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+ unsigned long __ptr = (unsigned long)ptr;
+
+ __ptr |= BIT(bit);
+ return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+ unsigned long __ptr = (unsigned long)ptr;
+
+ __ptr &= ~BIT(bit);
+ return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+ return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+ return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+ return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+ return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index c863816..9934115 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -132,55 +132,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size). The kernel one is at the beginning 4k and
- * the user one is in the last 4k. To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
-
-/*
- * This generates better code than the inline assembly in
- * __set_bit().
- */
-static inline void *ptr_set_bit(void *ptr, int bit)
-{
- unsigned long __ptr = (unsigned long)ptr;
-
- __ptr |= BIT(bit);
- return (void *)__ptr;
-}
-static inline void *ptr_clear_bit(void *ptr, int bit)
-{
- unsigned long __ptr = (unsigned long)ptr;
-
- __ptr &= ~BIT(bit);
- return (void *)__ptr;
-}
-
-static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
-{
- return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
-{
- return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
- return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
- return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-#endif /* CONFIG_PAGE_TABLE_ISOLATION */
-
/*
* Page table pages are page-aligned. The lower half of the top
* level is used for userspace and the top half for the kernel.
--
2.7.4

2018-04-16 15:36:25

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

From: Joerg Roedel <[email protected]>

We want x86_tss.sp0 point to the entry stack later to use
it as a trampoline stack for other kernel entry points
besides SYSENTER.

So store the task stack pointer in x86_tss.sp1, which is
otherwise unused by the hardware, as Linux doesn't make use
of Ring 1.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/asm-offsets_32.c | 2 +-
arch/x86/kernel/process_32.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index c6ac48f..5f05329 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
BLANK();

/* Offset from the sysenter stack to tss.sp0 */
- DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+ DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
offsetofend(struct cpu_entry_area, entry_stack_page.stack));

#ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c60..097d36a 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,6 +292,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(cpu_current_top_of_stack,
(unsigned long)task_stack_page(next_p) +
THREAD_SIZE);
+ /* SYSENTER reads the task-stack from tss.sp1 */
+ this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

/*
* Restore %gs if needed (which is common)
--
2.7.4

2018-04-16 15:36:33

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 13/35] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points

From: Joerg Roedel <[email protected]>

Add unconditional cr3 switches between user and kernel cr3
to all non-NMI entry and exit points.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 83 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 79 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 71e1cb3..b2b0ecb 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -154,6 +154,33 @@

#endif /* CONFIG_X86_32_LAZY_GS */

+/* Unconditionally switch to user cr3 */
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ movl %cr3, \scratch_reg
+ orl $PTI_SWITCH_MASK, \scratch_reg
+ movl \scratch_reg, %cr3
+.Lend_\@:
+.endm
+
+/*
+ * Switch to kernel cr3 if not already loaded and return current cr3 in
+ * \scratch_reg
+ */
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ movl %cr3, \scratch_reg
+ /* Test if we are already on kernel CR3 */
+ testl $PTI_SWITCH_MASK, \scratch_reg
+ jz .Lend_\@
+ andl $(~PTI_SWITCH_MASK), \scratch_reg
+ movl \scratch_reg, %cr3
+ /* Return original CR3 in \scratch_reg */
+ orl $PTI_SWITCH_MASK, \scratch_reg
+.Lend_\@:
+.endm
+
.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
cld
/* Push segment registers and %eax */
@@ -288,7 +315,6 @@
#endif /* CONFIG_X86_ESPFIX32 */
.endm

-
/*
* Called with pt_regs fully populated and kernel segments loaded,
* so we can access PER_CPU and use the integer registers.
@@ -301,11 +327,19 @@
*/

#define CS_FROM_ENTRY_STACK (1 << 31)
+#define CS_FROM_USER_CR3 (1 << 30)

.macro SWITCH_TO_KERNEL_STACK

ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV

+ SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
+
+ /*
+ * %eax now contains the entry cr3 and we carry it forward in
+ * that register for the time this macro runs
+ */
+
/* Are we on the entry stack? Bail out if not! */
movl PER_CPU_VAR(cpu_entry_area), %edi
addl $CPU_ENTRY_AREA_entry_stack, %edi
@@ -374,7 +408,8 @@
* but switch back to the entry-stack again when we approach
* iret and return to the interrupted code-path. This usually
* happens when we hit an exception while restoring user-space
- * segment registers on the way back to user-space.
+ * segment registers on the way back to user-space or when the
+ * sysenter handler runs with eflags.tf set.
*
* When we switch to the task-stack here, we can't trust the
* contents of the entry-stack anymore, as the exception handler
@@ -391,6 +426,7 @@
*
* %esi: Entry-Stack pointer (same as %esp)
* %edi: Top of the task stack
+ * %eax: CR3 on kernel entry
*/

/* Calculate number of bytes on the entry stack in %ecx */
@@ -407,6 +443,14 @@
orl $CS_FROM_ENTRY_STACK, PT_CS(%esp)

/*
+ * Test the cr3 used to enter the kernel and add a marker
+ * so that we can switch back to it before iret.
+ */
+ testl $PTI_SWITCH_MASK, %eax
+ jz .Lcopy_pt_regs_\@
+ orl $CS_FROM_USER_CR3, PT_CS(%esp)
+
+ /*
* %esi and %edi are unchanged, %ecx contains the number of
* bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
* the stack-frame on task-stack and copy everything over
@@ -472,7 +516,7 @@

/*
* This macro handles the case when we return to kernel-mode on the iret
- * path and have to switch back to the entry stack.
+ * path and have to switch back to the entry stack and/or user-cr3
*
* See the comments below the .Lentry_from_kernel_\@ label in the
* SWITCH_TO_KERNEL_STACK macro for more details.
@@ -518,6 +562,18 @@
/* Safe to switch to entry-stack now */
movl %ebx, %esp

+ /*
+ * We came from entry-stack and need to check if we also need to
+ * switch back to user cr3.
+ */
+ testl $CS_FROM_USER_CR3, PT_CS(%esp)
+ jz .Lend_\@
+
+ /* Clear marker from stack-frame */
+ andl $(~CS_FROM_USER_CR3), PT_CS(%esp)
+
+ SWITCH_TO_USER_CR3 scratch_reg=%eax
+
.Lend_\@:
.endm
/*
@@ -711,6 +767,18 @@ ENTRY(xen_sysenter_target)
* 0(%ebp) arg6
*/
ENTRY(entry_SYSENTER_32)
+ /*
+ * On entry-stack with all userspace-regs live - save and
+ * restore eflags and %eax to use it as scratch-reg for the cr3
+ * switch.
+ */
+ pushfl
+ pushl %eax
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
+ popl %eax
+ popfl
+
+ /* Stack empty again, switch to task stack */
movl TSS_entry_stack(%esp), %esp

.Lsysenter_past_esp:
@@ -791,6 +859,9 @@ ENTRY(entry_SYSENTER_32)
/* Switch to entry stack */
movl %eax, %esp

+ /* Now ready to switch the cr3 */
+ SWITCH_TO_USER_CR3 scratch_reg=%eax
+
/*
* Restore all flags except IF. (We restore IF separately because
* STI gives a one-instruction window in which we won't be interrupted,
@@ -871,7 +942,11 @@ restore_all:
.Lrestore_all_notrace:
CHECK_AND_APPLY_ESPFIX
.Lrestore_nocheck:
- RESTORE_REGS 4 # skip orig_eax/error_code
+ /* Switch back to user CR3 */
+ SWITCH_TO_USER_CR3 scratch_reg=%eax
+
+ /* Restore user state */
+ RESTORE_REGS pop=4 # skip orig_eax/error_code
.Lirq_return:
/*
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
--
2.7.4

2018-04-16 15:36:59

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 10/35] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack

From: Joerg Roedel <[email protected]>

It can happen that we enter the kernel from kernel-mode and
on the entry-stack. The most common way this happens is when
we get an exception while loading the user-space segment
registers on the kernel-to-userspace exit path.

The segment loading needs to be done after the entry-stack
switch, because the stack-switch needs kernel %fs for
per_cpu access.

When this happens, we need to make sure that we leave the
kernel with the entry-stack again, so that the interrupted
code-path runs on the right stack when switching to the
user-cr3.

We do this by detecting this condition on kernel-entry by
checking CS.RPL and %esp, and if it happens, we copy over
the complete content of the entry stack to the task-stack.
This needs to be done because once we enter the exception
handlers we might be scheduled out or even migrated to a
different CPU, so that we can't rely on the entry-stack
contents. We also leave a marker in the stack-frame to
detect this condition on the exit path.

On the exit path the copy is reversed, we copy all of the
remaining task-stack back to the entry-stack and switch
to it.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 116 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 115 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index e2621bf..b3477ff 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -299,6 +299,9 @@
* copied there. So allocate the stack-frame on the task-stack and
* switch to it before we do any copying.
*/
+
+#define CS_FROM_ENTRY_STACK (1 << 31)
+
.macro SWITCH_TO_KERNEL_STACK

ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -320,6 +323,16 @@
/* Load top of task-stack into %edi */
movl TSS_entry_stack(%edi), %edi

+ /*
+ * Clear upper bits of the CS slot in pt_regs in case hardware
+ * didn't clear it for us
+ */
+ andl $(0x0000ffff), PT_CS(%esp)
+
+ /* Special case - entry from kernel mode via entry stack */
+ testl $SEGMENT_RPL_MASK, PT_CS(%esp)
+ jz .Lentry_from_kernel_\@
+
/* Bytes to copy */
movl $PTREGS_SIZE, %ecx

@@ -333,8 +346,8 @@
*/
addl $(4 * 4), %ecx

-.Lcopy_pt_regs_\@:
#endif
+.Lcopy_pt_regs_\@:

/* Allocate frame on task-stack */
subl %ecx, %edi
@@ -350,6 +363,56 @@
cld
rep movsl

+ jmp .Lend_\@
+
+.Lentry_from_kernel_\@:
+
+ /*
+ * This handles the case when we enter the kernel from
+ * kernel-mode and %esp points to the entry-stack. When this
+ * happens we need to switch to the task-stack to run C code,
+ * but switch back to the entry-stack again when we approach
+ * iret and return to the interrupted code-path. This usually
+ * happens when we hit an exception while restoring user-space
+ * segment registers on the way back to user-space.
+ *
+ * When we switch to the task-stack here, we can't trust the
+ * contents of the entry-stack anymore, as the exception handler
+ * might be scheduled out or moved to another CPU. Therefore we
+ * copy the complete entry-stack to the task-stack and set a
+ * marker in the iret-frame (bit 31 of the CS dword) to detect
+ * what we've done on the iret path.
+ *
+ * On the iret path we copy everything back and switch to the
+ * entry-stack, so that the interrupted kernel code-path
+ * continues on the same stack it was interrupted with.
+ *
+ * Be aware that an NMI can happen anytime in this code.
+ *
+ * %esi: Entry-Stack pointer (same as %esp)
+ * %edi: Top of the task stack
+ */
+
+ /* Calculate number of bytes on the entry stack in %ecx */
+ movl %esi, %ecx
+
+ /* %ecx to the top of entry-stack */
+ andl $(MASK_entry_stack), %ecx
+ addl $(SIZEOF_entry_stack), %ecx
+
+ /* Number of bytes on the entry stack to %ecx */
+ sub %esi, %ecx
+
+ /* Mark stackframe as coming from entry stack */
+ orl $CS_FROM_ENTRY_STACK, PT_CS(%esp)
+
+ /*
+ * %esi and %edi are unchanged, %ecx contains the number of
+ * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
+ * the stack-frame on task-stack and copy everything over
+ */
+ jmp .Lcopy_pt_regs_\@
+
.Lend_\@:
.endm

@@ -408,6 +471,56 @@
.endm

/*
+ * This macro handles the case when we return to kernel-mode on the iret
+ * path and have to switch back to the entry stack.
+ *
+ * See the comments below the .Lentry_from_kernel_\@ label in the
+ * SWITCH_TO_KERNEL_STACK macro for more details.
+ */
+.macro PARANOID_EXIT_TO_KERNEL_MODE
+
+ /*
+ * Test if we entered the kernel with the entry-stack. Most
+ * likely we did not, because this code only runs on the
+ * return-to-kernel path.
+ */
+ testl $CS_FROM_ENTRY_STACK, PT_CS(%esp)
+ jz .Lend_\@
+
+ /* Unlikely slow-path */
+
+ /* Clear marker from stack-frame */
+ andl $(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
+
+ /* Copy the remaining task-stack contents to entry-stack */
+ movl %esp, %esi
+ movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+
+ /* Bytes on the task-stack to ecx */
+ movl PER_CPU_VAR(cpu_current_top_of_stack), %ecx
+ subl %esi, %ecx
+
+ /* Allocate stack-frame on entry-stack */
+ subl %ecx, %edi
+
+ /*
+ * Save future stack-pointer, we must not switch until the
+ * copy is done, otherwise the NMI handler could destroy the
+ * contents of the task-stack we are about to copy.
+ */
+ movl %edi, %ebx
+
+ /* Do the copy */
+ shrl $2, %ecx
+ cld
+ rep movsl
+
+ /* Safe to switch to entry-stack now */
+ movl %ebx, %esp
+
+.Lend_\@:
+.endm
+/*
* %eax: prev task
* %edx: next task
*/
@@ -769,6 +882,7 @@ restore_all:

restore_all_kernel:
TRACE_IRQS_IRET
+ PARANOID_EXIT_TO_KERNEL_MODE
RESTORE_REGS 4
jmp .Lirq_return

--
2.7.4

2018-04-16 15:37:32

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 07/35] x86/entry/32: Enter the kernel via trampoline stack

From: Joerg Roedel <[email protected]>

Use the entry-stack as a trampoline to enter the kernel. The
entry-stack is already in the cpu_entry_area and will be
mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 136 +++++++++++++++++++++++++++++++--------
arch/x86/include/asm/switch_to.h | 6 +-
arch/x86/kernel/asm-offsets.c | 1 +
arch/x86/kernel/cpu/common.c | 5 +-
arch/x86/kernel/process.c | 2 -
arch/x86/kernel/process_32.c | 10 +--
6 files changed, 121 insertions(+), 39 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2f04d6e..1d6b527 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -154,25 +154,36 @@

#endif /* CONFIG_X86_32_LAZY_GS */

-.macro SAVE_ALL pt_regs_ax=%eax
+.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
cld
+ /* Push segment registers and %eax */
PUSH_GS
pushl %fs
pushl %es
pushl %ds
pushl \pt_regs_ax
+
+ /* Load kernel segments */
+ movl $(__USER_DS), %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl $(__KERNEL_PERCPU), %eax
+ movl %eax, %fs
+ SET_KERNEL_GS %eax
+
+ /* Push integer registers and complete PT_REGS */
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
- movl $(__USER_DS), %edx
- movl %edx, %ds
- movl %edx, %es
- movl $(__KERNEL_PERCPU), %edx
- movl %edx, %fs
- SET_KERNEL_GS %edx
+
+ /* Switch to kernel stack if necessary */
+.if \switch_stacks > 0
+ SWITCH_TO_KERNEL_STACK
+.endif
+
.endm

/*
@@ -269,6 +280,72 @@
.Lend_\@:
#endif /* CONFIG_X86_ESPFIX32 */
.endm
+
+
+/*
+ * Called with pt_regs fully populated and kernel segments loaded,
+ * so we can access PER_CPU and use the integer registers.
+ *
+ * We need to be very careful here with the %esp switch, because an NMI
+ * can happen everywhere. If the NMI handler finds itself on the
+ * entry-stack, it will overwrite the task-stack and everything we
+ * copied there. So allocate the stack-frame on the task-stack and
+ * switch to it before we do any copying.
+ */
+.macro SWITCH_TO_KERNEL_STACK
+
+ ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV
+
+ /* Are we on the entry stack? Bail out if not! */
+ movl PER_CPU_VAR(cpu_entry_area), %edi
+ addl $CPU_ENTRY_AREA_entry_stack, %edi
+ cmpl %esp, %edi
+ jae .Lend_\@
+
+ /* Load stack pointer into %esi and %edi */
+ movl %esp, %esi
+ movl %esi, %edi
+
+ /* Move %edi to the top of the entry stack */
+ andl $(MASK_entry_stack), %edi
+ addl $(SIZEOF_entry_stack), %edi
+
+ /* Load top of task-stack into %edi */
+ movl TSS_entry_stack(%edi), %edi
+
+ /* Bytes to copy */
+ movl $PTREGS_SIZE, %ecx
+
+#ifdef CONFIG_VM86
+ testl $X86_EFLAGS_VM, PT_EFLAGS(%esi)
+ jz .Lcopy_pt_regs_\@
+
+ /*
+ * Stack-frame contains 4 additional segment registers when
+ * coming from VM86 mode
+ */
+ addl $(4 * 4), %ecx
+
+.Lcopy_pt_regs_\@:
+#endif
+
+ /* Allocate frame on task-stack */
+ subl %ecx, %edi
+
+ /* Switch to task-stack */
+ movl %edi, %esp
+
+ /*
+ * We are now on the task-stack and can safely copy over the
+ * stack-frame
+ */
+ shrl $2, %ecx
+ cld
+ rep movsl
+
+.Lend_\@:
+.endm
+
/*
* %eax: prev task
* %edx: next task
@@ -461,6 +538,7 @@ ENTRY(xen_sysenter_target)
*/
ENTRY(entry_SYSENTER_32)
movl TSS_entry_stack(%esp), %esp
+
.Lsysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
@@ -469,7 +547,7 @@ ENTRY(entry_SYSENTER_32)
pushl $__USER_CS /* pt_regs->cs */
pushl $0 /* pt_regs->ip = 0 (placeholder) */
pushl %eax /* pt_regs->orig_ax */
- SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */
+ SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest, stack already switched */

/*
* SYSENTER doesn't filter flags, so we need to clear NT, AC
@@ -580,7 +658,8 @@ ENDPROC(entry_SYSENTER_32)
ENTRY(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
- SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */
+
+ SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */

/*
* User mode is traced as though IRQs are on, and the interrupt gate
@@ -677,7 +756,8 @@ END(irq_entries_start)
common_interrupt:
ASM_CLAC
addl $-0x80, (%esp) /* Adjust vector into the [-256, -1] range */
- SAVE_ALL
+
+ SAVE_ALL switch_stacks=1
ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
movl %esp, %eax
@@ -685,16 +765,16 @@ common_interrupt:
jmp ret_from_intr
ENDPROC(common_interrupt)

-#define BUILD_INTERRUPT3(name, nr, fn) \
-ENTRY(name) \
- ASM_CLAC; \
- pushl $~(nr); \
- SAVE_ALL; \
- ENCODE_FRAME_POINTER; \
- TRACE_IRQS_OFF \
- movl %esp, %eax; \
- call fn; \
- jmp ret_from_intr; \
+#define BUILD_INTERRUPT3(name, nr, fn) \
+ENTRY(name) \
+ ASM_CLAC; \
+ pushl $~(nr); \
+ SAVE_ALL switch_stacks=1; \
+ ENCODE_FRAME_POINTER; \
+ TRACE_IRQS_OFF \
+ movl %esp, %eax; \
+ call fn; \
+ jmp ret_from_intr; \
ENDPROC(name)

#define BUILD_INTERRUPT(name, nr) \
@@ -926,16 +1006,20 @@ common_exception:
pushl %es
pushl %ds
pushl %eax
+ movl $(__USER_DS), %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl $(__KERNEL_PERCPU), %eax
+ movl %eax, %fs
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
+ SWITCH_TO_KERNEL_STACK
ENCODE_FRAME_POINTER
cld
- movl $(__KERNEL_PERCPU), %ecx
- movl %ecx, %fs
UNWIND_ESPFIX_STACK
GS_TO_REG %ecx
movl PT_GS(%esp), %edi # get the function address
@@ -943,9 +1027,6 @@ common_exception:
movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart
REG_TO_PTGS %ecx
SET_KERNEL_GS %ecx
- movl $(__USER_DS), %ecx
- movl %ecx, %ds
- movl %ecx, %es
TRACE_IRQS_OFF
movl %esp, %eax # pt_regs pointer
CALL_NOSPEC %edi
@@ -964,6 +1045,7 @@ ENTRY(debug)
*/
ASM_CLAC
pushl $-1 # mark this as an int
+
SAVE_ALL
ENCODE_FRAME_POINTER
xorl %edx, %edx # error code 0
@@ -999,6 +1081,7 @@ END(debug)
*/
ENTRY(nmi)
ASM_CLAC
+
#ifdef CONFIG_X86_ESPFIX32
pushl %eax
movl %ss, %eax
@@ -1066,7 +1149,8 @@ END(nmi)
ENTRY(int3)
ASM_CLAC
pushl $-1 # mark this as an int
- SAVE_ALL
+
+ SAVE_ALL switch_stacks=1
ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
xorl %edx, %edx # zero error code
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f799..20e5f7ab 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
/* This is used when switching tasks or entering/exiting vm86 mode. */
static inline void update_sp0(struct task_struct *task)
{
- /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
- load_sp0(task->thread.sp0);
-#else
+ /* sp0 always points to the entry trampoline stack, which is constant: */
if (static_cpu_has(X86_FEATURE_XENPV))
load_sp0(task_top_of_stack(task));
-#endif
}

#endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 232152c..86f06e8 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -103,6 +103,7 @@ void common(void) {
OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
+ DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));

/* Offset for sp0 and sp1 into the tss_struct */
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8a5b185..311e988 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1718,11 +1718,12 @@ void cpu_init(void)
enter_lazy_tlb(&init_mm, curr);

/*
- * Initialize the TSS. Don't bother initializing sp0, as the initial
- * task never enters user mode.
+ * Initialize the TSS. sp0 points to the entry trampoline stack
+ * regardless of what task is running.
*/
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
load_TR_desc();
+ load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));

load_mm_ldt(&init_mm);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 03408b9..2b256d3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,14 +56,12 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
*/
.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,

-#ifdef CONFIG_X86_64
/*
* .sp1 is cpu_current_top_of_stack. The init task never
* runs user code, but cpu_current_top_of_stack should still
* be well defined before the first context switch.
*/
.sp1 = TOP_OF_INIT_STACK,
-#endif

#ifdef CONFIG_X86_32
.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 097d36a..3f3a8c6 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -289,10 +289,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*/
update_sp0(next_p);
refresh_sysenter_cs(next);
- this_cpu_write(cpu_current_top_of_stack,
- (unsigned long)task_stack_page(next_p) +
- THREAD_SIZE);
- /* SYSENTER reads the task-stack from tss.sp1 */
+ this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
+ /*
+ * TODO: Find a way to let cpu_current_top_of_stack point to
+ * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+ * iret exceptions.
+ */
this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

/*
--
2.7.4

2018-04-16 15:37:47

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 15/35] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl

From: Joerg Roedel <[email protected]>

With the way page-table folding is implemented on 32 bit, we
are not only setting PGDs with this functions, but also PUDs
and even PMDs. Give the function a more generic name to
reflect that.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 12 ++++++------
arch/x86/mm/pti.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 877bc27..c863816 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -196,21 +196,21 @@ static inline bool pgdp_maps_userspace(void *__ptr)
}

#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);

/*
* Take a PGD location (pgdp) and a pgd value that needs to be set there.
* Populates the user and returns the resulting PGD that must be set in
* the kernel copy of the page tables.
*/
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
{
if (!static_cpu_has(X86_FEATURE_PTI))
return pgd;
- return __pti_set_user_pgd(pgdp, pgd);
+ return __pti_set_user_pgtbl(pgdp, pgd);
}
#else
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
{
return pgd;
}
@@ -226,7 +226,7 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
}

pgd = native_make_pgd(native_p4d_val(p4d));
- pgd = pti_set_user_pgd((pgd_t *)p4dp, pgd);
+ pgd = pti_set_user_pgtbl((pgd_t *)p4dp, pgd);
*p4dp = native_make_p4d(native_pgd_val(pgd));
}

@@ -237,7 +237,7 @@ static inline void native_p4d_clear(p4d_t *p4d)

static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
- *pgdp = pti_set_user_pgd(pgdp, pgd);
+ *pgdp = pti_set_user_pgtbl(pgdp, pgd);
}

static inline void native_pgd_clear(pgd_t *pgd)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index f1fd52f..9bea9c3 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -117,7 +117,7 @@ void __init pti_check_boottime_disable(void)
setup_force_cpu_cap(X86_FEATURE_PTI);
}

-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
{
/*
* Changes to the high (kernel) portion of the kernelmode page
--
2.7.4

2018-04-16 15:37:52

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 05/35] x86/entry/32: Unshare NMI return path

From: Joerg Roedel <[email protected]>

NMI will no longer use most of the shared return path,
because NMI needs special handling when the CR3 switches for
PTI are added. This patch prepares for that.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 118420b..3a319fd 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1017,7 +1017,7 @@ ENTRY(nmi)

/* Not on SYSENTER stack. */
call do_nmi
- jmp .Lrestore_all_notrace
+ jmp .Lnmi_return

.Lnmi_from_sysenter_stack:
/*
@@ -1028,7 +1028,11 @@ ENTRY(nmi)
movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
call do_nmi
movl %ebx, %esp
- jmp .Lrestore_all_notrace
+
+.Lnmi_return:
+ CHECK_AND_APPLY_ESPFIX
+ RESTORE_REGS 4
+ jmp .Lirq_return

#ifdef CONFIG_X86_ESPFIX32
.Lnmi_espfix_stack:
--
2.7.4

2018-04-16 15:38:02

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 01/35] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c

From: Joerg Roedel <[email protected]>

These offsets will be used in 32 bit assembly code as well,
so make them available for all of x86 code.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/asm-offsets.c | 4 ++++
arch/x86/kernel/asm-offsets_64.c | 2 --
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 76417a9..232152c 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -103,4 +103,8 @@ void common(void) {
OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
+
+ /* Offset for sp0 and sp1 into the tss_struct */
+ OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+ OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
}
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bf51e51..d2eba73 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -65,8 +65,6 @@ int main(void)
#undef ENTRY

OFFSET(TSS_ist, tss_struct, x86_tss.ist);
- OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
- OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
BLANK();

#ifdef CONFIG_CC_STACKPROTECTOR
--
2.7.4

2018-04-16 15:38:13

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 02/35] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack

From: Joerg Roedel <[email protected]>

The stack address doesn't need to be stored in tss.sp0 if
we switch manually like on sysenter. Rename the offset so
that it still makes sense when we change its location.

We will also use this stack for all kernel-entry points, not
just sysenter. Reflect that in the name as well.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 2 +-
arch/x86/kernel/asm-offsets_32.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index bef8e2b..ec288be 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -412,7 +412,7 @@ ENTRY(xen_sysenter_target)
* 0(%ebp) arg6
*/
ENTRY(entry_SYSENTER_32)
- movl TSS_sysenter_sp0(%esp), %esp
+ movl TSS_entry_stack(%esp), %esp
.Lsysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index f91ba53..c6ac48f 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
BLANK();

/* Offset from the sysenter stack to tss.sp0 */
- DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+ DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
offsetofend(struct cpu_entry_area, entry_stack_page.stack));

#ifdef CONFIG_CC_STACKPROTECTOR
--
2.7.4

2018-04-16 15:38:31

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 04/35] x86/entry/32: Put ESPFIX code into a macro

From: Joerg Roedel <[email protected]>

This makes it easier to split up the shared iret code path.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 97 ++++++++++++++++++++++++-----------------------
1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ec288be..118420b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,54 @@
POP_GS_EX
.endm

+.macro CHECK_AND_APPLY_ESPFIX
+#ifdef CONFIG_X86_ESPFIX32
+#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
+
+ ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_ESPFIX
+
+ movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
+ /*
+ * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
+ * are returning to the kernel.
+ * See comments in process.c:copy_thread() for details.
+ */
+ movb PT_OLDSS(%esp), %ah
+ movb PT_CS(%esp), %al
+ andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
+ cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
+ jne .Lend_\@ # returning to user-space with LDT SS
+
+ /*
+ * Setup and switch to ESPFIX stack
+ *
+ * We're returning to userspace with a 16 bit stack. The CPU will not
+ * restore the high word of ESP for us on executing iret... This is an
+ * "official" bug of all the x86-compatible CPUs, which we can work
+ * around to make dosemu and wine happy. We do this by preloading the
+ * high word of ESP with the high word of the userspace ESP while
+ * compensating for the offset by changing to the ESPFIX segment with
+ * a base address that matches for the difference.
+ */
+ mov %esp, %edx /* load kernel esp */
+ mov PT_OLDESP(%esp), %eax /* load userspace esp */
+ mov %dx, %ax /* eax: new kernel esp */
+ sub %eax, %edx /* offset (low word is 0) */
+ shr $16, %edx
+ mov %dl, GDT_ESPFIX_SS + 4 /* bits 16..23 */
+ mov %dh, GDT_ESPFIX_SS + 7 /* bits 24..31 */
+ pushl $__ESPFIX_SS
+ pushl %eax /* new kernel esp */
+ /*
+ * Disable interrupts, but do not irqtrace this section: we
+ * will soon execute iret and the tracer was already set to
+ * the irqstate after the IRET:
+ */
+ DISABLE_INTERRUPTS(CLBR_ANY)
+ lss (%esp), %esp /* switch to espfix segment */
+.Lend_\@:
+#endif /* CONFIG_X86_ESPFIX32 */
+.endm
/*
* %eax: prev task
* %edx: next task
@@ -547,21 +595,7 @@ ENTRY(entry_INT80_32)
restore_all:
TRACE_IRQS_IRET
.Lrestore_all_notrace:
-#ifdef CONFIG_X86_ESPFIX32
- ALTERNATIVE "jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
-
- movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
- /*
- * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
- * are returning to the kernel.
- * See comments in process.c:copy_thread() for details.
- */
- movb PT_OLDSS(%esp), %ah
- movb PT_CS(%esp), %al
- andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
- cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
- je .Lldt_ss # returning to user-space with LDT SS
-#endif
+ CHECK_AND_APPLY_ESPFIX
.Lrestore_nocheck:
RESTORE_REGS 4 # skip orig_eax/error_code
.Lirq_return:
@@ -579,39 +613,6 @@ ENTRY(iret_exc )
jmp common_exception
.previous
_ASM_EXTABLE(.Lirq_return, iret_exc)
-
-#ifdef CONFIG_X86_ESPFIX32
-.Lldt_ss:
-/*
- * Setup and switch to ESPFIX stack
- *
- * We're returning to userspace with a 16 bit stack. The CPU will not
- * restore the high word of ESP for us on executing iret... This is an
- * "official" bug of all the x86-compatible CPUs, which we can work
- * around to make dosemu and wine happy. We do this by preloading the
- * high word of ESP with the high word of the userspace ESP while
- * compensating for the offset by changing to the ESPFIX segment with
- * a base address that matches for the difference.
- */
-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
- mov %esp, %edx /* load kernel esp */
- mov PT_OLDESP(%esp), %eax /* load userspace esp */
- mov %dx, %ax /* eax: new kernel esp */
- sub %eax, %edx /* offset (low word is 0) */
- shr $16, %edx
- mov %dl, GDT_ESPFIX_SS + 4 /* bits 16..23 */
- mov %dh, GDT_ESPFIX_SS + 7 /* bits 24..31 */
- pushl $__ESPFIX_SS
- pushl %eax /* new kernel esp */
- /*
- * Disable interrupts, but do not irqtrace this section: we
- * will soon execute iret and the tracer was already set to
- * the irqstate after the IRET:
- */
- DISABLE_INTERRUPTS(CLBR_ANY)
- lss (%esp), %esp /* switch to espfix segment */
- jmp .Lrestore_nocheck
-#endif
ENDPROC(entry_INT80_32)

.macro FIXUP_ESPFIX_STACK
--
2.7.4

2018-04-16 15:38:46

by Joerg Roedel

[permalink] [raw]

Subject: [PATCH 06/35] x86/entry/32: Split off return-to-kernel path

From: Joerg Roedel <[email protected]>

Use a separate return path when we know we are returning to
the kernel. This allows us to put the PTI cr3-switch and the
switch to the entry-stack into the return-to-user path
without further checking.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 3a319fd..2f04d6e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -65,7 +65,7 @@
# define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
#else
# define preempt_stop(clobbers)
-# define resume_kernel restore_all
+# define resume_kernel restore_all_kernel
#endif

.macro TRACE_IRQS_IRET
@@ -399,9 +399,9 @@ ENTRY(resume_kernel)
DISABLE_INTERRUPTS(CLBR_ANY)
.Lneed_resched:
cmpl $0, PER_CPU_VAR(__preempt_count)
- jnz restore_all
+ jnz restore_all_kernel
testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ?
- jz restore_all
+ jz restore_all_kernel
call preempt_schedule_irq
jmp .Lneed_resched
END(resume_kernel)
@@ -606,6 +606,11 @@ restore_all:
*/
INTERRUPT_RETURN

+restore_all_kernel:
+ TRACE_IRQS_IRET
+ RESTORE_REGS 4
+ jmp .Lirq_return
+
.section .fixup, "ax"
ENTRY(iret_exc )
pushl $0 # no error code
--
2.7.4

2018-04-16 16:03:46

by Joerg Roedel

[permalink] [raw]

Subject: Re: [PATCH 00/35 v5] PTI support for x32

Hi Linus,

On Mon, Apr 16, 2018 at 08:32:46AM -0700, Linus Torvalds wrote:
> Also, it would be nice to have performance numbers,

Here are some numbers I gathered for Ingo on v2 of the patch-set:

https://marc.info/?l=linux-kernel&m=151844711432661&w=2

I don't think they significantly changed, but I can measure them again
if needed.

Besides those, what other numbers/tests are you interested in?

> and check that the global page issues are at least fixed on 32-bit
> too, since we screwed that up on x86-64 initially.
>
> On x86-32, the global pages are likely a bigger deal since there's no PCID.

Okay, I verify if there are any global bits left in the page-tables.
According to the PTDUMP_X86 the cpu_entry_area is mapped with G=1 (which
should be fine?) and another 4M range in the kernel mapping. I need to
check what that is.

Thanks,

Joerg

>
> Linus

2018-04-16 16:15:19

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 00/35 v5] PTI support for x32

On Mon, Apr 16, 2018 at 9:01 AM, Joerg Roedel <[email protected]> wrote:
>
> Okay, I verify if there are any global bits left in the page-tables.
> According to the PTDUMP_X86 the cpu_entry_area is mapped with G=1 (which
> should be fine?) and another 4M range in the kernel mapping. I need to
> check what that is.

All the kernel entry code that is both in the user mapping and the
kernel mapping should be marked G.

We had missed a lot of it (and the impact is very small with PCID),
but if you rebased on top of 4.17-rc1 you should have it fixed at
least on 64-bit.

See for example commit 8c06c7740d19 ("x86/pti: Leave kernel text
global for !PCID") and in particular the performance numbers (that's
an Atom microserver, but it was chosen due to lack of PCID).

Linus

2018-04-16 18:02:02

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 00/35 v5] PTI support for x32

On 04/16/18 08:24, Joerg Roedel wrote:
> Hi,
>
> here is the 5th iteration of my PTI enablement patches for
> x86-32. There are no real changes between v4 and v5 besides
> that I rebased the whole patch-set to v4.17-rc1 and resolved
> the numerous conflicts that this caused.
>

Please don't use the term "x32" for i386/x86-32. "x32" generally refers
to the x32 ABI for x86-64. "x64" in Microsoft terminology for x86-64
corresponds to "x86" for x86-32.

-hpa

2018-04-18 07:37:32

by Jörg Rödel

[permalink] [raw]

Subject: Re: [PATCH 00/35 v5] PTI support for x32

On Mon, Apr 16, 2018 at 09:13:22AM -0700, Linus Torvalds wrote:
> See for example commit 8c06c7740d19 ("x86/pti: Leave kernel text
> global for !PCID") and in particular the performance numbers (that's
> an Atom microserver, but it was chosen due to lack of PCID).

Okay, I checked this on 32 bit and after some small changes I got
identical mappings with GLB set in all page-tables. The changes were:

* Don't change permission bits in pti_clone_kernel_text().
Changing them does not make a difference on 64 bit as
everything cloned in this function is RO anyway. On 32 bit
some areas are mapped RW, so it does make a difference there.

Having different permissions between kernel and user
page-table does also not make sense, because a permission
mismatch in the TLB will cause a re-walk, which is as fast as
not mapping it at all.

* Mapping kernel-text to user-space on 32 bit too. Since there
is no PCID this should improve performance. I have not
measured that yet, but will do so before posting the next
version.

I do some more testing and performance measurements and will send
version 6 of my patches beginning of next week when v4.17-rc2 is out.

Regards,

Joerg

2018-04-18 23:27:50

[permalink] [raw]

Subject: Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

Joerg Roedel <[email protected]> writes:

> From: Joerg Roedel <[email protected]>
>
> We want x86_tss.sp0 point to the entry stack later to use
> it as a trampoline stack for other kernel entry points
> besides SYSENTER.
>
> So store the task stack pointer in x86_tss.sp1, which is
> otherwise unused by the hardware, as Linux doesn't make use
> of Ring 1.

Seems like a hack. Why can't that be stored in a per cpu variable?

-Andi

2018-04-19 00:03:54

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen <[email protected]> wrote:
>
> Seems like a hack. Why can't that be stored in a per cpu variable?

It *is* a percpu variable - the whole x86_tss structure is percpu.

I guess it could be a different (separate) percpu variable, but might
as well use the space we already have allocated.

Linus

2018-04-19 00:40:04

[permalink] [raw]

Subject: Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

On Wed, Apr 18, 2018 at 05:02:02PM -0700, Linus Torvalds wrote:
> On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen <[email protected]> wrote:
> >
> > Seems like a hack. Why can't that be stored in a per cpu variable?
>
> It *is* a percpu variable - the whole x86_tss structure is percpu.
>
> I guess it could be a different (separate) percpu variable, but might
> as well use the space we already have allocated.

Would be better/cleaner to use a separate variable instead of reusing
x86 structures like this. Who knows what subtle side effects that
may have eventually.

It will be also easier to understand in the code.

-Andi

2018-04-19 00:46:42

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

On Wed, Apr 18, 2018 at 5:38 PM, Andi Kleen <[email protected]> wrote:
> On Wed, Apr 18, 2018 at 05:02:02PM -0700, Linus Torvalds wrote:
>> On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen <[email protected]> wrote:
>> >
>> > Seems like a hack. Why can't that be stored in a per cpu variable?
>>
>> It *is* a percpu variable - the whole x86_tss structure is percpu.
>>
>> I guess it could be a different (separate) percpu variable, but might
>> as well use the space we already have allocated.
>
> Would be better/cleaner to use a separate variable instead of reusing
> x86 structures like this. Who knows what subtle side effects that
> may have eventually.

This variable is extremely hot, and it’s used under the same
circumstances as sp0, so sharing a cache line makes sense. And x86_64
works this way.

>
> It will be also easier to understand in the code.

I suppose it could go right before the TSS, but then we have potential
alignment issues. We could also muck with unions to give the field an
alternative, clearer name, I suppose. But this patch should go in
regardless and any cleanups should be done on x86_32 and x86_64
simultaneously, I think.

2018-04-19 09:02:28

by David Laight

[permalink] [raw]

Subject: RE: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

From: Andi Kleen
> Sent: 19 April 2018 01:39
>
> On Wed, Apr 18, 2018 at 05:02:02PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen <[email protected]> wrote:
> > >
> > > Seems like a hack. Why can't that be stored in a per cpu variable?
> >
> > It *is* a percpu variable - the whole x86_tss structure is percpu.
> >
> > I guess it could be a different (separate) percpu variable, but might
> > as well use the space we already have allocated.
>
> Would be better/cleaner to use a separate variable instead of reusing
> x86 structures like this. Who knows what subtle side effects that
> may have eventually.
>
> It will be also easier to understand in the code.

You could (probably) use an unnamed union in the x86_tss structure
so that it is more obvious that the two variables share a location.

David