2018-01-16 16:51:07

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 00/16] PTI support for x86-32

From: Joerg Roedel <[email protected]>

Hi,

here is my current WIP code to enable PTI on x86-32. It is
still in a pretty early state, but it successfully boots my
KVM guest with PAE and with legacy paging. The existing PTI
code for x86-64 already prepares a lot of the stuff needed
for 32 bit too, thanks for that to all the people involved
in its development :)

The patches are split as follows:

- 1-3 contain the entry-code changes to enter and
exit the kernel via the sysenter trampoline stack.

- 4-7 are fixes to get the code compile on 32 bit
with CONFIG_PAGE_TABLE_ISOLATION=y.

- 8-14 adapt the existing PTI code to work properly
on 32 bit and add the needed parts to 32 bit
page-table code.

- 15 switches PTI on by adding the CR3 switches to
kernel entry/exit.

- 16 enables the Kconfig for all of X86

The code has not run on bare-metal yet, I'll test that in
the next days once I setup a 32 bit box again. I also havn't
tested Wine and DosEMU yet, so this might also be broken.

With that post I'd like to ask for all kinds of constructive
feedback on the approaches I have taken and of course the
many things I broke with it :)

One of the things that are surely broken is XEN_PV support.
I'd appreciate any help with testing and bugfixing on that
front.

So please review and let me know your thoughts.

Thanks,

Joerg

Joerg Roedel (16):
x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
x86/entry/32: Enter the kernel via trampoline stack
x86/entry/32: Leave the kernel via the trampoline stack
x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
x86/mm/ldt: Reserve high address-space range for the LDT
x86/mm: Move two more functions from pgtable_64.h to pgtable.h
x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
x86/mm/pti: Populate valid user pud entries
x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
x86/mm/pae: Populate the user page-table with user pgd's
x86/mm/pti: Add an overflow check to pti_clone_pmds()
x86/mm/legacy: Populate the user page-table with user pgd's
x86/entry/32: Switch between kernel and user cr3 on entry/exit
x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

arch/x86/entry/entry_32.S | 170 +++++++++++++++++++++++++++++---
arch/x86/include/asm/pgtable-2level.h | 3 +
arch/x86/include/asm/pgtable-3level.h | 3 +
arch/x86/include/asm/pgtable.h | 88 +++++++++++++++++
arch/x86/include/asm/pgtable_32_types.h | 5 +-
arch/x86/include/asm/pgtable_64.h | 85 ----------------
arch/x86/include/asm/processor-flags.h | 8 +-
arch/x86/include/asm/switch_to.h | 6 +-
arch/x86/kernel/asm-offsets_32.c | 5 +-
arch/x86/kernel/cpu/common.c | 5 +-
arch/x86/kernel/head_32.S | 23 ++++-
arch/x86/kernel/process.c | 2 -
arch/x86/kernel/process_32.c | 6 ++
arch/x86/mm/pgtable.c | 11 ++-
arch/x86/mm/pti.c | 34 ++++++-
security/Kconfig | 2 +-
16 files changed, 333 insertions(+), 123 deletions(-)

--
2.13.6


2018-01-16 16:48:38

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled

From: Joerg Roedel <[email protected]>

Allocate a kernel and a user page-table root when PTI is
enabled. Also allocate a full page per root for PAEm because
otherwise the bit to flip in cr3 to switch between them
would be non-constant, which creates a lot of hassle.
Keep that for a later optimization.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
arch/x86/mm/pgtable.c | 11 ++++++-----
2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index c29020907886..fc550559bf58 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -512,28 +512,41 @@ ENTRY(initial_code)
ENTRY(setup_once_ref)
.long setup_once

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define PGD_ALIGN (2 * PAGE_SIZE)
+#define PTI_USER_PGD_FILL 1024
+#else
+#define PGD_ALIGN (PAGE_SIZE)
+#define PTI_USER_PGD_FILL 0
+#endif
/*
* BSS section
*/
__PAGE_ALIGNED_BSS
- .align PAGE_SIZE
+ .align PGD_ALIGN
#ifdef CONFIG_X86_PAE
.globl initial_pg_pmd
initial_pg_pmd:
.fill 1024*KPMDS,4,0
+ .fill PTI_USER_PGD_FILL,4,0
#else
.globl initial_page_table
initial_page_table:
.fill 1024,4,0
+ .fill PTI_USER_PGD_FILL,4,0
#endif
+ .align PGD_ALIGN
initial_pg_fixmap:
.fill 1024,4,0
-.globl empty_zero_page
-empty_zero_page:
- .fill 4096,1,0
+ .fill PTI_USER_PGD_FILL,4,0
.globl swapper_pg_dir
+ .align PGD_ALIGN
swapper_pg_dir:
.fill 1024,4,0
+ .fill PTI_USER_PGD_FILL,4,0
+.globl empty_zero_page
+empty_zero_page:
+ .fill 4096,1,0
EXPORT_SYMBOL(empty_zero_page)

/*
@@ -542,7 +555,7 @@ EXPORT_SYMBOL(empty_zero_page)
#ifdef CONFIG_X86_PAE
__PAGE_ALIGNED_DATA
/* Page-aligned for the benefit of paravirt? */
- .align PAGE_SIZE
+ .align PGD_ALIGN
ENTRY(initial_page_table)
.long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
# if KPMDS == 3
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..48abefd95924 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -313,7 +313,7 @@ static int __init pgd_cache_init(void)
* When PAE kernel is running as a Xen domain, it does not use
* shared kernel pmd. And this requires a whole page for pgd.
*/
- if (!SHARED_KERNEL_PMD)
+ if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
return 0;

/*
@@ -337,8 +337,9 @@ static inline pgd_t *_pgd_alloc(void)
* If no SHARED_KERNEL_PMD, PAE kernel is running as a Xen domain.
* We allocate one page for pgd.
*/
- if (!SHARED_KERNEL_PMD)
- return (pgd_t *)__get_free_page(PGALLOC_GFP);
+ if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+ return (pgd_t *)__get_free_pages(PGALLOC_GFP,
+ PGD_ALLOCATION_ORDER);

/*
* Now PAE kernel is not running as a Xen domain. We can allocate
@@ -349,8 +350,8 @@ static inline pgd_t *_pgd_alloc(void)

static inline void _pgd_free(pgd_t *pgd)
{
- if (!SHARED_KERNEL_PMD)
- free_page((unsigned long)pgd);
+ if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+ free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
else
kmem_cache_free(pgd_cache, pgd);
}
--
2.13.6

2018-01-16 16:48:36

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's

From: Joerg Roedel <[email protected]>

Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-2level.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..d96486d23c58 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)

static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
+#endif
*pmdp = pmd;
}

--
2.13.6

2018-01-16 16:48:53

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32

From: Joerg Roedel <[email protected]>

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ce38f165489b..20be21301a59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -308,6 +308,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
}
}

+#ifdef CONFIG_X86_64
/*
* Clone a single p4d (i.e. a top-level entry on 4-level systems and a
* next-level entry on 5-level systems.
@@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
kernel_p4d = p4d_offset(kernel_pgd, addr);
*user_p4d = *kernel_p4d;
}
+#endif

/*
* Clone the CPU_ENTRY_AREA into the user space visible page table.
*/
static void __init pti_clone_user_shared(void)
{
+#ifdef CONFIG_X86_32
+ /*
+ * On 32 bit PAE systems with 1GB of Kernel address space there is only
+ * one pgd/p4d for the whole kernel. Cloning that would map the whole
+ * address space into the user page-tables, making PTI useless. So clone
+ * the page-table on the PMD level to prevent that.
+ */
+ unsigned long start, end;
+
+ start = CPU_ENTRY_AREA_BASE;
+ end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+ pti_clone_pmds(start, end, _PAGE_GLOBAL);
+#else
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+#endif
}

/*
--
2.13.6

2018-01-16 16:48:55

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

From: Joerg Roedel <[email protected]>

Reserve 2MB/4MB of address space for mapping the LDT to
user-space.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_32_types.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index ce245b0cdfca..3c30a7fcae68 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -47,9 +47,12 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
#define CPU_ENTRY_AREA_BASE \
((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)

-#define PKMAP_BASE \
+#define LDT_BASE_ADDR \
((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)

+#define PKMAP_BASE \
+ ((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
+
#ifdef CONFIG_HIGHMEM
# define VMALLOC_END (PKMAP_BASE - 2 * PAGE_SIZE)
#else
--
2.13.6

2018-01-16 16:48:56

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32

From: Joerg Roedel <[email protected]>

Move it out of the X86_64 specific processor defines so
that its visible for 32bit too.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/processor-flags.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 625a52a5594f..02c2cbda4a74 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -39,10 +39,6 @@
#define CR3_PCID_MASK 0xFFFull
#define CR3_NOFLUSH BIT_ULL(63)

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-# define X86_CR3_PTI_PCID_USER_BIT 11
-#endif
-
#else
/*
* CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -53,4 +49,8 @@
#define CR3_NOFLUSH 0
#endif

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_PCID_USER_BIT 11
+#endif
+
#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
--
2.13.6

2018-01-16 16:48:52

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's

From: Joerg Roedel <[email protected]>

This is the last part of the PAE page-table setup for PAE
before we can add the CR3 switch to the entry code.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 3 +++
arch/x86/mm/pti.c | 7 +++++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af5453802..910f0b35370e 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -98,6 +98,9 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)

static inline void native_set_pud(pud_t *pudp, pud_t pud)
{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ pud.p4d.pgd = pti_set_user_pgd(&pudp->p4d.pgd, pud.p4d.pgd);
+#endif
set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
}

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 6b6bfd13350e..a561b5625d6c 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -122,6 +122,7 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
*/
kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;

+#ifdef CONFIG_X86_64
/*
* If this is normal user memory, make it NX in the kernel
* pagetables so that, if we somehow screw up and return to
@@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
* may execute from it
* - we don't have NX support
* - we're clearing the PGD (i.e. the new pgd is not present).
+ * - We run on a 32 bit kernel. 2-level paging doesn't support NX at
+ * all and PAE paging does not support it on the PGD level. We can
+ * set it in the PMD level there in the future, but that means we
+ * need to unshare the PMDs between the kernel and the user
+ * page-tables.
*/
if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
(__supported_pte_mask & _PAGE_NX))
pgd.pgd |= _PAGE_NX;
+#endif

/* return the copy of the PGD we want the kernel to use: */
return pgd;
--
2.13.6

2018-01-16 16:51:03

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 13/16] x86/mm/pti: Add an overflow check to pti_clone_pmds()

From: Joerg Roedel <[email protected]>

The addr counter will overflow if we clone the last PMD of
the address space, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index a561b5625d6c..faea5faeddc5 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -293,6 +293,10 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
p4d_t *p4d;
pud_t *pud;

+ /* Overflow check */
+ if (addr < start)
+ break;
+
pgd = pgd_offset_k(addr);
if (WARN_ON(pgd_none(*pgd)))
return;
--
2.13.6

2018-01-16 16:51:08

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack

From: Joerg Roedel <[email protected]>

The stack addresss doesn't need to be stored in tss.sp0 if
we switch manually like on sysenter. Rename the offset so
that it still makes sense when we its location.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 2 +-
arch/x86/kernel/asm-offsets_32.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a1f28a54f23a..eb8c5615777b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
* 0(%ebp) arg6
*/
ENTRY(entry_SYSENTER_32)
- movl TSS_sysenter_sp0(%esp), %esp
+ movl TSS_sysenter_stack(%esp), %esp
.Lsysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index fa1261eefa16..654229bac2fc 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
BLANK();

/* Offset from the sysenter stack to tss.sp0 */
- DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+ DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
offsetofend(struct cpu_entry_area, entry_stack_page.stack));

#ifdef CONFIG_CC_STACKPROTECTOR
--
2.13.6

2018-01-16 16:51:02

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h

From: Joerg Roedel <[email protected]>

These two functions are required for PTI on 32 bit:

* pgdp_maps_userspace()
* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work
on 64 and 32 bit kernels.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable.h | 16 ++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 15 ---------------
2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0a9f746cbdc1..abafe4d7fd3e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,22 @@ static inline int pud_write(pud_t pud)
return pud_flags(pud) & _PAGE_RW;
}

+/*
+ * Page table pages are page-aligned. The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+ unsigned long ptr = (unsigned long)__ptr;
+
+ return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
+}
+
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
#ifdef CONFIG_PAGE_TABLE_ISOLATION
/*
* All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 58d7f10e937d..3c5a73c8bb50 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,20 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

-/*
- * Page table pages are page-aligned. The lower half of the top
- * level is used for userspace and the top half for the kernel.
- *
- * Returns true for parts of the PGD that map userspace and
- * false for the parts that map the kernel.
- */
-static inline bool pgdp_maps_userspace(void *__ptr)
-{
- unsigned long ptr = (unsigned long)__ptr;
-
- return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
-}
-
#ifdef CONFIG_PAGE_TABLE_ISOLATION
pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);

@@ -208,7 +194,6 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
/*
* Level 4 access.
*/
-static inline int pgd_large(pgd_t pgd) { return 0; }
#define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)

/* PUD - Level3 access */
--
2.13.6

2018-01-16 16:53:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <[email protected]>
>
> Reserve 2MB/4MB of address space for mapping the LDT to
> user-space.

LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
64K*2*64=8M > 2M.

2018-01-16 16:55:37

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries

From: Joerg Roedel <[email protected]>

With PAE paging we don't have PGD and P4D levels in the
page-table, instead the PUD level is the highest one.

In PAE page-tables at the top-level most bits we usually set
with _KERNPG_TABLE are reserved, resulting in a #GP when
they are loaded by the processor.

Work around this by populating PUD entries in the user
page-table only with _PAGE_PRESENT set.

I am pretty sure there is a cleaner way to do this, but
until I find it use this #ifdef solution.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/pti.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 20be21301a59..6b6bfd13350e 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -202,8 +202,12 @@ static __init pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
unsigned long new_pmd_page = __get_free_page(gfp);
if (!new_pmd_page)
return NULL;
-
+#ifdef CONFIG_X86_PAE
+ /* TODO: There must be a cleaner way to do this */
+ set_pud(pud, __pud(_PAGE_PRESENT | __pa(new_pmd_page)));
+#else
set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+#endif
}

return pmd_offset(pud, address);
--
2.13.6

2018-01-16 16:55:35

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

From: Joerg Roedel <[email protected]>

Switch back to the trampoline stack before returning to
userspace.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 58 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/asm-offsets_32.c | 1 +
2 files changed, 59 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 5a7bdb73be9f..14018eeb11c3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -263,6 +263,61 @@
.endm

/*
+ * Switch back from the kernel stack to the entry stack.
+ *
+ * iret_frame > 0 adds code to copie over an iret frame from the old to
+ * the new stack. It also adds a check which bails out if
+ * we are not returning to user-space.
+ *
+ * This macro is allowed not modify eflags when iret_frame == 0.
+ */
+.macro SWITCH_TO_ENTRY_STACK iret_frame=0
+ .if \iret_frame > 0
+ /* Are we returning to userspace? */
+ testb $3, 4(%esp) /* return CS */
+ jz .Lend_\@
+ .endif
+
+ /*
+ * We run with user-%fs already loaded from pt_regs, so we don't
+ * have access to per_cpu data anymore, and there is no swapgs
+ * equivalent on x86_32.
+ * We work around this by loading the kernel-%fs again and
+ * reading the entry stack address from there. Then we restore
+ * the user-%fs and return.
+ */
+ pushl %fs
+ pushl %edi
+
+ /* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
+ movl $(__KERNEL_PERCPU), %edi
+ movl %edi, %fs
+
+ /* Save old stack pointer to copy the return frame over if needed */
+ movl %esp, %edi
+ movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
+
+ /* Now we are on the entry stack */
+
+ .if \iret_frame > 0
+ /* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
+ pushl 6*4(%edi) /* ss */
+ pushl 5*4(%edi) /* esp */
+ pushl 4*4(%edi) /* eflags */
+ pushl 3*4(%edi) /* cs */
+ pushl 2*4(%edi) /* eip */
+ .endif
+
+ pushl 4(%edi) /* fs */
+
+ /* Restore user %edi and user %fs */
+ movl (%edi), %edi
+ popl %fs
+
+.Lend_\@:
+.endm
+
+/*
* %eax: prev task
* %edx: next task
*/
@@ -512,6 +567,8 @@ ENTRY(entry_SYSENTER_32)
btr $X86_EFLAGS_IF_BIT, (%esp)
popfl

+ SWITCH_TO_ENTRY_STACK
+
/*
* Return back to the vDSO, which will pop ecx and edx.
* Don't bother with DS and ES (they already contain __USER_DS).
@@ -601,6 +658,7 @@ restore_all:
.Lrestore_nocheck:
RESTORE_REGS 4 # skip orig_eax/error_code
.Lirq_return:
+ SWITCH_TO_ENTRY_STACK iret_frame=1
INTERRUPT_RETURN

.section .fixup, "ax"
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 7270dd834f4b..b628f898edd2 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,6 +50,7 @@ void foo(void)
DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
offsetofend(struct cpu_entry_area, entry_stack_page.stack));

+ OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

#ifdef CONFIG_CC_STACKPROTECTOR
--
2.13.6

2018-01-16 16:56:20

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 15/16] x86/entry/32: Switch between kernel and user cr3 on entry/exit

From: Joerg Roedel <[email protected]>

Add the cr3 switches between the kernel and the user
page-table when PTI is enabled.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 14018eeb11c3..6a1d9f1e1f89 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,25 @@
POP_GS_EX
.endm

+#define PTI_SWITCH_MASK (1 << PAGE_SHIFT)
+
+.macro SWITCH_TO_KERNEL_CR3
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ movl %cr3, %edi
+ andl $(~PTI_SWITCH_MASK), %edi
+ movl %edi, %cr3
+.Lend_\@:
+.endm
+
+.macro SWITCH_TO_USER_CR3
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ mov %cr3, %edi
+ /* Flip the PGD to the user version */
+ orl $(PTI_SWITCH_MASK), %edi
+ mov %edi, %cr3
+.Lend_\@:
+.endm
+
/*
* Switch from the entry-trampline stack to the kernel stack of the
* running task.
@@ -240,6 +259,7 @@
.endif

pushl %edi
+ SWITCH_TO_KERNEL_CR3
movl %esp, %edi

/*
@@ -309,9 +329,12 @@
.endif

pushl 4(%edi) /* fs */
+ pushl (%edi) /* edi */
+
+ SWITCH_TO_USER_CR3

/* Restore user %edi and user %fs */
- movl (%edi), %edi
+ popl %edi
popl %fs

.Lend_\@:
--
2.13.6

2018-01-16 16:56:18

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 05/16] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h

From: Joerg Roedel <[email protected]>

Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable.h | 49 +++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 49 ---------------------------------------
2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e42b8943cb1a..0a9f746cbdc1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,55 @@ static inline int pud_write(pud_t pud)
return pud_flags(pud) & _PAGE_RW;
}

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size). The kernel one is at the beginning 4k and
+ * the user one is in the last 4k. To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+ unsigned long __ptr = (unsigned long)ptr;
+
+ __ptr |= BIT(bit);
+ return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+ unsigned long __ptr = (unsigned long)ptr;
+
+ __ptr &= ~BIT(bit);
+ return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+ return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+ return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+ return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+ return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 81462e9a34f6..58d7f10e937d 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,55 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size). The kernel one is at the beginning 4k and
- * the user one is in the last 4k. To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT PAGE_SHIFT
-
-/*
- * This generates better code than the inline assembly in
- * __set_bit().
- */
-static inline void *ptr_set_bit(void *ptr, int bit)
-{
- unsigned long __ptr = (unsigned long)ptr;
-
- __ptr |= BIT(bit);
- return (void *)__ptr;
-}
-static inline void *ptr_clear_bit(void *ptr, int bit)
-{
- unsigned long __ptr = (unsigned long)ptr;
-
- __ptr &= ~BIT(bit);
- return (void *)__ptr;
-}
-
-static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
-{
- return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
-{
- return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
- return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
- return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-#endif /* CONFIG_PAGE_TABLE_ISOLATION */
-
/*
* Page table pages are page-aligned. The lower half of the top
* level is used for userspace and the top half for the kernel.
--
2.13.6

2018-01-16 16:57:39

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

From: Joerg Roedel <[email protected]>

Use the sysenter stack as a trampoline stack to enter the
kernel. The sysenter stack is already in the cpu_entry_area
and will be mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/entry/entry_32.S | 89 +++++++++++++++++++++++++++++++++++-----
arch/x86/include/asm/switch_to.h | 6 +--
arch/x86/kernel/asm-offsets_32.c | 4 +-
arch/x86/kernel/cpu/common.c | 5 ++-
arch/x86/kernel/process.c | 2 -
arch/x86/kernel/process_32.c | 6 +++
6 files changed, 91 insertions(+), 21 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index eb8c5615777b..5a7bdb73be9f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -222,6 +222,47 @@
.endm

/*
+ * Switch from the entry-trampline stack to the kernel stack of the
+ * running task.
+ *
+ * nr_regs is the number of dwords to push from the entry stack to the
+ * task stack. If it is > 0 it expects an irq frame at the bottom of the
+ * stack.
+ *
+ * check_user != 0 it will add a check to only switch stacks if the
+ * kernel entry was from user-space.
+ */
+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
+
+ .if \check_user > 0 && \nr_regs > 0
+ testb $3, (\nr_regs - 4)*4(%esp) /* CS */
+ jz .Lend_\@
+ .endif
+
+ pushl %edi
+ movl %esp, %edi
+
+ /*
+ * TSS_sysenter_stack is the offset from the bottom of the
+ * entry-stack
+ */
+ movl TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
+
+ /* Copy the registers over */
+ .if \nr_regs > 0
+ i = 0
+ .rept \nr_regs
+ pushl (\nr_regs - i) * 4(%edi)
+ i = i + 1
+ .endr
+ .endif
+
+ mov (%edi), %edi
+
+.Lend_\@:
+.endm
+
+/*
* %eax: prev task
* %edx: next task
*/
@@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
* 0(%ebp) arg6
*/
ENTRY(entry_SYSENTER_32)
- movl TSS_sysenter_stack(%esp), %esp
+ /* Kernel stack is empty */
+ SWITCH_TO_KERNEL_STACK
+
.Lsysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
@@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
ENTRY(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
+
+ /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
+ SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */

/*
@@ -655,6 +702,10 @@ END(irq_entries_start)
common_interrupt:
ASM_CLAC
addl $-0x80, (%esp) /* Adjust vector into the [-256, -1] range */
+
+ /* Stack layout: ss, esp, eflags, cs, eip, vector */
+ SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
SAVE_ALL
ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
@@ -663,16 +714,17 @@ common_interrupt:
jmp ret_from_intr
ENDPROC(common_interrupt)

-#define BUILD_INTERRUPT3(name, nr, fn) \
-ENTRY(name) \
- ASM_CLAC; \
- pushl $~(nr); \
- SAVE_ALL; \
- ENCODE_FRAME_POINTER; \
- TRACE_IRQS_OFF \
- movl %esp, %eax; \
- call fn; \
- jmp ret_from_intr; \
+#define BUILD_INTERRUPT3(name, nr, fn) \
+ENTRY(name) \
+ ASM_CLAC; \
+ pushl $~(nr); \
+ SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1; \
+ SAVE_ALL; \
+ ENCODE_FRAME_POINTER; \
+ TRACE_IRQS_OFF \
+ movl %esp, %eax; \
+ call fn; \
+ jmp ret_from_intr; \
ENDPROC(name)

#define BUILD_INTERRUPT(name, nr) \
@@ -893,6 +945,9 @@ ENTRY(page_fault)
END(page_fault)

common_exception:
+ /* Stack layout: ss, esp, eflags, cs, eip, error_code, handler */
+ SWITCH_TO_KERNEL_STACK nr_regs=7 check_user=1
+
/* the function address is in %gs's slot on the stack */
pushl %fs
pushl %es
@@ -936,6 +991,10 @@ ENTRY(debug)
*/
ASM_CLAC
pushl $-1 # mark this as an int
+
+ /* Stack layout: ss, esp, eflags, cs, eip, $-1 */
+ SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
SAVE_ALL
ENCODE_FRAME_POINTER
xorl %edx, %edx # error code 0
@@ -971,6 +1030,10 @@ END(debug)
*/
ENTRY(nmi)
ASM_CLAC
+
+ /* Stack layout: ss, esp, eflags, cs, eip */
+ SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
+
#ifdef CONFIG_X86_ESPFIX32
pushl %eax
movl %ss, %eax
@@ -1034,6 +1097,10 @@ END(nmi)
ENTRY(int3)
ASM_CLAC
pushl $-1 # mark this as an int
+
+ /* Stack layout: ss, esp, eflags, cs, eip, vector */
+ SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
SAVE_ALL
ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f7999a893..20e5f7ab8260 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
/* This is used when switching tasks or entering/exiting vm86 mode. */
static inline void update_sp0(struct task_struct *task)
{
- /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
- load_sp0(task->thread.sp0);
-#else
+ /* sp0 always points to the entry trampoline stack, which is constant: */
if (static_cpu_has(X86_FEATURE_XENPV))
load_sp0(task_top_of_stack(task));
-#endif
}

#endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 654229bac2fc..7270dd834f4b 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,9 +47,11 @@ void foo(void)
BLANK();

/* Offset from the sysenter stack to tss.sp0 */
- DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+ DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
offsetofend(struct cpu_entry_area, entry_stack_page.stack));

+ OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
+
#ifdef CONFIG_CC_STACKPROTECTOR
BLANK();
OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ef29ad001991..20a71c914e59 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1649,11 +1649,12 @@ void cpu_init(void)
enter_lazy_tlb(&init_mm, curr);

/*
- * Initialize the TSS. Don't bother initializing sp0, as the initial
- * task never enters user mode.
+ * Initialize the TSS. sp0 points to the entry trampoline stack
+ * regardless of what task is running.
*/
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
load_TR_desc();
+ load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));

load_mm_ldt(&init_mm);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 832a6acd730f..a9950946b263 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,14 +57,12 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
*/
.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,

-#ifdef CONFIG_X86_64
/*
* .sp1 is cpu_current_top_of_stack. The init task never
* runs user code, but cpu_current_top_of_stack should still
* be well defined before the first context switch.
*/
.sp1 = TOP_OF_INIT_STACK,
-#endif

#ifdef CONFIG_X86_32
.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c6099184..452eeac00b80 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(cpu_current_top_of_stack,
(unsigned long)task_stack_page(next_p) +
THREAD_SIZE);
+ /*
+ * TODO: Find a way to let cpu_current_top_of_stack point to
+ * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+ * iret exceptions.
+ */
+ this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

/*
* Restore %gs if needed (which is common)
--
2.13.6

2018-01-16 16:57:40

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 11/16] x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h

From: Joerg Roedel <[email protected]>

There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable.h | 23 +++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 21 ---------------------
2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index abafe4d7fd3e..248721971532 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -618,8 +618,31 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,

pmd_t *populate_extra_pmd(unsigned long vaddr);
pte_t *populate_extra_pte(unsigned long vaddr);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return pgd;
+ return __pti_set_user_pgd(pgdp, pgd);
+}
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+ return pgd;
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
#endif /* __ASSEMBLY__ */

+
#ifdef CONFIG_X86_32
# include <asm/pgtable_32.h>
#else
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 3c5a73c8bb50..50a02a32a0b3 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,27 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
-
-/*
- * Take a PGD location (pgdp) and a pgd value that needs to be set there.
- * Populates the user and returns the resulting PGD that must be set in
- * the kernel copy of the page tables.
- */
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
- if (!static_cpu_has(X86_FEATURE_PTI))
- return pgd;
- return __pti_set_user_pgd(pgdp, pgd);
-}
-#else
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
- return pgd;
-}
-#endif
-
static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
{
#if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
--
2.13.6

2018-01-16 16:57:44

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH 16/16] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

From: Joerg Roedel <[email protected]>

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel <[email protected]>
---
security/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index b0cb9a5f9448..93d85fda0f54 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
config PAGE_TABLE_ISOLATION
bool "Remove the kernel mapping in user mode"
default y
- depends on X86_64 && !UML
+ depends on X86 && !UML
help
This feature reduces the number of hardware side channels by
ensuring that the majority of kernel addresses are not mapped
--
2.13.6

2018-01-16 17:13:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

Hi Peter,

On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <[email protected]>
> >
> > Reserve 2MB/4MB of address space for mapping the LDT to
> > user-space.
>
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

Thanks, I'll fix that in the next version.


Joerg

2018-01-16 17:31:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
> Hi Peter,
>
> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > > From: Joerg Roedel <[email protected]>
> > >
> > > Reserve 2MB/4MB of address space for mapping the LDT to
> > > user-space.
> >
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
>
> Thanks, I'll fix that in the next version.

Just lower the max SMP setting until it fits or something. 32bit is too
address space starved for lots of CPU in any case, 64 CPUs on 32bit is
absolutely insane.

2018-01-16 17:35:06

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

On 01/16/2018 12:31 PM, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
>> Hi Peter,
>>
>> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
>>> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>>>> From: Joerg Roedel <[email protected]>
>>>>
>>>> Reserve 2MB/4MB of address space for mapping the LDT to
>>>> user-space.
>>> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
>>> 64K*2*64=8M > 2M.
>> Thanks, I'll fix that in the next version.
> Just lower the max SMP setting until it fits or something. 32bit is too
> address space starved for lots of CPU in any case, 64 CPUs on 32bit is
> absolutely insane.

Maybe we can just scale the amount of reserved space according to the
current NR_CPUS setting. In this way, we won't waste more memory than is
necessary.

2018-01-16 18:03:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +/*
> + * Page table pages are page-aligned. The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
> + */
> +static inline bool pgdp_maps_userspace(void *__ptr)
> +{
> + unsigned long ptr = (unsigned long)__ptr;
> +
> + return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> +}

One of the reasons to implement it the other way:

- return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

is that the compiler can do this all quickly. KERNEL_PGD_BOUNDARY
depends on PAGE_OFFSET which depends on a variable. IOW, the compiler
can't do it.

How much worse is the code that this generates?

2018-01-16 18:06:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
>
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.

Can you save me the trip to the SDM and remind me which bits actually
cause trouble here?

2018-01-16 18:11:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
> /*
> * If this is normal user memory, make it NX in the kernel
> * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
> * may execute from it
> * - we don't have NX support
> * - we're clearing the PGD (i.e. the new pgd is not present).
> + * - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> + * all and PAE paging does not support it on the PGD level. We can
> + * set it in the PMD level there in the future, but that means we
> + * need to unshare the PMDs between the kernel and the user
> + * page-tables.
> */
> if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
> (__supported_pte_mask & _PAGE_NX))
> pgd.pgd |= _PAGE_NX;
> +#endif

Ugh. The ghosts of PAE have come back to haunt us.

Could we do:

static inline bool pgd_supports_nx(unsigned long)
{
#ifdef CONFIG_X86_64
return (__supported_pte_mask & _PAGE_NX);
#else
/* No 32-bit page tables support NX at PGD level */
return 0;
#endif
}

Nobody will ever spot the #ifdef the way you laid it out.

2018-01-16 18:14:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Joerg,

Very cool!.

I really appreciate you putting this together. I don't see any real
showstoppers or things that I think will *break* 64-bit. I just hope
that we can merge this _slowly_ in case it breaks 64-bit along the way.

I didn't look at the assembly in too much detail.

2018-01-16 18:36:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <[email protected]>
>
> The stack addresss doesn't need to be stored in tss.sp0 if
> we switch manually like on sysenter. Rename the offset so
> that it still makes sense when we its location.

-ENOSENTENCE

Other than that. Makes sense.

> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 2 +-
> arch/x86/kernel/asm-offsets_32.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index a1f28a54f23a..eb8c5615777b 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
> * 0(%ebp) arg6
> */
> ENTRY(entry_SYSENTER_32)
> - movl TSS_sysenter_sp0(%esp), %esp
> + movl TSS_sysenter_stack(%esp), %esp
> .Lsysenter_past_esp:
> pushl $__USER_DS /* pt_regs->ss */
> pushl %ebp /* pt_regs->sp (stashed in bp) */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index fa1261eefa16..654229bac2fc 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,7 +47,7 @@ void foo(void)
> BLANK();
>
> /* Offset from the sysenter stack to tss.sp0 */
> - DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> + DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>
> #ifdef CONFIG_CC_STACKPROTECTOR
> --
> 2.13.6
>
>

2018-01-16 18:59:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Yes, I'm very happy to see that this is actually not nearly as bad as
I feared it might be,

Some of those #ifdef's in the PTI code you added might want more
commentary about what the exact differences are. And maybe they could
be done more cleanly with some abstraction. But nothing looked
_horrible_.

> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.

.. and please run all the segment and syscall selfchecks that Andy has written.

But yes, checking bare metal, and checking the "odd" applications like
Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
good idea.

> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.

Xen PV and PTI don't work together even on x86-64 afaik, the Xen
people apparently felt it wasn't worth it. See the

if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
pti_print_if_insecure("disabled on XEN PV.");
return;
}

in pti_check_boottime_disable().

Linus

2018-01-16 19:02:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On 01/16/2018 10:59 AM, Linus Torvalds wrote:
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
> .. and please run all the segment and syscall selfchecks that Andy has written.
>
> But yes, checking bare metal, and checking the "odd" applications like
> Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
> good idea.

I tried to document a list of the "gotchas" that tripped us up during
the 64-bit effort under "Testing":

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/pti&id=01c9b17bf673b05bb401b76ec763e9730ccf1376

NMIs were a biggie too.

2018-01-16 19:11:09

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h

On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > + return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > +}
>
> One of the reasons to implement it the other way:
>
> - return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
>
> is that the compiler can do this all quickly. KERNEL_PGD_BOUNDARY
> depends on PAGE_OFFSET which depends on a variable. IOW, the compiler
> can't do it.
>
> How much worse is the code that this generates?

I havn't looked at the actual code this generates, but the
(PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
space is not always evenly split. I'll look into a better way to check
this.

Thanks,

Joerg

2018-01-16 19:31:38

by Andrew Cooper

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On 16/01/18 18:59, Linus Torvalds wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it. See the
>
> if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
> pti_print_if_insecure("disabled on XEN PV.");
> return;
> }

64bit PV guests under Xen already have split pagetables.  It is a base
and necessary part of the ABI, because segment limits stopped working in
64bit.

32bit PV guests aren't split, but by far the most efficient way of doing
this is to introduce a new enlightenment and have Xen switch all this
stuff (and IBRS, for that matter) on behalf of the guest kernel on
context switch.

~Andrew

2018-01-16 19:35:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> > On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > > + return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > > +}
> >
> > One of the reasons to implement it the other way:
> >
> > - return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> >
> > is that the compiler can do this all quickly. KERNEL_PGD_BOUNDARY
> > depends on PAGE_OFFSET which depends on a variable. IOW, the compiler
> > can't do it.
> >
> > How much worse is the code that this generates?
>
> I havn't looked at the actual code this generates, but the
> (PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
> space is not always evenly split. I'll look into a better way to check
> this.

It should be trivial enough to do

return (ptr & ~PAGE_MASK) < PGD_SPLIT_SIZE);

and define it PAGE_SIZE/2 for 64bit and for PAE make it depend on the
configured address space split.

Thanks,

tglx

2018-01-16 19:41:15

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries

On Tue, Jan 16, 2018 at 10:06:48AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> >
> > In PAE page-tables at the top-level most bits we usually set
> > with _KERNPG_TABLE are reserved, resulting in a #GP when
> > they are loaded by the processor.
>
> Can you save me the trip to the SDM and remind me which bits actually
> cause trouble here?

Everything besides PRESENT, PCD, PWT and the actual physical address, so
RW, and NX for example cause a #GP.


Joerg

2018-01-16 19:44:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's

On Tue, Jan 16, 2018 at 10:11:14AM -0800, Dave Hansen wrote:
>
> Ugh. The ghosts of PAE have come back to haunt us.

:-) Yeah, PAE caused the most trouble for me while getting this running.

>
> Could we do:
>
> static inline bool pgd_supports_nx(unsigned long)
> {
> #ifdef CONFIG_X86_64
> return (__supported_pte_mask & _PAGE_NX);
> #else
> /* No 32-bit page tables support NX at PGD level */
> return 0;
> #endif
> }
>
> Nobody will ever spot the #ifdef the way you laid it out.

Right, thats a better way to do it. I'll change it in the next version.

Thanks,

Joerg

2018-01-16 19:46:17

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Tue, Jan 16, 2018 at 10:14:19AM -0800, Dave Hansen wrote:
> Joerg,
>
> Very cool!.

Thanks :)

> I really appreciate you putting this together. I don't see any real
> showstoppers or things that I think will *break* 64-bit. I just hope
> that we can merge this _slowly_ in case it breaks 64-bit along the way.

Sure, it needs a lot more testing and most likely fixing anyway. So
there is still some way to go before this is ready for merging.


Joerg

2018-01-16 19:55:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hi Linus,

On Tue, Jan 16, 2018 at 10:59:01AM -0800, Linus Torvalds wrote:
> Yes, I'm very happy to see that this is actually not nearly as bad as
> I feared it might be,

Yeah, I was looking at the original PTI patches and my impression was
that a lot of the complicated stuff (like setting up the cpu_entry_area)
was already in there for 32 bit too. So it was mostly about the entry
code and some changes to the 32bit page-table code.

> Some of those #ifdef's in the PTI code you added might want more
> commentary about what the exact differences are. And maybe they could
> be done more cleanly with some abstraction. But nothing looked
> _horrible_.

I'll add more comments and better abstraction, Dave has already
suggested some improvements here. Reading some of my comments again,
they need a rework anyway.

> .. and please run all the segment and syscall selfchecks that Andy has written.

Didn't know about them yet, thanks. I will run them too in my testing

> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it. See the
>
> if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
> pti_print_if_insecure("disabled on XEN PV.");
> return;
> }
>
> in pti_check_boottime_disable().

But I might have broken something for them anyway, honestly I didn't pay
much attention to the XEN_PV case as I was trying to get it running
here. My hope is that someone who knows Xen better than I do will help
out :)


Regards,

Joerg

2018-01-16 20:30:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
> /* This is used when switching tasks or entering/exiting vm86 mode. */
> static inline void update_sp0(struct task_struct *task)
> {
> - /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> - load_sp0(task->thread.sp0);
> -#else
> + /* sp0 always points to the entry trampoline stack, which is constant: */
> if (static_cpu_has(X86_FEATURE_XENPV))
> load_sp0(task_top_of_stack(task));
> -#endif
> }
>
> #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
> BLANK();
>
> /* Offset from the sysenter stack to tss.sp0 */
> - DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> + DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
> offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>
> + OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

Can you please split out the change of TSS_sysenter_stack into a separate
patch?

Other than that, this looks good.

Thanks,

tglx

2018-01-16 21:03:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
> /*
> * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
> * next-level entry on 5-level systems.
> @@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
> kernel_p4d = p4d_offset(kernel_pgd, addr);
> *user_p4d = *kernel_p4d;
> }
> +#endif
>
> /*
> * Clone the CPU_ENTRY_AREA into the user space visible page table.
> */
> static void __init pti_clone_user_shared(void)
> {
> +#ifdef CONFIG_X86_32
> + /*
> + * On 32 bit PAE systems with 1GB of Kernel address space there is only
> + * one pgd/p4d for the whole kernel. Cloning that would map the whole
> + * address space into the user page-tables, making PTI useless. So clone
> + * the page-table on the PMD level to prevent that.
> + */
> + unsigned long start, end;
> +
> + start = CPU_ENTRY_AREA_BASE;
> + end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
> +
> + pti_clone_pmds(start, end, _PAGE_GLOBAL);
> +#else
> pti_clone_p4d(CPU_ENTRY_AREA_BASE);
> +#endif
> }

Just a minor nit. You already wrap pti_clone_p4d() into X86_64. So it would
be cleaner to do:

kernel_p4d = p4d_offset(kernel_pgd, addr);
*user_p4d = *kernel_p4d;
}

static void __init pti_clone_user_shared(void)
{
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
}

#else /* CONFIG_X86_64 */

/*
* Big fat comment.
*/
static void __init pti_clone_user_shared(void)
{
....
}
#endif /* !CONFIG_X86_64 */

Thanks,

tglx

2018-01-16 21:06:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <[email protected]>
>
> With PAE paging we don't have PGD and P4D levels in the
> page-table, instead the PUD level is the highest one.
>
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.
>
> Work around this by populating PUD entries in the user
> page-table only with _PAGE_PRESENT set.
>
> I am pretty sure there is a cleaner way to do this, but
> until I find it use this #ifdef solution.

Stick somehting like

#define _KERNELPG_TABLE_PUD_ENTRY

into the 32 and 64 bit variants of some relevant header file

Thanks,

tglx

2018-01-16 21:10:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's

On Tue, 16 Jan 2018, Joerg Roedel wrote:
>
> +#ifdef CONFIG_X86_64
> /*
> * If this is normal user memory, make it NX in the kernel
> * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
> * may execute from it
> * - we don't have NX support
> * - we're clearing the PGD (i.e. the new pgd is not present).
> + * - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> + * all and PAE paging does not support it on the PGD level. We can
> + * set it in the PMD level there in the future, but that means we
> + * need to unshare the PMDs between the kernel and the user
> + * page-tables.
> */
> if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
> (__supported_pte_mask & _PAGE_NX))
> pgd.pgd |= _PAGE_NX;

I'd suggest to have:

static inline pteval_t supported_pgd_mask(void)
{
if (IS_ENABLED(CONFIG_X86_64))
return __supported_pte_mask;
return __supported_pte_mask & ~_PAGE_NX);
}

and get rid of the ifdeffery completely.

Thanks,

tglx

2018-01-16 21:15:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's

On 01/16/2018 01:10 PM, Thomas Gleixner wrote:
>
> static inline pteval_t supported_pgd_mask(void)
> {
> if (IS_ENABLED(CONFIG_X86_64))
> return __supported_pte_mask;
> return __supported_pte_mask & ~_PAGE_NX);
> }
>
> and get rid of the ifdeffery completely.

Heh, that's an entertaining way to do it. Joerg, if you go do it this
way, it would be nice to add all the other gunk that we don't allow to
be set in the PAE pgd.

2018-01-16 21:20:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

> 16 files changed, 333 insertions(+), 123 deletions(-)

Impressively small and well done !

Can you please make that patch set against

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus

so we immediately have it backportable for 4.14 stable? It's only a trivial
conflict in pgtable.h, but we'd like to make the life of stable as simple
as possible. They have enough headache with the pre 4.14 trees.

We can pick some of the simple patches which make defines and inlines
available out of the pile right away and apply them to x86/pti to shrink
the amount of stuff you have to worry about.

Thanks,

tglx


2018-01-16 22:26:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
>
> The patches are split as follows:
>
> - 1-3 contain the entry-code changes to enter and
> exit the kernel via the sysenter trampoline stack.
>
> - 4-7 are fixes to get the code compile on 32 bit
> with CONFIG_PAGE_TABLE_ISOLATION=y.
>
> - 8-14 adapt the existing PTI code to work properly
> on 32 bit and add the needed parts to 32 bit
> page-table code.
>
> - 15 switches PTI on by adding the CR3 switches to
> kernel entry/exit.
>
> - 16 enables the Kconfig for all of X86
>
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
>

If you pass all the x86 selftests, then Wine and DOSEMU are pretty
likely to work :)

--Andy

2018-01-16 22:37:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Tue, Jan 16, 2018 at 12:30 PM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:
>> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>> /* This is used when switching tasks or entering/exiting vm86 mode. */
>> static inline void update_sp0(struct task_struct *task)
>> {
>> - /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
>> -#ifdef CONFIG_X86_32
>> - load_sp0(task->thread.sp0);
>> -#else
>> + /* sp0 always points to the entry trampoline stack, which is constant: */
>> if (static_cpu_has(X86_FEATURE_XENPV))
>> load_sp0(task_top_of_stack(task));
>> -#endif
>> }
>>
>> #endif /* _ASM_X86_SWITCH_TO_H */
>> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
>> index 654229bac2fc..7270dd834f4b 100644
>> --- a/arch/x86/kernel/asm-offsets_32.c
>> +++ b/arch/x86/kernel/asm-offsets_32.c
>> @@ -47,9 +47,11 @@ void foo(void)
>> BLANK();
>>
>> /* Offset from the sysenter stack to tss.sp0 */
>> - DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
>> + DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>> offsetofend(struct cpu_entry_area, entry_stack_page.stack));

I was going to say that this is just too magical. The convention is
that STRUCT_member refers to "member" of "STRUCT". Here you're
encoding a more complicated calculation. How about putting just the
needed offsets in asm_offsets and putting the actual calculation in
the asm code or a header.

>>
>> + OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

This belongs in asm_offsets.c. Just move the asm_offsets_64.c version
there and call it a day.

2018-01-16 22:45:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Use the sysenter stack as a trampoline stack to enter the
> kernel. The sysenter stack is already in the cpu_entry_area
> and will be mapped to userspace when PTI is enabled.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 89 +++++++++++++++++++++++++++++++++++-----
> arch/x86/include/asm/switch_to.h | 6 +--
> arch/x86/kernel/asm-offsets_32.c | 4 +-
> arch/x86/kernel/cpu/common.c | 5 ++-
> arch/x86/kernel/process.c | 2 -
> arch/x86/kernel/process_32.c | 6 +++
> 6 files changed, 91 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index eb8c5615777b..5a7bdb73be9f 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -222,6 +222,47 @@
> .endm
>
> /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0

How about marking nr_regs with :req to force everyone to be explicit?

> +
> + .if \check_user > 0 && \nr_regs > 0
> + testb $3, (\nr_regs - 4)*4(%esp) /* CS */
> + jz .Lend_\@
> + .endif
> +
> + pushl %edi
> + movl %esp, %edi
> +
> + /*
> + * TSS_sysenter_stack is the offset from the bottom of the
> + * entry-stack
> + */
> + movl TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp

This is incomprehensible. You're adding what appears to be the offset
of sysenter_stack within the TSS to something based on esp and
dereferencing that to get the new esp. That't not actually what
you're doing, but please change asm_offsets.c (as in my previous
email) to avoid putting serious arithmetic in it and then do the
arithmetic right here so that it's possible to follow what's going on.

> +
> + /* Copy the registers over */
> + .if \nr_regs > 0
> + i = 0
> + .rept \nr_regs
> + pushl (\nr_regs - i) * 4(%edi)
> + i = i + 1
> + .endr
> + .endif
> +
> + mov (%edi), %edi
> +
> +.Lend_\@:
> +.endm
> +
> +/*
> * %eax: prev task
> * %edx: next task
> */
> @@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
> * 0(%ebp) arg6
> */
> ENTRY(entry_SYSENTER_32)
> - movl TSS_sysenter_stack(%esp), %esp
> + /* Kernel stack is empty */
> + SWITCH_TO_KERNEL_STACK

This would be more readable if you put nr_regs in here.

> +
> .Lsysenter_past_esp:
> pushl $__USER_DS /* pt_regs->ss */
> pushl %ebp /* pt_regs->sp (stashed in bp) */
> @@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
> ENTRY(entry_INT80_32)
> ASM_CLAC
> pushl %eax /* pt_regs->orig_ax */
> +
> + /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> + SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> +

Why check_user?

> @@ -655,6 +702,10 @@ END(irq_entries_start)
> common_interrupt:
> ASM_CLAC
> addl $-0x80, (%esp) /* Adjust vector into the [-256, -1] range */
> +
> + /* Stack layout: ss, esp, eflags, cs, eip, vector */
> + SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1

LGTM.

> ENTRY(nmi)
> ASM_CLAC
> +
> + /* Stack layout: ss, esp, eflags, cs, eip */
> + SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1

This is wrong, I think. If you get an nmi in kernel mode but while
still on the sysenter stack, you blow up. IIRC we have some crazy
code already to handle this (for nmi and #DB), and maybe that's
already adequate or can be made adequate, but at the very least this
needs a big comment explaining why it's okay.

> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index eb5f7999a893..20e5f7ab8260 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
> /* This is used when switching tasks or entering/exiting vm86 mode. */
> static inline void update_sp0(struct task_struct *task)
> {
> - /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> - load_sp0(task->thread.sp0);
> -#else
> + /* sp0 always points to the entry trampoline stack, which is constant: */
> if (static_cpu_has(X86_FEATURE_XENPV))
> load_sp0(task_top_of_stack(task));
> -#endif
> }
>
> #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
> BLANK();
>
> /* Offset from the sysenter stack to tss.sp0 */
> - DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> + DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
> offsetofend(struct cpu_entry_area, entry_stack_page.stack));



>
> + OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
> +
> #ifdef CONFIG_CC_STACKPROTECTOR
> BLANK();
> OFFSET(stack_canary_offset, stack_canary, canary);
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index ef29ad001991..20a71c914e59 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1649,11 +1649,12 @@ void cpu_init(void)
> enter_lazy_tlb(&init_mm, curr);
>
> /*
> - * Initialize the TSS. Don't bother initializing sp0, as the initial
> - * task never enters user mode.
> + * Initialize the TSS. sp0 points to the entry trampoline stack
> + * regardless of what task is running.
> */
> set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
> load_TR_desc();
> + load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));

It's high time we unified the 32-bit and 64-bit versions of the code.
This isn't necessarily needed for your series, though.

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 5224c6099184..452eeac00b80 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> this_cpu_write(cpu_current_top_of_stack,
> (unsigned long)task_stack_page(next_p) +
> THREAD_SIZE);
> + /*
> + * TODO: Find a way to let cpu_current_top_of_stack point to
> + * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> + * iret exceptions.
> + */
> + this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

Do you know what the issue is?

As a general comment, the interaction between this patch and vm86 is a
bit scary. In vm86 mode, the kernel gets entered with extra stuff on
the stack, which may screw up all your offsets.

--Andy

2018-01-16 22:46:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Move it out of the X86_64 specific processor defines so
> that its visible for 32bit too.

Hmm. This is okay, I guess, but any code that actually uses this
definition is inherently wrong, since 32-bit implies !PCID.

2018-01-16 22:49:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Switch back to the trampoline stack before returning to
> userspace.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 58 ++++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/asm-offsets_32.c | 1 +
> 2 files changed, 59 insertions(+)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 5a7bdb73be9f..14018eeb11c3 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -263,6 +263,61 @@
> .endm
>
> /*
> + * Switch back from the kernel stack to the entry stack.
> + *
> + * iret_frame > 0 adds code to copie over an iret frame from the old to
> + * the new stack. It also adds a check which bails out if
> + * we are not returning to user-space.
> + *
> + * This macro is allowed not modify eflags when iret_frame == 0.
> + */
> +.macro SWITCH_TO_ENTRY_STACK iret_frame=0
> + .if \iret_frame > 0
> + /* Are we returning to userspace? */
> + testb $3, 4(%esp) /* return CS */
> + jz .Lend_\@
> + .endif
> +
> + /*
> + * We run with user-%fs already loaded from pt_regs, so we don't
> + * have access to per_cpu data anymore, and there is no swapgs
> + * equivalent on x86_32.
> + * We work around this by loading the kernel-%fs again and
> + * reading the entry stack address from there. Then we restore
> + * the user-%fs and return.
> + */
> + pushl %fs
> + pushl %edi
> +
> + /* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
> + movl $(__KERNEL_PERCPU), %edi
> + movl %edi, %fs
> +
> + /* Save old stack pointer to copy the return frame over if needed */
> + movl %esp, %edi
> + movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
> +
> + /* Now we are on the entry stack */
> +
> + .if \iret_frame > 0
> + /* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
> + pushl 6*4(%edi) /* ss */
> + pushl 5*4(%edi) /* esp */
> + pushl 4*4(%edi) /* eflags */
> + pushl 3*4(%edi) /* cs */
> + pushl 2*4(%edi) /* eip */
> + .endif
> +
> + pushl 4(%edi) /* fs */
> +
> + /* Restore user %edi and user %fs */
> + movl (%edi), %edi
> + popl %fs

Yikes! We're not *supposed* to be able to observe an asynchronous
descriptor table change, but if the LDT changes out from under you,
this is going to blow up badly. It would be really nice if you could
pull this off without percpu access or without needing to do this
dance where you load user FS, then kernel FS, then user FS. If that's
not doable, then you should at least add exception handling -- look at
the other 'pop %fs' instructions in entry_32.S.

2018-01-16 22:52:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <[email protected]>
>>
>> Reserve 2MB/4MB of address space for mapping the LDT to
>> user-space.
>
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

If this works like it does on 64-bit, it only needs 128k regardless of
the number of CPUs. The LDT mapping is specific to the mm.

How are you dealing with PAE here? That is, what's your pagetable
layout? What parts of the address space are owned by what code?

2018-01-17 02:48:27

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack



On 01/16/2018 11:36 AM, Joerg Roedel wrote:

>
> /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0


This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.

With those macros fixed I was able to boot 32-bit Xen PV guest.

-boris

2018-01-17 07:59:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT

On Tue, Jan 16, 2018 at 02:51:45PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> >> From: Joerg Roedel <[email protected]>
> >>
> >> Reserve 2MB/4MB of address space for mapping the LDT to
> >> user-space.
> >
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
>
> If this works like it does on 64-bit, it only needs 128k regardless of
> the number of CPUs. The LDT mapping is specific to the mm.

Ah, then I got my LDT things confused again... which is certainly
possible, we had a few too many variants back then.

2018-01-17 09:02:42

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

Hi Boris,

thanks for testing this :)

On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
> >+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>
>
> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>
> With those macros fixed I was able to boot 32-bit Xen PV guest.

Hmm, on bare metal the stack switch happens regardless of the
X86_FEATURE_PTI feature being set, because we always program tss.sp0
with the systenter stack. How is the kernel entry stack setup on xen-pv?
I think something is missing there instead.

Regards,

Joerg

2018-01-17 09:18:58

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>
> How about marking nr_regs with :req to force everyone to be explicit?

Yeah, that's more readable, I'll change it.

> > + /*
> > + * TSS_sysenter_stack is the offset from the bottom of the
> > + * entry-stack
> > + */
> > + movl TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
>
> This is incomprehensible. You're adding what appears to be the offset
> of sysenter_stack within the TSS to something based on esp and
> dereferencing that to get the new esp. That't not actually what
> you're doing, but please change asm_offsets.c (as in my previous
> email) to avoid putting serious arithmetic in it and then do the
> arithmetic right here so that it's possible to follow what's going on.

Probably this needs better comments. So TSS_sysenter_stack is the offset
from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
this macro the stack might not be empty, it has a configurable (by
\nr_regs) number of dwords on it. Before this instruction we also do a
push %edi, so we need (\nr_regs + 1).

This can't be put into asm_offset.c, as the actual offset depends on how
much is on the stack.

> > ENTRY(entry_INT80_32)
> > ASM_CLAC
> > pushl %eax /* pt_regs->orig_ax */
> > +
> > + /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> > + SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> > +
>
> Why check_user?

You are right, check_user shouldn't ne needed as INT80 is never called
from kernel mode.

> > ENTRY(nmi)
> > ASM_CLAC
> > +
> > + /* Stack layout: ss, esp, eflags, cs, eip */
> > + SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
>
> This is wrong, I think. If you get an nmi in kernel mode but while
> still on the sysenter stack, you blow up. IIRC we have some crazy
> code already to handle this (for nmi and #DB), and maybe that's
> already adequate or can be made adequate, but at the very least this
> needs a big comment explaining why it's okay.

If we get an nmi while still on the sysenter stack, then we are not
entering the handler from user-space and the above code will do
nothing and behave as before.

But you are right, it might blow up. There is a problem with the cr3
switch, because the nmi can happen in kernel mode before the cr3 is
switched, then this handler will not do the cr3 switch itself and crash
the kernel. But the stack switching should be fine, I think.

> > + /*
> > + * TODO: Find a way to let cpu_current_top_of_stack point to
> > + * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> > + * iret exceptions.
> > + */
> > + this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
>
> Do you know what the issue is?

No, not yet, I will look into that again. But first I want to get
this series stable enough as it is.

> As a general comment, the interaction between this patch and vm86 is a
> bit scary. In vm86 mode, the kernel gets entered with extra stuff on
> the stack, which may screw up all your offsets.

Just read up on vm86 mode control transfers and the stack layout then.
Looks like I need to check for eflags.vm=1 and copy four more registers
from/to the entry stack. Thanks for pointing that out.

Thanks,

Joerg

2018-01-17 09:24:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> > + /* Restore user %edi and user %fs */
> > + movl (%edi), %edi
> > + popl %fs
>
> Yikes! We're not *supposed* to be able to observe an asynchronous
> descriptor table change, but if the LDT changes out from under you,
> this is going to blow up badly. It would be really nice if you could
> pull this off without percpu access or without needing to do this
> dance where you load user FS, then kernel FS, then user FS. If that's
> not doable, then you should at least add exception handling -- look at
> the other 'pop %fs' instructions in entry_32.S.

You are right! This also means I need to do the 'popl %fs' before the
cr3-switch. I'll fix it in the next version.

I have no real idea on how to switch back to the entry stack without
access to per_cpu variables. I also can't access the cpu_entry_area for
the cpu yet, because for that we need to be on the entry stack already.


Joerg

2018-01-17 09:27:01

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32

On Tue, Jan 16, 2018 at 02:46:16PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> > From: Joerg Roedel <[email protected]>
> >
> > Move it out of the X86_64 specific processor defines so
> > that its visible for 32bit too.
>
> Hmm. This is okay, I guess, but any code that actually uses this
> definition is inherently wrong, since 32-bit implies !PCID.

Yes, I tried another approach first which just #ifdef'ed out the
relevant parts in tlbflush.h which use this bit. But that seemed to be
the wrong path, as there is more PCID code that is compiled in for 32
bit. So defining the bit for 32 bit seemed to be the cleaner solution
for now.


Joerg

2018-01-17 09:33:38

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hi Andy,

thanks a lot for your review and input, especially on the entry-code
changes!

On Tue, Jan 16, 2018 at 02:26:22PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> > The code has not run on bare-metal yet, I'll test that in
> > the next days once I setup a 32 bit box again. I also havn't
> > tested Wine and DosEMU yet, so this might also be broken.
> >
>
> If you pass all the x86 selftests, then Wine and DOSEMU are pretty
> likely to work :)

Okay, good to know. I will definitily run them and make them pass :)


Joerg

2018-01-17 09:55:12

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hi Thomas,

thanks for your review, I'll work in your suggestions for the next post.

On Tue, Jan 16, 2018 at 10:20:40PM +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:

> > 16 files changed, 333 insertions(+), 123 deletions(-)
>
> Impressively small and well done !

Thanks :)

> Can you please make that patch set against
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus
>
> so we immediately have it backportable for 4.14 stable? It's only a trivial
> conflict in pgtable.h, but we'd like to make the life of stable as simple
> as possible. They have enough headache with the pre 4.14 trees.

Sure, will do.

> We can pick some of the simple patches which make defines and inlines
> available out of the pile right away and apply them to x86/pti to shrink
> the amount of stuff you have to worry about.

This should be patches 4, 5, 7, 11, and I think 13 is also simple
enough. Feel free to take them, but I can also carry them forward if
needed.

Thanks,

Joerg

2018-01-17 13:57:56

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
>> > + /* Restore user %edi and user %fs */
>> > + movl (%edi), %edi
>> > + popl %fs
>>
>> Yikes! We're not *supposed* to be able to observe an asynchronous
>> descriptor table change, but if the LDT changes out from under you,
>> this is going to blow up badly. It would be really nice if you could
>> pull this off without percpu access or without needing to do this
>> dance where you load user FS, then kernel FS, then user FS. If that's
>> not doable, then you should at least add exception handling -- look at
>> the other 'pop %fs' instructions in entry_32.S.
>
> You are right! This also means I need to do the 'popl %fs' before the
> cr3-switch. I'll fix it in the next version.
>
> I have no real idea on how to switch back to the entry stack without
> access to per_cpu variables. I also can't access the cpu_entry_area for
> the cpu yet, because for that we need to be on the entry stack already.

Switch to the trampoline stack before loading user segments.

--
Brian Gerst

2018-01-17 14:00:10

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <[email protected]> wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <[email protected]> wrote:
>> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
>>> > + /* Restore user %edi and user %fs */
>>> > + movl (%edi), %edi
>>> > + popl %fs
>>>
>>> Yikes! We're not *supposed* to be able to observe an asynchronous
>>> descriptor table change, but if the LDT changes out from under you,
>>> this is going to blow up badly. It would be really nice if you could
>>> pull this off without percpu access or without needing to do this
>>> dance where you load user FS, then kernel FS, then user FS. If that's
>>> not doable, then you should at least add exception handling -- look at
>>> the other 'pop %fs' instructions in entry_32.S.
>>
>> You are right! This also means I need to do the 'popl %fs' before the
>> cr3-switch. I'll fix it in the next version.
>>
>> I have no real idea on how to switch back to the entry stack without
>> access to per_cpu variables. I also can't access the cpu_entry_area for
>> the cpu yet, because for that we need to be on the entry stack already.
>
> Switch to the trampoline stack before loading user segments.

But then again, you could take a fault on the trampoline stack if you
get a bad segment. Perhaps just pushing the new stack pointer onto
the process stack before user segment loads will be the right move.

--
Brian Gerst

2018-01-17 14:08:13

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On 17/01/18 09:02, Joerg Roedel wrote:
> Hi Boris,
>
> thanks for testing this :)
>
> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>
>> With those macros fixed I was able to boot 32-bit Xen PV guest.
> Hmm, on bare metal the stack switch happens regardless of the
> X86_FEATURE_PTI feature being set, because we always program tss.sp0
> with the systenter stack. How is the kernel entry stack setup on xen-pv?
> I think something is missing there instead.

There is one single stack registered with Xen, on which you get a normal
exception frame in all cases, even via the registered (virtual)
syscall/sysenter/failsafe handlers.

~Andrew

2018-01-17 14:10:09

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <[email protected]> wrote:

> > I have no real idea on how to switch back to the entry stack without
> > access to per_cpu variables. I also can't access the cpu_entry_area for
> > the cpu yet, because for that we need to be on the entry stack already.
>
> Switch to the trampoline stack before loading user segments.

That requires to copy most of pt_regs from task- to trampoline-stack,
not sure if that is faster than temporily restoring kernel %fs.


Joerg

2018-01-17 14:14:38

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <[email protected]> wrote:
> But then again, you could take a fault on the trampoline stack if you
> get a bad segment. Perhaps just pushing the new stack pointer onto
> the process stack before user segment loads will be the right move.

User segment loads pop from the stack, so having anything on-top also
doesn't work.

Maybe I can leave some space at the bottom of the task-stack at entry
time and store the pointer there on exit, if that doesn't confuse the
stack unwinder too much.


Joerg

2018-01-17 14:45:20

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 03:14:18PM +0100, Joerg Roedel wrote:
> On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> > On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <[email protected]> wrote:
> > But then again, you could take a fault on the trampoline stack if you
> > get a bad segment. Perhaps just pushing the new stack pointer onto
> > the process stack before user segment loads will be the right move.
>
> User segment loads pop from the stack, so having anything on-top also
> doesn't work.
>
> Maybe I can leave some space at the bottom of the task-stack at entry
> time and store the pointer there on exit, if that doesn't confuse the
> stack unwinder too much.

If you put it at the end of the stack page, I _think_ all you'd have to
do is just adjust TOP_OF_KERNEL_STACK_PADDING.

--
Josh

2018-01-17 15:25:11

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On 01/17/2018 09:04 AM, Andrew Cooper wrote:
> On 17/01/18 09:02, Joerg Roedel wrote:
>> Hi Boris,
>>
>> thanks for testing this :)
>>
>> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>>
>>> With those macros fixed I was able to boot 32-bit Xen PV guest.
>> Hmm, on bare metal the stack switch happens regardless of the
>> X86_FEATURE_PTI feature being set, because we always program tss.sp0
>> with the systenter stack. How is the kernel entry stack setup on xen-pv?
>> I think something is missing there instead.
> There is one single stack registered with Xen, on which you get a normal
> exception frame in all cases, even via the registered (virtual)
> syscall/sysenter/failsafe handlers.

And so the check should be at least against X86_FEATURE_XENPV, not
necessarily X86_FEATURE_PTI.

But I guess you can still check against X86_FEATURE_PTI since without it
there is not much reason to switch stacks?

-boris

2018-01-17 18:11:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
>> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> How about marking nr_regs with :req to force everyone to be explicit?
>
> Yeah, that's more readable, I'll change it.
>
>> > + /*
>> > + * TSS_sysenter_stack is the offset from the bottom of the
>> > + * entry-stack
>> > + */
>> > + movl TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
>>
>> This is incomprehensible. You're adding what appears to be the offset
>> of sysenter_stack within the TSS to something based on esp and
>> dereferencing that to get the new esp. That't not actually what
>> you're doing, but please change asm_offsets.c (as in my previous
>> email) to avoid putting serious arithmetic in it and then do the
>> arithmetic right here so that it's possible to follow what's going on.
>
> Probably this needs better comments. So TSS_sysenter_stack is the offset
> from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
> this macro the stack might not be empty, it has a configurable (by
> \nr_regs) number of dwords on it. Before this instruction we also do a
> push %edi, so we need (\nr_regs + 1).
>
> This can't be put into asm_offset.c, as the actual offset depends on how
> much is on the stack.
>
>> > ENTRY(entry_INT80_32)
>> > ASM_CLAC
>> > pushl %eax /* pt_regs->orig_ax */
>> > +
>> > + /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
>> > + SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
>> > +
>>
>> Why check_user?
>
> You are right, check_user shouldn't ne needed as INT80 is never called
> from kernel mode.
>
>> > ENTRY(nmi)
>> > ASM_CLAC
>> > +
>> > + /* Stack layout: ss, esp, eflags, cs, eip */
>> > + SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
>>
>> This is wrong, I think. If you get an nmi in kernel mode but while
>> still on the sysenter stack, you blow up. IIRC we have some crazy
>> code already to handle this (for nmi and #DB), and maybe that's
>> already adequate or can be made adequate, but at the very least this
>> needs a big comment explaining why it's okay.
>
> If we get an nmi while still on the sysenter stack, then we are not
> entering the handler from user-space and the above code will do
> nothing and behave as before.
>
> But you are right, it might blow up. There is a problem with the cr3
> switch, because the nmi can happen in kernel mode before the cr3 is
> switched, then this handler will not do the cr3 switch itself and crash
> the kernel. But the stack switching should be fine, I think.
>
>> > + /*
>> > + * TODO: Find a way to let cpu_current_top_of_stack point to
>> > + * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
>> > + * iret exceptions.
>> > + */
>> > + this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
>>
>> Do you know what the issue is?
>
> No, not yet, I will look into that again. But first I want to get
> this series stable enough as it is.
>
>> As a general comment, the interaction between this patch and vm86 is a
>> bit scary. In vm86 mode, the kernel gets entered with extra stuff on
>> the stack, which may screw up all your offsets.
>
> Just read up on vm86 mode control transfers and the stack layout then.
> Looks like I need to check for eflags.vm=1 and copy four more registers
> from/to the entry stack. Thanks for pointing that out.

You could just copy those slots unconditionally. After all, you're
slowing down entries by an epic amount due to writing CR3 on with PCID
off, so four words copied should be entirely lost in the noise. OTOH,
checking for VM86 mode is just a single bt against EFLAGS.

With the modern (rewritten a year or two ago by Brian Gerst) vm86
code, all the slots (those actually in pt_regs) are in the same
location regardless of whether we're in VM86 mode or not, but we're
still fiddling with the bottom of the stack. Since you're controlling
the switch to the kernel thread stack, you can easily just write the
frame to the correct location, so you should not need to context
switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
of the kernel stack no matter what. In fact, you could probably avoid
context switching sp0, either, which would be a nice cleanup.

So I recommend the following. Keep sp0 as the bottom of the sysenter
stack no matter what. Then do:

bt $X86_EFLAGS_VM_BIT
jc .Lfrom_vm_\@

push 5 regs to real stack, starting at four-word offset (so they're in
the right place)
update %esp
...
.Lupdate_esp_\@

.Lfrom_vm_\@:
push 9 regs to real stack, starting at the bottom
jmp .Lupdate_esp_\@

Does that seem reasonable? It's arguably much nicer than what we have now.

2018-01-17 18:15:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 6:10 AM, Joerg Roedel <[email protected]> wrote:
> On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
>> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <[email protected]> wrote:
>
>> > I have no real idea on how to switch back to the entry stack without
>> > access to per_cpu variables. I also can't access the cpu_entry_area for
>> > the cpu yet, because for that we need to be on the entry stack already.
>>
>> Switch to the trampoline stack before loading user segments.
>
> That requires to copy most of pt_regs from task- to trampoline-stack,
> not sure if that is faster than temporily restoring kernel %fs.
>

I would optimize for simplicity, not speed. You're already planning
to write to CR3, which is serializing, blows away the TLB, *and* takes
the absurdly large amount of time that the microcode needs to blow
away the TLB.

(For whatever reason, Intel doesn't seem to have hardware that can
quickly wipe the TLB. I suspect that the actual implementation does
it in a loop and wipes little pieces at a time. Whatever it actually
does, the CR3 write itself is very slow.)

2018-01-17 23:42:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Also populate the user-spage pgd's in the user page-table.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/include/asm/pgtable-2level.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
> index 685ffe8a0eaf..d96486d23c58 100644
> --- a/arch/x86/include/asm/pgtable-2level.h
> +++ b/arch/x86/include/asm/pgtable-2level.h
> @@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
>
> static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
> {
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> + pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
> +#endif
> *pmdp = pmd;
> }
>

Nothing against your patch, but this seems like a perfectly fine place
to rant: I *hate* the way we deal with page table folding. Grr.

2018-01-17 23:44:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> Allocate a kernel and a user page-table root when PTI is
> enabled. Also allocate a full page per root for PAEm because
> otherwise the bit to flip in cr3 to switch between them
> would be non-constant, which creates a lot of hassle.
> Keep that for a later optimization.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
> arch/x86/mm/pgtable.c | 11 ++++++-----
> 2 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
> index c29020907886..fc550559bf58 100644
> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -512,28 +512,41 @@ ENTRY(initial_code)
> ENTRY(setup_once_ref)
> .long setup_once
>
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +#define PGD_ALIGN (2 * PAGE_SIZE)
> +#define PTI_USER_PGD_FILL 1024
> +#else
> +#define PGD_ALIGN (PAGE_SIZE)
> +#define PTI_USER_PGD_FILL 0
> +#endif
> /*
> * BSS section
> */
> __PAGE_ALIGNED_BSS
> - .align PAGE_SIZE
> + .align PGD_ALIGN
> #ifdef CONFIG_X86_PAE
> .globl initial_pg_pmd
> initial_pg_pmd:
> .fill 1024*KPMDS,4,0
> + .fill PTI_USER_PGD_FILL,4,0

Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

2018-01-19 09:57:41

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

Hey Andy,

On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <[email protected]> wrote:

> > Just read up on vm86 mode control transfers and the stack layout then.
> > Looks like I need to check for eflags.vm=1 and copy four more registers
> > from/to the entry stack. Thanks for pointing that out.
>
> You could just copy those slots unconditionally. After all, you're
> slowing down entries by an epic amount due to writing CR3 on with PCID
> off, so four words copied should be entirely lost in the noise. OTOH,
> checking for VM86 mode is just a single bt against EFLAGS.
>
> With the modern (rewritten a year or two ago by Brian Gerst) vm86
> code, all the slots (those actually in pt_regs) are in the same
> location regardless of whether we're in VM86 mode or not, but we're
> still fiddling with the bottom of the stack. Since you're controlling
> the switch to the kernel thread stack, you can easily just write the
> frame to the correct location, so you should not need to context
> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
> of the kernel stack no matter what. In fact, you could probably avoid
> context switching sp0, either, which would be a nice cleanup.

I am not sure what you mean by "not context switching sp0/sp1" ...

> So I recommend the following. Keep sp0 as the bottom of the sysenter
> stack no matter what. Then do:
>
> bt $X86_EFLAGS_VM_BIT
> jc .Lfrom_vm_\@
>
> push 5 regs to real stack, starting at four-word offset (so they're in
> the right place)
> update %esp
> ...
> .Lupdate_esp_\@
>
> .Lfrom_vm_\@:
> push 9 regs to real stack, starting at the bottom
> jmp .Lupdate_esp_\@
>
> Does that seem reasonable? It's arguably much nicer than what we have
> now.

But that looks like a good idea. Having a consistent stack with and
without vm86 is certainly a nice cleanup.


Regards,

Joerg


2018-01-19 09:58:14

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack

On Wed, Jan 17, 2018 at 10:12:32AM -0800, Andy Lutomirski wrote:
> I would optimize for simplicity, not speed. You're already planning
> to write to CR3, which is serializing, blows away the TLB, *and* takes
> the absurdly large amount of time that the microcode needs to blow
> away the TLB.

Okay, so I am going to do the stack-switch before pt_regs is restored.
This is at least better than playing games with hiding the entry/exit
%esp somewhere in stack-memory.


Thanks,

Joerg

2018-01-19 09:58:40

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled

On Wed, Jan 17, 2018 at 03:43:14PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <[email protected]> wrote:
> > #ifdef CONFIG_X86_PAE
> > .globl initial_pg_pmd
> > initial_pg_pmd:
> > .fill 1024*KPMDS,4,0
> > + .fill PTI_USER_PGD_FILL,4,0
>
> Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

You are right, will change that.

Thanks,

Joerg


2018-01-19 10:56:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hi!

> From: Joerg Roedel <[email protected]>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Thanks for doing the work.

I tried applying it on top of -next, and that did not succeed. Let me
try Linus tree...

> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.

Um. Ok, testing is something I can do. At least I have excuse to power
on T40p.

Ok... Testing is something I can do... If I can get it to compile.

CC arch/x86/mm/dump_pagetables.o
arch/x86/mm/dump_pagetables.c: In function
‘ptdump_walk_user_pgd_level_checkwx’:
arch/x86/mm/dump_pagetables.c:546:26: error: ‘init_top_pgt’
undeclared (first use in this function)
pgd_t *pgd = (pgd_t *) &init_top_pgt;
^
arch/x86/mm/dump_pagetables.c:546:26:
note: each undeclared identifier is reported only once for each
function it appears in
scripts/Makefile.build:316: recipe for target
'arch/x86/mm/dump_pagetables.o' failed
make[2]: *** [arch/x86/mm/dump_pagetables.o] Error 1
scripts/Makefile.build:575: recipe for target 'arch/x86/mm' failed
make[1]: *** [arch/x86/mm] Error 2
make[1]: *** Waiting for unfinished jobs....
CC arch/x86/platform/intel/iosf_mbi.o

Ok, I guess I can disable some config option...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.84 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-01-19 11:08:38

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hey Pavel,

On Fri, Jan 19, 2018 at 11:55:28AM +0100, Pavel Machek wrote:
> Thanks for doing the work.
>
> I tried applying it on top of -next, and that did not succeed. Let me
> try Linus tree...

Thanks for your help with testing this patch-set, but I recommend to
wait for the next version, as review already found a couple of bugs that
might crash your system. For example there are NMI cases that might
crash your machine because the NMI happens in kernel mode before the cr3
switch. VM86 mode is also definitly broken.

I am about to fix that and will send a new version, if all goes well, at
some point next week.


Thanks,

Joerg


2018-01-19 13:01:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Fri 2018-01-19 12:07:26, Joerg Roedel wrote:
> Hey Pavel,
>
> On Fri, Jan 19, 2018 at 11:55:28AM +0100, Pavel Machek wrote:
> > Thanks for doing the work.
> >
> > I tried applying it on top of -next, and that did not succeed. Let me
> > try Linus tree...
>
> Thanks for your help with testing this patch-set, but I recommend to
> wait for the next version, as review already found a couple of bugs that
> might crash your system. For example there are NMI cases that might
> crash your machine because the NMI happens in kernel mode before the cr3
> switch. VM86 mode is also definitly broken.

Thanks for heads-up. I guess I can disable NMI avoid VM86.

CONFIG_X86_PTDUMP_CORE should be responsible for boot fail. Disabling
it is not at all easy, as CONFIG_EMBEDDED selects CONFIG_EXPERTS
selects CONFIG_DEBUG_KERNEL selects CONFIG_X86_PTDUMP_CORE. (Crazy, if
you ask me). You may want to test with that enabled. Patch below might
fix it. (Signed-off-by: me).

Tests so far: kernel boots in qemu. Whole system boots on thinkpad
T40p, vulnerabities/meltdown says mitigation: PTI.. so I guess it
works.

Tested-by: me. :-)

Best regards,
Pavel


diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2a4849e..896b53b 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -543,7 +543,11 @@ EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
static void ptdump_walk_user_pgd_level_checkwx(void)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#ifdef CONFIG_X86_64
pgd_t *pgd = (pgd_t *) &init_top_pgt;
+#else
+ pgd_t *pgd = swapper_pg_dir;
+#endif

if (!static_cpu_has(X86_FEATURE_PTI))
return;

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.81 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-01-19 16:32:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Fri, Jan 19, 2018 at 1:55 AM, Joerg Roedel <[email protected]> wrote:
> Hey Andy,
>
> On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
>> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <[email protected]> wrote:
>
>> > Just read up on vm86 mode control transfers and the stack layout then.
>> > Looks like I need to check for eflags.vm=1 and copy four more registers
>> > from/to the entry stack. Thanks for pointing that out.
>>
>> You could just copy those slots unconditionally. After all, you're
>> slowing down entries by an epic amount due to writing CR3 on with PCID
>> off, so four words copied should be entirely lost in the noise. OTOH,
>> checking for VM86 mode is just a single bt against EFLAGS.
>>
>> With the modern (rewritten a year or two ago by Brian Gerst) vm86
>> code, all the slots (those actually in pt_regs) are in the same
>> location regardless of whether we're in VM86 mode or not, but we're
>> still fiddling with the bottom of the stack. Since you're controlling
>> the switch to the kernel thread stack, you can easily just write the
>> frame to the correct location, so you should not need to context
>> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
>> of the kernel stack no matter what. In fact, you could probably avoid
>> context switching sp0, either, which would be a nice cleanup.
>
> I am not sure what you mean by "not context switching sp0/sp1" ...

You're supposed to read what I meant, not what I said...

I meant that we could have sp0 have a genuinely constant value per
cpu. That means that the entry trampoline ends up with RIP, etc in a
different place depending on whether VM was in use, but the entry
trampoline code should be able to handle that. sp1 would have a value
that varies by task, but it could just point to the top of the stack
instead of being changed depending on whether VM is in use. Instead,
the entry trampoline would offset the registers as needed to keep
pt_regs in the right place.

I think you already figured all of that out, though :)

2018-01-21 20:13:56

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
this setup (kaslr disabled, similar setup works on 64-bit).

Did you use any PoC to “test” the protection?

Thanks,
Nadav


Joerg Roedel <[email protected]> wrote:

> From: Joerg Roedel <[email protected]>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
>
> The patches are split as follows:
>
> - 1-3 contain the entry-code changes to enter and
> exit the kernel via the sysenter trampoline stack.
>
> - 4-7 are fixes to get the code compile on 32 bit
> with CONFIG_PAGE_TABLE_ISOLATION=y.
>
> - 8-14 adapt the existing PTI code to work properly
> on 32 bit and add the needed parts to 32 bit
> page-table code.
>
> - 15 switches PTI on by adding the CR3 switches to
> kernel entry/exit.
>
> - 16 enables the Kconfig for all of X86
>
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
>
> With that post I'd like to ask for all kinds of constructive
> feedback on the approaches I have taken and of course the
> many things I broke with it :)
>
> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.
>
> So please review and let me know your thoughts.
>
> Thanks,
>
> Joerg
>
> Joerg Roedel (16):
> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
> x86/entry/32: Enter the kernel via trampoline stack
> x86/entry/32: Leave the kernel via the trampoline stack
> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
> x86/mm/ldt: Reserve high address-space range for the LDT
> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
> x86/mm/pti: Populate valid user pud entries
> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
> x86/mm/pae: Populate the user page-table with user pgd's
> x86/mm/pti: Add an overflow check to pti_clone_pmds()
> x86/mm/legacy: Populate the user page-table with user pgd's
> x86/entry/32: Switch between kernel and user cr3 on entry/exit
> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>
> arch/x86/entry/entry_32.S | 170 +++++++++++++++++++++++++++++---
> arch/x86/include/asm/pgtable-2level.h | 3 +
> arch/x86/include/asm/pgtable-3level.h | 3 +
> arch/x86/include/asm/pgtable.h | 88 +++++++++++++++++
> arch/x86/include/asm/pgtable_32_types.h | 5 +-
> arch/x86/include/asm/pgtable_64.h | 85 ----------------
> arch/x86/include/asm/processor-flags.h | 8 +-
> arch/x86/include/asm/switch_to.h | 6 +-
> arch/x86/kernel/asm-offsets_32.c | 5 +-
> arch/x86/kernel/cpu/common.c | 5 +-
> arch/x86/kernel/head_32.S | 23 ++++-
> arch/x86/kernel/process.c | 2 -
> arch/x86/kernel/process_32.c | 6 ++
> arch/x86/mm/pgtable.c | 11 ++-
> arch/x86/mm/pti.c | 34 ++++++-
> security/Kconfig | 2 +-
> 16 files changed, 333 insertions(+), 123 deletions(-)
>
> --
> 2.13.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



2018-01-21 20:45:34

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Please ignore my previous email. I got it working… Sorry for the spam.


Nadav Amit <[email protected]> wrote:

> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
> this setup (kaslr disabled, similar setup works on 64-bit).
>
> Did you use any PoC to “test” the protection?
>
> Thanks,
> Nadav
>
>
> Joerg Roedel <[email protected]> wrote:
>
>> From: Joerg Roedel <[email protected]>
>>
>> Hi,
>>
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
>>
>> The patches are split as follows:
>>
>> - 1-3 contain the entry-code changes to enter and
>> exit the kernel via the sysenter trampoline stack.
>>
>> - 4-7 are fixes to get the code compile on 32 bit
>> with CONFIG_PAGE_TABLE_ISOLATION=y.
>>
>> - 8-14 adapt the existing PTI code to work properly
>> on 32 bit and add the needed parts to 32 bit
>> page-table code.
>>
>> - 15 switches PTI on by adding the CR3 switches to
>> kernel entry/exit.
>>
>> - 16 enables the Kconfig for all of X86
>>
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
>>
>> With that post I'd like to ask for all kinds of constructive
>> feedback on the approaches I have taken and of course the
>> many things I broke with it :)
>>
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
>>
>> So please review and let me know your thoughts.
>>
>> Thanks,
>>
>> Joerg
>>
>> Joerg Roedel (16):
>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>> x86/entry/32: Enter the kernel via trampoline stack
>> x86/entry/32: Leave the kernel via the trampoline stack
>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>> x86/mm/ldt: Reserve high address-space range for the LDT
>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>> x86/mm/pti: Populate valid user pud entries
>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>> x86/mm/pae: Populate the user page-table with user pgd's
>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>> x86/mm/legacy: Populate the user page-table with user pgd's
>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>>
>> arch/x86/entry/entry_32.S | 170 +++++++++++++++++++++++++++++---
>> arch/x86/include/asm/pgtable-2level.h | 3 +
>> arch/x86/include/asm/pgtable-3level.h | 3 +
>> arch/x86/include/asm/pgtable.h | 88 +++++++++++++++++
>> arch/x86/include/asm/pgtable_32_types.h | 5 +-
>> arch/x86/include/asm/pgtable_64.h | 85 ----------------
>> arch/x86/include/asm/processor-flags.h | 8 +-
>> arch/x86/include/asm/switch_to.h | 6 +-
>> arch/x86/kernel/asm-offsets_32.c | 5 +-
>> arch/x86/kernel/cpu/common.c | 5 +-
>> arch/x86/kernel/head_32.S | 23 ++++-
>> arch/x86/kernel/process.c | 2 -
>> arch/x86/kernel/process_32.c | 6 ++
>> arch/x86/mm/pgtable.c | 11 ++-
>> arch/x86/mm/pti.c | 34 ++++++-
>> security/Kconfig | 2 +-
>> 16 files changed, 333 insertions(+), 123 deletions(-)
>>
>> --
>> 2.13.6
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



2018-01-21 23:47:08

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

I wanted to see whether segments protection can be a replacement for PTI
(yes, excluding SMEP emulation), or whether speculative execution “ignores”
limit checks, similarly to the way paging protection is skipped.

It does seem that segmentation provides sufficient protection from Meltdown.
The “reliability” test of Gratz PoC fails if the segment limit is set to
prevent access to the kernel memory. [ It passes if the limit is not set,
even if the DS is reloaded. ] My test is enclosed below.

So my question: wouldn’t it be much more efficient to use segmentation
protection for x86-32, and allow users to choose whether they want SMEP-like
protection if needed (and then enable PTI)?

[ There might be some corner cases in which setting a segment limit
introduces a problem, for example when modify_ldt() is used to set invalid
limit, but I presume that these are relatively uncommon, can be detected on
runtime, and PTI can then be used as a fallback mechanism. ]

Thanks,
Nadav

-- >8 --
Subject: [PATCH] Test segmentation protection

---
libkdump/libkdump.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/libkdump/libkdump.c b/libkdump/libkdump.c
index c590391..db5bac3 100644
--- a/libkdump/libkdump.c
+++ b/libkdump/libkdump.c
@@ -10,6 +10,9 @@
#include <stdarg.h>
#include <stdlib.h>
#include <unistd.h>
+#include <assert.h>
+#include <sys/types.h>
+#include <asm/ldt.h>

libkdump_config_t libkdump_auto_config = {0};

@@ -500,6 +503,31 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_tsx() {
return 0;
}

+extern int modify_ldt(int, void*, unsigned long);
+
+void change_ds(void)
+{
+ int r;
+ struct user_desc desc = {
+ .entry_number = 1,
+ .base_addr = 0,
+#ifdef NO_SEGMENTS
+ .limit = 0xffffeu,
+#else
+ .limit = 0xbffffu,
+#endif
+ .seg_32bit = 1,
+ .contents = 0,
+ .read_exec_only = 0,
+ .limit_in_pages = 1,
+ .seg_not_present = 0,
+ };
+
+ r = modify_ldt(1 /* write */, &desc, sizeof(desc));
+ assert(r == 0);
+ asm volatile ("mov %0, %%ds\n\t" : : "r"((1 << 3) | (1 << 2) | 3));
+}
+
// ---------------------------------------------------------------------------
int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {
size_t retries = config.retries + 1;
@@ -507,6 +535,9 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {

while (retries--) {
if (!setjmp(buf)) {
+ /* longjmp reloads the original DS... */
+ change_ds();
+
MELTDOWN;
}

Nadav Amit <[email protected]> wrote:

> Please ignore my previous email. I got it working… Sorry for the spam.
>
>
> Nadav Amit <[email protected]> wrote:
>
>> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
>> this setup (kaslr disabled, similar setup works on 64-bit).
>>
>> Did you use any PoC to “test” the protection?
>>
>> Thanks,
>> Nadav
>>
>>
>> Joerg Roedel <[email protected]> wrote:
>>
>>> From: Joerg Roedel <[email protected]>
>>>
>>> Hi,
>>>
>>> here is my current WIP code to enable PTI on x86-32. It is
>>> still in a pretty early state, but it successfully boots my
>>> KVM guest with PAE and with legacy paging. The existing PTI
>>> code for x86-64 already prepares a lot of the stuff needed
>>> for 32 bit too, thanks for that to all the people involved
>>> in its development :)
>>>
>>> The patches are split as follows:
>>>
>>> - 1-3 contain the entry-code changes to enter and
>>> exit the kernel via the sysenter trampoline stack.
>>>
>>> - 4-7 are fixes to get the code compile on 32 bit
>>> with CONFIG_PAGE_TABLE_ISOLATION=y.
>>>
>>> - 8-14 adapt the existing PTI code to work properly
>>> on 32 bit and add the needed parts to 32 bit
>>> page-table code.
>>>
>>> - 15 switches PTI on by adding the CR3 switches to
>>> kernel entry/exit.
>>>
>>> - 16 enables the Kconfig for all of X86
>>>
>>> The code has not run on bare-metal yet, I'll test that in
>>> the next days once I setup a 32 bit box again. I also havn't
>>> tested Wine and DosEMU yet, so this might also be broken.
>>>
>>> With that post I'd like to ask for all kinds of constructive
>>> feedback on the approaches I have taken and of course the
>>> many things I broke with it :)
>>>
>>> One of the things that are surely broken is XEN_PV support.
>>> I'd appreciate any help with testing and bugfixing on that
>>> front.
>>>
>>> So please review and let me know your thoughts.
>>>
>>> Thanks,
>>>
>>> Joerg
>>>
>>> Joerg Roedel (16):
>>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>>> x86/entry/32: Enter the kernel via trampoline stack
>>> x86/entry/32: Leave the kernel via the trampoline stack
>>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>>> x86/mm/ldt: Reserve high address-space range for the LDT
>>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>>> x86/mm/pti: Populate valid user pud entries
>>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>>> x86/mm/pae: Populate the user page-table with user pgd's
>>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>>> x86/mm/legacy: Populate the user page-table with user pgd's
>>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>>>
>>> arch/x86/entry/entry_32.S | 170 +++++++++++++++++++++++++++++---
>>> arch/x86/include/asm/pgtable-2level.h | 3 +
>>> arch/x86/include/asm/pgtable-3level.h | 3 +
>>> arch/x86/include/asm/pgtable.h | 88 +++++++++++++++++
>>> arch/x86/include/asm/pgtable_32_types.h | 5 +-
>>> arch/x86/include/asm/pgtable_64.h | 85 ----------------
>>> arch/x86/include/asm/processor-flags.h | 8 +-
>>> arch/x86/include/asm/switch_to.h | 6 +-
>>> arch/x86/kernel/asm-offsets_32.c | 5 +-
>>> arch/x86/kernel/cpu/common.c | 5 +-
>>> arch/x86/kernel/head_32.S | 23 ++++-
>>> arch/x86/kernel/process.c | 2 -
>>> arch/x86/kernel/process_32.c | 6 ++
>>> arch/x86/mm/pgtable.c | 11 ++-
>>> arch/x86/mm/pti.c | 34 ++++++-
>>> security/Kconfig | 2 +-
>>> 16 files changed, 333 insertions(+), 123 deletions(-)
>>>
>>> --
>>> 2.13.6
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



2018-01-22 02:11:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <[email protected]> wrote:
> I wanted to see whether segments protection can be a replacement for PTI
> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
> limit checks, similarly to the way paging protection is skipped.
>
> It does seem that segmentation provides sufficient protection from Meltdown.
> The “reliability” test of Gratz PoC fails if the segment limit is set to
> prevent access to the kernel memory. [ It passes if the limit is not set,
> even if the DS is reloaded. ] My test is enclosed below.

Interesting. It might not be entirely reliable for all
microarchitectures, though.

> So my question: wouldn’t it be much more efficient to use segmentation
> protection for x86-32, and allow users to choose whether they want SMEP-like
> protection if needed (and then enable PTI)?

That's what we did long long ago, with user space segments actually
using the limit (in fact, if you go back far enough, the kernel even
used the base).

You'd have to make sure that the LDT loading etc do not allow CPL3
segments with base+limit past TASK_SIZE, so that people can't generate
their own. And the TLS segments also need to be limited (and
remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).

And we should check with Intel that segment limit checking really is
guaranteed to be done before any access.

Too bad x86-64 got rid of the segments ;)

Linus

2018-01-22 02:29:07

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Linus Torvalds <[email protected]> wrote:

> On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <[email protected]> wrote:
>> I wanted to see whether segments protection can be a replacement for PTI
>> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
>> limit checks, similarly to the way paging protection is skipped.
>>
>> It does seem that segmentation provides sufficient protection from Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set to
>> prevent access to the kernel memory. [ It passes if the limit is not set,
>> even if the DS is reloaded. ] My test is enclosed below.
>
> Interesting. It might not be entirely reliable for all
> microarchitectures, though.
>
>> So my question: wouldn’t it be much more efficient to use segmentation
>> protection for x86-32, and allow users to choose whether they want SMEP-like
>> protection if needed (and then enable PTI)?
>
> That's what we did long long ago, with user space segments actually
> using the limit (in fact, if you go back far enough, the kernel even
> used the base).
>
> You'd have to make sure that the LDT loading etc do not allow CPL3
> segments with base+limit past TASK_SIZE, so that people can't generate
> their own. And the TLS segments also need to be limited (and
> remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
>
> And we should check with Intel that segment limit checking really is
> guaranteed to be done before any access.

Thanks. I’ll try to check with Intel liaison people of VMware (my employer),
yet any feedback will be appreciated.

> Too bad x86-64 got rid of the segments ;)

Actually, as I noted in a different thread, running 32-bit binaries on
x86_64 in legacy-mode, without PTI, performs considerably better than x86_64
binaries with PTI for workloads that are hit the most (e.g., Redis). By
dynamically removing the 64-bit user-CS from the GDT, this mode should be
safe, as long as CS load is not done speculatively.

Regards,
Nadav

2018-01-22 02:41:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On January 21, 2018 6:11:07 PM PST, Linus Torvalds <[email protected]> wrote:
>On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <[email protected]>
>wrote:
>> I wanted to see whether segments protection can be a replacement for
>PTI
>> (yes, excluding SMEP emulation), or whether speculative execution
>“ignores”
>> limit checks, similarly to the way paging protection is skipped.
>>
>> It does seem that segmentation provides sufficient protection from
>Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set
>to
>> prevent access to the kernel memory. [ It passes if the limit is not
>set,
>> even if the DS is reloaded. ] My test is enclosed below.
>
>Interesting. It might not be entirely reliable for all
>microarchitectures, though.
>
>> So my question: wouldn’t it be much more efficient to use
>segmentation
>> protection for x86-32, and allow users to choose whether they want
>SMEP-like
>> protection if needed (and then enable PTI)?
>
>That's what we did long long ago, with user space segments actually
>using the limit (in fact, if you go back far enough, the kernel even
>used the base).
>
>You'd have to make sure that the LDT loading etc do not allow CPL3
>segments with base+limit past TASK_SIZE, so that people can't generate
>their own. And the TLS segments also need to be limited (and
>remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
>
>And we should check with Intel that segment limit checking really is
>guaranteed to be done before any access.
>
>Too bad x86-64 got rid of the segments ;)
>
> Linus

No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-01-22 08:58:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hey Nadav,

On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> It does seem that segmentation provides sufficient protection from Meltdown.

Thanks for testing this, if this turns out to be true for all affected
uarchs it would be a great and better way of protection than enabling
PTI.

But I'd like an official statement from Intel on that one, as their
recommended fix is still to use PTI.

And as you said, if it turns out that this works only on some Intel
uarchs, we can also detect it at runtime and then chose the fasted
meltdown protection mechanism.


Thanks,

Joerg


2018-01-22 09:56:05

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH 00/16] PTI support for x86-32

From: Nadav Amit
> Sent: 21 January 2018 23:46
>
> I wanted to see whether segments protection can be a replacement for PTI
> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
> limit checks, similarly to the way paging protection is skipped.

That's made me remember something about segment limits applying in 64bit mode.
I really can't remember the details at all.
I'm sure it had something to do with one of the VM implementations restricting
memory accesses.

David

2018-01-22 10:04:57

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Mon, Jan 22, 2018 at 09:55:31AM +0000, David Laight wrote:
> That's made me remember something about segment limits applying in 64bit mode.
> I really can't remember the details at all.
> I'm sure it had something to do with one of the VM implementations restricting
> memory accesses.

Some AMD chips have long-mode segment limits, not sure if Intel has them
too. But they are useless here because the limit is 32 bit and can only
protect the upper 4GB of virtual address space. The limits also don't
apply to GS and CS segements.


Regards,

Joerg


2018-01-22 10:12:44

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

Hey Andy,

On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
> I meant that we could have sp0 have a genuinely constant value per
> cpu. That means that the entry trampoline ends up with RIP, etc in a
> different place depending on whether VM was in use, but the entry
> trampoline code should be able to handle that. sp1 would have a value
> that varies by task, but it could just point to the top of the stack
> instead of being changed depending on whether VM is in use. Instead,
> the entry trampoline would offset the registers as needed to keep
> pt_regs in the right place.
>
> I think you already figured all of that out, though :)

Yes, and after looking a while into it, it would make a nice cleanup for
the entry code. On the other side, it would change the layout for the
in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
with a different layout than the one we use in the the kernel.

This can certainly be all worked out, but it makes this nice entry-code
cleanup not so nice and clean anymore. At least the work required to
make it work without breaking user-space is not in the scope of this
patch-set.


Regards,

Joerg


2018-01-22 17:47:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

On Mon, Jan 22, 2018 at 2:11 AM, Joerg Roedel <[email protected]> wrote:
> Hey Andy,
>
> On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
>> I meant that we could have sp0 have a genuinely constant value per
>> cpu. That means that the entry trampoline ends up with RIP, etc in a
>> different place depending on whether VM was in use, but the entry
>> trampoline code should be able to handle that. sp1 would have a value
>> that varies by task, but it could just point to the top of the stack
>> instead of being changed depending on whether VM is in use. Instead,
>> the entry trampoline would offset the registers as needed to keep
>> pt_regs in the right place.
>>
>> I think you already figured all of that out, though :)
>
> Yes, and after looking a while into it, it would make a nice cleanup for
> the entry code. On the other side, it would change the layout for the
> in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
> with a different layout than the one we use in the the kernel.

I don't think this is necessarily the case. We end up with four more
fields that are logically there at the end of pt_regs (which is
already kind-of-sort-of the case), but we don't actually need to put
them in struct pt_regs. We just end up with (regs + 1) != "top of
task stack", but even that has precedent -- it's already true for
tasks in vm86 mode.

>
> This can certainly be all worked out, but it makes this nice entry-code
> cleanup not so nice and clean anymore. At least the work required to
> make it work without breaking user-space is not in the scope of this
> patch-set.

Agreed. This should probably be saved for later. Except that your
patch set still needs to come up with some way to function correctly
on vm86.

2018-01-22 20:20:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Sun, Jan 21, 2018 at 6:20 PM, <[email protected]> wrote:
>
> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.

Yes, but TMTA had a really odd uarch and didn't check segment limits natively.

When you do it in hardware. the limit check is actually fairly natural
to do early rather than late (since it acts on the linear address
_before_ base add and TLB lookup).

So it's not like it can't be done late, but there are reasons why a
traditional microarchitecture might always end up doing the limit
check early and so segmentation might be a good defense against
meltdown on 32-bit Intel.

Linus

2018-01-22 21:31:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On 01/22/18 12:14, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 6:20 PM, <[email protected]> wrote:
>>
>> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
>
> Yes, but TMTA had a really odd uarch and didn't check segment limits natively.
>

Only on TM3000 ("Wilma") and TM5000 ("Fred"), not on TM8000 ("Astro").
Astro might in fact have been more synchronous than most modern machines
(see below.)

> When you do it in hardware. the limit check is actually fairly natural
> to do early rather than late (since it acts on the linear address
> _before_ base add and TLB lookup).
>
> So it's not like it can't be done late, but there are reasons why a
> traditional microarchitecture might always end up doing the limit
> check early and so segmentation might be a good defense against
> meltdown on 32-bit Intel.

I will try to investigate, but as you can imagine the amount of
bandwidth I might be able to get on this is definitely going to be limited.

All of the below is generic discussion that almost certainly can be
found in some form in Hennesey & Patterson, and so I don't have to worry
about giving away Intel secrets:

It isn't really true that it is natural to check this early. One of the
most fundamental frequency limiters in a modern CPU architecture
(meaning anything from the last 20 years or so) has been the
data-dependent AGU-D$-AGU loop. Note that this doesn't even include the
TLB: the TLB is looked up in parallel with the D$, and if the result was
*either* a cache-TLB mismatch or a TLB miss the result is prevented from
committing.

In the case of the x86, the AGU receives up to three sources plus the
segment base, and if possible given the target process and gates
available might be designed to have a unified 4-input adder, with the
3-input case for limit checks being done separately.

Misses and even more so exceptions (which are far less frequent than
misses) are demoted to a slower where the goal is to prevent commit
rather than trying to race to be in the data path. So although it is
natural to *issue* the load and the limit check at the same time, the
limit check is still going to be deferred. Whether or not it is
permitted to be fully asynchronous with the load is probably a tradeoff
of timing requirements vs complexity. At least theoretically one could
imagine a machine which would take the trap after the speculative
machine had already chased the pointer loop several levels down; this
would most likely mean separate uops to allow for the existing
out-of-order machine to do the bookkeeping.

-hpa

2018-01-23 14:44:38

by Alan Cox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

> of timing requirements vs complexity. At least theoretically one could
> imagine a machine which would take the trap after the speculative
> machine had already chased the pointer loop several levels down; this
> would most likely mean separate uops to allow for the existing
> out-of-order machine to do the bookkeeping.

It's not quite the same but in the IA-64 case you can write itanium code
that does exactly that. The speculation is expressed in software not
hardware (because you can trigger a load, then check later if it worked
out and respond appripriately).

Alan

2018-01-23 15:01:13

by Alan Cox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <[email protected]> wrote:

> Hey Nadav,
>
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.
>
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
>
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.
>
> And as you said, if it turns out that this works only on some Intel
> uarchs, we can also detect it at runtime and then chose the fasted
> meltdown protection mechanism.

I'll follow this up and get an official statement.

Alan

2018-01-24 19:04:33

by Krzysztof Mazur

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <[email protected]>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Hi,

I've waited for this patches for a long time, until I've tried to
exploit meltdown on some old 32-bit CPUs and failed. Pentium M
seems to speculatively execute the second load with eax
always equal to 0:

movzx (%[addr]), %%eax
shl $12, %%eax
movzx (%[target], %%eax), %%eax

And on Pentium 4-based Xeon the second load seems to be never executed,
even without shift (shifts are slow on some or all Pentium 4's). Maybe
not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
kernel, at least on 32-bit, should try to exploit meltdown to test if
the CPU is really affected.


The series boots on Pentium M (and crashes when I've used perf,
but it is an already known issue). However, I don't like
the performance regression with CONFIG_PAGE_TABLE_ISOLATION=n
(about 7.2%), trivial "benchmark":

--- cut here ---
#include <unistd.h>
#include <fcntl.h>

int main(void)
{
unsigned long i;
int fd;

fd = open("/dev/null", O_WRONLY);
for (i = 0; i < 10000000; i++) {
char x = 0;
write(fd, &x, 1);
}
return 0;
}
--- cut here ---

Time (on Pentium M 1.73 GHz):

baseline (4.15.0-rc8-gdda3e152): 2.415 s (+/- 0.64%)
patched, without CONFIG_PAGE_TABLE_ISOLATION=n 2.588 s (+/- 0.01%)
patched, nopti 2.597 s (+/- 0.31%)
patched, pti 18.272 s
(some older kernel, pre 4.15) 2.378 s

Thanks,
Krzysiek
--
perf results:

baseline:

Performance counter stats for './bench' (5 runs):

2401.539139 task-clock:HG # 0.995 CPUs utilized ( +- 0.23% )
23 context-switches:HG # 0.009 K/sec ( +- 4.02% )
0 cpu-migrations:HG # 0.000 K/sec
30 page-faults:HG # 0.013 K/sec ( +- 1.24% )
4142375834 cycles:HG # 1.725 GHz ( +- 0.23% ) [39.99%]
385110908 stalled-cycles-frontend:HG # 9.30% frontend cycles idle ( +- 0.06% ) [40.01%]
<not supported> stalled-cycles-backend:HG
4142489274 instructions:HG # 1.00 insns per cycle
# 0.09 stalled cycles per insn ( +- 0.00% ) [40.00%]
802270380 branches:HG # 334.065 M/sec ( +- 0.00% ) [40.00%]
34278 branch-misses:HG # 0.00% of all branches ( +- 1.94% ) [40.00%]

2.414741497 seconds time elapsed ( +- 0.64% )

patched, without CONFIG_PAGE_TABLE_ISOLATION=n

Performance counter stats for './bench' (5 runs):

2587.026405 task-clock:HG # 1.000 CPUs utilized ( +- 0.01% )
28 context-switches:HG # 0.011 K/sec ( +- 5.95% )
0 cpu-migrations:HG # 0.000 K/sec
31 page-faults:HG # 0.012 K/sec ( +- 1.21% )
4462401079 cycles:HG # 1.725 GHz ( +- 0.01% ) [39.98%]
388646121 stalled-cycles-frontend:HG # 8.71% frontend cycles idle ( +- 0.05% ) [40.01%]
<not supported> stalled-cycles-backend:HG
4283638646 instructions:HG # 0.96 insns per cycle
# 0.09 stalled cycles per insn ( +- 0.00% ) [40.03%]
822484311 branches:HG # 317.927 M/sec ( +- 0.00% ) [40.01%]
39372 branch-misses:HG # 0.00% of all branches ( +- 2.33% ) [39.98%]

2.587818354 seconds time elapsed ( +- 0.01% )

2018-01-25 17:13:19

by Alan Cox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <[email protected]> wrote:

> Hey Nadav,
>
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.
>
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
>
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.

It is: we don't think segmentation works on all processors as a defence.

Alan

2018-01-25 22:10:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Krzysztof Mazur <[email protected]> wrote:

> On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <[email protected]>
>>
>> Hi,
>>
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
>
> Hi,
>
> I've waited for this patches for a long time, until I've tried to
> exploit meltdown on some old 32-bit CPUs and failed. Pentium M
> seems to speculatively execute the second load with eax
> always equal to 0:
>
> movzx (%[addr]), %%eax
> shl $12, %%eax
> movzx (%[target], %%eax), %%eax
>
> And on Pentium 4-based Xeon the second load seems to be never executed,
> even without shift (shifts are slow on some or all Pentium 4's). Maybe
> not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
> kernel, at least on 32-bit, should try to exploit meltdown to test if
> the CPU is really affected.

The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
you setup has more? Can you try the attack while setting max_addr=1G ?

Thanks,
Nadav

2018-01-26 09:29:58

by Krzysztof Mazur

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

On Thu, Jan 25, 2018 at 02:09:40PM -0800, Nadav Amit wrote:
> The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
> you setup has more? Can you try the attack while setting max_addr=1G ?

No, I tested on:

Pentium M (Dothan): 1.5 GB RAM, PAE for NX, 2GB/2GB split

CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G=y
CONFIG_PAGE_OFFSET=0x80000000
CONFIG_X86_PAE=y

and

Xeon (Pentium 4): 2 GB RAM, no PAE, 1.75GB/2.25GB split
CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G_OPT=y
CONFIG_PAGE_OFFSET=0x78000000


Now I'm testing with standard settings on
Pentium M: 1.5 GB RAM, no PAE, 3GB/1GB split, ~890 MB RAM available

CONFIG_NOHIGHMEM=y
CONFIG_PAGE_OFFSET=0xc0000000
CONFIG_X86_PAE=n

and it still does not work.

reliability from https://github.com/IAIK/meltdown reports 0.38%
(1/256 = 0.39%, "true" random), and other libkdump tools does not work.

https://github.com/paboldin/meltdown-exploit (on linux_proc_banner
symbol) reports:
cached = 46, uncached = 515, threshold 153
read c0897020 = ff (score=0/1000)
read c0897021 = ff (score=0/1000)
read c0897022 = ff (score=0/1000)
read c0897023 = ff (score=0/1000)
read c0897024 = ff (score=0/1000)
NOT VULNERABLE

and my exploit with:

for (i = 0; i < 256; i++) {
unsigned char *px = p + (i << 12);

t = rdtsc();
readb(px);
t = rdtsc() - t;
if (t < 100)
printf("%02x %lld\n", i, t);
}

loop returns only "00 45". When I change the exploit code (now based
on paboldin code to be sure) to:

movzx (%[addr]), %%eax
movl $0xaa, %%eax
shl $12, %%eax
movzx (%[target], %%eax), %%eax

I always get "0xaa 51", so the CPU is speculatively executing the second
load with (0xaa << 12) in eax, and without the movl instruction, eax seems
to be always 0. I even tried to remove the shift:

movzx (%[addr]), %%eax
movzx (%[target], %%eax), %%eax

and I've been reading known value (from /dev/mem, for instance 0x20),
I've modified target array offset, and the CPU is still touching "wrong"
cacheline, eax == 0 instead of 0x20. I've also tested movl instead
of movzx (with and 0xff).


On Core 2 Quad in 64-bit mode everything works as expected, vulnerable
to Meltdown (I did not test it in 32-bit mode). I don't have any Core
"1" to test.

On that Pentium M syscall slowdown caused by PTI is huge, 7.5 times slower
(7 times compared to patched kernel with disabled PTI), on Skylake with
PCID the same trivial benchmark is "only" 3.5 times slower (and 5.2
times slower without PCID).

Krzysiek

2018-01-26 12:38:35

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] PTI support for x86-32

Hi Alan,

On Thu, Jan 25, 2018 at 05:09:25PM +0000, Alan Cox wrote:
> On Mon, 22 Jan 2018 09:56:25 +0100
> Joerg Roedel <[email protected]> wrote:
>
> > Hey Nadav,
> >
> > On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > > It does seem that segmentation provides sufficient protection from Meltdown.
> >
> > Thanks for testing this, if this turns out to be true for all affected
> > uarchs it would be a great and better way of protection than enabling
> > PTI.
> >
> > But I'd like an official statement from Intel on that one, as their
> > recommended fix is still to use PTI.
>
> It is: we don't think segmentation works on all processors as a defence.

Thanks for checking and the official statement. So the official
mitigation recommendation is still to use PTI.


Regards,

Joerg