2017-12-02 06:20:17

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 0/2] KPTI: get rid of cpu_entry_area mapping duplication

I like this variant much better. It might also fix the nasty bug tglx
and peterz were chasing.

Andy Lutomirski (2):
Undo the split of setup_cpu_entry_area
x86/kpti: Reference all cpu_entry_area pagetables in the usermode
tables

arch/x86/include/asm/fixmap.h | 14 +++++---
arch/x86/include/asm/kpti.h | 8 +++--
arch/x86/kernel/cpu/common.c | 3 --
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/kernel/traps.c | 6 +++-
arch/x86/mm/kpti.c | 82 ++++++++++++++++++++++++++-----------------
6 files changed, 71 insertions(+), 44 deletions(-)

--
2.13.6


2017-12-02 06:20:20

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 2/2] x86/kpti: Reference all cpu_entry_area pagetables in the usermode tables

We were manually configuring cpu_entry_area in the usermode tables.
This was error-prone and wasted memory. (Not much memory, but
still.) Instead, just reference the same pagetables.

This avoids needing to keep the KPTI code and the normal
cpu_entry_area code in sync, since the KPTI code no longer cares
what's in cpu_entry_area.

[This does *not* work on the current KPTI series. It requires that
all the kernelmode cpu_entry_tables are pre-allocated. That
happens in the series as I submitted it, but tglx changed it for
reasons that I haven't figured out.]

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fixmap.h | 14 +++++---
arch/x86/include/asm/kpti.h | 8 +++--
arch/x86/kernel/cpu/common.c | 3 --
arch/x86/mm/kpti.c | 82 ++++++++++++++++++++++++++-----------------
4 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 839addd1eaec..a630cd2861f7 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -142,16 +142,20 @@ enum fixed_addresses {
#ifdef CONFIG_PARAVIRT
FIX_PARAVIRT_BOOTMAP,
#endif
- FIX_TEXT_POKE1, /* reserve 2 pages for text_poke() */
- FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
#ifdef CONFIG_X86_INTEL_MID
FIX_LNW_VRTC,
#endif
- /* Fixmap entries to remap the GDTs, one per processor. */
- FIX_CPU_ENTRY_AREA_TOP,
+ FIX_TEXT_POKE1, /* reserve 2 pages for text_poke() */
+ FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
+
+ /*
+ * Fixmap entries to remap the GDTs, one per processor. Align
+ * to a PMD boundary.
+ */
+ FIX_CPU_ENTRY_AREA_TOP = round_up(FIX_TEXT_POKE0 + 1, PTRS_PER_PMD),
FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,

- __end_of_permanent_fixed_addresses,
+ __end_of_permanent_fixed_addresses = round_up(FIX_CPU_ENTRY_AREA_BOTTOM + 1, PTRS_PER_PMD),

/*
* 512 temporary boot-time mappings, used by early_ioremap(),
diff --git a/arch/x86/include/asm/kpti.h b/arch/x86/include/asm/kpti.h
index 0c10e86ae3f8..df52cec2a53b 100644
--- a/arch/x86/include/asm/kpti.h
+++ b/arch/x86/include/asm/kpti.h
@@ -1,5 +1,8 @@
#ifndef _ASM_X86_KPTI_H
#define _ASM_X86_KPTI_H
+
+#include <linux/init.h>
+
/*
* Copyright(c) 2017 Intel Corporation. All rights reserved.
*
@@ -34,10 +37,9 @@ extern int kpti_add_mapping(unsigned long addr, unsigned long size,
unsigned long flags);

/**
- * kpti_add_mapping_cpu_entry - map the cpu entry area
- * @cpu: the CPU for which the entry area is being mapped
+ * kpti_clone_cpu_entry_areas - clone cpu_entry_areas to the usermode tables
*/
-extern void kpti_add_mapping_cpu_entry(int cpu);
+extern void __init kpti_clone_cpu_entry_areas(void);

/**
* kpti_remove_mapping - remove a kernel mapping from the userpage tables
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 00697119f983..3dc814519c92 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -606,9 +606,6 @@ void __init setup_cpu_entry_area(int cpu)
sizeof(struct debug_store) / PAGE_SIZE,
PAGE_KERNEL);
#endif
- /* CPU 0's mapping is done in kpti_init() */
- if (cpu)
- kpti_add_mapping_cpu_entry(cpu);
}

/* Load the original GDT from the per-cpu structure */
diff --git a/arch/x86/mm/kpti.c b/arch/x86/mm/kpti.c
index 52fd833845ba..cd81a7432f49 100644
--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -240,7 +240,7 @@ static pmd_t *kpti_shadow_pagetable_walk_pmd(unsigned long address,
* Returns a pointer to a PTE on success, or NULL on failure.
*/
static pte_t *kpti_shadow_pagetable_walk(unsigned long address,
- unsigned long flags)
+ unsigned long flags)
{
pmd_t *pmd = kpti_shadow_pagetable_walk_pmd(address, flags);
pte_t *pte;
@@ -401,28 +401,55 @@ static void __init kpti_init_all_pgds(void)
WARN_ON(__ret); \
} while (0)

-void kpti_add_mapping_cpu_entry(int cpu)
+void __init kpti_clone_cpu_entry_areas(void)
{
- kpti_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
- __PAGE_KERNEL_RO);
-
- kpti_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
- sizeof(get_cpu_entry_area(cpu)->tss),
- __PAGE_KERNEL | _PAGE_GLOBAL);
-
- /* entry stack */
- kpti_add_user_map_early(&get_cpu_entry_area(cpu)->SYSENTER_stack_page,
- sizeof(get_cpu_entry_area(cpu)->SYSENTER_stack_page),
- __PAGE_KERNEL | _PAGE_GLOBAL);
-
- /* Entry code, so needs to be EXEC */
- kpti_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
- sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
- __PAGE_KERNEL_RX | _PAGE_GLOBAL);
-
- kpti_add_user_map_early(&get_cpu_entry_area(cpu)->exception_stacks,
- sizeof(get_cpu_entry_area(cpu)->exception_stacks),
- __PAGE_KERNEL | _PAGE_GLOBAL);
+ int cpu;
+ unsigned long last_pmd_addr = 0;
+
+ /* The top of the cpu_entry_area block is meant to be PMD-aligned. */
+ WARN_ON((unsigned long)(get_cpu_entry_area(NR_CPUS-1) + 1) & ~PMD_MASK);
+
+ /*
+ * Iterate over possible CPUs, not addresses: it's possible that
+ * NR_CPUs is enough larger than the actual number of possible CPUs
+ * that we have unpopulated PMDs in the cpu_entry_area range.
+ */
+ for_each_possible_cpu(cpu) {
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd, *target_pmd;
+ unsigned long addr =
+ (unsigned long)get_cpu_entry_area(cpu) & PMD_MASK;
+
+ if (addr == last_pmd_addr)
+ continue;
+ last_pmd_addr = addr;
+
+ pgd = pgd_offset_k(addr);
+ if (WARN_ON(pgd_none(*pgd)))
+ return;
+ p4d = p4d_offset(pgd, addr);
+ if (WARN_ON(p4d_none(*p4d)))
+ return;
+ pud = pud_offset(p4d, addr);
+ if (WARN_ON(pud_none(*pud)))
+ return;
+ pmd = pmd_offset(pud, addr);
+ if (WARN_ON(pmd_none(*pmd)))
+ return;
+
+ target_pmd = kpti_shadow_pagetable_walk_pmd(addr, 0);
+ if (WARN_ON(!target_pmd))
+ return;
+
+ /*
+ * Copy the PMD. That is, the kernelmode and usermode tables
+ * will share all last-level page tables containing
+ * cpu_entry_area mappings.
+ */
+ *target_pmd = *pmd;
+ }
}

/*
@@ -459,16 +486,7 @@ void __init kpti_init(void)
sizeof(gate_desc) * NR_VECTORS,
__PAGE_KERNEL_RO | _PAGE_GLOBAL);

- /*
- * We delay CPU 0's mappings because these structures are created
- * before the page allocator is up. Deferring it until here lets
- * us use the plain page allocator unconditionally in the page
- * table code above.
- *
- * This is OK because kpti_init() is called long before we ever run
- * userspace and need the KERNEL_PAGE_TABLE_ISOLATION mappings.
- */
- kpti_add_mapping_cpu_entry(0);
+ kpti_clone_cpu_entry_areas();
}

int kpti_add_mapping(unsigned long addr, unsigned long size,
--
2.13.6

2017-12-02 06:20:40

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 1/2] Undo the split of setup_cpu_entry_area

This is obviously a hack. Either the patch should be adjusted back to
the version I sent or trap_init should forcibly initialize all PMDs
by something like __set_fixmap(..., __mkpte(0)); or however it's spelled.
---
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/kernel/traps.c | 6 +++++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 26317716559d..1325844b930a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1232,8 +1232,10 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
* The boot CPU area has been set up in trap_init()
* already.
*/
+ /*
if (i)
setup_cpu_entry_area(i);
+ */
}

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ff4e6b595ae4..0ad92f35a75b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -945,8 +945,12 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)

void __init trap_init(void)
{
+ int cpu;
/* Init cpu_entry_area before IST entries are set up */
- setup_cpu_entry_area(smp_processor_id());
+ for_each_possible_cpu(cpu) {
+ pr_err("setup_cpu_entry_area(%d)\n", cpu);
+ setup_cpu_entry_area(cpu);
+ }

idt_setup_traps();

--
2.13.6

2017-12-02 10:34:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] Undo the split of setup_cpu_entry_area

On Fri, 1 Dec 2017, Andy Lutomirski wrote:

> This is obviously a hack. Either the patch should be adjusted back to
> the version I sent or trap_init should forcibly initialize all PMDs
> by something like __set_fixmap(..., __mkpte(0)); or however it's spelled.

I split it because the whole thing crashed when I kept the loop you had
because it tried to allocate stuff. Had no time to figure out why, so I
went the lazy way of making it "work".

Simplifying the whole mess was on my list anyway.

Thanks,

tglx

2017-12-02 10:38:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/kpti: Reference all cpu_entry_area pagetables in the usermode tables

On Fri, 1 Dec 2017, Andy Lutomirski wrote:

> We were manually configuring cpu_entry_area in the usermode tables.
> This was error-prone and wasted memory. (Not much memory, but
> still.) Instead, just reference the same pagetables.
>
> This avoids needing to keep the KPTI code and the normal
> cpu_entry_area code in sync, since the KPTI code no longer cares
> what's in cpu_entry_area.
>
> [This does *not* work on the current KPTI series. It requires that
> all the kernelmode cpu_entry_tables are pre-allocated. That
> happens in the series as I submitted it, but tglx changed it for
> reasons that I haven't figured out.]

As I said it crashed and burned for yet unknown reasons. I'll dig into
that.

Thanks,

tglx

2017-12-02 10:47:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] Undo the split of setup_cpu_entry_area

On Sat, 2 Dec 2017, Thomas Gleixner wrote:

> On Fri, 1 Dec 2017, Andy Lutomirski wrote:
>
> > This is obviously a hack. Either the patch should be adjusted back to
> > the version I sent or trap_init should forcibly initialize all PMDs
> > by something like __set_fixmap(..., __mkpte(0)); or however it's spelled.
>
> I split it because the whole thing crashed when I kept the loop you had
> because it tried to allocate stuff. Had no time to figure out why, so I
> went the lazy way of making it "work".

[ 0.000000] dump_stack+0x85/0xc5
[ 0.000000] warn_alloc+0x114/0x1c0
[ 0.000000] __alloc_pages_slowpath+0x1089/0x10d0
[ 0.000000] __alloc_pages_nodemask+0x2e8/0x370
[ 0.000000] __get_free_pages+0x10/0x40
[ 0.000000] kpti_shadow_pagetable_walk+0x2b2/0x3e0
[ 0.000000] kpti_add_user_map+0xfe/0x330
[ 0.000000] kpti_add_mapping_cpu_entry+0x5a/0x100
[ 0.000000] trap_init+0x2c/0x7b
[ 0.000000] start_kernel+0x24c/0x497
[ 0.000000] secondary_startup_64+0xa5/0xb0

Cute, isn't it? And then further down the line it triplefaults of course.

2017-12-02 13:34:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/2] Undo the split of setup_cpu_entry_area



> On Dec 2, 2017, at 2:47 AM, Thomas Gleixner <[email protected]> wrote:
>
>> On Sat, 2 Dec 2017, Thomas Gleixner wrote:
>>
>>> On Fri, 1 Dec 2017, Andy Lutomirski wrote:
>>>
>>> This is obviously a hack. Either the patch should be adjusted back to
>>> the version I sent or trap_init should forcibly initialize all PMDs
>>> by something like __set_fixmap(..., __mkpte(0)); or however it's spelled.
>>
>> I split it because the whole thing crashed when I kept the loop you had
>> because it tried to allocate stuff. Had no time to figure out why, so I
>> went the lazy way of making it "work".
>
> [ 0.000000] dump_stack+0x85/0xc5
> [ 0.000000] warn_alloc+0x114/0x1c0
> [ 0.000000] __alloc_pages_slowpath+0x1089/0x10d0
> [ 0.000000] __alloc_pages_nodemask+0x2e8/0x370
> [ 0.000000] __get_free_pages+0x10/0x40
> [ 0.000000] kpti_shadow_pagetable_walk+0x2b2/0x3e0
> [ 0.000000] kpti_add_user_map+0xfe/0x330
> [ 0.000000] kpti_add_mapping_cpu_entry+0x5a/0x100
> [ 0.000000] trap_init+0x2c/0x7b
> [ 0.000000] start_kernel+0x24c/0x497
> [ 0.000000] secondary_startup_64+0xa5/0xb0
>
> Cute, isn't it? And then further down the line it triplefaults of course.

Hmm. My second patch should make this go away, though.

2017-12-02 14:20:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] Undo the split of setup_cpu_entry_area

On Sat, 2 Dec 2017, Andy Lutomirski wrote:

>
>
> > On Dec 2, 2017, at 2:47 AM, Thomas Gleixner <[email protected]> wrote:
> >
> >> On Sat, 2 Dec 2017, Thomas Gleixner wrote:
> >>
> >>> On Fri, 1 Dec 2017, Andy Lutomirski wrote:
> >>>
> >>> This is obviously a hack. Either the patch should be adjusted back to
> >>> the version I sent or trap_init should forcibly initialize all PMDs
> >>> by something like __set_fixmap(..., __mkpte(0)); or however it's spelled.
> >>
> >> I split it because the whole thing crashed when I kept the loop you had
> >> because it tried to allocate stuff. Had no time to figure out why, so I
> >> went the lazy way of making it "work".
> >
> > [ 0.000000] dump_stack+0x85/0xc5
> > [ 0.000000] warn_alloc+0x114/0x1c0
> > [ 0.000000] __alloc_pages_slowpath+0x1089/0x10d0
> > [ 0.000000] __alloc_pages_nodemask+0x2e8/0x370
> > [ 0.000000] __get_free_pages+0x10/0x40
> > [ 0.000000] kpti_shadow_pagetable_walk+0x2b2/0x3e0
> > [ 0.000000] kpti_add_user_map+0xfe/0x330
> > [ 0.000000] kpti_add_mapping_cpu_entry+0x5a/0x100
> > [ 0.000000] trap_init+0x2c/0x7b
> > [ 0.000000] start_kernel+0x24c/0x497
> > [ 0.000000] secondary_startup_64+0xa5/0xb0
> >
> > Cute, isn't it? And then further down the line it triplefaults of course.
>
> Hmm. My second patch should make this go away, though.

Yes, it's all a mess and I'm cleaning it up. Fighting with the PEBS/BTS
buffers right now.

Thanks,

tglx