2017-04-06 14:01:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4

Here's the fourth and the last bunch of of patches that brings initial
5-level paging enabling.

Please review and consider applying.

As Ingo requested I've tried to rewrite assembly parts of boot process
into C before bringing 5-level paging support. The only part where I
succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is
now in C.

I failed to rewrite startup_32 in arch/x86/boot/compressed/head_64.S in C.
The code I need to modify in still in 32-bit mode, but if I would move it
to C it will be compiled as 64-bit. I've tried to move it into separate
translation unit and compile it with -m32, but then linking phase fails
due to type mismatch of object files.

I also have trouble with rewriting secondary_startup_64. Stack breaks as
soon as we switch to new page tables when onlining secondary CPUs. I don't
know how to get around this.

I hope it's not show-stopper.

If you know how to get around these issues, let me know.

Kirill A. Shutemov (8):
x86/boot/64: Rewrite startup_64 in C
x86/boot/64: Rename init_level4_pgt and early_level4_pgt
x86/boot/64: Add support of additional page table level during early
boot
x86/mm: Add sync_global_pgds() for configuration with 5-level paging
x86/mm: Make kernel_physical_mapping_init() support 5-level paging
x86/mm: Add support for 5-level paging for KASLR
x86: Enable 5-level paging support
x86/mm: Allow to have userspace mappings above 47-bits

arch/x86/Kconfig | 5 +
arch/x86/boot/compressed/head_64.S | 23 ++++-
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/mpx.h | 9 ++
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 6 +-
arch/x86/include/asm/processor.h | 9 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 137 +++++++++++++++++++++++++---
arch/x86/kernel/head_64.S | 132 ++++++---------------------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/kernel/sys_x86_64.c | 28 +++++-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/hugetlbpage.c | 27 +++++-
arch/x86/mm/init_64.c | 104 +++++++++++++++++++--
arch/x86/mm/kasan_init_64.c | 12 +--
arch/x86/mm/kaslr.c | 81 ++++++++++++----
arch/x86/mm/mmap.c | 2 +-
arch/x86/mm/mpx.c | 33 ++++++-
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/Kconfig | 1 +
arch/x86/xen/mmu.c | 18 ++--
arch/x86/xen/xen-pvh.S | 2 +-
24 files changed, 463 insertions(+), 180 deletions(-)

--
2.11.0


2017-04-06 14:01:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 4 ++--
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 18 +++++++++---------
arch/x86/kernel/head_64.S | 14 +++++++-------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/kasan_init_64.c | 12 ++++++------
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/mmu.c | 18 +++++++++---------
arch/x86/xen/xen-pvh.S | 2 +-
11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
static inline void __meminit init_trampoline_default(void)
{
/* Default trampoline pgd value */
- trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+ trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt

extern void paging_init(void);

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
p4d_populate(&init_mm, p4d, espfix_pud_page);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
/*
* Manage page tables very early on.
*/
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd = fixup_pointer(&early_top_pgt, physaddr);
pgd[pgd_index(__START_KERNEL_map)] += load_delta;

pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
- memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+ memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
next_early_pgt = 0;
- write_cr3(__pa_nodebug(early_level4_pgt));
+ write_cr3(__pa_nodebug(early_top_pgt));
}

/* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
pmdval_t pmd, *pmd_p;

/* Invalid address or early pgt is done ? */
- if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+ if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
return -1;

again:
- pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+ pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

clear_bss();

- clear_page(init_level4_pgt);
+ clear_page(init_top_pgt);

kasan_early_init();

@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
load_ucode_bsp();

- /* set init_level4_pgt kernel high mapping*/
- init_level4_pgt[511] = early_level4_pgt[511];
+ /* set init_top_pgt kernel high mapping*/
+ init_top_pgt[511] = early_top_pgt[511];

x86_64_start_reservations(real_mode_data);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..d44c350797bf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
leaq _text(%rip), %rdi
call __startup_64

- movq $(early_level4_pgt - __START_KERNEL_map), %rax
+ movq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu

- movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
.endr

__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
.data

#ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.fill 512,8,0
#else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + L4_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + L4_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_NUMBER(phys_base);
- VMCOREINFO_SYMBOL(init_level4_pgt);
+ VMCOREINFO_SYMBOL(init_top_pgt);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be71a72..6680cefc062e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx)
{
#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_level4_pgt;
+ pgd_t *start = (pgd_t *) &init_top_pgt;
#else
pgd_t *start = swapper_pg_dir;
#endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern struct range pfn_mapped[E820_MAX_ENTRIES];

static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
kasan_zero_p4d[i] = __p4d(p4d_val);

- kasan_map_early_shadow(early_level4_pgt);
- kasan_map_early_shadow(init_level4_pgt);
+ kasan_map_early_shadow(early_top_pgt);
+ kasan_map_early_shadow(init_top_pgt);
}

void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
register_die_notifier(&kasan_die_notifier);
#endif

- memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
- load_cr3(early_level4_pgt);
+ memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+ load_cr3(early_top_pgt);
__flush_tlb_all();

clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- load_cr3(init_level4_pgt);
+ load_cr3(init_top_pgt);
__flush_tlb_all();

/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)

trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = trampoline_pgd_entry.pgd;
- trampoline_pgd[511] = init_level4_pgt[511].pgd;
+ trampoline_pgd[511] = init_top_pgt[511].pgd;
#endif
}

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
* At the start of the day - when Xen launches a guest, it has already
* built pagetables for the guest. We diligently look over them
* in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
*
* The generic code starts (start_kernel) and 'init_mem_mapping' sets
* up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
pt_end = pt_base + xen_start_info->nr_pt_frames;

/* Zap identity mapping */
- init_level4_pgt[0] = __pgd(0);
+ init_top_pgt[0] = __pgd(0);

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Pre-constructed entries are in pfn, so convert to mfn */
/* L4[272] -> level3_ident_pgt
* L4[511] -> level3_kernel_pgt */
- convert_pfn_mfn(init_level4_pgt);
+ convert_pfn_mfn(init_top_pgt);

/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Copy the initial P->M table mappings if necessary. */
i = pgd_index(xen_start_info->mfn_list);
if (i && i < pgd_index(__START_KERNEL_map))
- init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+ init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
- set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+ set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Pin down new L4 */
pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
- PFN_DOWN(__pa_symbol(init_level4_pgt)));
+ PFN_DOWN(__pa_symbol(init_top_pgt)));

/* Unpin Xen-provided one */
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* pgd.
*/
xen_mc_batch();
- __xen_write_cr3(true, __pa(init_level4_pgt));
+ __xen_write_cr3(true, __pa(init_top_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
} else
- native_write_cr3(__pa(init_level4_pgt));
+ native_write_cr3(__pa(init_top_pgt));

/* We can't that easily rip out L3 and L2, as the Xen pagetables are
* set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ... for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
wrmsr

/* Enable pre-constructed page tables. */
- mov $_pa(init_level4_pgt), %eax
+ mov $_pa(init_top_pgt), %eax
mov %eax, %cr3
mov $(X86_CR0_PG | X86_CR0_PE), %eax
mov %eax, %cr0
--
2.11.0

2017-04-06 14:01:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 23 ++++++++++++---
arch/x86/include/asm/pgtable_64.h | 2 ++
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/head64.c | 44 +++++++++++++++++++++++++----
arch/x86/kernel/head_64.S | 29 +++++++++++++++----
5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
addl %ebp, gdt+2(%ebp)
lgdt gdt(%ebp)

- /* Enable PAE mode */
+ /* Enable PAE and LA57 mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %eax
+#endif
movl %eax, %cr4

/*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl

+ xorl %edx, %edx
+
+ /* Build Top Level */
+ leal pgtable(%ebx,%edx,1), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
/* Build Level 4 */
- leal pgtable + 0(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
+#endif

/* Build Level 3 */
- leal pgtable + 0x1000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
jnz 1b

/* Build Level 2 */
- leal pgtable + 0x2000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
#include <linux/bitops.h>
#include <linux/threads.h>

+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
#define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
#define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
#define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */
+#define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT)
#define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */
#define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT)
#define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
{
unsigned long load_delta, *p;
pgdval_t *pgd;
+ p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
pgd = fixup_pointer(&early_top_pgt, physaddr);
pgd[pgd_index(__START_KERNEL_map)] += load_delta;

+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+ p4d[511] += load_delta;
+ }
+
pud = fixup_pointer(&level3_kernel_pgt, physaddr);
pud[510] += load_delta;
pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);

- pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
- pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+ pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+ p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ } else {
+ pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ }

pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pgdval_t pgd, *pgd_p;
+ p4dval_t p4d, *p4d_p;
pudval_t pud, *pud_p;
pmdval_t pmd, *pmd_p;

@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
* critical -- __PAGE_OFFSET would point us back into the dynamic
* range and we might end up looping forever...
*/
- if (pgd)
- pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ p4d_p = pgd_p;
+ else if (pgd)
+ p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ else {
+ if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+ reset_early_page_tables();
+ goto again;
+ }
+
+ p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+ memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+ *pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ }
+ p4d_p += p4d_index(address);
+ p4d = *p4d_p;
+
+ if (p4d)
+ pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
else {
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)

pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
- *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ *p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
}
pud_p += pud_index(address);
pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d44c350797bf..b24fc575a6da 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
*
*/

+#define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))

-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
L3_START_KERNEL = pud_index(__START_KERNEL_map)

.text
@@ -98,11 +102,14 @@ ENTRY(secondary_startup_64)
movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

- /* Enable PAE mode and PGE */
+ /* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %ecx
+#endif
movq %rcx, %cr4

- /* Setup early boot stage 4 level pagetables. */
+ /* Setup early boot stage 4-/5-level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3

@@ -328,7 +335,11 @@ GLOBAL(name)
__INITDATA
NEXT_PAGE(early_top_pgt)
.fill 511,8,0
+#ifdef CONFIG_X86_5LEVEL
+ .quad level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif

NEXT_PAGE(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -341,9 +352,9 @@ NEXT_PAGE(init_top_pgt)
#else
NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_top_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + PGD_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_top_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + PGD_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -357,6 +368,12 @@ NEXT_PAGE(level2_ident_pgt)
PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
#endif

+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
--
2.11.0

2017-04-06 14:01:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
* When memory was added make sure all the processes MM have
* suitable PGD entries in the local PGD level page.
*/
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+ unsigned long address;
+
+ for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+ const pgd_t *pgd_ref = pgd_offset_k(address);
+ struct page *page;
+
+ if (pgd_none(*pgd_ref))
+ continue;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+ }
+}
+#else
void sync_global_pgds(unsigned long start, unsigned long end)
{
unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_unlock(&pgd_lock);
}
}
+#endif

/*
* NOTE: This function is marked __ref because it calls __init function
--
2.11.0

2017-04-06 14:02:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Safonov <[email protected]>
---
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/mpx.h | 9 +++++++++
arch/x86/include/asm/processor.h | 9 ++++++---
arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
arch/x86/mm/hugetlbpage.c | 27 ++++++++++++++++++++++++---
arch/x86/mm/mmap.c | 2 +-
arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
7 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader. We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk. */

-#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)

/* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
}
void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags);
#else
static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
{
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+ unsigned long len, unsigned long flags)
+{
+ return addr;
+}
#endif /* CONFIG_X86_INTEL_MPX */

#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..9f437aea7f57 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET PAGE_OFFSET
#define TASK_SIZE PAGE_OFFSET
#define TASK_SIZE_MAX TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
#define STACK_TOP TASK_SIZE
#define STACK_TOP_MAX STACK_TOP

@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
@@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

-#define STACK_TOP TASK_SIZE
+#define STACK_TOP DEFAULT_MAP_WINDOW
#define STACK_TOP_MAX TASK_SIZE_MAX

#define INIT_THREAD { \
@@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
* space during mmap's.
*/
#define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

#define KSTK_EIP(task) (task_pt_regs(task)->ip)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
#include <asm/compat.h>
#include <asm/ia32.h>
#include <asm/syscalls.h>
+#include <asm/mpx.h>

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_unmapped_area_info info;
unsigned long begin, end;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (flags & MAP_FIXED)
return addr;

@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.flags = 0;
info.length = len;
info.low_limit = begin;
- info.high_limit = end;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = min(end, TASK_SIZE);
+ else
+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/elf.h>
+#include <asm/mpx.h>

#if 0 /* This is just for testing */
struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.low_limit = get_mmap_base(1);
info.high_limit = in_compat_syscall() ?
tasksize_32bit() : tasksize_64bit();
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = TASK_SIZE;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
- unsigned long addr0, unsigned long len,
+ unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
- unsigned long addr;

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
- info.high_limit = TASK_SIZE;
+ info.high_limit = DEFAULT_MAP_WINDOW;
addr = vm_unmapped_area(&info);
}

@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

if (len & ~huge_page_mask(h))
return -EINVAL;
+
+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (len > TASK_SIZE)
return -ENOMEM;

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)

unsigned long tasksize_64bit(void)
{
- return TASK_SIZE_MAX;
+ return DEFAULT_MAP_WINDOW;
}

static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
*/
bd_base = mpx_get_bounds_dir();
down_write(&mm->mmap_sem);
+
+ /* MPX doesn't support addresses above 47-bits yet. */
+ if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+ pr_warn_once("%s (%d): MPX cannot handle addresses "
+ "above 47-bits. Disabling.",
+ current->comm, current->pid);
+ ret = -ENXIO;
+ goto out;
+ }
mm->context.bd_addr = bd_base;
if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
ret = -ENXIO;
-
+out:
up_write(&mm->mmap_sem);
return ret;
}
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
if (ret)
force_sig(SIGSEGV, current);
}
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags)
+{
+ if (!kernel_managing_mpx_tables(current->mm))
+ return addr;
+ if (addr + len <= DEFAULT_MAP_WINDOW)
+ return addr;
+ if (flags & MAP_FIXED)
+ return -ENOMEM;
+
+ /*
+ * Requested len is larger than whole area we're allowed to map in.
+ * Resetting hinting address wouldn't do much good -- fail early.
+ */
+ if (len > DEFAULT_MAP_WINDOW)
+ return -ENOMEM;
+
+ /* Look for unmap area within DEFAULT_MAP_WINDOW */
+ return 0;
+}
--
2.11.0

2017-04-06 14:02:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
*
* Entropy is generated using the KASLR early boot functions now shared in
* the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
*
* The order of each memory region is not changed. The feature looks at
* the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 64/* Maximum */ },
+ { &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
{ &vmemmap_base, 1 },
};
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
- entropy = (rand % (entropy + 1)) & PUD_MASK;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ entropy = (rand % (entropy + 1)) & P4D_MASK;
+ else
+ entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;

@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
- vaddr = round_up(vaddr + 1, PUD_SIZE);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ vaddr = round_up(vaddr + 1, P4D_SIZE);
+ else
+ vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}

-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
{
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;

- if (!kaslr_memory_enabled()) {
- init_trampoline_default();
- return;
- }
-
pud_page_tramp = alloc_low_page();

paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
}
+
+static void __meminit init_trampoline_p4d(void)
+{
+ unsigned long paddr, paddr_next;
+ pgd_t *pgd;
+ p4d_t *p4d_page, *p4d_page_tramp;
+ int i;
+
+ p4d_page_tramp = alloc_low_page();
+
+ paddr = 0;
+ pgd = pgd_offset_k((unsigned long)__va(paddr));
+ p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+ for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d, *p4d_tramp;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+
+ p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ *p4d_tramp = *p4d;
+ }
+
+ set_pgd(&trampoline_pgd_entry,
+ __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+ if (!kaslr_memory_enabled()) {
+ init_trampoline_default();
+ return;
+ }
+
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ init_trampoline_p4d();
+ else
+ init_trampoline_pud();
+}
--
2.11.0

2017-04-06 14:02:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/head64.c | 81 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kernel/head_64.S | 93 +----------------------------------------------
2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
*/
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+ return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+ unsigned long load_delta, *p;
+ pgdval_t *pgd;
+ pudval_t *pud;
+ pmdval_t *pmd, pmd_entry;
+ int i;
+
+ /* Is the address too large? */
+ if (physaddr >> MAX_PHYSMEM_BITS)
+ for (;;);
+
+ /*
+ * Compute the delta between the address I am compiled to run at
+ * and the address I am actually running at.
+ */
+ load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+ /* Is the address not 2M aligned? */
+ if (load_delta & ~PMD_PAGE_MASK)
+ for (;;);
+
+ /* Fixup the physical addresses in the page table */
+
+ pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+ pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+ pud[510] += load_delta;
+ pud[511] += load_delta;
+
+ pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+ pmd[506] += load_delta;
+
+ /*
+ * Set up the identity mapping for the switchover. These
+ * entries should *NOT* have the global bit set! This also
+ * creates a bunch of nonsense entries but that is fine --
+ * it avoids problems around wraparound.
+ */
+
+ pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+ pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+ pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+ pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+ pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+ pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+ pmd_entry += physaddr;
+
+ for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+ pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+ /*
+ * Fixup the kernel text+data virtual addresses. Note that
+ * we might write invalid pmds, when the kernel is relocated
+ * cleanup_highmap() fixes this up along with the mappings
+ * beyond _end.
+ */
+
+ pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (pmd[i] & _PAGE_PRESENT)
+ pmd[i] += load_delta;
+ }
+
+ /* Fixup phys_base */
+ p = fixup_pointer(&phys_base, physaddr);
+ *p += load_delta;
+}
+
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..9656c5951b98 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
/* Sanitize CPU configuration */
call verify_cpu

- /*
- * Compute the delta between the address I am compiled to run at and the
- * address I am actually running at.
- */
- leaq _text(%rip), %rbp
- subq $_text - __START_KERNEL_map, %rbp
-
- /* Is the address not 2M aligned? */
- testl $~PMD_PAGE_MASK, %ebp
- jnz bad_address
-
- /*
- * Is the address too large?
- */
- leaq _text(%rip), %rax
- shrq $MAX_PHYSMEM_BITS, %rax
- jnz bad_address
-
- /*
- * Fixup the physical addresses in the page table
- */
- addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
- addq %rbp, level3_kernel_pgt + (510*8)(%rip)
- addq %rbp, level3_kernel_pgt + (511*8)(%rip)
-
- addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
-
- /*
- * Set up the identity mapping for the switchover. These
- * entries should *NOT* have the global bit set! This also
- * creates a bunch of nonsense entries but that is fine --
- * it avoids problems around wraparound.
- */
leaq _text(%rip), %rdi
- leaq early_level4_pgt(%rip), %rbx
-
- movq %rdi, %rax
- shrq $PGDIR_SHIFT, %rax
-
- leaq (PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
- movq %rdx, 0(%rbx,%rax,8)
- movq %rdx, 8(%rbx,%rax,8)
-
- addq $PAGE_SIZE, %rdx
- movq %rdi, %rax
- shrq $PUD_SHIFT, %rax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
- incl %eax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
-
- addq $PAGE_SIZE * 2, %rbx
- movq %rdi, %rax
- shrq $PMD_SHIFT, %rdi
- addq $(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
- leaq (_end - 1)(%rip), %rcx
- shrq $PMD_SHIFT, %rcx
- subq %rdi, %rcx
- incl %ecx
+ call __startup_64

-1:
- andq $(PTRS_PER_PMD - 1), %rdi
- movq %rax, (%rbx,%rdi,8)
- incq %rdi
- addq $PMD_SIZE, %rax
- decl %ecx
- jnz 1b
-
- test %rbp, %rbp
- jz .Lskip_fixup
-
- /*
- * Fixup the kernel text+data virtual addresses. Note that
- * we might write invalid pmds, when the kernel is relocated
- * cleanup_highmap() fixes this up along with the mappings
- * beyond _end.
- */
- leaq level2_kernel_pgt(%rip), %rdi
- leaq PAGE_SIZE(%rdi), %r8
- /* See if it is a valid page table entry */
-1: testb $_PAGE_PRESENT, 0(%rdi)
- jz 2f
- addq %rbp, 0(%rdi)
- /* Go to the next page */
-2: addq $8, %rdi
- cmp %r8, %rdi
- jne 1b
-
- /* Fixup phys_base */
- addq %rbp, phys_base(%rip)
-
-.Lskip_fixup:
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
--
2.11.0

2017-04-06 14:02:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support 5-level paging

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
return paddr_last;
}

+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+ unsigned long page_size_mask)
+{
+ unsigned long paddr_next, paddr_last = paddr_end;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+ int i = p4d_index(vaddr);
+
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+ for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d;
+ pud_t *pud;
+
+ vaddr = (unsigned long)__va(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ if (paddr >= paddr_end) {
+ if (!after_bootmem &&
+ !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+ E820_TYPE_RAM) &&
+ !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+ E820_TYPE_RESERVED_KERN))
+ set_p4d(p4d, __p4d(0));
+ continue;
+ }
+
+ if (!p4d_none(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ paddr_last = phys_pud_init(pud, paddr,
+ paddr_end,
+ page_size_mask);
+ __flush_tlb_all();
+ continue;
+ }
+
+ pud = alloc_low_page();
+ paddr_last = phys_pud_init(pud, paddr, paddr_end,
+ page_size_mask);
+
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, p4d, pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+ __flush_tlb_all();
+
+ return paddr_last;
+}
+
/*
* Create page table mapping for the physical memory for specific physical
* addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
- pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- BUILD_BUG_ON(pgd_none(*pgd));
- p4d = p4d_offset(pgd, vaddr);
- if (p4d_val(*p4d)) {
- pud = (pud_t *)p4d_page_vaddr(*p4d);
- paddr_last = phys_pud_init(pud, __pa(vaddr),
+ if (pgd_val(*pgd)) {
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}

- pud = alloc_low_page();
- paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+ p4d = alloc_low_page();
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- p4d_populate(&init_mm, p4d, pud);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ pgd_populate(&init_mm, pgd, p4d);
+ else
+ p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
--
2.11.0

2017-04-06 14:03:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 7/8] x86: Enable 5-level paging support

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 5 +++++
arch/x86/xen/Kconfig | 1 +
2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM

config PGTABLE_LEVELS
int
+ default 5 if X86_5LEVEL
default 4 if X86_64
default 3 if X86_PAE
default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
has the cost of more pagetable lookup overhead, and also
consumes more pagetable space per process.

+config X86_5LEVEL
+ bool "Enable 5-level page tables support"
+ depends on X86_64
+
config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
config XEN
bool "Xen guest support"
depends on PARAVIRT
+ depends on !X86_5LEVEL
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
--
2.11.0

2017-04-06 14:52:22

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 7/8] x86: Enable 5-level paging support

On 06/04/17 16:01, Kirill A. Shutemov wrote:
> Most of things are in place and we can enable support of 5-level paging.
>
> Enabling XEN with 5-level paging requires more work. The patch makes XEN
> dependent on !X86_5LEVEL.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 5 +++++
> arch/x86/xen/Kconfig | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4e153e93273f..7a76dcac357e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>
> config PGTABLE_LEVELS
> int
> + default 5 if X86_5LEVEL
> default 4 if X86_64
> default 3 if X86_PAE
> default 2
> @@ -1390,6 +1391,10 @@ config X86_PAE
> has the cost of more pagetable lookup overhead, and also
> consumes more pagetable space per process.
>
> +config X86_5LEVEL
> + bool "Enable 5-level page tables support"
> + depends on X86_64
> +
> config ARCH_PHYS_ADDR_T_64BIT
> def_bool y
> depends on X86_64 || X86_PAE
> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> index 76b6dbd627df..b90d481ce5a1 100644
> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -5,6 +5,7 @@
> config XEN
> bool "Xen guest support"
> depends on PARAVIRT
> + depends on !X86_5LEVEL
> select PARAVIRT_CLOCK
> select XEN_HAVE_PVMMU
> select XEN_HAVE_VPMU
>

Just a heads up: this last change will conflict with the Xen tree.

Can't we just ignore the additional level in Xen pv mode and run with
4 levels instead?


Juergen

2017-04-06 15:24:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 7/8] x86: Enable 5-level paging support

On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
> On 06/04/17 16:01, Kirill A. Shutemov wrote:
> > Most of things are in place and we can enable support of 5-level paging.
> >
> > Enabling XEN with 5-level paging requires more work. The patch makes XEN
> > dependent on !X86_5LEVEL.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/Kconfig | 5 +++++
> > arch/x86/xen/Kconfig | 1 +
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 4e153e93273f..7a76dcac357e 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
> >
> > config PGTABLE_LEVELS
> > int
> > + default 5 if X86_5LEVEL
> > default 4 if X86_64
> > default 3 if X86_PAE
> > default 2
> > @@ -1390,6 +1391,10 @@ config X86_PAE
> > has the cost of more pagetable lookup overhead, and also
> > consumes more pagetable space per process.
> >
> > +config X86_5LEVEL
> > + bool "Enable 5-level page tables support"
> > + depends on X86_64
> > +
> > config ARCH_PHYS_ADDR_T_64BIT
> > def_bool y
> > depends on X86_64 || X86_PAE
> > diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> > index 76b6dbd627df..b90d481ce5a1 100644
> > --- a/arch/x86/xen/Kconfig
> > +++ b/arch/x86/xen/Kconfig
> > @@ -5,6 +5,7 @@
> > config XEN
> > bool "Xen guest support"
> > depends on PARAVIRT
> > + depends on !X86_5LEVEL
> > select PARAVIRT_CLOCK
> > select XEN_HAVE_PVMMU
> > select XEN_HAVE_VPMU
> >
>
> Just a heads up: this last change will conflict with the Xen tree.

It should be trivial to fix, right? It's one-liner after all.

> Can't we just ignore the additional level in Xen pv mode and run with
> 4 levels instead?

We don't have yet boot-time switching between paging modes yet. It will
come later. So the answer is no.

--
Kirill A. Shutemov

2017-04-06 15:57:00

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 7/8] x86: Enable 5-level paging support

On 06/04/17 17:24, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
>> On 06/04/17 16:01, Kirill A. Shutemov wrote:
>>> Most of things are in place and we can enable support of 5-level paging.
>>>
>>> Enabling XEN with 5-level paging requires more work. The patch makes XEN
>>> dependent on !X86_5LEVEL.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> ---
>>> arch/x86/Kconfig | 5 +++++
>>> arch/x86/xen/Kconfig | 1 +
>>> 2 files changed, 6 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 4e153e93273f..7a76dcac357e 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>>>
>>> config PGTABLE_LEVELS
>>> int
>>> + default 5 if X86_5LEVEL
>>> default 4 if X86_64
>>> default 3 if X86_PAE
>>> default 2
>>> @@ -1390,6 +1391,10 @@ config X86_PAE
>>> has the cost of more pagetable lookup overhead, and also
>>> consumes more pagetable space per process.
>>>
>>> +config X86_5LEVEL
>>> + bool "Enable 5-level page tables support"
>>> + depends on X86_64
>>> +
>>> config ARCH_PHYS_ADDR_T_64BIT
>>> def_bool y
>>> depends on X86_64 || X86_PAE
>>> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
>>> index 76b6dbd627df..b90d481ce5a1 100644
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -5,6 +5,7 @@
>>> config XEN
>>> bool "Xen guest support"
>>> depends on PARAVIRT
>>> + depends on !X86_5LEVEL
>>> select PARAVIRT_CLOCK
>>> select XEN_HAVE_PVMMU
>>> select XEN_HAVE_VPMU
>>>
>>
>> Just a heads up: this last change will conflict with the Xen tree.
>
> It should be trivial to fix, right? It's one-liner after all.

Right. Just wanted to mention it.


Juergen

2017-04-06 18:47:32

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

Hi Kirill,

On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.

Do you wish after the first over-47-bit mapping the following mmap()
calls return also over-47-bits if there is free space?
It so, you could simplify all this code by changing only mm->mmap_base
on the first over-47-bit mmap() call.
This will do simple trick.

>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> ---
> arch/x86/include/asm/elf.h | 2 +-
> arch/x86/include/asm/mpx.h | 9 +++++++++
> arch/x86/include/asm/processor.h | 9 ++++++---
> arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
> arch/x86/mm/hugetlbpage.c | 27 ++++++++++++++++++++++++---
> arch/x86/mm/mmap.c | 2 +-
> arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
> 7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
> the loader. We need to make sure that it is out of the way of the program
> that it will "exec", and that there is sufficient room for the brk. */
>
> -#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)

This will kill 32-bit userspace:
As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.

Here is the test:
[root@localhost test]# cat hello-world.c
#include <stdio.h>

int main(int argc, char **argv)
{
printf("Maybe this world is another planet's hell.\n");
return 0;
}
[root@localhost test]# gcc -m32 hello-world.c -o hello-world
[root@localhost test]# ./hello-world
[ 35.306726] hello-world[1948]: segfault at ffa5288c ip
00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
Segmentation fault (core dumped)

So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.


>
> /* This yields a mask that user programs can use to figure out what
> instruction set this CPU supports. This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
> }
> void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags);
> #else
> static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
> {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> + unsigned long len, unsigned long flags)
> +{
> + return addr;
> +}
> #endif /* CONFIG_X86_INTEL_MPX */
>
> #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..9f437aea7f57 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
> #define IA32_PAGE_OFFSET PAGE_OFFSET
> #define TASK_SIZE PAGE_OFFSET
> #define TASK_SIZE_MAX TASK_SIZE
> +#define DEFAULT_MAP_WINDOW TASK_SIZE
> #define STACK_TOP TASK_SIZE
> #define STACK_TOP_MAX STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
> * particular problem by preventing anything from being mapped
> * at the maximum canonical address.
> */
> -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
>
> /* This decides where the kernel will search for a free chunk of vm
> * space during mmap's.
> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
> #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
> IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP TASK_SIZE
> +#define STACK_TOP DEFAULT_MAP_WINDOW
> #define STACK_TOP_MAX TASK_SIZE_MAX
>
> #define INIT_THREAD { \
> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
> * space during mmap's.
> */
> #define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

ditto

>
> #define KSTK_EIP(task) (task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
> #include <asm/compat.h>
> #include <asm/ia32.h>
> #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
> /*
> * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
> struct vm_unmapped_area_info info;
> unsigned long begin, end;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (flags & MAP_FIXED)
> return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
> info.flags = 0;
> info.length = len;
> info.low_limit = begin;
> - info.high_limit = end;
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW)
> + info.high_limit = min(end, TASK_SIZE);
> + else
> + info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> +
> info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> unsigned long addr = addr0;
> struct vm_unmapped_area_info info;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> /* requested length too big for entire address space */
> if (len > TASK_SIZE)
> return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
is set, which can do 64-bit syscalls - the subtraction will be
a negative..


> +
> info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/elf.h>
> +#include <asm/mpx.h>
>
> #if 0 /* This is just for testing */
> struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
> info.low_limit = get_mmap_base(1);
> info.high_limit = in_compat_syscall() ?
> tasksize_32bit() : tasksize_64bit();
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW)
> + info.high_limit = TASK_SIZE;
> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> return vm_unmapped_area(&info);
> }
>
> static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> - unsigned long addr0, unsigned long len,
> + unsigned long addr, unsigned long len,
> unsigned long pgoff, unsigned long flags)
> {
> struct hstate *h = hstate_file(file);
> struct vm_unmapped_area_info info;
> - unsigned long addr;
>
> info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

ditto

> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> VM_BUG_ON(addr != -ENOMEM);
> info.flags = 0;
> info.low_limit = TASK_UNMAPPED_BASE;
> - info.high_limit = TASK_SIZE;
> + info.high_limit = DEFAULT_MAP_WINDOW;

ditto about 32-bits

> addr = vm_unmapped_area(&info);
> }
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
> if (len & ~huge_page_mask(h))
> return -EINVAL;
> +
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (len > TASK_SIZE)
> return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
> unsigned long tasksize_64bit(void)
> {
> - return TASK_SIZE_MAX;
> + return DEFAULT_MAP_WINDOW;
> }
>
> static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
> */
> bd_base = mpx_get_bounds_dir();
> down_write(&mm->mmap_sem);
> +
> + /* MPX doesn't support addresses above 47-bits yet. */
> + if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> + pr_warn_once("%s (%d): MPX cannot handle addresses "
> + "above 47-bits. Disabling.",
> + current->comm, current->pid);
> + ret = -ENXIO;
> + goto out;
> + }
> mm->context.bd_addr = bd_base;
> if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
> ret = -ENXIO;
> -
> +out:
> up_write(&mm->mmap_sem);
> return ret;
> }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ret)
> force_sig(SIGSEGV, current);
> }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags)
> +{
> + if (!kernel_managing_mpx_tables(current->mm))
> + return addr;
> + if (addr + len <= DEFAULT_MAP_WINDOW)
> + return addr;
> + if (flags & MAP_FIXED)
> + return -ENOMEM;
> +
> + /*
> + * Requested len is larger than whole area we're allowed to map in.
> + * Resetting hinting address wouldn't do much good -- fail early.
> + */
> + if (len > DEFAULT_MAP_WINDOW)
> + return -ENOMEM;
> +
> + /* Look for unmap area within DEFAULT_MAP_WINDOW */
> + return 0;
> +}
>


--
Dmitry

2017-04-06 19:19:35

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> Hi Kirill,
>
> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> Not all user space is ready to handle wide addresses. It's known that
>> at least some JIT compilers use higher bits in pointers to encode their
>> information. It collides with valid pointers with 5-level paging and
>> leads to crashes.
>>
>> To mitigate this, we are not going to allocate virtual address space
>> above 47-bit by default.
>>
>> But userspace can ask for allocation from full address space by
>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>
>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>> to look for unmapped area by specified address. If it's already
>> occupied, we look for unmapped area in *full* address space, rather than
>> from 47-bit window.
>
> Do you wish after the first over-47-bit mapping the following mmap()
> calls return also over-47-bits if there is free space?
> It so, you could simplify all this code by changing only mm->mmap_base
> on the first over-47-bit mmap() call.
> This will do simple trick.
>
>>
>> This approach helps to easily make application's memory allocator aware
>> about large address space without manually tracking allocated virtual
>> address space.
>>
>> One important case we need to handle here is interaction with MPX.
>> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
>> need to make sure that MPX cannot be enabled we already have VMA above
>> the boundary and forbid creating such VMAs once MPX is enabled.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Cc: Dmitry Safonov <[email protected]>
>> ---
>> arch/x86/include/asm/elf.h | 2 +-
>> arch/x86/include/asm/mpx.h | 9 +++++++++
>> arch/x86/include/asm/processor.h | 9 ++++++---
>> arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
>> arch/x86/mm/hugetlbpage.c | 27 ++++++++++++++++++++++++---
>> arch/x86/mm/mmap.c | 2 +-
>> arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
>> 7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
>> index d4d3ed456cb7..67260dbe1688 100644
>> --- a/arch/x86/include/asm/elf.h
>> +++ b/arch/x86/include/asm/elf.h
>> @@ -250,7 +250,7 @@ extern int force_personality32;
>> the loader. We need to make sure that it is out of the way of the
>> program
>> that it will "exec", and that there is sufficient room for the
>> brk. */
>>
>> -#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
>> +#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)
>
> This will kill 32-bit userspace:
> As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
> not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.
>
> Here is the test:
> [root@localhost test]# cat hello-world.c
> #include <stdio.h>
>
> int main(int argc, char **argv)
> {
> printf("Maybe this world is another planet's hell.\n");
> return 0;
> }
> [root@localhost test]# gcc -m32 hello-world.c -o hello-world
> [root@localhost test]# ./hello-world
> [ 35.306726] hello-world[1948]: segfault at ffa5288c ip
> 00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
> Segmentation fault (core dumped)
>
> So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.

I just tried to define it like this:
-#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
+#define DEFAULT_MAP_WINDOW (test_thread_flag(TIF_ADDR32) ? \
+ IA32_PAGE_OFFSET : ((1UL << 47) -
PAGE_SIZE))

And it looks working better.

>
>
>>
>> /* This yields a mask that user programs can use to figure out what
>> instruction set this CPU supports. This could be done in user space,
>> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
>> index a0d662be4c5b..7d7404756bb4 100644
>> --- a/arch/x86/include/asm/mpx.h
>> +++ b/arch/x86/include/asm/mpx.h
>> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>> }
>> void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>> unsigned long start, unsigned long end);
>> +
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> + unsigned long flags);
>> #else
>> static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>> {
>> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct
>> mm_struct *mm,
>> unsigned long start, unsigned long end)
>> {
>> }
>> +
>> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
>> + unsigned long len, unsigned long flags)
>> +{
>> + return addr;
>> +}
>> #endif /* CONFIG_X86_INTEL_MPX */
>>
>> #endif /* _ASM_X86_MPX_H */
>> diff --git a/arch/x86/include/asm/processor.h
>> b/arch/x86/include/asm/processor.h
>> index 3cada998a402..9f437aea7f57 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>> #define IA32_PAGE_OFFSET PAGE_OFFSET
>> #define TASK_SIZE PAGE_OFFSET
>> #define TASK_SIZE_MAX TASK_SIZE
>> +#define DEFAULT_MAP_WINDOW TASK_SIZE
>> #define STACK_TOP TASK_SIZE
>> #define STACK_TOP_MAX STACK_TOP
>>
>> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>> * particular problem by preventing anything from being mapped
>> * at the maximum canonical address.
>> */
>> -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
>> +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>> +
>> +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
>>
>> /* This decides where the kernel will search for a free chunk of vm
>> * space during mmap's.
>> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>> #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child,
>> TIF_ADDR32)) ? \
>> IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>>
>> -#define STACK_TOP TASK_SIZE
>> +#define STACK_TOP DEFAULT_MAP_WINDOW
>> #define STACK_TOP_MAX TASK_SIZE_MAX
>>
>> #define INIT_THREAD { \
>> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs,
>> unsigned long new_ip,
>> * space during mmap's.
>> */
>> #define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
>> -#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
>> +#define TASK_UNMAPPED_BASE
>> __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> ditto
>
>>
>> #define KSTK_EIP(task) (task_pt_regs(task)->ip)
>>
>> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
>> index 207b8f2582c7..593a31e93812 100644
>> --- a/arch/x86/kernel/sys_x86_64.c
>> +++ b/arch/x86/kernel/sys_x86_64.c
>> @@ -21,6 +21,7 @@
>> #include <asm/compat.h>
>> #include <asm/ia32.h>
>> #include <asm/syscalls.h>
>> +#include <asm/mpx.h>
>>
>> /*
>> * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
>> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>> struct vm_unmapped_area_info info;
>> unsigned long begin, end;
>>
>> + addr = mpx_unmapped_area_check(addr, len, flags);
>> + if (IS_ERR_VALUE(addr))
>> + return addr;
>> +
>> if (flags & MAP_FIXED)
>> return addr;
>>
>> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>> info.flags = 0;
>> info.length = len;
>> info.low_limit = begin;
>> - info.high_limit = end;
>> +
>> + /*
>> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> + * in the full address space.
>> + */
>> + if (addr > DEFAULT_MAP_WINDOW)
>> + info.high_limit = min(end, TASK_SIZE);
>> + else
>> + info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>> +
>> info.align_mask = 0;
>> info.align_offset = pgoff << PAGE_SHIFT;
>> if (filp) {
>> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>> unsigned long addr = addr0;
>> struct vm_unmapped_area_info info;
>>
>> + addr = mpx_unmapped_area_check(addr, len, flags);
>> + if (IS_ERR_VALUE(addr))
>> + return addr;
>> +
>> /* requested length too big for entire address space */
>> if (len > TASK_SIZE)
>> return -ENOMEM;
>> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>> info.length = len;
>> info.low_limit = PAGE_SIZE;
>> info.high_limit = get_mmap_base(0);
>> +
>> + /*
>> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> + * in the full address space.
>> + */
>> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> is set, which can do 64-bit syscalls - the subtraction will be
> a negative..
>
>
>> +
>> info.align_mask = 0;
>> info.align_offset = pgoff << PAGE_SHIFT;
>> if (filp) {
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 302f43fd9c28..9a0b89252c52 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -18,6 +18,7 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/elf.h>
>> +#include <asm/mpx.h>
>>
>> #if 0 /* This is just for testing */
>> struct page *
>> @@ -87,23 +88,38 @@ static unsigned long
>> hugetlb_get_unmapped_area_bottomup(struct file *file,
>> info.low_limit = get_mmap_base(1);
>> info.high_limit = in_compat_syscall() ?
>> tasksize_32bit() : tasksize_64bit();
>> +
>> + /*
>> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> + * in the full address space.
>> + */
>> + if (addr > DEFAULT_MAP_WINDOW)
>> + info.high_limit = TASK_SIZE;
>> +
>> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>> info.align_offset = 0;
>> return vm_unmapped_area(&info);
>> }
>>
>> static unsigned long hugetlb_get_unmapped_area_topdown(struct file
>> *file,
>> - unsigned long addr0, unsigned long len,
>> + unsigned long addr, unsigned long len,
>> unsigned long pgoff, unsigned long flags)
>> {
>> struct hstate *h = hstate_file(file);
>> struct vm_unmapped_area_info info;
>> - unsigned long addr;
>>
>> info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> info.length = len;
>> info.low_limit = PAGE_SIZE;
>> info.high_limit = get_mmap_base(0);
>> +
>> + /*
>> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> + * in the full address space.
>> + */
>> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> ditto
>
>> +
>> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>> info.align_offset = 0;
>> addr = vm_unmapped_area(&info);
>> @@ -118,7 +134,7 @@ static unsigned long
>> hugetlb_get_unmapped_area_topdown(struct file *file,
>> VM_BUG_ON(addr != -ENOMEM);
>> info.flags = 0;
>> info.low_limit = TASK_UNMAPPED_BASE;
>> - info.high_limit = TASK_SIZE;
>> + info.high_limit = DEFAULT_MAP_WINDOW;
>
> ditto about 32-bits
>
>> addr = vm_unmapped_area(&info);
>> }
>>
>> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file,
>> unsigned long addr,
>>
>> if (len & ~huge_page_mask(h))
>> return -EINVAL;
>> +
>> + addr = mpx_unmapped_area_check(addr, len, flags);
>> + if (IS_ERR_VALUE(addr))
>> + return addr;
>> +
>> if (len > TASK_SIZE)
>> return -ENOMEM;
>>
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 19ad095b41df..d63232a31945 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>>
>> unsigned long tasksize_64bit(void)
>> {
>> - return TASK_SIZE_MAX;
>> + return DEFAULT_MAP_WINDOW;
>> }
>>
>> static unsigned long stack_maxrandom_size(unsigned long task_size)
>> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
>> index cd44ae727df7..a26a1b373fd0 100644
>> --- a/arch/x86/mm/mpx.c
>> +++ b/arch/x86/mm/mpx.c
>> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>> */
>> bd_base = mpx_get_bounds_dir();
>> down_write(&mm->mmap_sem);
>> +
>> + /* MPX doesn't support addresses above 47-bits yet. */
>> + if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
>> + pr_warn_once("%s (%d): MPX cannot handle addresses "
>> + "above 47-bits. Disabling.",
>> + current->comm, current->pid);
>> + ret = -ENXIO;
>> + goto out;
>> + }
>> mm->context.bd_addr = bd_base;
>> if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>> ret = -ENXIO;
>> -
>> +out:
>> up_write(&mm->mmap_sem);
>> return ret;
>> }
>> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>> if (ret)
>> force_sig(SIGSEGV, current);
>> }
>> +
>> +/* MPX cannot handle addresses above 47-bits yet. */
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> + unsigned long flags)
>> +{
>> + if (!kernel_managing_mpx_tables(current->mm))
>> + return addr;
>> + if (addr + len <= DEFAULT_MAP_WINDOW)
>> + return addr;
>> + if (flags & MAP_FIXED)
>> + return -ENOMEM;
>> +
>> + /*
>> + * Requested len is larger than whole area we're allowed to map in.
>> + * Resetting hinting address wouldn't do much good -- fail early.
>> + */
>> + if (len > DEFAULT_MAP_WINDOW)
>> + return -ENOMEM;
>> +
>> + /* Look for unmap area within DEFAULT_MAP_WINDOW */
>> + return 0;
>> +}
>>
>
>


--
Dmitry

2017-04-06 23:22:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> > Hi Kirill,
> >
> > On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> > > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > > Not all user space is ready to handle wide addresses. It's known that
> > > at least some JIT compilers use higher bits in pointers to encode their
> > > information. It collides with valid pointers with 5-level paging and
> > > leads to crashes.
> > >
> > > To mitigate this, we are not going to allocate virtual address space
> > > above 47-bit by default.
> > >
> > > But userspace can ask for allocation from full address space by
> > > specifying hint address (with or without MAP_FIXED) above 47-bits.
> > >
> > > If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > > to look for unmapped area by specified address. If it's already
> > > occupied, we look for unmapped area in *full* address space, rather than
> > > from 47-bit window.
> >
> > Do you wish after the first over-47-bit mapping the following mmap()
> > calls return also over-47-bits if there is free space?
> > It so, you could simplify all this code by changing only mm->mmap_base
> > on the first over-47-bit mmap() call.
> > This will do simple trick.

No.

I want every allocation to explicitely opt-in large address space. It's
additional fail-safe: if a library can't handle large addresses it has
better chance to survive if its own allocation will stay within 47-bits.

> I just tried to define it like this:
> -#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
> +#define DEFAULT_MAP_WINDOW (test_thread_flag(TIF_ADDR32) ? \
> + IA32_PAGE_OFFSET : ((1UL << 47) -
> PAGE_SIZE))
>
> And it looks working better.

Okay, thanks. I'll send v2.

> > > + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> > > + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> >
> > Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> > That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> > is set, which can do 64-bit syscalls - the subtraction will be
> > a negative..

With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
okay, right?

--
Kirill A. Shutemov

2017-04-06 23:25:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Safonov <[email protected]>
---
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/mpx.h | 9 +++++++++
arch/x86/include/asm/processor.h | 10 +++++++---
arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
arch/x86/mm/hugetlbpage.c | 27 ++++++++++++++++++++++++---
arch/x86/mm/mmap.c | 2 +-
arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
7 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader. We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk. */

-#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)

/* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
}
void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags);
#else
static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
{
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+ unsigned long len, unsigned long flags)
+{
+ return addr;
+}
#endif /* CONFIG_X86_INTEL_MPX */

#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..a98395e89ac6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET PAGE_OFFSET
#define TASK_SIZE PAGE_OFFSET
#define TASK_SIZE_MAX TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
#define STACK_TOP TASK_SIZE
#define STACK_TOP_MAX STACK_TOP

@@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW (test_thread_flag(TIF_ADDR32) ? \
+ IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
@@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

-#define STACK_TOP TASK_SIZE
+#define STACK_TOP DEFAULT_MAP_WINDOW
#define STACK_TOP_MAX TASK_SIZE_MAX

#define INIT_THREAD { \
@@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
* space during mmap's.
*/
#define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

#define KSTK_EIP(task) (task_pt_regs(task)->ip)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
#include <asm/compat.h>
#include <asm/ia32.h>
#include <asm/syscalls.h>
+#include <asm/mpx.h>

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_unmapped_area_info info;
unsigned long begin, end;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (flags & MAP_FIXED)
return addr;

@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.flags = 0;
info.length = len;
info.low_limit = begin;
- info.high_limit = end;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = min(end, TASK_SIZE);
+ else
+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/elf.h>
+#include <asm/mpx.h>

#if 0 /* This is just for testing */
struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.low_limit = get_mmap_base(1);
info.high_limit = in_compat_syscall() ?
tasksize_32bit() : tasksize_64bit();
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = TASK_SIZE;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
- unsigned long addr0, unsigned long len,
+ unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
- unsigned long addr;

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
- info.high_limit = TASK_SIZE;
+ info.high_limit = DEFAULT_MAP_WINDOW;
addr = vm_unmapped_area(&info);
}

@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

if (len & ~huge_page_mask(h))
return -EINVAL;
+
+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (len > TASK_SIZE)
return -ENOMEM;

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)

unsigned long tasksize_64bit(void)
{
- return TASK_SIZE_MAX;
+ return DEFAULT_MAP_WINDOW;
}

static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
*/
bd_base = mpx_get_bounds_dir();
down_write(&mm->mmap_sem);
+
+ /* MPX doesn't support addresses above 47-bits yet. */
+ if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+ pr_warn_once("%s (%d): MPX cannot handle addresses "
+ "above 47-bits. Disabling.",
+ current->comm, current->pid);
+ ret = -ENXIO;
+ goto out;
+ }
mm->context.bd_addr = bd_base;
if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
ret = -ENXIO;
-
+out:
up_write(&mm->mmap_sem);
return ret;
}
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
if (ret)
force_sig(SIGSEGV, current);
}
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags)
+{
+ if (!kernel_managing_mpx_tables(current->mm))
+ return addr;
+ if (addr + len <= DEFAULT_MAP_WINDOW)
+ return addr;
+ if (flags & MAP_FIXED)
+ return -ENOMEM;
+
+ /*
+ * Requested len is larger than whole area we're allowed to map in.
+ * Resetting hinting address wouldn't do much good -- fail early.
+ */
+ if (len > DEFAULT_MAP_WINDOW)
+ return -ENOMEM;
+
+ /* Look for unmap area within DEFAULT_MAP_WINDOW */
+ return 0;
+}
--
2.11.0

2017-04-07 10:10:24

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On 04/07/2017 02:21 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
>> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
>>> Hi Kirill,
>>>
>>> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use higher bits in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> To mitigate this, we are not going to allocate virtual address space
>>>> above 47-bit by default.
>>>>
>>>> But userspace can ask for allocation from full address space by
>>>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>>>
>>>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>>>> to look for unmapped area by specified address. If it's already
>>>> occupied, we look for unmapped area in *full* address space, rather than
>>>> from 47-bit window.
>>>
>>> Do you wish after the first over-47-bit mapping the following mmap()
>>> calls return also over-47-bits if there is free space?
>>> It so, you could simplify all this code by changing only mm->mmap_base
>>> on the first over-47-bit mmap() call.
>>> This will do simple trick.
>
> No.
>
> I want every allocation to explicitely opt-in large address space. It's
> additional fail-safe: if a library can't handle large addresses it has
> better chance to survive if its own allocation will stay within 47-bits.

Ok

>
>> I just tried to define it like this:
>> -#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
>> +#define DEFAULT_MAP_WINDOW (test_thread_flag(TIF_ADDR32) ? \
>> + IA32_PAGE_OFFSET : ((1UL << 47) -
>> PAGE_SIZE))
>>
>> And it looks working better.
>
> Okay, thanks. I'll send v2.
>
>>>> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>>>> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>>>
>>> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
>>> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
>>> is set, which can do 64-bit syscalls - the subtraction will be
>>> a negative..
>
> With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
> okay, right?

I'll comment to v2 to keep all in one place.


--
Dmitry

2017-04-07 11:36:24

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On 04/07/2017 02:24 AM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> ---
> arch/x86/include/asm/elf.h | 2 +-
> arch/x86/include/asm/mpx.h | 9 +++++++++
> arch/x86/include/asm/processor.h | 10 +++++++---
> arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
> arch/x86/mm/hugetlbpage.c | 27 ++++++++++++++++++++++++---
> arch/x86/mm/mmap.c | 2 +-
> arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
> 7 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
> the loader. We need to make sure that it is out of the way of the program
> that it will "exec", and that there is sufficient room for the brk. */
>
> -#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)
>
> /* This yields a mask that user programs can use to figure out what
> instruction set this CPU supports. This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
> }
> void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags);
> #else
> static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
> {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> + unsigned long len, unsigned long flags)
> +{
> + return addr;
> +}
> #endif /* CONFIG_X86_INTEL_MPX */
>
> #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..a98395e89ac6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
> #define IA32_PAGE_OFFSET PAGE_OFFSET
> #define TASK_SIZE PAGE_OFFSET
> #define TASK_SIZE_MAX TASK_SIZE
> +#define DEFAULT_MAP_WINDOW TASK_SIZE
> #define STACK_TOP TASK_SIZE
> #define STACK_TOP_MAX STACK_TOP
>
> @@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
> * particular problem by preventing anything from being mapped
> * at the maximum canonical address.
> */
> -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW (test_thread_flag(TIF_ADDR32) ? \
> + IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

That fixes 32-bit, but we need to adjust some places, AFAICS, I'll
point them below.

>
> /* This decides where the kernel will search for a free chunk of vm
> * space during mmap's.
> @@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
> #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
> IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP TASK_SIZE
> +#define STACK_TOP DEFAULT_MAP_WINDOW
> #define STACK_TOP_MAX TASK_SIZE_MAX
>
> #define INIT_THREAD { \
> @@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
> * space during mmap's.
> */
> #define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> #define KSTK_EIP(task) (task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
> #include <asm/compat.h>
> #include <asm/ia32.h>
> #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
> /*
> * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
> struct vm_unmapped_area_info info;
> unsigned long begin, end;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (flags & MAP_FIXED)
> return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
> info.flags = 0;
> info.length = len;
> info.low_limit = begin;
> - info.high_limit = end;
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW)
> + info.high_limit = min(end, TASK_SIZE);
> + else
> + info.high_limit = min(end, DEFAULT_MAP_WINDOW);

That looks not working.
`end' is choosed between tasksize_32bit() and tasksize_64bit().
Which is ~4Gb or 47-bit. So, info.high_limit will never go
above DEFAULT_MAP_WINDOW with this min().

Can we move this logic into find_start_end()?

May it be something like:
if (in_compat_syscall())
*end = tasksize_32bit();
else if (addr > task_size_64bit())
*end = TASK_SIZE_MAX;
else
*end = tasksize_64bit();

In my point of view, it could be even simpler if we add a parameter
to task_size_64bit():

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))

unsigned long task_size_64bit(int full_addr_space)
{
return (full_addr_space) ? TASK_SIZE_MAX : TASK_SIZE_47BIT;
}

> +
> info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> unsigned long addr = addr0;
> struct vm_unmapped_area_info info;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> /* requested length too big for entire address space */
> if (len > TASK_SIZE)
> return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, looks like we do need in_compat_syscall() as you did
because x32 mmap() syscall has 8 byte parameter.
Maybe worth a comment.

Anyway, maybe something like that:
if (addr > tasksize_64bit() && !in_compat_syscall())
info.high_limit += TASK_SIZE_MAX - tasksize_64bit();

This way it's more readable and clear because we don't
need to keep in mind that TIF_ADDR32 flag, while reading.


> +
> info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/elf.h>
> +#include <asm/mpx.h>
>
> #if 0 /* This is just for testing */
> struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
> info.low_limit = get_mmap_base(1);
> info.high_limit = in_compat_syscall() ?
> tasksize_32bit() : tasksize_64bit();
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW)
> + info.high_limit = TASK_SIZE;
> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> return vm_unmapped_area(&info);
> }
>
> static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> - unsigned long addr0, unsigned long len,
> + unsigned long addr, unsigned long len,
> unsigned long pgoff, unsigned long flags)
> {
> struct hstate *h = hstate_file(file);
> struct vm_unmapped_area_info info;
> - unsigned long addr;
>
> info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> VM_BUG_ON(addr != -ENOMEM);
> info.flags = 0;
> info.low_limit = TASK_UNMAPPED_BASE;
> - info.high_limit = TASK_SIZE;
> + info.high_limit = DEFAULT_MAP_WINDOW;
> addr = vm_unmapped_area(&info);
> }
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
> if (len & ~huge_page_mask(h))
> return -EINVAL;
> +
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (len > TASK_SIZE)
> return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
> unsigned long tasksize_64bit(void)
> {
> - return TASK_SIZE_MAX;
> + return DEFAULT_MAP_WINDOW;

My suggestion about new parameter is above, but at least
we need to omit depending on TIF_ADDR32 here and return
64-bit size independent of flag value:

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))
unsigned long task_size_64bit(void)
{
return TASK_SIZE_47BIT;
}

Because for 32-bit ELFs it would be always 4Gb in your
case, while 32-bit ELFs can do 64-bit syscalls.

> }
>
> static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
> */
> bd_base = mpx_get_bounds_dir();
> down_write(&mm->mmap_sem);
> +
> + /* MPX doesn't support addresses above 47-bits yet. */
> + if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> + pr_warn_once("%s (%d): MPX cannot handle addresses "
> + "above 47-bits. Disabling.",
> + current->comm, current->pid);
> + ret = -ENXIO;
> + goto out;
> + }
> mm->context.bd_addr = bd_base;
> if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
> ret = -ENXIO;
> -
> +out:
> up_write(&mm->mmap_sem);
> return ret;
> }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ret)
> force_sig(SIGSEGV, current);
> }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags)
> +{
> + if (!kernel_managing_mpx_tables(current->mm))
> + return addr;
> + if (addr + len <= DEFAULT_MAP_WINDOW)
> + return addr;
> + if (flags & MAP_FIXED)
> + return -ENOMEM;
> +
> + /*
> + * Requested len is larger than whole area we're allowed to map in.
> + * Resetting hinting address wouldn't do much good -- fail early.
> + */
> + if (len > DEFAULT_MAP_WINDOW)
> + return -ENOMEM;
> +
> + /* Look for unmap area within DEFAULT_MAP_WINDOW */
> + return 0;
> +}
>


--
Dmitry

2017-04-07 13:35:58

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.

I am wondering if the commitment of virtual space range to the
user space is kind of an API which needs to be maintained there
after. If that is the case then we need to have some plans when
increasing it from the current level.

Will those JIT compilers keep using the higher bit positions of
the pointer for ever ? Then it will limit the ability of the
kernel to expand the virtual address range later as well. I am
not saying we should not increase till the extent it does not
affect any *known* user but then we should not increase twice
for now, create the hint mechanism to be passed from the user
to avail beyond that (which will settle in as a expectation
from the kernel later on). Do the same thing again while
expanding the address range next time around. I think we need
to have a plan for this and particularly around 'hint' mechanism
and whether it should be decided per mmap() request or at the
task level.

2017-04-07 15:45:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Safonov <[email protected]>
---
v3:
- Address Dmitry feedback;
- Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
instead, which would task TIF_ADDR32 into account.
---
arch/x86/include/asm/elf.h | 4 ++--
arch/x86/include/asm/mpx.h | 9 +++++++++
arch/x86/include/asm/processor.h | 11 ++++++++---
arch/x86/kernel/sys_x86_64.c | 30 ++++++++++++++++++++++++++----
arch/x86/mm/hugetlbpage.c | 27 +++++++++++++++++++++++----
arch/x86/mm/mmap.c | 6 +++---
arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..2501ef7970f9 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader. We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk. */

-#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE (TASK_SIZE_LOW / 3 * 2)

/* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
}

extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
extern unsigned long get_mmap_base(int is_legacy);

#ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
}
void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags);
#else
static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
{
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+ unsigned long len, unsigned long flags)
+{
+ return addr;
+}
#endif /* CONFIG_X86_INTEL_MPX */

#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET PAGE_OFFSET
#define TASK_SIZE PAGE_OFFSET
#define TASK_SIZE_MAX TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
#define STACK_TOP TASK_SIZE
#define STACK_TOP_MAX STACK_TOP

@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \
0xc0000000 : 0xFFFFe000)

+#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \
+ IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

-#define STACK_TOP TASK_SIZE
+#define STACK_TOP TASK_SIZE_LOW
#define STACK_TOP_MAX TASK_SIZE_MAX

#define INIT_THREAD { \
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
* space during mmap's.
*/
#define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE_LOW)

#define KSTK_EIP(task) (task_pt_regs(task)->ip)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
#include <asm/compat.h>
#include <asm/ia32.h>
#include <asm/syscalls.h>
+#include <asm/mpx.h>

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
return error;
}

-static void find_start_end(unsigned long flags, unsigned long *begin,
- unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+ unsigned long *begin, unsigned long *end)
{
if (!in_compat_syscall() && (flags & MAP_32BIT)) {
/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
}

*begin = get_mmap_base(1);
- *end = in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+ if (in_compat_syscall())
+ *end = tasksize_32bit();
+ else
+ *end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
}

unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_unmapped_area_info info;
unsigned long begin, end;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (flags & MAP_FIXED)
return addr;

- find_start_end(flags, &begin, &end);
+ find_start_end(addr, flags, &begin, &end);

if (len > end)
return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ *
+ * !in_compat_syscall() check to avoid high addresses for x32.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/elf.h>
+#include <asm/mpx.h>

#if 0 /* This is just for testing */
struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.flags = 0;
info.length = len;
info.low_limit = get_mmap_base(1);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
info.high_limit = in_compat_syscall() ?
- tasksize_32bit() : tasksize_64bit();
+ tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
- unsigned long addr0, unsigned long len,
+ unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
- unsigned long addr;

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
- info.high_limit = TASK_SIZE;
+ info.high_limit = TASK_SIZE_LOW;
addr = vm_unmapped_area(&info);
}

@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

if (len & ~huge_page_mask(h))
return -EINVAL;
+
+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (len > TASK_SIZE)
return -ENOMEM;

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
return IA32_PAGE_OFFSET;
}

-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
{
- return TASK_SIZE_MAX;
+ return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
}

static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
mm->get_unmapped_area = arch_get_unmapped_area_topdown;

arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
- arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+ arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));

#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
*/
bd_base = mpx_get_bounds_dir();
down_write(&mm->mmap_sem);
+
+ /* MPX doesn't support addresses above 47-bits yet. */
+ if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+ pr_warn_once("%s (%d): MPX cannot handle addresses "
+ "above 47-bits. Disabling.",
+ current->comm, current->pid);
+ ret = -ENXIO;
+ goto out;
+ }
mm->context.bd_addr = bd_base;
if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
ret = -ENXIO;
-
+out:
up_write(&mm->mmap_sem);
return ret;
}
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
if (ret)
force_sig(SIGSEGV, current);
}
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags)
+{
+ if (!kernel_managing_mpx_tables(current->mm))
+ return addr;
+ if (addr + len <= DEFAULT_MAP_WINDOW)
+ return addr;
+ if (flags & MAP_FIXED)
+ return -ENOMEM;
+
+ /*
+ * Requested len is larger than whole area we're allowed to map in.
+ * Resetting hinting address wouldn't do much good -- fail early.
+ */
+ if (len > DEFAULT_MAP_WINDOW)
+ return -ENOMEM;
+
+ /* Look for unmap area within DEFAULT_MAP_WINDOW */
+ return 0;
+}
--
2.11.0

2017-04-07 16:00:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use higher bits in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> >
> > To mitigate this, we are not going to allocate virtual address space
> > above 47-bit by default.
>
> I am wondering if the commitment of virtual space range to the
> user space is kind of an API which needs to be maintained there
> after. If that is the case then we need to have some plans when
> increasing it from the current level.

I don't think we should ever enable full address space for all
applications. There's no point.

/bin/true doesn't need more than 64TB of virtual memory.
And I hope never will.

By increasing virtual address space for everybody we will pay (assuming
current page table format) at least one extra page per process for moving
stack at very end of address space.

Yes, you can gain something in security by having more bits for ASLR, but
I don't think it worth the cost.

> Will those JIT compilers keep using the higher bit positions of
> the pointer for ever ? Then it will limit the ability of the
> kernel to expand the virtual address range later as well. I am
> not saying we should not increase till the extent it does not
> affect any *known* user but then we should not increase twice
> for now, create the hint mechanism to be passed from the user
> to avail beyond that (which will settle in as a expectation
> from the kernel later on). Do the same thing again while
> expanding the address range next time around. I think we need
> to have a plan for this and particularly around 'hint' mechanism
> and whether it should be decided per mmap() request or at the
> task level.

I think the reasonable way for an application to claim it's 63-bit clean
is to make allocations with (void *)-1 as hint address.

--
Kirill A. Shutemov

2017-04-07 16:20:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On Fri, Apr 07, 2017 at 09:09:27AM -0700, [email protected] wrote:
> >I think the reasonable way for an application to claim it's 63-bit
> >clean
> >is to make allocations with (void *)-1 as hint address.
>
> You realize that people have said that about just about every memory

Any better solution?

--
Kirill A. Shutemov

2017-04-07 16:20:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On April 7, 2017 8:59:45 AM PDT, "Kirill A. Shutemov" <[email protected]> wrote:
>On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address
>space.
>> > Not all user space is ready to handle wide addresses. It's known
>that
>> > at least some JIT compilers use higher bits in pointers to encode
>their
>> > information. It collides with valid pointers with 5-level paging
>and
>> > leads to crashes.
>> >
>> > To mitigate this, we are not going to allocate virtual address
>space
>> > above 47-bit by default.
>>
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
>I don't think we should ever enable full address space for all
>applications. There's no point.
>
>/bin/true doesn't need more than 64TB of virtual memory.
>And I hope never will.
>
>By increasing virtual address space for everybody we will pay (assuming
>current page table format) at least one extra page per process for
>moving
>stack at very end of address space.
>
>Yes, you can gain something in security by having more bits for ASLR,
>but
>I don't think it worth the cost.
>
>> Will those JIT compilers keep using the higher bit positions of
>> the pointer for ever ? Then it will limit the ability of the
>> kernel to expand the virtual address range later as well. I am
>> not saying we should not increase till the extent it does not
>> affect any *known* user but then we should not increase twice
>> for now, create the hint mechanism to be passed from the user
>> to avail beyond that (which will settle in as a expectation
>> from the kernel later on). Do the same thing again while
>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
>I think the reasonable way for an application to claim it's 63-bit
>clean
>is to make allocations with (void *)-1 as hint address.

You realize that people have said that about just about every memory threshold from 64K onward?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2017-04-07 16:41:43

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On 04/07/2017 06:44 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Dmitry Safonov <[email protected]>

LGTM,
Reviewed-by: Dmitry Safonov <[email protected]>

Thou, I'm not very excited about TASK_SIZE_LOW naming, but I'm not good
at naming either, so maybe tglx will help.
Anyway, I don't see any problems with code's logic now.
I've run it through CRIU ia32 tests, where there is
32/64-bit mmap(), 64-bit mmap() from 32-bit binary, the same with
MAP_32BIT and some other not very pleasant corner-cases.
That doesn't prove that mmap() works in *all* possible cases, thou.

P.S.:
JFYI: there is a rule to send new patch versions in a new thread -
otherwise the patch can lose maintainers attention. So, they may ask
you to resend it.

> ---
> v3:
> - Address Dmitry feedback;
> - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
> instead, which would task TIF_ADDR32 into account.
> ---
> arch/x86/include/asm/elf.h | 4 ++--
> arch/x86/include/asm/mpx.h | 9 +++++++++
> arch/x86/include/asm/processor.h | 11 ++++++++---
> arch/x86/kernel/sys_x86_64.c | 30 ++++++++++++++++++++++++++----
> arch/x86/mm/hugetlbpage.c | 27 +++++++++++++++++++++++----
> arch/x86/mm/mmap.c | 6 +++---
> arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
> 7 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..2501ef7970f9 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
> the loader. We need to make sure that it is out of the way of the program
> that it will "exec", and that there is sufficient room for the brk. */
>
> -#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE (TASK_SIZE_LOW / 3 * 2)
>
> /* This yields a mask that user programs can use to figure out what
> instruction set this CPU supports. This could be done in user space,
> @@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
> }
>
> extern unsigned long tasksize_32bit(void);
> -extern unsigned long tasksize_64bit(void);
> +extern unsigned long tasksize_64bit(int full_addr_space);
> extern unsigned long get_mmap_base(int is_legacy);
>
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
> }
> void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags);
> #else
> static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
> {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> + unsigned long len, unsigned long flags)
> +{
> + return addr;
> +}
> #endif /* CONFIG_X86_INTEL_MPX */
>
> #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..aaed58b03ddb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
> #define IA32_PAGE_OFFSET PAGE_OFFSET
> #define TASK_SIZE PAGE_OFFSET
> #define TASK_SIZE_MAX TASK_SIZE
> +#define DEFAULT_MAP_WINDOW TASK_SIZE
> #define STACK_TOP TASK_SIZE
> #define STACK_TOP_MAX STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
> * particular problem by preventing anything from being mapped
> * at the maximum canonical address.
> */
> -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
>
> /* This decides where the kernel will search for a free chunk of vm
> * space during mmap's.
> @@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
> #define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \
> 0xc0000000 : 0xFFFFe000)
>
> +#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \
> + IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
> #define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \
> IA32_PAGE_OFFSET : TASK_SIZE_MAX)
> #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
> IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP TASK_SIZE
> +#define STACK_TOP TASK_SIZE_LOW
> #define STACK_TOP_MAX TASK_SIZE_MAX
>
> #define INIT_THREAD { \
> @@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
> * space during mmap's.
> */
> #define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
>
> #define KSTK_EIP(task) (task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..74d1587b181d 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
> #include <asm/compat.h>
> #include <asm/ia32.h>
> #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
> /*
> * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
> return error;
> }
>
> -static void find_start_end(unsigned long flags, unsigned long *begin,
> - unsigned long *end)
> +static void find_start_end(unsigned long addr, unsigned long flags,
> + unsigned long *begin, unsigned long *end)
> {
> if (!in_compat_syscall() && (flags & MAP_32BIT)) {
> /* This is usually used needed to map code in small
> @@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
> }
>
> *begin = get_mmap_base(1);
> - *end = in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
> + if (in_compat_syscall())
> + *end = tasksize_32bit();
> + else
> + *end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> }
>
> unsigned long
> @@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
> struct vm_unmapped_area_info info;
> unsigned long begin, end;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (flags & MAP_FIXED)
> return addr;
>
> - find_start_end(flags, &begin, &end);
> + find_start_end(addr, flags, &begin, &end);
>
> if (len > end)
> return -ENOMEM;
> @@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> unsigned long addr = addr0;
> struct vm_unmapped_area_info info;
>
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> /* requested length too big for entire address space */
> if (len > TASK_SIZE)
> return -ENOMEM;
> @@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + *
> + * !in_compat_syscall() check to avoid high addresses for x32.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
> info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..730f00250acb 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/elf.h>
> +#include <asm/mpx.h>
>
> #if 0 /* This is just for testing */
> struct page *
> @@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
> info.flags = 0;
> info.length = len;
> info.low_limit = get_mmap_base(1);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> info.high_limit = in_compat_syscall() ?
> - tasksize_32bit() : tasksize_64bit();
> + tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> return vm_unmapped_area(&info);
> }
>
> static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> - unsigned long addr0, unsigned long len,
> + unsigned long addr, unsigned long len,
> unsigned long pgoff, unsigned long flags)
> {
> struct hstate *h = hstate_file(file);
> struct vm_unmapped_area_info info;
> - unsigned long addr;
>
> info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = get_mmap_base(0);
> +
> + /*
> + * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> + * in the full address space.
> + */
> + if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> + info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
> info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> info.align_offset = 0;
> addr = vm_unmapped_area(&info);
> @@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> VM_BUG_ON(addr != -ENOMEM);
> info.flags = 0;
> info.low_limit = TASK_UNMAPPED_BASE;
> - info.high_limit = TASK_SIZE;
> + info.high_limit = TASK_SIZE_LOW;
> addr = vm_unmapped_area(&info);
> }
>
> @@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
> if (len & ~huge_page_mask(h))
> return -EINVAL;
> +
> + addr = mpx_unmapped_area_check(addr, len, flags);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> +
> if (len > TASK_SIZE)
> return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..199050249d60 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
> return IA32_PAGE_OFFSET;
> }
>
> -unsigned long tasksize_64bit(void)
> +unsigned long tasksize_64bit(int full_addr_space)
> {
> - return TASK_SIZE_MAX;
> + return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
> }
>
> static unsigned long stack_maxrandom_size(unsigned long task_size)
> @@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> mm->get_unmapped_area = arch_get_unmapped_area_topdown;
>
> arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
> - arch_rnd(mmap64_rnd_bits), tasksize_64bit());
> + arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
>
> #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
> /*
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
> */
> bd_base = mpx_get_bounds_dir();
> down_write(&mm->mmap_sem);
> +
> + /* MPX doesn't support addresses above 47-bits yet. */
> + if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> + pr_warn_once("%s (%d): MPX cannot handle addresses "
> + "above 47-bits. Disabling.",
> + current->comm, current->pid);
> + ret = -ENXIO;
> + goto out;
> + }
> mm->context.bd_addr = bd_base;
> if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
> ret = -ENXIO;
> -
> +out:
> up_write(&mm->mmap_sem);
> return ret;
> }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ret)
> force_sig(SIGSEGV, current);
> }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> + unsigned long flags)
> +{
> + if (!kernel_managing_mpx_tables(current->mm))
> + return addr;
> + if (addr + len <= DEFAULT_MAP_WINDOW)
> + return addr;
> + if (flags & MAP_FIXED)
> + return -ENOMEM;
> +
> + /*
> + * Requested len is larger than whole area we're allowed to map in.
> + * Resetting hinting address wouldn't do much good -- fail early.
> + */
> + if (len > DEFAULT_MAP_WINDOW)
> + return -ENOMEM;
> +
> + /* Look for unmap area within DEFAULT_MAP_WINDOW */
> + return 0;
> +}
>


--
Dmitry

2017-04-11 07:02:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot


* Kirill A. Shutemov <[email protected]> wrote:

> This patch adds support for 5-level paging during early boot.
> It generalizes boot for 4- and 5-level paging on 64-bit systems with
> compile-time switch between them.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/boot/compressed/head_64.S | 23 ++++++++++++---
> arch/x86/include/asm/pgtable_64.h | 2 ++
> arch/x86/include/uapi/asm/processor-flags.h | 2 ++
> arch/x86/kernel/head64.c | 44 +++++++++++++++++++++++++----
> arch/x86/kernel/head_64.S | 29 +++++++++++++++----
> 5 files changed, 85 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index d2ae1f821e0c..3ed26769810b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -122,9 +122,12 @@ ENTRY(startup_32)
> addl %ebp, gdt+2(%ebp)
> lgdt gdt(%ebp)
>
> - /* Enable PAE mode */
> + /* Enable PAE and LA57 mode */
> movl %cr4, %eax
> orl $X86_CR4_PAE, %eax
> +#ifdef CONFIG_X86_5LEVEL
> + orl $X86_CR4_LA57, %eax
> +#endif
> movl %eax, %cr4
>
> /*
> @@ -136,13 +139,24 @@ ENTRY(startup_32)
> movl $(BOOT_INIT_PGT_SIZE/4), %ecx
> rep stosl
>
> + xorl %edx, %edx
> +
> + /* Build Top Level */
> + leal pgtable(%ebx,%edx,1), %edi
> + leal 0x1007 (%edi), %eax
> + movl %eax, 0(%edi)
> +
> +#ifdef CONFIG_X86_5LEVEL
> /* Build Level 4 */
> - leal pgtable + 0(%ebx), %edi
> + addl $0x1000, %edx
> + leal pgtable(%ebx,%edx), %edi
> leal 0x1007 (%edi), %eax
> movl %eax, 0(%edi)
> +#endif
>
> /* Build Level 3 */
> - leal pgtable + 0x1000(%ebx), %edi
> + addl $0x1000, %edx
> + leal pgtable(%ebx,%edx), %edi
> leal 0x1007(%edi), %eax
> movl $4, %ecx
> 1: movl %eax, 0x00(%edi)
> @@ -152,7 +166,8 @@ ENTRY(startup_32)
> jnz 1b
>
> /* Build Level 2 */
> - leal pgtable + 0x2000(%ebx), %edi
> + addl $0x1000, %edx
> + leal pgtable(%ebx,%edx), %edi
> movl $0x00000183, %eax
> movl $2048, %ecx
> 1: movl %eax, 0(%edi)

I realize that you had difficulties converting this to C, but it's not going to
get any easier in the future either, with one more paging mode/level added!

If you are stuck on where it breaks I'd suggest doing it gradually: first add a
trivial .c, build and link it in and call it separately. Then once that works,
move functionality from asm to C step by step and test it at every step.

I've applied the first two patches of this series, but we really should convert
this assembly bit to C too.

Thanks,

Ingo

Subject: [tip:x86/mm] x86/boot/64: Rename init_level4_pgt() and early_level4_pgt[]

Commit-ID: 8c86769544abd97bb9b55ae0eaa91d65b56f2272
Gitweb: http://git.kernel.org/tip/8c86769544abd97bb9b55ae0eaa91d65b56f2272
Author: Kirill A. Shutemov <[email protected]>
AuthorDate: Thu, 6 Apr 2017 17:01:00 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 11 Apr 2017 08:57:37 +0200

x86/boot/64: Rename init_level4_pgt() and early_level4_pgt[]

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt() and
early_top_pgt[].

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 4 ++--
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 18 +++++++++---------
arch/x86/kernel/head_64.S | 14 +++++++-------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/kasan_init_64.c | 12 ++++++------
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/mmu.c | 18 +++++++++---------
arch/x86/xen/xen-pvh.S | 2 +-
11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482a..77037b6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
static inline void __meminit init_trampoline_default(void)
{
/* Default trampoline pgd value */
- trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+ trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea312..affcb2a 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt

extern void paging_init(void);

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1..6b91e2e 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
p4d_populate(&init_mm, p4d, espfix_pud_page);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 079b382..9b759f8 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
/*
* Manage page tables very early on.
*/
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd = fixup_pointer(&early_top_pgt, physaddr);
pgd[pgd_index(__START_KERNEL_map)] += load_delta;

pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
- memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+ memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
next_early_pgt = 0;
- write_cr3(__pa_nodebug(early_level4_pgt));
+ write_cr3(__pa_nodebug(early_top_pgt));
}

/* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
pmdval_t pmd, *pmd_p;

/* Invalid address or early pgt is done ? */
- if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+ if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
return -1;

again:
- pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+ pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

clear_bss();

- clear_page(init_level4_pgt);
+ clear_page(init_top_pgt);

kasan_early_init();

@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
load_ucode_bsp();

- /* set init_level4_pgt kernel high mapping*/
- init_level4_pgt[511] = early_level4_pgt[511];
+ /* set init_top_pgt kernel high mapping*/
+ init_top_pgt[511] = early_top_pgt[511];

x86_64_start_reservations(real_mode_data);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c59..d44c350 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
leaq _text(%rip), %rdi
call __startup_64

- movq $(early_level4_pgt - __START_KERNEL_map), %rax
+ movq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu

- movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
.endr

__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
.data

#ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.fill 512,8,0
#else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + L4_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + L4_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b3..42f502b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_NUMBER(phys_base);
- VMCOREINFO_SYMBOL(init_level4_pgt);
+ VMCOREINFO_SYMBOL(init_top_pgt);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be..6680cef 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx)
{
#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_level4_pgt;
+ pgd_t *start = (pgd_t *) &init_top_pgt;
#else
pgd_t *start = swapper_pg_dir;
#endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d812..88215ac 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern struct range pfn_mapped[E820_MAX_ENTRIES];

static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
kasan_zero_p4d[i] = __p4d(p4d_val);

- kasan_map_early_shadow(early_level4_pgt);
- kasan_map_early_shadow(init_level4_pgt);
+ kasan_map_early_shadow(early_top_pgt);
+ kasan_map_early_shadow(init_top_pgt);
}

void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
register_die_notifier(&kasan_die_notifier);
#endif

- memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
- load_cr3(early_level4_pgt);
+ memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+ load_cr3(early_top_pgt);
__flush_tlb_all();

clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- load_cr3(init_level4_pgt);
+ load_cr3(init_top_pgt);
__flush_tlb_all();

/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f1..dc0836d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)

trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = trampoline_pgd_entry.pgd;
- trampoline_pgd[511] = init_level4_pgt[511].pgd;
+ trampoline_pgd[511] = init_top_pgt[511].pgd;
#endif
}

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038..7c2081f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
* At the start of the day - when Xen launches a guest, it has already
* built pagetables for the guest. We diligently look over them
* in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
*
* The generic code starts (start_kernel) and 'init_mem_mapping' sets
* up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
pt_end = pt_base + xen_start_info->nr_pt_frames;

/* Zap identity mapping */
- init_level4_pgt[0] = __pgd(0);
+ init_top_pgt[0] = __pgd(0);

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Pre-constructed entries are in pfn, so convert to mfn */
/* L4[272] -> level3_ident_pgt
* L4[511] -> level3_kernel_pgt */
- convert_pfn_mfn(init_level4_pgt);
+ convert_pfn_mfn(init_top_pgt);

/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Copy the initial P->M table mappings if necessary. */
i = pgd_index(xen_start_info->mfn_list);
if (i && i < pgd_index(__START_KERNEL_map))
- init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+ init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
- set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+ set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Pin down new L4 */
pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
- PFN_DOWN(__pa_symbol(init_level4_pgt)));
+ PFN_DOWN(__pa_symbol(init_top_pgt)));

/* Unpin Xen-provided one */
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* pgd.
*/
xen_mc_batch();
- __xen_write_cr3(true, __pa(init_level4_pgt));
+ __xen_write_cr3(true, __pa(init_top_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
} else
- native_write_cr3(__pa(init_level4_pgt));
+ native_write_cr3(__pa(init_top_pgt));

/* We can't that easily rip out L3 and L2, as the Xen pagetables are
* set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ... for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e24671..e1a5fbe 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
wrmsr

/* Enable pre-constructed page tables. */
- mov $_pa(init_level4_pgt), %eax
+ mov $_pa(init_top_pgt), %eax
mov %eax, %cr3
mov $(X86_CR0_PG | X86_CR0_PE), %eax
mov %eax, %cr0

Subject: [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C

Commit-ID: 10ee822ec8c53f3987d969ce98b2686323a7fc2a
Gitweb: http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
Author: Kirill A. Shutemov <[email protected]>
AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 11 Apr 2017 08:57:37 +0200

x86/boot/64: Rewrite startup_64() in C

The patch converts most of the startup_64 logic from assembly to C.

This is preparation for 5-level paging enabling.

No change in functionality.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Small typo fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/head64.c | 81 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kernel/head_64.S | 93 +----------------------------------------------
2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002..079b382 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
*/
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+ return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+ unsigned long load_delta, *p;
+ pgdval_t *pgd;
+ pudval_t *pud;
+ pmdval_t *pmd, pmd_entry;
+ int i;
+
+ /* Is the address too large? */
+ if (physaddr >> MAX_PHYSMEM_BITS)
+ for (;;);
+
+ /*
+ * Compute the delta between the address I am compiled to run at
+ * and the address I am actually running at.
+ */
+ load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+ /* Is the address not 2M aligned? */
+ if (load_delta & ~PMD_PAGE_MASK)
+ for (;;);
+
+ /* Fixup the physical addresses in the page table */
+
+ pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+ pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+ pud[510] += load_delta;
+ pud[511] += load_delta;
+
+ pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+ pmd[506] += load_delta;
+
+ /*
+ * Set up the identity mapping for the switchover. These
+ * entries should *NOT* have the global bit set! This also
+ * creates a bunch of nonsense entries but that is fine --
+ * it avoids problems around wraparound.
+ */
+
+ pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+ pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+ pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+ pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+ pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+ pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+ pmd_entry += physaddr;
+
+ for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+ pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+ /*
+ * Fix up the kernel text+data virtual addresses. Note that
+ * we might write invalid pmds, when the kernel is relocated
+ * cleanup_highmap() fixes this up along with the mappings
+ * beyond _end.
+ */
+
+ pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (pmd[i] & _PAGE_PRESENT)
+ pmd[i] += load_delta;
+ }
+
+ /* Fix up phys_base */
+ p = fixup_pointer(&phys_base, physaddr);
+ *p += load_delta;
+}
+
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327..9656c59 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
/* Sanitize CPU configuration */
call verify_cpu

- /*
- * Compute the delta between the address I am compiled to run at and the
- * address I am actually running at.
- */
- leaq _text(%rip), %rbp
- subq $_text - __START_KERNEL_map, %rbp
-
- /* Is the address not 2M aligned? */
- testl $~PMD_PAGE_MASK, %ebp
- jnz bad_address
-
- /*
- * Is the address too large?
- */
- leaq _text(%rip), %rax
- shrq $MAX_PHYSMEM_BITS, %rax
- jnz bad_address
-
- /*
- * Fixup the physical addresses in the page table
- */
- addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
- addq %rbp, level3_kernel_pgt + (510*8)(%rip)
- addq %rbp, level3_kernel_pgt + (511*8)(%rip)
-
- addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
-
- /*
- * Set up the identity mapping for the switchover. These
- * entries should *NOT* have the global bit set! This also
- * creates a bunch of nonsense entries but that is fine --
- * it avoids problems around wraparound.
- */
leaq _text(%rip), %rdi
- leaq early_level4_pgt(%rip), %rbx
-
- movq %rdi, %rax
- shrq $PGDIR_SHIFT, %rax
-
- leaq (PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
- movq %rdx, 0(%rbx,%rax,8)
- movq %rdx, 8(%rbx,%rax,8)
-
- addq $PAGE_SIZE, %rdx
- movq %rdi, %rax
- shrq $PUD_SHIFT, %rax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
- incl %eax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
-
- addq $PAGE_SIZE * 2, %rbx
- movq %rdi, %rax
- shrq $PMD_SHIFT, %rdi
- addq $(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
- leaq (_end - 1)(%rip), %rcx
- shrq $PMD_SHIFT, %rcx
- subq %rdi, %rcx
- incl %ecx
+ call __startup_64

-1:
- andq $(PTRS_PER_PMD - 1), %rdi
- movq %rax, (%rbx,%rdi,8)
- incq %rdi
- addq $PMD_SIZE, %rax
- decl %ecx
- jnz 1b
-
- test %rbp, %rbp
- jz .Lskip_fixup
-
- /*
- * Fixup the kernel text+data virtual addresses. Note that
- * we might write invalid pmds, when the kernel is relocated
- * cleanup_highmap() fixes this up along with the mappings
- * beyond _end.
- */
- leaq level2_kernel_pgt(%rip), %rdi
- leaq PAGE_SIZE(%rdi), %r8
- /* See if it is a valid page table entry */
-1: testb $_PAGE_PRESENT, 0(%rdi)
- jz 2f
- addq %rbp, 0(%rdi)
- /* Go to the next page */
-2: addq $8, %rdi
- cmp %r8, %rdi
- jne 1b
-
- /* Fixup phys_base */
- addq %rbp, phys_base(%rip)
-
-.Lskip_fixup:
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)

2017-04-11 08:54:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C


* tip-bot for Kirill A. Shutemov <[email protected]> wrote:

> Commit-ID: 10ee822ec8c53f3987d969ce98b2686323a7fc2a
> Gitweb: http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
> Author: Kirill A. Shutemov <[email protected]>
> AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 11 Apr 2017 08:57:37 +0200
>
> x86/boot/64: Rewrite startup_64() in C
>
> The patch converts most of the startup_64 logic from assembly to C.
>
> This is preparation for 5-level paging enabling.
>
> No change in functionality.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Link: http://lkml.kernel.org/r/[email protected]
> [ Small typo fixes. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kernel/head64.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> arch/x86/kernel/head_64.S | 93 +----------------------------------------------
> 2 files changed, 81 insertions(+), 93 deletions(-)

Hm, so I had to zap this commit as it broke booting on a 64-bit Intel and an AMD
system as well, with defconfig-ish kernels.

I've attached the failing config. There's no serial console output at all, the
boot just hangs after the Grub messages.

Thanks,

Ingo


Attachments:
(No filename) (1.61 kB)
.config (109.65 kB)
Download all attachments

2017-04-11 10:53:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> I realize that you had difficulties converting this to C, but it's not going to
> get any easier in the future either, with one more paging mode/level added!
>
> If you are stuck on where it breaks I'd suggest doing it gradually: first add a
> trivial .c, build and link it in and call it separately. Then once that works,
> move functionality from asm to C step by step and test it at every step.

I've described the specific issue with converting this code to C in cover
letter: how to make compiler to generate 32-bit code for a specific
function or translation unit, without breaking linking afterwards (-m32
break it).

I would be glad to convert it, but I'm stuck.

Do you have an idea how to get around the issue.

--
Kirill A. Shutemov

2017-04-11 11:28:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot


* Kirill A. Shutemov <[email protected]> wrote:

> On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > I realize that you had difficulties converting this to C, but it's not going to
> > get any easier in the future either, with one more paging mode/level added!
> >
> > If you are stuck on where it breaks I'd suggest doing it gradually: first add a
> > trivial .c, build and link it in and call it separately. Then once that works,
> > move functionality from asm to C step by step and test it at every step.
>
> I've described the specific issue with converting this code to C in cover
> letter: how to make compiler to generate 32-bit code for a specific
> function or translation unit, without breaking linking afterwards (-m32
> break it).

Have you tried putting it into a separate .c file, and building it 32-bit?

I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit
code even on 64-bit kernels.

Thanks,

Ingo

2017-04-11 11:46:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Tue, Apr 11, 2017 at 01:28:45PM +0200, Ingo Molnar wrote:
>
> * Kirill A. Shutemov <[email protected]> wrote:
>
> > On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > > I realize that you had difficulties converting this to C, but it's not going to
> > > get any easier in the future either, with one more paging mode/level added!
> > >
> > > If you are stuck on where it breaks I'd suggest doing it gradually: first add a
> > > trivial .c, build and link it in and call it separately. Then once that works,
> > > move functionality from asm to C step by step and test it at every step.
> >
> > I've described the specific issue with converting this code to C in cover
> > letter: how to make compiler to generate 32-bit code for a specific
> > function or translation unit, without breaking linking afterwards (-m32
> > break it).
>
> Have you tried putting it into a separate .c file, and building it 32-bit?

Yes, I have. The patch below fails linking:

ld: i386 architecture of input file `arch/x86/boot/compressed/head64.o' is incompatible with i386:x86-64 output

>
> I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit
> code even on 64-bit kernels.

I'll look closer (building proccess it's rather complicated), but my
understanding is that VDSO is stand-alone binary and doesn't really links
with the rest of the kernel, rather included as blob, no?

Andy, may be you have an idea?

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 44163e8c3868..8c1acacf408e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -76,6 +76,8 @@ vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
ifdef CONFIG_X86_64
vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/pagetable.o
+ vmlinux-objs-y += $(obj)/head64.o
+$(obj)/head64.o: KBUILD_CFLAGS := -m32 -D__KERNEL__ -O2
endif

$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
diff --git a/arch/x86/boot/compressed/head64.c b/arch/x86/boot/compressed/head64.c
new file mode 100644
index 000000000000..42e1d64a15f4
--- /dev/null
+++ b/arch/x86/boot/compressed/head64.c
@@ -0,0 +1,3 @@
+void __startup32(void)
+{
+}
--
Kirill A. Shutemov

2017-04-11 12:29:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C

On Tue, Apr 11, 2017 at 10:54:41AM +0200, Ingo Molnar wrote:
>
> * tip-bot for Kirill A. Shutemov <[email protected]> wrote:
>
> > Commit-ID: 10ee822ec8c53f3987d969ce98b2686323a7fc2a
> > Gitweb: http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
> > Author: Kirill A. Shutemov <[email protected]>
> > AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Tue, 11 Apr 2017 08:57:37 +0200
> >
> > x86/boot/64: Rewrite startup_64() in C
> >
> > The patch converts most of the startup_64 logic from assembly to C.
> >
> > This is preparation for 5-level paging enabling.
> >
> > No change in functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Link: http://lkml.kernel.org/r/[email protected]
> > [ Small typo fixes. ]
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > arch/x86/kernel/head64.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> > arch/x86/kernel/head_64.S | 93 +----------------------------------------------
> > 2 files changed, 81 insertions(+), 93 deletions(-)
>
> Hm, so I had to zap this commit as it broke booting on a 64-bit Intel and an AMD
> system as well, with defconfig-ish kernels.

The fixup is below. Maybe there's better more idiomatic way, but I'm not
really into assembly.

Basically, we need to preserve %rsi across call __startup_64.

I will post whole thing again once we sort out what to do with the reset
of assembly code I've touched. Or you can apply the fixup, if you feel to.

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -73,7 +73,9 @@ startup_64:
call verify_cpu

leaq _text(%rip), %rdi
+ pushq %rsi
call __startup_64
+ popq %rsi

movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
--
Kirill A. Shutemov

2017-04-11 14:09:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

> I'll look closer (building proccess it's rather complicated), but my
> understanding is that VDSO is stand-alone binary and doesn't really links
> with the rest of the kernel, rather included as blob, no?
>
> Andy, may be you have an idea?

There isn't any way I know of to directly link them together. The ELF
format wasn't designed for that. You would need to merge blobs and then use
manual jump vectors, like the 16bit startup code does. It would be likely
complicated and ugly.

-Andi

2017-04-12 10:18:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > I'll look closer (building proccess it's rather complicated), but my
> > understanding is that VDSO is stand-alone binary and doesn't really links
> > with the rest of the kernel, rather included as blob, no?
> >
> > Andy, may be you have an idea?
>
> There isn't any way I know of to directly link them together. The ELF
> format wasn't designed for that. You would need to merge blobs and then use
> manual jump vectors, like the 16bit startup code does. It would be likely
> complicated and ugly.

Ingo, can we proceed without coverting this assembly to C?

I'm committed to convert it to C later if we'll find reasonable solution
to the issue.

We're pretty late into release cycle. It would be nice to give the whole
thing time in tip/master and -next before the merge window.

Can I repost part 4?

--
Kirill A. Shutemov

2017-04-12 10:41:40

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

Hi Kirill,

I'm interested in this because we're doing pretty much the same thing on
powerpc at the moment, and I want to make sure x86 & powerpc end up with
compatible behaviour.

"Kirill A. Shutemov" <[email protected]> writes:
> On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use higher bits in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> >
>> > To mitigate this, we are not going to allocate virtual address space
>> > above 47-bit by default.
>>
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
> I don't think we should ever enable full address space for all
> applications. There's no point.
>
> /bin/true doesn't need more than 64TB of virtual memory.
> And I hope never will.
>
> By increasing virtual address space for everybody we will pay (assuming
> current page table format) at least one extra page per process for moving
> stack at very end of address space.

That assumes the current layout though, it could be different.

> Yes, you can gain something in security by having more bits for ASLR, but
> I don't think it worth the cost.

It may not be worth the cost now, for you, but that trade off will be
different for other people and at other times.

So I think it's quite likely some folks will be interested in the full
address range for ASLR.

>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
> I think the reasonable way for an application to claim it's 63-bit clean
> is to make allocations with (void *)-1 as hint address.

I do like the simplicity of that.

But I wouldn't be surprised if some (crappy) code out there already
passes an address of -1. Probably it won't break if it starts getting
high addresses, but who knows.

An alternative would be to only interpret the hint as requesting a large
address if it's >= 64TB && < TASK_SIZE_MAX.

If we're really worried about breaking userspace then a new MMAP flag
seems like the safest option?

I don't feel particularly strongly about any option, but like I said my
main concern is that x86 & powerpc end up with the same behaviour.

And whatever we end up with someone will need to do an update to the man
page for mmap.

cheers

2017-04-12 11:21:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits

On Wed, Apr 12, 2017 at 08:41:29PM +1000, Michael Ellerman wrote:
> Hi Kirill,
>
> I'm interested in this because we're doing pretty much the same thing on
> powerpc at the moment, and I want to make sure x86 & powerpc end up with
> compatible behaviour.
>
> "Kirill A. Shutemov" <[email protected]> writes:
> > On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> >> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use higher bits in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> >
> >> > To mitigate this, we are not going to allocate virtual address space
> >> > above 47-bit by default.
> >>
> >> I am wondering if the commitment of virtual space range to the
> >> user space is kind of an API which needs to be maintained there
> >> after. If that is the case then we need to have some plans when
> >> increasing it from the current level.
> >
> > I don't think we should ever enable full address space for all
> > applications. There's no point.
> >
> > /bin/true doesn't need more than 64TB of virtual memory.
> > And I hope never will.
> >
> > By increasing virtual address space for everybody we will pay (assuming
> > current page table format) at least one extra page per process for moving
> > stack at very end of address space.
>
> That assumes the current layout though, it could be different.

True.

> > Yes, you can gain something in security by having more bits for ASLR, but
> > I don't think it worth the cost.
>
> It may not be worth the cost now, for you, but that trade off will be
> different for other people and at other times.
>
> So I think it's quite likely some folks will be interested in the full
> address range for ASLR.

We always can extend interface if/when userspace demand materialize.

Let's not invent interfaces unless we're sure there's demand.

> >> expanding the address range next time around. I think we need
> >> to have a plan for this and particularly around 'hint' mechanism
> >> and whether it should be decided per mmap() request or at the
> >> task level.
> >
> > I think the reasonable way for an application to claim it's 63-bit clean
> > is to make allocations with (void *)-1 as hint address.
>
> I do like the simplicity of that.
>
> But I wouldn't be surprised if some (crappy) code out there already
> passes an address of -1. Probably it won't break if it starts getting
> high addresses, but who knows.

To make an application break we need two thing:

- it sets hint address to -1 by mistake;
- it uses upper bit to encode its info;

I would be surprise if such combination exists in real world.

But let me know if you have any particular code in mind.

> An alternative would be to only interpret the hint as requesting a large
> address if it's >= 64TB && < TASK_SIZE_MAX.

Nope. That doesn't work if you take into accounting further extension of the
address space.

Consider extension x86 to 6-level page tables. User-space has 63-bit
address space. TASK_SIZE_MAX is bumped to (1UL << 63) - PAGE_SIZE.

An application wants access to full address space. It gets recompiled
using new TASK_SIZE_MAX as hint address. And everything works fine.

But only on machine with 6-level paging enabled.

If we run the same application binary on machine with older kernel and
5-level paging, the application will get access to only 47-bit address
space, not 56-bit, as hint address is more than TASK_SIZE_MAX in this
configuration.

> If we're really worried about breaking userspace then a new MMAP flag
> seems like the safest option?
>
> I don't feel particularly strongly about any option, but like I said my
> main concern is that x86 & powerpc end up with the same behaviour.
>
> And whatever we end up with someone will need to do an update to the man
> page for mmap.

Sure.

--
Kirill A. Shutemov

2017-04-13 11:30:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4

Here's updated version the fourth and the last bunch of of patches that brings
initial 5-level paging enabling.

Please review and consider applying.

The situation with assembly hasn't changed much. I still not see a way to get
it work.

In this version I've included patch to fix comment in return_from_SYSCALL_64,
fixed bug in coverting startup_64 to C and updated the patch which allows to
opt-in full address space.

Kirill A. Shutemov (9):
x86/asm: Fix comment in return_from_SYSCALL_64
x86/boot/64: Rewrite startup_64 in C
x86/boot/64: Rename init_level4_pgt and early_level4_pgt
x86/boot/64: Add support of additional page table level during early
boot
x86/mm: Add sync_global_pgds() for configuration with 5-level paging
x86/mm: Make kernel_physical_mapping_init() support 5-level paging
x86/mm: Add support for 5-level paging for KASLR
x86: Enable 5-level paging support
x86/mm: Allow to have userspace mappings above 47-bits

arch/x86/Kconfig | 5 +
arch/x86/boot/compressed/head_64.S | 23 ++++-
arch/x86/entry/entry_64.S | 3 +-
arch/x86/include/asm/elf.h | 4 +-
arch/x86/include/asm/mpx.h | 9 ++
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 6 +-
arch/x86/include/asm/processor.h | 11 ++-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 137 +++++++++++++++++++++++++---
arch/x86/kernel/head_64.S | 134 +++++++--------------------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/kernel/sys_x86_64.c | 30 +++++-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/hugetlbpage.c | 27 +++++-
arch/x86/mm/init_64.c | 104 +++++++++++++++++++--
arch/x86/mm/kasan_init_64.c | 12 +--
arch/x86/mm/kaslr.c | 81 ++++++++++++----
arch/x86/mm/mmap.c | 6 +-
arch/x86/mm/mpx.c | 33 ++++++-
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/Kconfig | 1 +
arch/x86/xen/mmu.c | 18 ++--
arch/x86/xen/xen-pvh.S | 2 +-
25 files changed, 470 insertions(+), 188 deletions(-)

--
2.11.0

2017-04-13 11:31:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 8/9] x86: Enable 5-level paging support

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 5 +++++
arch/x86/xen/Kconfig | 1 +
2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM

config PGTABLE_LEVELS
int
+ default 5 if X86_5LEVEL
default 4 if X86_64
default 3 if X86_PAE
default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
has the cost of more pagetable lookup overhead, and also
consumes more pagetable space per process.

+config X86_5LEVEL
+ bool "Enable 5-level page tables support"
+ depends on X86_64
+
config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
config XEN
bool "Xen guest support"
depends on PARAVIRT
+ depends on !X86_5LEVEL
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
--
2.11.0

2017-04-13 11:31:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
*
* Entropy is generated using the KASLR early boot functions now shared in
* the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
*
* The order of each memory region is not changed. The feature looks at
* the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 64/* Maximum */ },
+ { &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
{ &vmemmap_base, 1 },
};
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
- entropy = (rand % (entropy + 1)) & PUD_MASK;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ entropy = (rand % (entropy + 1)) & P4D_MASK;
+ else
+ entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;

@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
- vaddr = round_up(vaddr + 1, PUD_SIZE);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ vaddr = round_up(vaddr + 1, P4D_SIZE);
+ else
+ vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}

-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
{
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;

- if (!kaslr_memory_enabled()) {
- init_trampoline_default();
- return;
- }
-
pud_page_tramp = alloc_low_page();

paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
}
+
+static void __meminit init_trampoline_p4d(void)
+{
+ unsigned long paddr, paddr_next;
+ pgd_t *pgd;
+ p4d_t *p4d_page, *p4d_page_tramp;
+ int i;
+
+ p4d_page_tramp = alloc_low_page();
+
+ paddr = 0;
+ pgd = pgd_offset_k((unsigned long)__va(paddr));
+ p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+ for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d, *p4d_tramp;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+
+ p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ *p4d_tramp = *p4d;
+ }
+
+ set_pgd(&trampoline_pgd_entry,
+ __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+ if (!kaslr_memory_enabled()) {
+ init_trampoline_default();
+ return;
+ }
+
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ init_trampoline_p4d();
+ else
+ init_trampoline_pud();
+}
--
2.11.0

2017-04-13 11:31:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/head64.c | 81 +++++++++++++++++++++++++++++++++++++++-
arch/x86/kernel/head_64.S | 95 ++---------------------------------------------
2 files changed, 83 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
*/
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+ return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+ unsigned long load_delta, *p;
+ pgdval_t *pgd;
+ pudval_t *pud;
+ pmdval_t *pmd, pmd_entry;
+ int i;
+
+ /* Is the address too large? */
+ if (physaddr >> MAX_PHYSMEM_BITS)
+ for (;;);
+
+ /*
+ * Compute the delta between the address I am compiled to run at
+ * and the address I am actually running at.
+ */
+ load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+ /* Is the address not 2M aligned? */
+ if (load_delta & ~PMD_PAGE_MASK)
+ for (;;);
+
+ /* Fixup the physical addresses in the page table */
+
+ pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+ pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+ pud[510] += load_delta;
+ pud[511] += load_delta;
+
+ pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+ pmd[506] += load_delta;
+
+ /*
+ * Set up the identity mapping for the switchover. These
+ * entries should *NOT* have the global bit set! This also
+ * creates a bunch of nonsense entries but that is fine --
+ * it avoids problems around wraparound.
+ */
+
+ pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+ pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+ pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+ pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+ pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+ pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+ pmd_entry += physaddr;
+
+ for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+ pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+ /*
+ * Fixup the kernel text+data virtual addresses. Note that
+ * we might write invalid pmds, when the kernel is relocated
+ * cleanup_highmap() fixes this up along with the mappings
+ * beyond _end.
+ */
+
+ pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (pmd[i] & _PAGE_PRESENT)
+ pmd[i] += load_delta;
+ }
+
+ /* Fixup phys_base */
+ p = fixup_pointer(&phys_base, physaddr);
+ *p += load_delta;
+}
+
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,11 @@ startup_64:
/* Sanitize CPU configuration */
call verify_cpu

- /*
- * Compute the delta between the address I am compiled to run at and the
- * address I am actually running at.
- */
- leaq _text(%rip), %rbp
- subq $_text - __START_KERNEL_map, %rbp
-
- /* Is the address not 2M aligned? */
- testl $~PMD_PAGE_MASK, %ebp
- jnz bad_address
-
- /*
- * Is the address too large?
- */
- leaq _text(%rip), %rax
- shrq $MAX_PHYSMEM_BITS, %rax
- jnz bad_address
-
- /*
- * Fixup the physical addresses in the page table
- */
- addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
- addq %rbp, level3_kernel_pgt + (510*8)(%rip)
- addq %rbp, level3_kernel_pgt + (511*8)(%rip)
-
- addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
-
- /*
- * Set up the identity mapping for the switchover. These
- * entries should *NOT* have the global bit set! This also
- * creates a bunch of nonsense entries but that is fine --
- * it avoids problems around wraparound.
- */
leaq _text(%rip), %rdi
- leaq early_level4_pgt(%rip), %rbx
-
- movq %rdi, %rax
- shrq $PGDIR_SHIFT, %rax
-
- leaq (PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
- movq %rdx, 0(%rbx,%rax,8)
- movq %rdx, 8(%rbx,%rax,8)
-
- addq $PAGE_SIZE, %rdx
- movq %rdi, %rax
- shrq $PUD_SHIFT, %rax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
- incl %eax
- andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
-
- addq $PAGE_SIZE * 2, %rbx
- movq %rdi, %rax
- shrq $PMD_SHIFT, %rdi
- addq $(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
- leaq (_end - 1)(%rip), %rcx
- shrq $PMD_SHIFT, %rcx
- subq %rdi, %rcx
- incl %ecx
+ pushq %rsi
+ call __startup_64
+ popq %rsi

-1:
- andq $(PTRS_PER_PMD - 1), %rdi
- movq %rax, (%rbx,%rdi,8)
- incq %rdi
- addq $PMD_SIZE, %rax
- decl %ecx
- jnz 1b
-
- test %rbp, %rbp
- jz .Lskip_fixup
-
- /*
- * Fixup the kernel text+data virtual addresses. Note that
- * we might write invalid pmds, when the kernel is relocated
- * cleanup_highmap() fixes this up along with the mappings
- * beyond _end.
- */
- leaq level2_kernel_pgt(%rip), %rdi
- leaq PAGE_SIZE(%rdi), %r8
- /* See if it is a valid page table entry */
-1: testb $_PAGE_PRESENT, 0(%rdi)
- jz 2f
- addq %rbp, 0(%rdi)
- /* Go to the next page */
-2: addq $8, %rdi
- cmp %r8, %rdi
- jne 1b
-
- /* Fixup phys_base */
- addq %rbp, phys_base(%rip)
-
-.Lskip_fixup:
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
--
2.11.0

2017-04-13 11:31:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support 5-level paging

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
return paddr_last;
}

+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+ unsigned long page_size_mask)
+{
+ unsigned long paddr_next, paddr_last = paddr_end;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+ int i = p4d_index(vaddr);
+
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+ for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d;
+ pud_t *pud;
+
+ vaddr = (unsigned long)__va(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ if (paddr >= paddr_end) {
+ if (!after_bootmem &&
+ !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+ E820_TYPE_RAM) &&
+ !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+ E820_TYPE_RESERVED_KERN))
+ set_p4d(p4d, __p4d(0));
+ continue;
+ }
+
+ if (!p4d_none(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ paddr_last = phys_pud_init(pud, paddr,
+ paddr_end,
+ page_size_mask);
+ __flush_tlb_all();
+ continue;
+ }
+
+ pud = alloc_low_page();
+ paddr_last = phys_pud_init(pud, paddr, paddr_end,
+ page_size_mask);
+
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, p4d, pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+ __flush_tlb_all();
+
+ return paddr_last;
+}
+
/*
* Create page table mapping for the physical memory for specific physical
* addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
- pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- BUILD_BUG_ON(pgd_none(*pgd));
- p4d = p4d_offset(pgd, vaddr);
- if (p4d_val(*p4d)) {
- pud = (pud_t *)p4d_page_vaddr(*p4d);
- paddr_last = phys_pud_init(pud, __pa(vaddr),
+ if (pgd_val(*pgd)) {
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}

- pud = alloc_low_page();
- paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+ p4d = alloc_low_page();
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- p4d_populate(&init_mm, p4d, pud);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ pgd_populate(&init_mm, pgd, p4d);
+ else
+ p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
--
2.11.0

2017-04-13 11:31:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
* When memory was added make sure all the processes MM have
* suitable PGD entries in the local PGD level page.
*/
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+ unsigned long address;
+
+ for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+ const pgd_t *pgd_ref = pgd_offset_k(address);
+ struct page *page;
+
+ if (pgd_none(*pgd_ref))
+ continue;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+ }
+}
+#else
void sync_global_pgds(unsigned long start, unsigned long end)
{
unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_unlock(&pgd_lock);
}
}
+#endif

/*
* NOTE: This function is marked __ref because it calls __init function
--
2.11.0

2017-04-13 11:32:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64

On x86-64 __VIRTUAL_MASK_SHIFT depends on paging mode now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/entry/entry_64.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 607d72c4a485..edec30584eb8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -266,7 +266,8 @@ return_from_SYSCALL_64:
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
*
- * Change top 16 bits to be the sign-extension of 47th bit
+ * Change top bits to match most significant bit (47th or 56th bit
+ * depending on paging mode) in the address.
*/
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
--
2.11.0

2017-04-13 11:32:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

A high hint address would only affect the allocation in question, but not
any future mmap()s.

Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/elf.h | 4 ++--
arch/x86/include/asm/mpx.h | 9 +++++++++
arch/x86/include/asm/processor.h | 11 ++++++++---
arch/x86/kernel/sys_x86_64.c | 30 ++++++++++++++++++++++++++----
arch/x86/mm/hugetlbpage.c | 27 +++++++++++++++++++++++----
arch/x86/mm/mmap.c | 6 +++---
arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..7a30513a4046 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader. We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk. */

-#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE (TASK_SIZE_LOW / 3 * 2)

/* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
}

extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
extern unsigned long get_mmap_base(int is_legacy);

#ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
}
void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags);
#else
static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
{
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+ unsigned long len, unsigned long flags)
+{
+ return addr;
+}
#endif /* CONFIG_X86_INTEL_MPX */

#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET PAGE_OFFSET
#define TASK_SIZE PAGE_OFFSET
#define TASK_SIZE_MAX TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
#define STACK_TOP TASK_SIZE
#define STACK_TOP_MAX STACK_TOP

@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \
0xc0000000 : 0xFFFFe000)

+#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \
+ IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

-#define STACK_TOP TASK_SIZE
+#define STACK_TOP TASK_SIZE_LOW
#define STACK_TOP_MAX TASK_SIZE_MAX

#define INIT_THREAD { \
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
* space during mmap's.
*/
#define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE_LOW)

#define KSTK_EIP(task) (task_pt_regs(task)->ip)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
#include <asm/compat.h>
#include <asm/ia32.h>
#include <asm/syscalls.h>
+#include <asm/mpx.h>

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
return error;
}

-static void find_start_end(unsigned long flags, unsigned long *begin,
- unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+ unsigned long *begin, unsigned long *end)
{
if (!in_compat_syscall() && (flags & MAP_32BIT)) {
/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
}

*begin = get_mmap_base(1);
- *end = in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+ if (in_compat_syscall())
+ *end = tasksize_32bit();
+ else
+ *end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
}

unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_unmapped_area_info info;
unsigned long begin, end;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (flags & MAP_FIXED)
return addr;

- find_start_end(flags, &begin, &end);
+ find_start_end(addr, flags, &begin, &end);

if (len > end)
return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ *
+ * !in_compat_syscall() check to avoid high addresses for x32.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/elf.h>
+#include <asm/mpx.h>

#if 0 /* This is just for testing */
struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.flags = 0;
info.length = len;
info.low_limit = get_mmap_base(1);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
info.high_limit = in_compat_syscall() ?
- tasksize_32bit() : tasksize_64bit();
+ tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
- unsigned long addr0, unsigned long len,
+ unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
- unsigned long addr;

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+ info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
- info.high_limit = TASK_SIZE;
+ info.high_limit = TASK_SIZE_LOW;
addr = vm_unmapped_area(&info);
}

@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

if (len & ~huge_page_mask(h))
return -EINVAL;
+
+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (len > TASK_SIZE)
return -ENOMEM;

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
return IA32_PAGE_OFFSET;
}

-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
{
- return TASK_SIZE_MAX;
+ return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
}

static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
mm->get_unmapped_area = arch_get_unmapped_area_topdown;

arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
- arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+ arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));

#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1c34b767c84c..8c8da27e8549 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
*/
bd_base = mpx_get_bounds_dir();
down_write(&mm->mmap_sem);
+
+ /* MPX doesn't support addresses above 47-bits yet. */
+ if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+ pr_warn_once("%s (%d): MPX cannot handle addresses "
+ "above 47-bits. Disabling.",
+ current->comm, current->pid);
+ ret = -ENXIO;
+ goto out;
+ }
mm->context.bd_addr = bd_base;
if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
ret = -ENXIO;
-
+out:
up_write(&mm->mmap_sem);
return ret;
}
@@ -1030,3 +1039,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
if (ret)
force_sig(SIGSEGV, current);
}
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags)
+{
+ if (!kernel_managing_mpx_tables(current->mm))
+ return addr;
+ if (addr + len <= DEFAULT_MAP_WINDOW)
+ return addr;
+ if (flags & MAP_FIXED)
+ return -ENOMEM;
+
+ /*
+ * Requested len is larger than whole area we're allowed to map in.
+ * Resetting hinting address wouldn't do much good -- fail early.
+ */
+ if (len > DEFAULT_MAP_WINDOW)
+ return -ENOMEM;
+
+ /* Look for unmap area within DEFAULT_MAP_WINDOW */
+ return 0;
+}
--
2.11.0

2017-04-13 11:32:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 23 ++++++++++++---
arch/x86/include/asm/pgtable_64.h | 2 ++
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/head64.c | 44 +++++++++++++++++++++++++----
arch/x86/kernel/head_64.S | 29 +++++++++++++++----
5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
addl %ebp, gdt+2(%ebp)
lgdt gdt(%ebp)

- /* Enable PAE mode */
+ /* Enable PAE and LA57 mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %eax
+#endif
movl %eax, %cr4

/*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl

+ xorl %edx, %edx
+
+ /* Build Top Level */
+ leal pgtable(%ebx,%edx,1), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
/* Build Level 4 */
- leal pgtable + 0(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
+#endif

/* Build Level 3 */
- leal pgtable + 0x1000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
jnz 1b

/* Build Level 2 */
- leal pgtable + 0x2000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
#include <linux/bitops.h>
#include <linux/threads.h>

+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
#define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
#define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
#define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */
+#define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT)
#define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */
#define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT)
#define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
{
unsigned long load_delta, *p;
pgdval_t *pgd;
+ p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
pgd = fixup_pointer(&early_top_pgt, physaddr);
pgd[pgd_index(__START_KERNEL_map)] += load_delta;

+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+ p4d[511] += load_delta;
+ }
+
pud = fixup_pointer(&level3_kernel_pgt, physaddr);
pud[510] += load_delta;
pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);

- pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
- pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+ pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+ p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ } else {
+ pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+ pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+ }

pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pgdval_t pgd, *pgd_p;
+ p4dval_t p4d, *p4d_p;
pudval_t pud, *pud_p;
pmdval_t pmd, *pmd_p;

@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
* critical -- __PAGE_OFFSET would point us back into the dynamic
* range and we might end up looping forever...
*/
- if (pgd)
- pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ p4d_p = pgd_p;
+ else if (pgd)
+ p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ else {
+ if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+ reset_early_page_tables();
+ goto again;
+ }
+
+ p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+ memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+ *pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ }
+ p4d_p += p4d_index(address);
+ p4d = *p4d_p;
+
+ if (p4d)
+ pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
else {
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)

pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
- *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ *p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
}
pud_p += pud_index(address);
pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 0ae0bad4d4d5..7b527fa47536 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
*
*/

+#define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))

-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
L3_START_KERNEL = pud_index(__START_KERNEL_map)

.text
@@ -100,11 +104,14 @@ ENTRY(secondary_startup_64)
movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

- /* Enable PAE mode and PGE */
+ /* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %ecx
+#endif
movq %rcx, %cr4

- /* Setup early boot stage 4 level pagetables. */
+ /* Setup early boot stage 4-/5-level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3

@@ -330,7 +337,11 @@ GLOBAL(name)
__INITDATA
NEXT_PAGE(early_top_pgt)
.fill 511,8,0
+#ifdef CONFIG_X86_5LEVEL
+ .quad level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif

NEXT_PAGE(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -343,9 +354,9 @@ NEXT_PAGE(init_top_pgt)
#else
NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_top_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + PGD_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_top_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + PGD_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -359,6 +370,12 @@ NEXT_PAGE(level2_ident_pgt)
PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
#endif

+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
--
2.11.0

2017-04-13 11:33:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 4 ++--
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 18 +++++++++---------
arch/x86/kernel/head_64.S | 14 +++++++-------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/kasan_init_64.c | 12 ++++++------
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/mmu.c | 18 +++++++++---------
arch/x86/xen/xen-pvh.S | 2 +-
11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
static inline void __meminit init_trampoline_default(void)
{
/* Default trampoline pgd value */
- trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+ trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt

extern void paging_init(void);

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
p4d_populate(&init_mm, p4d, espfix_pud_page);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
/*
* Manage page tables very early on.
*/
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(&early_level4_pgt, physaddr);
+ pgd = fixup_pointer(&early_top_pgt, physaddr);
pgd[pgd_index(__START_KERNEL_map)] += load_delta;

pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
- memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+ memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
next_early_pgt = 0;
- write_cr3(__pa_nodebug(early_level4_pgt));
+ write_cr3(__pa_nodebug(early_top_pgt));
}

/* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
pmdval_t pmd, *pmd_p;

/* Invalid address or early pgt is done ? */
- if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+ if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
return -1;

again:
- pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+ pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

clear_bss();

- clear_page(init_level4_pgt);
+ clear_page(init_top_pgt);

kasan_early_init();

@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
load_ucode_bsp();

- /* set init_level4_pgt kernel high mapping*/
- init_level4_pgt[511] = early_level4_pgt[511];
+ /* set init_top_pgt kernel high mapping*/
+ init_top_pgt[511] = early_top_pgt[511];

x86_64_start_reservations(real_mode_data);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1432d530fa35..0ae0bad4d4d5 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -77,7 +77,7 @@ startup_64:
call __startup_64
popq %rsi

- movq $(early_level4_pgt - __START_KERNEL_map), %rax
+ movq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
/*
@@ -97,7 +97,7 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu

- movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/* Enable PAE mode and PGE */
@@ -328,7 +328,7 @@ GLOBAL(name)
.endr

__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -338,14 +338,14 @@ NEXT_PAGE(early_dynamic_pgts)
.data

#ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.fill 512,8,0
#else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + L4_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + L4_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_NUMBER(phys_base);
- VMCOREINFO_SYMBOL(init_level4_pgt);
+ VMCOREINFO_SYMBOL(init_top_pgt);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index bce6990b1d81..0470826d2bdc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx)
{
#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_level4_pgt;
+ pgd_t *start = (pgd_t *) &init_top_pgt;
#else
pgd_t *start = swapper_pg_dir;
#endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern struct range pfn_mapped[E820_MAX_ENTRIES];

static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
kasan_zero_p4d[i] = __p4d(p4d_val);

- kasan_map_early_shadow(early_level4_pgt);
- kasan_map_early_shadow(init_level4_pgt);
+ kasan_map_early_shadow(early_top_pgt);
+ kasan_map_early_shadow(init_top_pgt);
}

void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
register_die_notifier(&kasan_die_notifier);
#endif

- memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
- load_cr3(early_level4_pgt);
+ memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+ load_cr3(early_top_pgt);
__flush_tlb_all();

clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- load_cr3(init_level4_pgt);
+ load_cr3(init_top_pgt);
__flush_tlb_all();

/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)

trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = trampoline_pgd_entry.pgd;
- trampoline_pgd[511] = init_level4_pgt[511].pgd;
+ trampoline_pgd[511] = init_top_pgt[511].pgd;
#endif
}

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
* At the start of the day - when Xen launches a guest, it has already
* built pagetables for the guest. We diligently look over them
* in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
*
* The generic code starts (start_kernel) and 'init_mem_mapping' sets
* up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
pt_end = pt_base + xen_start_info->nr_pt_frames;

/* Zap identity mapping */
- init_level4_pgt[0] = __pgd(0);
+ init_top_pgt[0] = __pgd(0);

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Pre-constructed entries are in pfn, so convert to mfn */
/* L4[272] -> level3_ident_pgt
* L4[511] -> level3_kernel_pgt */
- convert_pfn_mfn(init_level4_pgt);
+ convert_pfn_mfn(init_top_pgt);

/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Copy the initial P->M table mappings if necessary. */
i = pgd_index(xen_start_info->mfn_list);
if (i && i < pgd_index(__START_KERNEL_map))
- init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+ init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
- set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+ set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Pin down new L4 */
pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
- PFN_DOWN(__pa_symbol(init_level4_pgt)));
+ PFN_DOWN(__pa_symbol(init_top_pgt)));

/* Unpin Xen-provided one */
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* pgd.
*/
xen_mc_batch();
- __xen_write_cr3(true, __pa(init_level4_pgt));
+ __xen_write_cr3(true, __pa(init_top_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
} else
- native_write_cr3(__pa(init_level4_pgt));
+ native_write_cr3(__pa(init_top_pgt));

/* We can't that easily rip out L3 and L2, as the Xen pagetables are
* set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ... for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
wrmsr

/* Enable pre-constructed page tables. */
- mov $_pa(init_level4_pgt), %eax
+ mov $_pa(init_top_pgt), %eax
mov %eax, %cr3
mov $(X86_CR0_PG | X86_CR0_PE), %eax
mov %eax, %cr0
--
2.11.0

2017-04-17 10:32:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot


* Kirill A. Shutemov <[email protected]> wrote:

> On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > I'll look closer (building proccess it's rather complicated), but my
> > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > with the rest of the kernel, rather included as blob, no?
> > >
> > > Andy, may be you have an idea?
> >
> > There isn't any way I know of to directly link them together. The ELF
> > format wasn't designed for that. You would need to merge blobs and then use
> > manual jump vectors, like the 16bit startup code does. It would be likely
> > complicated and ugly.
>
> Ingo, can we proceed without coverting this assembly to C?
>
> I'm committed to convert it to C later if we'll find reasonable solution
> to the issue.

So one way to do it would be to build it standalone as a .o, then add it not to
the regular kernel objects link target (as you found out it's not possible to link
32-bit and 64-bit objects), but to link it in a manual fashion, as part of
vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.

But there would be other complications with this approach, such as we'd have to
add a size field and there might be symbol linking problems ...

Another, pretty hacky way would be to generate a .S from the .c, then post-process
the .S and essentially generate today's 32-bit .S from it.

Probably not worth the trouble.

Thanks,

Ingo

2017-04-18 08:59:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
>
> * Kirill A. Shutemov <[email protected]> wrote:
>
> > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > I'll look closer (building proccess it's rather complicated), but my
> > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > with the rest of the kernel, rather included as blob, no?
> > > >
> > > > Andy, may be you have an idea?
> > >
> > > There isn't any way I know of to directly link them together. The ELF
> > > format wasn't designed for that. You would need to merge blobs and then use
> > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > complicated and ugly.
> >
> > Ingo, can we proceed without coverting this assembly to C?
> >
> > I'm committed to convert it to C later if we'll find reasonable solution
> > to the issue.
>
> So one way to do it would be to build it standalone as a .o, then add it not to
> the regular kernel objects link target (as you found out it's not possible to link
> 32-bit and 64-bit objects), but to link it in a manual fashion, as part of
> vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
>
> But there would be other complications with this approach, such as we'd have to
> add a size field and there might be symbol linking problems ...
>
> Another, pretty hacky way would be to generate a .S from the .c, then post-process
> the .S and essentially generate today's 32-bit .S from it.
>
> Probably not worth the trouble.

So, do I need to do anything else to get part 4 applied?

--
Kirill A. Shutemov

2017-04-18 10:15:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> >
> > * Kirill A. Shutemov <[email protected]> wrote:
> >
> > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > with the rest of the kernel, rather included as blob, no?
> > > > >
> > > > > Andy, may be you have an idea?
> > > >
> > > > There isn't any way I know of to directly link them together. The ELF
> > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > complicated and ugly.
> > >
> > > Ingo, can we proceed without coverting this assembly to C?
> > >
> > > I'm committed to convert it to C later if we'll find reasonable solution
> > > to the issue.
> >
> > So one way to do it would be to build it standalone as a .o, then add it not to
> > the regular kernel objects link target (as you found out it's not possible to link
> > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of
> > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> >
> > But there would be other complications with this approach, such as we'd have to
> > add a size field and there might be symbol linking problems ...
> >
> > Another, pretty hacky way would be to generate a .S from the .c, then post-process
> > the .S and essentially generate today's 32-bit .S from it.
> >
> > Probably not worth the trouble.
>
> So, do I need to do anything else to get part 4 applied?

Doh!

I've just realized we don't really need to enable 5-level paging in
decompression code. Leaving 4-level paging there works perfectly fine.

I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
patchset.

--
Kirill A. Shutemov

2017-04-18 11:10:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot

On Tue, Apr 18, 2017 at 01:15:34PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > >
> > > * Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > > with the rest of the kernel, rather included as blob, no?
> > > > > >
> > > > > > Andy, may be you have an idea?
> > > > >
> > > > > There isn't any way I know of to directly link them together. The ELF
> > > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > > complicated and ugly.
> > > >
> > > > Ingo, can we proceed without coverting this assembly to C?
> > > >
> > > > I'm committed to convert it to C later if we'll find reasonable solution
> > > > to the issue.
> > >
> > > So one way to do it would be to build it standalone as a .o, then add it not to
> > > the regular kernel objects link target (as you found out it's not possible to link
> > > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of
> > > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > >
> > > But there would be other complications with this approach, such as we'd have to
> > > add a size field and there might be symbol linking problems ...
> > >
> > > Another, pretty hacky way would be to generate a .S from the .c, then post-process
> > > the .S and essentially generate today's 32-bit .S from it.
> > >
> > > Probably not worth the trouble.
> >
> > So, do I need to do anything else to get part 4 applied?
>
> Doh!
>
> I've just realized we don't really need to enable 5-level paging in
> decompression code. Leaving 4-level paging there works perfectly fine.
>
> I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
> patchset.

No. This breaks KASLR. Decompression code has to use 5-level paging to
keep KASLR working.

So, v4 of part 4 is up-to-date.

Sorry for noise.

--
Kirill A. Shutemov