2017-03-06 14:07:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 00/33] 5-level paging

Here is v4 of 5-level paging patchset. Please review and consider applying.

== Overview ==

x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
of physical address space. We are already bumping into this limit: some
vendors offers servers with 64 TiB of memory today.

To overcome the limitation upcoming hardware will introduce support for
5-level paging[1]. It is a straight-forward extension of the current page
table structure adding one more layer of translation.

It bumps the limits to 128 PiB of virtual address space and 4 PiB of
physical address space. This "ought to be enough for anybody" ©.

== Patches ==

The patchset is build on top of v4.11-rc1.

Current QEMU upstream git supports 5-level paging. Use "-cpu qemu64,+la57"
to enable it.

Patch 1:
Detect la57 feature for /proc/cpuinfo.

Patches 2-7:
Brings 5-level paging to generic code and convert all
architectures to it using <asm-generic/5level-fixup.h>

Patches 8-19:
Convert x86 to properly folded p4d layer using
<asm-generic/pgtable-nop4d.h>.

Patches 20-32:
Enabling of real 5-level paging.

CONFIG_X86_5LEVEL=y will enable new paging mode.

Patch 33:
Introduce new prctl(2) handles -- PR_SET_MAX_VADDR and PR_GET_MAX_VADDR.

This aims to address compatibility issue. Only supports x86 for
now, but should be useful for other archtectures.

Git:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git la57/v4

== TODO ==

There is still work to do:

- CONFIG_XEN is broken for 5-level paging.

Xen for 5-level paging requires more work to get functional.

Xen on 4-level paging works, so it's not a regression.

- Boot-time switch between 4- and 5-level paging.

We assume that distributions will be keen to avoid returning to the
i386 days where we shipped one kernel binary for each page table
layout.

As page table format is the same for 4- and 5-level paging it should
be possible to have single kernel binary and switch between them at
boot-time without too much hassle.

For now I only implemented compile-time switch.

This will implemented with separate patchset.

== Changelong ==

v4:
- Rebased to v4.11-rc1;
- Use mmap() hint address to allocate virtual addresss space above
47-bits insteads of prctl() handles.
v3:
- Rebased to v4.10-rc5;
- prctl() handles for large address space opt-in;
- Xen works for 4-level paging;
- EFI boot fixed for both 4- and 5-level paging;
- Hibernation fixed for 4-level paging;
- kexec() fixed;
- Couple of build fixes;
v2:
- Rebased to v4.10-rc1;
- RLIMIT_VADDR proposal;
- Fix virtual map and update documentation;
- Fix few build errors;
- Rework cpuid helpers in boot code;
- Fix espfix code to work with 5-level pages;

[1] https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf
Kirill A. Shutemov (33):
x86/cpufeature: Add 5-level paging detection
asm-generic: introduce 5level-fixup.h
asm-generic: introduce __ARCH_USE_5LEVEL_HACK
arch, mm: convert all architectures to use 5level-fixup.h
asm-generic: introduce <asm-generic/pgtable-nop4d.h>
mm: convert generic code to 5-level paging
mm: introduce __p4d_alloc()
x86: basic changes into headers for 5-level paging
x86: trivial portion of 5-level paging conversion
x86/gup: add 5-level paging support
x86/ident_map: add 5-level paging support
x86/mm: add support of p4d_t in vmalloc_fault()
x86/power: support p4d_t in hibernate code
x86/kexec: support p4d_t
x86/efi: handle p4d in EFI pagetables
x86/mm/pat: handle additional page table
x86/kasan: prepare clear_pgds() to switch to
<asm-generic/pgtable-nop4d.h>
x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d
x86: convert the rest of the code to support p4d_t
x86: detect 5-level paging support
x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
x86/mm: define virtual memory map for 5-level paging
x86/paravirt: make paravirt code support 5-level paging
x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
x86/dump_pagetables: support 5-level paging
x86/kasan: extend to support 5-level paging
x86/espfix: support 5-level paging
x86/mm: add support of additional page table level during early boot
x86/mm: add sync_global_pgds() for configuration with 5-level paging
x86/mm: make kernel_physical_mapping_init() support 5-level paging
x86/mm: add support for 5-level paging for KASLR
x86: enable 5-level paging support
x86/mm: allow to have userspace mappigs above 47-bits

Documentation/x86/x86_64/mm.txt | 33 +-
arch/arc/include/asm/hugepage.h | 1 +
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable.h | 1 +
arch/arm64/include/asm/pgtable-types.h | 4 +
arch/avr32/include/asm/pgtable-2level.h | 1 +
arch/cris/include/asm/pgtable.h | 1 +
arch/frv/include/asm/pgtable.h | 1 +
arch/h8300/include/asm/pgtable.h | 1 +
arch/hexagon/include/asm/pgtable.h | 1 +
arch/ia64/include/asm/pgtable.h | 2 +
arch/metag/include/asm/pgtable.h | 1 +
arch/mips/include/asm/pgtable-32.h | 1 +
arch/mips/include/asm/pgtable-64.h | 1 +
arch/mn10300/include/asm/page.h | 1 +
arch/nios2/include/asm/pgtable.h | 1 +
arch/openrisc/include/asm/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 3 +
arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
arch/s390/include/asm/pgtable.h | 1 +
arch/score/include/asm/pgtable.h | 1 +
arch/sh/include/asm/pgtable-2level.h | 1 +
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/tile/include/asm/pgtable_32.h | 1 +
arch/tile/include/asm/pgtable_64.h | 1 +
arch/um/include/asm/pgtable-2level.h | 1 +
arch/um/include/asm/pgtable-3level.h | 1 +
arch/unicore32/include/asm/pgtable.h | 1 +
arch/x86/Kconfig | 6 +
arch/x86/boot/compressed/head_64.S | 23 +-
arch/x86/boot/cpucheck.c | 9 +
arch/x86/boot/cpuflags.c | 12 +-
arch/x86/entry/entry_64.S | 7 +-
arch/x86/include/asm/cpufeatures.h | 3 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/kasan.h | 9 +-
arch/x86/include/asm/kexec.h | 1 +
arch/x86/include/asm/mpx.h | 9 +
arch/x86/include/asm/page_64_types.h | 10 +
arch/x86/include/asm/paravirt.h | 65 +++-
arch/x86/include/asm/paravirt_types.h | 17 +-
arch/x86/include/asm/pgalloc.h | 37 +-
arch/x86/include/asm/pgtable-2level_types.h | 1 +
arch/x86/include/asm/pgtable-3level_types.h | 1 +
arch/x86/include/asm/pgtable.h | 85 ++++-
arch/x86/include/asm/pgtable_64.h | 29 +-
arch/x86/include/asm/pgtable_64_types.h | 27 ++
arch/x86/include/asm/pgtable_types.h | 42 ++-
arch/x86/include/asm/processor.h | 9 +-
arch/x86/include/asm/required-features.h | 8 +-
arch/x86/include/asm/sparsemem.h | 9 +-
arch/x86/include/asm/xen/page.h | 8 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 12 +-
arch/x86/kernel/head64.c | 40 ++-
arch/x86/kernel/head_64.S | 63 +++-
arch/x86/kernel/machine_kexec_32.c | 4 +-
arch/x86/kernel/machine_kexec_64.c | 16 +-
arch/x86/kernel/paravirt.c | 13 +-
arch/x86/kernel/sys_x86_64.c | 28 +-
arch/x86/kernel/tboot.c | 6 +-
arch/x86/kernel/vm86_32.c | 6 +-
arch/x86/mm/dump_pagetables.c | 51 ++-
arch/x86/mm/fault.c | 57 ++-
arch/x86/mm/gup.c | 33 +-
arch/x86/mm/hugetlbpage.c | 31 +-
arch/x86/mm/ident_map.c | 47 ++-
arch/x86/mm/init_32.c | 22 +-
arch/x86/mm/init_64.c | 269 ++++++++++++--
arch/x86/mm/ioremap.c | 3 +-
arch/x86/mm/kasan_init_64.c | 41 ++-
arch/x86/mm/kaslr.c | 82 ++++-
arch/x86/mm/mmap.c | 4 +-
arch/x86/mm/mpx.c | 33 +-
arch/x86/mm/pageattr.c | 56 ++-
arch/x86/mm/pgtable.c | 38 +-
arch/x86/mm/pgtable_32.c | 8 +-
arch/x86/platform/efi/efi_64.c | 38 +-
arch/x86/power/hibernate_32.c | 7 +-
arch/x86/power/hibernate_64.c | 49 ++-
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/Kconfig | 1 +
arch/x86/xen/mmu.c | 433 ++++++++++++++---------
arch/x86/xen/mmu.h | 1 +
arch/xtensa/include/asm/pgtable.h | 1 +
drivers/misc/sgi-gru/grufault.c | 9 +-
fs/userfaultfd.c | 6 +-
include/asm-generic/4level-fixup.h | 3 +-
include/asm-generic/5level-fixup.h | 41 +++
include/asm-generic/pgtable-nop4d-hack.h | 62 ++++
include/asm-generic/pgtable-nop4d.h | 56 +++
include/asm-generic/pgtable-nopud.h | 48 +--
include/asm-generic/pgtable.h | 48 ++-
include/asm-generic/tlb.h | 14 +-
include/linux/hugetlb.h | 5 +-
include/linux/kasan.h | 1 +
include/linux/mm.h | 34 +-
include/trace/events/xen.h | 28 +-
lib/ioremap.c | 39 +-
mm/gup.c | 46 ++-
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 29 +-
mm/kasan/kasan_init.c | 35 +-
mm/memory.c | 230 ++++++++++--
mm/mlock.c | 1 +
mm/mprotect.c | 26 +-
mm/mremap.c | 13 +-
mm/page_vma_mapped.c | 6 +-
mm/pagewalk.c | 32 +-
mm/pgtable-generic.c | 6 +
mm/rmap.c | 7 +-
mm/sparse-vmemmap.c | 22 +-
mm/swapfile.c | 26 +-
mm/userfaultfd.c | 23 +-
mm/vmalloc.c | 81 +++--
120 files changed, 2413 insertions(+), 577 deletions(-)
create mode 100644 include/asm-generic/5level-fixup.h
create mode 100644 include/asm-generic/pgtable-nop4d-hack.h
create mode 100644 include/asm-generic/pgtable-nop4d.h

--
2.11.0


2017-03-06 13:57:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 17/33] x86/kasan: prepare clear_pgds() to switch to <asm-generic/pgtable-nop4d.h>

With folded p4d, pgd_clear() is nop. Change clear_pgds() to use
p4d_clear() instead.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
---
arch/x86/mm/kasan_init_64.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 8d63d7a104c3..733f8ba6a01f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -32,8 +32,15 @@ static int __init map_range(struct range *range)
static void __init clear_pgds(unsigned long start,
unsigned long end)
{
- for (; start < end; start += PGDIR_SIZE)
- pgd_clear(pgd_offset_k(start));
+ pgd_t *pgd;
+
+ for (; start < end; start += PGDIR_SIZE) {
+ pgd = pgd_offset_k(start);
+ if (CONFIG_PGTABLE_LEVELS < 5)
+ p4d_clear(p4d_offset(pgd, start));
+ else
+ pgd_clear(pgd);
+ }
}

static void __init kasan_map_early_shadow(pgd_t *pgd)
--
2.11.0

2017-03-06 13:58:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 29/33] x86/mm: add sync_global_pgds() for configuration with 5-level paging

This basically restores slightly modified version of original
sync_global_pgds() which we had before foldedl p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7bdda6f1d135..5ba99090dc3c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,42 @@ __setup("noexec32=", nonx32_setup);
* When memory was added make sure all the processes MM have
* suitable PGD entries in the local PGD level page.
*/
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+ unsigned long address;
+
+ for (address = start; address <= end && address >= start;
+ address += PGDIR_SIZE) {
+ const pgd_t *pgd_ref = pgd_offset_k(address);
+ struct page *page;
+
+ if (pgd_none(*pgd_ref))
+ continue;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ BUG_ON(pgd_page_vaddr(*pgd)
+ != pgd_page_vaddr(*pgd_ref));
+
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+ }
+}
+#else
void sync_global_pgds(unsigned long start, unsigned long end)
{
unsigned long address;
@@ -135,6 +171,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
spin_unlock(&pgd_lock);
}
}
+#endif

/*
* NOTE: This function is marked __ref because it calls __init function
--
2.11.0

2017-03-06 13:58:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 25/33] x86/dump_pagetables: support 5-level paging

Simple extension to support one more page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 49 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 58b5bee7ea27..0effac6989cd 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -110,7 +110,8 @@ static struct addr_marker address_markers[] = {
#define PTE_LEVEL_MULT (PAGE_SIZE)
#define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT)
#define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT)
-#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define P4D_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define PGD_LEVEL_MULT (PTRS_PER_PUD * P4D_LEVEL_MULT)

#define pt_dump_seq_printf(m, to_dmesg, fmt, args...) \
({ \
@@ -347,7 +348,7 @@ static bool pud_already_checked(pud_t *prev_pud, pud_t *pud, bool checkwx)
return checkwx && prev_pud && (pud_val(*prev_pud) == pud_val(*pud));
}

-static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
unsigned long P)
{
int i;
@@ -355,7 +356,7 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
pgprotval_t prot;
pud_t *prev_pud = NULL;

- start = (pud_t *) pgd_page_vaddr(addr);
+ start = (pud_t *) p4d_page_vaddr(addr);

for (i = 0; i < PTRS_PER_PUD; i++) {
st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
@@ -377,9 +378,43 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
}

#else
-#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(pgd_val(a)),p)
-#define pgd_large(a) pud_large(__pud(pgd_val(a)))
-#define pgd_none(a) pud_none(__pud(pgd_val(a)))
+#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(p4d_val(a)),p)
+#define p4d_large(a) pud_large(__pud(p4d_val(a)))
+#define p4d_none(a) pud_none(__pud(p4d_val(a)))
+#endif
+
+#if PTRS_PER_P4D > 1
+
+static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+ unsigned long P)
+{
+ int i;
+ p4d_t *start;
+ pgprotval_t prot;
+
+ start = (p4d_t *) pgd_page_vaddr(addr);
+
+ for (i = 0; i < PTRS_PER_P4D; i++) {
+ st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
+ if (!p4d_none(*start)) {
+ if (p4d_large(*start) || !p4d_present(*start)) {
+ prot = p4d_flags(*start);
+ note_page(m, st, __pgprot(prot), 2);
+ } else {
+ walk_pud_level(m, st, *start,
+ P + i * P4D_LEVEL_MULT);
+ }
+ } else
+ note_page(m, st, __pgprot(0), 2);
+
+ start++;
+ }
+}
+
+#else
+#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p)
+#define pgd_large(a) p4d_large(__p4d(pgd_val(a)))
+#define pgd_none(a) p4d_none(__p4d(pgd_val(a)))
#endif

static inline bool is_hypervisor_range(int idx)
@@ -424,7 +459,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
prot = pgd_flags(*start);
note_page(m, &st, __pgprot(prot), 1);
} else {
- walk_pud_level(m, &st, *start,
+ walk_p4d_level(m, &st, *start,
i * PGD_LEVEL_MULT);
}
} else
--
2.11.0

2017-03-06 13:58:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 04/33] arch, mm: convert all architectures to use 5level-fixup.h

If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arc/include/asm/hugepage.h | 1 +
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable.h | 1 +
arch/arm64/include/asm/pgtable-types.h | 4 ++++
arch/avr32/include/asm/pgtable-2level.h | 1 +
arch/cris/include/asm/pgtable.h | 1 +
arch/frv/include/asm/pgtable.h | 1 +
arch/h8300/include/asm/pgtable.h | 1 +
arch/hexagon/include/asm/pgtable.h | 1 +
arch/ia64/include/asm/pgtable.h | 2 ++
arch/metag/include/asm/pgtable.h | 1 +
arch/mips/include/asm/pgtable-32.h | 1 +
arch/mips/include/asm/pgtable-64.h | 1 +
arch/mn10300/include/asm/page.h | 1 +
arch/nios2/include/asm/pgtable.h | 1 +
arch/openrisc/include/asm/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 3 +++
arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
arch/s390/include/asm/pgtable.h | 1 +
arch/score/include/asm/pgtable.h | 1 +
arch/sh/include/asm/pgtable-2level.h | 1 +
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/tile/include/asm/pgtable_32.h | 1 +
arch/tile/include/asm/pgtable_64.h | 1 +
arch/um/include/asm/pgtable-2level.h | 1 +
arch/um/include/asm/pgtable-3level.h | 1 +
arch/unicore32/include/asm/pgtable.h | 1 +
arch/x86/include/asm/pgtable_types.h | 4 ++++
arch/xtensa/include/asm/pgtable.h | 1 +
33 files changed, 44 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 317ff773e1ca..b18fcb606908 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -11,6 +11,7 @@
#define _ASM_ARC_HUGEPAGE_H

#include <linux/types.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pte_t pmd_pte(pmd_t pmd)
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index e94ca72b974e..ee22d40afef4 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -37,6 +37,7 @@

#include <asm/page.h>
#include <asm/mmu.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <linux/const.h>

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a8d656d9aec7..1c462381c225 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -20,6 +20,7 @@

#else

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#include <asm/memory.h>
#include <asm/pgtable-hwdef.h>
diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h
index 69b2fd41503c..345a072b5856 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -55,9 +55,13 @@ typedef struct { pteval_t pgprot; } pgprot_t;
#define __pgprot(x) ((pgprot_t) { (x) } )

#if CONFIG_PGTABLE_LEVELS == 2
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#elif CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
+#elif CONFIG_PGTABLE_LEVELS == 4
+#include <asm-generic/5level-fixup.h>
#endif

#endif /* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/avr32/include/asm/pgtable-2level.h b/arch/avr32/include/asm/pgtable-2level.h
index 425dd567b5b9..d5b1c63993ec 100644
--- a/arch/avr32/include/asm/pgtable-2level.h
+++ b/arch/avr32/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
#ifndef __ASM_AVR32_PGTABLE_2LEVEL_H
#define __ASM_AVR32_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/*
diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h
index 2a3210ba4c72..fa3a73004cc5 100644
--- a/arch/cris/include/asm/pgtable.h
+++ b/arch/cris/include/asm/pgtable.h
@@ -6,6 +6,7 @@
#define _CRIS_PGTABLE_H

#include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h
index a0513d463a1f..ab6e7e961b54 100644
--- a/arch/frv/include/asm/pgtable.h
+++ b/arch/frv/include/asm/pgtable.h
@@ -16,6 +16,7 @@
#ifndef _ASM_PGTABLE_H
#define _ASM_PGTABLE_H

+#include <asm-generic/5level-fixup.h>
#include <asm/mem-layout.h>
#include <asm/setup.h>
#include <asm/processor.h>
diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h
index 8341db67821d..7d265d28ba5e 100644
--- a/arch/h8300/include/asm/pgtable.h
+++ b/arch/h8300/include/asm/pgtable.h
@@ -1,5 +1,6 @@
#ifndef _H8300_PGTABLE_H
#define _H8300_PGTABLE_H
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#include <asm-generic/pgtable.h>
#define pgtable_cache_init() do { } while (0)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 49eab8136ec3..24a9177fb897 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -26,6 +26,7 @@
*/
#include <linux/swap.h>
#include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* A handy thing to have if one has the RAM. Declared in head.S */
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 384794e665fc..6cc22c8d8923 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -587,8 +587,10 @@ extern struct page *zero_page_memmap_ptr;


#if CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#endif
+#include <asm-generic/5level-fixup.h>
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
diff --git a/arch/metag/include/asm/pgtable.h b/arch/metag/include/asm/pgtable.h
index ffa3a3a2ecad..0c151e5af079 100644
--- a/arch/metag/include/asm/pgtable.h
+++ b/arch/metag/include/asm/pgtable.h
@@ -6,6 +6,7 @@
#define _METAG_PGTABLE_H

#include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* Invalid regions on Meta: 0x00000000-0x001FFFFF and 0xFFFF0000-0xFFFFFFFF */
diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
index d21f3da7bdb6..6f94bed571c4 100644
--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -16,6 +16,7 @@
#include <asm/cachectl.h>
#include <asm/fixmap.h>

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

extern int temp_tlb_entry;
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 514cbc0a6a67..130a2a6c1531 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -17,6 +17,7 @@
#include <asm/cachectl.h>
#include <asm/fixmap.h>

+#define __ARCH_USE_5LEVEL_HACK
#if defined(CONFIG_PAGE_SIZE_64KB) && !defined(CONFIG_MIPS_VA_BITS_48)
#include <asm-generic/pgtable-nopmd.h>
#else
diff --git a/arch/mn10300/include/asm/page.h b/arch/mn10300/include/asm/page.h
index 3810a6f740fd..dfe730a5ede0 100644
--- a/arch/mn10300/include/asm/page.h
+++ b/arch/mn10300/include/asm/page.h
@@ -57,6 +57,7 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) })
#define __pgprot(x) ((pgprot_t) { (x) })

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#endif /* !__ASSEMBLY__ */
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 298393c3cb42..db4f7d179220 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -22,6 +22,7 @@
#include <asm/tlbflush.h>

#include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#define FIRST_USER_ADDRESS 0UL
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 3567aa7be555..ff97374ca069 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -25,6 +25,7 @@
#ifndef __ASM_OPENRISC_PGTABLE_H
#define __ASM_OPENRISC_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 012223638815..26ed228d4dc6 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
#define _ASM_POWERPC_BOOK3S_32_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#include <asm/book3s/32/hash.h>
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 1eeeb72c7015..13c39b6d5d64 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1,9 +1,12 @@
#ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
#define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_

+#include <asm-generic/5level-fixup.h>
+
#ifndef __ASSEMBLY__
#include <linux/mmdebug.h>
#endif
+
/*
* Common bits between hash and Radix page table
*/
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index ba9921bf202e..5134ade2e850 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_NOHASH_32_PGTABLE_H
#define _ASM_POWERPC_NOHASH_32_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
index d0db98793dd8..9f4de0a1035e 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
@@ -1,5 +1,8 @@
#ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
#define _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
+
+#include <asm-generic/5level-fixup.h>
+
/*
* Entries per page directory level. The PTE level must use a 64b record
* for each page table entry. The PMD and PGD level use a 32b record for
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
index 55b28ef3409a..1facb584dd29 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H
#define _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>


diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7ed1972b1920..93e37b12e882 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -24,6 +24,7 @@
* the S390 page table tree.
*/
#ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
diff --git a/arch/score/include/asm/pgtable.h b/arch/score/include/asm/pgtable.h
index 0553e5cd5985..46ff8fd678a7 100644
--- a/arch/score/include/asm/pgtable.h
+++ b/arch/score/include/asm/pgtable.h
@@ -2,6 +2,7 @@
#define _ASM_SCORE_PGTABLE_H

#include <linux/const.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#include <asm/fixmap.h>
diff --git a/arch/sh/include/asm/pgtable-2level.h b/arch/sh/include/asm/pgtable-2level.h
index 19bd89db17e7..f75cf4387257 100644
--- a/arch/sh/include/asm/pgtable-2level.h
+++ b/arch/sh/include/asm/pgtable-2level.h
@@ -1,6 +1,7 @@
#ifndef __ASM_SH_PGTABLE_2LEVEL_H
#define __ASM_SH_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/*
diff --git a/arch/sh/include/asm/pgtable-3level.h b/arch/sh/include/asm/pgtable-3level.h
index 249a985d9648..9b1e776eca31 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -1,6 +1,7 @@
#ifndef __ASM_SH_PGTABLE_3LEVEL_H
#define __ASM_SH_PGTABLE_3LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/*
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 56e49c8f770d..8a598528ec1f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -12,6 +12,7 @@
* the SpitFire page tables.
*/

+#include <asm-generic/5level-fixup.h>
#include <linux/compiler.h>
#include <linux/const.h>
#include <asm/types.h>
diff --git a/arch/tile/include/asm/pgtable_32.h b/arch/tile/include/asm/pgtable_32.h
index d26a42279036..5f8c615cb5e9 100644
--- a/arch/tile/include/asm/pgtable_32.h
+++ b/arch/tile/include/asm/pgtable_32.h
@@ -74,6 +74,7 @@ extern unsigned long VMALLOC_RESERVE /* = CONFIG_VMALLOC_RESERVE */;
#define MAXMEM (_VMALLOC_START - PAGE_OFFSET)

/* We have no pmd or pud since we are strictly a two-level page table */
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline int pud_huge_page(pud_t pud) { return 0; }
diff --git a/arch/tile/include/asm/pgtable_64.h b/arch/tile/include/asm/pgtable_64.h
index e96cec52f6d8..96fe58b45118 100644
--- a/arch/tile/include/asm/pgtable_64.h
+++ b/arch/tile/include/asm/pgtable_64.h
@@ -59,6 +59,7 @@
#ifndef __ASSEMBLY__

/* We have no pud since we are a three-level page table. */
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/*
diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h
index cfbe59752469..179c0ea87a0c 100644
--- a/arch/um/include/asm/pgtable-2level.h
+++ b/arch/um/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
#ifndef __UM_PGTABLE_2LEVEL_H
#define __UM_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index bae8523a162f..c4d876dfb9ac 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -7,6 +7,7 @@
#ifndef __UM_PGTABLE_3LEVEL_H
#define __UM_PGTABLE_3LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/unicore32/include/asm/pgtable.h b/arch/unicore32/include/asm/pgtable.h
index 818d0f5598e3..a4f2bef37e70 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -12,6 +12,7 @@
#ifndef __UNICORE_PGTABLE_H__
#define __UNICORE_PGTABLE_H__

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <asm/cpu-single.h>

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..62484333673d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,6 +273,8 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
}

#if CONFIG_PGTABLE_LEVELS > 3
+#include <asm-generic/5level-fixup.h>
+
typedef struct { pudval_t pud; } pud_t;

static inline pud_t native_make_pud(pmdval_t val)
@@ -285,6 +287,7 @@ static inline pudval_t native_pud_val(pud_t pud)
return pud.pud;
}
#else
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

static inline pudval_t native_pud_val(pud_t pud)
@@ -306,6 +309,7 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
return pmd.pmd;
}
#else
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pmdval_t native_pmd_val(pmd_t pmd)
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index 8aa0e0d9cbb2..30dd5b2e4ad5 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -11,6 +11,7 @@
#ifndef _XTENSA_PGTABLE_H
#define _XTENSA_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <asm/page.h>
#include <asm/kmem_layout.h>
--
2.11.0

2017-03-06 13:58:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 06/33] mm: convert generic code to 5-level paging

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/misc/sgi-gru/grufault.c | 9 +-
fs/userfaultfd.c | 6 +-
include/asm-generic/pgtable.h | 48 +++++++++-
include/linux/hugetlb.h | 5 +-
include/linux/kasan.h | 1 +
include/linux/mm.h | 31 ++++--
lib/ioremap.c | 39 +++++++-
mm/gup.c | 46 +++++++--
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 29 +++---
mm/kasan/kasan_init.c | 35 ++++++-
mm/memory.c | 207 +++++++++++++++++++++++++++++++++-------
mm/mlock.c | 1 +
mm/mprotect.c | 26 ++++-
mm/mremap.c | 13 ++-
mm/page_vma_mapped.c | 6 +-
mm/pagewalk.c | 32 ++++++-
mm/pgtable-generic.c | 6 ++
mm/rmap.c | 7 +-
mm/sparse-vmemmap.c | 22 ++++-
mm/swapfile.c | 26 ++++-
mm/userfaultfd.c | 23 +++--
mm/vmalloc.c | 81 ++++++++++++----
23 files changed, 586 insertions(+), 120 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 6fb773dbcd0c..93be82fc338a 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -219,15 +219,20 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
int write, unsigned long *paddr, int *pageshift)
{
pgd_t *pgdp;
- pmd_t *pmdp;
+ p4d_t *p4dp;
pud_t *pudp;
+ pmd_t *pmdp;
pte_t pte;

pgdp = pgd_offset(vma->vm_mm, vaddr);
if (unlikely(pgd_none(*pgdp)))
goto err;

- pudp = pud_offset(pgdp, vaddr);
+ p4dp = p4d_offset(pgdp, vaddr);
+ if (unlikely(p4d_none(*p4dp)))
+ goto err;
+
+ pudp = pud_offset(p4dp, vaddr);
if (unlikely(pud_none(*pudp)))
goto err;

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 973607df579d..02ce3944d0f5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -267,6 +267,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
{
struct mm_struct *mm = ctx->mm;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd, _pmd;
pte_t *pte;
@@ -277,7 +278,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
goto out;
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ goto out;
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f4ca23b158b3..1fad160f35de 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,9 +10,9 @@
#include <linux/bug.h>
#include <linux/errno.h>

-#if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \
- CONFIG_PGTABLE_LEVELS
-#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{PUD,PMD}_FOLDED
+#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
+ defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
+#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
#endif

/*
@@ -424,6 +424,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
(__boundary - 1 < (end) - 1)? __boundary: (end); \
})

+#ifndef p4d_addr_end
+#define p4d_addr_end(addr, end) \
+({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \
+ (__boundary - 1 < (end) - 1)? __boundary: (end); \
+})
+#endif
+
#ifndef pud_addr_end
#define pud_addr_end(addr, end) \
({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \
@@ -444,6 +451,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
* Do the tests inline, but report and clear the bad entry in mm/memory.c.
*/
void pgd_clear_bad(pgd_t *);
+void p4d_clear_bad(p4d_t *);
void pud_clear_bad(pud_t *);
void pmd_clear_bad(pmd_t *);

@@ -458,6 +466,17 @@ static inline int pgd_none_or_clear_bad(pgd_t *pgd)
return 0;
}

+static inline int p4d_none_or_clear_bad(p4d_t *p4d)
+{
+ if (p4d_none(*p4d))
+ return 1;
+ if (unlikely(p4d_bad(*p4d))) {
+ p4d_clear_bad(p4d);
+ return 1;
+ }
+ return 0;
+}
+
static inline int pud_none_or_clear_bad(pud_t *pud)
{
if (pud_none(*pud))
@@ -844,11 +863,30 @@ static inline int pmd_protnone(pmd_t pmd)
#endif /* CONFIG_MMU */

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+
+#ifndef __PAGETABLE_P4D_FOLDED
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot);
+int p4d_clear_huge(p4d_t *p4d);
+#else
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
+#endif /* !__PAGETABLE_P4D_FOLDED */
+
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
int pud_clear_huge(pud_t *pud);
int pmd_clear_huge(pmd_t *pmd);
#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
{
return 0;
@@ -857,6 +895,10 @@ static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
{
return 0;
}
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
static inline int pud_clear_huge(pud_t *pud)
{
return 0;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 503099d8aada..b857fc8cc2ec 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -122,7 +122,7 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
int pmd_huge(pmd_t pmd);
-int pud_huge(pud_t pmd);
+int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

@@ -197,6 +197,9 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
#ifndef pgd_huge
#define pgd_huge(x) 0
#endif
+#ifndef p4d_huge
+#define p4d_huge(x) 0
+#endif

#ifndef pgd_write
static inline int pgd_write(pgd_t pgd)
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index ceb3fe78a0d3..1c823bef4c15 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -18,6 +18,7 @@ extern unsigned char kasan_zero_page[PAGE_SIZE];
extern pte_t kasan_zero_pte[PTRS_PER_PTE];
extern pmd_t kasan_zero_pmd[PTRS_PER_PMD];
extern pud_t kasan_zero_pud[PTRS_PER_PUD];
+extern p4d_t kasan_zero_p4d[PTRS_PER_P4D];

void kasan_populate_zero_shadow(const void *shadow_start,
const void *shadow_end);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index be1fe264eb37..5f01c88f0800 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1560,14 +1560,24 @@ static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
return ptep;
}

+#ifdef __PAGETABLE_P4D_FOLDED
+static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long address)
+{
+ return 0;
+}
+#else
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+#endif
+
#ifdef __PAGETABLE_PUD_FOLDED
-static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd,
+static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
unsigned long address)
{
return 0;
}
#else
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
#endif

#if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
@@ -1621,10 +1631,18 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)

#ifndef __ARCH_HAS_5LEVEL_HACK
-static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long address)
+{
+ return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address)) ?
+ NULL : p4d_offset(pgd, address);
+}
+
+static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
+ unsigned long address)
{
- return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
- NULL: pud_offset(pgd, address);
+ return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
+ NULL : pud_offset(p4d, address);
}
#endif /* !__ARCH_HAS_5LEVEL_HACK */

@@ -2388,7 +2406,8 @@ void sparse_mem_maps_populate_node(struct page **map_map,

struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
-pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
+p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
+pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/lib/ioremap.c b/lib/ioremap.c
index a3e14ce92a56..4bb30206b942 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -14,6 +14,7 @@
#include <asm/pgtable.h>

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static int __read_mostly ioremap_p4d_capable;
static int __read_mostly ioremap_pud_capable;
static int __read_mostly ioremap_pmd_capable;
static int __read_mostly ioremap_huge_disabled;
@@ -35,6 +36,11 @@ void __init ioremap_huge_init(void)
}
}

+static inline int ioremap_p4d_enabled(void)
+{
+ return ioremap_p4d_capable;
+}
+
static inline int ioremap_pud_enabled(void)
{
return ioremap_pud_capable;
@@ -46,6 +52,7 @@ static inline int ioremap_pmd_enabled(void)
}

#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int ioremap_p4d_enabled(void) { return 0; }
static inline int ioremap_pud_enabled(void) { return 0; }
static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
@@ -94,14 +101,14 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
return 0;
}

-static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
+static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pud_t *pud;
unsigned long next;

phys_addr -= addr;
- pud = pud_alloc(&init_mm, pgd, addr);
+ pud = pud_alloc(&init_mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -120,6 +127,32 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
return 0;
}

+static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ phys_addr -= addr;
+ p4d = p4d_alloc(&init_mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (ioremap_p4d_enabled() &&
+ ((next - addr) == P4D_SIZE) &&
+ IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
+ if (p4d_set_huge(p4d, phys_addr + addr, prot))
+ continue;
+ }
+
+ if (ioremap_pud_range(p4d, addr, next, phys_addr + addr, prot))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
int ioremap_page_range(unsigned long addr,
unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
@@ -135,7 +168,7 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_pud_range(pgd, addr, next, phys_addr+addr, prot);
+ err = ioremap_p4d_range(pgd, addr, next, phys_addr+addr, prot);
if (err)
break;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/gup.c b/mm/gup.c
index 9c047e951aa3..c74bad1bf6e8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -226,6 +226,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned int *page_mask)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
@@ -243,8 +244,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
-
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d))
+ return no_page_table(vma, flags);
+ BUILD_BUG_ON(p4d_huge(*p4d));
+ if (unlikely(p4d_bad(*p4d)))
+ return no_page_table(vma, flags);
+ pud = pud_offset(p4d, address);
if (pud_none(*pud))
return no_page_table(vma, flags);
if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
@@ -325,6 +331,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
struct page **page)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -338,7 +345,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
else
pgd = pgd_offset_gate(mm, address);
BUG_ON(pgd_none(*pgd));
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ BUG_ON(p4d_none(*p4d));
+ pud = pud_offset(p4d, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
@@ -1400,13 +1409,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
return 1;
}

-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

- pudp = pud_offset(&pgd, addr);
+ pudp = pud_offset(&p4d, addr);
do {
pud_t pud = READ_ONCE(*pudp);

@@ -1428,6 +1437,31 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ p4d_t *p4dp;
+
+ p4dp = p4d_offset(&pgd, addr);
+ do {
+ p4d_t p4d = READ_ONCE(*p4dp);
+
+ next = p4d_addr_end(addr, end);
+ if (p4d_none(p4d))
+ return 0;
+ BUILD_BUG_ON(p4d_huge(p4d));
+ if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
+ if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
+ P4D_SHIFT, next, write, pages, nr))
+ return 0;
+ } else if (!gup_p4d_range(p4d, addr, next, write, pages, nr))
+ return 0;
+ } while (p4dp++, addr = next, addr != end);
+
+ return 1;
+}
+
/*
* Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
* the regular GUP. It will only return non-negative values.
@@ -1478,7 +1512,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
PGDIR_SHIFT, next, write, pages, &nr))
break;
- } else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ } else if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
break;
} while (pgdp++, addr = next, addr != end);
local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d36b2af4d1bf..e4766de25709 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2048,6 +2048,7 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
bool freeze, struct page *page)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -2055,7 +2056,11 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
if (!pgd_present(*pgd))
return;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return;
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return;

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a7aa811b7d14..3d0aab9ee80d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,7 +4555,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
{
pgd_t *pgd = pgd_offset(mm, *addr);
- pud_t *pud = pud_offset(pgd, *addr);
+ p4d_t *p4d = p4d_offset(pgd, *addr);
+ pud_t *pud = pud_offset(p4d, *addr);

BUG_ON(page_count(virt_to_page(ptep)) == 0);
if (page_count(virt_to_page(ptep)) == 1)
@@ -4586,11 +4587,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pte_t *pte = NULL;

pgd = pgd_offset(mm, addr);
- pud = pud_alloc(mm, pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (pud) {
if (sz == PUD_SIZE) {
pte = (pte_t *)pud;
@@ -4610,18 +4613,22 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
- pmd_t *pmd = NULL;
+ pmd_t *pmd;

pgd = pgd_offset(mm, addr);
- if (pgd_present(*pgd)) {
- pud = pud_offset(pgd, addr);
- if (pud_present(*pud)) {
- if (pud_huge(*pud))
- return (pte_t *)pud;
- pmd = pmd_offset(pud, addr);
- }
- }
+ if (!pgd_present(*pgd))
+ return NULL;
+ p4d = p4d_offset(pgd, addr);
+ if (!p4d_present(*p4d))
+ return NULL;
+ pud = pud_offset(p4d, addr);
+ if (!pud_present(*pud))
+ return NULL;
+ if (pud_huge(*pud))
+ return (pte_t *)pud;
+ pmd = pmd_offset(pud, addr);
return (pte_t *) pmd;
}

diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 31238dad85fb..7870ad44ee20 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -30,6 +30,9 @@
*/
unsigned char kasan_zero_page[PAGE_SIZE] __page_aligned_bss;

+#if CONFIG_PGTABLE_LEVELS > 4
+p4d_t kasan_zero_p4d[PTRS_PER_P4D] __page_aligned_bss;
+#endif
#if CONFIG_PGTABLE_LEVELS > 3
pud_t kasan_zero_pud[PTRS_PER_PUD] __page_aligned_bss;
#endif
@@ -82,10 +85,10 @@ static void __init zero_pmd_populate(pud_t *pud, unsigned long addr,
} while (pmd++, addr = next, addr != end);
}

-static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
+static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr,
unsigned long end)
{
- pud_t *pud = pud_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
unsigned long next;

do {
@@ -107,6 +110,23 @@ static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
} while (pud++, addr = next, addr != end);
}

+static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
+ unsigned long end)
+{
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ unsigned long next;
+
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (p4d_none(*p4d)) {
+ p4d_populate(&init_mm, p4d,
+ early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+ }
+ zero_pud_populate(p4d, addr, next);
+ } while (p4d++, addr = next, addr != end);
+}
+
/**
* kasan_populate_zero_shadow - populate shadow memory region with
* kasan_zero_page
@@ -125,6 +145,7 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
next = pgd_addr_end(addr, end);

if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -136,8 +157,12 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
* puds,pmds, so pgd_populate(), pud_populate()
* is noops.
*/
- pgd_populate(&init_mm, pgd, lm_alias(kasan_zero_pud));
- pud = pud_offset(pgd, addr);
+#ifndef __ARCH_HAS_5LEVEL_HACK
+ pgd_populate(&init_mm, pgd, lm_alias(kasan_zero_p4d));
+#endif
+ p4d = p4d_offset(pgd, addr);
+ p4d_populate(&init_mm, p4d, lm_alias(kasan_zero_pud));
+ pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud, lm_alias(kasan_zero_pmd));
pmd = pmd_offset(pud, addr);
pmd_populate_kernel(&init_mm, pmd, lm_alias(kasan_zero_pte));
@@ -148,6 +173,6 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
pgd_populate(&init_mm, pgd,
early_alloc(PAGE_SIZE, NUMA_NO_NODE));
}
- zero_pud_populate(pgd, addr, next);
+ zero_p4d_populate(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
diff --git a/mm/memory.c b/mm/memory.c
index a97a4cec2e1f..7f1c2163b3ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -445,7 +445,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
mm_dec_nr_pmds(tlb->mm);
}

-static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
unsigned long addr, unsigned long end,
unsigned long floor, unsigned long ceiling)
{
@@ -454,7 +454,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
unsigned long start;

start = addr;
- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -462,6 +462,39 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
free_pmd_range(tlb, pud, addr, next, floor, ceiling);
} while (pud++, addr = next, addr != end);

+ start &= P4D_MASK;
+ if (start < floor)
+ return;
+ if (ceiling) {
+ ceiling &= P4D_MASK;
+ if (!ceiling)
+ return;
+ }
+ if (end - 1 > ceiling - 1)
+ return;
+
+ pud = pud_offset(p4d, start);
+ p4d_clear(p4d);
+ pud_free_tlb(tlb, pud, start);
+}
+
+static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ unsigned long start;
+
+ start = addr;
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+ } while (p4d++, addr = next, addr != end);
+
start &= PGDIR_MASK;
if (start < floor)
return;
@@ -473,9 +506,9 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
if (end - 1 > ceiling - 1)
return;

- pud = pud_offset(pgd, start);
+ p4d = p4d_offset(pgd, start);
pgd_clear(pgd);
- pud_free_tlb(tlb, pud, start);
+ p4d_free_tlb(tlb, p4d, start);
}

/*
@@ -539,7 +572,7 @@ void free_pgd_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+ free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
} while (pgd++, addr = next, addr != end);
}

@@ -658,7 +691,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, struct page *page)
{
pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
- pud_t *pud = pud_offset(pgd, addr);
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
pmd_t *pmd = pmd_offset(pud, addr);
struct address_space *mapping;
pgoff_t index;
@@ -1023,16 +1057,16 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
}

static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+ p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
pud_t *src_pud, *dst_pud;
unsigned long next;

- dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+ dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
if (!dst_pud)
return -ENOMEM;
- src_pud = pud_offset(src_pgd, addr);
+ src_pud = pud_offset(src_p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
@@ -1056,6 +1090,28 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
return 0;
}

+static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ p4d_t *src_p4d, *dst_p4d;
+ unsigned long next;
+
+ dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+ if (!dst_p4d)
+ return -ENOMEM;
+ src_p4d = p4d_offset(src_pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(src_p4d))
+ continue;
+ if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
+ vma, addr, next))
+ return -ENOMEM;
+ } while (dst_p4d++, src_p4d++, addr = next, addr != end);
+ return 0;
+}
+
int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct vm_area_struct *vma)
{
@@ -1111,7 +1167,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
vma, addr, next))) {
ret = -ENOMEM;
break;
@@ -1267,14 +1323,14 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
}

static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma, pgd_t *pgd,
+ struct vm_area_struct *vma, p4d_t *p4d,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
pud_t *pud;
unsigned long next;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
@@ -1295,6 +1351,25 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}

+static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ next = zap_pud_range(tlb, vma, p4d, addr, next, details);
+ } while (p4d++, addr = next, addr != end);
+
+ return addr;
+}
+
void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
@@ -1310,7 +1385,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- next = zap_pud_range(tlb, vma, pgd, addr, next, details);
+ next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
}
@@ -1465,16 +1540,24 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
spinlock_t **ptl)
{
- pgd_t *pgd = pgd_offset(mm, addr);
- pud_t *pud = pud_alloc(mm, pgd, addr);
- if (pud) {
- pmd_t *pmd = pmd_alloc(mm, pud, addr);
- if (pmd) {
- VM_BUG_ON(pmd_trans_huge(*pmd));
- return pte_alloc_map_lock(mm, pmd, addr, ptl);
- }
- }
- return NULL;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return NULL;
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd)
+ return NULL;
+
+ VM_BUG_ON(pmd_trans_huge(*pmd));
+ return pte_alloc_map_lock(mm, pmd, addr, ptl);
}

/*
@@ -1740,7 +1823,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
return 0;
}

-static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
unsigned long addr, unsigned long end,
unsigned long pfn, pgprot_t prot)
{
@@ -1748,7 +1831,7 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
unsigned long next;

pfn -= addr >> PAGE_SHIFT;
- pud = pud_alloc(mm, pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -1760,6 +1843,26 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
return 0;
}

+static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ unsigned long pfn, pgprot_t prot)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ pfn -= addr >> PAGE_SHIFT;
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ if (remap_pud_range(mm, p4d, addr, next,
+ pfn + (addr >> PAGE_SHIFT), prot))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
/**
* remap_pfn_range - remap kernel memory to userspace
* @vma: user vma to map to
@@ -1816,7 +1919,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
flush_cache_range(vma, addr, end);
do {
next = pgd_addr_end(addr, end);
- err = remap_pud_range(mm, pgd, addr, next,
+ err = remap_p4d_range(mm, pgd, addr, next,
pfn + (addr >> PAGE_SHIFT), prot);
if (err)
break;
@@ -1932,7 +2035,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
return err;
}

-static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
unsigned long addr, unsigned long end,
pte_fn_t fn, void *data)
{
@@ -1940,7 +2043,7 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
unsigned long next;
int err;

- pud = pud_alloc(mm, pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -1952,6 +2055,26 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
return err;
}

+static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ pte_fn_t fn, void *data)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int err;
+
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ err = apply_to_pud_range(mm, p4d, addr, next, fn, data);
+ if (err)
+ break;
+ } while (p4d++, addr = next, addr != end);
+ return err;
+}
+
/*
* Scan a region of virtual memory, filling in page tables as necessary
* and calling a provided function on each leaf page table.
@@ -1970,7 +2093,7 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
- err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+ err = apply_to_p4d_range(mm, pgd, addr, next, fn, data);
if (err)
break;
} while (pgd++, addr = next, addr != end);
@@ -3653,11 +3776,15 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
};
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
+ p4d_t *p4d;
int ret;

pgd = pgd_offset(mm, address);
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return VM_FAULT_OOM;

- vmf.pud = pud_alloc(mm, pgd, address);
+ vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
@@ -3784,7 +3911,7 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
* Allocate page upper directory.
* We've already handled the fast-path in-line.
*/
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
{
pud_t *new = pud_alloc_one(mm, address);
if (!new)
@@ -3793,10 +3920,17 @@ int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */

spin_lock(&mm->page_table_lock);
- if (pgd_present(*pgd)) /* Another has populated it */
+#ifndef __ARCH_HAS_5LEVEL_HACK
+ if (p4d_present(*p4d)) /* Another has populated it */
+ pud_free(mm, new);
+ else
+ p4d_populate(mm, p4d, new);
+#else
+ if (pgd_present(*p4d)) /* Another has populated it */
pud_free(mm, new);
else
- pgd_populate(mm, pgd, new);
+ pgd_populate(mm, p4d, new);
+#endif /* __ARCH_HAS_5LEVEL_HACK */
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -3839,6 +3973,7 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
@@ -3847,7 +3982,11 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud) || unlikely(pud_bad(*pud)))
goto out;

diff --git a/mm/mlock.c b/mm/mlock.c
index 1050511f8b2b..945edac46810 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -380,6 +380,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
pte = get_locked_pte(vma->vm_mm, start, &ptl);
/* Make sure we do not cross the page table boundary */
end = pgd_addr_end(start, end);
+ end = p4d_addr_end(start, end);
end = pud_addr_end(start, end);
end = pmd_addr_end(start, end);

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 848e946b08e5..8edd0d576254 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -193,14 +193,14 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

static inline unsigned long change_pud_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end,
+ p4d_t *p4d, unsigned long addr, unsigned long end,
pgprot_t newprot, int dirty_accountable, int prot_numa)
{
pud_t *pud;
unsigned long next;
unsigned long pages = 0;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -212,6 +212,26 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
return pages;
}

+static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
+ pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgprot_t newprot, int dirty_accountable, int prot_numa)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ unsigned long pages = 0;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ pages += change_pud_range(vma, p4d, addr, next, newprot,
+ dirty_accountable, prot_numa);
+ } while (p4d++, addr = next, addr != end);
+
+ return pages;
+}
+
static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
@@ -230,7 +250,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- pages += change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_p4d_range(vma, pgd, addr, next, newprot,
dirty_accountable, prot_numa);
} while (pgd++, addr = next, addr != end);

diff --git a/mm/mremap.c b/mm/mremap.c
index 8233b0105c82..cd8a1b199ef9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -32,6 +32,7 @@
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -39,7 +40,11 @@ static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
if (pgd_none_or_clear_bad(pgd))
return NULL;

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none_or_clear_bad(p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
if (pud_none_or_clear_bad(pud))
return NULL;

@@ -54,11 +59,15 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

pgd = pgd_offset(mm, addr);
- pud = pud_alloc(mm, pgd, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return NULL;

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a23001a22c15..c4c9def8ffea 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -104,6 +104,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
struct mm_struct *mm = pvmw->vma->vm_mm;
struct page *page = pvmw->page;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

/* The only possible pmd mapping has been handled on last iteration */
@@ -133,7 +134,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pgd = pgd_offset(mm, pvmw->address);
if (!pgd_present(*pgd))
return false;
- pud = pud_offset(pgd, pvmw->address);
+ p4d = p4d_offset(pgd, pvmw->address);
+ if (!p4d_present(*p4d))
+ return false;
+ pud = pud_offset(p4d, pvmw->address);
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 03761577ae86..60f7856e508f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -69,14 +69,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
return err;
}

-static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pud_t *pud;
unsigned long next;
int err = 0;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
again:
next = pud_addr_end(addr, end);
@@ -113,6 +113,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
return err;
}

+static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int err = 0;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d)) {
+ if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+ if (err)
+ break;
+ continue;
+ }
+ if (walk->pmd_entry || walk->pte_entry)
+ err = walk_pud_range(p4d, addr, next, walk);
+ if (err)
+ break;
+ } while (p4d++, addr = next, addr != end);
+
+ return err;
+}
+
static int walk_pgd_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -131,7 +157,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
continue;
}
if (walk->pmd_entry || walk->pte_entry)
- err = walk_pud_range(pgd, addr, next, walk);
+ err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4ed5908c65b0..c99d9512a45b 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -22,6 +22,12 @@ void pgd_clear_bad(pgd_t *pgd)
pgd_clear(pgd);
}

+void p4d_clear_bad(p4d_t *p4d)
+{
+ p4d_ERROR(*p4d);
+ p4d_clear(p4d);
+}
+
void pud_clear_bad(pud_t *pud)
{
pud_ERROR(*pud);
diff --git a/mm/rmap.c b/mm/rmap.c
index 2da487d6cea8..2984403a2424 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -684,6 +684,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd = NULL;
pmd_t pmde;
@@ -692,7 +693,11 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
if (!pgd_present(*pgd))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
goto out;

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 574c67b663fe..a56c3989f773 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -196,9 +196,9 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
return pmd;
}

-pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
+pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
{
- pud_t *pud = pud_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
void *p = vmemmap_alloc_block(PAGE_SIZE, node);
if (!p)
@@ -208,6 +208,18 @@ pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
return pud;
}

+p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
+{
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d)) {
+ void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+ if (!p)
+ return NULL;
+ p4d_populate(&init_mm, p4d, p);
+ }
+ return p4d;
+}
+
pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
{
pgd_t *pgd = pgd_offset_k(addr);
@@ -225,6 +237,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
{
unsigned long addr = start;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -233,7 +246,10 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
pgd = vmemmap_pgd_populate(addr, node);
if (!pgd)
return -ENOMEM;
- pud = vmemmap_pud_populate(pgd, addr, node);
+ p4d = vmemmap_p4d_populate(pgd, addr, node);
+ if (!p4d)
+ return -ENOMEM;
+ pud = vmemmap_pud_populate(p4d, addr, node);
if (!pud)
return -ENOMEM;
pmd = vmemmap_pmd_populate(pud, addr, node);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 521ef9b6064f..178130880b90 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1517,7 +1517,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
return 0;
}

-static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
unsigned long addr, unsigned long end,
swp_entry_t entry, struct page *page)
{
@@ -1525,7 +1525,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long next;
int ret;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -1537,6 +1537,26 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
return 0;
}

+static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ swp_entry_t entry, struct page *page)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int ret;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ ret = unuse_pud_range(vma, p4d, addr, next, entry, page);
+ if (ret)
+ return ret;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
static int unuse_vma(struct vm_area_struct *vma,
swp_entry_t entry, struct page *page)
{
@@ -1560,7 +1580,7 @@ static int unuse_vma(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
+ ret = unuse_p4d_range(vma, pgd, addr, next, entry, page);
if (ret)
return ret;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 479e631d43c2..8bcb501bce60 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -128,19 +128,22 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
- pmd_t *pmd = NULL;

pgd = pgd_offset(mm, address);
- pud = pud_alloc(mm, pgd, address);
- if (pud)
- /*
- * Note that we didn't run this because the pmd was
- * missing, the *pmd may be already established and in
- * turn it may also be a trans_huge_pmd.
- */
- pmd = pmd_alloc(mm, pud, address);
- return pmd;
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, address);
+ if (!pud)
+ return NULL;
+ /*
+ * Note that we didn't run this because the pmd was
+ * missing, the *pmd may be already established and in
+ * turn it may also be a trans_huge_pmd.
+ */
+ return pmd_alloc(mm, pud, address);
}

#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b4024d688f38..0dd80222b20b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -86,12 +86,12 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
} while (pmd++, addr = next, addr != end);
}

-static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
{
pud_t *pud;
unsigned long next;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_clear_huge(pud))
@@ -102,6 +102,22 @@ static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
} while (pud++, addr = next, addr != end);
}

+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_clear_huge(p4d))
+ continue;
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ vunmap_pud_range(p4d, addr, next);
+ } while (p4d++, addr = next, addr != end);
+}
+
static void vunmap_page_range(unsigned long addr, unsigned long end)
{
pgd_t *pgd;
@@ -113,7 +129,7 @@ static void vunmap_page_range(unsigned long addr, unsigned long end)
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- vunmap_pud_range(pgd, addr, next);
+ vunmap_p4d_range(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

@@ -160,13 +176,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return 0;
}

-static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, pgd, addr);
+ pud = pud_alloc(&init_mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -177,6 +193,23 @@ static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
return 0;
}

+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_alloc(&init_mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
/*
* Set up page tables in kva (addr, end). The ptes shall have prot "prot", and
* will have pfns corresponding to the "pages" array.
@@ -196,7 +229,7 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_pud_range(pgd, addr, next, prot, pages, &nr);
+ err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
@@ -237,6 +270,10 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
unsigned long addr = (unsigned long) vmalloc_addr;
struct page *page = NULL;
pgd_t *pgd = pgd_offset_k(addr);
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;

/*
* XXX we might need to change this if we add VIRTUAL_BUG_ON for
@@ -244,21 +281,23 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
*/
VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));

- if (!pgd_none(*pgd)) {
- pud_t *pud = pud_offset(pgd, addr);
- if (!pud_none(*pud)) {
- pmd_t *pmd = pmd_offset(pud, addr);
- if (!pmd_none(*pmd)) {
- pte_t *ptep, pte;
-
- ptep = pte_offset_map(pmd, addr);
- pte = *ptep;
- if (pte_present(pte))
- page = pte_page(pte);
- pte_unmap(ptep);
- }
- }
- }
+ if (pgd_none(*pgd))
+ return NULL;
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return NULL;
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ ptep = pte_offset_map(pmd, addr);
+ pte = *ptep;
+ if (pte_present(pte))
+ page = pte_page(pte);
+ pte_unmap(ptep);
return page;
}
EXPORT_SYMBOL(vmalloc_to_page);
--
2.11.0

2017-03-06 13:58:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 24/33] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL

Extends pagetable headers to support new paging mode.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 11 +++++++++++
arch/x86/include/asm/pgtable_64_types.h | 20 +++++++++++++++++++
arch/x86/include/asm/pgtable_types.h | 10 +++++++++-
arch/x86/mm/pgtable.c | 34 ++++++++++++++++++++++++++++++++-
4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 79396bfdc791..9991224f6238 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -35,6 +35,13 @@ extern void paging_init(void);
#define pud_ERROR(e) \
pr_err("%s:%d: bad pud %p(%016lx)\n", \
__FILE__, __LINE__, &(e), pud_val(e))
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#define p4d_ERROR(e) \
+ pr_err("%s:%d: bad p4d %p(%016lx)\n", \
+ __FILE__, __LINE__, &(e), p4d_val(e))
+#endif
+
#define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %p(%016lx)\n", \
__FILE__, __LINE__, &(e), pgd_val(e))
@@ -128,7 +135,11 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)

static inline void native_p4d_clear(p4d_t *p4d)
{
+#ifdef CONFIG_X86_5LEVEL
+ native_set_p4d(p4d, native_make_p4d(0));
+#else
native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+#endif
}

static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 00dc0c2b456e..7ae641fdbd07 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -23,12 +23,32 @@ typedef struct { pteval_t pte; } pte_t;

#define SHARED_KERNEL_PMD 0

+#ifdef CONFIG_X86_5LEVEL
+
+/*
+ * PGDIR_SHIFT determines what a top-level page table entry can map
+ */
+#define PGDIR_SHIFT 48
+#define PTRS_PER_PGD 512
+
+/*
+ * 4rd level page in 5-level paging case
+ */
+#define P4D_SHIFT 39
+#define PTRS_PER_P4D 512
+#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
+#define P4D_MASK (~(P4D_SIZE - 1))
+
+#else /* CONFIG_X86_5LEVEL */
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
#define PGDIR_SHIFT 39
#define PTRS_PER_PGD 512

+#endif /* CONFIG_X86_5LEVEL */
+
/*
* 3rd level page
*/
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4930afe9df0a..bf9638e1ee42 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,9 +273,17 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
}

#if CONFIG_PGTABLE_LEVELS > 4
+typedef struct { p4dval_t p4d; } p4d_t;

-#error FIXME
+static inline p4d_t native_make_p4d(pudval_t val)
+{
+ return (p4d_t) { val };
+}

+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+ return p4d.p4d;
+}
#else
#include <asm-generic/pgtable-nop4d.h>

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 38b6daf72deb..d26b066944a5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -81,6 +81,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
tlb_remove_page(tlb, virt_to_page(pud));
}
+
+#if CONFIG_PGTABLE_LEVELS > 4
+void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
+{
+ paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
+ tlb_remove_page(tlb, virt_to_page(p4d));
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

@@ -120,7 +128,7 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
references from swapper_pg_dir. */
if (CONFIG_PGTABLE_LEVELS == 2 ||
(CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS == 4) {
+ CONFIG_PGTABLE_LEVELS >= 4) {
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
@@ -582,6 +590,30 @@ void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys,
}

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#ifdef CONFIG_X86_5LEVEL
+/**
+ * p4d_set_huge - setup kernel P4D mapping
+ *
+ * No 512GB pages yet -- always return 0
+ *
+ * Returns 1 on success and 0 on failure.
+ */
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
+
+/**
+ * p4d_clear_huge - clear kernel P4D mapping when it is set
+ *
+ * No 512GB pages yet -- always return 0
+ */
+int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
+#endif
+
/**
* pud_set_huge - setup kernel PUD mapping
*
--
2.11.0

2017-03-06 13:59:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 31/33] x86/mm: add support for 5-level paging for KASLR

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/kaslr.c | 82 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 887e57182716..662e5c4b21c8 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
*
* Entropy is generated using the KASLR early boot functions now shared in
* the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
*
* The order of each memory region is not changed. The feature looks at
* the available space for the regions based on different configuration
@@ -70,7 +70,8 @@ static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 64/* Maximum */ },
+ { &page_offset_base,
+ 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
{ &vmemmap_base, 1 },
};
@@ -142,7 +143,10 @@ void __init kernel_randomize_memory(void)
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
- entropy = (rand % (entropy + 1)) & PUD_MASK;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ entropy = (rand % (entropy + 1)) & P4D_MASK;
+ else
+ entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;

@@ -151,27 +155,21 @@ void __init kernel_randomize_memory(void)
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
- vaddr = round_up(vaddr + 1, PUD_SIZE);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ vaddr = round_up(vaddr + 1, P4D_SIZE);
+ else
+ vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}

-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
{
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;

- if (!kaslr_memory_enabled()) {
- init_trampoline_default();
- return;
- }
-
pud_page_tramp = alloc_low_page();

paddr = 0;
@@ -192,3 +190,49 @@ void __meminit init_trampoline(void)
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
}
+
+static void __meminit init_trampoline_p4d(void)
+{
+ unsigned long paddr, paddr_next;
+ pgd_t *pgd;
+ p4d_t *p4d_page, *p4d_page_tramp;
+ int i;
+
+ p4d_page_tramp = alloc_low_page();
+
+ paddr = 0;
+ pgd = pgd_offset_k((unsigned long)__va(paddr));
+ p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+ for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d, *p4d_tramp;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+
+ p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ *p4d_tramp = *p4d;
+ }
+
+ set_pgd(&trampoline_pgd_entry,
+ __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+ if (!kaslr_memory_enabled()) {
+ init_trampoline_default();
+ return;
+ }
+
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ init_trampoline_p4d();
+ else
+ init_trampoline_pud();
+}
--
2.11.0

2017-03-06 13:59:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 03/33] asm-generic: introduce __ARCH_USE_5LEVEL_HACK

We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to use 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/pgtable-nop4d-hack.h | 62 ++++++++++++++++++++++++++++++++
include/asm-generic/pgtable-nopud.h | 5 +++
2 files changed, 67 insertions(+)
create mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
new file mode 100644
index 000000000000..752fb7511750
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -0,0 +1,62 @@
+#ifndef _PGTABLE_NOP4D_HACK_H
+#define _PGTABLE_NOP4D_HACK_H
+
+#ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
+
+#define __PAGETABLE_PUD_FOLDED
+
+/*
+ * Having the pud type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pud is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pud_t;
+
+#define PUD_SHIFT PGDIR_SHIFT
+#define PTRS_PER_PUD 1
+#define PUD_SIZE (1UL << PUD_SHIFT)
+#define PUD_MASK (~(PUD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pud is never bad, and a pud always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd) { return 0; }
+static inline int pgd_bad(pgd_t pgd) { return 0; }
+static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline void pgd_clear(pgd_t *pgd) { }
+#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+
+#define pgd_populate(mm, pgd, pud) do { } while (0)
+/*
+ * (puds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval) set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+
+static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+{
+ return (pud_t *)pgd;
+}
+
+#define pud_val(x) (pgd_val((x).pgd))
+#define __pud(x) ((pud_t) { __pgd(x) })
+
+#define pgd_page(pgd) (pud_page((pud_t){ pgd }))
+#define pgd_page_vaddr(pgd) (pud_page_vaddr((pud_t){ pgd }))
+
+/*
+ * allocating and freeing a pud is trivial: the 1-entry pud is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pud_alloc_one(mm, address) NULL
+#define pud_free(mm, x) do { } while (0)
+#define __pud_free_tlb(tlb, x, a) do { } while (0)
+
+#undef pud_addr_end
+#define pud_addr_end(addr, end) (end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 810431d8351b..5e49430a30a4 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -3,6 +3,10 @@

#ifndef __ASSEMBLY__

+#ifdef __ARCH_USE_5LEVEL_HACK
+#include <asm-generic/pgtable-nop4d-hack.h>
+#else
+
#define __PAGETABLE_PUD_FOLDED

/*
@@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
#define pud_addr_end(addr, end) (end)

#endif /* __ASSEMBLY__ */
+#endif /* !__ARCH_USE_5LEVEL_HACK */
#endif /* _PGTABLE_NOPUD_H */
--
2.11.0

2017-03-06 13:59:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 26/33] x86/kasan: extend to support 5-level paging

This patch bring support for non-folded additional page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Vyukov <[email protected]
---
arch/x86/mm/kasan_init_64.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 733f8ba6a01f..bcabc56e0dc4 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -50,8 +50,18 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
unsigned long end = KASAN_SHADOW_END;

for (i = pgd_index(start); start < end; i++) {
- pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud)
- | _KERNPG_TABLE);
+ switch (CONFIG_PGTABLE_LEVELS) {
+ case 4:
+ pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
+ _KERNPG_TABLE);
+ break;
+ case 5:
+ pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
+ _KERNPG_TABLE);
+ break;
+ default:
+ BUILD_BUG();
+ }
start += PGDIR_SIZE;
}
}
@@ -79,6 +89,7 @@ void __init kasan_early_init(void)
pteval_t pte_val = __pa_nodebug(kasan_zero_page) | __PAGE_KERNEL;
pmdval_t pmd_val = __pa_nodebug(kasan_zero_pte) | _KERNPG_TABLE;
pudval_t pud_val = __pa_nodebug(kasan_zero_pmd) | _KERNPG_TABLE;
+ p4dval_t p4d_val = __pa_nodebug(kasan_zero_pud) | _KERNPG_TABLE;

for (i = 0; i < PTRS_PER_PTE; i++)
kasan_zero_pte[i] = __pte(pte_val);
@@ -89,6 +100,9 @@ void __init kasan_early_init(void)
for (i = 0; i < PTRS_PER_PUD; i++)
kasan_zero_pud[i] = __pud(pud_val);

+ for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+ kasan_zero_p4d[i] = __p4d(p4d_val);
+
kasan_map_early_shadow(early_level4_pgt);
kasan_map_early_shadow(init_level4_pgt);
}
--
2.11.0

2017-03-06 14:00:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 12/33] x86/mm: add support of p4d_t in vmalloc_fault()

With 4-level paging copying happens on p4d level, as we have pgd_none()
always false when p4d_t folded.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 605fd5e8e048..fcc887f607c2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -435,6 +435,7 @@ void vmalloc_sync_all(void)
static noinline int vmalloc_fault(unsigned long address)
{
pgd_t *pgd, *pgd_ref;
+ p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;
@@ -462,13 +463,26 @@ static noinline int vmalloc_fault(unsigned long address)
BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
}

+ /* With 4-level paging copying happens on p4d level. */
+ p4d = p4d_offset(pgd, address);
+ p4d_ref = p4d_offset(pgd_ref, address);
+ if (p4d_none(*p4d_ref))
+ return -1;
+
+ if (p4d_none(*p4d)) {
+ set_p4d(p4d, *p4d_ref);
+ arch_flush_lazy_mmu_mode();
+ } else {
+ BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_ref));
+ }
+
/*
* Below here mismatches are bugs because these lower tables
* are shared:
*/

- pud = pud_offset(pgd, address);
- pud_ref = pud_offset(pgd_ref, address);
+ pud = pud_offset(p4d, address);
+ pud_ref = pud_offset(p4d_ref, address);
if (pud_none(*pud_ref))
return -1;

--
2.11.0

2017-03-06 14:00:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 07/33] mm: introduce __p4d_alloc()

For full 5-level paging we need a helper to allocate p4d page table.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7f1c2163b3ce..235ba51b2fbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3906,6 +3906,29 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);

+#ifndef __PAGETABLE_P4D_FOLDED
+/*
+ * Allocate p4d page table.
+ * We've already handled the fast-path in-line.
+ */
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+ p4d_t *new = p4d_alloc_one(mm, address);
+ if (!new)
+ return -ENOMEM;
+
+ smp_wmb(); /* See comment in __pte_alloc */
+
+ spin_lock(&mm->page_table_lock);
+ if (pgd_present(*pgd)) /* Another has populated it */
+ p4d_free(mm, new);
+ else
+ pgd_populate(mm, pgd, new);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+}
+#endif /* __PAGETABLE_P4D_FOLDED */
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
--
2.11.0

2017-03-06 14:00:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 11/33] x86/ident_map: add 5-level paging support

Nothing special: just handle one more level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/ident_map.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 4473cb4f8b90..2c9a62282fb1 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -45,6 +45,34 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
return 0;
}

+static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
+ unsigned long addr, unsigned long end)
+{
+ unsigned long next;
+
+ for (; addr < end; addr = next) {
+ p4d_t *p4d = p4d_page + p4d_index(addr);
+ pud_t *pud;
+
+ next = (addr & P4D_MASK) + P4D_SIZE;
+ if (next > end)
+ next = end;
+
+ if (p4d_present(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ ident_pud_init(info, pud, addr, next);
+ continue;
+ }
+ pud = (pud_t *)info->alloc_pgt_page(info->context);
+ if (!pud)
+ return -ENOMEM;
+ ident_pud_init(info, pud, addr, next);
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+ }
+
+ return 0;
+}
+
int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
unsigned long pstart, unsigned long pend)
{
@@ -55,27 +83,32 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,

for (; addr < end; addr = next) {
pgd_t *pgd = pgd_page + pgd_index(addr);
- pud_t *pud;
+ p4d_t *p4d;

next = (addr & PGDIR_MASK) + PGDIR_SIZE;
if (next > end)
next = end;

if (pgd_present(*pgd)) {
- pud = pud_offset(pgd, 0);
- result = ident_pud_init(info, pud, addr, next);
+ p4d = p4d_offset(pgd, 0);
+ result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
continue;
}

- pud = (pud_t *)info->alloc_pgt_page(info->context);
- if (!pud)
+ p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+ if (!p4d)
return -ENOMEM;
- result = ident_pud_init(info, pud, addr, next);
+ result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ } else {
+ pud_t *pud = pud_offset(p4d, 0);
+ set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+ }
}

return 0;
--
2.11.0

2017-03-06 14:00:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 13/33] x86/power: support p4d_t in hibernate code

set_up_temporary_text_mapping() and relocate_restore_code() require
trivial adjustments to handle additional page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/power/hibernate_64.c | 49 ++++++++++++++++++++++++++++++-------------
1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index ded2e8272382..9ec941638932 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -49,6 +49,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
{
pmd_t *pmd;
pud_t *pud;
+ p4d_t *p4d;

/*
* The new mapping only has to cover the page containing the image
@@ -63,6 +64,13 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
* the virtual address space after switching over to the original page
* tables used by the image kernel.
*/
+
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ p4d = (p4d_t *)get_safe_page(GFP_ATOMIC);
+ if (!p4d)
+ return -ENOMEM;
+ }
+
pud = (pud_t *)get_safe_page(GFP_ATOMIC);
if (!pud)
return -ENOMEM;
@@ -75,8 +83,15 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
set_pud(pud + pud_index(restore_jump_address),
__pud(__pa(pmd) | _KERNPG_TABLE));
- set_pgd(pgd + pgd_index(restore_jump_address),
- __pgd(__pa(pud) | _KERNPG_TABLE));
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ set_p4d(p4d + p4d_index(restore_jump_address),
+ __p4d(__pa(pud) | _KERNPG_TABLE));
+ set_pgd(pgd + pgd_index(restore_jump_address),
+ __pgd(__pa(p4d) | _KERNPG_TABLE));
+ } else {
+ set_pgd(pgd + pgd_index(restore_jump_address),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+ }

return 0;
}
@@ -124,7 +139,10 @@ static int set_up_temporary_mappings(void)
static int relocate_restore_code(void)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;

relocated_restore_code = get_safe_page(GFP_ATOMIC);
if (!relocated_restore_code)
@@ -134,22 +152,25 @@ static int relocate_restore_code(void)

/* Make the page containing the relocated code executable */
pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
- pud = pud_offset(pgd, relocated_restore_code);
+ p4d = p4d_offset(pgd, relocated_restore_code);
+ if (p4d_large(*p4d)) {
+ set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX));
+ goto out;
+ }
+ pud = pud_offset(p4d, relocated_restore_code);
if (pud_large(*pud)) {
set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
- } else {
- pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
-
- if (pmd_large(*pmd)) {
- set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
- } else {
- pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
-
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
- }
+ goto out;
+ }
+ pmd = pmd_offset(pud, relocated_restore_code);
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ goto out;
}
+ pte = pte_offset_kernel(pmd, relocated_restore_code);
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+out:
__flush_tlb_all();
-
return 0;
}

--
2.11.0

2017-03-06 14:00:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 32/33] x86: enable 5-level paging support

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 5 +++++
arch/x86/xen/Kconfig | 1 +
2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 747f06f00a22..43b3343402f5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -317,6 +317,7 @@ config FIX_EARLYCON_MEM

config PGTABLE_LEVELS
int
+ default 5 if X86_5LEVEL
default 4 if X86_64
default 3 if X86_PAE
default 2
@@ -1381,6 +1382,10 @@ config X86_PAE
has the cost of more pagetable lookup overhead, and also
consumes more pagetable space per process.

+config X86_5LEVEL
+ bool "Enable 5-level page tables support"
+ depends on X86_64
+
config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
config XEN
bool "Xen guest support"
depends on PARAVIRT
+ depends on !X86_5LEVEL
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
--
2.11.0

2017-03-06 14:01:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 33/33] x86/mm: allow to have userspace mappigs above 47-bits

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/mpx.h | 9 +++++++++
arch/x86/include/asm/processor.h | 9 ++++++---
arch/x86/kernel/sys_x86_64.c | 28 +++++++++++++++++++++++++++-
arch/x86/mm/hugetlbpage.c | 31 +++++++++++++++++++++++++++----
arch/x86/mm/mmap.c | 4 ++--
arch/x86/mm/mpx.c | 33 ++++++++++++++++++++++++++++++++-
7 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 9d49c18b5ea9..265625b0d6cb 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader. We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk. */

-#define ELF_ET_DYN_BASE (TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE (DEFAULT_MAP_WINDOW / 3 * 2)

/* This yields a mask that user programs can use to figure out what
instruction set this CPU supports. This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
}
void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags);
#else
static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
{
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+ unsigned long len, unsigned long flags)
+{
+ return addr;
+}
#endif /* CONFIG_X86_INTEL_MPX */

#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f385eca5407a..da8ab4f2d0c7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -799,6 +799,7 @@ static inline void spin_lock_prefetch(const void *x)
*/
#define TASK_SIZE PAGE_OFFSET
#define TASK_SIZE_MAX TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
#define STACK_TOP TASK_SIZE
#define STACK_TOP_MAX STACK_TOP

@@ -838,7 +839,9 @@ static inline void spin_lock_prefetch(const void *x)
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
@@ -851,7 +854,7 @@ static inline void spin_lock_prefetch(const void *x)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

-#define STACK_TOP TASK_SIZE
+#define STACK_TOP DEFAULT_MAP_WINDOW
#define STACK_TOP_MAX TASK_SIZE_MAX

#define INIT_THREAD { \
@@ -873,7 +876,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
*/
-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))

#define KSTK_EIP(task) (task_pt_regs(task)->ip)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 50215a4b9347..bae3706130a6 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -19,6 +19,7 @@

#include <asm/ia32.h>
#include <asm/syscalls.h>
+#include <asm/mpx.h>

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -129,6 +130,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_unmapped_area_info info;
unsigned long begin, end;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (flags & MAP_FIXED)
return addr;

@@ -148,7 +153,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.flags = 0;
info.length = len;
info.low_limit = begin;
- info.high_limit = end;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = min(end, TASK_SIZE);
+ else
+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
@@ -168,6 +182,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;

+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
@@ -192,6 +210,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index c5066a260803..94f41a39d8fe 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -16,6 +16,7 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
+#include <asm/mpx.h>

#if 0 /* This is just for testing */
struct page *
@@ -83,24 +84,41 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.flags = 0;
info.length = len;
info.low_limit = current->mm->mmap_legacy_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = DEFAULT_MAP_WINDOW;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit = TASK_SIZE;
+ else
+ info.high_limit = DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
- unsigned long addr0, unsigned long len,
+ unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
- unsigned long addr;

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = current->mm->mmap_base;
+
+ /*
+ * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+ * in the full address space.
+ */
+ if (addr > DEFAULT_MAP_WINDOW)
+ info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -115,7 +133,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
- info.high_limit = TASK_SIZE;
+ info.high_limit = DEFAULT_MAP_WINDOW;
addr = vm_unmapped_area(&info);
}

@@ -132,6 +150,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

if (len & ~huge_page_mask(h))
return -EINVAL;
+
+ addr = mpx_unmapped_area_check(addr, len, flags);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
if (len > TASK_SIZE)
return -ENOMEM;

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 7940166c799b..2fbfcabd098a 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -53,7 +53,7 @@ static unsigned long stack_maxrandom_size(void)
* Leave an at least ~128 MB hole with possible stack randomization.
*/
#define MIN_GAP (128*1024*1024UL + stack_maxrandom_size())
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)

static int mmap_is_legacy(void)
{
@@ -91,7 +91,7 @@ static unsigned long mmap_base(unsigned long rnd)
else if (gap > MAX_GAP)
gap = MAX_GAP;

- return PAGE_ALIGN(TASK_SIZE - gap - rnd);
+ return PAGE_ALIGN(DEFAULT_MAP_WINDOW - gap - rnd);
}

/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 5126dfd52b18..cc318817ce7c 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
*/
bd_base = mpx_get_bounds_dir();
down_write(&mm->mmap_sem);
+
+ /* MPX doesn't support addresses above 47-bits yet. */
+ if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+ pr_warn_once("%s (%d): MPX cannot handle addresses "
+ "above 47-bits. Disabling.",
+ current->comm, current->pid);
+ ret = -ENXIO;
+ goto out;
+ }
mm->context.bd_addr = bd_base;
if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
ret = -ENXIO;
-
+out:
up_write(&mm->mmap_sem);
return ret;
}
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
if (ret)
force_sig(SIGSEGV, current);
}
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+ unsigned long flags)
+{
+ if (!kernel_managing_mpx_tables(current->mm))
+ return addr;
+ if (addr + len <= DEFAULT_MAP_WINDOW)
+ return addr;
+ if (flags & MAP_FIXED)
+ return -ENOMEM;
+
+ /*
+ * Requested len is larger than whole area we're allowed to map in.
+ * Resetting hinting address wouldn't do much good -- fail early.
+ */
+ if (len > DEFAULT_MAP_WINDOW)
+ return -ENOMEM;
+
+ /* Look for unmap area within DEFAULT_MAP_WINDOW */
+ return 0;
+}
--
2.11.0

2017-03-06 14:04:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 16/33] x86/mm/pat: handle additional page table

Straight-forward extension of existing code to support additional page
table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/pageattr.c | 56 ++++++++++++++++++++++++++++++++++++--------------
1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 28d42130243c..eb0ad12cdfde 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -346,6 +346,7 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long address,
pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
unsigned int *level)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -354,7 +355,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
if (pgd_none(*pgd))
return NULL;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d))
+ return NULL;
+
+ *level = PG_LEVEL_512G;
+ if (p4d_large(*p4d) || !p4d_present(*p4d))
+ return (pte_t *)p4d;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud))
return NULL;

@@ -406,13 +415,18 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
pmd_t *lookup_pmd_address(unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

pgd = pgd_offset_k(address);
if (pgd_none(*pgd))
return NULL;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || p4d_large(*p4d) || !p4d_present(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud))
return NULL;

@@ -477,11 +491,13 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)

list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
set_pte_atomic((pte_t *)pmd, pte);
}
@@ -836,9 +852,9 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end)
pud_clear(pud);
}

-static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
+static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
{
- pud_t *pud = pud_offset(pgd, start);
+ pud_t *pud = pud_offset(p4d, start);

/*
* Not on a GB page boundary?
@@ -1004,8 +1020,8 @@ static long populate_pmd(struct cpa_data *cpa,
return num_pages;
}

-static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
- pgprot_t pgprot)
+static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,
+ pgprot_t pgprot)
{
pud_t *pud;
unsigned long end;
@@ -1026,7 +1042,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
cur_pages = (pre_end - start) >> PAGE_SHIFT;
cur_pages = min_t(int, (int)cpa->numpages, cur_pages);

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);

/*
* Need a PMD page?
@@ -1047,7 +1063,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
if (cpa->numpages == cur_pages)
return cur_pages;

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);
pud_pgprot = pgprot_4k_2_large(pgprot);

/*
@@ -1067,7 +1083,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
if (start < end) {
long tmp;

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);
if (pud_none(*pud))
if (alloc_pmd_page(pud))
return -1;
@@ -1090,33 +1106,43 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
{
pgprot_t pgprot = __pgprot(_KERNPG_TABLE);
pud_t *pud = NULL; /* shut up gcc */
+ p4d_t *p4d;
pgd_t *pgd_entry;
long ret;

pgd_entry = cpa->pgd + pgd_index(addr);

+ if (pgd_none(*pgd_entry)) {
+ p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
+ if (!p4d)
+ return -1;
+
+ set_pgd(pgd_entry, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ }
+
/*
- * Allocate a PUD page and hand it down for mapping.
+ * Allocate a P4D page and hand it down for mapping.
*/
- if (pgd_none(*pgd_entry)) {
+ p4d = p4d_offset(pgd_entry, addr);
+ if (p4d_none(*p4d)) {
pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
if (!pud)
return -1;

- set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
}

pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr);
pgprot_val(pgprot) |= pgprot_val(cpa->mask_set);

- ret = populate_pud(cpa, addr, pgd_entry, pgprot);
+ ret = populate_pud(cpa, addr, p4d, pgprot);
if (ret < 0) {
/*
* Leave the PUD page in place in case some other CPU or thread
* already found it, but remove any useless entries we just
* added to it.
*/
- unmap_pud_range(pgd_entry, addr,
+ unmap_pud_range(p4d, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}
--
2.11.0

2017-03-06 14:04:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 10/33] x86/gup: add 5-level paging support

It's simply extension for one more page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/gup.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 99c7805a9693..eb407cf0f6d3 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -76,9 +76,9 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
}

/*
- * 'pteval' can come from a pte, pmd or pud. We only check
+ * 'pteval' can come from a pte, pmd, pud or p4d. We only check
* _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
- * same value on all 3 types.
+ * same value on all 4 types.
*/
static inline int pte_allows_gup(unsigned long pteval, int write)
{
@@ -290,13 +290,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
return 1;
}

-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

- pudp = pud_offset(&pgd, addr);
+ pudp = pud_offset(&p4d, addr);
do {
pud_t pud = *pudp;

@@ -315,6 +315,27 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ p4d_t *p4dp;
+
+ p4dp = p4d_offset(&pgd, addr);
+ do {
+ p4d_t p4d = *p4dp;
+
+ next = p4d_addr_end(addr, end);
+ if (p4d_none(p4d))
+ return 0;
+ BUILD_BUG_ON(p4d_large(p4d));
+ if (!gup_pud_range(p4d, addr, next, write, pages, nr))
+ return 0;
+ } while (p4dp++, addr = next, addr != end);
+
+ return 1;
+}
+
/*
* Like get_user_pages_fast() except its IRQ-safe in that it won't fall
* back to the regular GUP.
@@ -363,7 +384,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
break;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
break;
} while (pgdp++, addr = next, addr != end);
local_irq_restore(flags);
@@ -435,7 +456,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
goto slow;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
goto slow;
} while (pgdp++, addr = next, addr != end);
local_irq_enable();
--
2.11.0

2017-03-06 14:05:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 15/33] x86/efi: handle p4d in EFI pagetables

Allocate additional page table level and change efi_sync_low_kernel_mappings()
to make syncing logic work with additional page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Matt Fleming <[email protected]>
---
arch/x86/platform/efi/efi_64.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8544dae3d1b4..34d019f75239 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -135,6 +135,7 @@ static pgd_t *efi_pgd;
int __init efi_alloc_page_tables(void)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
gfp_t gfp_mask;

@@ -147,15 +148,20 @@ int __init efi_alloc_page_tables(void)
return -ENOMEM;

pgd = efi_pgd + pgd_index(EFI_VA_END);
+ p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END);
+ if (!p4d) {
+ free_page((unsigned long)efi_pgd);
+ return -ENOMEM;
+ }

- pud = pud_alloc_one(NULL, 0);
+ pud = pud_alloc(&init_mm, p4d, EFI_VA_END);
if (!pud) {
+ if (CONFIG_PGTABLE_LEVELS > 4)
+ free_page((unsigned long) pgd_page_vaddr(*pgd));
free_page((unsigned long)efi_pgd);
return -ENOMEM;
}

- pgd_populate(NULL, pgd, pud);
-
return 0;
}

@@ -190,6 +196,18 @@ void efi_sync_low_kernel_mappings(void)
num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET);
memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries);

+ /* The same story as with PGD entries */
+ BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
+ BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
+
+ pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
+ pgd_k = pgd_offset_k(EFI_VA_END);
+ p4d_efi = p4d_offset(pgd_efi, 0);
+ p4d_k = p4d_offset(pgd_k, 0);
+
+ num_entries = p4d_index(EFI_VA_END);
+ memcpy(p4d_efi, p4d_k, sizeof(p4d_t) * num_entries);
+
/*
* We share all the PUD entries apart from those that map the
* EFI regions. Copy around them.
@@ -197,20 +215,15 @@ void efi_sync_low_kernel_mappings(void)
BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0);
BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);

- pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
- p4d_efi = p4d_offset(pgd_efi, 0);
+ p4d_efi = p4d_offset(pgd_efi, EFI_VA_END);
+ p4d_k = p4d_offset(pgd_k, EFI_VA_END);
pud_efi = pud_offset(p4d_efi, 0);
-
- pgd_k = pgd_offset_k(EFI_VA_END);
- p4d_k = p4d_offset(pgd_k, 0);
pud_k = pud_offset(p4d_k, 0);

num_entries = pud_index(EFI_VA_END);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);

- p4d_efi = p4d_offset(pgd_efi, EFI_VA_START);
pud_efi = pud_offset(p4d_efi, EFI_VA_START);
- p4d_k = p4d_offset(pgd_k, EFI_VA_START);
pud_k = pud_offset(p4d_k, EFI_VA_START);

num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
--
2.11.0

2017-03-06 14:06:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 09/33] x86: trivial portion of 5-level paging conversion

This patch covers simple cases only.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tboot.c | 6 +++++-
arch/x86/kernel/vm86_32.c | 6 +++++-
arch/x86/mm/fault.c | 39 +++++++++++++++++++++++++++++++++------
arch/x86/mm/init_32.c | 22 ++++++++++++++++------
arch/x86/mm/ioremap.c | 3 ++-
arch/x86/mm/pgtable.c | 4 +++-
arch/x86/mm/pgtable_32.c | 8 +++++++-
arch/x86/platform/efi/efi_64.c | 13 +++++++++----
arch/x86/power/hibernate_32.c | 7 +++++--
9 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index b868fa1b812b..5db0f33cbf2c 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -118,12 +118,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
pgprot_t prot)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

pgd = pgd_offset(&tboot_mm, vaddr);
- pud = pud_alloc(&tboot_mm, pgd, vaddr);
+ p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+ if (!p4d)
+ return -1;
+ pud = pud_alloc(&tboot_mm, p4d, vaddr);
if (!pud)
return -1;
pmd = pmd_alloc(&tboot_mm, pud, vaddr);
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 23ee89ce59a9..62597c300d94 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -164,6 +164,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
struct vm_area_struct *vma;
spinlock_t *ptl;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -173,7 +174,10 @@ static void mark_screen_rdonly(struct mm_struct *mm)
pgd = pgd_offset(mm, 0xA0000);
if (pgd_none_or_clear_bad(pgd))
goto out;
- pud = pud_offset(pgd, 0xA0000);
+ p4d = p4d_offset(pgd, 0xA0000);
+ if (p4d_none_or_clear_bad(p4d))
+ goto out;
+ pud = pud_offset(p4d, 0xA0000);
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 428e31763cb9..605fd5e8e048 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -253,6 +253,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
{
unsigned index = pgd_index(address);
pgd_t *pgd_k;
+ p4d_t *p4d, *p4d_k;
pud_t *pud, *pud_k;
pmd_t *pmd, *pmd_k;

@@ -265,10 +266,15 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
/*
* set_pgd(pgd, *pgd_k); here would be useless on PAE
* and redundant with the set_pmd() on non-PAE. As would
- * set_pud.
+ * set_p4d/set_pud.
*/
- pud = pud_offset(pgd, address);
- pud_k = pud_offset(pgd_k, address);
+ p4d = p4d_offset(pgd, address);
+ p4d_k = p4d_offset(pgd_k, address);
+ if (!p4d_present(*p4d_k))
+ return NULL;
+
+ pud = pud_offset(p4d, address);
+ pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
return NULL;

@@ -384,6 +390,8 @@ static void dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3());
pgd_t *pgd = &base[pgd_index(address)];
+ p4d_t *p4d;
+ pud_t *pud;
pmd_t *pmd;
pte_t *pte;

@@ -392,7 +400,9 @@ static void dump_pagetable(unsigned long address)
if (!low_pfn(pgd_val(*pgd) >> PAGE_SHIFT) || !pgd_present(*pgd))
goto out;
#endif
- pmd = pmd_offset(pud_offset(pgd, address), address);
+ p4d = p4d_offset(pgd, address);
+ pud = pud_offset(p4d, address);
+ pmd = pmd_offset(pud, address);
printk(KERN_CONT "*pde = %0*Lx ", sizeof(*pmd) * 2, (u64)pmd_val(*pmd));

/*
@@ -526,6 +536,7 @@ static void dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
pgd_t *pgd = base + pgd_index(address);
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -538,7 +549,15 @@ static void dump_pagetable(unsigned long address)
if (!pgd_present(*pgd))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (bad_address(p4d))
+ goto bad;
+
+ printk("P4D %lx ", p4d_val(*p4d));
+ if (!p4d_present(*p4d) || p4d_large(*p4d))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (bad_address(pud))
goto bad;

@@ -1082,6 +1101,7 @@ static noinline int
spurious_fault(unsigned long error_code, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -1104,7 +1124,14 @@ spurious_fault(unsigned long error_code, unsigned long address)
if (!pgd_present(*pgd))
return 0;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return 0;
+
+ if (p4d_large(*p4d))
+ return spurious_fault_check(error_code, (pte_t *) p4d);
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return 0;

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 2b4b53e6793f..5ed3c141bbd5 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -67,6 +67,7 @@ bool __read_mostly __vmalloc_start_set = false;
*/
static pmd_t * __init one_md_table_init(pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd_table;

@@ -75,13 +76,15 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
pmd_table = (pmd_t *)alloc_low_page();
paravirt_alloc_pmd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
BUG_ON(pmd_table != pmd_offset(pud, 0));

return pmd_table;
}
#endif
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);

return pmd_table;
@@ -390,8 +393,11 @@ pte_t *kmap_pte;

static inline pte_t *kmap_get_fixmap_pte(unsigned long vaddr)
{
- return pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr),
- vaddr), vaddr), vaddr);
+ pgd_t *pgd = pgd_offset_k(vaddr);
+ p4d_t *p4d = p4d_offset(pgd, vaddr);
+ pud_t *pud = pud_offset(p4d, vaddr);
+ pmd_t *pmd = pmd_offset(pud, vaddr);
+ return pte_offset_kernel(pmd, vaddr);
}

static void __init kmap_init(void)
@@ -410,6 +416,7 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
{
unsigned long vaddr;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -418,7 +425,8 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);

pgd = swapper_pg_dir + pgd_index(vaddr);
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ pud = pud_offset(p4d, vaddr);
pmd = pmd_offset(pud, vaddr);
pte = pte_offset_kernel(pmd, vaddr);
pkmap_page_table = pte;
@@ -450,6 +458,7 @@ void __init native_pagetable_init(void)
{
unsigned long pfn, va;
pgd_t *pgd, *base = swapper_pg_dir;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -469,7 +478,8 @@ void __init native_pagetable_init(void)
if (!pgd_present(*pgd))
break;

- pud = pud_offset(pgd, va);
+ p4d = p4d_offset(pgd, va);
+ pud = pud_offset(p4d, va);
pmd = pmd_offset(pud, va);
if (!pmd_present(*pmd))
break;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 7aaa2635862d..a5e1cda85974 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -425,7 +425,8 @@ static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
/* Don't assume we're using swapper_pg_dir at this point */
pgd_t *base = __va(read_cr3());
pgd_t *pgd = &base[pgd_index(addr)];
- pud_t *pud = pud_offset(pgd, addr);
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
pmd_t *pmd = pmd_offset(pud, addr);

return pmd;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 6cbdff26bb96..38b6daf72deb 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -261,13 +261,15 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)

static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
{
+ p4d_t *p4d;
pud_t *pud;
int i;

if (PREALLOCATED_PMDS == 0) /* Work around gcc-3.4.x bug */
return;

- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);

for (i = 0; i < PREALLOCATED_PMDS; i++, pud++) {
pmd_t *pmd = pmds[i];
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 9adce776852b..3d275a791c76 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -26,6 +26,7 @@ unsigned int __VMALLOC_RESERVE = 128 << 20;
void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -35,7 +36,12 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
BUG();
return;
}
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ if (p4d_none(*p4d)) {
+ BUG();
+ return;
+ }
+ pud = pud_offset(p4d, vaddr);
if (pud_none(*pud)) {
BUG();
return;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index a4695da42d77..8544dae3d1b4 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -166,6 +166,7 @@ void efi_sync_low_kernel_mappings(void)
{
unsigned num_entries;
pgd_t *pgd_k, *pgd_efi;
+ p4d_t *p4d_k, *p4d_efi;
pud_t *pud_k, *pud_efi;

if (efi_enabled(EFI_OLD_MEMMAP))
@@ -197,16 +198,20 @@ void efi_sync_low_kernel_mappings(void)
BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);

pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
- pud_efi = pud_offset(pgd_efi, 0);
+ p4d_efi = p4d_offset(pgd_efi, 0);
+ pud_efi = pud_offset(p4d_efi, 0);

pgd_k = pgd_offset_k(EFI_VA_END);
- pud_k = pud_offset(pgd_k, 0);
+ p4d_k = p4d_offset(pgd_k, 0);
+ pud_k = pud_offset(p4d_k, 0);

num_entries = pud_index(EFI_VA_END);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);

- pud_efi = pud_offset(pgd_efi, EFI_VA_START);
- pud_k = pud_offset(pgd_k, EFI_VA_START);
+ p4d_efi = p4d_offset(pgd_efi, EFI_VA_START);
+ pud_efi = pud_offset(p4d_efi, EFI_VA_START);
+ p4d_k = p4d_offset(pgd_k, EFI_VA_START);
+ pud_k = pud_offset(p4d_k, EFI_VA_START);

num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
diff --git a/arch/x86/power/hibernate_32.c b/arch/x86/power/hibernate_32.c
index 9f14bd34581d..c35fdb585c68 100644
--- a/arch/x86/power/hibernate_32.c
+++ b/arch/x86/power/hibernate_32.c
@@ -32,6 +32,7 @@ pgd_t *resume_pg_dir;
*/
static pmd_t *resume_one_md_table_init(pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd_table;

@@ -41,11 +42,13 @@ static pmd_t *resume_one_md_table_init(pgd_t *pgd)
return NULL;

set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);

BUG_ON(pmd_table != pmd_offset(pud, 0));
#else
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);
#endif

--
2.11.0

2017-03-06 14:06:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 08/33] x86: basic changes into headers for 5-level paging

This patch extends x86 headers to enable 5-level paging support.

It's still based on <asm-generic/5level-fixup.h>. We will get to the
point where we can have <asm-generic/pgtable-nop4d.h> later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable-2level_types.h | 1 +
arch/x86/include/asm/pgtable-3level_types.h | 1 +
arch/x86/include/asm/pgtable.h | 26 ++++++++++++++++++++-----
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/include/asm/pgtable_types.h | 30 ++++++++++++++++++++++++++++-
5 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 392576433e77..373ab1de909f 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -7,6 +7,7 @@
typedef unsigned long pteval_t;
typedef unsigned long pmdval_t;
typedef unsigned long pudval_t;
+typedef unsigned long p4dval_t;
typedef unsigned long pgdval_t;
typedef unsigned long pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index bcc89625ebe5..b8a4341faafa 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -7,6 +7,7 @@
typedef u64 pteval_t;
typedef u64 pmdval_t;
typedef u64 pudval_t;
+typedef u64 p4dval_t;
typedef u64 pgdval_t;
typedef u64 pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..6f6f351e0a81 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -179,6 +179,17 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
}

+static inline unsigned long p4d_pfn(p4d_t p4d)
+{
+ return (p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT;
+}
+
+static inline int p4d_large(p4d_t p4d)
+{
+ /* No 512 GiB pages yet */
+ return 0;
+}
+
#define pte_page(pte) pfn_to_page(pte_pfn(pte))

static inline int pmd_large(pmd_t pte)
@@ -770,6 +781,16 @@ static inline int pud_large(pud_t pud)
}
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

+static inline unsigned long pud_index(unsigned long address)
+{
+ return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+static inline unsigned long p4d_index(unsigned long address)
+{
+ return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+
#if CONFIG_PGTABLE_LEVELS > 3
static inline int pgd_present(pgd_t pgd)
{
@@ -788,11 +809,6 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
#define pgd_page(pgd) pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT)

/* to find an entry in a page-table-directory. */
-static inline unsigned long pud_index(unsigned long address)
-{
- return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
-}
-
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 3a264200c62f..0b2797e5083c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -13,6 +13,7 @@
typedef unsigned long pteval_t;
typedef unsigned long pmdval_t;
typedef unsigned long pudval_t;
+typedef unsigned long p4dval_t;
typedef unsigned long pgdval_t;
typedef unsigned long pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 62484333673d..df08535f774a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -272,9 +272,20 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
return native_pgd_val(pgd) & PTE_FLAGS_MASK;
}

-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
+
+#error FIXME
+
+#else
#include <asm-generic/5level-fixup.h>

+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+ return native_pgd_val(p4d);
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 3
typedef struct { pudval_t pud; } pud_t;

static inline pud_t native_make_pud(pmdval_t val)
@@ -318,6 +329,22 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
}
#endif

+static inline p4dval_t p4d_pfn_mask(p4d_t p4d)
+{
+ /* No 512 GiB huge pages yet */
+ return PTE_PFN_MASK;
+}
+
+static inline p4dval_t p4d_flags_mask(p4d_t p4d)
+{
+ return ~p4d_pfn_mask(p4d);
+}
+
+static inline p4dval_t p4d_flags(p4d_t p4d)
+{
+ return native_p4d_val(p4d) & p4d_flags_mask(p4d);
+}
+
static inline pudval_t pud_pfn_mask(pud_t pud)
{
if (native_pud_val(pud) & _PAGE_PSE)
@@ -461,6 +488,7 @@ enum pg_level {
PG_LEVEL_4K,
PG_LEVEL_2M,
PG_LEVEL_1G,
+ PG_LEVEL_512G,
PG_LEVEL_NUM
};

--
2.11.0

2017-03-06 14:07:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 19/33] x86: convert the rest of the code to support p4d_t

This patch converts x86 to use proper folding of new page table level
with <asm-generic/pgtable-nop4d.h>.

That's a bit of kitchen sink, but I don't see how to split it further.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/paravirt.h | 33 +++++-
arch/x86/include/asm/paravirt_types.h | 12 ++-
arch/x86/include/asm/pgalloc.h | 35 ++++++-
arch/x86/include/asm/pgtable.h | 59 ++++++++++-
arch/x86/include/asm/pgtable_64.h | 12 ++-
arch/x86/include/asm/pgtable_types.h | 10 +-
arch/x86/include/asm/xen/page.h | 8 +-
arch/x86/kernel/paravirt.c | 10 +-
arch/x86/mm/init_64.c | 183 +++++++++++++++++++++++++++-------
arch/x86/xen/mmu.c | 152 ++++++++++++++++------------
include/trace/events/xen.h | 28 +++---
11 files changed, 401 insertions(+), 141 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 0489884fdc44..158d877ce9e9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -536,7 +536,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
PVOP_VCALL2(pv_mmu_ops.set_pud, pudp,
val);
}
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
static inline pud_t __pud(pudval_t val)
{
pudval_t ret;
@@ -565,6 +565,32 @@ static inline pudval_t pud_val(pud_t pud)
return ret;
}

+static inline void pud_clear(pud_t *pudp)
+{
+ set_pud(pudp, __pud(0));
+}
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ p4dval_t val = native_p4d_val(p4d);
+
+ if (sizeof(p4dval_t) > sizeof(long))
+ PVOP_VCALL3(pv_mmu_ops.set_p4d, p4dp,
+ val, (u64)val >> 32);
+ else
+ PVOP_VCALL2(pv_mmu_ops.set_p4d, p4dp,
+ val);
+}
+
+static inline void p4d_clear(p4d_t *p4dp)
+{
+ set_p4d(p4dp, __p4d(0));
+}
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+#error FIXME
+
static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
{
pgdval_t val = native_pgd_val(pgd);
@@ -582,10 +608,7 @@ static inline void pgd_clear(pgd_t *pgdp)
set_pgd(pgdp, __pgd(0));
}

-static inline void pud_clear(pud_t *pudp)
-{
- set_pud(pudp, __pud(0));
-}
+#endif /* CONFIG_PGTABLE_LEVELS == 5 */

#endif /* CONFIG_PGTABLE_LEVELS == 4 */

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index b060f962d581..93c49cf09b63 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -279,12 +279,18 @@ struct pv_mmu_ops {
struct paravirt_callee_save pmd_val;
struct paravirt_callee_save make_pmd;

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
struct paravirt_callee_save pud_val;
struct paravirt_callee_save make_pud;

- void (*set_pgd)(pgd_t *pudp, pgd_t pgdval);
-#endif /* CONFIG_PGTABLE_LEVELS == 4 */
+ void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
+
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

struct pv_lazy_ops lazy_mode;
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b6d425999f99..2f585054c63c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -121,10 +121,10 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
#endif /* CONFIG_X86_PAE */

#if CONFIG_PGTABLE_LEVELS > 3
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
{
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
- set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)));
+ set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
}

static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -150,6 +150,37 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
___pud_free_tlb(tlb, pud);
}

+#if CONFIG_PGTABLE_LEVELS > 4
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
+{
+ paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ gfp_t gfp = GFP_KERNEL_ACCOUNT;
+
+ if (mm == &init_mm)
+ gfp &= ~__GFP_ACCOUNT;
+ return (p4d_t *)get_zeroed_page(gfp);
+}
+
+static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+ BUG_ON((unsigned long)p4d & (PAGE_SIZE-1));
+ free_page((unsigned long)p4d);
+}
+
+extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
+
+static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
+ unsigned long address)
+{
+ ___p4d_free_tlb(tlb, p4d);
+}
+
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6f6f351e0a81..90f32116acd8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -53,11 +53,19 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);

#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd)

-#ifndef __PAGETABLE_PUD_FOLDED
+#ifndef __PAGETABLE_P4D_FOLDED
#define set_pgd(pgdp, pgd) native_set_pgd(pgdp, pgd)
#define pgd_clear(pgd) native_pgd_clear(pgd)
#endif

+#ifndef set_p4d
+# define set_p4d(p4dp, p4d) native_set_p4d(p4dp, p4d)
+#endif
+
+#ifndef __PAGETABLE_PUD_FOLDED
+#define p4d_clear(p4d) native_p4d_clear(p4d)
+#endif
+
#ifndef set_pud
# define set_pud(pudp, pud) native_set_pud(pudp, pud)
#endif
@@ -74,6 +82,11 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
#define pgd_val(x) native_pgd_val(x)
#define __pgd(x) native_make_pgd(x)

+#ifndef __PAGETABLE_P4D_FOLDED
+#define p4d_val(x) native_p4d_val(x)
+#define __p4d(x) native_make_p4d(x)
+#endif
+
#ifndef __PAGETABLE_PUD_FOLDED
#define pud_val(x) native_pud_val(x)
#define __pud(x) native_make_pud(x)
@@ -549,6 +562,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
#define pte_pgprot(x) __pgprot(pte_flags(x))
#define pmd_pgprot(x) __pgprot(pmd_flags(x))
#define pud_pgprot(x) __pgprot(pud_flags(x))
+#define p4d_pgprot(x) __pgprot(p4d_flags(x))

#define canon_pgprot(p) __pgprot(massage_pgprot(p))

@@ -786,12 +800,47 @@ static inline unsigned long pud_index(unsigned long address)
return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
}

+#if CONFIG_PGTABLE_LEVELS > 3
+static inline int p4d_none(p4d_t p4d)
+{
+ return (native_p4d_val(p4d) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+ return p4d_flags(p4d) & _PAGE_PRESENT;
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+ return (unsigned long)__va(p4d_val(p4d) & p4d_pfn_mask(p4d));
+}
+
+/*
+ * Currently stuck as a macro due to indirect forward reference to
+ * linux/mmzone.h's __section_mem_map_addr() definition:
+ */
+#define p4d_page(p4d) \
+ pfn_to_page((p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT)
+
+/* Find an entry in the third-level page table.. */
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+ return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+ return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+
static inline unsigned long p4d_index(unsigned long address)
{
return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
}

-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
static inline int pgd_present(pgd_t pgd)
{
return pgd_flags(pgd) & _PAGE_PRESENT;
@@ -809,9 +858,9 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
#define pgd_page(pgd) pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT)

/* to find an entry in a page-table-directory. */
-static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
- return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
+ return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}

static inline int pgd_bad(pgd_t pgd)
@@ -829,7 +878,7 @@ static inline int pgd_none(pgd_t pgd)
*/
return !native_pgd_val(pgd);
}
-#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */

#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..79396bfdc791 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -41,9 +41,9 @@ extern void paging_init(void);

struct mm_struct;

+void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte);
void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);

-
static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
@@ -121,6 +121,16 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
#endif
}

+static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ *p4dp = p4d;
+}
+
+static inline void native_p4d_clear(p4d_t *p4d)
+{
+ native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+}
+
static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
*pgdp = pgd;
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index df08535f774a..4930afe9df0a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -277,11 +277,11 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
#error FIXME

#else
-#include <asm-generic/5level-fixup.h>
+#include <asm-generic/pgtable-nop4d.h>

static inline p4dval_t native_p4d_val(p4d_t p4d)
{
- return native_pgd_val(p4d);
+ return native_pgd_val(p4d.pgd);
}
#endif

@@ -298,12 +298,11 @@ static inline pudval_t native_pud_val(pud_t pud)
return pud.pud;
}
#else
-#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

static inline pudval_t native_pud_val(pud_t pud)
{
- return native_pgd_val(pud.pgd);
+ return native_pgd_val(pud.p4d.pgd);
}
#endif

@@ -320,12 +319,11 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
return pmd.pmd;
}
#else
-#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pmdval_t native_pmd_val(pmd_t pmd)
{
- return native_pgd_val(pmd.pud.pgd);
+ return native_pgd_val(pmd.pud.p4d.pgd);
}
#endif

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 33cbd3db97b9..bf2ca56fba11 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -279,13 +279,17 @@ static inline pte_t __pte_ma(pteval_t x)

#define pmd_val_ma(v) ((v).pmd)
#ifdef __PAGETABLE_PUD_FOLDED
-#define pud_val_ma(v) ((v).pgd.pgd)
+#define pud_val_ma(v) ((v).p4d.pgd.pgd)
#else
#define pud_val_ma(v) ((v).pud)
#endif
#define __pmd_ma(x) ((pmd_t) { (x) } )

-#define pgd_val_ma(x) ((x).pgd)
+#ifdef __PAGETABLE_P4D_FOLDED
+#define p4d_val_ma(x) ((x).pgd.pgd)
+#else
+#define p4d_val_ma(x) ((x).p4d)
+#endif

void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid);

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 4797e87b0fb6..110daf22f5c7 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -430,12 +430,16 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
.pmd_val = PTE_IDENT,
.make_pmd = PTE_IDENT,

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
.pud_val = PTE_IDENT,
.make_pud = PTE_IDENT,

- .set_pgd = native_set_pgd,
-#endif
+ .set_p4d = native_set_p4d,
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

.pte_val = PTE_IDENT,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 15173d37f399..7bdda6f1d135 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -97,28 +97,38 @@ void sync_global_pgds(unsigned long start, unsigned long end)
unsigned long address;

for (address = start; address <= end; address += PGDIR_SIZE) {
- const pgd_t *pgd_ref = pgd_offset_k(address);
+ pgd_t *pgd_ref = pgd_offset_k(address);
+ const p4d_t *p4d_ref;
struct page *page;

- if (pgd_none(*pgd_ref))
+ /*
+ * With folded p4d, pgd_none() is always false, we need to
+ * handle synchonization on p4d level.
+ */
+ BUILD_BUG_ON(pgd_none(*pgd_ref));
+ p4d_ref = p4d_offset(pgd_ref, address);
+
+ if (p4d_none(*p4d_ref))
continue;

spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
+ p4d_t *p4d;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ p4d = p4d_offset(pgd, address);
/* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

- if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
- BUG_ON(pgd_page_vaddr(*pgd)
- != pgd_page_vaddr(*pgd_ref));
+ if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))
+ BUG_ON(p4d_page_vaddr(*p4d)
+ != p4d_page_vaddr(*p4d_ref));

- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
+ if (p4d_none(*p4d))
+ set_p4d(p4d, *p4d_ref);

spin_unlock(pgt_lock);
}
@@ -149,16 +159,28 @@ static __ref void *spp_getpage(void)
return ptr;
}

-static pud_t *fill_pud(pgd_t *pgd, unsigned long vaddr)
+static p4d_t *fill_p4d(pgd_t *pgd, unsigned long vaddr)
{
if (pgd_none(*pgd)) {
- pud_t *pud = (pud_t *)spp_getpage();
- pgd_populate(&init_mm, pgd, pud);
- if (pud != pud_offset(pgd, 0))
+ p4d_t *p4d = (p4d_t *)spp_getpage();
+ pgd_populate(&init_mm, pgd, p4d);
+ if (p4d != p4d_offset(pgd, 0))
printk(KERN_ERR "PAGETABLE BUG #00! %p <-> %p\n",
- pud, pud_offset(pgd, 0));
+ p4d, p4d_offset(pgd, 0));
+ }
+ return p4d_offset(pgd, vaddr);
+}
+
+static pud_t *fill_pud(p4d_t *p4d, unsigned long vaddr)
+{
+ if (p4d_none(*p4d)) {
+ pud_t *pud = (pud_t *)spp_getpage();
+ p4d_populate(&init_mm, p4d, pud);
+ if (pud != pud_offset(p4d, 0))
+ printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+ pud, pud_offset(p4d, 0));
}
- return pud_offset(pgd, vaddr);
+ return pud_offset(p4d, vaddr);
}

static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
@@ -167,7 +189,7 @@ static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
pmd_t *pmd = (pmd_t *) spp_getpage();
pud_populate(&init_mm, pud, pmd);
if (pmd != pmd_offset(pud, 0))
- printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+ printk(KERN_ERR "PAGETABLE BUG #02! %p <-> %p\n",
pmd, pmd_offset(pud, 0));
}
return pmd_offset(pud, vaddr);
@@ -179,20 +201,15 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
pte_t *pte = (pte_t *) spp_getpage();
pmd_populate_kernel(&init_mm, pmd, pte);
if (pte != pte_offset_kernel(pmd, 0))
- printk(KERN_ERR "PAGETABLE BUG #02!\n");
+ printk(KERN_ERR "PAGETABLE BUG #03!\n");
}
return pte_offset_kernel(pmd, vaddr);
}

-void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
+static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
{
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- pud = pud_page + pud_index(vaddr);
- pmd = fill_pmd(pud, vaddr);
- pte = fill_pte(pmd, vaddr);
+ pmd_t *pmd = fill_pmd(pud, vaddr);
+ pte_t *pte = fill_pte(pmd, vaddr);

set_pte(pte, new_pte);

@@ -203,10 +220,25 @@ void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
__flush_tlb_one(vaddr);
}

+void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
+{
+ p4d_t *p4d = p4d_page + p4d_index(vaddr);
+ pud_t *pud = fill_pud(p4d, vaddr);
+
+ __set_pte_vaddr(pud, vaddr, new_pte);
+}
+
+void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
+{
+ pud_t *pud = pud_page + pud_index(vaddr);
+
+ __set_pte_vaddr(pud, vaddr, new_pte);
+}
+
void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
{
pgd_t *pgd;
- pud_t *pud_page;
+ p4d_t *p4d_page;

pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(pteval));

@@ -216,17 +248,20 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
"PGD FIXMAP MISSING, it should be setup in head.S!\n");
return;
}
- pud_page = (pud_t*)pgd_page_vaddr(*pgd);
- set_pte_vaddr_pud(pud_page, vaddr, pteval);
+
+ p4d_page = p4d_offset(pgd, 0);
+ set_pte_vaddr_p4d(p4d_page, vaddr, pteval);
}

pmd_t * __init populate_extra_pmd(unsigned long vaddr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

pgd = pgd_offset_k(vaddr);
- pud = fill_pud(pgd, vaddr);
+ p4d = fill_p4d(pgd, vaddr);
+ pud = fill_pud(p4d, vaddr);
return fill_pmd(pud, vaddr);
}

@@ -245,6 +280,7 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
enum page_cache_mode cache)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pgprot_t prot;
@@ -255,11 +291,17 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
for (; size; phys += PMD_SIZE, size -= PMD_SIZE) {
pgd = pgd_offset_k((unsigned long)__va(phys));
if (pgd_none(*pgd)) {
+ p4d = (p4d_t *) spp_getpage();
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE |
+ _PAGE_USER));
+ }
+ p4d = p4d_offset(pgd, (unsigned long)__va(phys));
+ if (p4d_none(*p4d)) {
pud = (pud_t *) spp_getpage();
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE |
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE |
_PAGE_USER));
}
- pud = pud_offset(pgd, (unsigned long)__va(phys));
+ pud = pud_offset(p4d, (unsigned long)__va(phys));
if (pud_none(*pud)) {
pmd = (pmd_t *) spp_getpage();
set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE |
@@ -563,12 +605,15 @@ kernel_physical_mapping_init(unsigned long paddr_start,

for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
+ p4d_t *p4d;
pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- if (pgd_val(*pgd)) {
- pud = (pud_t *)pgd_page_vaddr(*pgd);
+ BUILD_BUG_ON(pgd_none(*pgd));
+ p4d = p4d_offset(pgd, vaddr);
+ if (p4d_val(*p4d)) {
+ pud = (pud_t *)p4d_page_vaddr(*p4d);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
@@ -580,7 +625,7 @@ kernel_physical_mapping_init(unsigned long paddr_start,
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- pgd_populate(&init_mm, pgd, pud);
+ p4d_populate(&init_mm, p4d, pud);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
@@ -726,6 +771,24 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
spin_unlock(&init_mm.page_table_lock);
}

+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+ pud_t *pud;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ pud = pud_start + i;
+ if (!pud_none(*pud))
+ return;
+ }
+
+ /* free a pud talbe */
+ free_pagetable(p4d_page(*p4d), 0);
+ spin_lock(&init_mm.page_table_lock);
+ p4d_clear(p4d);
+ spin_unlock(&init_mm.page_table_lock);
+}
+
static void __meminit
remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
bool direct)
@@ -908,6 +971,32 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
update_page_count(PG_LEVEL_1G, -pages);
}

+static void __meminit
+remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
+ bool direct)
+{
+ unsigned long next, pages = 0;
+ pud_t *pud_base;
+ p4d_t *p4d;
+
+ p4d = p4d_start + p4d_index(addr);
+ for (; addr < end; addr = next, p4d++) {
+ next = p4d_addr_end(addr, end);
+
+ if (!p4d_present(*p4d))
+ continue;
+
+ BUILD_BUG_ON(p4d_large(*p4d));
+
+ pud_base = (pud_t *)p4d_page_vaddr(*p4d);
+ remove_pud_table(pud_base, addr, next, direct);
+ free_pud_table(pud_base, p4d);
+ }
+
+ if (direct)
+ update_page_count(PG_LEVEL_512G, -pages);
+}
+
/* start and end are both virtual address. */
static void __meminit
remove_pagetable(unsigned long start, unsigned long end, bool direct)
@@ -915,7 +1004,7 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
unsigned long next;
unsigned long addr;
pgd_t *pgd;
- pud_t *pud;
+ p4d_t *p4d;

for (addr = start; addr < end; addr = next) {
next = pgd_addr_end(addr, end);
@@ -924,8 +1013,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
if (!pgd_present(*pgd))
continue;

- pud = (pud_t *)pgd_page_vaddr(*pgd);
- remove_pud_table(pud, addr, next, direct);
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ remove_p4d_table(p4d, addr, next, direct);
}

flush_tlb_all();
@@ -1090,6 +1179,7 @@ int kern_addr_valid(unsigned long addr)
{
unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -1101,7 +1191,11 @@ int kern_addr_valid(unsigned long addr)
if (pgd_none(*pgd))
return 0;

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
if (pud_none(*pud))
return 0;

@@ -1158,6 +1252,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
unsigned long addr;
unsigned long next;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -1168,7 +1263,11 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
if (!pgd)
return -ENOMEM;

- pud = vmemmap_pud_populate(pgd, addr, node);
+ p4d = vmemmap_p4d_populate(pgd, addr, node);
+ if (!p4d)
+ return -ENOMEM;
+
+ pud = vmemmap_pud_populate(p4d, addr, node);
if (!pud)
return -ENOMEM;

@@ -1236,6 +1335,7 @@ void register_page_bootmem_memmap(unsigned long section_nr,
unsigned long end = (unsigned long)(start_page + size);
unsigned long next;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
unsigned int nr_pages;
@@ -1251,7 +1351,14 @@ void register_page_bootmem_memmap(unsigned long section_nr,
}
get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d)) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ continue;
+ }
+ get_page_bootmem(section_nr, p4d_page(*p4d), MIX_SECTION_INFO);
+
+ pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
next = (addr + PAGE_SIZE) & PAGE_MASK;
continue;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 75af8da7b54f..c36c8178847d 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -535,40 +535,41 @@ static pgd_t *xen_get_user_pgd(pgd_t *pgd)
return user_ptr;
}

-static void __xen_set_pgd_hyper(pgd_t *ptr, pgd_t val)
+static void __xen_set_p4d_hyper(p4d_t *ptr, p4d_t val)
{
struct mmu_update u;

u.ptr = virt_to_machine(ptr).maddr;
- u.val = pgd_val_ma(val);
+ u.val = p4d_val_ma(val);
xen_extend_mmu_update(&u);
}

/*
- * Raw hypercall-based set_pgd, intended for in early boot before
+ * Raw hypercall-based set_p4d, intended for in early boot before
* there's a page structure. This implies:
* 1. The only existing pagetable is the kernel's
* 2. It is always pinned
* 3. It has no user pagetable attached to it
*/
-static void __init xen_set_pgd_hyper(pgd_t *ptr, pgd_t val)
+static void __init xen_set_p4d_hyper(p4d_t *ptr, p4d_t val)
{
preempt_disable();

xen_mc_batch();

- __xen_set_pgd_hyper(ptr, val);
+ __xen_set_p4d_hyper(ptr, val);

xen_mc_issue(PARAVIRT_LAZY_MMU);

preempt_enable();
}

-static void xen_set_pgd(pgd_t *ptr, pgd_t val)
+static void xen_set_p4d(p4d_t *ptr, p4d_t val)
{
- pgd_t *user_ptr = xen_get_user_pgd(ptr);
+ pgd_t *user_ptr = xen_get_user_pgd((pgd_t *)ptr);
+ pgd_t pgd_val;

- trace_xen_mmu_set_pgd(ptr, user_ptr, val);
+ trace_xen_mmu_set_p4d(ptr, (p4d_t *)user_ptr, val);

/* If page is not pinned, we can just update the entry
directly */
@@ -576,7 +577,8 @@ static void xen_set_pgd(pgd_t *ptr, pgd_t val)
*ptr = val;
if (user_ptr) {
WARN_ON(xen_page_pinned(user_ptr));
- *user_ptr = val;
+ pgd_val.pgd = p4d_val_ma(val);
+ *user_ptr = pgd_val;
}
return;
}
@@ -585,9 +587,9 @@ static void xen_set_pgd(pgd_t *ptr, pgd_t val)
user updates together. */
xen_mc_batch();

- __xen_set_pgd_hyper(ptr, val);
+ __xen_set_p4d_hyper(ptr, val);
if (user_ptr)
- __xen_set_pgd_hyper(user_ptr, val);
+ __xen_set_p4d_hyper((p4d_t *)user_ptr, val);

xen_mc_issue(PARAVIRT_LAZY_MMU);
}
@@ -1589,7 +1591,6 @@ static int xen_pgd_alloc(struct mm_struct *mm)
BUG_ON(PagePinned(virt_to_page(xen_get_user_pgd(pgd))));
}
#endif
-
return ret;
}

@@ -1781,7 +1782,7 @@ static void xen_release_pmd(unsigned long pfn)
xen_release_ptpage(pfn, PT_PMD);
}

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn)
{
xen_alloc_ptpage(mm, pfn, PT_PUD);
@@ -2122,21 +2123,27 @@ static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
*/
void __init xen_relocate_p2m(void)
{
- phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys;
+ phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys, p4d_phys;
unsigned long p2m_pfn, p2m_pfn_end, n_frames, pfn, pfn_end;
- int n_pte, n_pt, n_pmd, n_pud, idx_pte, idx_pt, idx_pmd, idx_pud;
+ int n_pte, n_pt, n_pmd, n_pud, n_p4d, idx_pte, idx_pt, idx_pmd, idx_pud, idx_p4d;
pte_t *pt;
pmd_t *pmd;
pud_t *pud;
+ p4d_t *p4d = NULL;
pgd_t *pgd;
unsigned long *new_p2m;
+ int save_pud;

size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
n_pte = roundup(size, PAGE_SIZE) >> PAGE_SHIFT;
n_pt = roundup(size, PMD_SIZE) >> PMD_SHIFT;
n_pmd = roundup(size, PUD_SIZE) >> PUD_SHIFT;
- n_pud = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
- n_frames = n_pte + n_pt + n_pmd + n_pud;
+ n_pud = roundup(size, P4D_SIZE) >> P4D_SHIFT;
+ if (PTRS_PER_P4D > 1)
+ n_p4d = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
+ else
+ n_p4d = 0;
+ n_frames = n_pte + n_pt + n_pmd + n_pud + n_p4d;

new_area = xen_find_free_area(PFN_PHYS(n_frames));
if (!new_area) {
@@ -2152,55 +2159,76 @@ void __init xen_relocate_p2m(void)
* To avoid any possible virtual address collision, just use
* 2 * PUD_SIZE for the new area.
*/
- pud_phys = new_area;
+ p4d_phys = new_area;
+ pud_phys = p4d_phys + PFN_PHYS(n_p4d);
pmd_phys = pud_phys + PFN_PHYS(n_pud);
pt_phys = pmd_phys + PFN_PHYS(n_pmd);
p2m_pfn = PFN_DOWN(pt_phys) + n_pt;

pgd = __va(read_cr3());
new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
- for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
- pud = early_memremap(pud_phys, PAGE_SIZE);
- clear_page(pud);
- for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
- idx_pmd++) {
- pmd = early_memremap(pmd_phys, PAGE_SIZE);
- clear_page(pmd);
- for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
- idx_pt++) {
- pt = early_memremap(pt_phys, PAGE_SIZE);
- clear_page(pt);
- for (idx_pte = 0;
- idx_pte < min(n_pte, PTRS_PER_PTE);
- idx_pte++) {
- set_pte(pt + idx_pte,
- pfn_pte(p2m_pfn, PAGE_KERNEL));
- p2m_pfn++;
+ idx_p4d = 0;
+ save_pud = n_pud;
+ do {
+ if (n_p4d > 0) {
+ p4d = early_memremap(p4d_phys, PAGE_SIZE);
+ clear_page(p4d);
+ n_pud = min(save_pud, PTRS_PER_P4D);
+ }
+ for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
+ pud = early_memremap(pud_phys, PAGE_SIZE);
+ clear_page(pud);
+ for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
+ idx_pmd++) {
+ pmd = early_memremap(pmd_phys, PAGE_SIZE);
+ clear_page(pmd);
+ for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
+ idx_pt++) {
+ pt = early_memremap(pt_phys, PAGE_SIZE);
+ clear_page(pt);
+ for (idx_pte = 0;
+ idx_pte < min(n_pte, PTRS_PER_PTE);
+ idx_pte++) {
+ set_pte(pt + idx_pte,
+ pfn_pte(p2m_pfn, PAGE_KERNEL));
+ p2m_pfn++;
+ }
+ n_pte -= PTRS_PER_PTE;
+ early_memunmap(pt, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pt_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
+ PFN_DOWN(pt_phys));
+ set_pmd(pmd + idx_pt,
+ __pmd(_PAGE_TABLE | pt_phys));
+ pt_phys += PAGE_SIZE;
}
- n_pte -= PTRS_PER_PTE;
- early_memunmap(pt, PAGE_SIZE);
- make_lowmem_page_readonly(__va(pt_phys));
- pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
- PFN_DOWN(pt_phys));
- set_pmd(pmd + idx_pt,
- __pmd(_PAGE_TABLE | pt_phys));
- pt_phys += PAGE_SIZE;
+ n_pt -= PTRS_PER_PMD;
+ early_memunmap(pmd, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pmd_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
+ PFN_DOWN(pmd_phys));
+ set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
+ pmd_phys += PAGE_SIZE;
}
- n_pt -= PTRS_PER_PMD;
- early_memunmap(pmd, PAGE_SIZE);
- make_lowmem_page_readonly(__va(pmd_phys));
- pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
- PFN_DOWN(pmd_phys));
- set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
- pmd_phys += PAGE_SIZE;
+ n_pmd -= PTRS_PER_PUD;
+ early_memunmap(pud, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pud_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
+ if (n_p4d > 0)
+ set_p4d(p4d + idx_pud, __p4d(_PAGE_TABLE | pud_phys));
+ else
+ set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
+ pud_phys += PAGE_SIZE;
}
- n_pmd -= PTRS_PER_PUD;
- early_memunmap(pud, PAGE_SIZE);
- make_lowmem_page_readonly(__va(pud_phys));
- pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
- set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
- pud_phys += PAGE_SIZE;
- }
+ if (n_p4d > 0) {
+ save_pud -= PTRS_PER_P4D;
+ early_memunmap(p4d, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(p4d_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, PFN_DOWN(p4d_phys));
+ set_pgd(pgd + 2 + idx_p4d, __pgd(_PAGE_TABLE | p4d_phys));
+ p4d_phys += PAGE_SIZE;
+ }
+ } while (++idx_p4d < n_p4d);

/* Now copy the old p2m info to the new area. */
memcpy(new_p2m, xen_p2m_addr, size);
@@ -2429,8 +2457,8 @@ static void __init xen_post_allocator_init(void)
pv_mmu_ops.set_pte = xen_set_pte;
pv_mmu_ops.set_pmd = xen_set_pmd;
pv_mmu_ops.set_pud = xen_set_pud;
-#if CONFIG_PGTABLE_LEVELS == 4
- pv_mmu_ops.set_pgd = xen_set_pgd;
+#if CONFIG_PGTABLE_LEVELS >= 4
+ pv_mmu_ops.set_p4d = xen_set_p4d;
#endif

/* This will work as long as patching hasn't happened yet
@@ -2439,7 +2467,7 @@ static void __init xen_post_allocator_init(void)
pv_mmu_ops.alloc_pmd = xen_alloc_pmd;
pv_mmu_ops.release_pte = xen_release_pte;
pv_mmu_ops.release_pmd = xen_release_pmd;
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
pv_mmu_ops.alloc_pud = xen_alloc_pud;
pv_mmu_ops.release_pud = xen_release_pud;
#endif
@@ -2505,10 +2533,10 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
.make_pmd = PV_CALLEE_SAVE(xen_make_pmd),
.pmd_val = PV_CALLEE_SAVE(xen_pmd_val),

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
.pud_val = PV_CALLEE_SAVE(xen_pud_val),
.make_pud = PV_CALLEE_SAVE(xen_make_pud),
- .set_pgd = xen_set_pgd_hyper,
+ .set_p4d = xen_set_p4d_hyper,

.alloc_pud = xen_alloc_pmd_init,
.release_pud = xen_release_pmd_init,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index bce990f5a35d..31acce9019a6 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -241,21 +241,21 @@ TRACE_EVENT(xen_mmu_set_pud,
(int)sizeof(pudval_t) * 2, (unsigned long long)__entry->pudval)
);

-TRACE_EVENT(xen_mmu_set_pgd,
- TP_PROTO(pgd_t *pgdp, pgd_t *user_pgdp, pgd_t pgdval),
- TP_ARGS(pgdp, user_pgdp, pgdval),
+TRACE_EVENT(xen_mmu_set_p4d,
+ TP_PROTO(p4d_t *p4dp, p4d_t *user_p4dp, p4d_t p4dval),
+ TP_ARGS(p4dp, user_p4dp, p4dval),
TP_STRUCT__entry(
- __field(pgd_t *, pgdp)
- __field(pgd_t *, user_pgdp)
- __field(pgdval_t, pgdval)
- ),
- TP_fast_assign(__entry->pgdp = pgdp;
- __entry->user_pgdp = user_pgdp;
- __entry->pgdval = pgdval.pgd),
- TP_printk("pgdp %p user_pgdp %p pgdval %0*llx (raw %0*llx)",
- __entry->pgdp, __entry->user_pgdp,
- (int)sizeof(pgdval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->pgdval)),
- (int)sizeof(pgdval_t) * 2, (unsigned long long)__entry->pgdval)
+ __field(p4d_t *, p4dp)
+ __field(p4d_t *, user_p4dp)
+ __field(p4dval_t, p4dval)
+ ),
+ TP_fast_assign(__entry->p4dp = p4dp;
+ __entry->user_p4dp = user_p4dp;
+ __entry->p4dval = p4d_val(p4dval)),
+ TP_printk("p4dp %p user_p4dp %p p4dval %0*llx (raw %0*llx)",
+ __entry->p4dp, __entry->user_p4dp,
+ (int)sizeof(p4dval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->p4dval)),
+ (int)sizeof(p4dval_t) * 2, (unsigned long long)__entry->p4dval)
);

TRACE_EVENT(xen_mmu_pud_clear,
--
2.11.0

2017-03-06 14:07:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 28/33] x86/mm: add support of additional page table level during early boot

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 23 +++++++++--
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 6 ++-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 40 +++++++++++++-----
arch/x86/kernel/head_64.S | 63 +++++++++++++++++++++--------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/kasan_init_64.c | 12 +++---
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/mmu.c | 38 ++++++++++-------
12 files changed, 135 insertions(+), 59 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
addl %ebp, gdt+2(%ebp)
lgdt gdt(%ebp)

- /* Enable PAE mode */
+ /* Enable PAE and LA57 mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %eax
+#endif
movl %eax, %cr4

/*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl

+ xorl %edx, %edx
+
+ /* Build Top Level */
+ leal pgtable(%ebx,%edx,1), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
/* Build Level 4 */
- leal pgtable + 0(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
+#endif

/* Build Level 3 */
- leal pgtable + 0x1000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
jnz 1b

/* Build Level 2 */
- leal pgtable + 0x2000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 90f32116acd8..6cefd861ac65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -917,7 +917,7 @@ extern pgd_t trampoline_pgd_entry;
static inline void __meminit init_trampoline_default(void)
{
/* Default trampoline pgd value */
- trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+ trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9991224f6238..c9e41f1599dd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,15 +14,17 @@
#include <linux/bitops.h>
#include <linux/threads.h>

+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt

extern void paging_init(void);

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
#define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
#define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
#define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */
+#define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT)
#define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */
#define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT)
#define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
p4d_populate(&init_mm, p4d, espfix_pud_page);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 54a2372f5dbb..f32d22986f47 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -32,7 +32,7 @@
/*
* Manage page tables very early on.
*/
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt = 2;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -40,9 +40,9 @@ pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
- memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+ memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
next_early_pgt = 0;
- write_cr3(__pa_nodebug(early_level4_pgt));
+ write_cr3(__pa_nodebug(early_top_pgt));
}

/* Create a new PMD entry */
@@ -50,15 +50,16 @@ int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pgdval_t pgd, *pgd_p;
+ p4dval_t p4d, *p4d_p;
pudval_t pud, *pud_p;
pmdval_t pmd, *pmd_p;

/* Invalid address or early pgt is done ? */
- if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+ if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
return -1;

again:
- pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+ pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

/*
@@ -66,8 +67,25 @@ int __init early_make_pgtable(unsigned long address)
* critical -- __PAGE_OFFSET would point us back into the dynamic
* range and we might end up looping forever...
*/
- if (pgd)
- pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ p4d_p = pgd_p;
+ else if (pgd)
+ p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ else {
+ if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+ reset_early_page_tables();
+ goto again;
+ }
+
+ p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+ memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+ *pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ }
+ p4d_p += p4d_index(address);
+ p4d = *p4d_p;
+
+ if (p4d)
+ pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
else {
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
@@ -76,7 +94,7 @@ int __init early_make_pgtable(unsigned long address)

pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
- *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ *p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
}
pud_p += pud_index(address);
pud = *pud_p;
@@ -155,7 +173,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

clear_bss();

- clear_page(init_level4_pgt);
+ clear_page(init_top_pgt);

kasan_early_init();

@@ -170,8 +188,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
load_ucode_bsp();

- /* set init_level4_pgt kernel high mapping*/
- init_level4_pgt[511] = early_level4_pgt[511];
+ /* set init_top_pgt kernel high mapping*/
+ init_top_pgt[511] = early_top_pgt[511];

x86_64_start_reservations(real_mode_data);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b467b14b03eb..fd1f88d94d6b 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
*
*/

+#define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))

-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
L3_START_KERNEL = pud_index(__START_KERNEL_map)

.text
@@ -93,7 +97,11 @@ startup_64:
/*
* Fixup the physical addresses in the page table
*/
- addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
+ addq %rbp, early_top_pgt + (PGD_START_KERNEL*8)(%rip)
+
+#ifdef CONFIG_X86_5LEVEL
+ addq %rbp, level4_kernel_pgt + (511*8)(%rip)
+#endif

addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
@@ -107,7 +115,7 @@ startup_64:
* it avoids problems around wraparound.
*/
leaq _text(%rip), %rdi
- leaq early_level4_pgt(%rip), %rbx
+ leaq early_top_pgt(%rip), %rbx

movq %rdi, %rax
shrq $PGDIR_SHIFT, %rax
@@ -116,16 +124,26 @@ startup_64:
movq %rdx, 0(%rbx,%rax,8)
movq %rdx, 8(%rbx,%rax,8)

+#ifdef CONFIG_X86_5LEVEL
+ addq $PAGE_SIZE, %rbx
+ addq $PAGE_SIZE, %rdx
+ movq %rdi, %rax
+ shrq $P4D_SHIFT, %rax
+ andl $(PTRS_PER_P4D-1), %eax
+ movq %rdx, 0(%rbx,%rax,8)
+#endif
+
+ addq $PAGE_SIZE, %rbx
addq $PAGE_SIZE, %rdx
movq %rdi, %rax
shrq $PUD_SHIFT, %rax
andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
+ movq %rdx, 0(%rbx,%rax,8)
incl %eax
andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, PAGE_SIZE(%rbx,%rax,8)
+ movq %rdx, 0(%rbx,%rax,8)

- addq $PAGE_SIZE * 2, %rbx
+ addq $PAGE_SIZE, %rbx
movq %rdi, %rax
shrq $PMD_SHIFT, %rdi
addq $(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
@@ -166,7 +184,7 @@ startup_64:
addq %rbp, phys_base(%rip)

.Lskip_fixup:
- movq $(early_level4_pgt - __START_KERNEL_map), %rax
+ movq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
/*
@@ -186,14 +204,17 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu

- movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

- /* Enable PAE mode and PGE */
+ /* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %ecx
+#endif
movq %rcx, %cr4

- /* Setup early boot stage 4 level pagetables. */
+ /* Setup early boot stage 4-/5-level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3

@@ -419,9 +440,13 @@ GLOBAL(name)
.endr

__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
.fill 511,8,0
+#ifdef CONFIG_X86_5LEVEL
+ .quad level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif

NEXT_PAGE(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -429,14 +454,14 @@ NEXT_PAGE(early_dynamic_pgts)
.data

#ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.fill 512,8,0
#else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
+ .org init_top_pgt + PGD_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .org init_level4_pgt + L4_START_KERNEL*8, 0
+ .org init_top_pgt + PGD_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

@@ -450,6 +475,12 @@ NEXT_PAGE(level2_ident_pgt)
PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
#endif

+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 42eae96c8450..4b520d072056 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -339,7 +339,7 @@ void machine_kexec(struct kimage *image)
void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_NUMBER(phys_base);
- VMCOREINFO_SYMBOL(init_level4_pgt);
+ VMCOREINFO_SYMBOL(init_top_pgt);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 0effac6989cd..0431bfd5e09f 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -435,7 +435,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx)
{
#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_level4_pgt;
+ pgd_t *start = (pgd_t *) &init_top_pgt;
#else
pgd_t *start = swapper_pg_dir;
#endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bcabc56e0dc4..a25dd40a0683 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -10,7 +10,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern struct range pfn_mapped[E820_X_MAX];

static int __init map_range(struct range *range)
@@ -103,8 +103,8 @@ void __init kasan_early_init(void)
for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
kasan_zero_p4d[i] = __p4d(p4d_val);

- kasan_map_early_shadow(early_level4_pgt);
- kasan_map_early_shadow(init_level4_pgt);
+ kasan_map_early_shadow(early_top_pgt);
+ kasan_map_early_shadow(init_top_pgt);
}

void __init kasan_init(void)
@@ -115,8 +115,8 @@ void __init kasan_init(void)
register_die_notifier(&kasan_die_notifier);
#endif

- memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
- load_cr3(early_level4_pgt);
+ memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+ load_cr3(early_top_pgt);
__flush_tlb_all();

clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -142,7 +142,7 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- load_cr3(init_level4_pgt);
+ load_cr3(init_top_pgt);
__flush_tlb_all();

/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)

trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = trampoline_pgd_entry.pgd;
- trampoline_pgd[511] = init_level4_pgt[511].pgd;
+ trampoline_pgd[511] = init_top_pgt[511].pgd;
#endif
}

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index c36c8178847d..a4079cfab007 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -97,7 +97,11 @@ static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
#endif
#ifdef CONFIG_X86_64
/* l3 pud for userspace vsyscall mapping */
-static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
+#if CONFIG_PGTABLE_LEVELS == 5
+static p4d_t user_vsyscall[PTRS_PER_P4D] __page_aligned_bss;
+#else
+static pud_t user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
+#endif
#endif /* CONFIG_X86_64 */

/*
@@ -504,7 +508,7 @@ __visible pmd_t xen_make_pmd(pmdval_t pmd)
}
PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd);

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
__visible pudval_t xen_pud_val(pud_t pud)
{
return pte_mfn_to_pfn(pud.pud);
@@ -1529,8 +1533,8 @@ static void xen_write_cr3(unsigned long cr3)
* At the start of the day - when Xen launches a guest, it has already
* built pagetables for the guest. We diligently look over them
* in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
*
* The generic code starts (start_kernel) and 'init_mem_mapping' sets
* up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1583,7 +1587,7 @@ static int xen_pgd_alloc(struct mm_struct *mm)
if (user_pgd != NULL) {
#ifdef CONFIG_X86_VSYSCALL_EMULATION
user_pgd[pgd_index(VSYSCALL_ADDR)] =
- __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE);
+ __pgd(__pa(user_vsyscall) | _PAGE_TABLE);
#endif
ret = 0;
}
@@ -1973,13 +1977,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
pt_end = pt_base + xen_start_info->nr_pt_frames;

/* Zap identity mapping */
- init_level4_pgt[0] = __pgd(0);
+ init_top_pgt[0] = __pgd(0);

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Pre-constructed entries are in pfn, so convert to mfn */
/* L4[272] -> level3_ident_pgt
* L4[511] -> level3_kernel_pgt */
- convert_pfn_mfn(init_level4_pgt);
+ convert_pfn_mfn(init_top_pgt);

/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
@@ -2010,14 +2014,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Copy the initial P->M table mappings if necessary. */
i = pgd_index(xen_start_info->mfn_list);
if (i && i < pgd_index(__START_KERNEL_map))
- init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+ init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
- set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+ set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
- set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+ set_page_prot(user_vsyscall, PAGE_KERNEL_RO);
set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
@@ -2025,7 +2029,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Pin down new L4 */
pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
- PFN_DOWN(__pa_symbol(init_level4_pgt)));
+ PFN_DOWN(__pa_symbol(init_top_pgt)));

/* Unpin Xen-provided one */
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2036,10 +2040,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* pgd.
*/
xen_mc_batch();
- __xen_write_cr3(true, __pa(init_level4_pgt));
+ __xen_write_cr3(true, __pa(init_top_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
} else
- native_write_cr3(__pa(init_level4_pgt));
+ native_write_cr3(__pa(init_top_pgt));

/* We can't that easily rip out L3 and L2, as the Xen pagetables are
* set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ... for
@@ -2444,7 +2448,11 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
pagetable vsyscall mapping. */
if (idx == VSYSCALL_PAGE) {
unsigned long vaddr = __fix_to_virt(idx);
- set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
+#if CONFIG_PGTABLE_LEVELS == 5
+ set_pte_vaddr_p4d(user_vsyscall, vaddr, pte);
+#else
+ set_pte_vaddr_pud(user_vsyscall, vaddr, pte);
+#endif
}
#endif
}
@@ -2475,7 +2483,7 @@ static void __init xen_post_allocator_init(void)

#ifdef CONFIG_X86_64
pv_mmu_ops.write_cr3 = &xen_write_cr3;
- SetPagePinned(virt_to_page(level3_user_vsyscall));
+ SetPagePinned(virt_to_page(user_vsyscall));
#endif
xen_mark_init_mm_pinned();
}
--
2.11.0

2017-03-06 14:07:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

Split these helpers few per-level functions and add p4d support.

Signed-off-by: Xiong Zhang <[email protected]>
[[email protected]: split off into separate patch]
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/xen/mmu.c | 243 ++++++++++++++++++++++++++++++++---------------------
arch/x86/xen/mmu.h | 1 +
2 files changed, 148 insertions(+), 96 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 37cb5aad71de..75af8da7b54f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -593,6 +593,62 @@ static void xen_set_pgd(pgd_t *ptr, pgd_t val)
}
#endif /* CONFIG_PGTABLE_LEVELS == 4 */

+static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd,
+ int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+ bool last, unsigned long limit)
+{
+ int i, nr, flush = 0;
+
+ nr = last ? pmd_index(limit) + 1 : PTRS_PER_PMD;
+ for (i = 0; i < nr; i++) {
+ if (!pmd_none(pmd[i]))
+ flush |= (*func)(mm, pmd_page(pmd[i]), PT_PTE);
+ }
+ return flush;
+}
+
+static int xen_pud_walk(struct mm_struct *mm, pud_t *pud,
+ int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+ bool last, unsigned long limit)
+{
+ int i, nr, flush = 0;
+
+ nr = last ? pud_index(limit) + 1 : PTRS_PER_PUD;
+ for (i = 0; i < nr; i++) {
+ pmd_t *pmd;
+
+ if (pud_none(pud[i]))
+ continue;
+
+ pmd = pmd_offset(&pud[i], 0);
+ if (PTRS_PER_PMD > 1)
+ flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
+ xen_pmd_walk(mm, pmd, func, last && i == nr - 1, limit);
+ }
+ return flush;
+}
+
+static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
+ int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+ bool last, unsigned long limit)
+{
+ int i, nr, flush = 0;
+
+ nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
+ for (i = 0; i < nr; i++) {
+ pud_t *pud;
+
+ if (p4d_none(p4d[i]))
+ continue;
+
+ pud = pud_offset(&p4d[i], 0);
+ if (PTRS_PER_PUD > 1)
+ flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
+ xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
+ }
+ return flush;
+}
+
/*
* (Yet another) pagetable walker. This one is intended for pinning a
* pagetable. This means that it walks a pagetable and calls the
@@ -613,10 +669,8 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
enum pt_level),
unsigned long limit)
{
- int flush = 0;
+ int i, nr, flush = 0;
unsigned hole_low, hole_high;
- unsigned pgdidx_limit, pudidx_limit, pmdidx_limit;
- unsigned pgdidx, pudidx, pmdidx;

/* The limit is the last byte to be touched */
limit--;
@@ -633,65 +687,22 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
hole_low = pgd_index(USER_LIMIT);
hole_high = pgd_index(PAGE_OFFSET);

- pgdidx_limit = pgd_index(limit);
-#if PTRS_PER_PUD > 1
- pudidx_limit = pud_index(limit);
-#else
- pudidx_limit = 0;
-#endif
-#if PTRS_PER_PMD > 1
- pmdidx_limit = pmd_index(limit);
-#else
- pmdidx_limit = 0;
-#endif
-
- for (pgdidx = 0; pgdidx <= pgdidx_limit; pgdidx++) {
- pud_t *pud;
+ nr = pgd_index(limit) + 1;
+ for (i = 0; i < nr; i++) {
+ p4d_t *p4d;

- if (pgdidx >= hole_low && pgdidx < hole_high)
+ if (i >= hole_low && i < hole_high)
continue;

- if (!pgd_val(pgd[pgdidx]))
+ if (pgd_none(pgd[i]))
continue;

- pud = pud_offset(&pgd[pgdidx], 0);
-
- if (PTRS_PER_PUD > 1) /* not folded */
- flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-
- for (pudidx = 0; pudidx < PTRS_PER_PUD; pudidx++) {
- pmd_t *pmd;
-
- if (pgdidx == pgdidx_limit &&
- pudidx > pudidx_limit)
- goto out;
-
- if (pud_none(pud[pudidx]))
- continue;
-
- pmd = pmd_offset(&pud[pudidx], 0);
-
- if (PTRS_PER_PMD > 1) /* not folded */
- flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
-
- for (pmdidx = 0; pmdidx < PTRS_PER_PMD; pmdidx++) {
- struct page *pte;
-
- if (pgdidx == pgdidx_limit &&
- pudidx == pudidx_limit &&
- pmdidx > pmdidx_limit)
- goto out;
-
- if (pmd_none(pmd[pmdidx]))
- continue;
-
- pte = pmd_page(pmd[pmdidx]);
- flush |= (*func)(mm, pte, PT_PTE);
- }
- }
+ p4d = p4d_offset(&pgd[i], 0);
+ if (PTRS_PER_P4D > 1)
+ flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
+ xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
}

-out:
/* Do the top level last, so that the callbacks can use it as
a cue to do final things like tlb flushes. */
flush |= (*func)(mm, virt_to_page(pgd), PT_PGD);
@@ -1150,57 +1161,97 @@ static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl, bool unpin)
xen_free_ro_pages(pa, PAGE_SIZE);
}

+static void __init xen_cleanmfnmap_pmd(pmd_t *pmd, bool unpin)
+{
+ unsigned long pa;
+ pte_t *pte_tbl;
+ int i;
+
+ if (pmd_large(*pmd)) {
+ pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PMD_SIZE);
+ return;
+ }
+
+ pte_tbl = pte_offset_kernel(pmd, 0);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ if (pte_none(pte_tbl[i]))
+ continue;
+ pa = pte_pfn(pte_tbl[i]) << PAGE_SHIFT;
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+ set_pmd(pmd, __pmd(0));
+ xen_cleanmfnmap_free_pgtbl(pte_tbl, unpin);
+}
+
+static void __init xen_cleanmfnmap_pud(pud_t *pud, bool unpin)
+{
+ unsigned long pa;
+ pmd_t *pmd_tbl;
+ int i;
+
+ if (pud_large(*pud)) {
+ pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PUD_SIZE);
+ return;
+ }
+
+ pmd_tbl = pmd_offset(pud, 0);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (pmd_none(pmd_tbl[i]))
+ continue;
+ xen_cleanmfnmap_pmd(pmd_tbl + i, unpin);
+ }
+ set_pud(pud, __pud(0));
+ xen_cleanmfnmap_free_pgtbl(pmd_tbl, unpin);
+}
+
+static void __init xen_cleanmfnmap_p4d(p4d_t *p4d, bool unpin)
+{
+ unsigned long pa;
+ pud_t *pud_tbl;
+ int i;
+
+ if (p4d_large(*p4d)) {
+ pa = p4d_val(*p4d) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, P4D_SIZE);
+ return;
+ }
+
+ pud_tbl = pud_offset(p4d, 0);
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ if (pud_none(pud_tbl[i]))
+ continue;
+ xen_cleanmfnmap_pud(pud_tbl + i, unpin);
+ }
+ set_p4d(p4d, __p4d(0));
+ xen_cleanmfnmap_free_pgtbl(pud_tbl, unpin);
+}
+
/*
* Since it is well isolated we can (and since it is perhaps large we should)
* also free the page tables mapping the initial P->M table.
*/
static void __init xen_cleanmfnmap(unsigned long vaddr)
{
- unsigned long va = vaddr & PMD_MASK;
- unsigned long pa;
- pgd_t *pgd = pgd_offset_k(va);
- pud_t *pud_page = pud_offset(pgd, 0);
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
+ pgd_t *pgd;
+ p4d_t *p4d;
unsigned int i;
bool unpin;

unpin = (vaddr == 2 * PGDIR_SIZE);
- set_pgd(pgd, __pgd(0));
- do {
- pud = pud_page + pud_index(va);
- if (pud_none(*pud)) {
- va += PUD_SIZE;
- } else if (pud_large(*pud)) {
- pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
- xen_free_ro_pages(pa, PUD_SIZE);
- va += PUD_SIZE;
- } else {
- pmd = pmd_offset(pud, va);
- if (pmd_large(*pmd)) {
- pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
- xen_free_ro_pages(pa, PMD_SIZE);
- } else if (!pmd_none(*pmd)) {
- pte = pte_offset_kernel(pmd, va);
- set_pmd(pmd, __pmd(0));
- for (i = 0; i < PTRS_PER_PTE; ++i) {
- if (pte_none(pte[i]))
- break;
- pa = pte_pfn(pte[i]) << PAGE_SHIFT;
- xen_free_ro_pages(pa, PAGE_SIZE);
- }
- xen_cleanmfnmap_free_pgtbl(pte, unpin);
- }
- va += PMD_SIZE;
- if (pmd_index(va))
- continue;
- set_pud(pud, __pud(0));
- xen_cleanmfnmap_free_pgtbl(pmd, unpin);
- }
-
- } while (pud_index(va) || pmd_index(va));
- xen_cleanmfnmap_free_pgtbl(pud_page, unpin);
+ vaddr &= PMD_MASK;
+ pgd = pgd_offset_k(vaddr);
+ p4d = p4d_offset(pgd, 0);
+ for (i = 0; i < PTRS_PER_P4D; i++) {
+ if (p4d_none(p4d[i]))
+ continue;
+ xen_cleanmfnmap_p4d(p4d + i, unpin);
+ }
+ if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+ set_pgd(pgd, __pgd(0));
+ xen_cleanmfnmap_free_pgtbl(p4d, unpin);
+ }
}

static void __init xen_pagetable_p2m_free(void)
diff --git a/arch/x86/xen/mmu.h b/arch/x86/xen/mmu.h
index 73809bb951b4..3fe2b3292915 100644
--- a/arch/x86/xen/mmu.h
+++ b/arch/x86/xen/mmu.h
@@ -5,6 +5,7 @@

enum pt_level {
PT_PGD,
+ PT_P4D,
PT_PUD,
PT_PMD,
PT_PTE
--
2.11.0

2017-03-06 14:07:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 14/33] x86/kexec: support p4d_t

Handle additional page table level in kexec code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/kexec.h | 1 +
arch/x86/kernel/machine_kexec_32.c | 4 +++-
arch/x86/kernel/machine_kexec_64.c | 14 ++++++++++++--
3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 282630e4c6ea..70ef205489f0 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -164,6 +164,7 @@ struct kimage_arch {
};
#else
struct kimage_arch {
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index 469b23d6acc2..5f43cec296c5 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -103,6 +103,7 @@ static void machine_kexec_page_table_set_one(
pgd_t *pgd, pmd_t *pmd, pte_t *pte,
unsigned long vaddr, unsigned long paddr)
{
+ p4d_t *p4d;
pud_t *pud;

pgd += pgd_index(vaddr);
@@ -110,7 +111,8 @@ static void machine_kexec_page_table_set_one(
if (!(pgd_val(*pgd) & _PAGE_PRESENT))
set_pgd(pgd, __pgd(__pa(pmd) | _PAGE_PRESENT));
#endif
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ pud = pud_offset(p4d, vaddr);
pmd = pmd_offset(pud, vaddr);
if (!(pmd_val(*pmd) & _PAGE_PRESENT))
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 307b1f4543de..42eae96c8450 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -36,6 +36,7 @@ static struct kexec_file_ops *kexec_file_loaders[] = {

static void free_transition_pgtable(struct kimage *image)
{
+ free_page((unsigned long)image->arch.p4d);
free_page((unsigned long)image->arch.pud);
free_page((unsigned long)image->arch.pmd);
free_page((unsigned long)image->arch.pte);
@@ -43,6 +44,7 @@ static void free_transition_pgtable(struct kimage *image)

static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -53,13 +55,21 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
pgd += pgd_index(vaddr);
if (!pgd_present(*pgd)) {
+ p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
+ if (!p4d)
+ goto err;
+ image->arch.p4d = p4d;
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ }
+ p4d = p4d_offset(pgd, vaddr);
+ if (!p4d_present(*p4d)) {
pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
if (!pud)
goto err;
image->arch.pud = pud;
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
}
- pud = pud_offset(pgd, vaddr);
+ pud = pud_offset(p4d, vaddr);
if (!pud_present(*pud)) {
pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
if (!pmd)
--
2.11.0

2017-03-06 14:08:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 20/33] x86: detect 5-level paging support

5-level paging support is required from hardware when compiled with
CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/cpucheck.c | 9 +++++++++
arch/x86/boot/cpuflags.c | 12 ++++++++++--
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/include/asm/required-features.h | 8 +++++++-
4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c
index 4ad7d70e8739..8f0c4c9fc904 100644
--- a/arch/x86/boot/cpucheck.c
+++ b/arch/x86/boot/cpucheck.c
@@ -44,6 +44,15 @@ static const u32 req_flags[NCAPINTS] =
0, /* REQUIRED_MASK5 not implemented in this file */
REQUIRED_MASK6,
0, /* REQUIRED_MASK7 not implemented in this file */
+ 0, /* REQUIRED_MASK8 not implemented in this file */
+ 0, /* REQUIRED_MASK9 not implemented in this file */
+ 0, /* REQUIRED_MASK10 not implemented in this file */
+ 0, /* REQUIRED_MASK11 not implemented in this file */
+ 0, /* REQUIRED_MASK12 not implemented in this file */
+ 0, /* REQUIRED_MASK13 not implemented in this file */
+ 0, /* REQUIRED_MASK14 not implemented in this file */
+ 0, /* REQUIRED_MASK15 not implemented in this file */
+ REQUIRED_MASK16,
};

#define A32(a, b, c, d) (((d) << 24)+((c) << 16)+((b) << 8)+(a))
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..9e77c23c2422 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -70,16 +70,19 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif

-static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
+static inline void cpuid_count(u32 id, u32 count,
+ u32 *a, u32 *b, u32 *c, u32 *d)
{
asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
"cpuid \n\t"
".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
: "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
- : "a" (id)
+ : "a" (id), "c" (count)
);
}

+#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
+
void get_cpuflags(void)
{
u32 max_intel_level, max_amd_level;
@@ -108,6 +111,11 @@ void get_cpuflags(void)
cpu.model += ((tfms >> 16) & 0xf) << 4;
}

+ if (max_intel_level >= 0x00000007) {
+ cpuid_count(0x00000007, 0, &ignored, &ignored,
+ &cpu.flags[16], &ignored);
+ }
+
cpuid(0x80000000, &max_amd_level, &ignored, &ignored,
&ignored);

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 85599ad4d024..fc0960236fc3 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -36,6 +36,12 @@
# define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */

+#ifdef CONFIG_X86_5LEVEL
+#define DISABLE_LA57 0
+#else
+#define DISABLE_LA57 (1<<(X86_FEATURE_LA57 & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -55,7 +61,7 @@
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
-#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE)
+#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57)
#define DISABLED_MASK17 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)

diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index fac9a5c0abe9..d91ba04dd007 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -53,6 +53,12 @@
# define NEED_MOVBE 0
#endif

+#ifdef CONFIG_X86_5LEVEL
+# define NEED_LA57 (1<<(X86_FEATURE_LA57 & 31))
+#else
+# define NEED_LA57 0
+#endif
+
#ifdef CONFIG_X86_64
#ifdef CONFIG_PARAVIRT
/* Paravirtualized systems may not have PSE or PGE available */
@@ -98,7 +104,7 @@
#define REQUIRED_MASK13 0
#define REQUIRED_MASK14 0
#define REQUIRED_MASK15 0
-#define REQUIRED_MASK16 0
+#define REQUIRED_MASK16 (NEED_LA57)
#define REQUIRED_MASK17 0
#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)

--
2.11.0

2017-03-06 14:08:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 21/33] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert

We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
better check for canonical addresses") made canonical address check
generic wrt. address width.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/entry/entry_64.S | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 044d18ebc43c..f07b4efb34d5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -265,12 +265,9 @@ return_from_SYSCALL_64:
*
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
+ *
+ * Change top 16 bits to be the sign-extension of 47th bit
*/
- .ifne __VIRTUAL_MASK_SHIFT - 47
- .error "virtual address width changed -- SYSRET checks need update"
- .endif
-
- /* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

--
2.11.0

2017-03-06 14:08:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 27/33] x86/espfix: support 5-level paging

We don't need extra virtual address space for ESPFIX, so it stays within
one PUD page table for both 4- and 5-level paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/espfix_64.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 04f89caef9c4..8e598a1ad986 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -50,11 +50,11 @@
#define ESPFIX_STACKS_PER_PAGE (PAGE_SIZE/ESPFIX_STACK_SIZE)

/* There is address space for how many espfix pages? */
-#define ESPFIX_PAGE_SPACE (1UL << (PGDIR_SHIFT-PAGE_SHIFT-16))
+#define ESPFIX_PAGE_SPACE (1UL << (P4D_SHIFT-PAGE_SHIFT-16))

#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE * ESPFIX_PAGE_SPACE)
#if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
-# error "Need more than one PGD for the ESPFIX hack"
+# error "Need more virtual address space for the ESPFIX hack"
#endif

#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
@@ -121,11 +121,13 @@ static void init_espfix_random(void)

void __init init_espfix_bsp(void)
{
- pgd_t *pgd_p;
+ pgd_t *pgd;
+ p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
- pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
+ pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+ p4d_populate(&init_mm, p4d, espfix_pud_page);

/* Randomize the locations */
init_espfix_random();
--
2.11.0

2017-03-06 14:08:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 01/33] x86/cpufeature: Add 5-level paging detection

Look for 'la57' in /proc/cpuinfo to see if your machine supports 5-level
paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4e7772387c6e..b04bb6dfed7f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -289,7 +289,8 @@
#define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */
-#define X86_FEATURE_RDPID (16*32+ 22) /* RDPID instruction */
+#define X86_FEATURE_LA57 (16*32+16) /* 5-level page tables */
+#define X86_FEATURE_RDPID (16*32+22) /* RDPID instruction */

/* AMD-defined CPU features, CPUID level 0x80000007 (ebx), word 17 */
#define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery support */
--
2.11.0

2017-03-06 14:08:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 05/33] asm-generic: introduce <asm-generic/pgtable-nop4d.h>

Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/pgtable-nop4d.h | 56 +++++++++++++++++++++++++++++++++++++
include/asm-generic/pgtable-nopud.h | 43 ++++++++++++++--------------
include/asm-generic/tlb.h | 14 ++++++++--
3 files changed, 89 insertions(+), 24 deletions(-)
create mode 100644 include/asm-generic/pgtable-nop4d.h

diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
new file mode 100644
index 000000000000..de364ecb8df6
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -0,0 +1,56 @@
+#ifndef _PGTABLE_NOP4D_H
+#define _PGTABLE_NOP4D_H
+
+#ifndef __ASSEMBLY__
+
+#define __PAGETABLE_P4D_FOLDED
+
+typedef struct { pgd_t pgd; } p4d_t;
+
+#define P4D_SHIFT PGDIR_SHIFT
+#define PTRS_PER_P4D 1
+#define P4D_SIZE (1UL << P4D_SHIFT)
+#define P4D_MASK (~(P4D_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the p4d is never bad, and a p4d always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd) { return 0; }
+static inline int pgd_bad(pgd_t pgd) { return 0; }
+static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline void pgd_clear(pgd_t *pgd) { }
+#define p4d_ERROR(p4d) (pgd_ERROR((p4d).pgd))
+
+#define pgd_populate(mm, pgd, p4d) do { } while (0)
+/*
+ * (p4ds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval) set_p4d((p4d_t *)(pgdptr), (p4d_t) { pgdval })
+
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+{
+ return (p4d_t *)pgd;
+}
+
+#define p4d_val(x) (pgd_val((x).pgd))
+#define __p4d(x) ((p4d_t) { __pgd(x) })
+
+#define pgd_page(pgd) (p4d_page((p4d_t){ pgd }))
+#define pgd_page_vaddr(pgd) (p4d_page_vaddr((p4d_t){ pgd }))
+
+/*
+ * allocating and freeing a p4d is trivial: the 1-entry p4d is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define p4d_alloc_one(mm, address) NULL
+#define p4d_free(mm, x) do { } while (0)
+#define __p4d_free_tlb(tlb, x, a) do { } while (0)
+
+#undef p4d_addr_end
+#define p4d_addr_end(addr, end) (end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 5e49430a30a4..c2b9b96d6268 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -6,53 +6,54 @@
#ifdef __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nop4d-hack.h>
#else
+#include <asm-generic/pgtable-nop4d.h>

#define __PAGETABLE_PUD_FOLDED

/*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
+ * Having the pud type consist of a p4d gets the size right, and allows
+ * us to conceptually access the p4d entry that this pud is folded into
* without casting.
*/
-typedef struct { pgd_t pgd; } pud_t;
+typedef struct { p4d_t p4d; } pud_t;

-#define PUD_SHIFT PGDIR_SHIFT
+#define PUD_SHIFT P4D_SHIFT
#define PTRS_PER_PUD 1
#define PUD_SIZE (1UL << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))

/*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * The "p4d_xxx()" functions here are trivial for a folded two-level
* setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
+ * into the p4d entry)
*/
-static inline int pgd_none(pgd_t pgd) { return 0; }
-static inline int pgd_bad(pgd_t pgd) { return 0; }
-static inline int pgd_present(pgd_t pgd) { return 1; }
-static inline void pgd_clear(pgd_t *pgd) { }
-#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+static inline int p4d_none(p4d_t p4d) { return 0; }
+static inline int p4d_bad(p4d_t p4d) { return 0; }
+static inline int p4d_present(p4d_t p4d) { return 1; }
+static inline void p4d_clear(p4d_t *p4d) { }
+#define pud_ERROR(pud) (p4d_ERROR((pud).p4d))

-#define pgd_populate(mm, pgd, pud) do { } while (0)
+#define p4d_populate(mm, p4d, pud) do { } while (0)
/*
- * (puds are folded into pgds so this doesn't get actually called,
+ * (puds are folded into p4ds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
*/
-#define set_pgd(pgdptr, pgdval) set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+#define set_p4d(p4dptr, p4dval) set_pud((pud_t *)(p4dptr), (pud_t) { p4dval })

-static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
{
- return (pud_t *)pgd;
+ return (pud_t *)p4d;
}

-#define pud_val(x) (pgd_val((x).pgd))
-#define __pud(x) ((pud_t) { __pgd(x) } )
+#define pud_val(x) (p4d_val((x).p4d))
+#define __pud(x) ((pud_t) { __p4d(x) })

-#define pgd_page(pgd) (pud_page((pud_t){ pgd }))
-#define pgd_page_vaddr(pgd) (pud_page_vaddr((pud_t){ pgd }))
+#define p4d_page(p4d) (pud_page((pud_t){ p4d }))
+#define p4d_page_vaddr(p4d) (pud_page_vaddr((pud_t){ p4d }))

/*
* allocating and freeing a pud is trivial: the 1-entry pud is
- * inside the pgd, so has no extra memory associated with it.
+ * inside the p4d, so has no extra memory associated with it.
*/
#define pud_alloc_one(mm, address) NULL
#define pud_free(mm, x) do { } while (0)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4329bc6ef04b..8afa4335e5b2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -270,6 +270,12 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
__pte_free_tlb(tlb, ptep, address); \
} while (0)

+#define pmd_free_tlb(tlb, pmdp, address) \
+ do { \
+ __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ __pmd_free_tlb(tlb, pmdp, address); \
+ } while (0)
+
#ifndef __ARCH_HAS_4LEVEL_HACK
#define pud_free_tlb(tlb, pudp, address) \
do { \
@@ -278,11 +284,13 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
} while (0)
#endif

-#define pmd_free_tlb(tlb, pmdp, address) \
+#ifndef __ARCH_HAS_5LEVEL_HACK
+#define p4d_free_tlb(tlb, pudp, address) \
do { \
- __tlb_adjust_range(tlb, address, PAGE_SIZE); \
- __pmd_free_tlb(tlb, pmdp, address); \
+ __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ __p4d_free_tlb(tlb, pudp, address); \
} while (0)
+#endif

#define tlb_migrate_finish(mm) do {} while (0)

--
2.11.0

2017-03-06 14:08:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 30/33] x86/mm: make kernel_physical_mapping_init() support 5-level paging

Properly populate addition pagetable level if CONFIG_X86_5LEVEL is
enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 71 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ba99090dc3c..ef117a69f74e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -622,6 +622,58 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
return paddr_last;
}

+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+ unsigned long page_size_mask)
+{
+ unsigned long paddr_next, paddr_last = paddr_end;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+ int i = p4d_index(vaddr);
+
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+ for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d;
+ pud_t *pud;
+
+ vaddr = (unsigned long)__va(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ if (paddr >= paddr_end) {
+ if (!after_bootmem &&
+ !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+ E820_RAM) &&
+ !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+ E820_RESERVED_KERN)) {
+ set_p4d(p4d, __p4d(0));
+ }
+ continue;
+ }
+
+ if (!p4d_none(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ paddr_last = phys_pud_init(pud, paddr,
+ paddr_end,
+ page_size_mask);
+ __flush_tlb_all();
+ continue;
+ }
+
+ pud = alloc_low_page();
+ paddr_last = phys_pud_init(pud, paddr, paddr_end,
+ page_size_mask);
+
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, p4d, pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+ __flush_tlb_all();
+
+ return paddr_last;
+}
+
/*
* Create page table mapping for the physical memory for specific physical
* addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -643,26 +695,27 @@ kernel_physical_mapping_init(unsigned long paddr_start,
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
- pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- BUILD_BUG_ON(pgd_none(*pgd));
- p4d = p4d_offset(pgd, vaddr);
- if (p4d_val(*p4d)) {
- pud = (pud_t *)p4d_page_vaddr(*p4d);
- paddr_last = phys_pud_init(pud, __pa(vaddr),
+ if (pgd_val(*pgd)) {
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}

- pud = alloc_low_page();
- paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+ p4d = alloc_low_page();
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- p4d_populate(&init_mm, p4d, pud);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ pgd_populate(&init_mm, pgd, p4d);
+ else
+ p4d_populate(&init_mm, p4d_offset(pgd, vaddr),
+ (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
--
2.11.0

2017-03-06 14:09:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 23/33] x86/paravirt: make paravirt code support 5-level paging

Add operations to allocate/release p4ds.

TODO: cover XEN.

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/paravirt.h | 44 +++++++++++++++++++++++++++++++----
arch/x86/include/asm/paravirt_types.h | 7 +++++-
arch/x86/include/asm/pgalloc.h | 2 ++
arch/x86/kernel/paravirt.c | 9 +++++--
4 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 158d877ce9e9..677edf3b6421 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -357,6 +357,16 @@ static inline void paravirt_release_pud(unsigned long pfn)
PVOP_VCALL1(pv_mmu_ops.release_pud, pfn);
}

+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn)
+{
+ PVOP_VCALL2(pv_mmu_ops.alloc_p4d, mm, pfn);
+}
+
+static inline void paravirt_release_p4d(unsigned long pfn)
+{
+ PVOP_VCALL1(pv_mmu_ops.release_p4d, pfn);
+}
+
static inline void pte_update(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
@@ -582,14 +592,35 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
val);
}

-static inline void p4d_clear(p4d_t *p4dp)
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+static inline p4d_t __p4d(p4dval_t val)
{
- set_p4d(p4dp, __p4d(0));
+ p4dval_t ret;
+
+ if (sizeof(p4dval_t) > sizeof(long))
+ ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.make_p4d,
+ val, (u64)val >> 32);
+ else
+ ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.make_p4d,
+ val);
+
+ return (p4d_t) { ret };
}

-#if CONFIG_PGTABLE_LEVELS >= 5
+static inline p4dval_t p4d_val(p4d_t p4d)
+{
+ p4dval_t ret;
+
+ if (sizeof(p4dval_t) > sizeof(long))
+ ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.p4d_val,
+ p4d.p4d, (u64)p4d.p4d >> 32);
+ else
+ ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.p4d_val,
+ p4d.p4d);

-#error FIXME
+ return ret;
+}

static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
{
@@ -610,6 +641,11 @@ static inline void pgd_clear(pgd_t *pgdp)

#endif /* CONFIG_PGTABLE_LEVELS == 5 */

+static inline void p4d_clear(p4d_t *p4dp)
+{
+ set_p4d(p4dp, __p4d(0));
+}
+
#endif /* CONFIG_PGTABLE_LEVELS == 4 */

#endif /* CONFIG_PGTABLE_LEVELS >= 3 */
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 93c49cf09b63..7465d6fe336f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -238,9 +238,11 @@ struct pv_mmu_ops {
void (*alloc_pte)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pmd)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pud)(struct mm_struct *mm, unsigned long pfn);
+ void (*alloc_p4d)(struct mm_struct *mm, unsigned long pfn);
void (*release_pte)(unsigned long pfn);
void (*release_pmd)(unsigned long pfn);
void (*release_pud)(unsigned long pfn);
+ void (*release_p4d)(unsigned long pfn);

/* Pagetable manipulation functions */
void (*set_pte)(pte_t *ptep, pte_t pteval);
@@ -286,7 +288,10 @@ struct pv_mmu_ops {
void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);

#if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
+ struct paravirt_callee_save p4d_val;
+ struct paravirt_callee_save make_p4d;
+
+ void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
#endif /* CONFIG_PGTABLE_LEVELS >= 5 */

#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 2f585054c63c..b2d0cd8288aa 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -17,9 +17,11 @@ static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) {
static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
unsigned long start, unsigned long count) {}
static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) {}
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_release_pte(unsigned long pfn) {}
static inline void paravirt_release_pmd(unsigned long pfn) {}
static inline void paravirt_release_pud(unsigned long pfn) {}
+static inline void paravirt_release_p4d(unsigned long pfn) {}
#endif

/*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 110daf22f5c7..3586996fc50d 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -405,9 +405,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
.alloc_pte = paravirt_nop,
.alloc_pmd = paravirt_nop,
.alloc_pud = paravirt_nop,
+ .alloc_p4d = paravirt_nop,
.release_pte = paravirt_nop,
.release_pmd = paravirt_nop,
.release_pud = paravirt_nop,
+ .release_p4d = paravirt_nop,

.set_pte = native_set_pte,
.set_pte_at = native_set_pte_at,
@@ -437,8 +439,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
.set_p4d = native_set_p4d,

#if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
-#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+ .p4d_val = PTE_IDENT,
+ .make_p4d = PTE_IDENT,
+
+ .set_pgd = native_set_pgd,
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

--
2.11.0

2017-03-06 14:09:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 22/33] x86/mm: define virtual memory map for 5-level paging

The first part of memory map (up to %esp fixup) simply scales existing
map for 4-level paging by factor of 9 -- number of bits addressed by
additional page table level.

The rest of the map is uncahnged.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/x86_64/mm.txt | 33 ++++++++++++++++++++++++++++++---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/kasan.h | 9 ++++++---
arch/x86/include/asm/page_64_types.h | 10 ++++++++++
arch/x86/include/asm/pgtable_64_types.h | 6 ++++++
arch/x86/include/asm/sparsemem.h | 9 +++++++--
6 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 5724092db811..0303a47b82f8 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -4,7 +4,7 @@
Virtual memory map with 4 level page tables:

0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
-hole caused by [48:63] sign extension
+hole caused by [47:63] sign extension
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
@@ -23,12 +23,39 @@ ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

+Virtual memory map with 5 level page tables:
+
+0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
+hole caused by [56:63] sign extension
+ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
+ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
+ff90000000000000 - ff91ffffffffffff (=49 bits) hole
+ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
+ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
+ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
+... unused hole ...
+ffd8000000000000 - fff7ffffffffffff (=53 bits) kasan shadow memory (8PB)
+... unused hole ...
+fffe000000000000 - fffe007fffffffff (=39 bits) %esp fixup stacks
+... unused hole ...
+ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
+... unused hole ...
+ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0
+ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+
+Architecture defines a 64-bit virtual address. Implementations can support
+less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+through to the most-significant implemented bit are set to either all ones
+or all zero. This causes hole between user space and kernel addresses.
+
The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
holes).

-vmalloc space is lazily synchronized into the different PML4 pages of
-the processes using the page fault handler, with init_level4_pgt as
+vmalloc space is lazily synchronized into the different PML4/PML5 pages of
+the processes using the page fault handler, with init_top_pgt as
reference.

Current X86-64 implementations support up to 46 bits of address space (64 TB),
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..747f06f00a22 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -290,6 +290,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
+ default 0xdff8000000000000 if X86_5LEVEL
default 0xdffffc0000000000

config HAVE_INTEL_TXT
diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
index 1410b567ecde..f527b02a0ee3 100644
--- a/arch/x86/include/asm/kasan.h
+++ b/arch/x86/include/asm/kasan.h
@@ -11,9 +11,12 @@
* 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT
*/
#define KASAN_SHADOW_START (KASAN_SHADOW_OFFSET + \
- (0xffff800000000000ULL >> 3))
-/* 47 bits for kernel address -> (47 - 3) bits for shadow */
-#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (47 - 3)))
+ ((-1UL << __VIRTUAL_MASK_SHIFT) >> 3))
+/*
+ * 47 bits for kernel address -> (47 - 3) bits for shadow
+ * 56 bits for kernel address -> (56 - 3) bits for shadow
+ */
+#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (__VIRTUAL_MASK_SHIFT - 3)))

#ifndef __ASSEMBLY__

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 9215e0527647..3f5f08b010d0 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -36,7 +36,12 @@
* hypervisor to fit. Choosing 16 slots here is arbitrary, but it's
* what Xen requires.
*/
+#ifdef CONFIG_X86_5LEVEL
+#define __PAGE_OFFSET_BASE _AC(0xff10000000000000, UL)
+#else
#define __PAGE_OFFSET_BASE _AC(0xffff880000000000, UL)
+#endif
+
#ifdef CONFIG_RANDOMIZE_MEMORY
#define __PAGE_OFFSET page_offset_base
#else
@@ -46,8 +51,13 @@
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)

/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
+#ifdef CONFIG_X86_5LEVEL
+#define __PHYSICAL_MASK_SHIFT 52
+#define __VIRTUAL_MASK_SHIFT 56
+#else
#define __PHYSICAL_MASK_SHIFT 46
#define __VIRTUAL_MASK_SHIFT 47
+#endif

/*
* Kernel image size is limited to 1GiB due to the fixmap living in the
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 0b2797e5083c..00dc0c2b456e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -56,9 +56,15 @@ typedef struct { pteval_t pte; } pte_t;

/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
+#ifdef CONFIG_X86_5LEVEL
+#define VMALLOC_SIZE_TB _AC(16384, UL)
+#define __VMALLOC_BASE _AC(0xff92000000000000, UL)
+#define __VMEMMAP_BASE _AC(0xffd4000000000000, UL)
+#else
#define VMALLOC_SIZE_TB _AC(32, UL)
#define __VMALLOC_BASE _AC(0xffffc90000000000, UL)
#define __VMEMMAP_BASE _AC(0xffffea0000000000, UL)
+#endif
#ifdef CONFIG_RANDOMIZE_MEMORY
#define VMALLOC_START vmalloc_base
#define VMEMMAP_START vmemmap_base
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 4517d6b93188..1f5bee2c202f 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -26,8 +26,13 @@
# endif
#else /* CONFIG_X86_32 */
# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
-# define MAX_PHYSADDR_BITS 44
-# define MAX_PHYSMEM_BITS 46
+# ifdef CONFIG_X86_5LEVEL
+# define MAX_PHYSADDR_BITS 52
+# define MAX_PHYSMEM_BITS 52
+# else
+# define MAX_PHYSADDR_BITS 44
+# define MAX_PHYSMEM_BITS 46
+# endif
#endif

#endif /* CONFIG_SPARSEMEM */
--
2.11.0

2017-03-06 14:07:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 02/33] asm-generic: introduce 5level-fixup.h

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/4level-fixup.h | 3 ++-
include/asm-generic/5level-fixup.h | 41 ++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 3 +++
3 files changed, 46 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index 5bdab6bffd23..928fd66b1271 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -15,7 +15,6 @@
((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
NULL: pmd_offset(pud, address))

-#define pud_alloc(mm, pgd, address) (pgd)
#define pud_offset(pgd, start) (pgd)
#define pud_none(pud) 0
#define pud_bad(pud) 0
@@ -35,4 +34,6 @@
#undef pud_addr_end
#define pud_addr_end(addr, end) (end)

+#include <asm-generic/5level-fixup.h>
+
#endif
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
new file mode 100644
index 000000000000..b5ca82dc4175
--- /dev/null
+++ b/include/asm-generic/5level-fixup.h
@@ -0,0 +1,41 @@
+#ifndef _5LEVEL_FIXUP_H
+#define _5LEVEL_FIXUP_H
+
+#define __ARCH_HAS_5LEVEL_HACK
+#define __PAGETABLE_P4D_FOLDED
+
+#define P4D_SHIFT PGDIR_SHIFT
+#define P4D_SIZE PGDIR_SIZE
+#define P4D_MASK PGDIR_MASK
+#define PTRS_PER_P4D 1
+
+#define p4d_t pgd_t
+
+#define pud_alloc(mm, p4d, address) \
+ ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
+ NULL : pud_offset(p4d, address))
+
+#define p4d_alloc(mm, pgd, address) (pgd)
+#define p4d_offset(pgd, start) (pgd)
+#define p4d_none(p4d) 0
+#define p4d_bad(p4d) 0
+#define p4d_present(p4d) 1
+#define p4d_ERROR(p4d) do { } while (0)
+#define p4d_clear(p4d) pgd_clear(p4d)
+#define p4d_val(p4d) pgd_val(p4d)
+#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud)
+#define p4d_page(p4d) pgd_page(p4d)
+#define p4d_page_vaddr(p4d) pgd_page_vaddr(p4d)
+
+#define __p4d(x) __pgd(x)
+#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d)
+
+#undef p4d_free_tlb
+#define p4d_free_tlb(tlb, x, addr) do { } while (0)
+#define p4d_free(mm, x) do { } while (0)
+#define __p4d_free_tlb(tlb, x, addr) do { } while (0)
+
+#undef p4d_addr_end
+#define p4d_addr_end(addr, end) (end)
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d65dd72c0f4..be1fe264eb37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1619,11 +1619,14 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
* Remove it when 4level-fixup.h has been removed.
*/
#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
+
+#ifndef __ARCH_HAS_5LEVEL_HACK
static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
NULL: pud_offset(pgd, address);
}
+#endif /* !__ARCH_HAS_5LEVEL_HACK */

static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
--
2.11.0

2017-03-06 18:29:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Mon, Mar 6, 2017 at 5:53 AM, Kirill A. Shutemov
<[email protected]> wrote:
> Here is v4 of 5-level paging patchset. Please review and consider applying.

I think we should just aim for this being in 4.12. I don't see any
real reason to delay merging it, the main question in my mind is which
tree it would go through. A separate x86 -tip branch, or Andrew's mm
tree or me just pulling directly, or what?

I basically think it's in good enough shape that future work might as
well be based on this being merged. No?

Linus

2017-03-06 18:43:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Mon, 6 Mar 2017, Linus Torvalds wrote:

> On Mon, Mar 6, 2017 at 5:53 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Here is v4 of 5-level paging patchset. Please review and consider applying.
>
> I think we should just aim for this being in 4.12. I don't see any
> real reason to delay merging it, the main question in my mind is which
> tree it would go through. A separate x86 -tip branch, or Andrew's mm
> tree or me just pulling directly, or what?

We can take it through -tip and I prefer to do so as there are other
changes in the page table code lurking.

We probably need to split it apart:

- Apply the mm core only parts to a branch which can be pulled into
Andrews mm-tree

- Base the x86 changes on top of it

So both worlds can work on top of the mm core parts (almost
independently). From what I have seen so far, it's more likely that we get
delta changes/fixes on the x86 side than on the mm core side. And if we get
changes on the mm core side, we can deal with that via the seperate mm core
branch.

Andrew, does that work for you?

Thanks,

tglx

2017-03-06 19:04:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Mon, Mar 6, 2017 at 10:42 AM, Thomas Gleixner <[email protected]> wrote:
>
> We probably need to split it apart:
>
> - Apply the mm core only parts to a branch which can be pulled into
> Andrews mm-tree
>
> - Base the x86 changes on top of it

I'll happily take some of the preparatory patches for 4.11 too. Some
of them just don't seem to have any downside. The cpuid stuff, and the
basic scaffolding we could easily merge early. That includes the dummy
5level code, ie "5level-fixup.h" and even some of the mm side that
doesn't actually change anything and just prepares for the real code.

But having some base branch too just for avoiding conflicts with
whatever mm stuff that Andrew keeps around sounds fine too.

Linus

2017-03-06 19:09:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Mon, Mar 06, 2017 at 11:03:56AM -0800, Linus Torvalds wrote:
> On Mon, Mar 6, 2017 at 10:42 AM, Thomas Gleixner <[email protected]> wrote:
> >
> > We probably need to split it apart:
> >
> > - Apply the mm core only parts to a branch which can be pulled into
> > Andrews mm-tree
> >
> > - Base the x86 changes on top of it
>
> I'll happily take some of the preparatory patches for 4.11 too. Some
> of them just don't seem to have any downside. The cpuid stuff, and the
> basic scaffolding we could easily merge early. That includes the dummy
> 5level code, ie "5level-fixup.h" and even some of the mm side that
> doesn't actually change anything and just prepares for the real code.

The first 7 patches are relatively low-risk. It would be nice to have them
in earlier.

I'm commited to address any possible drawbacks quickly if you considering
applying it into v4.11.

--
Kirill A. Shutemov

2017-03-06 19:46:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Mon, Mar 6, 2017 at 11:09 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> The first 7 patches are relatively low-risk. It would be nice to have them
> in earlier.

Ok, I gave those another look since you mentioned them in particular,
and they still look fine and non-controversial to me. I'd be willing
to take them directly, and into 4.11, to make future integration
eastier and avoid conflicts with other mm code during the 4.12 merge
window.

Just looking at my own inbox, I would suggest that maybe you should
send that small early series as a separate patch series, because those
patches actually got mixed up in my inbox with all the other patches
in the series. Email sending in quick succession does not tend to keep
things ordered. I suspect that happened to others too.

We might have people who are *not* willing to look at the whole
33-patch series that has a lot of x86 code in it, but are willing to
look through the first 7 emails when they are clearly separated out..

Linus

2017-03-06 20:33:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 28/33] x86/mm: add support of additional page table level during early boot

On Mon, Mar 06, 2017 at 03:05:49PM -0500, Boris Ostrovsky wrote:
>
> > diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> > index 9991224f6238..c9e41f1599dd 100644
> > --- a/arch/x86/include/asm/pgtable_64.h
> > +++ b/arch/x86/include/asm/pgtable_64.h
> > @@ -14,15 +14,17 @@
> > #include <linux/bitops.h>
> > #include <linux/threads.h>
> >
> > +extern p4d_t level4_kernel_pgt[512];
> > +extern p4d_t level4_ident_pgt[512];
> > extern pud_t level3_kernel_pgt[512];
> > extern pud_t level3_ident_pgt[512];
> > extern pmd_t level2_kernel_pgt[512];
> > extern pmd_t level2_fixmap_pgt[512];
> > extern pmd_t level2_ident_pgt[512];
> > extern pte_t level1_fixmap_pgt[512];
> > -extern pgd_t init_level4_pgt[];
> > +extern pgd_t init_top_pgt[];
> >
> > -#define swapper_pg_dir init_level4_pgt
> > +#define swapper_pg_dir init_top_pgt
> >
> > extern void paging_init(void);
> >
>
>
> This means you also need
>
>
> diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
> index 5e24671..e1a5fbe 100644
> --- a/arch/x86/xen/xen-pvh.S
> +++ b/arch/x86/xen/xen-pvh.S
> @@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
> wrmsr
>
> /* Enable pre-constructed page tables. */
> - mov $_pa(init_level4_pgt), %eax
> + mov $_pa(init_top_pgt), %eax
> mov %eax, %cr3
> mov $(X86_CR0_PG | X86_CR0_PE), %eax
> mov %eax, %cr0
>
>

Ah. Thanks. I've missed that.

The fix is folded.

--
Kirill A. Shutemov

2017-03-06 20:49:21

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d


> +static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
> + int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
> + bool last, unsigned long limit)
> +{
> + int i, nr, flush = 0;
> +
> + nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
> + for (i = 0; i < nr; i++) {
> + pud_t *pud;
> +
> + if (p4d_none(p4d[i]))
> + continue;
> +
> + pud = pud_offset(&p4d[i], 0);
> + if (PTRS_PER_PUD > 1)
> + flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
> + xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
> + }
> + return flush;
> +}

..

> + p4d = p4d_offset(&pgd[i], 0);
> + if (PTRS_PER_P4D > 1)
> + flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
> + xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);


We are losing flush status at all levels so we need something like

flush |= xen_XXX_walk(...)



> }
>
> -out:
> /* Do the top level last, so that the callbacks can use it as
> a cue to do final things like tlb flushes. */
> flush |= (*func)(mm, virt_to_page(pgd), PT_PGD);
> @@ -1150,57 +1161,97 @@ static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl, bool unpin)
> xen_free_ro_pages(pa, PAGE_SIZE);
> }
>
> +static void __init xen_cleanmfnmap_pmd(pmd_t *pmd, bool unpin)
> +{
> + unsigned long pa;
> + pte_t *pte_tbl;
> + int i;
> +
> + if (pmd_large(*pmd)) {
> + pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> + xen_free_ro_pages(pa, PMD_SIZE);
> + return;
> + }
> +
> + pte_tbl = pte_offset_kernel(pmd, 0);
> + for (i = 0; i < PTRS_PER_PTE; i++) {
> + if (pte_none(pte_tbl[i]))
> + continue;
> + pa = pte_pfn(pte_tbl[i]) << PAGE_SHIFT;
> + xen_free_ro_pages(pa, PAGE_SIZE);
> + }
> + set_pmd(pmd, __pmd(0));
> + xen_cleanmfnmap_free_pgtbl(pte_tbl, unpin);
> +}
> +
> +static void __init xen_cleanmfnmap_pud(pud_t *pud, bool unpin)
> +{
> + unsigned long pa;
> + pmd_t *pmd_tbl;
> + int i;
> +
> + if (pud_large(*pud)) {
> + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> + xen_free_ro_pages(pa, PUD_SIZE);
> + return;
> + }
> +
> + pmd_tbl = pmd_offset(pud, 0);
> + for (i = 0; i < PTRS_PER_PMD; i++) {
> + if (pmd_none(pmd_tbl[i]))
> + continue;
> + xen_cleanmfnmap_pmd(pmd_tbl + i, unpin);
> + }
> + set_pud(pud, __pud(0));
> + xen_cleanmfnmap_free_pgtbl(pmd_tbl, unpin);
> +}
> +
> +static void __init xen_cleanmfnmap_p4d(p4d_t *p4d, bool unpin)
> +{
> + unsigned long pa;
> + pud_t *pud_tbl;
> + int i;
> +
> + if (p4d_large(*p4d)) {
> + pa = p4d_val(*p4d) & PHYSICAL_PAGE_MASK;
> + xen_free_ro_pages(pa, P4D_SIZE);
> + return;
> + }
> +
> + pud_tbl = pud_offset(p4d, 0);
> + for (i = 0; i < PTRS_PER_PUD; i++) {
> + if (pud_none(pud_tbl[i]))
> + continue;
> + xen_cleanmfnmap_pud(pud_tbl + i, unpin);
> + }
> + set_p4d(p4d, __p4d(0));
> + xen_cleanmfnmap_free_pgtbl(pud_tbl, unpin);
> +}
> +
> /*
> * Since it is well isolated we can (and since it is perhaps large we should)
> * also free the page tables mapping the initial P->M table.
> */
> static void __init xen_cleanmfnmap(unsigned long vaddr)
> {
> - unsigned long va = vaddr & PMD_MASK;
> - unsigned long pa;
> - pgd_t *pgd = pgd_offset_k(va);
> - pud_t *pud_page = pud_offset(pgd, 0);
> - pud_t *pud;
> - pmd_t *pmd;
> - pte_t *pte;
> + pgd_t *pgd;
> + p4d_t *p4d;
> unsigned int i;
> bool unpin;
>
> unpin = (vaddr == 2 * PGDIR_SIZE);
> - set_pgd(pgd, __pgd(0));
> - do {
> - pud = pud_page + pud_index(va);
> - if (pud_none(*pud)) {
> - va += PUD_SIZE;
> - } else if (pud_large(*pud)) {
> - pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> - xen_free_ro_pages(pa, PUD_SIZE);
> - va += PUD_SIZE;
> - } else {
> - pmd = pmd_offset(pud, va);
> - if (pmd_large(*pmd)) {
> - pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> - xen_free_ro_pages(pa, PMD_SIZE);
> - } else if (!pmd_none(*pmd)) {
> - pte = pte_offset_kernel(pmd, va);
> - set_pmd(pmd, __pmd(0));
> - for (i = 0; i < PTRS_PER_PTE; ++i) {
> - if (pte_none(pte[i]))
> - break;
> - pa = pte_pfn(pte[i]) << PAGE_SHIFT;
> - xen_free_ro_pages(pa, PAGE_SIZE);
> - }
> - xen_cleanmfnmap_free_pgtbl(pte, unpin);
> - }
> - va += PMD_SIZE;
> - if (pmd_index(va))
> - continue;
> - set_pud(pud, __pud(0));
> - xen_cleanmfnmap_free_pgtbl(pmd, unpin);
> - }
> -
> - } while (pud_index(va) || pmd_index(va));
> - xen_cleanmfnmap_free_pgtbl(pud_page, unpin);
> + vaddr &= PMD_MASK;
> + pgd = pgd_offset_k(vaddr);
> + p4d = p4d_offset(pgd, 0);
> + for (i = 0; i < PTRS_PER_P4D; i++) {
> + if (p4d_none(p4d[i]))
> + continue;
> + xen_cleanmfnmap_p4d(p4d + i, unpin);
> + }

Don't we need to pass vaddr down to all routines so that they select
appropriate tables? You seem to always be choosing the first one.

-boris

> + if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
> + set_pgd(pgd, __pgd(0));
> + xen_cleanmfnmap_free_pgtbl(p4d, unpin);
> + }
> }
>
> static void __init xen_pagetable_p2m_free(void)
> diff --git a/arch/x86/xen/mmu.h b/arch/x86/xen/mmu.h
> index 73809bb951b4..3fe2b3292915 100644
> --- a/arch/x86/xen/mmu.h
> +++ b/arch/x86/xen/mmu.h
> @@ -5,6 +5,7 @@
>
> enum pt_level {
> PT_PGD,
> + PT_P4D,
> PT_PUD,
> PT_PMD,
> PT_PTE


2017-03-06 21:01:51

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCHv4 28/33] x86/mm: add support of additional page table level during early boot


> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 9991224f6238..c9e41f1599dd 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -14,15 +14,17 @@
> #include <linux/bitops.h>
> #include <linux/threads.h>
>
> +extern p4d_t level4_kernel_pgt[512];
> +extern p4d_t level4_ident_pgt[512];
> extern pud_t level3_kernel_pgt[512];
> extern pud_t level3_ident_pgt[512];
> extern pmd_t level2_kernel_pgt[512];
> extern pmd_t level2_fixmap_pgt[512];
> extern pmd_t level2_ident_pgt[512];
> extern pte_t level1_fixmap_pgt[512];
> -extern pgd_t init_level4_pgt[];
> +extern pgd_t init_top_pgt[];
>
> -#define swapper_pg_dir init_level4_pgt
> +#define swapper_pg_dir init_top_pgt
>
> extern void paging_init(void);
>


This means you also need


diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e24671..e1a5fbe 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
wrmsr

/* Enable pre-constructed page tables. */
- mov $_pa(init_level4_pgt), %eax
+ mov $_pa(init_top_pgt), %eax
mov %eax, %cr3
mov $(X86_CR0_PG | X86_CR0_PE), %eax
mov %eax, %cr0


-boris

2017-03-07 01:25:24

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

Hi Thomas,

On Mon, 6 Mar 2017 19:42:05 +0100 (CET) Thomas Gleixner <[email protected]> wrote:
>
> We probably need to split it apart:
>
> - Apply the mm core only parts to a branch which can be pulled into
> Andrews mm-tree

Andrew's mm-tree is not a git tree it is a quilt series ...

--
Cheers,
Stephen Rothwell

2017-03-07 10:11:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv4 00/33] 5-level paging

On Tue, 7 Mar 2017, Stephen Rothwell wrote:
> Hi Thomas,
>
> On Mon, 6 Mar 2017 19:42:05 +0100 (CET) Thomas Gleixner <[email protected]> wrote:
> >
> > We probably need to split it apart:
> >
> > - Apply the mm core only parts to a branch which can be pulled into
> > Andrews mm-tree
>
> Andrew's mm-tree is not a git tree it is a quilt series ...

I know, but creating a 'mm-base-5-level.patch' from a git branch is trivial
enough.

Thanks,

tglx

2017-03-07 13:02:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

On Mon, Mar 06, 2017 at 03:48:24PM -0500, Boris Ostrovsky wrote:
>
> > +static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
> > + int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
> > + bool last, unsigned long limit)
> > +{
> > + int i, nr, flush = 0;
> > +
> > + nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
> > + for (i = 0; i < nr; i++) {
> > + pud_t *pud;
> > +
> > + if (p4d_none(p4d[i]))
> > + continue;
> > +
> > + pud = pud_offset(&p4d[i], 0);
> > + if (PTRS_PER_PUD > 1)
> > + flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
> > + xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
> > + }
> > + return flush;
> > +}
>
> ..
>
> > + p4d = p4d_offset(&pgd[i], 0);
> > + if (PTRS_PER_P4D > 1)
> > + flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
> > + xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
>
>
> We are losing flush status at all levels so we need something like
>
> flush |= xen_XXX_walk(...)

+ Xiong.

Thanks for noticing this. The fixup is below.

Please test, I don't have a setup for this.

>
>
>
> > }
> >
> > -out:
> > /* Do the top level last, so that the callbacks can use it as
> > a cue to do final things like tlb flushes. */
> > flush |= (*func)(mm, virt_to_page(pgd), PT_PGD);
> > @@ -1150,57 +1161,97 @@ static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl, bool unpin)
> > xen_free_ro_pages(pa, PAGE_SIZE);
> > }
> >
> > +static void __init xen_cleanmfnmap_pmd(pmd_t *pmd, bool unpin)
> > +{
> > + unsigned long pa;
> > + pte_t *pte_tbl;
> > + int i;
> > +
> > + if (pmd_large(*pmd)) {
> > + pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> > + xen_free_ro_pages(pa, PMD_SIZE);
> > + return;
> > + }
> > +
> > + pte_tbl = pte_offset_kernel(pmd, 0);
> > + for (i = 0; i < PTRS_PER_PTE; i++) {
> > + if (pte_none(pte_tbl[i]))
> > + continue;
> > + pa = pte_pfn(pte_tbl[i]) << PAGE_SHIFT;
> > + xen_free_ro_pages(pa, PAGE_SIZE);
> > + }
> > + set_pmd(pmd, __pmd(0));
> > + xen_cleanmfnmap_free_pgtbl(pte_tbl, unpin);
> > +}
> > +
> > +static void __init xen_cleanmfnmap_pud(pud_t *pud, bool unpin)
> > +{
> > + unsigned long pa;
> > + pmd_t *pmd_tbl;
> > + int i;
> > +
> > + if (pud_large(*pud)) {
> > + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> > + xen_free_ro_pages(pa, PUD_SIZE);
> > + return;
> > + }
> > +
> > + pmd_tbl = pmd_offset(pud, 0);
> > + for (i = 0; i < PTRS_PER_PMD; i++) {
> > + if (pmd_none(pmd_tbl[i]))
> > + continue;
> > + xen_cleanmfnmap_pmd(pmd_tbl + i, unpin);
> > + }
> > + set_pud(pud, __pud(0));
> > + xen_cleanmfnmap_free_pgtbl(pmd_tbl, unpin);
> > +}
> > +
> > +static void __init xen_cleanmfnmap_p4d(p4d_t *p4d, bool unpin)
> > +{
> > + unsigned long pa;
> > + pud_t *pud_tbl;
> > + int i;
> > +
> > + if (p4d_large(*p4d)) {
> > + pa = p4d_val(*p4d) & PHYSICAL_PAGE_MASK;
> > + xen_free_ro_pages(pa, P4D_SIZE);
> > + return;
> > + }
> > +
> > + pud_tbl = pud_offset(p4d, 0);
> > + for (i = 0; i < PTRS_PER_PUD; i++) {
> > + if (pud_none(pud_tbl[i]))
> > + continue;
> > + xen_cleanmfnmap_pud(pud_tbl + i, unpin);
> > + }
> > + set_p4d(p4d, __p4d(0));
> > + xen_cleanmfnmap_free_pgtbl(pud_tbl, unpin);
> > +}
> > +
> > /*
> > * Since it is well isolated we can (and since it is perhaps large we should)
> > * also free the page tables mapping the initial P->M table.
> > */
> > static void __init xen_cleanmfnmap(unsigned long vaddr)
> > {
> > - unsigned long va = vaddr & PMD_MASK;
> > - unsigned long pa;
> > - pgd_t *pgd = pgd_offset_k(va);
> > - pud_t *pud_page = pud_offset(pgd, 0);
> > - pud_t *pud;
> > - pmd_t *pmd;
> > - pte_t *pte;
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > unsigned int i;
> > bool unpin;
> >
> > unpin = (vaddr == 2 * PGDIR_SIZE);
> > - set_pgd(pgd, __pgd(0));
> > - do {
> > - pud = pud_page + pud_index(va);
> > - if (pud_none(*pud)) {
> > - va += PUD_SIZE;
> > - } else if (pud_large(*pud)) {
> > - pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> > - xen_free_ro_pages(pa, PUD_SIZE);
> > - va += PUD_SIZE;
> > - } else {
> > - pmd = pmd_offset(pud, va);
> > - if (pmd_large(*pmd)) {
> > - pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> > - xen_free_ro_pages(pa, PMD_SIZE);
> > - } else if (!pmd_none(*pmd)) {
> > - pte = pte_offset_kernel(pmd, va);
> > - set_pmd(pmd, __pmd(0));
> > - for (i = 0; i < PTRS_PER_PTE; ++i) {
> > - if (pte_none(pte[i]))
> > - break;
> > - pa = pte_pfn(pte[i]) << PAGE_SHIFT;
> > - xen_free_ro_pages(pa, PAGE_SIZE);
> > - }
> > - xen_cleanmfnmap_free_pgtbl(pte, unpin);
> > - }
> > - va += PMD_SIZE;
> > - if (pmd_index(va))
> > - continue;
> > - set_pud(pud, __pud(0));
> > - xen_cleanmfnmap_free_pgtbl(pmd, unpin);
> > - }
> > -
> > - } while (pud_index(va) || pmd_index(va));
> > - xen_cleanmfnmap_free_pgtbl(pud_page, unpin);
> > + vaddr &= PMD_MASK;
> > + pgd = pgd_offset_k(vaddr);
> > + p4d = p4d_offset(pgd, 0);
> > + for (i = 0; i < PTRS_PER_P4D; i++) {
> > + if (p4d_none(p4d[i]))
> > + continue;
> > + xen_cleanmfnmap_p4d(p4d + i, unpin);
> > + }
>
> Don't we need to pass vaddr down to all routines so that they select
> appropriate tables? You seem to always be choosing the first one.

IIUC, we clear whole page table subtree covered by one pgd entry.
So, no, there's no need to pass vaddr down. Just pointer to page table
entry is enough.

But I know virtually nothing about Xen. Please re-check my reasoning.

I would also appreciate help with getting x86 Xen code work with 5-level
paging enabled. For now I make CONFIG_XEN dependent on !CONFIG_X86_5LEVEL.

Fixup:

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a4079cfab007..d66b7e79781a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -629,7 +629,8 @@ static int xen_pud_walk(struct mm_struct *mm, pud_t *pud,
pmd = pmd_offset(&pud[i], 0);
if (PTRS_PER_PMD > 1)
flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
- xen_pmd_walk(mm, pmd, func, last && i == nr - 1, limit);
+ flush |= xen_pmd_walk(mm, pmd, func,
+ last && i == nr - 1, limit);
}
return flush;
}
@@ -650,7 +651,8 @@ static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
pud = pud_offset(&p4d[i], 0);
if (PTRS_PER_PUD > 1)
flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
- xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
+ flush |= xen_pud_walk(mm, pud, func,
+ last && i == nr - 1, limit);
}
return flush;
}
@@ -706,7 +708,7 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
p4d = p4d_offset(&pgd[i], 0);
if (PTRS_PER_P4D > 1)
flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
- xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
+ flush |= xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
}

/* Do the top level last, so that the callbacks can use it as
--
Kirill A. Shutemov

2017-03-07 18:19:46

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d


>> Don't we need to pass vaddr down to all routines so that they select
>> appropriate tables? You seem to always be choosing the first one.
> IIUC, we clear whole page table subtree covered by one pgd entry.
> So, no, there's no need to pass vaddr down. Just pointer to page table
> entry is enough.
>
> But I know virtually nothing about Xen. Please re-check my reasoning.

Yes, we effectively remove the whole page table for vaddr so I guess
it's OK.

>
> I would also appreciate help with getting x86 Xen code work with 5-level
> paging enabled. For now I make CONFIG_XEN dependent on !CONFIG_X86_5LEVEL.

Hmmm... that's a problem since this requires changes in the hypervisor
and even if/when these changes are made older version of hypervisor
still will not be able to run those guests.

This affects only PV guests and there is a series under review that
provides clean code separation with CONFIG_XEN_PV but because, for
example, dom0 (Xen control domain) is PV this will significantly limit
availability of dom0-capable kernels (because I assume distros will want
to have CONFIG_X86_5LEVEL).


>
> Fixup:

Yes, that works. (But then it worked even without this change because
problems caused by missing the flush would be intermittent. And a joy to
debug).

-boris

2017-03-07 18:27:16

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

On 07/03/17 18:18, Boris Ostrovsky wrote:
>>> Don't we need to pass vaddr down to all routines so that they select
>>> appropriate tables? You seem to always be choosing the first one.
>> IIUC, we clear whole page table subtree covered by one pgd entry.
>> So, no, there's no need to pass vaddr down. Just pointer to page table
>> entry is enough.
>>
>> But I know virtually nothing about Xen. Please re-check my reasoning.
> Yes, we effectively remove the whole page table for vaddr so I guess
> it's OK.
>
>> I would also appreciate help with getting x86 Xen code work with 5-level
>> paging enabled. For now I make CONFIG_XEN dependent on !CONFIG_X86_5LEVEL.
> Hmmm... that's a problem since this requires changes in the hypervisor
> and even if/when these changes are made older version of hypervisor
> still will not be able to run those guests.
>
> This affects only PV guests and there is a series under review that
> provides clean code separation with CONFIG_XEN_PV but because, for
> example, dom0 (Xen control domain) is PV this will significantly limit
> availability of dom0-capable kernels (because I assume distros will want
> to have CONFIG_X86_5LEVEL).

Wasn't the plan to be able to automatically detect 4 vs 5 level support,
and cope either way, so distros didn't have to ship two different builds
of Linux?

If so, all we need to do git things to compile sensibly, and have the PV
entry code in Linux configure the rest of the kernel appropriately.

(If not, please ignore me.)

~Andrew

2017-03-07 19:25:43

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

On 03/07/2017 01:26 PM, Andrew Cooper wrote:
> On 07/03/17 18:18, Boris Ostrovsky wrote:
>>>> Don't we need to pass vaddr down to all routines so that they select
>>>> appropriate tables? You seem to always be choosing the first one.
>>> IIUC, we clear whole page table subtree covered by one pgd entry.
>>> So, no, there's no need to pass vaddr down. Just pointer to page table
>>> entry is enough.
>>>
>>> But I know virtually nothing about Xen. Please re-check my reasoning.
>> Yes, we effectively remove the whole page table for vaddr so I guess
>> it's OK.
>>
>>> I would also appreciate help with getting x86 Xen code work with 5-level
>>> paging enabled. For now I make CONFIG_XEN dependent on !CONFIG_X86_5LEVEL.
>> Hmmm... that's a problem since this requires changes in the hypervisor
>> and even if/when these changes are made older version of hypervisor
>> still will not be able to run those guests.
>>
>> This affects only PV guests and there is a series under review that
>> provides clean code separation with CONFIG_XEN_PV but because, for
>> example, dom0 (Xen control domain) is PV this will significantly limit
>> availability of dom0-capable kernels (because I assume distros will want
>> to have CONFIG_X86_5LEVEL).
> Wasn't the plan to be able to automatically detect 4 vs 5 level support,
> and cope either way, so distros didn't have to ship two different builds
> of Linux?
>
> If so, all we need to do git things to compile sensibly, and have the PV
> entry code in Linux configure the rest of the kernel appropriately.

I am not aware of any plans but this would obviously be the preferred route.

-boris