LinuxLists.cc - [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 16:23:25

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 00/28] 5-level paging

x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
of physical address space. We are already bumping into this limit: some
vendors offers servers with 64 TiB of memory today.

To overcome the limitation upcoming hardware will introduce support for
5-level paging[1]. It is a straight-forward extension of the current page
table structure adding one more layer of translation.

It bumps the limits to 128 PiB of virtual address space and 4 PiB of
physical address space. This "ought to be enough for anybody" ©.

This patchset is still very early. There are a number of things missing
that we have to do before asking anyone to merge it (listed below).
It would be great if folks can start testing applications now (in QEMU) to
look for breakage.
Any early comments on the design or the patches would be appreciated as
well.

More details on the design and what’s left to implement are below.

- Linux MM now uses 5-level paging abstraction.

New page table level is p4d, just below pgd.

- All architectures converted to folded 5-level paging.

I added <asm-generic/5level-fixup.h>. It uses the same basic
approach as <asm-generic/4level-fixup.h> hack.

- x86 is converted to new <asm-generic/pgtable-nop4d.h>

All existing paging modes (2-, 3-, 4-level) on x86 are converted to
pgtable-nop4d.h.

The new header provides basics for properly folded additional page
table level. The idea is the same as with other pgtable-nop?d.h.

- Implement 5-level paging in x86.

CONFIG_X86_5LEVEL=y will enable new 5-level paging mode.

The patchset is build on top of v4.8.

I've also included a QEMU patch which enables 5-level paging in the
emulator, so anybody can play with the feature.

There is still work to do:

- Boot-time switch between 4- and 5-level paging.

We assume that distributions will be keen to avoid returning to the
i386 days where we shipped one kernel binary for each page table
layout.

As page table format is the same for 4- and 5-level paging it should
be possible to have single kernel binary and switch between them at
boot-time without too much hassle.

For now I only implemented compile-time switch.

I hoped to bring this feature with separate patchset once basic
enabling is in upstream.

Is it okay?

- Handle opt-in wider address space for userspace.

Not all userspace is ready to handle addresses wider than current
47-bits. At least some JIT compiler make use of upper bits to encode
their info.

We need to have an interface to opt-in wider addresses from userspace
to avoid regressions.

For now, I've included testing-only patch which bumps TASK_SIZE to
56-bits. This can be handy for testing to see what breaks if we max-out
size of virtual address space.

- CONFIG_XEN is broken.

Paravirt Xen MMU support hasn't yet adjusted to work with 5-level
paging. It's legacy feature, not sure if we really need to support it
with new paging, but it blocks Xen drivers too.

I haven't got around to setup testing environment for XEN, so left it
broken for now.

I would appreciate help with the code.

- Split patches further.

In some cases it's not trivial to split patches into reasonable pieces
without breaking bisectability

- Validation.

I haven't done much testing beyond basic boot.

Git:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git la57/v1

Any comments are welcome.

[1] https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf
Kirill A. Shutemov (28):
asm-generic: introduce 5level-fixup.h
asm-generic: introduce __ARCH_USE_5LEVEL_HACK
arch, mm: convert all architectures to use 5level-fixup.h
asm-generic: introduce <asm-generic/pgtable-nop4d.h>
mm: convert generic code to 5-level paging
x86: basic changes into headers for 5-level paging
x86: trivial portion of 5-level paging conversion
x86/gup: add 5-level paging support
x86/ident_map: add 5-level paging support
x86/mm: add support of p4d_t in vmalloc_fault()
x86/power: support p4d_t in hibernate code
x86/kexec: support p4d_t
x86: convert the rest of the code to support p4d_t
mm: introduce __p4d_alloc()
x86: detect 5-level paging support
x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
x86/mm: define virtual memory map for 5-level paging
x86/paravirt: make paravirt code support 5-level paging
x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
x86/dump_pagetables: support 5-level paging
x86/mm: extend kasan to support 5-level paging
x86/espfix: support 5-level paging
x86/mm: add support of additional page table level during early boot
x86/mm: add sync_global_pgds() for configuration with 5-level paging
x86/mm: make kernel_physical_mapping_init() support 5-level paging
x86/mm: add support for 5-level paging for KASLR
x86: enable la57 support
TESTING-ONLY: bump TASK_SIZE_MAX

Documentation/x86/x86_64/mm.txt | 23 +-
arch/arc/include/asm/hugepage.h | 1 +
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable.h | 1 +
arch/arm64/include/asm/pgtable-types.h | 4 +
arch/avr32/include/asm/pgtable-2level.h | 1 +
arch/cris/include/asm/pgtable.h | 1 +
arch/frv/include/asm/pgtable.h | 1 +
arch/h8300/include/asm/pgtable.h | 1 +
arch/hexagon/include/asm/pgtable.h | 1 +
arch/ia64/include/asm/pgtable.h | 2 +
arch/metag/include/asm/pgtable.h | 1 +
arch/mips/include/asm/pgtable-32.h | 1 +
arch/mips/include/asm/pgtable-64.h | 1 +
arch/mn10300/include/asm/page.h | 1 +
arch/nios2/include/asm/pgtable.h | 1 +
arch/openrisc/include/asm/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 3 +
arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
arch/s390/include/asm/pgtable.h | 1 +
arch/score/include/asm/pgtable.h | 1 +
arch/sh/include/asm/pgtable-2level.h | 1 +
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/tile/include/asm/pgtable_32.h | 1 +
arch/tile/include/asm/pgtable_64.h | 1 +
arch/um/include/asm/pgtable-2level.h | 1 +
arch/um/include/asm/pgtable-3level.h | 1 +
arch/unicore32/include/asm/pgtable.h | 1 +
arch/x86/Kconfig | 7 +
arch/x86/boot/compressed/head_64.S | 23 +-
arch/x86/boot/cpucheck.c | 9 +
arch/x86/boot/cpuflags.c | 16 ++
arch/x86/entry/entry_64.S | 7 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/kasan.h | 9 +-
arch/x86/include/asm/kexec.h | 1 +
arch/x86/include/asm/page_64_types.h | 10 +
arch/x86/include/asm/paravirt.h | 64 +++++-
arch/x86/include/asm/paravirt_types.h | 17 +-
arch/x86/include/asm/pgalloc.h | 36 ++-
arch/x86/include/asm/pgtable-2level_types.h | 1 +
arch/x86/include/asm/pgtable-3level_types.h | 1 +
arch/x86/include/asm/pgtable.h | 91 +++++++-
arch/x86/include/asm/pgtable_64.h | 29 ++-
arch/x86/include/asm/pgtable_64_types.h | 27 +++
arch/x86/include/asm/pgtable_types.h | 42 +++-
arch/x86/include/asm/processor.h | 3 +-
arch/x86/include/asm/required-features.h | 8 +-
arch/x86/include/asm/sparsemem.h | 9 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 43 +++-
arch/x86/kernel/head64.c | 40 +++-
arch/x86/kernel/head_64.S | 58 +++--
arch/x86/kernel/machine_kexec_32.c | 4 +-
arch/x86/kernel/machine_kexec_64.c | 14 +-
arch/x86/kernel/paravirt.c | 13 +-
arch/x86/kernel/tboot.c | 6 +-
arch/x86/kernel/vm86_32.c | 6 +-
arch/x86/mm/dump_pagetables.c | 51 ++++-
arch/x86/mm/fault.c | 57 ++++-
arch/x86/mm/gup.c | 33 ++-
arch/x86/mm/ident_map.c | 42 +++-
arch/x86/mm/init_32.c | 22 +-
arch/x86/mm/init_64.c | 274 +++++++++++++++++++----
arch/x86/mm/ioremap.c | 3 +-
arch/x86/mm/kasan_init_64.c | 42 +++-
arch/x86/mm/kaslr.c | 82 +++++--
arch/x86/mm/pageattr.c | 56 +++--
arch/x86/mm/pgtable.c | 38 +++-
arch/x86/mm/pgtable_32.c | 8 +-
arch/x86/platform/efi/efi_64.c | 21 +-
arch/x86/power/hibernate_32.c | 7 +-
arch/x86/power/hibernate_64.c | 35 +--
arch/x86/realmode/init.c | 2 +-
arch/x86/xen/Kconfig | 1 +
arch/xtensa/include/asm/pgtable.h | 1 +
drivers/misc/sgi-gru/grufault.c | 9 +-
fs/userfaultfd.c | 6 +-
include/asm-generic/4level-fixup.h | 3 +-
include/asm-generic/5level-fixup.h | 41 ++++
include/asm-generic/pgtable-nop4d-hack.h | 62 +++++
include/asm-generic/pgtable-nop4d.h | 56 +++++
include/asm-generic/pgtable-nopud.h | 48 ++--
include/asm-generic/pgtable.h | 48 +++-
include/asm-generic/tlb.h | 12 +-
include/linux/hugetlb.h | 5 +-
include/linux/kasan.h | 1 +
include/linux/mm.h | 32 ++-
lib/ioremap.c | 39 +++-
mm/gup.c | 46 +++-
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 29 ++-
mm/kasan/kasan_init.c | 35 ++-
mm/memory.c | 230 ++++++++++++++++---
mm/mlock.c | 1 +
mm/mprotect.c | 26 ++-
mm/mremap.c | 13 +-
mm/pagewalk.c | 32 ++-
mm/pgtable-generic.c | 6 +
mm/rmap.c | 13 +-
mm/sparse-vmemmap.c | 22 +-
mm/swapfile.c | 26 ++-
mm/userfaultfd.c | 23 +-
mm/vmalloc.c | 81 +++++--
109 files changed, 2027 insertions(+), 366 deletions(-)
create mode 100644 include/asm-generic/5level-fixup.h
create mode 100644 include/asm-generic/pgtable-nop4d-hack.h
create mode 100644 include/asm-generic/pgtable-nop4d.h

--
2.10.2

2016-12-08 16:22:22

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 03/28] arch, mm: convert all architectures to use 5level-fixup.h

If the architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If the architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If the architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arc/include/asm/hugepage.h | 1 +
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable.h | 1 +
arch/arm64/include/asm/pgtable-types.h | 4 ++++
arch/avr32/include/asm/pgtable-2level.h | 1 +
arch/cris/include/asm/pgtable.h | 1 +
arch/frv/include/asm/pgtable.h | 1 +
arch/h8300/include/asm/pgtable.h | 1 +
arch/hexagon/include/asm/pgtable.h | 1 +
arch/ia64/include/asm/pgtable.h | 2 ++
arch/metag/include/asm/pgtable.h | 1 +
arch/mips/include/asm/pgtable-32.h | 1 +
arch/mips/include/asm/pgtable-64.h | 1 +
arch/mn10300/include/asm/page.h | 1 +
arch/nios2/include/asm/pgtable.h | 1 +
arch/openrisc/include/asm/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 2 ++
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 3 +++
arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
arch/s390/include/asm/pgtable.h | 1 +
arch/score/include/asm/pgtable.h | 1 +
arch/sh/include/asm/pgtable-2level.h | 1 +
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/tile/include/asm/pgtable_32.h | 1 +
arch/tile/include/asm/pgtable_64.h | 1 +
arch/um/include/asm/pgtable-2level.h | 1 +
arch/um/include/asm/pgtable-3level.h | 1 +
arch/unicore32/include/asm/pgtable.h | 1 +
arch/x86/include/asm/pgtable_types.h | 4 ++++
arch/xtensa/include/asm/pgtable.h | 1 +
33 files changed, 43 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 317ff773e1ca..b18fcb606908 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -11,6 +11,7 @@
#define _ASM_ARC_HUGEPAGE_H

#include <linux/types.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pte_t pmd_pte(pmd_t pmd)
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 89eeb3720051..233a92795c37 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -37,6 +37,7 @@

#include <asm/page.h>
#include <asm/mmu.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <linux/const.h>

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a8d656d9aec7..1c462381c225 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -20,6 +20,7 @@

#else

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#include <asm/memory.h>
#include <asm/pgtable-hwdef.h>
diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h
index 69b2fd41503c..345a072b5856 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -55,9 +55,13 @@ typedef struct { pteval_t pgprot; } pgprot_t;
#define __pgprot(x) ((pgprot_t) { (x) } )

#if CONFIG_PGTABLE_LEVELS == 2
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#elif CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
+#elif CONFIG_PGTABLE_LEVELS == 4
+#include <asm-generic/5level-fixup.h>
#endif

#endif /* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/avr32/include/asm/pgtable-2level.h b/arch/avr32/include/asm/pgtable-2level.h
index 425dd567b5b9..d5b1c63993ec 100644
--- a/arch/avr32/include/asm/pgtable-2level.h
+++ b/arch/avr32/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
#ifndef __ASM_AVR32_PGTABLE_2LEVEL_H
#define __ASM_AVR32_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/*
diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h
index ceefc314d64d..5dcdb7d014e5 100644
--- a/arch/cris/include/asm/pgtable.h
+++ b/arch/cris/include/asm/pgtable.h
@@ -6,6 +6,7 @@
#define _CRIS_PGTABLE_H

#include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h
index 07d7a7ef8bd5..847bcb14bf66 100644
--- a/arch/frv/include/asm/pgtable.h
+++ b/arch/frv/include/asm/pgtable.h
@@ -16,6 +16,7 @@
#ifndef _ASM_PGTABLE_H
#define _ASM_PGTABLE_H

+#include <asm-generic/5level-fixup.h>
#include <asm/mem-layout.h>
#include <asm/setup.h>
#include <asm/processor.h>
diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h
index 8341db67821d..7d265d28ba5e 100644
--- a/arch/h8300/include/asm/pgtable.h
+++ b/arch/h8300/include/asm/pgtable.h
@@ -1,5 +1,6 @@
#ifndef _H8300_PGTABLE_H
#define _H8300_PGTABLE_H
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#include <asm-generic/pgtable.h>
#define pgtable_cache_init() do { } while (0)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 49eab8136ec3..24a9177fb897 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -26,6 +26,7 @@
*/
#include <linux/swap.h>
#include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* A handy thing to have if one has the RAM. Declared in head.S */
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 9f3ed9ee8f13..53d97cd5fab9 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -587,8 +587,10 @@ extern struct page *zero_page_memmap_ptr;

#if CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>
#endif
+#include <asm-genenic/5level-fixup.h>
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
diff --git a/arch/metag/include/asm/pgtable.h b/arch/metag/include/asm/pgtable.h
index ffa3a3a2ecad..0c151e5af079 100644
--- a/arch/metag/include/asm/pgtable.h
+++ b/arch/metag/include/asm/pgtable.h
@@ -6,6 +6,7 @@
#define _METAG_PGTABLE_H

#include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* Invalid regions on Meta: 0x00000000-0x001FFFFF and 0xFFFF0000-0xFFFFFFFF */
diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
index d21f3da7bdb6..6f94bed571c4 100644
--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -16,6 +16,7 @@
#include <asm/cachectl.h>
#include <asm/fixmap.h>

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

extern int temp_tlb_entry;
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 514cbc0a6a67..130a2a6c1531 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -17,6 +17,7 @@
#include <asm/cachectl.h>
#include <asm/fixmap.h>

+#define __ARCH_USE_5LEVEL_HACK
#if defined(CONFIG_PAGE_SIZE_64KB) && !defined(CONFIG_MIPS_VA_BITS_48)
#include <asm-generic/pgtable-nopmd.h>
#else
diff --git a/arch/mn10300/include/asm/page.h b/arch/mn10300/include/asm/page.h
index 3810a6f740fd..dfe730a5ede0 100644
--- a/arch/mn10300/include/asm/page.h
+++ b/arch/mn10300/include/asm/page.h
@@ -57,6 +57,7 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) })
#define __pgprot(x) ((pgprot_t) { (x) })

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#endif /* !__ASSEMBLY__ */
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 298393c3cb42..db4f7d179220 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -22,6 +22,7 @@
#include <asm/tlbflush.h>

#include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#define FIRST_USER_ADDRESS 0UL
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 69c7df0e1420..2c98e6dabcfe 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -25,6 +25,7 @@
#ifndef __ASM_OPENRISC_PGTABLE_H
#define __ASM_OPENRISC_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 38b33dcfcc9d..ac53e7182387 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
#define _ASM_POWERPC_BOOK3S_32_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#include <asm/book3s/32/hash.h>
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 263bf39ced40..e0be442aaacb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1,6 +1,8 @@
#ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
#define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_

+#include <asm-generic/5level-fixup.h>
+
/*
* Common bits between hash and Radix page table
*/
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 780847597514..c4a3bd9e5c7f 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_NOHASH_32_PGTABLE_H
#define _ASM_POWERPC_NOHASH_32_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
index fc7d51753f81..2ab403514c07 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
@@ -1,5 +1,8 @@
#ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
#define _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
+
+#include <asm-generic/5level-fixup.h>
+
/*
* Entries per page directory level. The PTE level must use a 64b record
* for each page table entry. The PMD and PGD level use a 32b record for
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
index 908324574f77..f18752da5965 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
@@ -1,6 +1,7 @@
#ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H
#define _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 72c7f60bfe83..738a49934fc0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -24,6 +24,7 @@
* the S390 page table tree.
*/
#ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
diff --git a/arch/score/include/asm/pgtable.h b/arch/score/include/asm/pgtable.h
index 0553e5cd5985..46ff8fd678a7 100644
--- a/arch/score/include/asm/pgtable.h
+++ b/arch/score/include/asm/pgtable.h
@@ -2,6 +2,7 @@
#define _ASM_SCORE_PGTABLE_H

#include <linux/const.h>
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

#include <asm/fixmap.h>
diff --git a/arch/sh/include/asm/pgtable-2level.h b/arch/sh/include/asm/pgtable-2level.h
index 19bd89db17e7..f75cf4387257 100644
--- a/arch/sh/include/asm/pgtable-2level.h
+++ b/arch/sh/include/asm/pgtable-2level.h
@@ -1,6 +1,7 @@
#ifndef __ASM_SH_PGTABLE_2LEVEL_H
#define __ASM_SH_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/*
diff --git a/arch/sh/include/asm/pgtable-3level.h b/arch/sh/include/asm/pgtable-3level.h
index 249a985d9648..9b1e776eca31 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -1,6 +1,7 @@
#ifndef __ASM_SH_PGTABLE_3LEVEL_H
#define __ASM_SH_PGTABLE_3LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/*
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1fb317fbc0b3..4a0ae4b542fd 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -12,6 +12,7 @@
* the SpitFire page tables.
*/

+#include <asm-generic/5level-fixup.h>
#include <linux/compiler.h>
#include <linux/const.h>
#include <asm/types.h>
diff --git a/arch/tile/include/asm/pgtable_32.h b/arch/tile/include/asm/pgtable_32.h
index d26a42279036..5f8c615cb5e9 100644
--- a/arch/tile/include/asm/pgtable_32.h
+++ b/arch/tile/include/asm/pgtable_32.h
@@ -74,6 +74,7 @@ extern unsigned long VMALLOC_RESERVE /* = CONFIG_VMALLOC_RESERVE */;
#define MAXMEM (_VMALLOC_START - PAGE_OFFSET)

/* We have no pmd or pud since we are strictly a two-level page table */
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline int pud_huge_page(pud_t pud) { return 0; }
diff --git a/arch/tile/include/asm/pgtable_64.h b/arch/tile/include/asm/pgtable_64.h
index e96cec52f6d8..96fe58b45118 100644
--- a/arch/tile/include/asm/pgtable_64.h
+++ b/arch/tile/include/asm/pgtable_64.h
@@ -59,6 +59,7 @@
#ifndef __ASSEMBLY__

/* We have no pud since we are a three-level page table. */
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/*
diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h
index cfbe59752469..179c0ea87a0c 100644
--- a/arch/um/include/asm/pgtable-2level.h
+++ b/arch/um/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
#ifndef __UM_PGTABLE_2LEVEL_H
#define __UM_PGTABLE_2LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

/* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index bae8523a162f..c4d876dfb9ac 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -7,6 +7,7 @@
#ifndef __UM_PGTABLE_3LEVEL_H
#define __UM_PGTABLE_3LEVEL_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

/* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/unicore32/include/asm/pgtable.h b/arch/unicore32/include/asm/pgtable.h
index 818d0f5598e3..a4f2bef37e70 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -12,6 +12,7 @@
#ifndef __UNICORE_PGTABLE_H__
#define __UNICORE_PGTABLE_H__

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <asm/cpu-single.h>

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f1218f512f62..3187bec1b79a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,6 +273,8 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
}

#if CONFIG_PGTABLE_LEVELS > 3
+#include <asm-generic/5level-fixup.h>
+
typedef struct { pudval_t pud; } pud_t;

static inline pud_t native_make_pud(pmdval_t val)
@@ -285,6 +287,7 @@ static inline pudval_t native_pud_val(pud_t pud)
return pud.pud;
}
#else
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

static inline pudval_t native_pud_val(pud_t pud)
@@ -306,6 +309,7 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
return pmd.pmd;
}
#else
+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pmdval_t native_pmd_val(pmd_t pmd)
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index fb02fdc5ecee..91c530761f1e 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -11,6 +11,7 @@
#ifndef _XTENSA_PGTABLE_H
#define _XTENSA_PGTABLE_H

+#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>
#include <asm/page.h>

--
2.10.2

2016-12-08 16:22:27

by Kirill A. Shutemov

[permalink] [raw]

Subject: [QEMU, PATCH] x86: implement la57 paging mode

The new paging more is extension of IA32e mode with more additional page
table level.

It brings support of 57-bit vitrual address space (128PB) and 52-bit
physical address space (4PB).

The structure of new page table level is identical to pml4.

The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16].

CR4.LA57[bit 12] need to be set when pageing enables to activate 5-level
paging mode.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: [email protected]
---
target-i386/arch_memory_mapping.c | 42 ++++++++--
target-i386/cpu.c | 16 ++--
target-i386/cpu.h | 2 +
target-i386/helper.c | 54 ++++++++++--
target-i386/monitor.c | 167 ++++++++++++++++++++++++++++++++------
target-i386/translate.c | 2 +
6 files changed, 238 insertions(+), 45 deletions(-)

diff --git a/target-i386/arch_memory_mapping.c b/target-i386/arch_memory_mapping.c
index 88f341e1bbd0..826aee597b13 100644
--- a/target-i386/arch_memory_mapping.c
+++ b/target-i386/arch_memory_mapping.c
@@ -220,7 +220,8 @@ static void walk_pdpe(MemoryMappingList *list, AddressSpace *as,

/* IA-32e Paging */
static void walk_pml4e(MemoryMappingList *list, AddressSpace *as,
- hwaddr pml4e_start_addr, int32_t a20_mask)
+ hwaddr pml4e_start_addr, int32_t a20_mask,
+ target_ulong start_line_addr)
{
hwaddr pml4e_addr, pdpe_start_addr;
uint64_t pml4e;
@@ -236,11 +237,34 @@ static void walk_pml4e(MemoryMappingList *list, AddressSpace *as,
continue;
}

- line_addr = ((i & 0x1ffULL) << 39) | (0xffffULL << 48);
+ line_addr = start_line_addr | ((i & 0x1ffULL) << 39);
pdpe_start_addr = (pml4e & PLM4_ADDR_MASK) & a20_mask;
walk_pdpe(list, as, pdpe_start_addr, a20_mask, line_addr);
}
}
+
+static void walk_pml5e(MemoryMappingList *list, AddressSpace *as,
+ hwaddr pml5e_start_addr, int32_t a20_mask)
+{
+ hwaddr pml5e_addr, pml4e_start_addr;
+ uint64_t pml5e;
+ target_ulong line_addr;
+ int i;
+
+ for (i = 0; i < 512; i++) {
+ pml5e_addr = (pml5e_start_addr + i * 8) & a20_mask;
+ pml5e = address_space_ldq(as, pml5e_addr, MEMTXATTRS_UNSPECIFIED,
+ NULL);
+ if (!(pml5e & PG_PRESENT_MASK)) {
+ /* not present */
+ continue;
+ }
+
+ line_addr = (0x7fULL << 57) | ((i & 0x1ffULL) << 48);
+ pml4e_start_addr = (pml5e & PLM4_ADDR_MASK) & a20_mask;
+ walk_pml4e(list, as, pml4e_start_addr, a20_mask, line_addr);
+ }
+}
#endif

void x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
@@ -257,10 +281,18 @@ void x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
if (env->cr[4] & CR4_PAE_MASK) {
#ifdef TARGET_X86_64
if (env->hflags & HF_LMA_MASK) {
- hwaddr pml4e_addr;
+ if (env->cr[4] & CR4_LA57_MASK) {
+ hwaddr pml5e_addr;
+
+ pml5e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+ walk_pml5e(list, cs->as, pml5e_addr, env->a20_mask);
+ } else {
+ hwaddr pml4e_addr;

- pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
- walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask);
+ pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+ walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask,
+ 0xffffULL << 48);
+ }
} else
#endif
{
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index de1f30eeda63..a4b9832b5916 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -238,7 +238,8 @@ static void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
CPUID_7_0_EBX_RDSEED */
-#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)
+#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE | \
+ CPUID_7_0_ECX_LA57)
#define TCG_7_0_EDX_FEATURES 0
#define TCG_APM_FEATURES 0
#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
@@ -435,7 +436,7 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
"ospke", NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
- NULL, NULL, NULL, NULL,
+ "la57", NULL, NULL, NULL,
NULL, NULL, "rdpid", NULL,
NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
@@ -2742,10 +2743,13 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
case 0x80000008:
/* virtual & phys address size in low 2 bytes. */
if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
- /* 64 bit processor, 48 bits virtual, configurable
- * physical bits.
- */
- *eax = 0x00003000 + cpu->phys_bits;
+ /* 64 bit processor */
+ *eax = cpu->phys_bits; /* configurable physical bits */
+ if (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57) {
+ *eax |= 0x00003900; /* 57 bits virtual */
+ } else {
+ *eax |= 0x00003000; /* 48 bits virtual */
+ }
} else {
*eax = cpu->phys_bits;
}
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index c60572402272..0ba880fc2632 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -224,6 +224,7 @@
#define CR4_OSFXSR_SHIFT 9
#define CR4_OSFXSR_MASK (1U << CR4_OSFXSR_SHIFT)
#define CR4_OSXMMEXCPT_MASK (1U << 10)
+#define CR4_LA57_MASK (1U << 12)
#define CR4_VMXE_MASK (1U << 13)
#define CR4_SMXE_MASK (1U << 14)
#define CR4_FSGSBASE_MASK (1U << 16)
@@ -628,6 +629,7 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
#define CPUID_7_0_ECX_UMIP (1U << 2)
#define CPUID_7_0_ECX_PKU (1U << 3)
#define CPUID_7_0_ECX_OSPKE (1U << 4)
+#define CPUID_7_0_ECX_LA57 (1U << 16)
#define CPUID_7_0_ECX_RDPID (1U << 22)

#define CPUID_7_0_EDX_AVX512_4VNNIW (1U << 2) /* AVX512 Neural Network Instructions */
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 4ecc0912a48a..43e87ddba001 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -651,11 +651,11 @@ void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
uint32_t hflags;

#if defined(DEBUG_MMU)
- printf("CR4 update: CR4=%08x\n", (uint32_t)env->cr[4]);
+ printf("CR4 update: %08x -> %08x\n", (uint32_t)env->cr[4], new_cr4);
#endif
if ((new_cr4 ^ env->cr[4]) &
(CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK |
- CR4_SMEP_MASK | CR4_SMAP_MASK)) {
+ CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_LA57_MASK)) {
tlb_flush(CPU(cpu), 1);
}

@@ -757,19 +757,41 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr,

#ifdef TARGET_X86_64
if (env->hflags & HF_LMA_MASK) {
+ bool la57 = env->cr[4] & CR4_LA57_MASK;
+ uint64_t pml5e_addr, pml5e;
uint64_t pml4e_addr, pml4e;
int32_t sext;

/* test virtual address sign extension */
- sext = (int64_t)addr >> 47;
+ sext = la57 ? (int64_t)addr >> 56 : (int64_t)addr >> 47;
if (sext != 0 && sext != -1) {
env->error_code = 0;
cs->exception_index = EXCP0D_GPF;
return 1;
}

- pml4e_addr = ((env->cr[3] & ~0xfff) + (((addr >> 39) & 0x1ff) << 3)) &
- env->a20_mask;
+ if (la57) {
+ pml5e_addr = ((env->cr[3] & ~0xfff) +
+ (((addr >> 48) & 0x1ff) << 3)) & env->a20_mask;
+ pml5e = x86_ldq_phys(cs, pml5e_addr);
+ if (!(pml5e & PG_PRESENT_MASK)) {
+ goto do_fault;
+ }
+ if (pml5e & (rsvd_mask | PG_PSE_MASK)) {
+ goto do_fault_rsvd;
+ }
+ if (!(pml5e & PG_ACCESSED_MASK)) {
+ pml5e |= PG_ACCESSED_MASK;
+ x86_stl_phys_notdirty(cs, pml5e_addr, pml5e);
+ }
+ ptep = pml5e ^ PG_NX_MASK;
+ } else {
+ pml5e = env->cr[3];
+ ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
+ }
+
+ pml4e_addr = ((pml5e & PG_ADDRESS_MASK) +
+ (((addr >> 39) & 0x1ff) << 3)) & env->a20_mask;
pml4e = x86_ldq_phys(cs, pml4e_addr);
if (!(pml4e & PG_PRESENT_MASK)) {
goto do_fault;
@@ -781,7 +803,7 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr,
pml4e |= PG_ACCESSED_MASK;
x86_stl_phys_notdirty(cs, pml4e_addr, pml4e);
}
- ptep = pml4e ^ PG_NX_MASK;
+ ptep &= pml4e ^ PG_NX_MASK;
pdpe_addr = ((pml4e & PG_ADDRESS_MASK) + (((addr >> 30) & 0x1ff) << 3)) &
env->a20_mask;
pdpe = x86_ldq_phys(cs, pdpe_addr);
@@ -1024,16 +1046,30 @@ hwaddr x86_cpu_get_phys_page_debug(CPUState *cs, vaddr addr)

#ifdef TARGET_X86_64
if (env->hflags & HF_LMA_MASK) {
+ bool la57 = env->cr[4] & CR4_LA57_MASK;
+ uint64_t pml5e_addr, pml5e;
uint64_t pml4e_addr, pml4e;
int32_t sext;

/* test virtual address sign extension */
- sext = (int64_t)addr >> 47;
+ sext = la57 ? (int64_t)addr >> 56 : (int64_t)addr >> 47;
if (sext != 0 && sext != -1) {
return -1;
}
- pml4e_addr = ((env->cr[3] & ~0xfff) + (((addr >> 39) & 0x1ff) << 3)) &
- env->a20_mask;
+
+ if (la57) {
+ pml5e_addr = ((env->cr[3] & ~0xfff) +
+ (((addr >> 48) & 0x1ff) << 3)) & env->a20_mask;
+ pml5e = x86_ldq_phys(cs, pml5e_addr);
+ if (!(pml5e & PG_PRESENT_MASK)) {
+ return -1;
+ }
+ } else {
+ pml5e = env->cr[3];
+ }
+
+ pml4e_addr = ((pml5e & PG_ADDRESS_MASK) +
+ (((addr >> 39) & 0x1ff) << 3)) & env->a20_mask;
pml4e = x86_ldq_phys(cs, pml4e_addr);
if (!(pml4e & PG_PRESENT_MASK)) {
return -1;
diff --git a/target-i386/monitor.c b/target-i386/monitor.c
index 9a3b4d746e8d..ae2d2f66b6fa 100644
--- a/target-i386/monitor.c
+++ b/target-i386/monitor.c
@@ -30,13 +30,18 @@
#include "hmp.h"

-static void print_pte(Monitor *mon, hwaddr addr,
- hwaddr pte,
- hwaddr mask)
+static void print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+ hwaddr pte, hwaddr mask)
{
#ifdef TARGET_X86_64
- if (addr & (1ULL << 47)) {
- addr |= -1LL << 48;
+ if (env->cr[4] & CR4_LA57_MASK) {
+ if (addr & (1ULL << 56)) {
+ addr |= -1LL << 57;
+ }
+ } else {
+ if (addr & (1ULL << 47)) {
+ addr |= -1LL << 48;
+ }
}
#endif
monitor_printf(mon, TARGET_FMT_plx ": " TARGET_FMT_plx
@@ -66,13 +71,13 @@ static void tlb_info_32(Monitor *mon, CPUArchState *env)
if (pde & PG_PRESENT_MASK) {
if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
/* 4M pages */
- print_pte(mon, (l1 << 22), pde, ~((1 << 21) - 1));
+ print_pte(mon, env, (l1 << 22), pde, ~((1 << 21) - 1));
} else {
for(l2 = 0; l2 < 1024; l2++) {
cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
pte = le32_to_cpu(pte);
if (pte & PG_PRESENT_MASK) {
- print_pte(mon, (l1 << 22) + (l2 << 12),
+ print_pte(mon, env, (l1 << 22) + (l2 << 12),
pte & ~PG_PSE_MASK,
~0xfff);
}
@@ -100,7 +105,7 @@ static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
if (pde & PG_PRESENT_MASK) {
if (pde & PG_PSE_MASK) {
/* 2M pages with PAE, CR4.PSE is ignored */
- print_pte(mon, (l1 << 30 ) + (l2 << 21), pde,
+ print_pte(mon, env, (l1 << 30 ) + (l2 << 21), pde,
~((hwaddr)(1 << 20) - 1));
} else {
pt_addr = pde & 0x3fffffffff000ULL;
@@ -108,7 +113,7 @@ static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
cpu_physical_memory_read(pt_addr + l3 * 8, &pte, 8);
pte = le64_to_cpu(pte);
if (pte & PG_PRESENT_MASK) {
- print_pte(mon, (l1 << 30 ) + (l2 << 21)
+ print_pte(mon, env, (l1 << 30 ) + (l2 << 21)
+ (l3 << 12),
pte & ~PG_PSE_MASK,
~(hwaddr)0xfff);
@@ -122,13 +127,13 @@ static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
}

#ifdef TARGET_X86_64
-static void tlb_info_64(Monitor *mon, CPUArchState *env)
+static void tlb_info_la48(Monitor *mon, CPUArchState *env,
+ uint64_t l0, uint64_t pml4_addr)
{
uint64_t l1, l2, l3, l4;
uint64_t pml4e, pdpe, pde, pte;
- uint64_t pml4_addr, pdp_addr, pd_addr, pt_addr;
+ uint64_t pdp_addr, pd_addr, pt_addr;

- pml4_addr = env->cr[3] & 0x3fffffffff000ULL;
for (l1 = 0; l1 < 512; l1++) {
cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
pml4e = le64_to_cpu(pml4e);
@@ -140,8 +145,8 @@ static void tlb_info_64(Monitor *mon, CPUArchState *env)
if (pdpe & PG_PRESENT_MASK) {
if (pdpe & PG_PSE_MASK) {
/* 1G pages, CR4.PSE is ignored */
- print_pte(mon, (l1 << 39) + (l2 << 30), pdpe,
- 0x3ffffc0000000ULL);
+ print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30),
+ pdpe, 0x3ffffc0000000ULL);
} else {
pd_addr = pdpe & 0x3fffffffff000ULL;
for (l3 = 0; l3 < 512; l3++) {
@@ -150,9 +155,9 @@ static void tlb_info_64(Monitor *mon, CPUArchState *env)
if (pde & PG_PRESENT_MASK) {
if (pde & PG_PSE_MASK) {
/* 2M pages, CR4.PSE is ignored */
- print_pte(mon, (l1 << 39) + (l2 << 30) +
- (l3 << 21), pde,
- 0x3ffffffe00000ULL);
+ print_pte(mon, env, (l0 << 48) + (l1 << 39) +
+ (l2 << 30) + (l3 << 21), pde,
+ 0x3ffffffe00000ULL);
} else {
pt_addr = pde & 0x3fffffffff000ULL;
for (l4 = 0; l4 < 512; l4++) {
@@ -161,11 +166,11 @@ static void tlb_info_64(Monitor *mon, CPUArchState *env)
&pte, 8);
pte = le64_to_cpu(pte);
if (pte & PG_PRESENT_MASK) {
- print_pte(mon, (l1 << 39) +
- (l2 << 30) +
- (l3 << 21) + (l4 << 12),
- pte & ~PG_PSE_MASK,
- 0x3fffffffff000ULL);
+ print_pte(mon, env, (l0 << 48) +
+ (l1 << 39) + (l2 << 30) +
+ (l3 << 21) + (l4 << 12),
+ pte & ~PG_PSE_MASK,
+ 0x3fffffffff000ULL);
}
}
}
@@ -177,6 +182,22 @@ static void tlb_info_64(Monitor *mon, CPUArchState *env)
}
}
}
+
+static void tlb_info_la57(Monitor *mon, CPUArchState *env)
+{
+ uint64_t l0;
+ uint64_t pml5e;
+ uint64_t pml5_addr;
+
+ pml5_addr = env->cr[3] & 0x3fffffffff000ULL;
+ for (l0 = 0; l0 < 512; l0++) {
+ cpu_physical_memory_read(pml5_addr + l0 * 8, &pml5e, 8);
+ pml5e = le64_to_cpu(pml5e);
+ if (pml5e & PG_PRESENT_MASK) {
+ tlb_info_la48(mon, env, l0, pml5e & 0x3fffffffff000ULL);
+ }
+ }
+}
#endif /* TARGET_X86_64 */

void hmp_info_tlb(Monitor *mon, const QDict *qdict)
@@ -192,7 +213,11 @@ void hmp_info_tlb(Monitor *mon, const QDict *qdict)
if (env->cr[4] & CR4_PAE_MASK) {
#ifdef TARGET_X86_64
if (env->hflags & HF_LMA_MASK) {
- tlb_info_64(mon, env);
+ if (env->cr[4] & CR4_LA57_MASK) {
+ tlb_info_la57(mon, env);
+ } else {
+ tlb_info_la48(mon, env, 0, env->cr[3] & 0x3fffffffff000ULL);
+ }
} else
#endif
{
@@ -324,7 +349,7 @@ static void mem_info_pae32(Monitor *mon, CPUArchState *env)

#ifdef TARGET_X86_64
-static void mem_info_64(Monitor *mon, CPUArchState *env)
+static void mem_info_la48(Monitor *mon, CPUArchState *env)
{
int prot, last_prot;
uint64_t l1, l2, l3, l4;
@@ -400,6 +425,94 @@ static void mem_info_64(Monitor *mon, CPUArchState *env)
/* Flush last range */
mem_print(mon, &start, &last_prot, (hwaddr)1 << 48, 0);
}
+
+static void mem_info_la57(Monitor *mon, CPUArchState *env)
+{
+ int prot, last_prot;
+ uint64_t l0, l1, l2, l3, l4;
+ uint64_t pml5e, pml4e, pdpe, pde, pte;
+ uint64_t pml5_addr, pml4_addr, pdp_addr, pd_addr, pt_addr, start, end;
+
+ pml5_addr = env->cr[3] & 0x3fffffffff000ULL;
+ last_prot = 0;
+ start = -1;
+ for (l0 = 0; l0 < 512; l0++) {
+ cpu_physical_memory_read(pml5_addr + l0 * 8, &pml5e, 8);
+ pml4e = le64_to_cpu(pml5e);
+ end = l0 << 48;
+ if (pml5e & PG_PRESENT_MASK) {
+ pml4_addr = pml5e & 0x3fffffffff000ULL;
+ for (l1 = 0; l1 < 512; l1++) {
+ cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
+ pml4e = le64_to_cpu(pml4e);
+ end = (l0 << 48) + (l1 << 39);
+ if (pml4e & PG_PRESENT_MASK) {
+ pdp_addr = pml4e & 0x3fffffffff000ULL;
+ for (l2 = 0; l2 < 512; l2++) {
+ cpu_physical_memory_read(pdp_addr + l2 * 8, &pdpe, 8);
+ pdpe = le64_to_cpu(pdpe);
+ end = (l0 << 48) + (l1 << 39) + (l2 << 30);
+ if (pdpe & PG_PRESENT_MASK) {
+ if (pdpe & PG_PSE_MASK) {
+ prot = pdpe & (PG_USER_MASK | PG_RW_MASK |
+ PG_PRESENT_MASK);
+ prot &= pml4e;
+ mem_print(mon, &start, &last_prot, end, prot);
+ } else {
+ pd_addr = pdpe & 0x3fffffffff000ULL;
+ for (l3 = 0; l3 < 512; l3++) {
+ cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);
+ pde = le64_to_cpu(pde);
+ end = (l0 << 48) + (l1 << 39) + (l2 << 30) + (l3 << 21);
+ if (pde & PG_PRESENT_MASK) {
+ if (pde & PG_PSE_MASK) {
+ prot = pde & (PG_USER_MASK | PG_RW_MASK |
+ PG_PRESENT_MASK);
+ prot &= pml4e & pdpe;
+ mem_print(mon, &start, &last_prot, end, prot);
+ } else {
+ pt_addr = pde & 0x3fffffffff000ULL;
+ for (l4 = 0; l4 < 512; l4++) {
+ cpu_physical_memory_read(pt_addr
+ + l4 * 8,
+ &pte, 8);
+ pte = le64_to_cpu(pte);
+ end = (l0 << 48) + (l1 << 39) + (l2 << 30) +
+ (l3 << 21) + (l4 << 12);
+ if (pte & PG_PRESENT_MASK) {
+ prot = pte & (PG_USER_MASK | PG_RW_MASK |
+ PG_PRESENT_MASK);
+ prot &= pml4e & pdpe & pde;
+ } else {
+ prot = 0;
+ }
+ mem_print(mon, &start, &last_prot, end, prot);
+ }
+ }
+ } else {
+ prot = 0;
+ mem_print(mon, &start, &last_prot, end, prot);
+ }
+ }
+ }
+ } else {
+ prot = 0;
+ mem_print(mon, &start, &last_prot, end, prot);
+ }
+ }
+ } else {
+ prot = 0;
+ mem_print(mon, &start, &last_prot, end, prot);
+ }
+ }
+ } else {
+ prot = 0;
+ mem_print(mon, &start, &last_prot, end, prot);
+ }
+ }
+ /* Flush last range */
+ mem_print(mon, &start, &last_prot, (hwaddr)1 << 57, 0);
+}
#endif /* TARGET_X86_64 */

void hmp_info_mem(Monitor *mon, const QDict *qdict)
@@ -415,7 +528,11 @@ void hmp_info_mem(Monitor *mon, const QDict *qdict)
if (env->cr[4] & CR4_PAE_MASK) {
#ifdef TARGET_X86_64
if (env->hflags & HF_LMA_MASK) {
- mem_info_64(mon, env);
+ if (env->cr[4] & CR4_LA57_MASK) {
+ mem_info_la57(mon, env);
+ } else {
+ mem_info_la48(mon, env);
+ }
} else
#endif
{
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 324103c88521..d2aec5c9bf06 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -137,6 +137,7 @@ typedef struct DisasContext {
int cpuid_ext2_features;
int cpuid_ext3_features;
int cpuid_7_0_ebx_features;
+ int cpuid_7_0_ecx_features;
int cpuid_xsave_features;
} DisasContext;

@@ -8350,6 +8351,7 @@ void gen_intermediate_code(CPUX86State *env, TranslationBlock *tb)
dc->cpuid_ext2_features = env->features[FEAT_8000_0001_EDX];
dc->cpuid_ext3_features = env->features[FEAT_8000_0001_ECX];
dc->cpuid_7_0_ebx_features = env->features[FEAT_7_0_EBX];
+ dc->cpuid_7_0_ecx_features = env->features[FEAT_7_0_ECX];
dc->cpuid_xsave_features = env->features[FEAT_XSAVE];
#ifdef TARGET_X86_64
dc->lma = (flags >> HF_LMA_SHIFT) & 1;
--
2.10.2

2016-12-08 16:22:31

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 10/28] x86/mm: add support of p4d_t in vmalloc_fault()

With 4-level paging copying happens on p4d level, as we have pgd_none()
always false when p4d_t folded.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1bbdbb0594a3..820e8c284796 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -434,6 +434,7 @@ void vmalloc_sync_all(void)
static noinline int vmalloc_fault(unsigned long address)
{
pgd_t *pgd, *pgd_ref;
+ p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;
@@ -461,13 +462,26 @@ static noinline int vmalloc_fault(unsigned long address)
BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
}

+ /* With 4-level paging copying happens on p4d level. */
+ p4d = p4d_offset(pgd, address);
+ p4d_ref = p4d_offset(pgd_ref, address);
+ if (p4d_none(*p4d_ref))
+ return -1;
+
+ if (p4d_none(*p4d)) {
+ set_p4d(p4d, *p4d_ref);
+ arch_flush_lazy_mmu_mode();
+ } else {
+ BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_ref));
+ }
+
/*
* Below here mismatches are bugs because these lower tables
* are shared:
*/

- pud = pud_offset(pgd, address);
- pud_ref = pud_offset(pgd_ref, address);
+ pud = pud_offset(p4d, address);
+ pud_ref = pud_offset(p4d_ref, address);
if (pud_none(*pud_ref))
return -1;

--
2.10.2

2016-12-08 16:22:39

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

5-level paging support is required from hardware when compiled with
CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/cpucheck.c | 9 +++++++++
arch/x86/boot/cpuflags.c | 16 ++++++++++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/include/asm/required-features.h | 8 +++++++-
5 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c
index 4ad7d70e8739..8f0c4c9fc904 100644
--- a/arch/x86/boot/cpucheck.c
+++ b/arch/x86/boot/cpucheck.c
@@ -44,6 +44,15 @@ static const u32 req_flags[NCAPINTS] =
0, /* REQUIRED_MASK5 not implemented in this file */
REQUIRED_MASK6,
0, /* REQUIRED_MASK7 not implemented in this file */
+ 0, /* REQUIRED_MASK8 not implemented in this file */
+ 0, /* REQUIRED_MASK9 not implemented in this file */
+ 0, /* REQUIRED_MASK10 not implemented in this file */
+ 0, /* REQUIRED_MASK11 not implemented in this file */
+ 0, /* REQUIRED_MASK12 not implemented in this file */
+ 0, /* REQUIRED_MASK13 not implemented in this file */
+ 0, /* REQUIRED_MASK14 not implemented in this file */
+ 0, /* REQUIRED_MASK15 not implemented in this file */
+ REQUIRED_MASK16,
};

#define A32(a, b, c, d) (((d) << 24)+((c) << 16)+((b) << 8)+(a))
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..26e9a287805f 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -80,6 +80,17 @@ static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
);
}

+static inline void cpuid_count(u32 id, u32 count,
+ u32 *a, u32 *b, u32 *c, u32 *d)
+{
+ asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
+ "cpuid \n\t"
+ ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
+ : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
+ : "a" (id), "c" (count)
+ );
+}
+
void get_cpuflags(void)
{
u32 max_intel_level, max_amd_level;
@@ -108,6 +119,11 @@ void get_cpuflags(void)
cpu.model += ((tfms >> 16) & 0xf) << 4;
}

+ if (max_intel_level >= 0x00000007) {
+ cpuid_count(0x00000007, 0, &ignored, &ignored,
+ &cpu.flags[16], &ignored);
+ }
+
cpuid(0x80000000, &max_amd_level, &ignored, &ignored,
&ignored);

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 92a8308b96f6..388f1277880f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -280,6 +280,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */
#define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_LA57 (16*32+16) /* 5-level page tables */

/* AMD-defined CPU features, CPUID level 0x80000007 (ebx), word 17 */
#define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 85599ad4d024..fc0960236fc3 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -36,6 +36,12 @@
# define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */

+#ifdef CONFIG_X86_5LEVEL
+#define DISABLE_LA57 0
+#else
+#define DISABLE_LA57 (1<<(X86_FEATURE_LA57 & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -55,7 +61,7 @@
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
-#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE)
+#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57)
#define DISABLED_MASK17 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)

diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index fac9a5c0abe9..d91ba04dd007 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -53,6 +53,12 @@
# define NEED_MOVBE 0
#endif

+#ifdef CONFIG_X86_5LEVEL
+# define NEED_LA57 (1<<(X86_FEATURE_LA57 & 31))
+#else
+# define NEED_LA57 0
+#endif
+
#ifdef CONFIG_X86_64
#ifdef CONFIG_PARAVIRT
/* Paravirtualized systems may not have PSE or PGE available */
@@ -98,7 +104,7 @@
#define REQUIRED_MASK13 0
#define REQUIRED_MASK14 0
#define REQUIRED_MASK15 0
-#define REQUIRED_MASK16 0
+#define REQUIRED_MASK16 (NEED_LA57)
#define REQUIRED_MASK17 0
#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)

--
2.10.2

2016-12-08 16:22:49

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 05/28] mm: convert generic code to 5-level paging

Convers all non-architecture-specific code to 5-level paging.

It's mosly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/misc/sgi-gru/grufault.c | 9 +-
fs/userfaultfd.c | 6 +-
include/asm-generic/pgtable.h | 48 +++++++++-
include/linux/hugetlb.h | 5 +-
include/linux/kasan.h | 1 +
include/linux/mm.h | 29 ++++--
lib/ioremap.c | 39 +++++++-
mm/gup.c | 46 +++++++--
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 29 +++---
mm/kasan/kasan_init.c | 35 ++++++-
mm/memory.c | 207 +++++++++++++++++++++++++++++++++-------
mm/mlock.c | 1 +
mm/mprotect.c | 26 ++++-
mm/mremap.c | 13 ++-
mm/pagewalk.c | 32 ++++++-
mm/pgtable-generic.c | 6 ++
mm/rmap.c | 13 ++-
mm/sparse-vmemmap.c | 22 ++++-
mm/swapfile.c | 26 ++++-
mm/userfaultfd.c | 23 +++--
mm/vmalloc.c | 81 ++++++++++++----
22 files changed, 584 insertions(+), 120 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index a2d97b9b17e3..7c3ccb09d633 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -219,15 +219,20 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
int write, unsigned long *paddr, int *pageshift)
{
pgd_t *pgdp;
- pmd_t *pmdp;
+ p4d_t *p4dp;
pud_t *pudp;
+ pmd_t *pmdp;
pte_t pte;

pgdp = pgd_offset(vma->vm_mm, vaddr);
if (unlikely(pgd_none(*pgdp)))
goto err;

- pudp = pud_offset(pgdp, vaddr);
+ p4dp = p4d_offset(pgdp, vaddr);
+ if (unlikely(p4d_none(*p4dp)))
+ goto err;
+
+ pudp = pud_offset(p4dp, vaddr);
if (unlikely(pud_none(*pudp)))
goto err;

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 85959d8324df..91c55d2d9188 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -195,6 +195,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
{
struct mm_struct *mm = ctx->mm;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd, _pmd;
pte_t *pte;
@@ -205,7 +206,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
goto out;
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ goto out;
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d4458b6dbfb4..9d67d03a16eb 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,9 +10,9 @@
#include <linux/bug.h>
#include <linux/errno.h>

-#if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \
- CONFIG_PGTABLE_LEVELS
-#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{PUD,PMD}_FOLDED
+#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
+ defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
+#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
#endif

/*
@@ -339,6 +339,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
(__boundary - 1 < (end) - 1)? __boundary: (end); \
})

+#ifndef p4d_addr_end
+#define p4d_addr_end(addr, end) \
+({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \
+ (__boundary - 1 < (end) - 1)? __boundary: (end); \
+})
+#endif
+
#ifndef pud_addr_end
#define pud_addr_end(addr, end) \
({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \
@@ -359,6 +366,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
* Do the tests inline, but report and clear the bad entry in mm/memory.c.
*/
void pgd_clear_bad(pgd_t *);
+void p4d_clear_bad(p4d_t *);
void pud_clear_bad(pud_t *);
void pmd_clear_bad(pmd_t *);

@@ -373,6 +381,17 @@ static inline int pgd_none_or_clear_bad(pgd_t *pgd)
return 0;
}

+static inline int p4d_none_or_clear_bad(p4d_t *p4d)
+{
+ if (p4d_none(*p4d))
+ return 1;
+ if (unlikely(p4d_bad(*p4d))) {
+ p4d_clear_bad(p4d);
+ return 1;
+ }
+ return 0;
+}
+
static inline int pud_none_or_clear_bad(pud_t *pud)
{
if (pud_none(*pud))
@@ -760,11 +779,30 @@ static inline int pmd_protnone(pmd_t pmd)
#endif /* CONFIG_MMU */

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+
+#ifndef __PAGETABLE_P4D_FOLDED
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot);
+int p4d_clear_huge(p4d_t *p4d);
+#else
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
+#endif /* !__PAGETABLE_P4D_FOLDED */
+
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
int pud_clear_huge(pud_t *pud);
int pmd_clear_huge(pmd_t *pmd);
#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
{
return 0;
@@ -773,6 +811,10 @@ static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
{
return 0;
}
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
static inline int pud_clear_huge(pud_t *pud)
{
return 0;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c26d4638f665..f3e1798ad9c0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -116,7 +116,7 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
int pmd_huge(pmd_t pmd);
-int pud_huge(pud_t pmd);
+int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

@@ -189,6 +189,9 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
#ifndef pgd_huge
#define pgd_huge(x) 0
#endif
+#ifndef p4d_huge
+#define p4d_huge(x) 0
+#endif

#ifndef pgd_write
static inline int pgd_write(pgd_t pgd)
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index d600303306eb..df20e2c108cf 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -19,6 +19,7 @@ extern unsigned char kasan_zero_page[PAGE_SIZE];
extern pte_t kasan_zero_pte[PTRS_PER_PTE];
extern pmd_t kasan_zero_pmd[PTRS_PER_PMD];
extern pud_t kasan_zero_pud[PTRS_PER_PUD];
+extern p4d_t kasan_zero_p4d[PTRS_PER_P4D];

void kasan_populate_zero_shadow(const void *shadow_start,
const void *shadow_end);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a4dfd7c3515..b9c2fc7214e0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1541,14 +1541,24 @@ static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
return ptep;
}

+#ifdef __PAGETABLE_P4D_FOLDED
+static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long address)
+{
+ return 0;
+}
+#else
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+#endif
+
#ifdef __PAGETABLE_PUD_FOLDED
-static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd,
+static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
unsigned long address)
{
return 0;
}
#else
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
#endif

#if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
@@ -1602,10 +1612,16 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)

#ifndef __ARCH_HAS_5LEVEL_HACK
-static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+ return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address))?
+ NULL: p4d_offset(pgd, address);
+}
+
+static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
{
- return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
- NULL: pud_offset(pgd, address);
+ return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address))?
+ NULL: pud_offset(p4d, address);
}
#endif /* !__ARCH_HAS_5LEVEL_HACK */

@@ -2331,7 +2347,8 @@ void sparse_mem_maps_populate_node(struct page **map_map,

struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
-pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
+p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
+pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 86c8911b0e3a..5629eeaba5ae 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -14,6 +14,7 @@
#include <asm/pgtable.h>

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static int __read_mostly ioremap_p4d_capable;
static int __read_mostly ioremap_pud_capable;
static int __read_mostly ioremap_pmd_capable;
static int __read_mostly ioremap_huge_disabled;
@@ -35,6 +36,11 @@ void __init ioremap_huge_init(void)
}
}

+static inline int ioremap_p4d_enabled(void)
+{
+ return ioremap_p4d_capable;
+}
+
static inline int ioremap_pud_enabled(void)
{
return ioremap_pud_capable;
@@ -46,6 +52,7 @@ static inline int ioremap_pmd_enabled(void)
}

#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int ioremap_p4d_enabled(void) { return 0; }
static inline int ioremap_pud_enabled(void) { return 0; }
static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
@@ -94,14 +101,14 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
return 0;
}

-static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
+static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pud_t *pud;
unsigned long next;

phys_addr -= addr;
- pud = pud_alloc(&init_mm, pgd, addr);
+ pud = pud_alloc(&init_mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -120,6 +127,32 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
return 0;
}

+static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ phys_addr -= addr;
+ p4d = p4d_alloc(&init_mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (ioremap_p4d_enabled() &&
+ ((next - addr) == P4D_SIZE) &&
+ IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
+ if (p4d_set_huge(p4d, phys_addr + addr, prot))
+ continue;
+ }
+
+ if (ioremap_pud_range(p4d, addr, next, phys_addr + addr, prot))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
int ioremap_page_range(unsigned long addr,
unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
@@ -135,7 +168,7 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_pud_range(pgd, addr, next, phys_addr+addr, prot);
+ err = ioremap_p4d_range(pgd, addr, next, phys_addr+addr, prot);
if (err)
break;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/gup.c b/mm/gup.c
index 96b2b2fd0fbd..0f1e42f1bac6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -216,6 +216,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned int *page_mask)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
@@ -233,8 +234,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
-
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d))
+ return no_page_table(vma, flags);
+ BUILD_BUG_ON(p4d_huge(*p4d));
+ if (unlikely(p4d_bad(*p4d)))
+ return no_page_table(vma, flags);
+ pud = pud_offset(p4d, address);
if (pud_none(*pud))
return no_page_table(vma, flags);
if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
@@ -307,6 +313,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
struct page **page)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -320,7 +327,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
else
pgd = pgd_offset_gate(mm, address);
BUG_ON(pgd_none(*pgd));
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ BUG_ON(p4d_none(*p4d));
+ pud = pud_offset(p4d, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
@@ -1389,13 +1398,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
return 1;
}

-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

- pudp = pud_offset(&pgd, addr);
+ pudp = pud_offset(&p4d, addr);
do {
pud_t pud = READ_ONCE(*pudp);

@@ -1417,6 +1426,31 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ p4d_t *p4dp;
+
+ p4dp = p4d_offset(&pgd, addr);
+ do {
+ p4d_t p4d = READ_ONCE(*p4dp);
+
+ next = p4d_addr_end(addr, end);
+ if (p4d_none(p4d))
+ return 0;
+ BUILD_BUG_ON(p4d_huge(p4d));
+ if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
+ if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
+ P4D_SHIFT, next, write, pages, nr))
+ return 0;
+ } else if (!gup_p4d_range(p4d, addr, next, write, pages, nr))
+ return 0;
+ } while (p4dp++, addr = next, addr != end);
+
+ return 1;
+}
+
/*
* Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
* the regular GUP. It will only return non-negative values.
@@ -1467,7 +1501,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
PGDIR_SHIFT, next, write, pages, &nr))
break;
- } else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ } else if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
break;
} while (pgdp++, addr = next, addr != end);
local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 53ae6d00656a..8f9f582ad870 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1667,6 +1667,7 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
bool freeze, struct page *page)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -1674,7 +1675,11 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
if (!pgd_present(*pgd))
return;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return;
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return;

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87e11d8ad536..05177e18dd01 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4272,7 +4272,8 @@ out:
int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
{
pgd_t *pgd = pgd_offset(mm, *addr);
- pud_t *pud = pud_offset(pgd, *addr);
+ p4d_t *p4d = p4d_offset(pgd, *addr);
+ pud_t *pud = pud_offset(p4d, *addr);

BUG_ON(page_count(virt_to_page(ptep)) == 0);
if (page_count(virt_to_page(ptep)) == 1)
@@ -4303,11 +4304,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pte_t *pte = NULL;

pgd = pgd_offset(mm, addr);
- pud = pud_alloc(mm, pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (pud) {
if (sz == PUD_SIZE) {
pte = (pte_t *)pud;
@@ -4327,18 +4330,22 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
- pmd_t *pmd = NULL;
+ pmd_t *pmd;

pgd = pgd_offset(mm, addr);
- if (pgd_present(*pgd)) {
- pud = pud_offset(pgd, addr);
- if (pud_present(*pud)) {
- if (pud_huge(*pud))
- return (pte_t *)pud;
- pmd = pmd_offset(pud, addr);
- }
- }
+ if (!pgd_present(*pgd))
+ return NULL;
+ p4d = p4d_offset(pgd, addr);
+ if (!p4d_present(*p4d))
+ return NULL;
+ pud = pud_offset(p4d, addr);
+ if (!pud_present(*pud))
+ return NULL;
+ if (pud_huge(*pud))
+ return (pte_t *)pud;
+ pmd = pmd_offset(pud, addr);
return (pte_t *) pmd;
}

diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 3f9a41cf0ac6..0d4ee78796fc 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -29,6 +29,9 @@
*/
unsigned char kasan_zero_page[PAGE_SIZE] __page_aligned_bss;

+#if CONFIG_PGTABLE_LEVELS > 4
+p4d_t kasan_zero_p4d[PTRS_PER_P4D] __page_aligned_bss;
+#endif
#if CONFIG_PGTABLE_LEVELS > 3
pud_t kasan_zero_pud[PTRS_PER_PUD] __page_aligned_bss;
#endif
@@ -81,10 +84,10 @@ static void __init zero_pmd_populate(pud_t *pud, unsigned long addr,
} while (pmd++, addr = next, addr != end);
}

-static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
+static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr,
unsigned long end)
{
- pud_t *pud = pud_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
unsigned long next;

do {
@@ -106,6 +109,23 @@ static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
} while (pud++, addr = next, addr != end);
}

+static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
+ unsigned long end)
+{
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ unsigned long next;
+
+ do {
+ next = p4d_addr_end(addr, end);
+
+ if (p4d_none(*p4d)) {
+ p4d_populate(&init_mm, p4d,
+ early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+ }
+ zero_pud_populate(p4d, addr, next);
+ } while (p4d++, addr = next, addr != end);
+}
+
/**
* kasan_populate_zero_shadow - populate shadow memory region with
* kasan_zero_page
@@ -124,6 +144,7 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
next = pgd_addr_end(addr, end);

if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -135,8 +156,12 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
* puds,pmds, so pgd_populate(), pud_populate()
* is noops.
*/
- pgd_populate(&init_mm, pgd, kasan_zero_pud);
- pud = pud_offset(pgd, addr);
+#ifndef __ARCH_HAS_5LEVEL_HACK
+ pgd_populate(&init_mm, pgd, kasan_zero_p4d);
+#endif
+ p4d = p4d_offset(pgd, addr);
+ p4d_populate(&init_mm, p4d, kasan_zero_pud);
+ pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud, kasan_zero_pmd);
pmd = pmd_offset(pud, addr);
pmd_populate_kernel(&init_mm, pmd, kasan_zero_pte);
@@ -147,6 +172,6 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
pgd_populate(&init_mm, pgd,
early_alloc(PAGE_SIZE, NUMA_NO_NODE));
}
- zero_pud_populate(pgd, addr, next);
+ zero_p4d_populate(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
diff --git a/mm/memory.c b/mm/memory.c
index 793fe0f9841c..fd1a3319f413 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -444,7 +444,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
mm_dec_nr_pmds(tlb->mm);
}

-static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
unsigned long addr, unsigned long end,
unsigned long floor, unsigned long ceiling)
{
@@ -453,7 +453,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
unsigned long start;

start = addr;
- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -461,6 +461,39 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
free_pmd_range(tlb, pud, addr, next, floor, ceiling);
} while (pud++, addr = next, addr != end);

+ start &= P4D_MASK;
+ if (start < floor)
+ return;
+ if (ceiling) {
+ ceiling &= P4D_MASK;
+ if (!ceiling)
+ return;
+ }
+ if (end - 1 > ceiling - 1)
+ return;
+
+ pud = pud_offset(p4d, start);
+ p4d_clear(p4d);
+ pud_free_tlb(tlb, pud, start);
+}
+
+static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ unsigned long start;
+
+ start = addr;
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+ } while (p4d++, addr = next, addr != end);
+
start &= PGDIR_MASK;
if (start < floor)
return;
@@ -472,9 +505,9 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
if (end - 1 > ceiling - 1)
return;

- pud = pud_offset(pgd, start);
+ p4d = p4d_offset(pgd, start);
pgd_clear(pgd);
- pud_free_tlb(tlb, pud, start);
+ p4d_free_tlb(tlb, p4d, start);
}

/*
@@ -534,7 +567,7 @@ void free_pgd_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+ free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
} while (pgd++, addr = next, addr != end);
}

@@ -653,7 +686,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, struct page *page)
{
pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
- pud_t *pud = pud_offset(pgd, addr);
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
pmd_t *pmd = pmd_offset(pud, addr);
struct address_space *mapping;
pgoff_t index;
@@ -1018,16 +1052,16 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
}

static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+ p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
pud_t *src_pud, *dst_pud;
unsigned long next;

- dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+ dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
if (!dst_pud)
return -ENOMEM;
- src_pud = pud_offset(src_pgd, addr);
+ src_pud = pud_offset(src_p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(src_pud))
@@ -1039,6 +1073,28 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
return 0;
}

+static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ p4d_t *src_p4d, *dst_p4d;
+ unsigned long next;
+
+ dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+ if (!dst_p4d)
+ return -ENOMEM;
+ src_p4d = p4d_offset(src_pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(src_p4d))
+ continue;
+ if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
+ vma, addr, next))
+ return -ENOMEM;
+ } while (dst_p4d++, src_p4d++, addr = next, addr != end);
+ return 0;
+}
+
int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct vm_area_struct *vma)
{
@@ -1094,7 +1150,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
vma, addr, next))) {
ret = -ENOMEM;
break;
@@ -1263,14 +1319,14 @@ next:
}

static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma, pgd_t *pgd,
+ struct vm_area_struct *vma, p4d_t *p4d,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
pud_t *pud;
unsigned long next;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -1281,6 +1337,25 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}

+static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ next = zap_pud_range(tlb, vma, p4d, addr, next, details);
+ } while (p4d++, addr = next, addr != end);
+
+ return addr;
+}
+
void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
@@ -1296,7 +1371,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- next = zap_pud_range(tlb, vma, pgd, addr, next, details);
+ next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
}
@@ -1452,16 +1527,24 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
spinlock_t **ptl)
{
- pgd_t * pgd = pgd_offset(mm, addr);
- pud_t * pud = pud_alloc(mm, pgd, addr);
- if (pud) {
- pmd_t * pmd = pmd_alloc(mm, pud, addr);
- if (pmd) {
- VM_BUG_ON(pmd_trans_huge(*pmd));
- return pte_alloc_map_lock(mm, pmd, addr, ptl);
- }
- }
- return NULL;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return NULL;
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd)
+ return NULL;
+
+ VM_BUG_ON(pmd_trans_huge(*pmd));
+ return pte_alloc_map_lock(mm, pmd, addr, ptl);
}

/*
@@ -1723,7 +1806,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
return 0;
}

-static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
unsigned long addr, unsigned long end,
unsigned long pfn, pgprot_t prot)
{
@@ -1731,7 +1814,7 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
unsigned long next;

pfn -= addr >> PAGE_SHIFT;
- pud = pud_alloc(mm, pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -1743,6 +1826,26 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
return 0;
}

+static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ unsigned long pfn, pgprot_t prot)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ pfn -= addr >> PAGE_SHIFT;
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ if (remap_pud_range(mm, p4d, addr, next,
+ pfn + (addr >> PAGE_SHIFT), prot))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
/**
* remap_pfn_range - remap kernel memory to userspace
* @vma: user vma to map to
@@ -1799,7 +1902,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
flush_cache_range(vma, addr, end);
do {
next = pgd_addr_end(addr, end);
- err = remap_pud_range(mm, pgd, addr, next,
+ err = remap_p4d_range(mm, pgd, addr, next,
pfn + (addr >> PAGE_SHIFT), prot);
if (err)
break;
@@ -1915,7 +2018,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
return err;
}

-static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
unsigned long addr, unsigned long end,
pte_fn_t fn, void *data)
{
@@ -1923,7 +2026,7 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
unsigned long next;
int err;

- pud = pud_alloc(mm, pgd, addr);
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -1935,6 +2038,26 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
return err;
}

+static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ pte_fn_t fn, void *data)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int err;
+
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ err = apply_to_pud_range(mm, p4d, addr, next, fn, data);
+ if (err)
+ break;
+ } while (p4d++, addr = next, addr != end);
+ return err;
+}
+
/*
* Scan a region of virtual memory, filling in page tables as necessary
* and calling a provided function on each leaf page table.
@@ -1953,7 +2076,7 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
- err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+ err = apply_to_p4d_range(mm, pgd, addr, next, fn, data);
if (err)
break;
} while (pgd++, addr = next, addr != end);
@@ -3573,10 +3696,14 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
};
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

pgd = pgd_offset(mm, address);
- pud = pud_alloc(mm, pgd, address);
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return VM_FAULT_OOM;
+ pud = pud_alloc(mm, p4d, address);
if (!pud)
return VM_FAULT_OOM;
fe.pmd = pmd_alloc(mm, pud, address);
@@ -3667,7 +3794,7 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
* Allocate page upper directory.
* We've already handled the fast-path in-line.
*/
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
{
pud_t *new = pud_alloc_one(mm, address);
if (!new)
@@ -3676,10 +3803,17 @@ int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */

spin_lock(&mm->page_table_lock);
- if (pgd_present(*pgd)) /* Another has populated it */
+#ifndef __ARCH_HAS_5LEVEL_HACK
+ if (p4d_present(*p4d)) /* Another has populated it */
+ pud_free(mm, new);
+ else
+ p4d_populate(mm, p4d, new);
+#else
+ if (pgd_present(*p4d)) /* Another has populated it */
pud_free(mm, new);
else
- pgd_populate(mm, pgd, new);
+ pgd_populate(mm, p4d, new);
+#endif /* __ARCH_HAS_5LEVEL_HACK */
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -3721,6 +3855,7 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
pte_t **ptepp, spinlock_t **ptlp)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
@@ -3729,7 +3864,11 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud) || unlikely(pud_bad(*pud)))
goto out;

diff --git a/mm/mlock.c b/mm/mlock.c
index 14645be06e30..35a2be030dc8 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -376,6 +376,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
pte = get_locked_pte(vma->vm_mm, start, &ptl);
/* Make sure we do not cross the page table boundary */
end = pgd_addr_end(start, end);
+ end = p4d_addr_end(start, end);
end = pud_addr_end(start, end);
end = pmd_addr_end(start, end);

diff --git a/mm/mprotect.c b/mm/mprotect.c
index a4830f0325fe..d28516b620f6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -195,14 +195,14 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

static inline unsigned long change_pud_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end,
+ p4d_t *p4d, unsigned long addr, unsigned long end,
pgprot_t newprot, int dirty_accountable, int prot_numa)
{
pud_t *pud;
unsigned long next;
unsigned long pages = 0;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -214,6 +214,26 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
return pages;
}

+static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
+ pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgprot_t newprot, int dirty_accountable, int prot_numa)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ unsigned long pages = 0;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ pages += change_pud_range(vma, p4d, addr, next, newprot,
+ dirty_accountable, prot_numa);
+ } while (p4d++, addr = next, addr != end);
+
+ return pages;
+}
+
static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
@@ -232,7 +252,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- pages += change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_p4d_range(vma, pgd, addr, next, newprot,
dirty_accountable, prot_numa);
} while (pgd++, addr = next, addr != end);

diff --git a/mm/mremap.c b/mm/mremap.c
index da22ad2a5678..c4971130fd17 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -31,6 +31,7 @@
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -38,7 +39,11 @@ static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
if (pgd_none_or_clear_bad(pgd))
return NULL;

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none_or_clear_bad(p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
if (pud_none_or_clear_bad(pud))
return NULL;

@@ -53,11 +58,15 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

pgd = pgd_offset(mm, addr);
- pud = pud_alloc(mm, pgd, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
if (!pud)
return NULL;

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 207244489a68..0020f340abfd 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -69,14 +69,14 @@ again:
return err;
}

-static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pud_t *pud;
unsigned long next;
int err = 0;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud)) {
@@ -95,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
return err;
}

+static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int err = 0;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d)) {
+ if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+ if (err)
+ break;
+ continue;
+ }
+ if (walk->pmd_entry || walk->pte_entry)
+ err = walk_pud_range(p4d, addr, next, walk);
+ if (err)
+ break;
+ } while (p4d++, addr = next, addr != end);
+
+ return err;
+}
+
static int walk_pgd_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -113,7 +139,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
continue;
}
if (walk->pmd_entry || walk->pte_entry)
- err = walk_pud_range(pgd, addr, next, walk);
+ err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 71c5f9109f2a..738e278b48c1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -22,6 +22,12 @@ void pgd_clear_bad(pgd_t *pgd)
pgd_clear(pgd);
}

+void p4d_clear_bad(p4d_t *p4d)
+{
+ p4d_ERROR(*p4d);
+ p4d_clear(p4d);
+}
+
void pud_clear_bad(pud_t *pud)
{
pud_ERROR(*pud);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1ef36404e7b2..91a0cf369817 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,6 +685,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd = NULL;
pmd_t pmde;
@@ -693,7 +694,11 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
if (!pgd_present(*pgd))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
goto out;

@@ -798,6 +803,7 @@ bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
pte_t **ptep, spinlock_t **ptlp)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -817,7 +823,10 @@ bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return false;
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return false;
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return false;
pmd = pmd_offset(pud, address);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 574c67b663fe..a56c3989f773 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -196,9 +196,9 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
return pmd;
}

-pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
+pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
{
- pud_t *pud = pud_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
void *p = vmemmap_alloc_block(PAGE_SIZE, node);
if (!p)
@@ -208,6 +208,18 @@ pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
return pud;
}

+p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
+{
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d)) {
+ void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+ if (!p)
+ return NULL;
+ p4d_populate(&init_mm, p4d, p);
+ }
+ return p4d;
+}
+
pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
{
pgd_t *pgd = pgd_offset_k(addr);
@@ -225,6 +237,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
{
unsigned long addr = start;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -233,7 +246,10 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
pgd = vmemmap_pgd_populate(addr, node);
if (!pgd)
return -ENOMEM;
- pud = vmemmap_pud_populate(pgd, addr, node);
+ p4d = vmemmap_p4d_populate(pgd, addr, node);
+ if (!p4d)
+ return -ENOMEM;
+ pud = vmemmap_pud_populate(p4d, addr, node);
if (!pud)
return -ENOMEM;
pmd = vmemmap_pmd_populate(pud, addr, node);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2657accc6e2b..7267d5d59998 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1238,7 +1238,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
return 0;
}

-static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
unsigned long addr, unsigned long end,
swp_entry_t entry, struct page *page)
{
@@ -1246,7 +1246,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long next;
int ret;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
@@ -1258,6 +1258,26 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
return 0;
}

+static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ swp_entry_t entry, struct page *page)
+{
+ p4d_t *p4d;
+ unsigned long next;
+ int ret;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ ret = unuse_pud_range(vma, p4d, addr, next, entry, page);
+ if (ret)
+ return ret;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
static int unuse_vma(struct vm_area_struct *vma,
swp_entry_t entry, struct page *page)
{
@@ -1281,7 +1301,7 @@ static int unuse_vma(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
+ ret = unuse_p4d_range(vma, pgd, addr, next, entry, page);
if (ret)
return ret;
} while (pgd++, addr = next, addr != end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af817e5060fb..721681deba9f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -124,19 +124,22 @@ out_unlock:
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
- pmd_t *pmd = NULL;

pgd = pgd_offset(mm, address);
- pud = pud_alloc(mm, pgd, address);
- if (pud)
- /*
- * Note that we didn't run this because the pmd was
- * missing, the *pmd may be already established and in
- * turn it may also be a trans_huge_pmd.
- */
- pmd = pmd_alloc(mm, pud, address);
- return pmd;
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, address);
+ if (!pud)
+ return NULL;
+ /*
+ * Note that we didn't run this because the pmd was
+ * missing, the *pmd may be already established and in
+ * turn it may also be a trans_huge_pmd.
+ */
+ return pmd_alloc(mm, pud, address);
}

static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e78c516..b61ab3603bf6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -86,12 +86,12 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
} while (pmd++, addr = next, addr != end);
}

-static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
{
pud_t *pud;
unsigned long next;

- pud = pud_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
if (pud_clear_huge(pud))
@@ -102,6 +102,22 @@ static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
} while (pud++, addr = next, addr != end);
}

+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_clear_huge(p4d))
+ continue;
+ if (p4d_none_or_clear_bad(p4d))
+ continue;
+ vunmap_pud_range(p4d, addr, next);
+ } while (p4d++, addr = next, addr != end);
+}
+
static void vunmap_page_range(unsigned long addr, unsigned long end)
{
pgd_t *pgd;
@@ -113,7 +129,7 @@ static void vunmap_page_range(unsigned long addr, unsigned long end)
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- vunmap_pud_range(pgd, addr, next);
+ vunmap_p4d_range(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

@@ -160,13 +176,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return 0;
}

-static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, pgd, addr);
+ pud = pud_alloc(&init_mm, p4d, addr);
if (!pud)
return -ENOMEM;
do {
@@ -177,6 +193,23 @@ static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
return 0;
}

+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_alloc(&init_mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ do {
+ next = p4d_addr_end(addr, end);
+ if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+ return -ENOMEM;
+ } while (p4d++, addr = next, addr != end);
+ return 0;
+}
+
/*
* Set up page tables in kva (addr, end). The ptes shall have prot "prot", and
* will have pfns corresponding to the "pages" array.
@@ -196,7 +229,7 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_pud_range(pgd, addr, next, prot, pages, &nr);
+ err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
@@ -237,6 +270,10 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
unsigned long addr = (unsigned long) vmalloc_addr;
struct page *page = NULL;
pgd_t *pgd = pgd_offset_k(addr);
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;

/*
* XXX we might need to change this if we add VIRTUAL_BUG_ON for
@@ -244,21 +281,23 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
*/
VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));

- if (!pgd_none(*pgd)) {
- pud_t *pud = pud_offset(pgd, addr);
- if (!pud_none(*pud)) {
- pmd_t *pmd = pmd_offset(pud, addr);
- if (!pmd_none(*pmd)) {
- pte_t *ptep, pte;
-
- ptep = pte_offset_map(pmd, addr);
- pte = *ptep;
- if (pte_present(pte))
- page = pte_page(pte);
- pte_unmap(ptep);
- }
- }
- }
+ if (pgd_none(*pgd))
+ return NULL;
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return NULL;
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ ptep = pte_offset_map(pmd, addr);
+ pte = *ptep;
+ if (pte_present(pte))
+ page = pte_page(pte);
+ pte_unmap(ptep);
return page;
}
EXPORT_SYMBOL(vmalloc_to_page);
--
2.10.2

2016-12-08 16:22:41

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 18/28] x86/paravirt: make paravirt code support 5-level paging

Add operations to allocate/release p4ds.

TODO: cover XEN.

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/paravirt.h | 43 +++++++++++++++++++++++++++++++----
arch/x86/include/asm/paravirt_types.h | 7 +++++-
arch/x86/include/asm/pgalloc.h | 1 +
arch/x86/kernel/paravirt.c | 9 ++++++--
4 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2196ec33063e..ccbb88bb7681 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -366,6 +366,15 @@ static inline void paravirt_release_pud(unsigned long pfn)
PVOP_VCALL1(pv_mmu_ops.release_pud, pfn);
}

+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn)
+{
+ PVOP_VCALL2(pv_mmu_ops.alloc_p4d, mm, pfn);
+}
+static inline void paravirt_release_p4d(unsigned long pfn)
+{
+ PVOP_VCALL1(pv_mmu_ops.release_p4d, pfn);
+}
+
static inline void pte_update(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
@@ -580,14 +589,35 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
val);
}

-static inline void p4d_clear(p4d_t *p4dp)
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+static inline p4d_t __p4d(p4dval_t val)
{
- set_p4d(p4dp, __p4d(0));
+ p4dval_t ret;
+
+ if (sizeof(p4dval_t) > sizeof(long))
+ ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.make_p4d,
+ val, (u64)val >> 32);
+ else
+ ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.make_p4d,
+ val);
+
+ return (p4d_t) { ret };
}

-#if CONFIG_PGTABLE_LEVELS >= 5
+static inline p4dval_t p4d_val(p4d_t p4d)
+{
+ p4dval_t ret;

-#error FIXME
+ if (sizeof(p4dval_t) > sizeof(long))
+ ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.p4d_val,
+ p4d.p4d, (u64)p4d.p4d >> 32);
+ else
+ ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.p4d_val,
+ p4d.p4d);
+
+ return ret;
+}

static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
{
@@ -608,6 +638,11 @@ static inline void pgd_clear(pgd_t *pgdp)

#endif /* CONFIG_PGTABLE_LEVELS == 5 */

+static inline void p4d_clear(p4d_t *p4dp)
+{
+ set_p4d(p4dp, __p4d(0));
+}
+
#endif /* CONFIG_PGTABLE_LEVELS == 4 */

#endif /* CONFIG_PGTABLE_LEVELS >= 3 */
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index cdfa758ce7de..d1933e40cf4b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -241,9 +241,11 @@ struct pv_mmu_ops {
void (*alloc_pte)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pmd)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pud)(struct mm_struct *mm, unsigned long pfn);
+ void (*alloc_p4d)(struct mm_struct *mm, unsigned long pfn);
void (*release_pte)(unsigned long pfn);
void (*release_pmd)(unsigned long pfn);
void (*release_pud)(unsigned long pfn);
+ void (*release_p4d)(unsigned long pfn);

/* Pagetable manipulation functions */
void (*set_pte)(pte_t *ptep, pte_t pteval);
@@ -287,7 +289,10 @@ struct pv_mmu_ops {
void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);

#if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
+ struct paravirt_callee_save p4d_val;
+ struct paravirt_callee_save make_p4d;
+
+ void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
#endif /* CONFIG_PGTABLE_LEVELS >= 5 */

#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 2f585054c63c..8408511dbdd1 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -17,6 +17,7 @@ static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) {
static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
unsigned long start, unsigned long count) {}
static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) {}
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_release_pte(unsigned long pfn) {}
static inline void paravirt_release_pmd(unsigned long pfn) {}
static inline void paravirt_release_pud(unsigned long pfn) {}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index d81c0c4e6bcf..ca61a7d566cc 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -407,9 +407,11 @@ struct pv_mmu_ops pv_mmu_ops = {
.alloc_pte = paravirt_nop,
.alloc_pmd = paravirt_nop,
.alloc_pud = paravirt_nop,
+ .alloc_p4d = paravirt_nop,
.release_pte = paravirt_nop,
.release_pmd = paravirt_nop,
.release_pud = paravirt_nop,
+ .release_p4d = paravirt_nop,

.set_pte = native_set_pte,
.set_pte_at = native_set_pte_at,
@@ -438,8 +440,11 @@ struct pv_mmu_ops pv_mmu_ops = {
.set_p4d = native_set_p4d,

#if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
-#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+ .p4d_val = PTE_IDENT,
+ .make_p4d = PTE_IDENT,
+
+ .set_pgd = native_set_pgd,
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

--
2.10.2

2016-12-08 16:22:52

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 01/28] asm-generic: introduce 5level-fixup.h

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/4level-fixup.h | 3 ++-
include/asm-generic/5level-fixup.h | 41 ++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 3 +++
3 files changed, 46 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index 5bdab6bffd23..928fd66b1271 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -15,7 +15,6 @@
((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
NULL: pmd_offset(pud, address))

-#define pud_alloc(mm, pgd, address) (pgd)
#define pud_offset(pgd, start) (pgd)
#define pud_none(pud) 0
#define pud_bad(pud) 0
@@ -35,4 +34,6 @@
#undef pud_addr_end
#define pud_addr_end(addr, end) (end)

+#include <asm-generic/5level-fixup.h>
+
#endif
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
new file mode 100644
index 000000000000..02a71804d9fa
--- /dev/null
+++ b/include/asm-generic/5level-fixup.h
@@ -0,0 +1,41 @@
+#ifndef _5LEVEL_FIXUP_H
+#define _5LEVEL_FIXUP_H
+
+#define __ARCH_HAS_5LEVEL_HACK
+#define __PAGETABLE_P4D_FOLDED
+
+#define P4D_SHIFT PGDIR_SHIFT
+#define P4D_SIZE PGDIR_SIZE
+#define P4D_MASK PGDIR_MASK
+#define PTRS_PER_P4D 1
+
+#define p4d_t pgd_t
+
+#define pud_alloc(mm, p4d, address) \
+ ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address))? \
+ NULL: pud_offset(p4d, address))
+
+#define p4d_alloc(mm, pgd, address) (pgd)
+#define p4d_offset(pgd, start) (pgd)
+#define p4d_none(p4d) 0
+#define p4d_bad(p4d) 0
+#define p4d_present(p4d) 1
+#define p4d_ERROR(p4d) do { } while (0)
+#define p4d_clear(p4d) pgd_clear(p4d)
+#define p4d_val(p4d) pgd_val(p4d)
+#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud)
+#define p4d_page(p4d) pgd_page(p4d)
+#define p4d_page_vaddr(p4d) pgd_page_vaddr(p4d)
+
+#define __p4d(x) __pgd(x)
+#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d)
+
+#undef p4d_free_tlb
+#define p4d_free_tlb(tlb, x, addr) do { } while (0)
+#define p4d_free(mm, x) do { } while (0)
+#define __p4d_free_tlb(tlb, x, addr) do { } while (0)
+
+#undef p4d_addr_end
+#define p4d_addr_end(addr, end) (end)
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..9a4dfd7c3515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1600,11 +1600,14 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
* Remove it when 4level-fixup.h has been removed.
*/
#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
+
+#ifndef __ARCH_HAS_5LEVEL_HACK
static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
NULL: pud_offset(pgd, address);
}
+#endif /* !__ARCH_HAS_5LEVEL_HACK */

static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
--
2.10.2

2016-12-08 16:23:00

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 04/28] asm-generic: introduce <asm-generic/pgtable-nop4d.h>

Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/pgtable-nop4d.h | 56 +++++++++++++++++++++++++++++++++++++
include/asm-generic/pgtable-nopud.h | 43 ++++++++++++++--------------
include/asm-generic/tlb.h | 12 ++++++--
3 files changed, 88 insertions(+), 23 deletions(-)
create mode 100644 include/asm-generic/pgtable-nop4d.h

diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
new file mode 100644
index 000000000000..d33c1ca857fc
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -0,0 +1,56 @@
+#ifndef _PGTABLE_NOP4D_H
+#define _PGTABLE_NOP4D_H
+
+#ifndef __ASSEMBLY__
+
+#define __PAGETABLE_P4D_FOLDED
+
+typedef struct { pgd_t pgd; } p4d_t;
+
+#define P4D_SHIFT PGDIR_SHIFT
+#define PTRS_PER_P4D 1
+#define P4D_SIZE (1UL << P4D_SHIFT)
+#define P4D_MASK (~(P4D_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the p4d is never bad, and a p4d always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd) { return 0; }
+static inline int pgd_bad(pgd_t pgd) { return 0; }
+static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline void pgd_clear(pgd_t *pgd) { }
+#define p4d_ERROR(p4d) (pgd_ERROR((p4d).pgd))
+
+#define pgd_populate(mm, pgd, p4d) do { } while (0)
+/*
+ * (p4ds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval) set_p4d((p4d_t *)(pgdptr), (p4d_t) { pgdval })
+
+static inline p4d_t * p4d_offset(pgd_t * pgd, unsigned long address)
+{
+ return (p4d_t *)pgd;
+}
+
+#define p4d_val(x) (pgd_val((x).pgd))
+#define __p4d(x) ((p4d_t) { __pgd(x) } )
+
+#define pgd_page(pgd) (p4d_page((p4d_t){ pgd }))
+#define pgd_page_vaddr(pgd) (p4d_page_vaddr((p4d_t){ pgd }))
+
+/*
+ * allocating and freeing a p4d is trivial: the 1-entry p4d is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define p4d_alloc_one(mm, address) NULL
+#define p4d_free(mm, x) do { } while (0)
+#define __p4d_free_tlb(tlb, x, a) do { } while (0)
+
+#undef p4d_addr_end
+#define p4d_addr_end(addr, end) (end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 5e49430a30a4..de52c80c4349 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -6,53 +6,54 @@
#ifdef __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nop4d-hack.h>
#else
+#include <asm-generic/pgtable-nop4d.h>

#define __PAGETABLE_PUD_FOLDED

/*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
+ * Having the pud type consist of a p4d gets the size right, and allows
+ * us to conceptually access the p4d entry that this pud is folded into
* without casting.
*/
-typedef struct { pgd_t pgd; } pud_t;
+typedef struct { p4d_t p4d; } pud_t;

-#define PUD_SHIFT PGDIR_SHIFT
+#define PUD_SHIFT P4D_SHIFT
#define PTRS_PER_PUD 1
#define PUD_SIZE (1UL << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))

/*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * The "p4d_xxx()" functions here are trivial for a folded two-level
* setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
+ * into the p4d entry)
*/
-static inline int pgd_none(pgd_t pgd) { return 0; }
-static inline int pgd_bad(pgd_t pgd) { return 0; }
-static inline int pgd_present(pgd_t pgd) { return 1; }
-static inline void pgd_clear(pgd_t *pgd) { }
-#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+static inline int p4d_none(p4d_t p4d) { return 0; }
+static inline int p4d_bad(p4d_t p4d) { return 0; }
+static inline int p4d_present(p4d_t p4d) { return 1; }
+static inline void p4d_clear(p4d_t *p4d) { }
+#define pud_ERROR(pud) (p4d_ERROR((pud).p4d))

-#define pgd_populate(mm, pgd, pud) do { } while (0)
+#define p4d_populate(mm, p4d, pud) do { } while (0)
/*
- * (puds are folded into pgds so this doesn't get actually called,
+ * (puds are folded into p4ds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
*/
-#define set_pgd(pgdptr, pgdval) set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+#define set_p4d(p4dptr, p4dval) set_pud((pud_t *)(p4dptr), (pud_t) { p4dval })

-static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
+static inline pud_t * pud_offset(p4d_t * p4d, unsigned long address)
{
- return (pud_t *)pgd;
+ return (pud_t *)p4d;
}

-#define pud_val(x) (pgd_val((x).pgd))
-#define __pud(x) ((pud_t) { __pgd(x) } )
+#define pud_val(x) (p4d_val((x).p4d))
+#define __pud(x) ((pud_t) { __p4d(x) } )

-#define pgd_page(pgd) (pud_page((pud_t){ pgd }))
-#define pgd_page_vaddr(pgd) (pud_page_vaddr((pud_t){ pgd }))
+#define p4d_page(p4d) (pud_page((pud_t){ p4d }))
+#define p4d_page_vaddr(p4d) (pud_page_vaddr((pud_t){ p4d }))

/*
* allocating and freeing a pud is trivial: the 1-entry pud is
- * inside the pgd, so has no extra memory associated with it.
+ * inside the p4d, so has no extra memory associated with it.
*/
#define pud_alloc_one(mm, address) NULL
#define pud_free(mm, x) do { } while (0)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index c6d667187608..e34bdb4bfaa9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -239,6 +239,12 @@ static inline bool __tlb_remove_pte_page(struct mmu_gather *tlb, struct page *pa
__pte_free_tlb(tlb, ptep, address); \
} while (0)

+#define pmd_free_tlb(tlb, pmdp, address) \
+ do { \
+ __tlb_adjust_range(tlb, address); \
+ __pmd_free_tlb(tlb, pmdp, address); \
+ } while (0)
+
#ifndef __ARCH_HAS_4LEVEL_HACK
#define pud_free_tlb(tlb, pudp, address) \
do { \
@@ -247,11 +253,13 @@ static inline bool __tlb_remove_pte_page(struct mmu_gather *tlb, struct page *pa
} while (0)
#endif

-#define pmd_free_tlb(tlb, pmdp, address) \
+#ifndef __ARCH_HAS_5LEVEL_HACK
+#define p4d_free_tlb(tlb, pudp, address) \
do { \
__tlb_adjust_range(tlb, address); \
- __pmd_free_tlb(tlb, pmdp, address); \
+ __p4d_free_tlb(tlb, pudp, address); \
} while (0)
+#endif

#define tlb_migrate_finish(mm) do {} while (0)

--
2.10.2

2016-12-08 16:23:14

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 25/28] x86/mm: make kernel_physical_mapping_init() support 5-level paging

Properly populate addition pagetable level if CONFIG_X86_5LEVEL is
enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 71 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d637893ac8c2..80c688e3133e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -631,6 +631,58 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
return paddr_last;
}

+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+ unsigned long page_size_mask)
+{
+ unsigned long paddr_next, paddr_last = paddr_end;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+ int i = p4d_index(vaddr);
+
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+ for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d;
+ pud_t *pud;
+
+ vaddr = (unsigned long)__va(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ if (paddr >= paddr_end) {
+ if (!after_bootmem &&
+ !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+ E820_RAM) &&
+ !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+ E820_RESERVED_KERN)) {
+ set_p4d(p4d, __p4d(0));
+ }
+ continue;
+ }
+
+ if (!p4d_none(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ paddr_last = phys_pud_init(pud, paddr,
+ paddr_end,
+ page_size_mask);
+ __flush_tlb_all();
+ continue;
+ }
+
+ pud = alloc_low_page();
+ paddr_last = phys_pud_init(pud, paddr, paddr_end,
+ page_size_mask);
+
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, p4d, pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+ __flush_tlb_all();
+
+ return paddr_last;
+}
+
/*
* Create page table mapping for the physical memory for specific physical
* addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -652,26 +704,27 @@ kernel_physical_mapping_init(unsigned long paddr_start,
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
- pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- BUILD_BUG_ON(pgd_none(*pgd));
- p4d = p4d_offset(pgd, vaddr);
- if (p4d_val(*p4d)) {
- pud = (pud_t *)p4d_page_vaddr(*p4d);
- paddr_last = phys_pud_init(pud, __pa(vaddr),
+ if (pgd_val(*pgd)) {
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}

- pud = alloc_low_page();
- paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+ p4d = alloc_low_page();
+ paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- p4d_populate(&init_mm, p4d, pud);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ pgd_populate(&init_mm, pgd, p4d);
+ else
+ p4d_populate(&init_mm, p4d_offset(pgd, vaddr),
+ (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
--
2.10.2

2016-12-08 16:23:40

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 28/28] TESTING-ONLY: bump TASK_SIZE_MAX

This is useful for now to play with.

We would need to imlement proper opt-in with prctl()/personality()/ELF
flag.
---
arch/x86/include/asm/processor.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 63def9537a2d..ab44cbe853cb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -755,7 +755,8 @@ extern unsigned long thread_saved_pc(struct task_struct *tsk);
* particular problem by preventing anything from being mapped
* at the maximum canonical address.
*/
-#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
+/* TODO: use >47bit only if requested */
+#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)

/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
--
2.10.2

2016-12-08 16:23:58

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 07/28] x86: trivial portion of 5-level paging conversion

This patch covers simple cases only.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tboot.c | 6 +++++-
arch/x86/kernel/vm86_32.c | 6 +++++-
arch/x86/mm/fault.c | 39 +++++++++++++++++++++++++++++++++------
arch/x86/mm/init_32.c | 22 ++++++++++++++++------
arch/x86/mm/ioremap.c | 3 ++-
arch/x86/mm/pgtable.c | 4 +++-
arch/x86/mm/pgtable_32.c | 8 +++++++-
arch/x86/platform/efi/efi_64.c | 13 +++++++++----
arch/x86/power/hibernate_32.c | 7 +++++--
9 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 654f6c66fe45..e42397f51732 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -118,12 +118,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
pgprot_t prot)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

pgd = pgd_offset(&tboot_mm, vaddr);
- pud = pud_alloc(&tboot_mm, pgd, vaddr);
+ p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+ if (!p4d)
+ return -1;
+ pud = pud_alloc(&tboot_mm, p4d, vaddr);
if (!pud)
return -1;
pmd = pmd_alloc(&tboot_mm, pud, vaddr);
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 01f30e56f99e..e339717af265 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -161,6 +161,7 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval)
static void mark_screen_rdonly(struct mm_struct *mm)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -171,7 +172,10 @@ static void mark_screen_rdonly(struct mm_struct *mm)
pgd = pgd_offset(mm, 0xA0000);
if (pgd_none_or_clear_bad(pgd))
goto out;
- pud = pud_offset(pgd, 0xA0000);
+ p4d = p4d_offset(pgd, 0xA0000);
+ if (p4d_none_or_clear_bad(p4d))
+ goto out;
+ pud = pud_offset(p4d, 0xA0000);
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dc8023060456..1bbdbb0594a3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -252,6 +252,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
{
unsigned index = pgd_index(address);
pgd_t *pgd_k;
+ p4d_t *p4d, *p4d_k;
pud_t *pud, *pud_k;
pmd_t *pmd, *pmd_k;

@@ -264,10 +265,15 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
/*
* set_pgd(pgd, *pgd_k); here would be useless on PAE
* and redundant with the set_pmd() on non-PAE. As would
- * set_pud.
+ * set_p4d/set_pud.
*/
- pud = pud_offset(pgd, address);
- pud_k = pud_offset(pgd_k, address);
+ p4d = p4d_offset(pgd, address);
+ p4d_k = p4d_offset(pgd_k, address);
+ if (!p4d_present(*p4d_k))
+ return NULL;
+
+ pud = pud_offset(p4d, address);
+ pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
return NULL;

@@ -383,6 +389,8 @@ static void dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3());
pgd_t *pgd = &base[pgd_index(address)];
+ p4d_t *p4d;
+ pud_t *pud;
pmd_t *pmd;
pte_t *pte;

@@ -391,7 +399,9 @@ static void dump_pagetable(unsigned long address)
if (!low_pfn(pgd_val(*pgd) >> PAGE_SHIFT) || !pgd_present(*pgd))
goto out;
#endif
- pmd = pmd_offset(pud_offset(pgd, address), address);
+ p4d = p4d_offset(pgd, address);
+ pud = pud_offset(p4d, address);
+ pmd = pmd_offset(pud, address);
printk(KERN_CONT "*pde = %0*Lx ", sizeof(*pmd) * 2, (u64)pmd_val(*pmd));

/*
@@ -525,6 +535,7 @@ static void dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
pgd_t *pgd = base + pgd_index(address);
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -537,7 +548,15 @@ static void dump_pagetable(unsigned long address)
if (!pgd_present(*pgd))
goto out;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (bad_address(p4d))
+ goto bad;
+
+ printk("P4D %lx ", p4d_val(*p4d));
+ if (!p4d_present(*p4d) || p4d_large(*p4d))
+ goto out;
+
+ pud = pud_offset(p4d, address);
if (bad_address(pud))
goto bad;

@@ -1050,6 +1069,7 @@ static noinline int
spurious_fault(unsigned long error_code, unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -1072,7 +1092,14 @@ spurious_fault(unsigned long error_code, unsigned long address)
if (!pgd_present(*pgd))
return 0;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return 0;
+
+ if (p4d_large(*p4d))
+ return spurious_fault_check(error_code, (pte_t *) p4d);
+
+ pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return 0;

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cf8059016ec8..2b423b821386 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -67,6 +67,7 @@ bool __read_mostly __vmalloc_start_set = false;
*/
static pmd_t * __init one_md_table_init(pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd_table;

@@ -75,13 +76,15 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
pmd_table = (pmd_t *)alloc_low_page();
paravirt_alloc_pmd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
BUG_ON(pmd_table != pmd_offset(pud, 0));

return pmd_table;
}
#endif
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);

return pmd_table;
@@ -390,8 +393,11 @@ pte_t *kmap_pte;

static inline pte_t *kmap_get_fixmap_pte(unsigned long vaddr)
{
- return pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr),
- vaddr), vaddr), vaddr);
+ pgd_t *pgd = pgd_offset_k(vaddr);
+ p4d_t *p4d = p4d_offset(pgd, vaddr);
+ pud_t *pud = pud_offset(p4d, vaddr);
+ pmd_t *pmd = pmd_offset(pud, vaddr);
+ return pte_offset_kernel(pmd, vaddr);
}

static void __init kmap_init(void)
@@ -410,6 +416,7 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
{
unsigned long vaddr;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -418,7 +425,8 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);

pgd = swapper_pg_dir + pgd_index(vaddr);
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ pud = pud_offset(p4d, vaddr);
pmd = pmd_offset(pud, vaddr);
pte = pte_offset_kernel(pmd, vaddr);
pkmap_page_table = pte;
@@ -450,6 +458,7 @@ void __init native_pagetable_init(void)
{
unsigned long pfn, va;
pgd_t *pgd, *base = swapper_pg_dir;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -469,7 +478,8 @@ void __init native_pagetable_init(void)
if (!pgd_present(*pgd))
break;

- pud = pud_offset(pgd, va);
+ p4d = p4d_offset(pgd, va);
+ pud = pud_offset(p4d, va);
pmd = pmd_offset(pud, va);
if (!pmd_present(*pmd))
break;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 7aaa2635862d..a5e1cda85974 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -425,7 +425,8 @@ static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
/* Don't assume we're using swapper_pg_dir at this point */
pgd_t *base = __va(read_cr3());
pgd_t *pgd = &base[pgd_index(addr)];
- pud_t *pud = pud_offset(pgd, addr);
+ p4d_t *p4d = p4d_offset(pgd, addr);
+ pud_t *pud = pud_offset(p4d, addr);
pmd_t *pmd = pmd_offset(pud, addr);

return pmd;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3feec5af4e67..cc6fcd4040e2 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -261,13 +261,15 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)

static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
{
+ p4d_t *p4d;
pud_t *pud;
int i;

if (PREALLOCATED_PMDS == 0) /* Work around gcc-3.4.x bug */
return;

- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);

for (i = 0; i < PREALLOCATED_PMDS; i++, pud++) {
pmd_t *pmd = pmds[i];
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 9adce776852b..3d275a791c76 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -26,6 +26,7 @@ unsigned int __VMALLOC_RESERVE = 128 << 20;
void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -35,7 +36,12 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
BUG();
return;
}
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ if (p4d_none(*p4d)) {
+ BUG();
+ return;
+ }
+ pud = pud_offset(p4d, vaddr);
if (pud_none(*pud)) {
BUG();
return;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8dd3784eb075..ac4b0cbd479b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -165,6 +165,7 @@ void efi_sync_low_kernel_mappings(void)
{
unsigned num_entries;
pgd_t *pgd_k, *pgd_efi;
+ p4d_t *p4d_k, *p4d_efi;
pud_t *pud_k, *pud_efi;

if (efi_enabled(EFI_OLD_MEMMAP))
@@ -196,16 +197,20 @@ void efi_sync_low_kernel_mappings(void)
BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);

pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
- pud_efi = pud_offset(pgd_efi, 0);
+ p4d_efi = p4d_offset(pgd_efi, 0);
+ pud_efi = pud_offset(p4d_efi, 0);

pgd_k = pgd_offset_k(EFI_VA_END);
- pud_k = pud_offset(pgd_k, 0);
+ p4d_k = p4d_offset(pgd_k, 0);
+ pud_k = pud_offset(p4d_k, 0);

num_entries = pud_index(EFI_VA_END);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);

- pud_efi = pud_offset(pgd_efi, EFI_VA_START);
- pud_k = pud_offset(pgd_k, EFI_VA_START);
+ p4d_efi = p4d_offset(pgd_efi, EFI_VA_START);
+ pud_efi = pud_offset(p4d_efi, EFI_VA_START);
+ p4d_k = p4d_offset(pgd_k, EFI_VA_START);
+ pud_k = pud_offset(p4d_k, EFI_VA_START);

num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
diff --git a/arch/x86/power/hibernate_32.c b/arch/x86/power/hibernate_32.c
index 9f14bd34581d..c35fdb585c68 100644
--- a/arch/x86/power/hibernate_32.c
+++ b/arch/x86/power/hibernate_32.c
@@ -32,6 +32,7 @@ pgd_t *resume_pg_dir;
*/
static pmd_t *resume_one_md_table_init(pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd_table;

@@ -41,11 +42,13 @@ static pmd_t *resume_one_md_table_init(pgd_t *pgd)
return NULL;

set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);

BUG_ON(pmd_table != pmd_offset(pud, 0));
#else
- pud = pud_offset(pgd, 0);
+ p4d = p4d_offset(pgd, 0);
+ pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);
#endif

--
2.10.2

2016-12-08 16:22:59

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 09/28] x86/ident_map: add 5-level paging support

Nothing special: just handle one more level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/ident_map.c | 42 +++++++++++++++++++++++++++++++++++-------
1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 4473cb4f8b90..f1a620b9f2ef 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -45,6 +45,34 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
return 0;
}

+static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
+ unsigned long addr, unsigned long end)
+{
+ unsigned long next;
+
+ for (; addr < end; addr = next) {
+ p4d_t *p4d = p4d_page + p4d_index(addr);
+ pud_t *pud;
+
+ next = (addr & P4D_MASK) + P4D_SIZE;
+ if (next > end)
+ next = end;
+
+ if (p4d_present(*p4d)) {
+ pud = pud_offset(p4d, 0);
+ ident_pud_init(info, pud, addr, next);
+ continue;
+ }
+ pud = (pud_t *)info->alloc_pgt_page(info->context);
+ if (!pud)
+ return -ENOMEM;
+ ident_pud_init(info, pud, addr, next);
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+ }
+
+ return 0;
+}
+
int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
unsigned long pstart, unsigned long pend)
{
@@ -55,27 +83,27 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,

for (; addr < end; addr = next) {
pgd_t *pgd = pgd_page + pgd_index(addr);
- pud_t *pud;
+ p4d_t *p4d;

next = (addr & PGDIR_MASK) + PGDIR_SIZE;
if (next > end)
next = end;

if (pgd_present(*pgd)) {
- pud = pud_offset(pgd, 0);
- result = ident_pud_init(info, pud, addr, next);
+ p4d = p4d_offset(pgd, 0);
+ result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
continue;
}

- pud = (pud_t *)info->alloc_pgt_page(info->context);
- if (!pud)
+ p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+ if (!p4d)
return -ENOMEM;
- result = ident_pud_init(info, pud, addr, next);
+ result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
}

return 0;
--
2.10.2

2016-12-08 16:22:56

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 08/28] x86/gup: add 5-level paging support

It's simply extension for one more page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/gup.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index b8b6a60b32cf..8b69adc10988 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -76,9 +76,9 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
}

/*
- * 'pteval' can come from a pte, pmd or pud. We only check
+ * 'pteval' can come from a pte, pmd, pud or p4d. We only check
* _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
- * same value on all 3 types.
+ * same value on all 4 types.
*/
static inline int pte_allows_gup(unsigned long pteval, int write)
{
@@ -270,13 +270,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
return 1;
}

-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

- pudp = pud_offset(&pgd, addr);
+ pudp = pud_offset(&p4d, addr);
do {
pud_t pud = *pudp;

@@ -295,6 +295,27 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ int write, struct page **pages, int *nr)
+{
+ unsigned long next;
+ p4d_t *p4dp;
+
+ p4dp = p4d_offset(&pgd, addr);
+ do {
+ p4d_t p4d = *p4dp;
+
+ next = p4d_addr_end(addr, end);
+ if (p4d_none(p4d))
+ return 0;
+ BUILD_BUG_ON(p4d_large(p4d));
+ if (!gup_pud_range(p4d, addr, next, write, pages, nr))
+ return 0;
+ } while (p4dp++, addr = next, addr != end);
+
+ return 1;
+}
+
/*
* Like get_user_pages_fast() except its IRQ-safe in that it won't fall
* back to the regular GUP.
@@ -343,7 +364,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
break;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
break;
} while (pgdp++, addr = next, addr != end);
local_irq_restore(flags);
@@ -415,7 +436,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
goto slow;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
goto slow;
} while (pgdp++, addr = next, addr != end);
local_irq_enable();
--
2.10.2

2016-12-08 16:24:54

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 26/28] x86/mm: add support for 5-level paging for KASLR

With 5-level paging randomization happens on P4D level install of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/kaslr.c | 82 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index bda8d5eef04d..c79d52732efa 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
*
* Entropy is generated using the KASLR early boot functions now shared in
* the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
*
* The order of each memory region is not changed. The feature looks at
* the available space for the regions based on different configuration
@@ -61,7 +61,8 @@ static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 64/* Maximum */ },
+ { &page_offset_base,
+ 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
};

@@ -120,7 +121,10 @@ void __init kernel_randomize_memory(void)
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
- entropy = (rand % (entropy + 1)) & PUD_MASK;
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ entropy = (rand % (entropy + 1)) & P4D_MASK;
+ else
+ entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;

@@ -129,27 +133,21 @@ void __init kernel_randomize_memory(void)
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
- vaddr = round_up(vaddr + 1, PUD_SIZE);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ vaddr = round_up(vaddr + 1, P4D_SIZE);
+ else
+ vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}

-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
{
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;

- if (!kaslr_memory_enabled()) {
- init_trampoline_default();
- return;
- }
-
pud_page_tramp = alloc_low_page();

paddr = 0;
@@ -170,3 +168,49 @@ void __meminit init_trampoline(void)
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
}
+
+static void __meminit init_trampoline_p4d(void)
+{
+ unsigned long paddr, paddr_next;
+ pgd_t *pgd;
+ p4d_t *p4d_page, *p4d_page_tramp;
+ int i;
+
+ p4d_page_tramp = alloc_low_page();
+
+ paddr = 0;
+ pgd = pgd_offset_k((unsigned long)__va(paddr));
+ p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+ for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+ p4d_t *p4d, *p4d_tramp;
+ unsigned long vaddr = (unsigned long)__va(paddr);
+
+ p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+ p4d = p4d_page + p4d_index(vaddr);
+ paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+ *p4d_tramp = *p4d;
+ }
+
+ set_pgd(&trampoline_pgd_entry,
+ __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+ if (!kaslr_memory_enabled()) {
+ init_trampoline_default();
+ return;
+ }
+
+ if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ init_trampoline_p4d();
+ else
+ init_trampoline_pud();
+}
--
2.10.2

2016-12-08 16:25:37

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 24/28] x86/mm: add sync_global_pgds() for configuration with 5-level paging

This basically restores slightly modified version of original
sync_global_pgds() which we had before foldedl p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a991f5c4c2c4..d637893ac8c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,52 @@ __setup("noexec32=", nonx32_setup);
* When memory was added/removed make sure all the processes MM have
* suitable PGD entries in the local PGD level page.
*/
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end, int removed)
+{
+ unsigned long address;
+
+ for (address = start; address <= end && address >= start;
+ address += PGDIR_SIZE) {
+ const pgd_t *pgd_ref = pgd_offset_k(address);
+ struct page *page;
+
+ /*
+ * When it is called after memory hot remove, pgd_none()
+ * returns true. In this case (removed == 1), we must clear
+ * the PGD entries in the local PGD level page.
+ */
+ if (pgd_none(*pgd_ref) && !removed)
+ continue;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ BUG_ON(pgd_page_vaddr(*pgd)
+ != pgd_page_vaddr(*pgd_ref));
+
+ if (removed) {
+ if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ pgd_clear(pgd);
+ } else {
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+ }
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+ }
+}
+#else
void sync_global_pgds(unsigned long start, unsigned long end, int removed)
{
unsigned long address;
@@ -145,6 +191,7 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
spin_unlock(&pgd_lock);
}
}
+#endif

/*
* NOTE: This function is marked __ref because it calls __init function
--
2.10.2

2016-12-08 16:26:37

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 27/28] x86: enable la57 support

Most of things are in place and we can enable support of 5-level paging.
Things that known to be broken marked as BROKEN.

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df4f1d514ab0..83a4c22111b3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -309,6 +309,7 @@ config DEBUG_RODATA

config PGTABLE_LEVELS
int
+ default 5 if X86_5LEVEL
default 4 if X86_64
default 3 if X86_PAE
default 2
@@ -1340,6 +1341,10 @@ config X86_PAE
has the cost of more pagetable lookup overhead, and also
consumes more pagetable space per process.

+config X86_5LEVEL
+ bool "Enable 5-level page tables support"
+ depends on X86_64
+
config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
@@ -1698,6 +1703,7 @@ config X86_SMAP
config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
+ depends on !X86_5LEVEL
depends on CPU_SUP_INTEL
---help---
MPX provides hardware features that can be used in
--
2.10.2

2016-12-08 16:22:37

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 16/28] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert

We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
better check for canonical addresses") made canonical address check
generic wrt. address width.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/entry/entry_64.S | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 02fff3ebfb87..92a6753f8c85 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -267,12 +267,9 @@ return_from_SYSCALL_64:
*
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
+ *
+ * Change top 16 bits to be the sign-extension of 47th bit
*/
- .ifne __VIRTUAL_MASK_SHIFT - 47
- .error "virtual address width changed -- SYSRET checks need update"
- .endif
-
- /* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

--
2.10.2

2016-12-08 16:27:00

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 22/28] x86/espfix: support 5-level paging

XXX: how to test this?

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/espfix_64.c | 41 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 04f89caef9c4..f0afa0af4237 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -70,8 +70,15 @@ static DEFINE_MUTEX(espfix_init_mutex);
#define ESPFIX_MAX_PAGES DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
static void *espfix_pages[ESPFIX_MAX_PAGES];

-static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
+#if CONFIG_PGTABLE_LEVELS == 5
+static __page_aligned_bss pud_t espfix_pgtable_page[PTRS_PER_PUD]
__aligned(PAGE_SIZE);
+#elif CONFIG_PGTABLE_LEVELS == 4
+static __page_aligned_bss pud_t espfix_pgtable_page[PTRS_PER_PUD]
+ __aligned(PAGE_SIZE);
+#else
+#error Unexpected CONFIG_PGTABLE_LEVELS
+#endif

static unsigned int page_random, slot_random;

@@ -97,6 +104,8 @@ static inline unsigned long espfix_base_addr(unsigned int cpu)
#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
#define ESPFIX_PMD_CLONES PTRS_PER_PMD
#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
+/* XXX: what should it be? */
+#define ESPFIX_P4D_CLONES PTRS_PER_P4D

#define PGTABLE_PROT ((_KERNPG_TABLE & ~_PAGE_RW) | _PAGE_NX)

@@ -122,10 +131,21 @@ static void init_espfix_random(void)
void __init init_espfix_bsp(void)
{
pgd_t *pgd_p;
+ p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
- pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
+ switch (CONFIG_PGTABLE_LEVELS) {
+ case 4:
+ p4d = p4d_offset(pgd_p, ESPFIX_BASE_ADDR);
+ p4d_populate(&init_mm, p4d, (pud_t *)espfix_pgtable_page);
+ break;
+ case 5:
+ pgd_populate(&init_mm, pgd_p, (p4d_t *)espfix_pgtable_page);
+ break;
+ default:
+ BUILD_BUG();
+ }

/* Randomize the locations */
init_espfix_random();
@@ -138,6 +158,7 @@ void init_espfix_ap(int cpu)
{
unsigned int page;
unsigned long addr;
+ p4d_t p4d, *p4d_p;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
@@ -167,7 +188,21 @@ void init_espfix_ap(int cpu)
node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;

- pud_p = &espfix_pud_page[pud_index(addr)];
+ if (CONFIG_PGTABLE_LEVELS == 5) {
+ p4d_p = (p4d_t *)espfix_pgtable_page + p4d_index(addr);
+ p4d = *p4d_p;
+ if (!p4d_present(p4d)) {
+ struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+ pud_p = (pud_t *)page_address(page);
+ p4d = __p4d(__pa(pud_p) | (PGTABLE_PROT & ptemask));
+ paravirt_alloc_pud(&init_mm, __pa(pud_p) >> PAGE_SHIFT);
+ for (n = 0; n < ESPFIX_P4D_CLONES; n++)
+ set_p4d(&p4d_p[n], p4d);
+ }
+ } else {
+ pud_p = (pud_t *)espfix_pgtable_page + pud_index(addr);
+ }
pud = *pud_p;
if (!pud_present(pud)) {
struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
--
2.10.2

2016-12-08 16:27:03

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 20/28] x86/dump_pagetables: support 5-level paging

Simple extension to support one more page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 49 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ea9c49adaa1f..15670b55861b 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -100,7 +100,8 @@ static struct addr_marker address_markers[] = {
#define PTE_LEVEL_MULT (PAGE_SIZE)
#define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT)
#define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT)
-#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define P4D_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define PGD_LEVEL_MULT (PTRS_PER_PUD * P4D_LEVEL_MULT)

#define pt_dump_seq_printf(m, to_dmesg, fmt, args...) \
({ \
@@ -326,14 +327,14 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,

#if PTRS_PER_PUD > 1

-static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
unsigned long P)
{
int i;
pud_t *start;
pgprotval_t prot;

- start = (pud_t *) pgd_page_vaddr(addr);
+ start = (pud_t *) p4d_page_vaddr(addr);

for (i = 0; i < PTRS_PER_PUD; i++) {
st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
@@ -353,9 +354,43 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
}

#else
-#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(pgd_val(a)),p)
-#define pgd_large(a) pud_large(__pud(pgd_val(a)))
-#define pgd_none(a) pud_none(__pud(pgd_val(a)))
+#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(p4d_val(a)),p)
+#define p4d_large(a) pud_large(__pud(p4d_val(a)))
+#define p4d_none(a) pud_none(__pud(p4d_val(a)))
+#endif
+
+#if PTRS_PER_P4D > 1
+
+static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+ unsigned long P)
+{
+ int i;
+ p4d_t *start;
+ pgprotval_t prot;
+
+ start = (p4d_t *) pgd_page_vaddr(addr);
+
+ for (i = 0; i < PTRS_PER_P4D; i++) {
+ st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
+ if (!p4d_none(*start)) {
+ if (p4d_large(*start) || !p4d_present(*start)) {
+ prot = p4d_flags(*start);
+ note_page(m, st, __pgprot(prot), 2);
+ } else {
+ walk_pud_level(m, st, *start,
+ P + i * P4D_LEVEL_MULT);
+ }
+ } else
+ note_page(m, st, __pgprot(0), 2);
+
+ start++;
+ }
+}
+
+#else
+#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p)
+#define pgd_large(a) p4d_large(__p4d(pgd_val(a)))
+#define pgd_none(a) p4d_none(__p4d(pgd_val(a)))
#endif

static inline bool is_hypervisor_range(int idx)
@@ -400,7 +435,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
prot = pgd_flags(*start);
note_page(m, &st, __pgprot(prot), 1);
} else {
- walk_pud_level(m, &st, *start,
+ walk_p4d_level(m, &st, *start,
i * PGD_LEVEL_MULT);
}
} else
--
2.10.2

2016-12-08 16:27:02

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 23/28] x86/mm: add support of additional page table level during early boot

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

TODO: XEN support is missing

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 23 ++++++++++--
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable_64.h | 6 ++-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 40 ++++++++++++++------
arch/x86/kernel/head_64.S | 58 ++++++++++++++++++++++-------
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/mm/kasan_init_64.c | 12 +++---
arch/x86/realmode/init.c | 2 +-
11 files changed, 110 insertions(+), 41 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 0d80a7ad65cd..725c5ee939d1 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -123,9 +123,12 @@ ENTRY(startup_32)
movl %eax, gdt+2(%ebp)
lgdt gdt(%ebp)

- /* Enable PAE mode */
+ /* Enable PAE and LA57 mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %eax
+#endif
movl %eax, %cr4

/*
@@ -137,13 +140,24 @@ ENTRY(startup_32)
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl

+ xorl %edx, %edx
+
+ /* Build Top Level */
+ leal pgtable(%ebx,%edx,1), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
/* Build Level 4 */
- leal pgtable + 0(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
+#endif

/* Build Level 3 */
- leal pgtable + 0x1000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
@@ -153,7 +167,8 @@ ENTRY(startup_32)
jnz 1b

/* Build Level 2 */
- leal pgtable + 0x2000(%ebx), %edi
+ addl $0x1000, %edx
+ leal pgtable(%ebx,%edx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 398adab9a167..8992f0a9ea3a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -813,7 +813,7 @@ extern pgd_t trampoline_pgd_entry;
static inline void __meminit init_trampoline_default(void)
{
/* Default trampoline pgd value */
- trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+ trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index bfe276e9af1e..eab09641ef3f 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,15 +14,17 @@
#include <linux/bitops.h>
#include <linux/threads.h>

+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt

extern void paging_init(void);

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
#define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
#define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
#define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */
+#define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT)
#define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */
#define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT)
#define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f0afa0af4237..e6838b12414b 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -134,7 +134,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;

/* Install the espfix pud into the kernel page directory */
- pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+ pgd_p = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
switch (CONFIG_PGTABLE_LEVELS) {
case 4:
p4d = p4d_offset(pgd_p, ESPFIX_BASE_ADDR);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 54a2372f5dbb..f32d22986f47 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -32,7 +32,7 @@
/*
* Manage page tables very early on.
*/
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt = 2;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -40,9 +40,9 @@ pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
/* Wipe all early page tables except for the kernel symbol map */
static void __init reset_early_page_tables(void)
{
- memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+ memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
next_early_pgt = 0;
- write_cr3(__pa_nodebug(early_level4_pgt));
+ write_cr3(__pa_nodebug(early_top_pgt));
}

/* Create a new PMD entry */
@@ -50,15 +50,16 @@ int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pgdval_t pgd, *pgd_p;
+ p4dval_t p4d, *p4d_p;
pudval_t pud, *pud_p;
pmdval_t pmd, *pmd_p;

/* Invalid address or early pgt is done ? */
- if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+ if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
return -1;

again:
- pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+ pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

/*
@@ -66,8 +67,25 @@ again:
* critical -- __PAGE_OFFSET would point us back into the dynamic
* range and we might end up looping forever...
*/
- if (pgd)
- pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ p4d_p = pgd_p;
+ else if (pgd)
+ p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+ else {
+ if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+ reset_early_page_tables();
+ goto again;
+ }
+
+ p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+ memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+ *pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ }
+ p4d_p += p4d_index(address);
+ p4d = *p4d_p;
+
+ if (p4d)
+ pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
else {
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
@@ -76,7 +94,7 @@ again:

pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
- *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+ *p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
}
pud_p += pud_index(address);
pud = *pud_p;
@@ -155,7 +173,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

clear_bss();

- clear_page(init_level4_pgt);
+ clear_page(init_top_pgt);

kasan_early_init();

@@ -170,8 +188,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
load_ucode_bsp();

- /* set init_level4_pgt kernel high mapping*/
- init_level4_pgt[511] = early_level4_pgt[511];
+ /* set init_top_pgt kernel high mapping*/
+ init_top_pgt[511] = early_top_pgt[511];

x86_64_start_reservations(real_mode_data);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9f8efc9f0075..e1189003db50 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -36,10 +36,14 @@
*
*/

+#define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))

-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
L3_START_KERNEL = pud_index(__START_KERNEL_map)

.text
@@ -97,7 +101,11 @@ startup_64:
/*
* Fixup the physical addresses in the page table
*/
- addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
+ addq %rbp, early_top_pgt + (PGD_START_KERNEL*8)(%rip)
+
+#ifdef CONFIG_X86_5LEVEL
+ addq %rbp, level4_kernel_pgt + (511*8)(%rip)
+#endif

addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
@@ -111,7 +119,7 @@ startup_64:
* it avoids problems around wraparound.
*/
leaq _text(%rip), %rdi
- leaq early_level4_pgt(%rip), %rbx
+ leaq early_top_pgt(%rip), %rbx

movq %rdi, %rax
shrq $PGDIR_SHIFT, %rax
@@ -120,16 +128,26 @@ startup_64:
movq %rdx, 0(%rbx,%rax,8)
movq %rdx, 8(%rbx,%rax,8)

+#ifdef CONFIG_X86_5LEVEL
+ addq $4096, %rbx
+ addq $4096, %rdx
+ movq %rdi, %rax
+ shrq $P4D_SHIFT, %rax
+ andl $(PTRS_PER_P4D-1), %eax
+ movq %rdx, 0(%rbx,%rax,8)
+#endif
+
+ addq $4096, %rbx
addq $4096, %rdx
movq %rdi, %rax
shrq $PUD_SHIFT, %rax
andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, 4096(%rbx,%rax,8)
+ movq %rdx, 0(%rbx,%rax,8)
incl %eax
andl $(PTRS_PER_PUD-1), %eax
- movq %rdx, 4096(%rbx,%rax,8)
+ movq %rdx, 0(%rbx,%rax,8)

- addq $8192, %rbx
+ addq $4096, %rbx
movq %rdi, %rax
shrq $PMD_SHIFT, %rdi
addq $(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
@@ -166,7 +184,7 @@ startup_64:
/* Fixup phys_base */
addq %rbp, phys_base(%rip)

- movq $(early_level4_pgt - __START_KERNEL_map), %rax
+ movq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
ENTRY(secondary_startup_64)
/*
@@ -186,14 +204,17 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu

- movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_top_pgt - __START_KERNEL_map), %rax
1:

- /* Enable PAE mode and PGE */
+ /* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+ orl $X86_CR4_LA57, %ecx
+#endif
movq %rcx, %cr4

- /* Setup early boot stage 4 level pagetables. */
+ /* Setup early boot stage 4-/5-level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3

@@ -415,9 +436,13 @@ GLOBAL(name)
.endr

__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
.fill 511,8,0
+#ifdef CONFIG_X86_5LEVEL
+ .quad level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif

NEXT_PAGE(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -425,9 +450,10 @@ NEXT_PAGE(early_dynamic_pgts)
.data

#ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
.fill 512,8,0
#else
+#error FIXME
NEXT_PAGE(init_level4_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
.org init_level4_pgt + L4_PAGE_OFFSET*8, 0
@@ -446,6 +472,12 @@ NEXT_PAGE(level2_ident_pgt)
PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
#endif

+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 0a44cf20f939..f9bf209b826c 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -339,7 +339,7 @@ void machine_kexec(struct kimage *image)
void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_SYMBOL(phys_base);
- VMCOREINFO_SYMBOL(init_level4_pgt);
+ VMCOREINFO_SYMBOL(init_top_pgt);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 15670b55861b..495ab353d576 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -411,7 +411,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx)
{
#ifdef CONFIG_X86_64
- pgd_t *start = (pgd_t *) &init_level4_pgt;
+ pgd_t *start = (pgd_t *) &init_top_pgt;
#else
pgd_t *start = swapper_pg_dir;
#endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index e37504e94e8f..2d754cc4e02f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -9,7 +9,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
extern struct range pfn_mapped[E820_X_MAX];

static int __init map_range(struct range *range)
@@ -103,8 +103,8 @@ void __init kasan_early_init(void)
for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
kasan_zero_p4d[i] = __p4d(p4d_val);

- kasan_map_early_shadow(early_level4_pgt);
- kasan_map_early_shadow(init_level4_pgt);
+ kasan_map_early_shadow(early_top_pgt);
+ kasan_map_early_shadow(init_top_pgt);
}

void __init kasan_init(void)
@@ -115,8 +115,8 @@ void __init kasan_init(void)
register_die_notifier(&kasan_die_notifier);
#endif

- memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
- load_cr3(early_level4_pgt);
+ memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+ load_cr3(early_top_pgt);
__flush_tlb_all();

clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -142,7 +142,7 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- load_cr3(init_level4_pgt);
+ load_cr3(init_top_pgt);
__flush_tlb_all();

/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)

trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = trampoline_pgd_entry.pgd;
- trampoline_pgd[511] = init_level4_pgt[511].pgd;
+ trampoline_pgd[511] = init_top_pgt[511].pgd;
#endif
}

--
2.10.2

2016-12-08 16:28:01

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 21/28] x86/mm: extend kasan to support 5-level paging

This patch bring support fo non-folded additional page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/kasan_init_64.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 2964de48e177..e37504e94e8f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -50,8 +50,18 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
unsigned long end = KASAN_SHADOW_END;

for (i = pgd_index(start); start < end; i++) {
- pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud)
- | _KERNPG_TABLE);
+ switch (CONFIG_PGTABLE_LEVELS) {
+ case 4:
+ pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
+ _KERNPG_TABLE);
+ break;
+ case 5:
+ pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
+ _KERNPG_TABLE);
+ break;
+ default:
+ BUILD_BUG();
+ }
start += PGDIR_SIZE;
}
}
@@ -79,6 +89,7 @@ void __init kasan_early_init(void)
pteval_t pte_val = __pa_nodebug(kasan_zero_page) | __PAGE_KERNEL;
pmdval_t pmd_val = __pa_nodebug(kasan_zero_pte) | _KERNPG_TABLE;
pudval_t pud_val = __pa_nodebug(kasan_zero_pmd) | _KERNPG_TABLE;
+ p4dval_t p4d_val = __pa_nodebug(kasan_zero_pud) | _KERNPG_TABLE;

for (i = 0; i < PTRS_PER_PTE; i++)
kasan_zero_pte[i] = __pte(pte_val);
@@ -89,6 +100,9 @@ void __init kasan_early_init(void)
for (i = 0; i < PTRS_PER_PUD; i++)
kasan_zero_pud[i] = __pud(pud_val);

+ for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+ kasan_zero_p4d[i] = __p4d(p4d_val);
+
kasan_map_early_shadow(early_level4_pgt);
kasan_map_early_shadow(init_level4_pgt);
}
--
2.10.2

2016-12-08 16:28:03

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 17/28] x86/mm: define virtual memory map for 5-level paging

The first part of memory map (up to %esp fixup) simply scales existing
map for 4-level paging by factor of 9 -- number of bits addressed by
additional page table level.

The rest of the map is uncahnged.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/x86_64/mm.txt | 23 ++++++++++++++++++++++-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/kasan.h | 9 ++++++---
arch/x86/include/asm/page_64_types.h | 10 ++++++++++
arch/x86/include/asm/pgtable_64_types.h | 6 ++++++
arch/x86/include/asm/sparsemem.h | 9 +++++++--
6 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 8c7dd5957ae1..d33fb0799b3d 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,7 +12,7 @@ ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
... unused hole ...
-ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB)
+ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...
@@ -23,6 +23,27 @@ ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

+Virtual memory map with 5 level page tables:
+
+0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
+hole caused by [57:63] sign extension
+ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
+ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
+ff90000000000000 - ff91ffffffffffff (=49 bits) hole
+ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
+ffd2000000000000 - ff93ffffffffffff (=49 bits) virtual memory map (512TB)
+... unused hole ...
+ff96000000000000 - ffb5ffffffffffff (=53 bits) kasan shadow memory (8PB)
+... unused hole ...
+fffe000000000000 - fffeffffffffffff (=49 bits) %esp fixup stacks
+... unused hole ...
+ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space
+... unused hole ...
+ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
+ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+
The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
holes).
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a1f0ce7c59a..df4f1d514ab0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
+ default 0xdfb6000000000000 if X86_5LEVEL
default 0xdffffc0000000000

config HAVE_INTEL_TXT
diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
index 1410b567ecde..2587c6bd89be 100644
--- a/arch/x86/include/asm/kasan.h
+++ b/arch/x86/include/asm/kasan.h
@@ -11,9 +11,12 @@
* 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT
*/
#define KASAN_SHADOW_START (KASAN_SHADOW_OFFSET + \
- (0xffff800000000000ULL >> 3))
-/* 47 bits for kernel address -> (47 - 3) bits for shadow */
-#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (47 - 3)))
+ ((-1UL << __VIRTUAL_MASK_SHIFT) >> 3))
+/*
+ * 47 bits for kernel address -> (47 - 3) bits for shadow
+ * 56 bits for kernel address -> (56 - 3) bits fro shadow
+ */
+#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (__VIRTUAL_MASK_SHIFT - 3)))

#ifndef __ASSEMBLY__

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 9215e0527647..3f5f08b010d0 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -36,7 +36,12 @@
* hypervisor to fit. Choosing 16 slots here is arbitrary, but it's
* what Xen requires.
*/
+#ifdef CONFIG_X86_5LEVEL
+#define __PAGE_OFFSET_BASE _AC(0xff10000000000000, UL)
+#else
#define __PAGE_OFFSET_BASE _AC(0xffff880000000000, UL)
+#endif
+
#ifdef CONFIG_RANDOMIZE_MEMORY
#define __PAGE_OFFSET page_offset_base
#else
@@ -46,8 +51,13 @@
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)

/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
+#ifdef CONFIG_X86_5LEVEL
+#define __PHYSICAL_MASK_SHIFT 52
+#define __VIRTUAL_MASK_SHIFT 56
+#else
#define __PHYSICAL_MASK_SHIFT 46
#define __VIRTUAL_MASK_SHIFT 47
+#endif

/*
* Kernel image size is limited to 1GiB due to the fixmap living in the
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index d15ca53bd462..034cbca37c91 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -56,9 +56,15 @@ typedef struct { pteval_t pte; } pte_t;

/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
+#ifdef CONFIG_X86_5LEVEL
+#define VMALLOC_SIZE_TB _AC(16384, UL)
+#define __VMALLOC_BASE _AC(0xff92000000000000, UL)
+#define VMEMMAP_START _AC(0xffd2000000000000, UL)
+#else
#define VMALLOC_SIZE_TB _AC(32, UL)
#define __VMALLOC_BASE _AC(0xffffc90000000000, UL)
#define VMEMMAP_START _AC(0xffffea0000000000, UL)
+#endif
#ifdef CONFIG_RANDOMIZE_MEMORY
#define VMALLOC_START vmalloc_base
#else
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 4517d6b93188..1f5bee2c202f 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -26,8 +26,13 @@
# endif
#else /* CONFIG_X86_32 */
# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
-# define MAX_PHYSADDR_BITS 44
-# define MAX_PHYSMEM_BITS 46
+# ifdef CONFIG_X86_5LEVEL
+# define MAX_PHYSADDR_BITS 52
+# define MAX_PHYSMEM_BITS 52
+# else
+# define MAX_PHYSADDR_BITS 44
+# define MAX_PHYSMEM_BITS 46
+# endif
#endif

#endif /* CONFIG_SPARSEMEM */
--
2.10.2

2016-12-08 16:28:00

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 19/28] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL

Extends pagetable headers to support new paging mode.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 11 +++++++++++
arch/x86/include/asm/pgtable_64_types.h | 20 +++++++++++++++++++
arch/x86/include/asm/pgtable_types.h | 10 +++++++++-
arch/x86/mm/pgtable.c | 34 ++++++++++++++++++++++++++++++++-
4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index f14bbe95ca08..bfe276e9af1e 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -35,6 +35,13 @@ extern void paging_init(void);
#define pud_ERROR(e) \
pr_err("%s:%d: bad pud %p(%016lx)\n", \
__FILE__, __LINE__, &(e), pud_val(e))
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#define p4d_ERROR(e) \
+ pr_err("%s:%d: bad p4d %p(%016lx)\n", \
+ __FILE__, __LINE__, &(e), p4d_val(e))
+#endif
+
#define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %p(%016lx)\n", \
__FILE__, __LINE__, &(e), pgd_val(e))
@@ -113,7 +120,11 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)

static inline void native_p4d_clear(p4d_t *p4d)
{
+#ifdef CONFIG_X86_5LEVEL
+ native_set_p4d(p4d, native_make_p4d(0));
+#else
native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+#endif
}

static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 034cbca37c91..677525f7d538 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -23,12 +23,32 @@ typedef struct { pteval_t pte; } pte_t;

#define SHARED_KERNEL_PMD 0

+#ifdef CONFIG_X86_5LEVEL
+
+/*
+ * PGDIR_SHIFT determines what a top-level page table entry can map
+ */
+#define PGDIR_SHIFT 48
+#define PTRS_PER_PGD 512
+
+/*
+ * 4rd level page in 5-level paging case
+ */
+#define P4D_SHIFT 39
+#define PTRS_PER_P4D 512
+#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
+#define P4D_MASK (~(P4D_SIZE - 1))
+
+#else /* CONFIG_X86_5LEVEL */
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
#define PGDIR_SHIFT 39
#define PTRS_PER_PGD 512

+#endif /* CONFIG_X86_5LEVEL */
+
/*
* 3rd level page
*/
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0af5650e118c..91cc929f27e6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,9 +273,17 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
}

#if CONFIG_PGTABLE_LEVELS > 4
+typedef struct { p4dval_t p4d; } p4d_t;

-#error FIXME
+static inline p4d_t native_make_p4d(pudval_t val)
+{
+ return (p4d_t) { val };
+}

+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+ return p4d.p4d;
+}
#else
#include <asm-generic/pgtable-nop4d.h>

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index cc6fcd4040e2..ec2dc0625480 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -81,6 +81,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
tlb_remove_page(tlb, virt_to_page(pud));
}
+
+#if CONFIG_PGTABLE_LEVELS > 4
+void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
+{
+ paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
+ tlb_remove_page(tlb, virt_to_page(p4d));
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

@@ -120,7 +128,7 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
references from swapper_pg_dir. */
if (CONFIG_PGTABLE_LEVELS == 2 ||
(CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS == 4) {
+ CONFIG_PGTABLE_LEVELS >= 4) {
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
@@ -551,6 +559,30 @@ void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys,
}

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#ifdef CONFIG_X86_5LEVEL
+/**
+ * p4d_set_huge - setup kernel P4D mapping
+ *
+ * No 512GB pages yet -- always return 0
+ *
+ * Returns 1 on success and 0 on failure.
+ */
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+ return 0;
+}
+
+/**
+ * p4d_clear_huge - clear kernel P4D mapping when it is set
+ *
+ * No 512GB pages yet -- always return 0
+ */
+int p4d_clear_huge(p4d_t *p4d)
+{
+ return 0;
+}
+#endif
+
/**
* pud_set_huge - setup kernel PUD mapping
*
--
2.10.2

2016-12-08 16:28:47

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 13/28] x86: convert the rest of the code to support p4d_t

This patch converts x86 to use proper folding of new page table level
with <asm-generic/pgtable-nop4d.h>.

TODO: split it up futher.
FIXME: XEN is broken.

Not-yet-Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/paravirt.h | 33 ++++++-
arch/x86/include/asm/paravirt_types.h | 12 ++-
arch/x86/include/asm/pgalloc.h | 35 ++++++-
arch/x86/include/asm/pgtable.h | 75 +++++++++++++--
arch/x86/include/asm/pgtable_64.h | 12 ++-
arch/x86/include/asm/pgtable_types.h | 10 +-
arch/x86/kernel/paravirt.c | 10 +-
arch/x86/mm/init_64.c | 168 ++++++++++++++++++++++++++--------
arch/x86/mm/kasan_init_64.c | 12 ++-
arch/x86/mm/pageattr.c | 56 +++++++++---
arch/x86/platform/efi/efi_64.c | 8 +-
arch/x86/xen/Kconfig | 1 +
12 files changed, 345 insertions(+), 87 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2970d22d7766..2196ec33063e 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -534,7 +534,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
PVOP_VCALL2(pv_mmu_ops.set_pud, pudp,
val);
}
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
static inline pud_t __pud(pudval_t val)
{
pudval_t ret;
@@ -563,6 +563,32 @@ static inline pudval_t pud_val(pud_t pud)
return ret;
}

+static inline void pud_clear(pud_t *pudp)
+{
+ set_pud(pudp, __pud(0));
+}
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ p4dval_t val = native_p4d_val(p4d);
+
+ if (sizeof(p4dval_t) > sizeof(long))
+ PVOP_VCALL3(pv_mmu_ops.set_p4d, p4dp,
+ val, (u64)val >> 32);
+ else
+ PVOP_VCALL2(pv_mmu_ops.set_p4d, p4dp,
+ val);
+}
+
+static inline void p4d_clear(p4d_t *p4dp)
+{
+ set_p4d(p4dp, __p4d(0));
+}
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+#error FIXME
+
static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
{
pgdval_t val = native_pgd_val(pgd);
@@ -580,10 +606,7 @@ static inline void pgd_clear(pgd_t *pgdp)
set_pgd(pgdp, __pgd(0));
}

-static inline void pud_clear(pud_t *pudp)
-{
- set_pud(pudp, __pud(0));
-}
+#endif /* CONFIG_PGTABLE_LEVELS == 5 */

#endif /* CONFIG_PGTABLE_LEVELS == 4 */

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..cdfa758ce7de 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -280,12 +280,18 @@ struct pv_mmu_ops {
struct paravirt_callee_save pmd_val;
struct paravirt_callee_save make_pmd;

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
struct paravirt_callee_save pud_val;
struct paravirt_callee_save make_pud;

- void (*set_pgd)(pgd_t *pudp, pgd_t pgdval);
-#endif /* CONFIG_PGTABLE_LEVELS == 4 */
+ void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
+
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

struct pv_lazy_ops lazy_mode;
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b6d425999f99..2f585054c63c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -121,10 +121,10 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
#endif /* CONFIG_X86_PAE */

#if CONFIG_PGTABLE_LEVELS > 3
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
{
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
- set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)));
+ set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
}

static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -150,6 +150,37 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
___pud_free_tlb(tlb, pud);
}

+#if CONFIG_PGTABLE_LEVELS > 4
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
+{
+ paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ gfp_t gfp = GFP_KERNEL_ACCOUNT;
+
+ if (mm == &init_mm)
+ gfp &= ~__GFP_ACCOUNT;
+ return (p4d_t *)get_zeroed_page(gfp);
+}
+
+static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+ BUG_ON((unsigned long)p4d & (PAGE_SIZE-1));
+ free_page((unsigned long)p4d);
+}
+
+extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
+
+static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
+ unsigned long address)
+{
+ ___p4d_free_tlb(tlb, p4d);
+}
+
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 54b6632723d5..398adab9a167 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -52,11 +52,19 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);

#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd)

-#ifndef __PAGETABLE_PUD_FOLDED
+#ifndef __PAGETABLE_P4D_FOLDED
#define set_pgd(pgdp, pgd) native_set_pgd(pgdp, pgd)
#define pgd_clear(pgd) native_pgd_clear(pgd)
#endif

+#ifndef set_p4d
+# define set_p4d(p4dp, p4d) native_set_p4d(p4dp, p4d)
+#endif
+
+#ifndef __PAGETABLE_PUD_FOLDED
+#define p4d_clear(p4d) native_p4d_clear(p4d)
+#endif
+
#ifndef set_pud
# define set_pud(pudp, pud) native_set_pud(pudp, pud)
#endif
@@ -73,6 +81,11 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
#define pgd_val(x) native_pgd_val(x)
#define __pgd(x) native_make_pgd(x)

+#ifndef __PAGETABLE_P4D_FOLDED
+#define p4d_val(x) native_p4d_val(x)
+#define __p4d(x) native_make_p4d(x)
+#endif
+
#ifndef __PAGETABLE_PUD_FOLDED
#define pud_val(x) native_pud_val(x)
#define __pud(x) native_make_pud(x)
@@ -439,6 +452,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
#define pte_pgprot(x) __pgprot(pte_flags(x))
#define pmd_pgprot(x) __pgprot(pmd_flags(x))
#define pud_pgprot(x) __pgprot(pud_flags(x))
+#define p4d_pgprot(x) __pgprot(p4d_flags(x))

#define canon_pgprot(p) __pgprot(massage_pgprot(p))

@@ -671,12 +685,58 @@ static inline int pud_large(pud_t pud)
}
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

+#if CONFIG_PGTABLE_LEVELS > 3
+static inline int p4d_none(p4d_t p4d)
+{
+ return (native_p4d_val(p4d) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+ return p4d_flags(p4d) & _PAGE_PRESENT;
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+ return (unsigned long)__va(p4d_val(p4d) & p4d_pfn_mask(p4d));
+}
+
+/*
+ * Currently stuck as a macro due to indirect forward reference to
+ * linux/mmzone.h's __section_mem_map_addr() definition:
+ */
+#define p4d_page(p4d) \
+ pfn_to_page((p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT)
+
+/*
+ * the pud page can be thought of an array like this: pud_t[PTRS_PER_PUD]
+ *
+ * this macro returns the index of the entry in the pud page which would
+ * control the given virtual address
+ */
+static inline unsigned long pud_index(unsigned long address)
+{
+ return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+/* Find an entry in the third-level page table.. */
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+ return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+ return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+
static inline unsigned long p4d_index(unsigned long address)
{
return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
}

-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
static inline int pgd_present(pgd_t pgd)
{
return pgd_flags(pgd) & _PAGE_PRESENT;
@@ -694,14 +754,9 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
#define pgd_page(pgd) pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT)

/* to find an entry in a page-table-directory. */
-static inline unsigned long pud_index(unsigned long address)
-{
- return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
-}
-
-static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
- return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
+ return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}

static inline int pgd_bad(pgd_t pgd)
@@ -719,7 +774,7 @@ static inline int pgd_none(pgd_t pgd)
*/
return !native_pgd_val(pgd);
}
-#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */

#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 1cc82ece9ac1..f14bbe95ca08 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -41,7 +41,7 @@ extern void paging_init(void);

struct mm_struct;

-void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
+void set_pte_vaddr_p4d(pgd_t *pgd, unsigned long vaddr, pte_t new_pte);

static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
@@ -106,6 +106,16 @@ static inline void native_pud_clear(pud_t *pud)
native_set_pud(pud, native_make_pud(0));
}

+static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ *p4dp = p4d;
+}
+
+static inline void native_p4d_clear(p4d_t *p4d)
+{
+ native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+}
+
static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
*pgdp = pgd;
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4aa91e440b4a..0af5650e118c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -277,11 +277,11 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
#error FIXME

#else
-#include <asm-generic/5level-fixup.h>
+#include <asm-generic/pgtable-nop4d.h>

static inline p4dval_t native_p4d_val(p4d_t p4d)
{
- return native_pgd_val(p4d);
+ return native_pgd_val(p4d.pgd);
}
#endif

@@ -298,12 +298,11 @@ static inline pudval_t native_pud_val(pud_t pud)
return pud.pud;
}
#else
-#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopud.h>

static inline pudval_t native_pud_val(pud_t pud)
{
- return native_pgd_val(pud.pgd);
+ return native_pgd_val(pud.p4d.pgd);
}
#endif

@@ -320,12 +319,11 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
return pmd.pmd;
}
#else
-#define __ARCH_USE_5LEVEL_HACK
#include <asm-generic/pgtable-nopmd.h>

static inline pmdval_t native_pmd_val(pmd_t pmd)
{
- return native_pgd_val(pmd.pud.pgd);
+ return native_pgd_val(pmd.pud.p4d.pgd);
}
#endif

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1acfd76e3e26..d81c0c4e6bcf 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -431,12 +431,16 @@ struct pv_mmu_ops pv_mmu_ops = {
.pmd_val = PTE_IDENT,
.make_pmd = PTE_IDENT,

-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
.pud_val = PTE_IDENT,
.make_pud = PTE_IDENT,

- .set_pgd = native_set_pgd,
-#endif
+ .set_p4d = native_set_p4d,
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
#endif /* CONFIG_PGTABLE_LEVELS >= 3 */

.pte_val = PTE_IDENT,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 14b9dd71d9e8..a991f5c4c2c4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -97,37 +97,47 @@ void sync_global_pgds(unsigned long start, unsigned long end, int removed)
unsigned long address;

for (address = start; address <= end; address += PGDIR_SIZE) {
- const pgd_t *pgd_ref = pgd_offset_k(address);
+ pgd_t *pgd_ref = pgd_offset_k(address);
+ const p4d_t *p4d_ref;
struct page *page;

/*
- * When it is called after memory hot remove, pgd_none()
+ * With folded p4d, pgd_none() is always false, we need to
+ * handle synchonization on pgd level.
+ */
+ BUILD_BUG_ON(pgd_none(*pgd_ref));
+ p4d_ref = p4d_offset(pgd_ref, address);
+
+ /*
+ * When it is called after memory hot remove, p4d_none()
* returns true. In this case (removed == 1), we must clear
- * the PGD entries in the local PGD level page.
+ * the P4D entries in the local P4D level page.
*/
- if (pgd_none(*pgd_ref) && !removed)
+ if (p4d_none(*p4d_ref) && !removed)
continue;

spin_lock(&pgd_lock);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
+ p4d_t *p4d;
spinlock_t *pgt_lock;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ p4d = p4d_offset(pgd, address);
/* the pgt_lock only for Xen */
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);

- if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
- BUG_ON(pgd_page_vaddr(*pgd)
- != pgd_page_vaddr(*pgd_ref));
+ if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))
+ BUG_ON(p4d_page_vaddr(*p4d)
+ != p4d_page_vaddr(*p4d_ref));

if (removed) {
- if (pgd_none(*pgd_ref) && !pgd_none(*pgd))
- pgd_clear(pgd);
+ if (p4d_none(*p4d_ref) && !p4d_none(*p4d))
+ p4d_clear(p4d);
} else {
- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
+ if (p4d_none(*p4d))
+ set_p4d(p4d, *p4d_ref);
}

spin_unlock(pgt_lock);
@@ -159,16 +169,28 @@ static __ref void *spp_getpage(void)
return ptr;
}

-static pud_t *fill_pud(pgd_t *pgd, unsigned long vaddr)
+static p4d_t *fill_p4d(pgd_t *pgd, unsigned long vaddr)
{
if (pgd_none(*pgd)) {
- pud_t *pud = (pud_t *)spp_getpage();
- pgd_populate(&init_mm, pgd, pud);
- if (pud != pud_offset(pgd, 0))
+ p4d_t *p4d = (p4d_t *)spp_getpage();
+ pgd_populate(&init_mm, pgd, p4d);
+ if (p4d != p4d_offset(pgd, 0))
printk(KERN_ERR "PAGETABLE BUG #00! %p <-> %p\n",
- pud, pud_offset(pgd, 0));
+ p4d, p4d_offset(pgd, 0));
+ }
+ return p4d_offset(pgd, vaddr);
+}
+
+static pud_t *fill_pud(p4d_t *p4d, unsigned long vaddr)
+{
+ if (p4d_none(*p4d)) {
+ pud_t *pud = (pud_t *)spp_getpage();
+ p4d_populate(&init_mm, p4d, pud);
+ if (pud != pud_offset(p4d, 0))
+ printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+ pud, pud_offset(p4d, 0));
}
- return pud_offset(pgd, vaddr);
+ return pud_offset(p4d, vaddr);
}

static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
@@ -177,7 +199,7 @@ static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
pmd_t *pmd = (pmd_t *) spp_getpage();
pud_populate(&init_mm, pud, pmd);
if (pmd != pmd_offset(pud, 0))
- printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+ printk(KERN_ERR "PAGETABLE BUG #02! %p <-> %p\n",
pmd, pmd_offset(pud, 0));
}
return pmd_offset(pud, vaddr);
@@ -189,18 +211,20 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
pte_t *pte = (pte_t *) spp_getpage();
pmd_populate_kernel(&init_mm, pmd, pte);
if (pte != pte_offset_kernel(pmd, 0))
- printk(KERN_ERR "PAGETABLE BUG #02!\n");
+ printk(KERN_ERR "PAGETABLE BUG #03!\n");
}
return pte_offset_kernel(pmd, vaddr);
}

-void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
+void set_pte_vaddr_p4d(pgd_t *pgd, unsigned long vaddr, pte_t new_pte)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- pud = pud_page + pud_index(vaddr);
+ p4d = fill_p4d(pgd, vaddr);
+ pud = fill_pud(p4d, vaddr);
pmd = fill_pmd(pud, vaddr);
pte = fill_pte(pmd, vaddr);

@@ -216,7 +240,6 @@ void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
{
pgd_t *pgd;
- pud_t *pud_page;

pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(pteval));

@@ -226,17 +249,18 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
"PGD FIXMAP MISSING, it should be setup in head.S!\n");
return;
}
- pud_page = (pud_t*)pgd_page_vaddr(*pgd);
- set_pte_vaddr_pud(pud_page, vaddr, pteval);
+ set_pte_vaddr_p4d(pgd, vaddr, pteval);
}

pmd_t * __init populate_extra_pmd(unsigned long vaddr)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

pgd = pgd_offset_k(vaddr);
- pud = fill_pud(pgd, vaddr);
+ p4d = fill_p4d(pgd, vaddr);
+ pud = fill_pud(p4d, vaddr);
return fill_pmd(pud, vaddr);
}

@@ -255,6 +279,7 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
enum page_cache_mode cache)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pgprot_t prot;
@@ -265,11 +290,17 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
for (; size; phys += PMD_SIZE, size -= PMD_SIZE) {
pgd = pgd_offset_k((unsigned long)__va(phys));
if (pgd_none(*pgd)) {
+ p4d = (p4d_t *) spp_getpage();
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE |
+ _PAGE_USER));
+ }
+ p4d = p4d_offset(pgd, (unsigned long)__va(phys));
+ if (p4d_none(*p4d)) {
pud = (pud_t *) spp_getpage();
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE |
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE |
_PAGE_USER));
}
- pud = pud_offset(pgd, (unsigned long)__va(phys));
+ pud = pud_offset(p4d, (unsigned long)__va(phys));
if (pud_none(*pud)) {
pmd = (pmd_t *) spp_getpage();
set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE |
@@ -573,12 +604,15 @@ kernel_physical_mapping_init(unsigned long paddr_start,

for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
+ p4d_t *p4d;
pud_t *pud;

vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;

- if (pgd_val(*pgd)) {
- pud = (pud_t *)pgd_page_vaddr(*pgd);
+ BUILD_BUG_ON(pgd_none(*pgd));
+ p4d = p4d_offset(pgd, vaddr);
+ if (p4d_val(*p4d)) {
+ pud = (pud_t *)p4d_page_vaddr(*p4d);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
@@ -590,7 +624,7 @@ kernel_physical_mapping_init(unsigned long paddr_start,
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- pgd_populate(&init_mm, pgd, pud);
+ p4d_populate(&init_mm, p4d, pud);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
@@ -736,6 +770,24 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
spin_unlock(&init_mm.page_table_lock);
}

+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+ pud_t *pud;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ pud = pud_start + i;
+ if (!pud_none(*pud))
+ return;
+ }
+
+ /* free a pud talbe */
+ free_pagetable(p4d_page(*p4d), 0);
+ spin_lock(&init_mm.page_table_lock);
+ p4d_clear(p4d);
+ spin_unlock(&init_mm.page_table_lock);
+}
+
static void __meminit
remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
bool direct)
@@ -918,6 +970,32 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
update_page_count(PG_LEVEL_1G, -pages);
}

+static void __meminit
+remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
+ bool direct)
+{
+ unsigned long next, pages = 0;
+ pud_t *pud_base;
+ p4d_t *p4d;
+
+ p4d = p4d_start + p4d_index(addr);
+ for (; addr < end; addr = next, p4d++) {
+ next = p4d_addr_end(addr, end);
+
+ if (!p4d_present(*p4d))
+ continue;
+
+ BUILD_BUG_ON(p4d_large(*p4d));
+
+ pud_base = (pud_t *)p4d_page_vaddr(*p4d);
+ remove_pud_table(pud_base, addr, next, direct);
+ free_pud_table(pud_base, p4d);
+ }
+
+ if (direct)
+ update_page_count(PG_LEVEL_512G, -pages);
+}
+
/* start and end are both virtual address. */
static void __meminit
remove_pagetable(unsigned long start, unsigned long end, bool direct)
@@ -925,7 +1003,7 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
unsigned long next;
unsigned long addr;
pgd_t *pgd;
- pud_t *pud;
+ p4d_t *p4d;

for (addr = start; addr < end; addr = next) {
next = pgd_addr_end(addr, end);
@@ -934,8 +1012,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
if (!pgd_present(*pgd))
continue;

- pud = (pud_t *)pgd_page_vaddr(*pgd);
- remove_pud_table(pud, addr, next, direct);
+ p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+ remove_p4d_table(p4d, addr, next, direct);
}

flush_tlb_all();
@@ -1105,6 +1183,7 @@ int kern_addr_valid(unsigned long addr)
{
unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -1116,7 +1195,11 @@ int kern_addr_valid(unsigned long addr)
if (pgd_none(*pgd))
return 0;

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
if (pud_none(*pud))
return 0;

@@ -1173,6 +1256,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
unsigned long addr;
unsigned long next;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -1183,7 +1267,11 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
if (!pgd)
return -ENOMEM;

- pud = vmemmap_pud_populate(pgd, addr, node);
+ p4d = vmemmap_p4d_populate(pgd, addr, node);
+ if (!p4d)
+ return -ENOMEM;
+
+ pud = vmemmap_pud_populate(p4d, addr, node);
if (!pud)
return -ENOMEM;

@@ -1251,6 +1339,7 @@ void register_page_bootmem_memmap(unsigned long section_nr,
unsigned long end = (unsigned long)(start_page + size);
unsigned long next;
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
unsigned int nr_pages;
@@ -1266,7 +1355,14 @@ void register_page_bootmem_memmap(unsigned long section_nr,
}
get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);

- pud = pud_offset(pgd, addr);
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d)) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ continue;
+ }
+ get_page_bootmem(section_nr, p4d_page(*p4d), MIX_SECTION_INFO);
+
+ pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
next = (addr + PAGE_SIZE) & PAGE_MASK;
continue;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0493c17b8a51..2964de48e177 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -31,8 +31,16 @@ static int __init map_range(struct range *range)
static void __init clear_pgds(unsigned long start,
unsigned long end)
{
- for (; start < end; start += PGDIR_SIZE)
- pgd_clear(pgd_offset_k(start));
+ pgd_t *pgd;
+
+ for (; start < end; start += PGDIR_SIZE) {
+ pgd = pgd_offset_k(start);
+#ifdef __PAGETABLE_P4D_FOLDED
+ p4d_clear(p4d_offset(pgd, start));
+#else
+ pgd_clear(pgd);
+#endif
+ }
}

static void __init kasan_map_early_shadow(pgd_t *pgd)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index e3353c97d086..1cf11ffeb4c1 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -333,6 +333,7 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long address,
pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
unsigned int *level)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

@@ -341,7 +342,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
if (pgd_none(*pgd))
return NULL;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d))
+ return NULL;
+
+ *level = PG_LEVEL_512G;
+ if (p4d_large(*p4d) || !p4d_present(*p4d))
+ return (pte_t *)p4d;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud))
return NULL;

@@ -393,13 +402,18 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
pmd_t *lookup_pmd_address(unsigned long address)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;

pgd = pgd_offset_k(address);
if (pgd_none(*pgd))
return NULL;

- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || p4d_large(*p4d) || !p4d_present(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud))
return NULL;

@@ -464,11 +478,13 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)

list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;

pgd = (pgd_t *)page_address(page) + pgd_index(address);
- pud = pud_offset(pgd, address);
+ p4d = p4d_offset(pgd, address);
+ pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
set_pte_atomic((pte_t *)pmd, pte);
}
@@ -823,9 +839,9 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end)
pud_clear(pud);
}

-static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
+static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
{
- pud_t *pud = pud_offset(pgd, start);
+ pud_t *pud = pud_offset(p4d, start);

/*
* Not on a GB page boundary?
@@ -991,8 +1007,8 @@ static long populate_pmd(struct cpa_data *cpa,
return num_pages;
}

-static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
- pgprot_t pgprot)
+static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,
+ pgprot_t pgprot)
{
pud_t *pud;
unsigned long end;
@@ -1013,7 +1029,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
cur_pages = (pre_end - start) >> PAGE_SHIFT;
cur_pages = min_t(int, (int)cpa->numpages, cur_pages);

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);

/*
* Need a PMD page?
@@ -1034,7 +1050,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
if (cpa->numpages == cur_pages)
return cur_pages;

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);
pud_pgprot = pgprot_4k_2_large(pgprot);

/*
@@ -1054,7 +1070,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
if (start < end) {
long tmp;

- pud = pud_offset(pgd, start);
+ pud = pud_offset(p4d, start);
if (pud_none(*pud))
if (alloc_pmd_page(pud))
return -1;
@@ -1077,33 +1093,43 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
{
pgprot_t pgprot = __pgprot(_KERNPG_TABLE);
pud_t *pud = NULL; /* shut up gcc */
+ p4d_t *p4d;
pgd_t *pgd_entry;
long ret;

pgd_entry = cpa->pgd + pgd_index(addr);

+ if (pgd_none(*pgd_entry)) {
+ p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
+ if (!p4d)
+ return -1;
+
+ set_p4d(p4d, __p4d(__pa(p4d) | _KERNPG_TABLE));
+ }
+
/*
- * Allocate a PUD page and hand it down for mapping.
+ * Allocate a P4D page and hand it down for mapping.
*/
- if (pgd_none(*pgd_entry)) {
+ p4d = p4d_offset(pgd_entry, addr);
+ if (p4d_none(*p4d)) {
pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
if (!pud)
return -1;

- set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
}

pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr);
pgprot_val(pgprot) |= pgprot_val(cpa->mask_set);

- ret = populate_pud(cpa, addr, pgd_entry, pgprot);
+ ret = populate_pud(cpa, addr, p4d, pgprot);
if (ret < 0) {
/*
* Leave the PUD page in place in case some other CPU or thread
* already found it, but remove any useless entries we just
* added to it.
*/
- unmap_pud_range(pgd_entry, addr,
+ unmap_pud_range(p4d, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index ac4b0cbd479b..9c4f3cd31bf0 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -134,7 +134,7 @@ static pgd_t *efi_pgd;
int __init efi_alloc_page_tables(void)
{
pgd_t *pgd;
- pud_t *pud;
+ p4d_t *p4d;
gfp_t gfp_mask;

if (efi_enabled(EFI_OLD_MEMMAP))
@@ -147,13 +147,13 @@ int __init efi_alloc_page_tables(void)

pgd = efi_pgd + pgd_index(EFI_VA_END);

- pud = pud_alloc_one(NULL, 0);
- if (!pud) {
+ p4d = p4d_alloc_one(NULL, 0);
+ if (!p4d) {
free_page((unsigned long)efi_pgd);
return -ENOMEM;
}

- pgd_populate(NULL, pgd, pud);
+ pgd_populate(NULL, pgd, p4d);

return 0;
}
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index c7b15f3e2cf3..2aecee939095 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -4,6 +4,7 @@

config XEN
bool "Xen guest support"
+ depends on BROKEN
depends on PARAVIRT
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
--
2.10.2

2016-12-08 16:28:46

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 14/28] mm: introduce __p4d_alloc()

For full 5-level paging we need a helper to allocate p4d pagetable.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index fd1a3319f413..ce7639fdeeda 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3789,6 +3789,29 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);

+#ifndef __PAGETABLE_P4D_FOLDED
+/*
+ * Allocate p4d page table.
+ * We've already handled the fast-path in-line.
+ */
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+ p4d_t *new = p4d_alloc_one(mm, address);
+ if (!new)
+ return -ENOMEM;
+
+ smp_wmb(); /* See comment in __pte_alloc */
+
+ spin_lock(&mm->page_table_lock);
+ if (pgd_present(*pgd)) /* Another has populated it */
+ p4d_free(mm, new);
+ else
+ pgd_populate(mm, pgd, new);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+}
+#endif /* __PAGETABLE_P4D_FOLDED */
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
--
2.10.2

2016-12-08 16:29:18

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 12/28] x86/kexec: support p4d_t

Handle additional page table level in kexec code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/kexec.h | 1 +
arch/x86/kernel/machine_kexec_32.c | 4 +++-
arch/x86/kernel/machine_kexec_64.c | 12 +++++++++++-
3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index d2434c1cad05..5fed4cc96028 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -164,6 +164,7 @@ struct kimage_arch {
};
#else
struct kimage_arch {
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index 469b23d6acc2..5f43cec296c5 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -103,6 +103,7 @@ static void machine_kexec_page_table_set_one(
pgd_t *pgd, pmd_t *pmd, pte_t *pte,
unsigned long vaddr, unsigned long paddr)
{
+ p4d_t *p4d;
pud_t *pud;

pgd += pgd_index(vaddr);
@@ -110,7 +111,8 @@ static void machine_kexec_page_table_set_one(
if (!(pgd_val(*pgd) & _PAGE_PRESENT))
set_pgd(pgd, __pgd(__pa(pmd) | _PAGE_PRESENT));
#endif
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ pud = pud_offset(p4d, vaddr);
pmd = pmd_offset(pud, vaddr);
if (!(pmd_val(*pmd) & _PAGE_PRESENT))
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 5a294e48b185..0a44cf20f939 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -36,6 +36,7 @@ static struct kexec_file_ops *kexec_file_loaders[] = {

static void free_transition_pgtable(struct kimage *image)
{
+ free_page((unsigned long)image->arch.p4d);
free_page((unsigned long)image->arch.pud);
free_page((unsigned long)image->arch.pmd);
free_page((unsigned long)image->arch.pte);
@@ -43,6 +44,7 @@ static void free_transition_pgtable(struct kimage *image)

static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
{
+ p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -59,7 +61,15 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
image->arch.pud = pud;
set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
}
- pud = pud_offset(pgd, vaddr);
+ p4d = p4d_offset(pgd, vaddr);
+ if (!p4d_present(*p4d)) {
+ p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
+ if (!p4d)
+ goto err;
+ image->arch.p4d = p4d;
+ set_p4d(p4d, __p4d(__pa(p4d) | _KERNPG_TABLE));
+ }
+ pud = pud_offset(p4d, vaddr);
if (!pud_present(*pud)) {
pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
if (!pmd)
--
2.10.2

2016-12-08 16:29:41

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 11/28] x86/power: support p4d_t in hibernate code

set_up_temporary_text_mapping() and relocate_restore_code() require
trivial adjustments to handle additional page table level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/power/hibernate_64.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index 9634557a5444..bf38f5df60b0 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -45,6 +45,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
{
pmd_t *pmd;
pud_t *pud;
+ p4d_t *p4d;

/*
* The new mapping only has to cover the page containing the image
@@ -71,8 +72,10 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
set_pud(pud + pud_index(restore_jump_address),
__pud(__pa(pmd) | _KERNPG_TABLE));
+ set_p4d(p4d + p4d_index(restore_jump_address),
+ __p4d(__pa(pud) | _KERNPG_TABLE));
set_pgd(pgd + pgd_index(restore_jump_address),
- __pgd(__pa(pud) | _KERNPG_TABLE));
+ __pgd(__pa(p4d) | _KERNPG_TABLE));

return 0;
}
@@ -120,7 +123,10 @@ static int set_up_temporary_mappings(void)
static int relocate_restore_code(void)
{
pgd_t *pgd;
+ p4d_t *p4d;
pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;

relocated_restore_code = get_safe_page(GFP_ATOMIC);
if (!relocated_restore_code)
@@ -130,22 +136,25 @@ static int relocate_restore_code(void)

/* Make the page containing the relocated code executable */
pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
- pud = pud_offset(pgd, relocated_restore_code);
+ p4d = p4d_offset(pgd, relocated_restore_code);
+ if (p4d_large(*p4d)) {
+ set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX));
+ goto out;
+ }
+ pud = pud_offset(p4d, relocated_restore_code);
if (pud_large(*pud)) {
set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
- } else {
- pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
-
- if (pmd_large(*pmd)) {
- set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
- } else {
- pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
-
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
- }
+ goto out;
+ }
+ pmd = pmd_offset(pud, relocated_restore_code);
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ goto out;
}
+ pte = pte_offset_kernel(pmd, relocated_restore_code);
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+out:
__flush_tlb_all();
-
return 0;
}

--
2.10.2

2016-12-08 16:30:04

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 02/28] asm-generic: introduce __ARCH_USE_5LEVEL_HACK

We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses <asm-genenric/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/pgtable-nop4d-hack.h | 62 ++++++++++++++++++++++++++++++++
include/asm-generic/pgtable-nopud.h | 5 +++
2 files changed, 67 insertions(+)
create mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
new file mode 100644
index 000000000000..37d13cbf832d
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -0,0 +1,62 @@
+#ifndef _PGTABLE_NOP4D_HACK_H
+#define _PGTABLE_NOP4D_HACK_H
+
+#ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
+
+#define __PAGETABLE_PUD_FOLDED
+
+/*
+ * Having the pud type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pud is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pud_t;
+
+#define PUD_SHIFT PGDIR_SHIFT
+#define PTRS_PER_PUD 1
+#define PUD_SIZE (1UL << PUD_SHIFT)
+#define PUD_MASK (~(PUD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pud is never bad, and a pud always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd) { return 0; }
+static inline int pgd_bad(pgd_t pgd) { return 0; }
+static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline void pgd_clear(pgd_t *pgd) { }
+#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+
+#define pgd_populate(mm, pgd, pud) do { } while (0)
+/*
+ * (puds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval) set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+
+static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
+{
+ return (pud_t *)pgd;
+}
+
+#define pud_val(x) (pgd_val((x).pgd))
+#define __pud(x) ((pud_t) { __pgd(x) } )
+
+#define pgd_page(pgd) (pud_page((pud_t){ pgd }))
+#define pgd_page_vaddr(pgd) (pud_page_vaddr((pud_t){ pgd }))
+
+/*
+ * allocating and freeing a pud is trivial: the 1-entry pud is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pud_alloc_one(mm, address) NULL
+#define pud_free(mm, x) do { } while (0)
+#define __pud_free_tlb(tlb, x, a) do { } while (0)
+
+#undef pud_addr_end
+#define pud_addr_end(addr, end) (end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 810431d8351b..5e49430a30a4 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -3,6 +3,10 @@

#ifndef __ASSEMBLY__

+#ifdef __ARCH_USE_5LEVEL_HACK
+#include <asm-generic/pgtable-nop4d-hack.h>
+#else
+
#define __PAGETABLE_PUD_FOLDED

/*
@@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
#define pud_addr_end(addr, end) (end)

#endif /* __ASSEMBLY__ */
+#endif /* !__ARCH_USE_5LEVEL_HACK */
#endif /* _PGTABLE_NOPUD_H */
--
2.10.2

2016-12-08 16:30:06

by Kirill A. Shutemov

[permalink] [raw]

Subject: [RFC, PATCHv1 06/28] x86: basic changes into headers for 5-level paging

This patch extends x86 headers to enable 5-level paging support.

It's still based on <asm-generic/5level-fixup.h>. We will get to the
point where we can have <asm-generic/pgtable-nop4d.h> later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable-2level_types.h | 1 +
arch/x86/include/asm/pgtable-3level_types.h | 1 +
arch/x86/include/asm/pgtable.h | 16 +++++++++++++++
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/include/asm/pgtable_types.h | 30 ++++++++++++++++++++++++++++-
5 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 392576433e77..373ab1de909f 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -7,6 +7,7 @@
typedef unsigned long pteval_t;
typedef unsigned long pmdval_t;
typedef unsigned long pudval_t;
+typedef unsigned long p4dval_t;
typedef unsigned long pgdval_t;
typedef unsigned long pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index bcc89625ebe5..b8a4341faafa 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -7,6 +7,7 @@
typedef u64 pteval_t;
typedef u64 pmdval_t;
typedef u64 pudval_t;
+typedef u64 p4dval_t;
typedef u64 pgdval_t;
typedef u64 pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 437feb436efa..54b6632723d5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -168,6 +168,17 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
}

+static inline unsigned long p4d_pfn(p4d_t p4d)
+{
+ return (p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT;
+}
+
+static inline int p4d_large(p4d_t p4d)
+{
+ /* No 512 GiB pages yet */
+ return 0;
+}
+
#define pte_page(pte) pfn_to_page(pte_pfn(pte))

static inline int pmd_large(pmd_t pte)
@@ -660,6 +671,11 @@ static inline int pud_large(pud_t pud)
}
#endif /* CONFIG_PGTABLE_LEVELS > 2 */

+static inline unsigned long p4d_index(unsigned long address)
+{
+ return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+
#if CONFIG_PGTABLE_LEVELS > 3
static inline int pgd_present(pgd_t pgd)
{
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 6fdef9eef2d5..d15ca53bd462 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -13,6 +13,7 @@
typedef unsigned long pteval_t;
typedef unsigned long pmdval_t;
typedef unsigned long pudval_t;
+typedef unsigned long p4dval_t;
typedef unsigned long pgdval_t;
typedef unsigned long pgprotval_t;

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3187bec1b79a..4aa91e440b4a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -272,9 +272,20 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
return native_pgd_val(pgd) & PTE_FLAGS_MASK;
}

-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
+
+#error FIXME
+
+#else
#include <asm-generic/5level-fixup.h>

+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+ return native_pgd_val(p4d);
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 3
typedef struct { pudval_t pud; } pud_t;

static inline pud_t native_make_pud(pmdval_t val)
@@ -318,6 +329,22 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
}
#endif

+static inline p4dval_t p4d_pfn_mask(p4d_t p4d)
+{
+ /* No 512 GiB huge pages yet */
+ return PTE_PFN_MASK;
+}
+
+static inline p4dval_t p4d_flags_mask(p4d_t p4d)
+{
+ return ~p4d_pfn_mask(p4d);
+}
+
+static inline p4dval_t p4d_flags(p4d_t p4d)
+{
+ return native_p4d_val(p4d) & p4d_flags_mask(p4d);
+}
+
static inline pudval_t pud_pfn_mask(pud_t pud)
{
if (native_pud_val(pud) & _PAGE_PSE)
@@ -463,6 +490,7 @@ enum pg_level {
PG_LEVEL_4K,
PG_LEVEL_2M,
PG_LEVEL_1G,
+ PG_LEVEL_512G,
PG_LEVEL_NUM
};

--
2.10.2

2016-12-08 16:48:59

[permalink] [raw]

Subject: Re: [Qemu-devel] [QEMU, PATCH] x86: implement la57 paging mode

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [QEMU, PATCH] x86: implement la57 paging mode
Type: series
Message-id: [email protected]

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
509e387 x86: implement la57 paging mode

=== OUTPUT BEGIN ===
Checking PATCH 1/1: x86: implement la57 paging mode...
ERROR: space prohibited before that close parenthesis ')'
#311: FILE: target-i386/monitor.c:108:
+ print_pte(mon, env, (l1 << 30 ) + (l2 << 21), pde,

ERROR: space prohibited before that close parenthesis ')'
#320: FILE: target-i386/monitor.c:116:
+ print_pte(mon, env, (l1 << 30 ) + (l2 << 21)

WARNING: line over 80 characters
#347: FILE: target-i386/monitor.c:148:
+ print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30),

WARNING: line over 80 characters
#359: FILE: target-i386/monitor.c:158:
+ print_pte(mon, env, (l0 << 48) + (l1 << 39) +

WARNING: line over 80 characters
#467: FILE: target-i386/monitor.c:464:
+ cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);

ERROR: line over 90 characters
#469: FILE: target-i386/monitor.c:466:
+ end = (l0 << 48) + (l1 << 39) + (l2 << 30) + (l3 << 21);

WARNING: line over 80 characters
#472: FILE: target-i386/monitor.c:469:
+ prot = pde & (PG_USER_MASK | PG_RW_MASK |

WARNING: line over 80 characters
#475: FILE: target-i386/monitor.c:472:
+ mem_print(mon, &start, &last_prot, end, prot);

WARNING: line over 80 characters
#480: FILE: target-i386/monitor.c:477:
+ + l4 * 8,

WARNING: line over 80 characters
#481: FILE: target-i386/monitor.c:478:
+ &pte, 8);

ERROR: line over 90 characters
#483: FILE: target-i386/monitor.c:480:
+ end = (l0 << 48) + (l1 << 39) + (l2 << 30) +

ERROR: line over 90 characters
#486: FILE: target-i386/monitor.c:483:
+ prot = pte & (PG_USER_MASK | PG_RW_MASK |

WARNING: line over 80 characters
#487: FILE: target-i386/monitor.c:484:
+ PG_PRESENT_MASK);

ERROR: line over 90 characters
#492: FILE: target-i386/monitor.c:489:
+ mem_print(mon, &start, &last_prot, end, prot);

WARNING: line over 80 characters
#497: FILE: target-i386/monitor.c:494:
+ mem_print(mon, &start, &last_prot, end, prot);

total: 6 errors, 9 warnings, 481 lines checked

Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1

---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to [email protected]

2016-12-08 18:16:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.

Looks ok to me. Starting off with a compile-time config option seems fine.

I do think that the x86 cpuid part should (patch 15) should be the
first patch, so that we see "la57" as a capability in /proc/cpuinfo
whether it's being enabled or not? We should merge that part
regardless of any mm patches, I think.

Linus

2016-12-08 18:27:09

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On December 8, 2016 10:16:07 AM PST, Linus Torvalds <[email protected]> wrote:
>On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
><[email protected]> wrote:
>>
>> This patchset is still very early. There are a number of things
>missing
>> that we have to do before asking anyone to merge it (listed below).
>> It would be great if folks can start testing applications now (in
>QEMU) to
>> look for breakage.
>> Any early comments on the design or the patches would be appreciated
>as
>> well.
>
>Looks ok to me. Starting off with a compile-time config option seems
>fine.
>
>I do think that the x86 cpuid part should (patch 15) should be the
>first patch, so that we see "la57" as a capability in /proc/cpuinfo
>whether it's being enabled or not? We should merge that part
>regardless of any mm patches, I think.
>
> Linus

Definitely.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-12-08 18:40:24

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 16/28] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert

On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
<[email protected]> wrote:
> We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
> better check for canonical addresses") made canonical address check
> generic wrt. address width.

This code existed in part to remind us that this needs very careful
adjustment when the paging size becomes dynamic. If you want to
remove it, please add test cases to tools/testing/selftests/x86 that
verify:

a. Either mmap(2^47-4096, ..., MAP_FIXED, ...) fails or that, if it
succeeds and you put a syscall instruction at the very end, that
invoking the syscall instruction there works. The easiest way to do
this may be to have the selftest literally have a page of text that
has 4094 0xcc bytes and a syscall and to map that page or perhaps move
it into place with mremap. That will avoid annoying W^X userspace
stuff from messing up the test. You'll need to handle the signal when
you fall off the end of the world after the syscall.

b. Ditto for the new highest possible userspace page.

c. Ditto for one page earlier to make sure that your test actually works.

d. For each possible maximum address, call raise(SIGUSR1) and, in the
signal handler, change RIP to point to the first noncanonical address
and RCX to match RIP. Return and catch the resulting exception. This
may be easy to integrate into the sigreturn tests, and I can help with
that.

--Andy

2016-12-08 18:41:05

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 22/28] x86/espfix: support 5-level paging

On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
<[email protected]> wrote:
> XXX: how to test this?

tools/testing/selftests/x86/sigreturn_{32,64}

2016-12-08 18:42:45

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 24/28] x86/mm: add sync_global_pgds() for configuration with 5-level paging

On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
<[email protected]> wrote:
> This basically restores slightly modified version of original
> sync_global_pgds() which we had before foldedl p4d was introduced.
>
> The only modification is protection against 'address' overflow.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/mm/init_64.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index a991f5c4c2c4..d637893ac8c2 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -92,6 +92,52 @@ __setup("noexec32=", nonx32_setup);
> * When memory was added/removed make sure all the processes MM have
> * suitable PGD entries in the local PGD level page.
> */
> +#ifdef CONFIG_X86_5LEVEL
> +void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> +{
> + unsigned long address;
> +
> + for (address = start; address <= end && address >= start;
> + address += PGDIR_SIZE) {
> + const pgd_t *pgd_ref = pgd_offset_k(address);
> + struct page *page;
> +
> + /*
> + * When it is called after memory hot remove, pgd_none()
> + * returns true. In this case (removed == 1), we must clear
> + * the PGD entries in the local PGD level page.
> + */
> + if (pgd_none(*pgd_ref) && !removed)
> + continue;

This isn't quite specific to your patch, but can we assert that, if
removed=1, then we're not operating on the vmalloc range? Because if
we do, this will be racy is nasty ways.

2016-12-08 18:56:17

by Randy Dunlap

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 17/28] x86/mm: define virtual memory map for 5-level paging

On 12/08/16 08:21, Kirill A. Shutemov wrote:
> The first part of memory map (up to %esp fixup) simply scales existing
> map for 4-level paging by factor of 9 -- number of bits addressed by
> additional page table level.
>
> The rest of the map is uncahnged.

unchanged.

(more fixes below)

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> Documentation/x86/x86_64/mm.txt | 23 ++++++++++++++++++++++-
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/kasan.h | 9 ++++++---
> arch/x86/include/asm/page_64_types.h | 10 ++++++++++
> arch/x86/include/asm/pgtable_64_types.h | 6 ++++++
> arch/x86/include/asm/sparsemem.h | 9 +++++++--
> 6 files changed, 52 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> index 8c7dd5957ae1..d33fb0799b3d 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -12,7 +12,7 @@ ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
> ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
> ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
> ... unused hole ...
> -ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB)
> +ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
> ... unused hole ...
> ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
> ... unused hole ...
> @@ -23,6 +23,27 @@ ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
> ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
> ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
>
> +Virtual memory map with 5 level page tables:
> +
> +0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
> +hole caused by [57:63] sign extension

Can you briefly explain the sign extension?
Should that be [56:63]?

> +ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
> +ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
> +ff90000000000000 - ff91ffffffffffff (=49 bits) hole
> +ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
> +ffd2000000000000 - ff93ffffffffffff (=49 bits) virtual memory map (512TB)
> +... unused hole ...
> +ff96000000000000 - ffb5ffffffffffff (=53 bits) kasan shadow memory (8PB)
> +... unused hole ...
> +fffe000000000000 - fffeffffffffffff (=49 bits) %esp fixup stacks
> +... unused hole ...
> +ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space

- fffffffeffffffff

> +... unused hole ...
> +ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0

- ffffffff9fffffff

> +ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
> +ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
> +ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
> +
> The direct mapping covers all memory in the system up to the highest
> memory address (this means in some cases it can also include PCI memory
> holes).

> diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
> index 1410b567ecde..2587c6bd89be 100644
> --- a/arch/x86/include/asm/kasan.h
> +++ b/arch/x86/include/asm/kasan.h
> @@ -11,9 +11,12 @@
> * 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT
> */
> #define KASAN_SHADOW_START (KASAN_SHADOW_OFFSET + \
> - (0xffff800000000000ULL >> 3))
> -/* 47 bits for kernel address -> (47 - 3) bits for shadow */
> -#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (47 - 3)))
> + ((-1UL << __VIRTUAL_MASK_SHIFT) >> 3))
> +/*
> + * 47 bits for kernel address -> (47 - 3) bits for shadow
> + * 56 bits for kernel address -> (56 - 3) bits fro shadow

typo: s/fro/for/

> + */
> +#define KASAN_SHADOW_END (KASAN_SHADOW_START + (1ULL << (__VIRTUAL_MASK_SHIFT - 3)))
>
> #ifndef __ASSEMBLY__
>

--
~Randy

2016-12-08 19:20:52

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Thu, Dec 08, 2016 at 10:16:07AM -0800, Linus Torvalds wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
>
> Looks ok to me. Starting off with a compile-time config option seems fine.
>
> I do think that the x86 cpuid part should (patch 15) should be the
> first patch, so that we see "la57" as a capability in /proc/cpuinfo
> whether it's being enabled or not? We should merge that part
> regardless of any mm patches, I think.

Okay, I'll split up the CPUID part into separate patch and move it
beginning for the patchset

REQUIRED_MASK portion will stay where it is.

--
Kirill A. Shutemov

2016-12-08 19:24:04

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 16/28] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert

On Thu, Dec 08, 2016 at 10:39:57AM -0800, Andy Lutomirski wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
> > better check for canonical addresses") made canonical address check
> > generic wrt. address width.
>
> This code existed in part to remind us that this needs very careful
> adjustment when the paging size becomes dynamic. If you want to
> remove it, please add test cases to tools/testing/selftests/x86 that
> verify:
>
> a. Either mmap(2^47-4096, ..., MAP_FIXED, ...) fails or that, if it
> succeeds and you put a syscall instruction at the very end, that
> invoking the syscall instruction there works. The easiest way to do
> this may be to have the selftest literally have a page of text that
> has 4094 0xcc bytes and a syscall and to map that page or perhaps move
> it into place with mremap. That will avoid annoying W^X userspace
> stuff from messing up the test. You'll need to handle the signal when
> you fall off the end of the world after the syscall.
>
> b. Ditto for the new highest possible userspace page.
>
> c. Ditto for one page earlier to make sure that your test actually works.
>
> d. For each possible maximum address, call raise(SIGUSR1) and, in the
> signal handler, change RIP to point to the first noncanonical address
> and RCX to match RIP. Return and catch the resulting exception. This
> may be easy to integrate into the sigreturn tests, and I can help with
> that.

Thanks, for hints.

I'll come back to you with testcases to verify that they are you wanted
to see.

--
Kirill A. Shutemov

2016-12-08 19:24:43

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 17/28] x86/mm: define virtual memory map for 5-level paging

On Thu, Dec 08, 2016 at 10:56:04AM -0800, Randy Dunlap wrote:
> > @@ -23,6 +23,27 @@ ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
> > ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
> > ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
> >
> > +Virtual memory map with 5 level page tables:
> > +
> > +0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
> > +hole caused by [57:63] sign extension
>
> Can you briefly explain the sign extension?

Sure, I'll update it on respin.

> Should that be [56:63]?

You're right, it should.

Thanks for all your corrections.

--
Kirill A. Shutemov

2016-12-08 19:33:29

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 24/28] x86/mm: add sync_global_pgds() for configuration with 5-level paging

On Thu, Dec 08, 2016 at 10:42:19AM -0800, Andy Lutomirski wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > This basically restores slightly modified version of original
> > sync_global_pgds() which we had before foldedl p4d was introduced.
> >
> > The only modification is protection against 'address' overflow.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/mm/init_64.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 47 insertions(+)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index a991f5c4c2c4..d637893ac8c2 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -92,6 +92,52 @@ __setup("noexec32=", nonx32_setup);
> > * When memory was added/removed make sure all the processes MM have
> > * suitable PGD entries in the local PGD level page.
> > */
> > +#ifdef CONFIG_X86_5LEVEL
> > +void sync_global_pgds(unsigned long start, unsigned long end, int removed)
> > +{
> > + unsigned long address;
> > +
> > + for (address = start; address <= end && address >= start;
> > + address += PGDIR_SIZE) {
> > + const pgd_t *pgd_ref = pgd_offset_k(address);
> > + struct page *page;
> > +
> > + /*
> > + * When it is called after memory hot remove, pgd_none()
> > + * returns true. In this case (removed == 1), we must clear
> > + * the PGD entries in the local PGD level page.
> > + */
> > + if (pgd_none(*pgd_ref) && !removed)
> > + continue;
>
> This isn't quite specific to your patch, but can we assert that, if
> removed=1, then we're not operating on the vmalloc range? Because if
> we do, this will be racy is nasty ways.

Looks like there's no users of removed=1. The last user is gone with
af2cf278ef4f ("x86/mm/hotplug: Don't remove PGD entries in
remove_pagetable()")

I'll just drop it (with separate patch).

--
Kirill A. Shutemov

2016-12-08 20:05:17

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Thu, Dec 08, 2016 at 07:21:37PM +0300, Kirill A. Shutemov wrote:
> 5-level paging support is required from hardware when compiled with
> CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

...

> diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
> index 6687ab953257..26e9a287805f 100644
> --- a/arch/x86/boot/cpuflags.c
> +++ b/arch/x86/boot/cpuflags.c
> @@ -80,6 +80,17 @@ static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
> );
> }
>
> +static inline void cpuid_count(u32 id, u32 count,
> + u32 *a, u32 *b, u32 *c, u32 *d)
> +{
> + asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
> + "cpuid \n\t"
> + ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
> + : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
> + : "a" (id), "c" (count)
> + );
> +}

Pls make those like cpuid() and cpuid_count() in
arch/x86/include/asm/processor.h, which explicitly assign ecx and then
call the underlying helper.

The cpuid() in cpuflags.c doesn't zero ecx which, if we have to be
pedantic, it should do. It calls CPUID now with the ptr value of its 4th
on 64-bit and 3rd arg on 32-bit, respectively, IINM.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-08 20:08:57

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Thu, Dec 8, 2016 at 12:05 PM, Borislav Petkov <[email protected]> wrote:
>
> The cpuid() in cpuflags.c doesn't zero ecx which, if we have to be
> pedantic, it should do. It calls CPUID now with the ptr value of its 4th
> on 64-bit and 3rd arg on 32-bit, respectively, IINM.

In fact, just do a single cpuid_count(), and then implement the
traditional cpuid() as just

#define cpuid(x, a,b,c,d) cpuid_count(x, 0, a, b, c, d)

or something.

Especially since that's some of the ugliest inline asm ever due to the
nasty BX handling.

Linus

2016-12-08 20:20:25

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Thu, Dec 08, 2016 at 12:08:53PM -0800, Linus Torvalds wrote:
> Especially since that's some of the ugliest inline asm ever due to the
> nasty BX handling.

Yeah, about that: why doesn't gcc handle that for us like it would
handle a clobbered register? I mean, it *should* know that BX is live
when building with -fPIC... The .ifnc thing looks really silly.

Hmmm.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-09 05:01:37

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

* Kirill A. Shutemov <[email protected]> wrote:

> x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> of physical address space. We are already bumping into this limit: some
> vendors offers servers with 64 TiB of memory today.
>
> To overcome the limitation upcoming hardware will introduce support for
> 5-level paging[1]. It is a straight-forward extension of the current page
> table structure adding one more layer of translation.
>
> It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> physical address space. This "ought to be enough for anybody" ©.
>
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.
>
> More details on the design and what’s left to implement are below.

The patches don't look too painful, so no big complaints from me - kudos!

> There is still work to do:
>
> - Boot-time switch between 4- and 5-level paging.
>
> We assume that distributions will be keen to avoid returning to the
> i386 days where we shipped one kernel binary for each page table
> layout.

Absolutely.

> As page table format is the same for 4- and 5-level paging it should
> be possible to have single kernel binary and switch between them at
> boot-time without too much hassle.
>
> For now I only implemented compile-time switch.
>
> I hoped to bring this feature with separate patchset once basic
> enabling is in upstream.
>
> Is it okay?

LGTM, but we would eventually want to convert this kind of crazy open coding:

pgd_t *pgd, *pgd_ref;
p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;

To something saner that iterates and navigates the page table hierarchy in an
extensible fashion. That would also make it (much) easier to make the paging depth
boot time switchable.

Somehow I'm quite certain we'll see requests for more than 4 PiB memory in our
lifetimes.

In a decade or two once global warming really gets going, especially after Trump &
Republicans & Old Energy implement their billionaire welfare policies to mine,
sell and burn even more coal & oil without paying for the damage caused, the U.S.
meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable of 1+
trillion dollars damage) in near real time at 1 meter resolution will have to run
on something capable, right?

> - Handle opt-in wider address space for userspace.
>
> Not all userspace is ready to handle addresses wider than current
> 47-bits. At least some JIT compiler make use of upper bits to encode
> their info.
>
> We need to have an interface to opt-in wider addresses from userspace
> to avoid regressions.
>
> For now, I've included testing-only patch which bumps TASK_SIZE to
> 56-bits. This can be handy for testing to see what breaks if we max-out
> size of virtual address space.

So this is just a detail - but it sounds a bit limiting to me to provide an 'opt
in' flag for something that will work just fine on the vast majority of 64-bit
software.

Please make this an opt out compatibility flag instead: similar to how we handle
address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT,
ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Thanks,

Ingo

2016-12-09 10:27:11

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> > - Handle opt-in wider address space for userspace.
> >
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> >
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> >
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
>
> So this is just a detail - but it sounds a bit limiting to me to provide an 'opt
> in' flag for something that will work just fine on the vast majority of 64-bit
> software.
>
> Please make this an opt out compatibility flag instead: similar to how we handle
> address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT,
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

We've had a similar discussion about JIT software on ARM64, which has a wide
range of supported page table layouts and some software wants to limit that
to a specific number.

I don't remember the outcome of that discussion, but I'm adding a few people
to Cc that might remember.

There have also been some discussions in the past to make the depth of the
page table a per-task decision on s390, since you may have some tasks that
run just fine with two or three levels of paging while another task actually
wants the full 64-bit address space. I wonder how much extra work this would
be on top of the boot-time option.

Arnd

2016-12-09 10:37:28

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Fri, Dec 09, 2016 at 06:01:30AM +0100, Ingo Molnar wrote:
>
> * Kirill A. Shutemov <[email protected]> wrote:
>
> > x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> > of physical address space. We are already bumping into this limit: some
> > vendors offers servers with 64 TiB of memory today.
> >
> > To overcome the limitation upcoming hardware will introduce support for
> > 5-level paging[1]. It is a straight-forward extension of the current page
> > table structure adding one more layer of translation.
> >
> > It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> > physical address space. This "ought to be enough for anybody" ©.
> >
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> >
> > More details on the design and what’s left to implement are below.
>
> The patches don't look too painful, so no big complaints from me - kudos!

Thanks.

> > There is still work to do:
> >
> > - Boot-time switch between 4- and 5-level paging.
> >
> > We assume that distributions will be keen to avoid returning to the
> > i386 days where we shipped one kernel binary for each page table
> > layout.
>
> Absolutely.
>
> > As page table format is the same for 4- and 5-level paging it should
> > be possible to have single kernel binary and switch between them at
> > boot-time without too much hassle.
> >
> > For now I only implemented compile-time switch.
> >
> > I hoped to bring this feature with separate patchset once basic
> > enabling is in upstream.
> >
> > Is it okay?
>
> LGTM, but we would eventually want to convert this kind of crazy open coding:
>
> pgd_t *pgd, *pgd_ref;
> p4d_t *p4d, *p4d_ref;
> pud_t *pud, *pud_ref;
> pmd_t *pmd, *pmd_ref;
> pte_t *pte, *pte_ref;
>
> To something saner that iterates and navigates the page table hierarchy in an
> extensible fashion. That would also make it (much) easier to make the paging depth
> boot time switchable.

Yes, it would be nice to replace all these p??_t with something more
flexible. But that's no obviously right design for such transition.

I would rather not tight it to boot-time switch for paging, but have
separate experimental patchset. One day...

> Somehow I'm quite certain we'll see requests for more than 4 PiB memory in our
> lifetimes.
>
> In a decade or two once global warming really gets going, especially after Trump &
> Republicans & Old Energy implement their billionaire welfare policies to mine,
> sell and burn even more coal & oil without paying for the damage caused, the U.S.
> meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable of 1+
> trillion dollars damage) in near real time at 1 meter resolution will have to run
> on something capable, right?
>
> > - Handle opt-in wider address space for userspace.
> >
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> >
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> >
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
>
> So this is just a detail - but it sounds a bit limiting to me to provide an 'opt
> in' flag for something that will work just fine on the vast majority of 64-bit
> software.
>
> Please make this an opt out compatibility flag instead: similar to how we handle
> address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT,
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Well, that's true that most userspace can handle wide addresses just fine.
But even by simply booting Fedora on QEMU I see one SIGSEGV for this
reason: libmozjs-17.0.so cannot handle it (polkitd linked with it, hell
knows why).

I think keeping software from crashing is kinda priority in this
transition.

Beyond that, most of software would not benefit much from large virtual
address space. Okay, there's more bits for ASLR, but that's it.

On other hand, large virtual address space would put more pressure on
cache -- at least one more page table per process, if we make 56-bit VA
default.

--
Kirill A. Shutemov

2016-12-09 10:51:29

by Catalin Marinas

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Fri, Dec 09, 2016 at 11:24:12AM +0100, Arnd Bergmann wrote:
> On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> > > - Handle opt-in wider address space for userspace.
> > >
> > > Not all userspace is ready to handle addresses wider than current
> > > 47-bits. At least some JIT compiler make use of upper bits to encode
> > > their info.
> > >
> > > We need to have an interface to opt-in wider addresses from userspace
> > > to avoid regressions.
> > >
> > > For now, I've included testing-only patch which bumps TASK_SIZE to
> > > 56-bits. This can be handy for testing to see what breaks if we max-out
> > > size of virtual address space.
> >
> > So this is just a detail - but it sounds a bit limiting to me to provide an 'opt
> > in' flag for something that will work just fine on the vast majority of 64-bit
> > software.
> >
> > Please make this an opt out compatibility flag instead: similar to how we handle
> > address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT,
> > ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.
>
> We've had a similar discussion about JIT software on ARM64, which has a wide
> range of supported page table layouts and some software wants to limit that
> to a specific number.
>
> I don't remember the outcome of that discussion, but I'm adding a few people
> to Cc that might remember.

The arm64 kernel supports several user VA space configurations (though
commonly 39 and 48-bit) and has had these from the initial port. We
realised that certain JITs (e.g.
https://bugzilla.mozilla.org/show_bug.cgi?id=1143022) and IIRC LLVM
assume a 47-bit user VA but AFAICT, most have been fixed.

ARMv8.1 also supports 52-bit VA (though only with 64K pages and we
haven't added support for it yet). However, it's likely that if we make
a 52-bit TASK_SIZE this the default, we will break some user
assumptions. While arguably that's not necessarily ABI, if user relies
on a 47 or 48-bit VA the kernel shouldn't break it. So I'm strongly
inclined to make the 52-bit TASK_SIZE an opt-in on arm64.

--
Catalin

2016-12-09 15:32:39

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Thu, Dec 08, 2016 at 09:05:05PM +0100, Borislav Petkov wrote:
> On Thu, Dec 08, 2016 at 07:21:37PM +0300, Kirill A. Shutemov wrote:
> > 5-level paging support is required from hardware when compiled with
> > CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>
> ...
>
> > diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
> > index 6687ab953257..26e9a287805f 100644
> > --- a/arch/x86/boot/cpuflags.c
> > +++ b/arch/x86/boot/cpuflags.c
> > @@ -80,6 +80,17 @@ static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
> > );
> > }
> >
> > +static inline void cpuid_count(u32 id, u32 count,
> > + u32 *a, u32 *b, u32 *c, u32 *d)
> > +{
> > + asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
> > + "cpuid \n\t"
> > + ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
> > + : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
> > + : "a" (id), "c" (count)
> > + );
> > +}
>
> Pls make those like cpuid() and cpuid_count() in
> arch/x86/include/asm/processor.h, which explicitly assign ecx and then
> call the underlying helper.
>
> The cpuid() in cpuflags.c doesn't zero ecx which, if we have to be
> pedantic, it should do. It calls CPUID now with the ptr value of its 4th
> on 64-bit and 3rd arg on 32-bit, respectively, IINM.

Something like this?

diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..366aad972025 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -70,16 +70,22 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif

-static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
+static inline void cpuid_count(u32 id, u32 count,
+ u32 *a, u32 *b, u32 *c, u32 *d)
{
+ *a = id;
+ *c = count;
+
asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
"cpuid \n\t"
".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
: "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
- : "a" (id)
+ : "a" (id), "c" (count)
);
}

+#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
+
void get_cpuflags(void)
{
u32 max_intel_level, max_amd_level;
--
Kirill A. Shutemov

2016-12-09 16:33:42

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Fri, Dec 09, 2016 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> Something like this?
>
> diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
> index 6687ab953257..366aad972025 100644
> --- a/arch/x86/boot/cpuflags.c
> +++ b/arch/x86/boot/cpuflags.c
> @@ -70,16 +70,22 @@ int has_eflag(unsigned long mask)
> # define EBX_REG "=b"
> #endif
>
> -static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
> +static inline void cpuid_count(u32 id, u32 count,
> + u32 *a, u32 *b, u32 *c, u32 *d)
> {
> + *a = id;
> + *c = count;
> +
> asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
> "cpuid \n\t"
> ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
> : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
> - : "a" (id)
> + : "a" (id), "c" (count)
> );
> }
>
> +#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)

LGTM.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-09 16:40:14

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

The top level page always has to be there unless you disable it at boot time
(unless you go for a scheme where some processes share top level pages, and
others do not, which would likely be very complicated)

But even with that it is more than one: A typical set up has at least two extra
4K pages overhead, one for the bottom and one for the top mappings. Could easily be
more.

-Andi

2016-12-09 16:49:50

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On 12/09/2016 02:37 AM, Kirill A. Shutemov wrote:
> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

For a process only using a small amount of its address space, the
mid-level paging structure caches will be very effective since the page
walks are all very similar. You may take a cache miss on the extra
level on the *first* walk, but you only do that once per context switch.
I bet the CPU is also pretty aggressive about filling those things when
it sees a new CR3 and they've been forcibly emptied. So, you may never
even _see_ the latency from that extra miss.

2016-12-09 17:21:23

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On Fri, Dec 09, 2016 at 08:40:11AM -0800, Andi Kleen wrote:
> > On other hand, large virtual address space would put more pressure on
> > cache -- at least one more page table per process, if we make 56-bit VA
> > default.
>
> The top level page always has to be there unless you disable it at boot time
> (unless you go for a scheme where some processes share top level pages, and
> others do not, which would likely be very complicated)
>
> But even with that it is more than one: A typical set up has at least two extra
> 4K pages overhead, one for the bottom and one for the top mappings. Could easily be
> more.

So, right, one page for pgd, which we can't easily avoid.

If we limit VA to 47-bits by default, we would have one p4d page as the
range will be covered by one entry in pgd.

If we go to 56-bits VA by default, we would have at least two p4d pages
even for small processes. This where mine "at least one more page table
per process" comes from.

That's waste of memory and potentially cache. I don't think it's
justified.

--
Kirill A. Shutemov

2016-12-12 14:22:34

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 22/28] x86/espfix: support 5-level paging

On Thu, Dec 08, 2016 at 10:40:41AM -0800, Andy Lutomirski wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > XXX: how to test this?
>
> tools/testing/selftests/x86/sigreturn_{32,64}

Hm. They fail on non-patched kernel with QEMU, but not KVM. :-/
I guess I'd need to fix QEMU first.

--
Kirill A. Shutemov

2016-12-13 21:07:57

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 00/28] 5-level paging

On 12/08/2016 09:01 PM, Ingo Molnar wrote:
>> > - Handle opt-in wider address space for userspace.
>> >
>> > Not all userspace is ready to handle addresses wider than current
>> > 47-bits. At least some JIT compiler make use of upper bits to encode
>> > their info.
>> >
>> > We need to have an interface to opt-in wider addresses from userspace
>> > to avoid regressions.
>> >
>> > For now, I've included testing-only patch which bumps TASK_SIZE to
>> > 56-bits. This can be handy for testing to see what breaks if we max-out
>> > size of virtual address space.
> So this is just a detail - but it sounds a bit limiting to me to provide an 'opt
> in' flag for something that will work just fine on the vast majority of 64-bit
> software.

MPX is going to be a real pain here. It is relatively transparent to
applications that use it, and old MPX binaries are entirely incompatible
with the new address space size, so an opt-out wouldn't be friendly.

Because the top-level MPX bounds table is indexed by the virtual
address, a growth in vaddr space is going to require the table to grow
(or change somehow). The solution baked into the hardware spec is to
just make the top-level table 512x larger to accommodate the 512x
increase in vaddr space. (This behavior is controlled by a new MSR, btw...)

So, either we disable MPX on all old MPX binaries by returning an error
when the prctl() tries to enable MPX and 5-level paging is on, or we go
with some form of an opt-in. New MPX binaries will opt-in to the larger
address space since they know to allocate the new, larger table.

2016-12-13 22:46:40

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On 12/08/16 12:20, Borislav Petkov wrote:
> On Thu, Dec 08, 2016 at 12:08:53PM -0800, Linus Torvalds wrote:
>> Especially since that's some of the ugliest inline asm ever due to the
>> nasty BX handling.
>
> Yeah, about that: why doesn't gcc handle that for us like it would
> handle a clobbered register? I mean, it *should* know that BX is live
> when building with -fPIC... The .ifnc thing looks really silly.
>

When compiling with -fPIC gcc treats ebx as a "fixed register". A fixed
register can't be spilled, and so a clobber of a fixed register is a
fatal error.

Like it or not, it's how it works.

-hpa

2016-12-13 22:50:59

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On 12/09/16 07:32, Kirill A. Shutemov wrote:
>
> Something like this?
>
> diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
> index 6687ab953257..366aad972025 100644
> --- a/arch/x86/boot/cpuflags.c
> +++ b/arch/x86/boot/cpuflags.c
> @@ -70,16 +70,22 @@ int has_eflag(unsigned long mask)
> # define EBX_REG "=b"
> #endif
>
> -static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
> +static inline void cpuid_count(u32 id, u32 count,
> + u32 *a, u32 *b, u32 *c, u32 *d)
> {
> + *a = id;
> + *c = count;

These two lines are wrong, remove them.

> asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
> "cpuid \n\t"
> ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
> : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
> - : "a" (id)
> + : "a" (id), "c" (count)
> );
> }
>
> +#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
> +

Other than that, it's correct.

That being said, the claim that ECX ought to be zeroed on a
non-subleaf-equipped CPUID leaf is spurious, in my opinion. That being
said, it also doesn't do any harm and might avoid problems in the
opposite direction, e.g. someone thinking that leaf 7 doesn't have
subleaves.

It might also be better to have something like:

#define SAVE_EBX(x) ".ifnc %%ebx," x "; movl %%ebx," x "; .endif"
#define SWAP_EBX(x) ".ifnc %%ebx," x "; xchgl %%ebx," x "; .endif"

... but if it is only used once it might just be more confusion.

-hpa

2016-12-13 23:08:03

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On December 13, 2016 11:44:06 PM GMT+01:00, "H. Peter Anvin" <[email protected]> wrote:
>When compiling with -fPIC gcc treats ebx as a "fixed register". A
>fixed
>register can't be spilled, and so a clobber of a fixed register is a
>fatal error.
>
>Like it or not, it's how it works.
>
> -hpa

In the meantime I talked to my gcc guy and here's the deal:

There are gcc versions (4.x and earlier) which do not save/restore the PIC register around an inline asm even if it is one of the registers that the inline asm clobbers. Therefore the saving/restoring needs to be done by the inline asm itself.

5.x and later handle that fine.

Thus I was thinking of adding a build-time check for the gcc version but that might turn out to be more code in the end than those ugly ifnc clauses.
--
Sent from a small device: formatting sux and brevity is inevitable.

2016-12-15 14:39:51

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Wed, Dec 14, 2016 at 12:07:54AM +0100, Boris Petkov wrote:
> Thus I was thinking of adding a build-time check for the gcc version
> but that might turn out to be more code in the end than those ugly
> ifnc clauses.

IOW, something like this. I did this just to try to see whether it is
doable. And it does work - gcc 4.8 and 4.9 -m32 cannot preserve the PIC
register - actually the inline asm fails building due to impossible
constraints.

However, so many lines changed just to save the ifnc, meh, I dunno...

---
arch/x86/boot/compressed/Makefile | 8 ++++++
arch/x86/boot/cpuflags.c | 14 ++++++++--
scripts/gcc-clobber-pic.sh | 58 +++++++++++++++++++++++++++++++++++++++
3 files changed, 77 insertions(+), 3 deletions(-)
create mode 100755 scripts/gcc-clobber-pic.sh

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 34d9e15857c3..705fc2ab3fd6 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -35,6 +35,14 @@ KBUILD_CFLAGS += -mno-mmx -mno-sse
KBUILD_CFLAGS += $(call cc-option,-ffreestanding)
KBUILD_CFLAGS += $(call cc-option,-fno-stack-protector)

+# check whether inline asm clobbers the PIC register
+ifeq ($(CONFIG_X86_32),y)
+ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-clobber-pic.sh $(CC) -m32),n)
+ KBUILD_CFLAGS += -DCC_PRESERVES_PIC
+ KBUILD_AFLAGS += -DCC_PRESERVES_PIC
+endif
+endif
+
KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
GCOV_PROFILE := n
UBSAN_SANITIZE :=n
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..913c3f5ab3a0 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -70,11 +70,19 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif

+#if defined(__i386__) && defined(__PIC__) && !defined(CC_PRESERVES_PIC)
+# define SAVE_PIC ".ifnc %%ebx, %3; movl %%ebx, %3; .endif\n\t"
+# define SWAP_PIC ".ifnc %%ebx, %3; xchgl %%ebx, %3; .endif\n\t"
+#else
+# define SAVE_PIC
+# define SWAP_PIC
+#endif
+
static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
{
- asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
- "cpuid \n\t"
- ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
+ asm volatile(SAVE_PIC
+ "cpuid\n\t"
+ SWAP_PIC
: "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
: "a" (id)
);
diff --git a/scripts/gcc-clobber-pic.sh b/scripts/gcc-clobber-pic.sh
new file mode 100755
index 000000000000..7ff10edf9b08
--- /dev/null
+++ b/scripts/gcc-clobber-pic.sh
@@ -0,0 +1,58 @@
+#!/bin/bash -x
+err=0
+O=$(mktemp)
+cat << "END" | $@ -fPIC -x c - -o $O >/dev/null 2>&1 || err=1
+int some_global_var, some_other_global_var;
+
+typedef unsigned int u32;
+
+void __attribute__((noinline)) foo(void)
+{
+ asm volatile("# some crap just so that we don't get optimized away");
+
+ some_other_global_var = 43;
+}
+
+static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
+{
+ asm volatile("cpuid"
+ : "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
+ : "a" (id), "2" (*c)
+ : "si", "di"
+ );
+
+ some_global_var = 42;
+ foo();
+}
+
+int main(void)
+{
+ u32 a, b, c = 0, d;
+
+ cpuid(0x1, &a, &b, &c, &d);
+
+ /*
+ * Make sure foo() gets actually called and not optimized away due to
+ * miscompilation.
+ */
+ if (some_global_var == 42 && some_other_global_var == 43)
+ return 0;
+ else
+ return 1;
+}
+END
+
+if (( $err ));
+then
+ exit 1
+fi
+
+chmod u+x $O
+$O
+
+if ! (( $? ));
+then
+ echo "n"
+fi
+
+rm -f $O
--
2.11.0

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-15 17:53:24

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On December 15, 2016 6:39:44 AM PST, Borislav Petkov <[email protected]> wrote:
>On Wed, Dec 14, 2016 at 12:07:54AM +0100, Boris Petkov wrote:
>> Thus I was thinking of adding a build-time check for the gcc version
>> but that might turn out to be more code in the end than those ugly
>> ifnc clauses.
>
>IOW, something like this. I did this just to try to see whether it is
>doable. And it does work - gcc 4.8 and 4.9 -m32 cannot preserve the PIC
>register - actually the inline asm fails building due to impossible
>constraints.
>
>However, so many lines changed just to save the ifnc, meh, I dunno...
>
>---
> arch/x86/boot/compressed/Makefile | 8 ++++++
> arch/x86/boot/cpuflags.c | 14 ++++++++--
>scripts/gcc-clobber-pic.sh | 58
>+++++++++++++++++++++++++++++++++++++++
> 3 files changed, 77 insertions(+), 3 deletions(-)
> create mode 100755 scripts/gcc-clobber-pic.sh
>
>diff --git a/arch/x86/boot/compressed/Makefile
>b/arch/x86/boot/compressed/Makefile
>index 34d9e15857c3..705fc2ab3fd6 100644
>--- a/arch/x86/boot/compressed/Makefile
>+++ b/arch/x86/boot/compressed/Makefile
>@@ -35,6 +35,14 @@ KBUILD_CFLAGS += -mno-mmx -mno-sse
> KBUILD_CFLAGS += $(call cc-option,-ffreestanding)
> KBUILD_CFLAGS += $(call cc-option,-fno-stack-protector)
>
>+# check whether inline asm clobbers the PIC register
>+ifeq ($(CONFIG_X86_32),y)
>+ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-clobber-pic.sh
>$(CC) -m32),n)
>+ KBUILD_CFLAGS += -DCC_PRESERVES_PIC
>+ KBUILD_AFLAGS += -DCC_PRESERVES_PIC
>+endif
>+endif
>+
> KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
> GCOV_PROFILE := n
> UBSAN_SANITIZE :=n
>diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
>index 6687ab953257..913c3f5ab3a0 100644
>--- a/arch/x86/boot/cpuflags.c
>+++ b/arch/x86/boot/cpuflags.c
>@@ -70,11 +70,19 @@ int has_eflag(unsigned long mask)
> # define EBX_REG "=b"
> #endif
>
>+#if defined(__i386__) && defined(__PIC__) &&
>!defined(CC_PRESERVES_PIC)
>+# define SAVE_PIC ".ifnc %%ebx, %3; movl %%ebx, %3; .endif\n\t"
>+# define SWAP_PIC ".ifnc %%ebx, %3; xchgl %%ebx, %3; .endif\n\t"
>+#else
>+# define SAVE_PIC
>+# define SWAP_PIC
>+#endif
>+
> static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
> {
>- asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
>- "cpuid \n\t"
>- ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t"
>+ asm volatile(SAVE_PIC
>+ "cpuid\n\t"
>+ SWAP_PIC
> : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
> : "a" (id)
> );
>diff --git a/scripts/gcc-clobber-pic.sh b/scripts/gcc-clobber-pic.sh
>new file mode 100755
>index 000000000000..7ff10edf9b08
>--- /dev/null
>+++ b/scripts/gcc-clobber-pic.sh
>@@ -0,0 +1,58 @@
>+#!/bin/bash -x
>+err=0
>+O=$(mktemp)
>+cat << "END" | $@ -fPIC -x c - -o $O >/dev/null 2>&1 || err=1
>+int some_global_var, some_other_global_var;
>+
>+typedef unsigned int u32;
>+
>+void __attribute__((noinline)) foo(void)
>+{
>+ asm volatile("# some crap just so that we don't get optimized away");
>+
>+ some_other_global_var = 43;
>+}
>+
>+static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
>+{
>+ asm volatile("cpuid"
>+ : "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
>+ : "a" (id), "2" (*c)
>+ : "si", "di"
>+ );
>+
>+ some_global_var = 42;
>+ foo();
>+}
>+
>+int main(void)
>+{
>+ u32 a, b, c = 0, d;
>+
>+ cpuid(0x1, &a, &b, &c, &d);
>+
>+ /*
>+ * Make sure foo() gets actually called and not optimized away due to
>+ * miscompilation.
>+ */
>+ if (some_global_var == 42 && some_other_global_var == 43)
>+ return 0;
>+ else
>+ return 1;
>+}
>+END
>+
>+if (( $err ));
>+then
>+ exit 1
>+fi
>+
>+chmod u+x $O
>+$O
>+
>+if ! (( $? ));
>+then
>+ echo "n"
>+fi
>+
>+rm -f $O

This really is only worthwhile if it ends up producing better code, but I doubt it.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-12-15 19:09:08

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On Thu, Dec 15, 2016 at 09:52:12AM -0800, [email protected] wrote:
> This really is only worthwhile if it ends up producing better code,
> but I doubt it.

Nah, the most it does is drops those ifnc lines in there on newer gccs.

They will appear only on
gcc-4 and earlier and
if we're -fPIC and
if we're -m32 and
if we have enough register pressure to force gcc to use the PIC register

It was a good exercise for me to see in detail how would I go about
doing a gcc-specific workaround.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-15 20:53:44

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On December 15, 2016 11:20:17 AM PST, Andi Kleen <[email protected]> wrote:
>
>The code is not calling CPUID in any performance critical path, only
>at initialization. So any discussion about saving a few instructions
>is a complete waste of time.
>
>-Andi

Sort of. The BIOS boot code is very space-constrained for certain legacy bootloaders to continue to work. The BIOS boot code proper does not need PIC.

However, the existing .ifnc solution already takes care of it, so it doesn't matter.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-12-15 21:14:29

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

On December 15, 2016 11:20:17 AM PST, Andi Kleen <[email protected]> wrote:
>
>The code is not calling CPUID in any performance critical path, only
>at initialization. So any discussion about saving a few instructions
>is a complete waste of time.
>
>-Andi

NB: the chief offender is Loadlin, which is still used in some manufacturing flows that depends on a mix of legacy DOS and Linux applications; however, older versions of LILO also have this problem.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-12-15 22:16:53

[permalink] [raw]

Subject: Re: [RFC, PATCHv1 15/28] x86: detect 5-level paging support

The code is not calling CPUID in any performance critical path, only
at initialization. So any discussion about saving a few instructions
is a complete waste of time.

-Andi